Optimistic Temporal Difference Learning for <i>2048</i>
نویسندگان
چکیده
Temporal difference (TD) learning and its variants, such as multistage TD (MS-TD) temporal coherence (TC) learning, have been successfully applied to 2048. These methods rely on the stochasticity of environment 2048 for exploration. In this paper, we propose employ optimistic initialization (OI) encourage exploration 2048, empirically show that quality is significantly improved. This approach optimistically initializes feature weights very large values. Since tend be reduced once states are visited, agents explore those which unvisited or visited few times. Our experiments both TC with OI improve performance. As a result, network size required achieve same performance reduced. With additional tunings expectimax search, tile-downgrading technique, our design achieves state-of-the-art performance, namely an average score 625 377 rate 72% reaching 32768 tiles. addition, sufficiently tests, 65536 tiles reached at 0.02%.
منابع مشابه
Dual Temporal Difference Learning
Recently, researchers have investigated novel dual representations as a basis for dynamic programming and reinforcement learning algorithms. Although the convergence properties of classical dynamic programming algorithms have been established for dual representations, temporal difference learning algorithms have not yet been analyzed. In this paper, we study the convergence properties of tempor...
متن کاملPreconditioned Temporal Difference Learning
LSTD is numerically instable for some ergodic Markov chains with preferred visits among some states over the remaining ones. Because the matrix that LSTD accumulates has large condition numbers. In this paper, we propose a variant of temporal difference learning with high data efficiency. A class of preconditioned temporal difference learning algorithms are also proposed to speed up the new met...
متن کاملEmphatic Temporal-Difference Learning
Emphatic algorithms are temporal-difference learning algorithms that change their effective state distribution by selectively emphasizing and de-emphasizing their updates on different time steps. Recent works by Sutton, Mahmood and White (2015), and Yu (2015) show that by varying the emphasis in a particular way, these algorithms become stable and convergent under off-policy training with linea...
متن کاملNatural Temporal Difference Learning
In this paper we investigate the application of natural gradient descent to Bellman error based reinforcement learning algorithms. This combination is interesting because natural gradient descent is invariant to the parameterization of the value function. This invariance property means that natural gradient descent adapts its update directions to correct for poorly conditioned representations. ...
متن کاملAlgorithms for Fast Gradient Temporal Difference Learning
Temporal difference learning is one of the oldest and most used techniques in reinforcement learning to estimate value functions. Many modifications and extension of the classical TD methods have been proposed. Recent examples are TDC and GTD(2) ([Sutton et al., 2009b]), the first approaches that are as fast as classical TD and have proven convergence for linear function approximation in onand ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE transactions on games
سال: 2022
ISSN: ['2475-1502', '2475-1510']
DOI: https://doi.org/10.1109/tg.2021.3109887