Dynamics of Temporal Difference Learning
نویسنده
چکیده
In behavioural sciences, the problem that a sequence of stimuli is followed by a sequence of rewards r(t) is considered. The subject is to learn the full sequence of rewards from the stimuli, where the prediction is modelled by the Sutton-Barto rule. In a sequence of n trials, this prediction rule is learned iteratively by temporal difference learning. We present a closed formula of the prediction of rewards at trial time t within trial n. From that formula, we show directly that for n → ∞ the predictions converge to the real rewards. In this approach, a new quality of correlation type Toeplitz matrices is proven. We give learning rates which optimally speed up the learning process. 1 Temporal Difference Learning We consider here a mathematical treatment of a problem in behavioural biology. This problem has been described e.g. in [Dayan and Abbott 2001] as learning to predict a reward. It is Pavlovian in the sense that classical conditioning is adressed, however reward is not immediate, but after a series of stimuli there is a latency time, followed by a series of rewards. Using a simple linear rule, we will show that the subject is able to predict the remaining rewards by that rule, repeating the same stimuli-reward pattern over a series of trials. A review of learning theory related to these problems can be found in [Sutton and Barto 1998]. There are recent experimental biological correlates with this mathematical model, e.g. in the activity of primate dopamine cells during appetitive conditioning tasks, together with the psychological and pharmacological rationale for studying these cells. A review was given in [Schultz 1998], the connection to temporal difference learning can be found in [Montague et al. 1996]. In this paper, the focus is on two mathematical issues: a) to provide a direct constructive proof of convergence by giving an explicit dependence of the prediction error over trials, b) to minimize the learning time by giving a formula for optimal setting of the learning rate. Hence, the paper contributes as well to temporal difference learning as a purely mathematical issue, which may be valuable also without reference to behavioural biology and reinforcement learning. It can also be understood in a Dynamic Programming sense, see [Watkins 1989] and later work, e.g. [Gordon 2001] and [Szepesvari and Smart 2004]. We adopt the following notation in the course of this paper, following [Dayan and Abbott 2001]: • Stimulus u(t) • Future reward r(t) • Sum of future rewards R(t) • Weights w(k) • Predicted reward v(t) We want to compute, for all trials n of duration T , and for any time of trial t, the predicted reward v(t). The (extended) stimulus u(t) is given at times tu,min . . . tu,max , the (extended) reward r(t) is presented at times tr,min . . . tr,max. Stimulus and reward do not overlap, i.e. tr,min > tu,max. The subject is to learn the total remaining reward at time t,
منابع مشابه
Control of Multivariable Systems Based on Emotional Temporal Difference Learning Controller
One of the most important issues that we face in controlling delayed systems and non-minimum phase systems is to fulfill objective orientations simultaneously and in the best way possible. In this paper proposing a new method, an objective orientation is presented for controlling multi-objective systems. The principles of this method is based an emotional temporal difference learning, and has a...
متن کاملHand Gesture Recognition from RGB-D Data using 2D and 3D Convolutional Neural Networks: a comparative study
Despite considerable enhances in recognizing hand gestures from still images, there are still many challenges in the classification of hand gestures in videos. The latter comes with more challenges, including higher computational complexity and arduous task of representing temporal features. Hand movement dynamics, represented by temporal features, have to be extracted by analyzing the total fr...
متن کاملIranian EFL Learners’ Motivational Fluctuation in Task Performance over Different Timescales
Motivation for learning a new language is both self and time-oriented. The language learner’s motivation experiences gradual fluctuation over time and the view of oneself is different on each timescale of the study. Interaction among different timescales throughout the Second Language Development (SLD) is a novel area of investigation (de Bot, 2015). In order to probe this interactive nature, t...
متن کاملWhy did TD-Gammon Work?
Although TD-Gammon is one of the major successes in machine learning, it has not led to similar impressive breakthroughs in temporal difference learning for other applications or even other games. We were able to replicate some of the success of TD-Gammon, developing a competitive evaluation function on a 4000 parameter feed-forward neural network, without using back-propagation, reinforcement ...
متن کاملAn Imperfect Dopaminergic Error Signal Can Drive Temporal-Difference Learning
An open problem in the field of computational neuroscience is how to link synaptic plasticity to system-level learning. A promising framework in this context is temporal-difference (TD) learning. Experimental evidence that supports the hypothesis that the mammalian brain performs temporal-difference learning includes the resemblance of the phasic activity of the midbrain dopaminergic neurons to...
متن کاملSimulation-based Search of Combinatorial Games
Monte-Carlo Tree Search is a very successful game playing algorithm. Unfortunately it suffers from the horizon effect: some important tactical sequences may be delayed beyond the depth of the search tree, causing evaluation errors. Temporal-difference search with a function approximation is a method that was proposed to overcome these weaknesses, by adaptively changing the simulation policy out...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007