Temporal Second Difference Traces

نویسنده

  • Mitchell Keith Bloch
چکیده

Q-learning is a reliable but inefficient off-policy temporal-difference method, backing up reward only one step at a time. Replacing traces, using a recency heuristic, are more efficient but less reliable. In this work, we introduce model-free, off-policy temporal difference methods that make better use of experience than Watkins’ Q(λ). We introduce both Optimistic Q(λ) and the temporal second difference trace (TSDT). TSDT is particularly powerful in deterministic domains. TSDT uses neither recency nor frequency heuristics, storing (s, a, r, s, δ) so that off-policy updates can be performed after apparently suboptimal actions have been taken. There are additional advantages when using state abstraction, as in MAXQ. We demonstrate that TSDT does significantly better than both Q-learning and Watkins’ Q(λ) in a deterministic cliff-walking domain. Results in a noisy cliff-walking domain are less advantageous for TSDT, but demonstrate the efficacy of Optimistic Q(λ), a replacing trace with some of the advantages of TSDT.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Experimental analysis of eligibility traces strategies in temporal difference learning

Temporal difference (TD) learning is a model-free reinforcement learning technique, which adopts an infinite horizon discount model and uses an incremental learning technique for dynamic programming. The state value function is updated in terms of sample episodes. Utilising eligibility traces is a key mechanism in enhancing the rate of convergence. TD(λ) represents the use of eligibility traces...

متن کامل

The kNN-TD Reinforcement Learning Algorithm

A reinforcement learning algorithm called kNN-TD is introduced. This algorithm has been developed using the classical formulation of temporal difference methods and a k-nearest neighbors scheme as its expectations memory. By means of this kind of memory the algorithm is able to generalize properly over continuous state spaces and also take benefits from collective action selection and learning ...

متن کامل

A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning

Recently, a new multi-step temporal learning algorithm, called Q(σ), unifies n-step Tree-Backup (when σ = 0) and n-step Sarsa (when σ = 1) by introducing a sampling parameter σ. However, similar to other multi-step temporal-difference learning algorithms, Q(σ) needs much memory consumption and computation time. Eligibility trace is an important mechanism to transform the off-line updates into e...

متن کامل

SMT-Based Checking of SOLOIST over Sparse Traces

SMT solvers have been recently applied to bounded model checking and satisfiability checking of metric temporal logic. In this paper we consider SOLOIST, an extension of metric temporal logic with aggregate temporal modalities; it has been defined based on a field study on the use of specification patterns in the context of the provisioning of service-based applications. We apply bounded satisf...

متن کامل

Adaptive Agent Models Using Temporal Discounting, Memory Traces and Hebbian Learning with Inhibition, and their Rationality

In this paper three adaptive agent models incorporating triggered emotional responses are explored and evaluated on their rationality. One of the models is based on temporal discounting second on memory traces and the third one on hebbian learning with mutual inhibition. The models are assessed using a measure reflecting the environment’s behaviour and expressing the extent of rationality. Simu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1104.4664  شماره 

صفحات  -

تاریخ انتشار 2011