Q-Learning for Bandit Problems

نویسنده

Michael O. Duff

چکیده

Multi-armed bandits may be viewed as decompositionally-structured Markov decision processes (MDP's) with potentially very large state sets. A particularly elegant methodology for computing optimal policies was developed over twenty ago by Gittins Gittins & Jones, 1974]. Gittins' approach reduces the problem of nding optimal policies for the original MDP to a sequence of low-dimensional stopping problems whose solutions determine the optimal policy through the so-called \Gittins indices." Katehakis and Veinott Katehakis & Veinott, 1987] have shown that the Gittins index for a task in state i may be interpreted as a particular component of the maximum-value function associated with the \restart-in-i" process, a simple MDP to which standard solution methods for computing optimal policies, such as successive approximation, apply. This paper explores the problem of learning the Gittins indices on-line without the aid of a process model; it suggests utilizing task-state-speciic Q-learning agents to solve their respective restart-in-state-i subproblems, and includes an example in which the online reinforcement learning approach is applied to a simple problem of stochastic scheduling|one instance drawn from a wide class of problems that may be formulated as bandit problems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Exploration vs Exploitation Trade-Off in Bandit Problems: An Empirical Study

We compare well-known action selection policies used in reinforcement learning like ǫ-greedy and softmax with lesser known ones like the Gittins index and the knowledge gradient on bandit problems. The latter two are in comparison very performant. Moreover the knowledge gradient can be generalized to other than bandit problems.

متن کامل

Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O ( (n/2) log(1/δ) ) times to find an 2-optimal arm with probability of at least 1−δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise act...

متن کامل

Sparsity, variance and curvature in multi-armed bandits

In (online) learning theory the concepts of sparsity, variance and curvature are well-understood and are routinely used to obtain refined regret and generalization bounds. In this paper we further our understanding of these concepts in the more challenging limited feedback scenario. We consider the adversarial multi-armed bandit and linear bandit settings and solve several open problems pertain...

متن کامل

Large-Scale Bandit Problems and KWIK Learning

We show that parametric multi-armed bandit (MAB) problems with large state and action spaces can be algorithmically reduced to the supervised learning model known as “Knows What It Knows” or KWIK learning. We give matching impossibility results showing that the KWIKlearnability requirement cannot be replaced by weaker supervised learning assumptions. We provide such results in both the standard...

متن کامل

Incentives in the Dark: Multi-armed Bandits for Evolving Users with Unknown Type

Design of incentives or recommendations to users is becoming more common as platform providers continually emerge. We propose a multi-armed bandit approach to the problem in which users types are unknown a priori and evolve dynamically in time. Unlike the traditional bandit setting, observed rewards are generated by a single Markov process. We demonstrate via an illustrative example that blindl...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 1995

Q-Learning for Bandit Problems

نویسنده

چکیده

منابع مشابه

The Exploration vs Exploitation Trade-Off in Bandit Problems: An Empirical Study

Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems

Sparsity, variance and curvature in multi-armed bandits

Large-Scale Bandit Problems and KWIK Learning

Incentives in the Dark: Multi-armed Bandits for Evolving Users with Unknown Type

عنوان ژورنال:

اشتراک گذاری