Q-Learning for Bandit Problems
نویسنده
چکیده
Multi-armed bandits may be viewed as decompositionally-structured Markov decision processes (MDP's) with potentially very large state sets. A particularly elegant methodology for computing optimal policies was developed over twenty ago by Gittins Gittins & Jones, 1974]. Gittins' approach reduces the problem of nding optimal policies for the original MDP to a sequence of low-dimensional stopping problems whose solutions determine the optimal policy through the so-called \Gittins indices." Katehakis and Veinott Katehakis & Veinott, 1987] have shown that the Gittins index for a task in state i may be interpreted as a particular component of the maximum-value function associated with the \restart-in-i" process, a simple MDP to which standard solution methods for computing optimal policies, such as successive approximation, apply. This paper explores the problem of learning the Gittins indices on-line without the aid of a process model; it suggests utilizing task-state-speciic Q-learning agents to solve their respective restart-in-state-i subproblems, and includes an example in which the online reinforcement learning approach is applied to a simple problem of stochastic scheduling|one instance drawn from a wide class of problems that may be formulated as bandit problems.
منابع مشابه
The Exploration vs Exploitation Trade-Off in Bandit Problems: An Empirical Study
We compare well-known action selection policies used in reinforcement learning like ǫ-greedy and softmax with lesser known ones like the Gittins index and the knowledge gradient on bandit problems. The latter two are in comparison very performant. Moreover the knowledge gradient can be generalized to other than bandit problems.
متن کاملAction Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems
We incorporate statistical confidence intervals in both the multi-armed bandit and the reinforcement learning problems. In the bandit problem we show that given n arms, it suffices to pull the arms a total of O ( (n/2) log(1/δ) ) times to find an 2-optimal arm with probability of at least 1−δ. This bound matches the lower bound of Mannor and Tsitsiklis (2004) up to constants. We also devise act...
متن کاملSparsity, variance and curvature in multi-armed bandits
In (online) learning theory the concepts of sparsity, variance and curvature are well-understood and are routinely used to obtain refined regret and generalization bounds. In this paper we further our understanding of these concepts in the more challenging limited feedback scenario. We consider the adversarial multi-armed bandit and linear bandit settings and solve several open problems pertain...
متن کاملLarge-Scale Bandit Problems and KWIK Learning
We show that parametric multi-armed bandit (MAB) problems with large state and action spaces can be algorithmically reduced to the supervised learning model known as “Knows What It Knows” or KWIK learning. We give matching impossibility results showing that the KWIKlearnability requirement cannot be replaced by weaker supervised learning assumptions. We provide such results in both the standard...
متن کاملIncentives in the Dark: Multi-armed Bandits for Evolving Users with Unknown Type
Design of incentives or recommendations to users is becoming more common as platform providers continually emerge. We propose a multi-armed bandit approach to the problem in which users types are unknown a priori and evolve dynamically in time. Unlike the traditional bandit setting, observed rewards are generated by a single Markov process. We demonstrate via an illustrative example that blindl...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1995