q policy

An Online Convergent Q-learning Algorithm with Linear Function Approximation

2014

Shalabh Bhatnagar

We present in this article a variant of Q-learning with linear function approximation that is based on two-timescale stochastic approximation. Whereas it is difficult to prove convergence of regular Q-learning with linear function approximation because of the off-policy problem, we prove that our algorithm is convergent. Numerical results on a multi-stage stochastic shortest path problem show t...

متن کامل

AMPol-Q: Adaptive Middleware Policy to Support QoS

2006

Raja Afandi Jianqing Zhang Carl A. Gunter

There are many problems hindering the design and development of Service-Oriented Architectures (SOAs), which can dynamically discover and compose multiple services so that the quality of the composite service is measured by its End-to-End (E2E) quality, rather than that of individual services in isolation. The diversity and complexity of QoS constraints further limit the widescale adoption of Q...

متن کامل

RTP-Q: A Reinforcement Learning System with Time Constraints Exploration Planning for Accelerating the Learning Rate

1999

Gang ZHAO Ruoying SUN

Reinforcement learning is an efficient method for solving Markov Decision Processes that an agent improves its performance by using scalar reward values with higher capability of reactive and adaptive behaviors. Q-learning is a representative reinforcement learning method which is guaranteed to obtain an optimal policy but needs numerous trials to achieve it. k-Certainty Exploration Learning Sy...

متن کامل

A Fast Learning Agent Based on the Dyna Architecture

Journal: :J. Inf. Sci. Eng. 2014

Yuan-Pao Hsu Wei-Cheng Jiang

In this paper, we present a rapid learning algorithm called Dyna-QPC. The proposed algorithm requires considerably less training time than Q-learning and Table-based Dyna-Q algorithm, making it applicable to real-world control tasks. The Dyna-QPC algorithm is a combination of existing learning techniques: CMAC, Q-learning, and prioritized sweeping. In a practical experiment, the Dyna-QPC algori...

متن کامل

Decentralized Multi-Agent Reinforcement Learning in Average-Reward Dynamic DCOPs (Theoretical Proofs)

2014

Duc Thien Nguyen William Yeoh Hoong Chuin Lau Shlomo Zilberstein Chongjie Zhang

In this document, we show the proofs for the theoretical results described in the paper titled “Decentralized Multi-Agent Reinforcement Learning in Average-Reward Dynamic DCOPs” submitted to AAAI 2014. In this paper, we consider MDPs where a joint state can transition to any other joint state with non-zero probability, that is, the MDP is unichain. We are going to show the decomposability of th...

متن کامل

An E cient Procedure for Computing an Optimal (R,Q) Policy in Continuous Review Systems with Poisson Demands and Constant Lead Time

2009

N. Yazdan Shenas A. Eshraghniaye Jahromi M. Modarres Yazdi

In this paper, a continuous review inventory system is considered in which an order in a batch of size Q is placed immediately after the inventory position reaches R. Transportation time is constant and demands are assumed to be generated by a stationary Poisson process with one unit demand at a time. Demands not covered immediately from the inventory are backordered. In a recent paper, the exa...

متن کامل

Rates of Convergence of Performance Gradient Estimates Using Function Approximation and Bias in Reinforcement Learning

2001

Gregory Z. Grudic Lyle H. Ungar

We address two open theoretical questions in Policy Gradient Reinforcement Learning. The first concerns the efficacy of using function approximation to represent the state action value function, Q. Theory is presented showing that linear function approximation representations of Q can degrade the rate of convergence of performance gradient estimates by a factor of O(ML) relative to when no func...

متن کامل

Q-learning with Censored Data.

Journal: :Annals of statistics 2012

Yair Goldberg Michael R Kosorok

We develop methodology for a multistage-decision problem with flexible number of stages in which the rewards are survival times that are subject to censoring. We present a novel Q-learning algorithm that is adjusted for censored data and allows a flexible number of stages. We provide finite sample bounds on the generalization error of the policy learned by the algorithm, and show that when the ...

متن کامل

On-Line EM Reinforcement Learning

2000

Junichiro Yoshimoto Shin Ishii Masa-aki Sato

In this article, we propose a new reinforcement learning (RL) method for a system having continuous state and action spaces. Our RL method has an architecture like the actorcritic model. The critic tries to approximate the Q-function, which is the expected future return for the current state-action pair. The actor tries to approximate a stochastic soft-max policy defined by the Q-function. The ...

متن کامل

inventory cost evaluation under vmi program with lot splitting

Journal: :international journal of industrial engineering and productional research- 0

rasoul haji rasoul haji, department of industrial engineering, sharif university of technology, tehran, iran mohammadmohsen moarefdoost department of industrial engineering, sharif university of technology, tehran, iran seyed babak ebrahimi department of industrial engineering, iran university of science & technology, tehran, iran

this paper aims to evaluate inventory cost of a two-echelon serial supply chain system under vendor managed inventory program with stochastic demand, and examine the effect of environmental factors on the cost of overall system. for this purpose, we consider a two-echelon serial supply chain with a manufacturer and a retailer. under vendor managed inventory program, the decision on inventory le...

متن کامل