q policy

A short variational proof of equivalence between policy gradients and soft Q learning

Journal: :CoRR 2017

Pierre H. Richemond Brendan Maginnis

Two main families of reinforcement learning algorithms, Q-learning and policy gradients, have recently been proven to be equivalent when using a softmax relaxation on one part, and an entropic regularization on the other. We relate this result to the well-known convex duality of Shannon entropy and the softmax function. Such a result is also known as the Donsker-Varadhan formula. This provides ...

متن کامل

New algorithms of the Q-learning type

Journal: :Automatica 2008

Shalabh Bhatnagar K. Mohan Babu

We propose two algorithms for Q-learning that use the two timescale stochastic approximation methodology. The first of these updates Q-values of all feasible state-action pairs at each instant while the second updates Q-values of states with actions chosen according to the ‘current’ randomized policy updates. A proof of convergence of the algorithms is shown. Finally, numerical experiments usin...

متن کامل

Legislative Bargaining and the Dynamics of Public Investment

2011

Marco Battaglini Salvatore Nunnari Thomas Palfrey

Legislative Bargaining and the Dynamics of Public Investment by Marco Battaglini, Salvatore Nunnari, Thomas Palfrey * We present a legislative bargaining model of the provision of a durable public good over an infinite horizion. In each period, there is a societal endowment which can either be invested in the public good or consumed. We characterize the optimal public policy, defined by the tim...

متن کامل

Addressing the policy-bias of q-learning by repeating updates

2013

Sherief Abdallah Michael Kaisers

Q-learning is a very popular reinforcement learning algorithm being proven to converge to optimal policies in Markov decision processes. However, Q-learning shows artifacts in non-stationary environments, e.g., the probability of playing the optimal action may decrease if Q-values deviate significantly from the true values, a situation that may arise in the initial phase as well as after change...

متن کامل

Q&A: Howard Koh on smoking cessation and policy.

Journal: :Cancer discovery 2014

Howard Koh American Association Cancer

To mark the 50th anniversary of the Surgeon General's first report on smoking and health, and to promote this year's report, Howard Koh, MD, MPH, assistant secretary for health at the U.S. Department of Health and Human Services, spoke about the importance of and the continuing need for tobacco-control efforts.

متن کامل

Bias Correction and Confidence Intervals for Fitted Q-iteration

2008

Bibhas Chakraborty Victor Strecher Susan Murphy

We consider finite-horizon fitted Q-iteration with linear function approximation to learn a policy from a training set of trajectories. We show that fitted Q-iteration can give biased estimates and invalid confidence intervals for the parameters that feature in the policy. We propose a regularized estimator called soft-threshold estimator, derive it as an approximate empirical Bayes estimator, ...

متن کامل

Optimal Thresholds of an Infinite Buffer Discrete-Time Two-Server System with Triadic Policy

Journal: :IJSDS 2011

Veena Goswami G. B. Mund

This paper analyzes a discrete-time infinite-buffer Geo/Geo/2 queue, in which the number of servers can be adjusted depending on the number of customers in the system one at a time at arrival or at service completion epoch. Analytical closed-form solutions of the infinite-buffer Geo/Geo/2 queueing system operating under the triadic (0, Q N, M) policy are derived. The total expected cost functio...

متن کامل

A dynamic channel assignment policy through Q-learning

Journal: :IEEE transactions on neural networks 1999

Junhong Nie Simon Haykin

One of the fundamental issues in the operation of a mobile communication system is the assignment of channels to cells and to calls. Since the number of channels allocated to a mobile communication system is limited, efficient utilization of these communication channels by using efficient channel assignment strategies is not only desirable but also imperative. This paper presents a novel approa...

متن کامل

Bank competition and ECB ' s monetary policy q

2000

Fabio C. Bagliano Alberto Dalmazzo Giancarlo Marini

In a model of oligopolistic competition in the banking sector, we analyse how the monetary policy rule chosen by the Central Bank can in ̄uence the incentive of banks to set high interest rates on loans over the business cycle. We exploit the basic model to investigate the potential impact of EMU implementation on collusion among banks. In particular, we consider the possible eects of the Europ...

متن کامل

On-Policy vs. Off-Policy Updates for Deep Reinforcement Learning

2016

Matthew Hausknecht

Temporal-difference-based deep-reinforcement learning methods have typically been driven by off-policy, bootstrap Q-Learning updates. In this paper, we investigate the effects of using on-policy, Monte Carlo updates. Our empirical results show that for the DDPG algorithm in a continuous action space, mixing on-policy and off-policy update targets exhibits superior performance and stability comp...

متن کامل