policy iterations

Heuristic Dynamic Programming Nonlinear Optimal Controller

2012

Asma Al-tamimi Murad Abu-Khalaf Frank Lewis

This chapter is concerned with the application of approximate dynamic programming techniques (ADP) to solve for the value function, and hence the optimal control policy, in discrete-time nonlinear optimal control problems having continuous state and action spaces. ADP is a reinforcement learning approach (Sutton & Barto, 1998) based on adaptive critics (Barto et al., 1983), (Widrow et al., 1973...

متن کامل

Randomised Procedures for Initialising and Switching Actions in Policy Iteration

2016

Shivaram Kalyanakrishnan Neeldhara Misra Aditya Gopalan

Policy Iteration (PI) (Howard 1960) is a classical method for computing an optimal policy for a finite Markov Decision Problem (MDP). The method is conceptually simple: starting from some initial policy, “policy improvement” is repeatedly performed to obtain progressively dominating policies, until eventually, an optimal policy is reached. Being remarkably efficient in practice, PI is often fav...

متن کامل

Least-squares methods for policy iteration

2011

Lucian Buşoniu Alessandro Lazaric Mohammad Ghavamzadeh Rémi Munos Robert Babuška Bart De Schutter

Approximate reinforcement learning deals with the essential problem of applying reinforcement learning in large and continuous state-action spaces, by using function approximators to represent the solution. This chapter reviews least-squares methods for policy iteration, an important class of algorithms for approximate reinforcement learning. We discuss three techniques for solving the core, po...

متن کامل

Strong polynomiality of policy iterations for average-cost MDPs modeling replacement and maintenance problems

Journal: :Operations Research Letters 2013

متن کامل

[hal-00829532, v2] Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

2013

Bruno Scherrer

Given a Markov Decision Process (MDP) with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal γ-discounted optimal policy. We consider two variations of PI: Howard’s PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal advan...

متن کامل

Continuous-time Markov decision processes with nth-bias optimality criteria

Journal: :Automatica 2009

Junyu Zhang Xi-Ren Cao

In this paper, we study the nth-bias optimality problem for finite continuous-time Markov decision processes (MDPs) with a multichain structure. We first provide nth-bias difference formulas for two policies and present some interesting characterizations of an nth-bias optimal policy by using these difference formulas. Then, we prove the existence of an nth-bias optimal policy by using nth-bias...

متن کامل

Q-learning and policy iteration algorithms for stochastic shortest path problems

Journal: :Annals OR 2013

Huizhen Yu Dimitri P. Bertsekas

We consider the stochastic shortest path problem, a classical finite-state Markovian decision problem with a termination state, and we propose new convergent Q-learning algorithms that combine elements of policy iteration and classical Q-learning/value iteration. These algorithms are related to the ones introduced by the authors for discounted problems in Bertsekas and Yu (Math. Oper. Res. 37(1...

متن کامل

Error Propagation for Approximate Policy and Value Iteration

2010

Amir Massoud Farahmand Rémi Munos Csaba Szepesvári

We address the question of how the approximation error/Bellman residual at each iteration of the Approximate Policy/Value Iteration algorithms influences the quality of the resulted policy. We quantify the performance loss as the Lp norm of the approximation error/Bellman residual at each iteration. Moreover, we show that the performance loss depends on the expectation of the squared Radon-Niko...

متن کامل

Game Theoretic Controller Synthesis for Multi-Robot Motion Planning-Part II: Policy-based Algorithms

2015

Devesh K. Jha Minghui Zhu

This paper presents the problem of distributed feedback motion planning for multiple robots. The problem of feedback multi-robot motion planning is formulated as a differential noncooperative game. We leverage the existing sampling-based algorithms and value iterations to develop an incremental policy synthesizer. The proposed algorithm makes use of an iterative best response algorithm to incre...

متن کامل

[hal-00829532, v3] Improved and Generalized Upper Bounds on the Complexity of Policy Iteration

2013

Bruno Scherrer

Given a Markov Decision Process (MDP) with n states and m actions per state, we study the number of iterations needed by Policy Iteration (PI) algorithms to converge to the optimal γ-discounted optimal policy. We consider two variations of PI: Howard’s PI that changes the actions in all states with a positive advantage, and Simplex-PI that only changes the action in the state with maximal advan...

متن کامل