AlphaSnake: Policy Iteration on a Nondeterministic NP-Hard Markov Decision Process (Student Abstract)

نویسندگان

چکیده

Reinforcement learning has been used to approach well-known NP-hard combinatorial problems in graph theory. Among these, Hamiltonian cycle are exceptionally difficult analyze, even when restricted individual instances of structurally complex graphs. In this paper, we use Monte Carlo Tree Search (MCTS), the search algorithm behind many state-of-the-art reinforcement algorithms such as AlphaZero, create autonomous agents that learn play game Snake, a centered on properties cycles grid The Snake can be formulated single-player discounted Markov Decision Process (MDP), where agent must behave optimally stochastic environment. Determining optimal policy for defined maximizes probability winning -- or win rate with higher priority and minimizes expected number time steps lower priority, is conjectured NP-hard. Performance-wise, compared prior work game, our first achieve over 0.5 (a uniform random achieves < 2.57 x 10^{-15}), demonstrating versatility AlphaZero tackling problems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Online Markov decision processes with policy iteration

The online Markov decision process (MDP) is a generalization of the classical Markov decision process that incorporates changing reward functions. In this paper, we propose practical online MDP algorithms with policy iteration and theoretically establish a sublinear regret bound. A notable advantage of the proposed algorithm is that it can be easily combined with function approximation, and thu...

متن کامل

Policy Iteration for Decentralized Control of Markov Decision Processes

Coordination of distributed agents is required for problems arising in many areas, including multi-robot systems, networking and e-commerce. As a formal framework for such problems, we use the decentralized partially observable Markov decision process (DECPOMDP). Though much work has been done on optimal dynamic programming algorithms for the single-agent version of the problem, optimal algorit...

متن کامل

Simplified policy iteration for skip-free Markov decision processes

We describe and analyse a new simplified policy iteration type algorithm for finite average cost Markov decision processes that are skip-free in the negative direction. We show that the algorithm is guaranteed to converge after a finite number of iterations, but the computational effort required for each iteration step is comparable with that for value iteration. We show that the analysis can b...

متن کامل

Adiabatic Markov Decision Process: Convergence of Value Iteration Algorithm

Markov Decision Process (MDP) is a well-known framework for devising the optimal decision making strategies under uncertainty. Typically, the decision maker assumes a stationary environment which is characterized by a time-invariant transition probability matrix. However, in many real-world scenarios, this assumption is not justified, thus the optimal strategy might not provide the expected per...

متن کامل

Approximate Policy Iteration with a Policy Language Bias: Solving Relational Markov Decision Processes

We study an approach to policy selection for large relational Markov Decision Processes (MDPs). We consider a variant of approximate policy iteration (API) that replaces the usual value-function learning step with a learning step in policy space. This is advantageous in domains where good policies are easier to represent and learn than the corresponding value functions, which is often the case ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2023

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v37i13.26962