We propose a new simple and natural algorithm for learning the optimal [Formula: see text]-value function of discounted-cost Markov decision process (MDP) when transition kernels are unknown. Unlike classical algorithms MDPs, such as text]-learning actor-critic algorithms, this does not depend on stochastic approximation-based method. show that our algorithm, which we call empirical iteration c...