This paper considers the problem of computing an optimal policy for a Markov Decision Process (MDP), under lack of complete a priori knowledge of (i) the branching probability distributions determining the evolution of the process state upon the execution of the different actions, and (ii) the probability distributions characterizing the immediate rewards returned by the environment as a result...