In this paper, we study a few challenging theoretical and numerical issues on the well known trust region policy optimization for deep reinforcement learning. The goal is to find that maximizes total expected reward when agent acts according policy. subproblem constructed with surrogate function coherent general distance constraint around latest We solve using preconditioned stochastic gradient...