نتایج جستجو برای: q policy
تعداد نتایج: 381585 فیلتر نتایج به سال:
Two main families of reinforcement learning algorithms, Q-learning and policy gradients, have recently been proven to be equivalent when using a softmax relaxation on one part, and an entropic regularization on the other. We relate this result to the well-known convex duality of Shannon entropy and the softmax function. Such a result is also known as the Donsker-Varadhan formula. This provides ...
We propose two algorithms for Q-learning that use the two timescale stochastic approximation methodology. The first of these updates Q-values of all feasible state-action pairs at each instant while the second updates Q-values of states with actions chosen according to the ‘current’ randomized policy updates. A proof of convergence of the algorithms is shown. Finally, numerical experiments usin...
Legislative Bargaining and the Dynamics of Public Investment by Marco Battaglini, Salvatore Nunnari, Thomas Palfrey * We present a legislative bargaining model of the provision of a durable public good over an infinite horizion. In each period, there is a societal endowment which can either be invested in the public good or consumed. We characterize the optimal public policy, defined by the tim...
Q-learning is a very popular reinforcement learning algorithm being proven to converge to optimal policies in Markov decision processes. However, Q-learning shows artifacts in non-stationary environments, e.g., the probability of playing the optimal action may decrease if Q-values deviate significantly from the true values, a situation that may arise in the initial phase as well as after change...
To mark the 50th anniversary of the Surgeon General's first report on smoking and health, and to promote this year's report, Howard Koh, MD, MPH, assistant secretary for health at the U.S. Department of Health and Human Services, spoke about the importance of and the continuing need for tobacco-control efforts.
We consider finite-horizon fitted Q-iteration with linear function approximation to learn a policy from a training set of trajectories. We show that fitted Q-iteration can give biased estimates and invalid confidence intervals for the parameters that feature in the policy. We propose a regularized estimator called soft-threshold estimator, derive it as an approximate empirical Bayes estimator, ...
This paper analyzes a discrete-time infinite-buffer Geo/Geo/2 queue, in which the number of servers can be adjusted depending on the number of customers in the system one at a time at arrival or at service completion epoch. Analytical closed-form solutions of the infinite-buffer Geo/Geo/2 queueing system operating under the triadic (0, Q N, M) policy are derived. The total expected cost functio...
One of the fundamental issues in the operation of a mobile communication system is the assignment of channels to cells and to calls. Since the number of channels allocated to a mobile communication system is limited, efficient utilization of these communication channels by using efficient channel assignment strategies is not only desirable but also imperative. This paper presents a novel approa...
In a model of oligopolistic competition in the banking sector, we analyse how the monetary policy rule chosen by the Central Bank can in ̄uence the incentive of banks to set high interest rates on loans over the business cycle. We exploit the basic model to investigate the potential impact of EMU implementation on collusion among banks. In particular, we consider the possible eects of the Europ...
Temporal-difference-based deep-reinforcement learning methods have typically been driven by off-policy, bootstrap Q-Learning updates. In this paper, we investigate the effects of using on-policy, Monte Carlo updates. Our empirical results show that for the DDPG algorithm in a continuous action space, mixing on-policy and off-policy update targets exhibits superior performance and stability comp...
نمودار تعداد نتایج جستجو در هر سال
با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید