نتایج جستجو برای: q) policy

تعداد نتایج: 381585  

2009
Dominic Thomas Elliot Bendoly Monica Capra

How do social networks motivate people to connect not only to their previously existing friends but also to novel or blind new contacts? We report the results of an experiment to identify the value that participants give to alternative network characteristics when deciding to connect to a social network. We focus on network tie characteristics because they represent information that potentially...

Journal: :مدیریت زنجیره تأمین 0
زهره کاهه رضا برادران کاظم زاده

in this paper, tender problems in an automobile company for procuring needed items from potential suppliers have been resolved by the learning algorithm q. in this case the purchaser with respect to proposals received from potential providers, including price and delivery time is proposed; order the needed parts to suppliers assigns. the buyer’s objective is minimizing the procurement costs thr...

2015
Rezwana Reaz Muqeet Ali Mohamed G. Gouda Marijn Heule Ehab S. Elmallah

A computing policy is a sequence of rules, where each rule consists of a predicate and an action, and where each action is either “accept” or “reject”. A policy P is said to accept (or reject, respectively) a request iff the action of the first rule in P , that is matched by the request is “accept” (or “reject”, respectively). A pair of policies (P , Q) is called an accept-implication pair iff ...

2001
Eyal Even-Dar Yishay Mansour

Yishay Mansourt Vie sho,v the convergence of tV/O deterministic variants of Qlearning. The first is the widely used optimistic Q-learning, which initializes the Q-values to large initial values and then follows a greedy policy with respect to the Q-values. We show that setting the initial value sufficiently large guarantees the converges to an Eoptimal policy. The second is a new and novel algo...

Journal: :CoRR 2016
Shixiang Gu Timothy P. Lillicrap Zoubin Ghahramani Richard E. Turner Sergey Levine

Model-free deep reinforcement learning (RL) methods have been successful in a wide variety of simulated domains. However, a major obstacle facing deep RL in the real world is the high sample complexity of such methods. Unbiased batch policy-gradient methods offer stable learning, but at the cost of high variance, which often requires large batches, while TD-style methods, such as off-policy act...

2007
Christian Larsen

We develop an algorithm to compute an optimal Q(s,S) policy for the joint replenishment problem when demands follow a compound correlated Poisson process. It is a non-trivial generalization of the work by Nielsen and Larsen (2005). We make some numerical analyses on two-item problems where we compare the optimal Q(s,S) policy to the optimal uncoordinated (s,S) policies. The results indicate tha...

2016
Brendan O'Donoghue Remi Munos Koray Kavukcuoglu Volodymyr Mnih

Policy gradient is an efficient technique for improving a policy in a reinforcement learning setting. However, vanilla online variants are on-policy only and not able to take advantage of off-policy data. In this paper we describe a new technique that combines policy gradient with off-policy Q-learning, drawing experience from a replay buffer. This is motivated by making a connection between th...

Journal: :CoRR 2017
John Schulman Pieter Abbeel Xi Chen

Two of the leading approaches for model-free reinforcement learning are policy gradient methods and Q-learning methods. Q-learning methods can be effective and sample-efficient when they work, however, it is not well-understood why they work, since empirically, the Q-values they estimate are very inaccurate. A partial explanation may be that Q-learning methods are secretly implementing policy g...

2012
Christopher R. Dance Onno R. Zoeter Haengju Lee

We consider the stochastic joint replenishment problem in which several items must be ordered in the face of stochastic demand. Previous authors proposed multiple heuristic policies for this economically-important problem. We show that several such policies are not good approximations to an optimal policy, since as some items grow more expensive than others, the cost rate of the heuristic polic...

2012
Landon Kraemer Bikramjit Banerjee

Decentralized partially observable Markov decision processes (Dec-POMDPs) offer a formal model for planning in cooperative multi-agent systems where agents operate with noisy sensors and actuators and local information. While many techniques have been developed for solving DecPOMDPs exactly and approximately, they have been primarily centralized and reliant on knowledge of the model parameters....

نمودار تعداد نتایج جستجو در هر سال

با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید