نتایج جستجو برای: q policy

تعداد نتایج: 381585  

2015
Robert William Wright Xingye Qiao Lei Yu Steven Loscalzo

Fitted Q-Iteration (FQI) is a popular approximate value iteration (AVI) approach that makes effective use of off-policy data. FQI uses a 1-step return value update which does not exploit the sequential nature of trajectory data. Complex returns (weighted averages of the n-step returns) use trajectory data more effectively, but have not been used in an AVI context because of off-policy bias. In ...

2015
Iain Frame Sarah Cant

In this video Q&A, we talk to Iain Frame and Sarah Cant from Prostate Cancer UK about the current challenges in prostate cancer research and policy and how these are being addressed.

Journal: :CoRR 2017
Kristopher De Asis J. Fernando Hernandez-Garcia G. Zacharias Holland Richard S. Sutton

Unifying seemingly disparate algorithmic ideas to produce better performing algorithms has been a longstanding goal in reinforcement learning. As a primary example, TD(λ) elegantly unifies one-step TD prediction with Monte Carlo methods through the use of eligibility traces and the trace-decay parameter λ. Currently, there are a multitude of algorithms that can be used to perform TD control, in...

پایان نامه :وزارت علوم، تحقیقات و فناوری - دانشگاه تبریز - دانشکده فیزیک 1393

در این رساله یک ساختار از چندجمله ایهای q- هرمیت جدید را همراه با یک مشخصه بندی کامل از ویژگی های اصلی آن ارائه داده و سپس جبر عملگرهای بالابرنده و پایین آورنده متناظر با آن را استخراج می کنیم. سپس خانواده دیگری از چندجمله ایهای q- هرمیت را که با (h_n (x,s?q نشان داده می¬شوند، معرفی می نماییم و ویژگی های مهم این چندجمله ایها را مورد بررسی قرار می دهیم. همچنین یک خانواده دیگر از چندجمله ایهای هر...

2000
Doina Precup Richard S. Sutton Satinder P. Singh

Eligibility traces have been shown to speed reinforcement learning, to make it more robust to hidden states, and to provide a link between Monte Carlo and temporal-difference methods. Here we generalize eligibility traces to off-policy learning, in which one learns about a policy different from the policy that generates the data. Off-policy methods can greatly multiply learning, as many policie...

Journal: :Inf. Sci. 2011
Kao-Shing Hwang Hsin-Yi Lin Yuan-Pao Hsu Hung-Hsiu Yu

This work describes a novel algorithm that integrates an adaptive resonance method (ARM), i.e. an ART-based algorithm with a self-organized design, and a Q-learning algorithm. By dynamically adjusting the size of sensitivity regions of each neuron and adaptively eliminating one of the redundant neurons, ARM can preserve resources, i.e. available neurons, to accommodate additional categories. As...

2011
Waheed Iqbal Matthew N. Dailey David Carrera

 Two experiments executed in three phases:  Exploration: we initialize an empty policy and let the system learn in real time  Exploitation: the agent simply uses the previously-learned policy to automatically resolve bottlenecks  Baseline: scale up every replicable tier every time a bottleneck occurs The workload pattern modeling method learns a clustering model: Update Q-value for each act...

Journal: :Journal of Machine Learning Research 2002
John N. Tsitsiklis

We consider a finite-state Markov decision problem and establish the convergence of a special case of optimistic policy iteration that involves Monte Carlo estimation of Q-values, in conjunction with greedy policy selection. We provide convergence results for a number of algorithmic variations, including one that involves temporal difference learning (bootstrapping) instead of Monte Carlo estim...

نمودار تعداد نتایج جستجو در هر سال

با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید