SlideShare a Scribd company logo
1 of 64
1 
Reinforcement Learning 
By: 
Chandra Prakash 
IIITM Gwalior
22 
Outline 
īŽIntroduction 
īŽElement of reinforcement learning 
īŽReinforcement Learning Problem 
īŽProblem solving methods for RL 
2
33 
Introduction 
Machine learning: Definition 
īą Machine learning is a scientific discipline that is 
concerned with the design and development of 
algorithms that allow computers to learn based on 
data, such as from sensor data or databases. 
A major focus of machine learning research is to 
automatically learn to recognize complex patterns and 
make intelligent decisions based on data . 
3
4 
Machine learning Type: 
With respect to the feedback type to learner: 
īą Supervised learning : Task Driven (Classification) 
īą Unsupervised learning : Data Driven (Clustering) 
īą Reinforcement learning — 
īŽ Close to human learning. 
īŽ Algorithm learns a policy of how to act in a given 
environment. 
īŽ Every action has some impact in the environment, 
and the environment provides rewards that guides 
the learning algorithm.
5 
Supervised Learning Vs 
Reinforcement Learning 
Supervised Learning 
īŽ Step: 1 
Teacher: Does picture 1 show a car or a flower? 
Learner: A flower. 
Teacher: No, it’s a car. 
Step: 2 
Teacher: Does picture 2 show a car or a flower? 
Learner: A car. 
Teacher: Yes, it’s a car. 
Step: 3 ....
6 
Supervised Learning Vs 
Reinforcement Learning (Cont...) 
Reinforcement Learning 
Step: 1 
World: You are in state 9. Choose action A or C. 
Learner: Action A. 
World: Your reward is 100. 
Step: 2 
World: You are in state 32. Choose action B or E. 
Learner: Action B. 
World: Your reward is 50. 
Step: 3 ....
77 
Introduction (Cont..) 
īŽ Meaning of Reinforcement: Occurrence of an 
event, in the proper relation to a response, that tends to 
increase the probability that the response will occur 
again in the same situation. 
Reinforcement learning is the problem faced by an 
agent that learns behavior through trial-and-error 
interactions with a dynamic environment. 
īą Reinforcement Learning is learning how to act in 
order to maximize a numerical reward. 
7
88 
Introduction (Cont..) 
īŽ Reinforcement learning is not a type of neural 
network, nor is it an alternative to neural networks. 
Rather, it is an orthogonal approach for Learning 
Machine. 
īŽ Reinforcement learning emphasizes learning feedback 
that evaluates the learner's performance without 
providing standards of correctness in the form 
of behavioral targets. 
Example: Bicycle learning 
8
99 
Element of reinforcement learning 
Agent 
State Reward Action 
Environment 
Policy 
īą Agent: Intelligent programs 
īą Environment: External condition 
īą Policy: 
īąDefines the agent’s behavior at a given time 
īąA mapping from states to actions 
īąLookup tables or simple function 
9
1100 
Element of reinforcement learning 
īŽ Reward function : 
īą Defines the goal in an RL problem 
īą Policy is altered to achieve this goal 
īŽ Value function: 
īą Reward function indicates what is good in an immediate 
sense while a value function specifies what is good in 
the long run. 
īą Value of a state is the total amount of reward an agent can expect 
to accumulate over the future, starting form that state. 
īŽ Model of the environment : 
īą Predict mimic behavior of environment, 
īą Used for planning & if Know current state and action 
then predict the resultant next state and next reward. 
10
1111 
Agent- Environment Interface 
11
1122 
Steps for Reinforcement Learning 
1. The agent observes an input state 
2. An action is determined by a decision making 
function (policy) 
3. The action is performed 
4. The agent receives a scalar reward or reinforcement 
from the environment 
5. Information about the reward given for that state / 
action pair is recorded 
12
1133 
Silent Features of Reinforcement 
Learning : 
īŽ Set of problems rather than a set of techniques 
īŽ Without specifying how the task is to be achieved. 
īŽ “RL as a tool” point of view: 
īą RL is training by rewards and punishments. 
īą Train tool for the computer learning. 
īŽ The learning agent’s point of view: 
īą RL is learning from trial and error with the world. 
īą Eg. how much reward I much get if I get this. 
13
1144 
Reinforcement Learning (Cont..) 
īŽ Reinforcement Learning uses Evaluative Feedback 
īŽ Purely Evaluative Feedback 
īą Indicates how good the action taken is. 
īą Not tell if it is the best or the worst action possible. 
īą Basis of methods for function optimization. 
īŽ Purely Instructive Feedback 
īą Indicates the correct action to take, independently of 
the action actually taken. 
īą Eg: Supervised Learning 
īŽ Associative and Non Associative: 
14
15 
Associative & Non Associative Tasks 
īŽ Associative : 
īą Situation Dependent 
īą Mapping from situation to the actions that are best in that 
situation 
īŽ Non Associative: 
īą Situation independent 
īą No need for associating different action with different 
situations. 
īą Learner either tries to find a single best action when the task 
is stationary, or tries to track the best action as it changes 
over time when the task is non stationary.
1166 
Reinforcement Learning (Cont..) 
Exploration and exploitation 
Greedy action: Action chosen with greatest estimated 
value. 
Greedy action: a case of Exploitation. 
Non greedy action: a case of Exploration, as it 
enables us to improve estimate the non-greedy 
action's value. 
īŽ N-armed bandit Problem: 
We have to choose from n different options or actions. 
We will choose the one with maximum reward. 
16
Bandits Problem 
One-Bandit 
“arms” 
Pull arms sequentially so as to maximize the total 
expected reward 
â€ĸIt is non associative and evaluative
1188 
Action Selection Policies 
Greediest action - action with the highest estimated 
reward. 
īŽ Ɛ -greedy 
īą Most of the time the greediest action is chosen 
īą Every once in a while with a small probability Ɛ, an 
action is selected at random. The action is selected 
uniformly, independent of the action-value estimates. 
īŽ Ɛ -soft - The best action is selected with probability (1 
–Ɛ) and the rest of the time a random action is chosen 
uniformly. 
18
1199 
Ɛ-Greedy Action Selection Method : 
Let the a* is the greedy action at time t and Qt(a) is the 
value of action a at time. 
Greedy Action Selection: 
īŽ Ɛ –greedy 
19 
Î 
* = argmax 
with probability 1 t a 
{ 
-Î 
random action with probability 
at= 
at = at 
a 
Qt (a)
20 
Action Selection Policies (Contâ€Ļ) 
īŽ Softmax – 
īą Drawback of Ɛ -greedy & Ɛ -soft: Select random 
actions uniformly. 
īą Softmax remedies this by: 
īŽ Assigning a weight with each actions, according 
to their action-value estimate. 
īŽ A random action is selected with regards to the 
weight associated with each action 
īŽ The worst actions are unlikely to be chosen. 
īŽ This is a good approach to take where the worst 
actions are very unfavorable.
21 
Softmax Action Selection( Contâ€Ļ) 
īŽ Problem with Ɛ-greedy: Neglects action values 
īŽ Softmax idea: grade action probs. by estimated values. 
īŽ Gibbs, or Boltzmann action selection, or exponential 
weights: 
t is the “computational temperature” 
At t īƒ  0 the Softmax action selection method become 
the same as greedy action selection.
22 
Incremental Implementation 
īŽ Sample average: 
īŽ Incremental computation: 
īŽ Common update rule form: 
NewEstimate = OldEstimate + StepSize[Target – 
OldEstimate] 
īŽ The expression [ Target - Old Estimate] is an error in the 
estimate. 
īŽ It is reduce by taking a step toward the target. 
īŽ In proceeding the (t+1)st reward for action a the step-size 
parameter will be 1(t+1).
Some terms in Reinforcement Learning 
īŽ The Agent Learns a Policy: 
2233 
īą Policy at step t, : a mapping from states to action 
probabilities will be: 
īą Agents changes their policy with Experience. 
īą Objective: get as much reward as possible over a 
long run. 
īŽ Goals and Rewards 
īą A goal should specify what we want to achieve, not 
how we want to achieve it. 
23
24 
Some terms in RL (Contâ€Ļ) 
īŽ Returns 
īą Rewards in long term 
īą Episodes: Subsequence of interaction between agent-environment 
e.g., plays of a game, trips through a 
maze. 
īŽ Discount return 
īą The geometrically discounted model of return: 
īą Used to: 
īŽ To determine the present value of the future 
rewards 
īŽ Give more weight to earlier rewards
2255 
The Markov Property 
īŽ Environment's response at (t+1) depends only on State 
(s) & Action (a) at t, 
Markov Decision Processes 
īą Finite MDP 
īŽ Transition Probabilities: 
īŽ Expected Reward 
Transition Graph: Summarize dynamics of finite MDP. 
25
Value function 
īŽ States-action pairs function that estimate how good it is 
for the agent to be in a given state 
īŽ Type of value function 
2266 
īą State-Value function 
īą Action-Value function 
26
2277 
Backup Diagrams 
īŽ Basis of the update or backup operations 
īŽ backup diagrams to provide graphical summaries of 
the algorithms. (State value & Action value function) 
Bellman Equation for a Policy Π 
27
28 
Model-free and Model-based Learning 
īŽ Model-based learning 
īą Learn from the model instead of interacting with the world 
īą Can visit arbitrary parts of the model 
īą Also called indirect methods 
īą E.g. Dynamic Programming 
īŽ Model-free learning 
īą Sample Reward and transition function by interacting with 
the world 
īą Also called direct methods
29 
Problem Solving Methods for RL 
1) Dynamic programming 
2) Monte Carlo methods 
3) Temporal-difference learning.
3300 
1.Dynamic programming 
īŽ Classical solution method 
īŽ Require a complete and accurate model of the 
environment. 
īŽ Popular method for Dynamic programming 
īą Policy Evaluation : Iterative computation of the value 
function for a given policy (prediction Problem) 
īą Policy Improvement: Computation of improved policy 
for a given value function. 
30 
( ) { ( )} t t 1 t V s E r gV s p ÂŦ + +
Dynamic Programming 
T 
st 
rt+1 
T 
T 
st+1 
T 
T T 
T 
T T 
T 
31 
( ) { ( )} t t 1 t V s E r gV s p ÂŦ + +
32
33
34 
īŽ NewEstimate = OldEstimate + 
StepSize[Target – 
OldEstimate] 
NewEstimate = OldEstimate + StepSize[Target – OldEstimate]
Computation of improved policy for a given value function 
35
36 
īŽ Policy Evaluation and Policy improvement 
together leads 
īŽ Policy Iteration 
īŽ Value Iteration:
3377 
Policy Iteration : Alternates policy evaluation and policy 
improvement steps 
37
īą Combines evaluation and improvement in one update 
38
3399 
Value Iteration: 
Combines evaluation and improvement in one update
40 
Generalized Policy Iteration (GPI) 
Consist of two iteration process, 
īŽ Policy Evaluation :Making the value function 
consistent with the current policy 
īŽ Policy Improvement: Making the policy greedy 
with respect to the current value function
4411 
2. Monte Carlo Methods 
īŽ Features of Monte Carlo Methods 
īą No need of Complete knowledge of environment 
īą Based on averaging sample returns observed after visit 
to that state. 
īą Experience is divided into Episodes 
īą Only after completing an episode, value estimates and 
policies are changed. 
īą Don't require a model 
īą Not suited for step-by-step incremental computation 
41
4422 
MC and DP Methods 
īŽ Compute same value function 
īŽ Same step as in DP 
īą Policy evaluation 
īą Computation of VΠ and QΠ for a fixed arbitrary 
policy (Π). 
īą Policy Improvement 
īą Generalized Policy Iteration 
42
4433 
To find value of a State 
īŽ Estimate by experience, average the returns observed 
after visit to that state. 
īŽ More the return, more is the average converge to 
expected value 
43
First-visit Monte Carlo Policy evaluation 
4444 
44
4455
In MCES, all returns for each state-action pair are accumulated 
and averaged, irrespective of what Policy was in force when 
they were observed 
4466 
46
4477
4488 
Monte Carlo Method (Contâ€Ļ) 
īŽ All actions should be selected infinitely for 
optimization 
īŽ MC exploring starts can not converge to any 
suboptimal policy. 
īŽ Agents have 2 approaches 
īą On-policy 
īą Off-policy 
48
49
5500 
On-policy MC algorithm 
50
51 
Off Policy MC algorithm
Monte Carlo and Dynamic Programming 
5522 
īŽ MC has several advantage over DP: 
īą Can learn from interaction with environment 
īą No need of full models 
īą No need to learn about ALL states 
īą No bootstrapping 
52
5533 
3. Temporal Difference (TD) methods 
īŽ Learn from experience, like MC 
īą Can learn directly from interaction with environment 
īą No need for full models 
īŽ Estimate values based on estimated values of next states, 
like DP 
īŽ Bootstrapping (like DP) 
īŽ Issue to watch for: maintaining sufficient exploration 
53
for a given policy p, compute the state-value function 
target: the actual return after time t 
5544 
TD Prediction 
Policy Evaluation (the prediction problem): 
Simple every - visit Monte Carlomethod : 
V(s ) V(s )+Îą[R V(s )] t t t t ÂŦ - 
V p 
( ) : 
The simplest TD method, TD 0 
V(s ) V(s )+Îą[r + V(s ) V(s )] 
ÂŦ g 
- t t t+ 1 t+ 1 
t target: an estimate of the return 
54
Simple Monte Carlo 
T T T 
T 
T T 
T st 
T T 
T T T 
55
Simplest TD Method 
st+1 
rt+1 
st 
T T T T T 
T T 
T T T T T T 
56 
V(st )ÂŦV(st) +a rt+1 +g V(st+1 ) - V(st [ )]
Dynamic Programming 
T 
st 
rt+1 
T 
T 
st+1 
T 
T T 
T 
T T 
T 
57 
( ) { ( )} t t 1 t V s E r gV s p ÂŦ + +
5588 
TD methods bootstrap and sample 
īŽ Bootstrapping: update involves an estimate 
īą MC does not bootstrap 
īą DP bootstraps 
īą TD bootstraps 
īŽ Sampling: update does not involve an expected 
value 
īą MC samples 
īą DP does not sample 
īą TD samples 
58
After every transition from a nonterminal state , do this ( ) ( ) [ ( ) ( )] 
If is terminal, then 0. 
1 1 
5599 
Learning An Action-Value Function 
Estimate Qp for the current behavior policy p . 
59 
Q s ,a ÂŦQ s ,a +Îą r + gQ s ,a Q s ,a 
t t t t t+ 1 t+ 1 
t+1 t t 
s Q(s ,a )= 
s 
t+ t+ t+1 
t 
-
Q-Learning: Off-Policy TD Control 
60 
One - step Q- learning : 
Q st , at ( )ÂŦQ st , at ( )+a rt +1 +g max 
[ Q(s, )- +aQ ( s, a)] 
t1t t a
6611 
Sarsa: On-Policy TD Control 
S A R S A: State Action Reward State Action 
61 
Turn this into a control method by always updating the policy 
to be greedy with respect to the current estimate:
6622 
Advantages of TD Learning 
īŽ TD methods do not require a model of the 
environment, only experience 
īŽ TD, but not MC, methods can be fully incremental 
īą You can learn before knowing the final outcome 
īŽ Less memory 
īŽ Less peak computation 
īą You can learn without the final outcome 
īŽ From incomplete sequences 
īŽ Both MC and TD converge 
62
63 
References: RL 
īŽ Univ. of Alberta 
īŽ http://www.cs.ualberta.ca/~sutton/book/ebook/node 
1.html 
īŽ Sutton and barto,”Reinforcement Learning an 
introduction.” 
īŽ Univ. of South Wales 
īŽ http://www.cse.unsw.edu.au/~cs9417ml/RL1/tdlear 
ning.html
6644 
Thank You 
64

More Related Content

What's hot

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
Artificial Neural Networks - ANN
Artificial Neural Networks - ANNArtificial Neural Networks - ANN
Artificial Neural Networks - ANNMohamed Talaat
 
supervised learning
supervised learningsupervised learning
supervised learningAmar Tripathi
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)Sharayu Patil
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learningamalalhait
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Simplilearn
 
Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning Usama Fayyaz
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and RegressionMegha Sharma
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Suraj Aavula
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision treesKnoldus Inc.
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningLior Rokach
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its ApplicationsDr Ganesh Iyer
 
I. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHMI. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHMvikas dhakane
 
Presentation on unsupervised learning
Presentation on unsupervised learning Presentation on unsupervised learning
Presentation on unsupervised learning ANKUSH PAL
 
Deep learning presentation
Deep learning presentationDeep learning presentation
Deep learning presentationTunde Ajose-Ismail
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithmparry prabhu
 

What's hot (20)

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Artificial Neural Networks - ANN
Artificial Neural Networks - ANNArtificial Neural Networks - ANN
Artificial Neural Networks - ANN
 
supervised learning
supervised learningsupervised learning
supervised learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Support vector machines (svm)
Support vector machines (svm)Support vector machines (svm)
Support vector machines (svm)
 
Unsupervised learning
Unsupervised learningUnsupervised learning
Unsupervised learning
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning
 
Classification and Regression
Classification and RegressionClassification and Regression
Classification and Regression
 
Convolution Neural Network (CNN)
Convolution Neural Network (CNN)Convolution Neural Network (CNN)
Convolution Neural Network (CNN)
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its Applications
 
I. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHMI. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHM
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 
Presentation on unsupervised learning
Presentation on unsupervised learning Presentation on unsupervised learning
Presentation on unsupervised learning
 
Deep learning presentation
Deep learning presentationDeep learning presentation
Deep learning presentation
 
Cnn
CnnCnn
Cnn
 
K mean-clustering algorithm
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
 

Viewers also liked

Lecture 9 10 .mobile ad-hoc routing protocols
Lecture 9 10 .mobile ad-hoc routing protocolsLecture 9 10 .mobile ad-hoc routing protocols
Lecture 9 10 .mobile ad-hoc routing protocolsChandra Meena
 
Lecture 11 14. Adhoc routing protocols cont..
Lecture 11 14. Adhoc  routing protocols cont..Lecture 11 14. Adhoc  routing protocols cont..
Lecture 11 14. Adhoc routing protocols cont..Chandra Meena
 
Lecture 2 evolution of mobile cellular
Lecture 2  evolution of mobile cellular Lecture 2  evolution of mobile cellular
Lecture 2 evolution of mobile cellular Chandra Meena
 
Lecture 7 8 ad hoc wireless media access protocols
Lecture 7 8 ad hoc wireless media access protocolsLecture 7 8 ad hoc wireless media access protocols
Lecture 7 8 ad hoc wireless media access protocolsChandra Meena
 
Lecture 3 4. prnet
Lecture 3 4. prnetLecture 3 4. prnet
Lecture 3 4. prnetChandra Meena
 
Lecture 1 mobile and adhoc network- introduction
Lecture 1  mobile and adhoc network- introductionLecture 1  mobile and adhoc network- introduction
Lecture 1 mobile and adhoc network- introductionChandra Meena
 
Lecture 19 22. transport protocol for ad-hoc
Lecture 19 22. transport protocol for ad-hoc Lecture 19 22. transport protocol for ad-hoc
Lecture 19 22. transport protocol for ad-hoc Chandra Meena
 
Lecture 23 27. quality of services in ad hoc wireless networks
Lecture 23 27. quality of services in ad hoc wireless networksLecture 23 27. quality of services in ad hoc wireless networks
Lecture 23 27. quality of services in ad hoc wireless networksChandra Meena
 
Lecture 5 6 .ad hoc network
Lecture 5 6 .ad hoc networkLecture 5 6 .ad hoc network
Lecture 5 6 .ad hoc networkChandra Meena
 

Viewers also liked (9)

Lecture 9 10 .mobile ad-hoc routing protocols
Lecture 9 10 .mobile ad-hoc routing protocolsLecture 9 10 .mobile ad-hoc routing protocols
Lecture 9 10 .mobile ad-hoc routing protocols
 
Lecture 11 14. Adhoc routing protocols cont..
Lecture 11 14. Adhoc  routing protocols cont..Lecture 11 14. Adhoc  routing protocols cont..
Lecture 11 14. Adhoc routing protocols cont..
 
Lecture 2 evolution of mobile cellular
Lecture 2  evolution of mobile cellular Lecture 2  evolution of mobile cellular
Lecture 2 evolution of mobile cellular
 
Lecture 7 8 ad hoc wireless media access protocols
Lecture 7 8 ad hoc wireless media access protocolsLecture 7 8 ad hoc wireless media access protocols
Lecture 7 8 ad hoc wireless media access protocols
 
Lecture 3 4. prnet
Lecture 3 4. prnetLecture 3 4. prnet
Lecture 3 4. prnet
 
Lecture 1 mobile and adhoc network- introduction
Lecture 1  mobile and adhoc network- introductionLecture 1  mobile and adhoc network- introduction
Lecture 1 mobile and adhoc network- introduction
 
Lecture 19 22. transport protocol for ad-hoc
Lecture 19 22. transport protocol for ad-hoc Lecture 19 22. transport protocol for ad-hoc
Lecture 19 22. transport protocol for ad-hoc
 
Lecture 23 27. quality of services in ad hoc wireless networks
Lecture 23 27. quality of services in ad hoc wireless networksLecture 23 27. quality of services in ad hoc wireless networks
Lecture 23 27. quality of services in ad hoc wireless networks
 
Lecture 5 6 .ad hoc network
Lecture 5 6 .ad hoc networkLecture 5 6 .ad hoc network
Lecture 5 6 .ad hoc network
 

Similar to Reinforcement learning

reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfVaishnavGhadge1
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxMohibKhan79
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
CS3013 -MACHINE LEARNING.pptx
CS3013 -MACHINE LEARNING.pptxCS3013 -MACHINE LEARNING.pptx
CS3013 -MACHINE LEARNING.pptxlogesswarisrinivasan
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSVijaylakshmi
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learningSKS
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.pptbutest
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNINGpradiprahul
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningElias Hasnat
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.pptcharusharma165
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Reinforcement Learning.ppt
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.pptPOOJASHREEC1
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptxRithikRaj25
 

Similar to Reinforcement learning (20)

reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
reinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptxreinforcement-learning-141009013546-conversion-gate02.pptx
reinforcement-learning-141009013546-conversion-gate02.pptx
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
CS3013 -MACHINE LEARNING.pptx
CS3013 -MACHINE LEARNING.pptxCS3013 -MACHINE LEARNING.pptx
CS3013 -MACHINE LEARNING.pptx
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
RL.ppt
RL.pptRL.ppt
RL.ppt
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.ppt
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Reinforcement Learning.ppt
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.ppt
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 
YijueRL.ppt
YijueRL.pptYijueRL.ppt
YijueRL.ppt
 

Recently uploaded

Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptNarmatha D
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)Dr SOUNDIRARAJ N
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingBootNeck1
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadaditya806802
 
TechTACÂŽ CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTACÂŽ CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTACÂŽ CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTACÂŽ CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxRomil Mishra
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 

Recently uploaded (20)

Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Industrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.pptIndustrial Safety Unit-IV workplace health and safety.ppt
Industrial Safety Unit-IV workplace health and safety.ppt
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
UNIT III ANALOG ELECTRONICS (BASIC ELECTRONICS)
 
System Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event SchedulingSystem Simulation and Modelling with types and Event Scheduling
System Simulation and Modelling with types and Event Scheduling
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
home automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasadhome automation using Arduino by Aditya Prasad
home automation using Arduino by Aditya Prasad
 
TechTACÂŽ CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTACÂŽ CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTACÂŽ CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTACÂŽ CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Mine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptxMine Environment II Lab_MI10448MI__________.pptx
Mine Environment II Lab_MI10448MI__________.pptx
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.ppt
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 

Reinforcement learning

  • 1. 1 Reinforcement Learning By: Chandra Prakash IIITM Gwalior
  • 2. 22 Outline īŽIntroduction īŽElement of reinforcement learning īŽReinforcement Learning Problem īŽProblem solving methods for RL 2
  • 3. 33 Introduction Machine learning: Definition īą Machine learning is a scientific discipline that is concerned with the design and development of algorithms that allow computers to learn based on data, such as from sensor data or databases. A major focus of machine learning research is to automatically learn to recognize complex patterns and make intelligent decisions based on data . 3
  • 4. 4 Machine learning Type: With respect to the feedback type to learner: īą Supervised learning : Task Driven (Classification) īą Unsupervised learning : Data Driven (Clustering) īą Reinforcement learning — īŽ Close to human learning. īŽ Algorithm learns a policy of how to act in a given environment. īŽ Every action has some impact in the environment, and the environment provides rewards that guides the learning algorithm.
  • 5. 5 Supervised Learning Vs Reinforcement Learning Supervised Learning īŽ Step: 1 Teacher: Does picture 1 show a car or a flower? Learner: A flower. Teacher: No, it’s a car. Step: 2 Teacher: Does picture 2 show a car or a flower? Learner: A car. Teacher: Yes, it’s a car. Step: 3 ....
  • 6. 6 Supervised Learning Vs Reinforcement Learning (Cont...) Reinforcement Learning Step: 1 World: You are in state 9. Choose action A or C. Learner: Action A. World: Your reward is 100. Step: 2 World: You are in state 32. Choose action B or E. Learner: Action B. World: Your reward is 50. Step: 3 ....
  • 7. 77 Introduction (Cont..) īŽ Meaning of Reinforcement: Occurrence of an event, in the proper relation to a response, that tends to increase the probability that the response will occur again in the same situation. Reinforcement learning is the problem faced by an agent that learns behavior through trial-and-error interactions with a dynamic environment. īą Reinforcement Learning is learning how to act in order to maximize a numerical reward. 7
  • 8. 88 Introduction (Cont..) īŽ Reinforcement learning is not a type of neural network, nor is it an alternative to neural networks. Rather, it is an orthogonal approach for Learning Machine. īŽ Reinforcement learning emphasizes learning feedback that evaluates the learner's performance without providing standards of correctness in the form of behavioral targets. Example: Bicycle learning 8
  • 9. 99 Element of reinforcement learning Agent State Reward Action Environment Policy īą Agent: Intelligent programs īą Environment: External condition īą Policy: īąDefines the agent’s behavior at a given time īąA mapping from states to actions īąLookup tables or simple function 9
  • 10. 1100 Element of reinforcement learning īŽ Reward function : īą Defines the goal in an RL problem īą Policy is altered to achieve this goal īŽ Value function: īą Reward function indicates what is good in an immediate sense while a value function specifies what is good in the long run. īą Value of a state is the total amount of reward an agent can expect to accumulate over the future, starting form that state. īŽ Model of the environment : īą Predict mimic behavior of environment, īą Used for planning & if Know current state and action then predict the resultant next state and next reward. 10
  • 11. 1111 Agent- Environment Interface 11
  • 12. 1122 Steps for Reinforcement Learning 1. The agent observes an input state 2. An action is determined by a decision making function (policy) 3. The action is performed 4. The agent receives a scalar reward or reinforcement from the environment 5. Information about the reward given for that state / action pair is recorded 12
  • 13. 1133 Silent Features of Reinforcement Learning : īŽ Set of problems rather than a set of techniques īŽ Without specifying how the task is to be achieved. īŽ “RL as a tool” point of view: īą RL is training by rewards and punishments. īą Train tool for the computer learning. īŽ The learning agent’s point of view: īą RL is learning from trial and error with the world. īą Eg. how much reward I much get if I get this. 13
  • 14. 1144 Reinforcement Learning (Cont..) īŽ Reinforcement Learning uses Evaluative Feedback īŽ Purely Evaluative Feedback īą Indicates how good the action taken is. īą Not tell if it is the best or the worst action possible. īą Basis of methods for function optimization. īŽ Purely Instructive Feedback īą Indicates the correct action to take, independently of the action actually taken. īą Eg: Supervised Learning īŽ Associative and Non Associative: 14
  • 15. 15 Associative & Non Associative Tasks īŽ Associative : īą Situation Dependent īą Mapping from situation to the actions that are best in that situation īŽ Non Associative: īą Situation independent īą No need for associating different action with different situations. īą Learner either tries to find a single best action when the task is stationary, or tries to track the best action as it changes over time when the task is non stationary.
  • 16. 1166 Reinforcement Learning (Cont..) Exploration and exploitation Greedy action: Action chosen with greatest estimated value. Greedy action: a case of Exploitation. Non greedy action: a case of Exploration, as it enables us to improve estimate the non-greedy action's value. īŽ N-armed bandit Problem: We have to choose from n different options or actions. We will choose the one with maximum reward. 16
  • 17. Bandits Problem One-Bandit “arms” Pull arms sequentially so as to maximize the total expected reward â€ĸIt is non associative and evaluative
  • 18. 1188 Action Selection Policies Greediest action - action with the highest estimated reward. īŽ Ɛ -greedy īą Most of the time the greediest action is chosen īą Every once in a while with a small probability Ɛ, an action is selected at random. The action is selected uniformly, independent of the action-value estimates. īŽ Ɛ -soft - The best action is selected with probability (1 –Ɛ) and the rest of the time a random action is chosen uniformly. 18
  • 19. 1199 Ɛ-Greedy Action Selection Method : Let the a* is the greedy action at time t and Qt(a) is the value of action a at time. Greedy Action Selection: īŽ Ɛ –greedy 19 Î * = argmax with probability 1 t a { -Î random action with probability at= at = at a Qt (a)
  • 20. 20 Action Selection Policies (Contâ€Ļ) īŽ Softmax – īą Drawback of Ɛ -greedy & Ɛ -soft: Select random actions uniformly. īą Softmax remedies this by: īŽ Assigning a weight with each actions, according to their action-value estimate. īŽ A random action is selected with regards to the weight associated with each action īŽ The worst actions are unlikely to be chosen. īŽ This is a good approach to take where the worst actions are very unfavorable.
  • 21. 21 Softmax Action Selection( Contâ€Ļ) īŽ Problem with Ɛ-greedy: Neglects action values īŽ Softmax idea: grade action probs. by estimated values. īŽ Gibbs, or Boltzmann action selection, or exponential weights: t is the “computational temperature” At t īƒ  0 the Softmax action selection method become the same as greedy action selection.
  • 22. 22 Incremental Implementation īŽ Sample average: īŽ Incremental computation: īŽ Common update rule form: NewEstimate = OldEstimate + StepSize[Target – OldEstimate] īŽ The expression [ Target - Old Estimate] is an error in the estimate. īŽ It is reduce by taking a step toward the target. īŽ In proceeding the (t+1)st reward for action a the step-size parameter will be 1(t+1).
  • 23. Some terms in Reinforcement Learning īŽ The Agent Learns a Policy: 2233 īą Policy at step t, : a mapping from states to action probabilities will be: īą Agents changes their policy with Experience. īą Objective: get as much reward as possible over a long run. īŽ Goals and Rewards īą A goal should specify what we want to achieve, not how we want to achieve it. 23
  • 24. 24 Some terms in RL (Contâ€Ļ) īŽ Returns īą Rewards in long term īą Episodes: Subsequence of interaction between agent-environment e.g., plays of a game, trips through a maze. īŽ Discount return īą The geometrically discounted model of return: īą Used to: īŽ To determine the present value of the future rewards īŽ Give more weight to earlier rewards
  • 25. 2255 The Markov Property īŽ Environment's response at (t+1) depends only on State (s) & Action (a) at t, Markov Decision Processes īą Finite MDP īŽ Transition Probabilities: īŽ Expected Reward Transition Graph: Summarize dynamics of finite MDP. 25
  • 26. Value function īŽ States-action pairs function that estimate how good it is for the agent to be in a given state īŽ Type of value function 2266 īą State-Value function īą Action-Value function 26
  • 27. 2277 Backup Diagrams īŽ Basis of the update or backup operations īŽ backup diagrams to provide graphical summaries of the algorithms. (State value & Action value function) Bellman Equation for a Policy Π 27
  • 28. 28 Model-free and Model-based Learning īŽ Model-based learning īą Learn from the model instead of interacting with the world īą Can visit arbitrary parts of the model īą Also called indirect methods īą E.g. Dynamic Programming īŽ Model-free learning īą Sample Reward and transition function by interacting with the world īą Also called direct methods
  • 29. 29 Problem Solving Methods for RL 1) Dynamic programming 2) Monte Carlo methods 3) Temporal-difference learning.
  • 30. 3300 1.Dynamic programming īŽ Classical solution method īŽ Require a complete and accurate model of the environment. īŽ Popular method for Dynamic programming īą Policy Evaluation : Iterative computation of the value function for a given policy (prediction Problem) īą Policy Improvement: Computation of improved policy for a given value function. 30 ( ) { ( )} t t 1 t V s E r gV s p ÂŦ + +
  • 31. Dynamic Programming T st rt+1 T T st+1 T T T T T T T 31 ( ) { ( )} t t 1 t V s E r gV s p ÂŦ + +
  • 32. 32
  • 33. 33
  • 34. 34 īŽ NewEstimate = OldEstimate + StepSize[Target – OldEstimate] NewEstimate = OldEstimate + StepSize[Target – OldEstimate]
  • 35. Computation of improved policy for a given value function 35
  • 36. 36 īŽ Policy Evaluation and Policy improvement together leads īŽ Policy Iteration īŽ Value Iteration:
  • 37. 3377 Policy Iteration : Alternates policy evaluation and policy improvement steps 37
  • 38. īą Combines evaluation and improvement in one update 38
  • 39. 3399 Value Iteration: Combines evaluation and improvement in one update
  • 40. 40 Generalized Policy Iteration (GPI) Consist of two iteration process, īŽ Policy Evaluation :Making the value function consistent with the current policy īŽ Policy Improvement: Making the policy greedy with respect to the current value function
  • 41. 4411 2. Monte Carlo Methods īŽ Features of Monte Carlo Methods īą No need of Complete knowledge of environment īą Based on averaging sample returns observed after visit to that state. īą Experience is divided into Episodes īą Only after completing an episode, value estimates and policies are changed. īą Don't require a model īą Not suited for step-by-step incremental computation 41
  • 42. 4422 MC and DP Methods īŽ Compute same value function īŽ Same step as in DP īą Policy evaluation īą Computation of VΠ and QΠ for a fixed arbitrary policy (Π). īą Policy Improvement īą Generalized Policy Iteration 42
  • 43. 4433 To find value of a State īŽ Estimate by experience, average the returns observed after visit to that state. īŽ More the return, more is the average converge to expected value 43
  • 44. First-visit Monte Carlo Policy evaluation 4444 44
  • 45. 4455
  • 46. In MCES, all returns for each state-action pair are accumulated and averaged, irrespective of what Policy was in force when they were observed 4466 46
  • 47. 4477
  • 48. 4488 Monte Carlo Method (Contâ€Ļ) īŽ All actions should be selected infinitely for optimization īŽ MC exploring starts can not converge to any suboptimal policy. īŽ Agents have 2 approaches īą On-policy īą Off-policy 48
  • 49. 49
  • 50. 5500 On-policy MC algorithm 50
  • 51. 51 Off Policy MC algorithm
  • 52. Monte Carlo and Dynamic Programming 5522 īŽ MC has several advantage over DP: īą Can learn from interaction with environment īą No need of full models īą No need to learn about ALL states īą No bootstrapping 52
  • 53. 5533 3. Temporal Difference (TD) methods īŽ Learn from experience, like MC īą Can learn directly from interaction with environment īą No need for full models īŽ Estimate values based on estimated values of next states, like DP īŽ Bootstrapping (like DP) īŽ Issue to watch for: maintaining sufficient exploration 53
  • 54. for a given policy p, compute the state-value function target: the actual return after time t 5544 TD Prediction Policy Evaluation (the prediction problem): Simple every - visit Monte Carlomethod : V(s ) V(s )+Îą[R V(s )] t t t t ÂŦ - V p ( ) : The simplest TD method, TD 0 V(s ) V(s )+Îą[r + V(s ) V(s )] ÂŦ g - t t t+ 1 t+ 1 t target: an estimate of the return 54
  • 55. Simple Monte Carlo T T T T T T T st T T T T T 55
  • 56. Simplest TD Method st+1 rt+1 st T T T T T T T T T T T T T 56 V(st )ÂŦV(st) +a rt+1 +g V(st+1 ) - V(st [ )]
  • 57. Dynamic Programming T st rt+1 T T st+1 T T T T T T T 57 ( ) { ( )} t t 1 t V s E r gV s p ÂŦ + +
  • 58. 5588 TD methods bootstrap and sample īŽ Bootstrapping: update involves an estimate īą MC does not bootstrap īą DP bootstraps īą TD bootstraps īŽ Sampling: update does not involve an expected value īą MC samples īą DP does not sample īą TD samples 58
  • 59. After every transition from a nonterminal state , do this ( ) ( ) [ ( ) ( )] If is terminal, then 0. 1 1 5599 Learning An Action-Value Function Estimate Qp for the current behavior policy p . 59 Q s ,a ÂŦQ s ,a +Îą r + gQ s ,a Q s ,a t t t t t+ 1 t+ 1 t+1 t t s Q(s ,a )= s t+ t+ t+1 t -
  • 60. Q-Learning: Off-Policy TD Control 60 One - step Q- learning : Q st , at ( )ÂŦQ st , at ( )+a rt +1 +g max [ Q(s, )- +aQ ( s, a)] t1t t a
  • 61. 6611 Sarsa: On-Policy TD Control S A R S A: State Action Reward State Action 61 Turn this into a control method by always updating the policy to be greedy with respect to the current estimate:
  • 62. 6622 Advantages of TD Learning īŽ TD methods do not require a model of the environment, only experience īŽ TD, but not MC, methods can be fully incremental īą You can learn before knowing the final outcome īŽ Less memory īŽ Less peak computation īą You can learn without the final outcome īŽ From incomplete sequences īŽ Both MC and TD converge 62
  • 63. 63 References: RL īŽ Univ. of Alberta īŽ http://www.cs.ualberta.ca/~sutton/book/ebook/node 1.html īŽ Sutton and barto,”Reinforcement Learning an introduction.” īŽ Univ. of South Wales īŽ http://www.cse.unsw.edu.au/~cs9417ml/RL1/tdlear ning.html