1. Reinforcement learning involves an agent learning through trial-and-error interactions with an environment. The agent learns a policy for how to act by maximizing rewards.
2. The document outlines key elements of reinforcement learning including states, actions, rewards, value functions, and explores different methods for solving reinforcement learning problems including dynamic programming, Monte Carlo methods, and temporal difference learning.
3. Temporal difference learning combines the advantages of Monte Carlo methods and dynamic programming by allowing for incremental learning through bootstrapping predictions like dynamic programming while also learning directly from experience like Monte Carlo methods.
2. 22
Outline
īŽIntroduction
īŽElement of reinforcement learning
īŽReinforcement Learning Problem
īŽProblem solving methods for RL
2
3. 33
Introduction
Machine learning: Definition
īą Machine learning is a scientific discipline that is
concerned with the design and development of
algorithms that allow computers to learn based on
data, such as from sensor data or databases.
A major focus of machine learning research is to
automatically learn to recognize complex patterns and
make intelligent decisions based on data .
3
4. 4
Machine learning Type:
With respect to the feedback type to learner:
īą Supervised learning : Task Driven (Classification)
īą Unsupervised learning : Data Driven (Clustering)
īą Reinforcement learning â
īŽ Close to human learning.
īŽ Algorithm learns a policy of how to act in a given
environment.
īŽ Every action has some impact in the environment,
and the environment provides rewards that guides
the learning algorithm.
5. 5
Supervised Learning Vs
Reinforcement Learning
Supervised Learning
īŽ Step: 1
Teacher: Does picture 1 show a car or a flower?
Learner: A flower.
Teacher: No, itâs a car.
Step: 2
Teacher: Does picture 2 show a car or a flower?
Learner: A car.
Teacher: Yes, itâs a car.
Step: 3 ....
6. 6
Supervised Learning Vs
Reinforcement Learning (Cont...)
Reinforcement Learning
Step: 1
World: You are in state 9. Choose action A or C.
Learner: Action A.
World: Your reward is 100.
Step: 2
World: You are in state 32. Choose action B or E.
Learner: Action B.
World: Your reward is 50.
Step: 3 ....
7. 77
Introduction (Cont..)
īŽ Meaning of Reinforcement: Occurrence of an
event, in the proper relation to a response, that tends to
increase the probability that the response will occur
again in the same situation.
Reinforcement learning is the problem faced by an
agent that learns behavior through trial-and-error
interactions with a dynamic environment.
īą Reinforcement Learning is learning how to act in
order to maximize a numerical reward.
7
8. 88
Introduction (Cont..)
īŽ Reinforcement learning is not a type of neural
network, nor is it an alternative to neural networks.
Rather, it is an orthogonal approach for Learning
Machine.
īŽ Reinforcement learning emphasizes learning feedback
that evaluates the learner's performance without
providing standards of correctness in the form
of behavioral targets.
Example: Bicycle learning
8
9. 99
Element of reinforcement learning
Agent
State Reward Action
Environment
Policy
īą Agent: Intelligent programs
īą Environment: External condition
īą Policy:
īąDefines the agentâs behavior at a given time
īąA mapping from states to actions
īąLookup tables or simple function
9
10. 1100
Element of reinforcement learning
īŽ Reward function :
īą Defines the goal in an RL problem
īą Policy is altered to achieve this goal
īŽ Value function:
īą Reward function indicates what is good in an immediate
sense while a value function specifies what is good in
the long run.
īą Value of a state is the total amount of reward an agent can expect
to accumulate over the future, starting form that state.
īŽ Model of the environment :
īą Predict mimic behavior of environment,
īą Used for planning & if Know current state and action
then predict the resultant next state and next reward.
10
12. 1122
Steps for Reinforcement Learning
1. The agent observes an input state
2. An action is determined by a decision making
function (policy)
3. The action is performed
4. The agent receives a scalar reward or reinforcement
from the environment
5. Information about the reward given for that state /
action pair is recorded
12
13. 1133
Silent Features of Reinforcement
Learning :
īŽ Set of problems rather than a set of techniques
īŽ Without specifying how the task is to be achieved.
īŽ âRL as a toolâ point of view:
īą RL is training by rewards and punishments.
īą Train tool for the computer learning.
īŽ The learning agentâs point of view:
īą RL is learning from trial and error with the world.
īą Eg. how much reward I much get if I get this.
13
14. 1144
Reinforcement Learning (Cont..)
īŽ Reinforcement Learning uses Evaluative Feedback
īŽ Purely Evaluative Feedback
īą Indicates how good the action taken is.
īą Not tell if it is the best or the worst action possible.
īą Basis of methods for function optimization.
īŽ Purely Instructive Feedback
īą Indicates the correct action to take, independently of
the action actually taken.
īą Eg: Supervised Learning
īŽ Associative and Non Associative:
14
15. 15
Associative & Non Associative Tasks
īŽ Associative :
īą Situation Dependent
īą Mapping from situation to the actions that are best in that
situation
īŽ Non Associative:
īą Situation independent
īą No need for associating different action with different
situations.
īą Learner either tries to find a single best action when the task
is stationary, or tries to track the best action as it changes
over time when the task is non stationary.
16. 1166
Reinforcement Learning (Cont..)
Exploration and exploitation
Greedy action: Action chosen with greatest estimated
value.
Greedy action: a case of Exploitation.
Non greedy action: a case of Exploration, as it
enables us to improve estimate the non-greedy
action's value.
īŽ N-armed bandit Problem:
We have to choose from n different options or actions.
We will choose the one with maximum reward.
16
17. Bandits Problem
One-Bandit
âarmsâ
Pull arms sequentially so as to maximize the total
expected reward
âĸIt is non associative and evaluative
18. 1188
Action Selection Policies
Greediest action - action with the highest estimated
reward.
īŽ Æ -greedy
īą Most of the time the greediest action is chosen
īą Every once in a while with a small probability Æ, an
action is selected at random. The action is selected
uniformly, independent of the action-value estimates.
īŽ Æ -soft - The best action is selected with probability (1
âÆ) and the rest of the time a random action is chosen
uniformly.
18
19. 1199
Æ-Greedy Action Selection Method :
Let the a* is the greedy action at time t and Qt(a) is the
value of action a at time.
Greedy Action Selection:
īŽ Æ âgreedy
19
Ã
* = argmax
with probability 1 t a
{
-Ã
random action with probability
at=
at = at
a
Qt (a)
20. 20
Action Selection Policies (ContâĻ)
īŽ Softmax â
īą Drawback of Æ -greedy & Æ -soft: Select random
actions uniformly.
īą Softmax remedies this by:
īŽ Assigning a weight with each actions, according
to their action-value estimate.
īŽ A random action is selected with regards to the
weight associated with each action
īŽ The worst actions are unlikely to be chosen.
īŽ This is a good approach to take where the worst
actions are very unfavorable.
21. 21
Softmax Action Selection( ContâĻ)
īŽ Problem with Æ-greedy: Neglects action values
īŽ Softmax idea: grade action probs. by estimated values.
īŽ Gibbs, or Boltzmann action selection, or exponential
weights:
t is the âcomputational temperatureâ
At t ī 0 the Softmax action selection method become
the same as greedy action selection.
22. 22
Incremental Implementation
īŽ Sample average:
īŽ Incremental computation:
īŽ Common update rule form:
NewEstimate = OldEstimate + StepSize[Target â
OldEstimate]
īŽ The expression [ Target - Old Estimate] is an error in the
estimate.
īŽ It is reduce by taking a step toward the target.
īŽ In proceeding the (t+1)st reward for action a the step-size
parameter will be 1(t+1).
23. Some terms in Reinforcement Learning
īŽ The Agent Learns a Policy:
2233
īą Policy at step t, : a mapping from states to action
probabilities will be:
īą Agents changes their policy with Experience.
īą Objective: get as much reward as possible over a
long run.
īŽ Goals and Rewards
īą A goal should specify what we want to achieve, not
how we want to achieve it.
23
24. 24
Some terms in RL (ContâĻ)
īŽ Returns
īą Rewards in long term
īą Episodes: Subsequence of interaction between agent-environment
e.g., plays of a game, trips through a
maze.
īŽ Discount return
īą The geometrically discounted model of return:
īą Used to:
īŽ To determine the present value of the future
rewards
īŽ Give more weight to earlier rewards
25. 2255
The Markov Property
īŽ Environment's response at (t+1) depends only on State
(s) & Action (a) at t,
Markov Decision Processes
īą Finite MDP
īŽ Transition Probabilities:
īŽ Expected Reward
Transition Graph: Summarize dynamics of finite MDP.
25
26. Value function
īŽ States-action pairs function that estimate how good it is
for the agent to be in a given state
īŽ Type of value function
2266
īą State-Value function
īą Action-Value function
26
27. 2277
Backup Diagrams
īŽ Basis of the update or backup operations
īŽ backup diagrams to provide graphical summaries of
the algorithms. (State value & Action value function)
Bellman Equation for a Policy Î
27
28. 28
Model-free and Model-based Learning
īŽ Model-based learning
īą Learn from the model instead of interacting with the world
īą Can visit arbitrary parts of the model
īą Also called indirect methods
īą E.g. Dynamic Programming
īŽ Model-free learning
īą Sample Reward and transition function by interacting with
the world
īą Also called direct methods
29. 29
Problem Solving Methods for RL
1) Dynamic programming
2) Monte Carlo methods
3) Temporal-difference learning.
30. 3300
1.Dynamic programming
īŽ Classical solution method
īŽ Require a complete and accurate model of the
environment.
īŽ Popular method for Dynamic programming
īą Policy Evaluation : Iterative computation of the value
function for a given policy (prediction Problem)
īą Policy Improvement: Computation of improved policy
for a given value function.
30
( ) { ( )} t t 1 t V s E r gV s p ÂŦ + +
31. Dynamic Programming
T
st
rt+1
T
T
st+1
T
T T
T
T T
T
31
( ) { ( )} t t 1 t V s E r gV s p ÂŦ + +
40. 40
Generalized Policy Iteration (GPI)
Consist of two iteration process,
īŽ Policy Evaluation :Making the value function
consistent with the current policy
īŽ Policy Improvement: Making the policy greedy
with respect to the current value function
41. 4411
2. Monte Carlo Methods
īŽ Features of Monte Carlo Methods
īą No need of Complete knowledge of environment
īą Based on averaging sample returns observed after visit
to that state.
īą Experience is divided into Episodes
īą Only after completing an episode, value estimates and
policies are changed.
īą Don't require a model
īą Not suited for step-by-step incremental computation
41
42. 4422
MC and DP Methods
īŽ Compute same value function
īŽ Same step as in DP
īą Policy evaluation
īą Computation of VÎ and QÎ for a fixed arbitrary
policy (Î ).
īą Policy Improvement
īą Generalized Policy Iteration
42
43. 4433
To find value of a State
īŽ Estimate by experience, average the returns observed
after visit to that state.
īŽ More the return, more is the average converge to
expected value
43
46. In MCES, all returns for each state-action pair are accumulated
and averaged, irrespective of what Policy was in force when
they were observed
4466
46
48. 4488
Monte Carlo Method (ContâĻ)
īŽ All actions should be selected infinitely for
optimization
īŽ MC exploring starts can not converge to any
suboptimal policy.
īŽ Agents have 2 approaches
īą On-policy
īą Off-policy
48
52. Monte Carlo and Dynamic Programming
5522
īŽ MC has several advantage over DP:
īą Can learn from interaction with environment
īą No need of full models
īą No need to learn about ALL states
īą No bootstrapping
52
53. 5533
3. Temporal Difference (TD) methods
īŽ Learn from experience, like MC
īą Can learn directly from interaction with environment
īą No need for full models
īŽ Estimate values based on estimated values of next states,
like DP
īŽ Bootstrapping (like DP)
īŽ Issue to watch for: maintaining sufficient exploration
53
54. for a given policy p, compute the state-value function
target: the actual return after time t
5544
TD Prediction
Policy Evaluation (the prediction problem):
Simple every - visit Monte Carlomethod :
V(s ) V(s )+Îą[R V(s )] t t t t ÂŦ -
V p
( ) :
The simplest TD method, TD 0
V(s ) V(s )+Îą[r + V(s ) V(s )]
ÂŦ g
- t t t+ 1 t+ 1
t target: an estimate of the return
54
56. Simplest TD Method
st+1
rt+1
st
T T T T T
T T
T T T T T T
56
V(st )ÂŦV(st) +a rt+1 +g V(st+1 ) - V(st [ )]
57. Dynamic Programming
T
st
rt+1
T
T
st+1
T
T T
T
T T
T
57
( ) { ( )} t t 1 t V s E r gV s p ÂŦ + +
58. 5588
TD methods bootstrap and sample
īŽ Bootstrapping: update involves an estimate
īą MC does not bootstrap
īą DP bootstraps
īą TD bootstraps
īŽ Sampling: update does not involve an expected
value
īą MC samples
īą DP does not sample
īą TD samples
58
59. After every transition from a nonterminal state , do this ( ) ( ) [ ( ) ( )]
If is terminal, then 0.
1 1
5599
Learning An Action-Value Function
Estimate Qp for the current behavior policy p .
59
Q s ,a ÂŦQ s ,a +Îą r + gQ s ,a Q s ,a
t t t t t+ 1 t+ 1
t+1 t t
s Q(s ,a )=
s
t+ t+ t+1
t
-
60. Q-Learning: Off-Policy TD Control
60
One - step Q- learning :
Q st , at ( )ÂŦQ st , at ( )+a rt +1 +g max
[ Q(s, )- +aQ ( s, a)]
t1t t a
61. 6611
Sarsa: On-Policy TD Control
S A R S A: State Action Reward State Action
61
Turn this into a control method by always updating the policy
to be greedy with respect to the current estimate:
62. 6622
Advantages of TD Learning
īŽ TD methods do not require a model of the
environment, only experience
īŽ TD, but not MC, methods can be fully incremental
īą You can learn before knowing the final outcome
īŽ Less memory
īŽ Less peak computation
īą You can learn without the final outcome
īŽ From incomplete sequences
īŽ Both MC and TD converge
62
63. 63
References: RL
īŽ Univ. of Alberta
īŽ http://www.cs.ualberta.ca/~sutton/book/ebook/node
1.html
īŽ Sutton and barto,âReinforcement Learning an
introduction.â
īŽ Univ. of South Wales
īŽ http://www.cse.unsw.edu.au/~cs9417ml/RL1/tdlear
ning.html