SlideShare a Scribd company logo
1 of 88
Download to read offline
[course site]
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politecnica de Catalunya
Technical University of Catalonia
Reinforcement Learning
Day 7 Lecture 2
#DLUPC
2
Acknowledegments
Bellver M, Giró-i-Nieto X, Marqués F, Torres J. Hierarchical Object Detection with Deep Reinforcement Learning. In Deep
Reinforcement Learning Workshop, NIPS 2016. 2016.
3
Acknowledegments
Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
4
Outline
1. Motivation
2. Architecture
3. Markov Decision Process (MDP)
4. Deep Q-learning
5. RL Frameworks
6. Learn more
5
Outline
1. Motivation
2. Architecture
3. Markov Decision Process (MDP)
4. Deep Q-learning
5. RL Frameworks
6. Learn more
6
Motivation
What is Reinforcement Learning ?
“a way of programming agents by reward and punishment without needing to
specify how the task is to be achieved”
[Kaelbling, Littman, & Moore, 96]
Kaelbling, Leslie Pack, Michael L. Littman, and Andrew W. Moore. "Reinforcement learning: A survey." Journal of artificial
intelligence research 4 (1996): 237-285.
Yann Lecun’s Black Forest cake
7
Motivation
We can categorize three types of learning procedures:
1. Supervised Learning:
= ƒ( )
2. Unsupervised Learning:
ƒ( )
3. Reinforcement Learning (RL):
= ƒ( )
8
Predict label y corresponding to
observation x
Estimate the distribution of
observation x
Predict action y based on
observation x, to maximize a future
reward z
Motivation
We can categorize three types of learning procedures:
1. Supervised Learning:
= ƒ( )
2. Unsupervised Learning:
ƒ( )
3. Reinforcement Learning (RL):
= ƒ( )
9
Motivation
10
Outline
1. Motivation
2. Architecture
3. Markov Decision Process (MDP)
4. Deep Q-learning
5. RL Frameworks
6. Learn more
11
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller.
"Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
12Bernhard Schölkkopf, “Learning to see and act” Nature 2015.
Motivation
13
Outline
1. Motivation
2. Architecture
3. Markov Decision Process (MDP)
○ Policy
○ Optimal Policy
○ Value Function
○ Q-value function
○ Optimal Q-value function
○ Bellman equation
○ Value iteration algorithm
4. Deep Q-learning
5. RL Frameworks
6. Learn more
14Figure: UCL Course on RL by David Silver
Architecture
15Figure: UCL Course on RL by David Silver
Environment
Architecture
16Figure: UCL Course on RL by David Silver
Environment
state (st
)
Architecture
17Figure: UCL Course on RL by David Silver
Environment
state (st
)
Architecture
18Figure: UCL Course on RL by David Silver
Environment
Agent
state (st
)
Architecture
19Figure: UCL Course on RL by David Silver
Environment
Agent
action (At
)state (st
)
Architecture
20Figure: UCL Course on RL by David Silver
Environment
Agent
action (At
)state (st
)
Architecture
21Figure: UCL Course on RL by David Silver
Environment
Agent
action (At
)
reward (rt
)
state (st
)
Architecture
22Figure: UCL Course on RL by David Silver
Environment
Agent
action (At
)
reward (rt
)
state (st
)
Architecture
Reward is given to
the agent delayed
with respect to
previous states and
actions !
23Figure: UCL Course on RL by David Silver
Environment
Agent
action (At
)
reward (rt
)
state (st+1
)
Architecture
24Figure: UCL Course on RL by David Silver
Environment
Agent
action (At
)
reward (rt
)
state (st+1
)
Architecture GOAL: Complete the
game with the
highest score.
25Figure: UCL Course on RL by David Silver
Environment
Agent
action (At
)
reward (rt
)
state (st+1
)
Architecture GOAL: Learn how to
take actions to
maximize
accumulative reward
26
Other problems that can be formulated with a RL architecture.
Cart-Pole Problem
Objective: Balance a pole on top of a movable car
Architecture
Slide credit: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
27
Architecture
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Environment
Agent
action (At
)
reward (rt
)
state (st
) Angle
Angular speed
Position
Horizontal velocity
Horizontal force
applied in the car1 at each time
step if the pole is
upright
28
Other problems that can be formulated with a RL architecture.
Robot Locomotion
Objective: Make the robot move forward
Architecture
Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-dimensional continuous control using generalized
advantage estimation." ICLR 2016 [project page]
29
Architecture
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Environment
Agent
action (At
)
reward (rt
)
state (st
) Angle and position
of the joints
Torques applied
on joints1 at each time
step upright +
forward
movement
30
Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-dimensional continuous control using generalized
advantage estimation." ICLR 2016 [project page]
31
Outline
1. Motivation
2. Architecture
3. Markov Decision Process (MDP)
○ Policy
○ Optimal Policy
○ Value Function
○ Q-value function
○ Optimal Q-value function
○ Bellman equation
○ Value iteration algorithm
4. Deep Q-learning
5. RL Frameworks
6. Learn more
32
Markov Decision Processes (MDP)
Markov Decision Processes provide a formalism for reinforcement learning
problems.
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Markov property:
Current state completely
characterises the state of the world.
33
Markov Decision Processes (MDP)
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
S A R P ४
34
Markov Decision Processes (MDP)
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
S A R P ४
Environment
samples initial
state s0
~ p(s0
)
Agent
selects
action at
Environment samples
next state st+1
~ P ( .| st
, at
)
Environment samples
reward rt
~ R(. | st
,at
) reward
(rt
)
state (st
) action (at
)
35
MDP: Policy
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
S A R P ४
Agent
selects
action at
policy π
A Policy π is a function S ➝ A that specifies which action to take in each state.
36
MDP: Policy
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Agent
selects
action at
policy π
A Policy π is a function S ➝ A that specifies which action to take in each state.
GOAL: Learn how
to take actions to
maximize reward
Agent
GOAL: Find policy π* that
maximizes the cumulative
discounted reward:
MDP
37
Other problems that can be formulated with a RL architecture.
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
MDP: Policy
Grid World (a simple MDP)
Objective: reach one of the terminal
states (greyed out) in least number of
actions.
38Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Environment
Agent
action (At
)
reward (rt
)
state (st
)
Each cell is a state:
A negative
“reward” (penalty)
for each transition
rt
= r = -1
MDP: Policy
39Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
MDP: Policy
Example: Actions resulting from applying a random policy on this Grid World
problem.
40Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Exercise: Draw the actions resulting from applying an optimal policy in this Grid
World problem.
MDP: Optimal Policy π*
41Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Solution: Draw the actions resulting from applying an optimal policy in this Grid World
problem.
MDP: Optimal Policy π*
42
MDP: Optimal Policy π*
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
How do we handle the randomness (initial
state s0
, transition probabilities, action...) ?
GOAL: Find policy π* that
maximizes the cumulative
discounted reward:
Environment
samples initial
state s0
~ p(s0
)
Agent selects
action at
~π
(.|st
)
Environment samples
next state st+1
~ P ( .| st
, at
)
Environment samples
reward rt
~ R(. | st
,at
) reward
(rt
)
state (st
) action (at
)
43Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
How do we handle the randomness (initial
state s0
, transition probabilities, action) ?
GOAL: Find policy π* that
maximizes the cumulative
discounted reward:
The optimal policy π* will maximize the
expected sum of rewards:
initial
state
selected
action at t
sampled state for
t+1expected cumulative
discounted reward
MDP: Optimal Policy π*
44
MDP: Policy: Value function Vπ
(s)
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
How to estimate how good state s is for a given policy π ?
With the value function at state s, Vπ
(s), the expected cumulative reward from
following policy π from state s.
“...from following policy π
from state s.”
“Expected
cumulative reward…””
45
MDP: Policy: Q-value function Qπ
(s,a)
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
How to estimate how good a state-action pair (s,a) is for a given policy π ?
With the Q-value function at state s and action a, Qπ
(s,a), the expected cumulative
reward from taking action a in state s, and then following policy π.
“...from taking action a in state s
and then following policy π.”
“Expected
cumulative reward…””
46
MDP: Policy: Optimal Q-value function Q*
(s,a)
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
The optimal Q-value function at state s and action, Q*
(s,a), is the maximum
expected cumulative reward achievable from a given (state, action) pair:
choose the policy that
maximizes the expected
cumulative reward
(From the previous page)
Q-value function
47
MDP: Policy: Bellman equation
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Q*
(s,a) satisfies the following Bellman equation:
Maximum expected
cumulative reward for
future pair (s’,a’)
FUTURE REWARD
(From the previous page)
Optimal Q-value function
reward for
considered
pair (s,a)
Maximum expected
cumulative reward
for considered pair
(s,a)
Expectation
across possible
future states s’
(randomness) discount
factor
48
MDP: Policy: Bellman equation
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Q*
(s,a) satisfies the following Bellman equation:
The optimal policy π* corresponds to taking the best action in any state according to
Q*.
GOAL: Find policy π* that
maximizes the cumulative
discounted reward:
select action a’ that maximizes
expected cumulative reward
49
MDP: Policy: Solving the Optimal Policy
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Value iteration algorithm: Estimate the Bellman equation with an iterative update.
The iterative estimation Qi
(s,a) will converge to the optimal Q*(s,a) as i ➝ ∞.
(From the previous page)
Bellman Equation
Updated Q-value
function
Current Q-value for
future pair (s’,a’)
50
MDP: Policy: Solving the Optimal Policy
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Qi
(s,a) will converge to the optimal Q*(s,a) as i ➝ ∞.
Updated Q-value for
current pair (s,a)
Current Q-value for
next pair (s’,a’)
This iterative approach is not scalable because it
requires computing Qi
(s,a) for every state-action pair.
Eg. If state is current game pixels, computationally unfeasible to compute
Qi
(s,a) for the entire state space !
51
MDP: Policy: Solving the Optimal Policy
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
This iterative approach is not scalable because it
requires computing Q(s,a) for every state-action pair.
Eg. If state is current game pixels, computationally unfeasible to compute
Q(s,a) for the entire state space !
Solution: Use a deep neural network as
an function approximator of Q*(s,a).
Q(s,a,Ө) ≈ Q*(s,a)
Neural Network parameters
52
Outline
1. Motivation
2. Architecture
3. Markov Decision Process (MDP)
4. Deep Q-learning
○ Forward and Backward passes
○ DQN
○ Experience Replay
○ Examples
5. RL Frameworks
6. Learn more
○ Coming next…
53
Deep Q-learning
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
The function to approximate is a Q-function that satisfies the Bellman equation:
Q(s,a,Ө) ≈ Q*(s,a)
54
Deep Q-learning
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
The function to approximate is a Q-function that satisfies the Bellman equation:
Q(s,a,Ө) ≈ Q*(s,a)
Forward Pass
Loss function:
Sample a (s,a) pair Predicted Q-value
with Өi
Sample a
future state s’
Predict Q-value with Өi-1
55
Deep Q-learning
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Train the DNN
to approximate
a Q-value
function that
satisfies the
Bellman
equation
56
Deep Q-learning
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Must compute
reward during
training
57
Deep Q-learning
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Backward Pass
Gradient update (with respect to Q-function parameters Ө):
Forward Pass
Loss function:
58Source: Tambet Matiisen, Demystifying Deep Reinforcement Learning (Nervana)
Deep Q-learning: Deep Q-Network DQN
Q(s,a,Ө) ≈ Q*(s,a)
59Source: Tambet Matiisen, Demystifying Deep Reinforcement Learning (Nervana)
Deep Q-learning: Deep Q-Network DQN
Q(s,a,Ө) ≈ Q*(s,a)
efficiency Single
Feed
Forward
Pass
A single feedforward pass to compute the Q-values
for all actions from the current state (efficient)
60
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al.
"Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.
Deep Q-learning: Deep Q-Network DQN
Number of
actions between
4-18, depending
on the Atari
game
61
Deep Q-learning: Deep Q-Network DQN
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Q(st
, ⬅), Q(st
, ➡), Q(st
, ⬆), Q(st
,⬇ )
62
Deep Q-learning: Experience Replay
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Learning from batches of consecutive samples
is problematic:
● Samples are too correlated ➡
inefficient learning
● Q-network parameters determine the next
training samples ➡
can lead to bad feedback loops.
63
Deep Q-learning: Experience Replay
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Experience replay:
● Continually update a replay memory table
of transitions (st
, at
, rt
, st+1
) as game
(experience) episodes are played.
● Train a Q-network on random minibatches
of transitions from the replay memory,
instead of consecutiev samples.
64Andrej Karpathy, “ConvNetJS Deep Q Learning Demo”
Deep Q-learning: Demo
65
Miriam Bellver, Xavier Giro-i-Nieto, Ferran Marques, and Jordi Torres. "Hierarchical Object Detection
with Deep Reinforcement Learning." Deep Reinforcement Learning Workshop NIPS 2016.
Deep Q-learning: DQN: Computer Vision
Method for performing hierarchical object detection in images guided by a
deep reinforcement learning agent.
OBJECT
FOUND
66
Deep Q-learning: DQN: Computer Vision
State: The agent will decide which action to choose based on:
● visual description of the current observed region
● history vector that maps past actions performed
67
Deep Q-learning: DQN: Computer Vision
Reward:
Reward for movement actions
Reward for terminal action
68
Deep Q-learning: DQN: Computer Vision
Actions: Two kind of actions:
● movement actions: to which of the 5 possible regions defined by the
hierarchy to move
● terminal action: the agent indicates that the object has been found
69
Miriam Bellver, Xavier Giro-i-Nieto, Ferran Marques, and Jordi Torres. "Hierarchical Object Detection
with Deep Reinforcement Learning." Deep Reinforcement Learning Workshop NIPS 2016.
Deep Q-learning: DQN: Computer Vision
70
Outline
1. Motivation
2. Architecture
3. Markov Decision Process (MDP)
4. Deep Q-learning
5. RL Frameworks
6. Learn more
71
RL Frameworks
OpenAI Gym + keras-rl
+
keras-rl
keras-rl implements some state-of-the
art deep reinforcement learning
algorithms in Python and seamlessly
integrates with the deep learning
library Keras. Just like Keras, it works
with either Theano or TensorFlow,
which means that you can train your
algorithm efficiently either on CPU or
GPU. Furthermore, keras-rl works with
OpenAI Gym out of the box.
Slide credit: Míriam Bellver
72
OpenAI
Universe
environment
RL Frameworks
73
Outline
1. Motivation
2. Architecture
3. Markov Decision Process (MDP)
4. Deep Q-learning
5. RL Frameworks
6. Learn more
74
Deep Learning TV,
“Reinforcement learning - Ep. 30”
Siraj Raval, Deep Q Learning for Video
Games
Learn more
Emma Brunskill, Stanford CS234: Reinforcement Learning
Learn more
David Silver, UCL COMP050, Reinforcement Learning
Learn more
Nando de Freitas, “Machine Learning” (University of Oxford)
Learn more
78
Pieter Abbeel and John Schulman, CS 294-112 Deep Reinforcement Learning,
Berkeley.
Slides: “Reinforcement Learning - Policy Optimization” OpenAI / UC Berkeley (2017)
Learn more
79
Learn more
Slide credit: Míriam Bellver
actor
state
critic
‘q-value’action (5)
state
action (5)
actor performs an
action
critic assesses how good the action was, and the
gradients are used to train the actor and the critic
Actor-Critic algorithm
Grondman, Ivo, Lucian Busoniu, Gabriel AD Lopes, and Robert Babuska. "A survey of actor-critic reinforcement learning: Standard and natural
policy gradients." IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, no. 6 (2012): 1291-1307.
● Evolution Strategies
Learn more
81
Outline
1. Motivation
2. Architecture
3. Markov Decision Process (MDP)
○ Policy
○ Optimal Policy
○ Value Function
○ Q-value function
○ Optimal Q-value function
○ Bellman equation
○ Value iteration algorithm
4. Deep Q-learning
○ Forward and Backward passes
○ DQN
○ Experience Replay
○ Examples
5. RL Frameworks
6. Learn more
○ Coming next…
Conclusions
Reinforcement Learning
● There is no supervisor, only reward
signal
● Feedback is delayed, not
instantaneous
● Time really matters (sequential, non
i.i.d data)
Slide credit: UCL Course on RL by David Silver
83
Coming next...
https://www.theguardian.com/technology/2014/jan/27/google-acquires-uk-artificial-intelligence-startup-deepmind
84
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,
Panneershelvam, V., Lanctot, M. and Dieleman, S., 2016. Mastering the game of Go with deep neural networks and tree
search. Nature, 529(7587), pp.484-489
Coming next...
85
Greg Kohs, “AlphaGo” (2017)
Coming next...
86
Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A.S., Yeo, M., Makhzani, A., Küttler, H.,
Agapiou, J., Schrittwieser, J. and Quan, J., 2017. Starcraft ii: A new challenge for reinforcement learning.
arXiv preprint arXiv:1708.04782. [Press release]
Coming next...
87
Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A.S., Yeo, M., Makhzani, A., Küttler, H.,
Agapiou, J., Schrittwieser, J. and Quan, J., 2017. Starcraft ii: A new challenge for reinforcement learning.
arXiv preprint arXiv:1708.04782. [Press release]
Coming next...
88
Coming next...
Edifici Vèrtex (Auditori)

More Related Content

What's hot

Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed banditJie-Han Chen
 
Frontier in reinforcement learning
Frontier in reinforcement learningFrontier in reinforcement learning
Frontier in reinforcement learningJie-Han Chen
 
Human-level Control Through Deep Reinforcement Learning (Presentation)
Human-level Control Through Deep Reinforcement Learning (Presentation)Human-level Control Through Deep Reinforcement Learning (Presentation)
Human-level Control Through Deep Reinforcement Learning (Presentation)Muhammed Kocabaş
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLJie-Han Chen
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesThomas da Silva Paula
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach謙益 黃
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningYigit UNALLAR
 

What's hot (20)

Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
Frontier in reinforcement learning
Frontier in reinforcement learningFrontier in reinforcement learning
Frontier in reinforcement learning
 
Human-level Control Through Deep Reinforcement Learning (Presentation)
Human-level Control Through Deep Reinforcement Learning (Presentation)Human-level Control Through Deep Reinforcement Learning (Presentation)
Human-level Control Through Deep Reinforcement Learning (Presentation)
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Hierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement LearningHierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement Learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to games
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 

Viewers also liked

Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...Universitat Politècnica de Catalunya
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)Universitat Politècnica de Catalunya
 
CNN(Convolutional Neural Nets) backpropagation
CNN(Convolutional Neural Nets) backpropagationCNN(Convolutional Neural Nets) backpropagation
CNN(Convolutional Neural Nets) backpropagationJooyoul Lee
 
Deep learning framework 제작
Deep learning framework 제작Deep learning framework 제작
Deep learning framework 제작Tae Young Lee
 
Deep Learning for Stock Prediction
Deep Learning for Stock PredictionDeep Learning for Stock Prediction
Deep Learning for Stock PredictionLim Zhi Yuan (Zane)
 
[222]대화 시스템 서비스 동향 및 개발 방법
[222]대화 시스템 서비스 동향 및 개발 방법[222]대화 시스템 서비스 동향 및 개발 방법
[222]대화 시스템 서비스 동향 및 개발 방법NAVER D2
 
Stock Prediction Using NLP and Deep Learning
Stock Prediction Using NLP and Deep Learning Stock Prediction Using NLP and Deep Learning
Stock Prediction Using NLP and Deep Learning Keon Kim
 
AlphaGo 알고리즘 요약
AlphaGo 알고리즘 요약AlphaGo 알고리즘 요약
AlphaGo 알고리즘 요약Jooyoul Lee
 
알파고 (바둑 인공지능)의 작동 원리
알파고 (바둑 인공지능)의 작동 원리알파고 (바둑 인공지능)의 작동 원리
알파고 (바둑 인공지능)의 작동 원리Shane (Seungwhan) Moon
 
기계학습 / 딥러닝이란 무엇인가
기계학습 / 딥러닝이란 무엇인가기계학습 / 딥러닝이란 무엇인가
기계학습 / 딥러닝이란 무엇인가Yongha Kim
 

Viewers also liked (11)

Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
Recurrent Neural Networks (DLAI D7L1 2017 UPC Deep Learning for Artificial In...
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
Methodology (DLAI D6L2 2017 UPC Deep Learning for Artificial Intelligence)
 
CNN(Convolutional Neural Nets) backpropagation
CNN(Convolutional Neural Nets) backpropagationCNN(Convolutional Neural Nets) backpropagation
CNN(Convolutional Neural Nets) backpropagation
 
Deep learning framework 제작
Deep learning framework 제작Deep learning framework 제작
Deep learning framework 제작
 
Deep Learning for Stock Prediction
Deep Learning for Stock PredictionDeep Learning for Stock Prediction
Deep Learning for Stock Prediction
 
[222]대화 시스템 서비스 동향 및 개발 방법
[222]대화 시스템 서비스 동향 및 개발 방법[222]대화 시스템 서비스 동향 및 개발 방법
[222]대화 시스템 서비스 동향 및 개발 방법
 
Stock Prediction Using NLP and Deep Learning
Stock Prediction Using NLP and Deep Learning Stock Prediction Using NLP and Deep Learning
Stock Prediction Using NLP and Deep Learning
 
AlphaGo 알고리즘 요약
AlphaGo 알고리즘 요약AlphaGo 알고리즘 요약
AlphaGo 알고리즘 요약
 
알파고 (바둑 인공지능)의 작동 원리
알파고 (바둑 인공지능)의 작동 원리알파고 (바둑 인공지능)의 작동 원리
알파고 (바둑 인공지능)의 작동 원리
 
기계학습 / 딥러닝이란 무엇인가
기계학습 / 딥러닝이란 무엇인가기계학습 / 딥러닝이란 무엇인가
기계학습 / 딥러닝이란 무엇인가
 

Similar to Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intelligence)

Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPGHye-min Ahn
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperDataScienceLab
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptxRithikRaj25
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfVaishnavGhadge1
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017mooopan
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningRyo Iwaki
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Dan Elton
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement LearningDavid Jardim
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.pptbutest
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 

Similar to Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intelligence) (20)

Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
 
Reinforcement-Learning.ppt
Reinforcement-Learning.pptReinforcement-Learning.ppt
Reinforcement-Learning.ppt
 
Learning To Run
Learning To RunLearning To Run
Learning To Run
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design Introduction to Reinforcement Learning for Molecular Design
Introduction to Reinforcement Learning for Molecular Design
 
Hierarchical Reinforcement Learning
Hierarchical Reinforcement LearningHierarchical Reinforcement Learning
Hierarchical Reinforcement Learning
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.ppt
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 

More from Universitat Politècnica de Catalunya

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoUniversitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosUniversitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat Politècnica de Catalunya
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Universitat Politècnica de Catalunya
 

More from Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
 

Recently uploaded

Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSAishani27
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 

Recently uploaded (20)

Ukraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICSUkraine War presentation: KNOW THE BASICS
Ukraine War presentation: KNOW THE BASICS
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 

Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intelligence)

  • 1. [course site] Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor Universitat Politecnica de Catalunya Technical University of Catalonia Reinforcement Learning Day 7 Lecture 2 #DLUPC
  • 2. 2 Acknowledegments Bellver M, Giró-i-Nieto X, Marqués F, Torres J. Hierarchical Object Detection with Deep Reinforcement Learning. In Deep Reinforcement Learning Workshop, NIPS 2016. 2016.
  • 3. 3 Acknowledegments Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
  • 4. 4 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  • 5. 5 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  • 6. 6 Motivation What is Reinforcement Learning ? “a way of programming agents by reward and punishment without needing to specify how the task is to be achieved” [Kaelbling, Littman, & Moore, 96] Kaelbling, Leslie Pack, Michael L. Littman, and Andrew W. Moore. "Reinforcement learning: A survey." Journal of artificial intelligence research 4 (1996): 237-285.
  • 7. Yann Lecun’s Black Forest cake 7 Motivation
  • 8. We can categorize three types of learning procedures: 1. Supervised Learning: = ƒ( ) 2. Unsupervised Learning: ƒ( ) 3. Reinforcement Learning (RL): = ƒ( ) 8 Predict label y corresponding to observation x Estimate the distribution of observation x Predict action y based on observation x, to maximize a future reward z Motivation
  • 9. We can categorize three types of learning procedures: 1. Supervised Learning: = ƒ( ) 2. Unsupervised Learning: ƒ( ) 3. Reinforcement Learning (RL): = ƒ( ) 9 Motivation
  • 10. 10 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  • 11. 11 Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
  • 12. 12Bernhard Schölkkopf, “Learning to see and act” Nature 2015. Motivation
  • 13. 13 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) ○ Policy ○ Optimal Policy ○ Value Function ○ Q-value function ○ Optimal Q-value function ○ Bellman equation ○ Value iteration algorithm 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  • 14. 14Figure: UCL Course on RL by David Silver Architecture
  • 15. 15Figure: UCL Course on RL by David Silver Environment Architecture
  • 16. 16Figure: UCL Course on RL by David Silver Environment state (st ) Architecture
  • 17. 17Figure: UCL Course on RL by David Silver Environment state (st ) Architecture
  • 18. 18Figure: UCL Course on RL by David Silver Environment Agent state (st ) Architecture
  • 19. 19Figure: UCL Course on RL by David Silver Environment Agent action (At )state (st ) Architecture
  • 20. 20Figure: UCL Course on RL by David Silver Environment Agent action (At )state (st ) Architecture
  • 21. 21Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st ) Architecture
  • 22. 22Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st ) Architecture Reward is given to the agent delayed with respect to previous states and actions !
  • 23. 23Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st+1 ) Architecture
  • 24. 24Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st+1 ) Architecture GOAL: Complete the game with the highest score.
  • 25. 25Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st+1 ) Architecture GOAL: Learn how to take actions to maximize accumulative reward
  • 26. 26 Other problems that can be formulated with a RL architecture. Cart-Pole Problem Objective: Balance a pole on top of a movable car Architecture Slide credit: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
  • 27. 27 Architecture Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Environment Agent action (At ) reward (rt ) state (st ) Angle Angular speed Position Horizontal velocity Horizontal force applied in the car1 at each time step if the pole is upright
  • 28. 28 Other problems that can be formulated with a RL architecture. Robot Locomotion Objective: Make the robot move forward Architecture Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-dimensional continuous control using generalized advantage estimation." ICLR 2016 [project page]
  • 29. 29 Architecture Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Environment Agent action (At ) reward (rt ) state (st ) Angle and position of the joints Torques applied on joints1 at each time step upright + forward movement
  • 30. 30 Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-dimensional continuous control using generalized advantage estimation." ICLR 2016 [project page]
  • 31. 31 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) ○ Policy ○ Optimal Policy ○ Value Function ○ Q-value function ○ Optimal Q-value function ○ Bellman equation ○ Value iteration algorithm 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  • 32. 32 Markov Decision Processes (MDP) Markov Decision Processes provide a formalism for reinforcement learning problems. Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Markov property: Current state completely characterises the state of the world.
  • 33. 33 Markov Decision Processes (MDP) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. S A R P ४
  • 34. 34 Markov Decision Processes (MDP) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. S A R P ४ Environment samples initial state s0 ~ p(s0 ) Agent selects action at Environment samples next state st+1 ~ P ( .| st , at ) Environment samples reward rt ~ R(. | st ,at ) reward (rt ) state (st ) action (at )
  • 35. 35 MDP: Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. S A R P ४ Agent selects action at policy π A Policy π is a function S ➝ A that specifies which action to take in each state.
  • 36. 36 MDP: Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Agent selects action at policy π A Policy π is a function S ➝ A that specifies which action to take in each state. GOAL: Learn how to take actions to maximize reward Agent GOAL: Find policy π* that maximizes the cumulative discounted reward: MDP
  • 37. 37 Other problems that can be formulated with a RL architecture. Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. MDP: Policy Grid World (a simple MDP) Objective: reach one of the terminal states (greyed out) in least number of actions.
  • 38. 38Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Environment Agent action (At ) reward (rt ) state (st ) Each cell is a state: A negative “reward” (penalty) for each transition rt = r = -1 MDP: Policy
  • 39. 39Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. MDP: Policy Example: Actions resulting from applying a random policy on this Grid World problem.
  • 40. 40Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Exercise: Draw the actions resulting from applying an optimal policy in this Grid World problem. MDP: Optimal Policy π*
  • 41. 41Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Solution: Draw the actions resulting from applying an optimal policy in this Grid World problem. MDP: Optimal Policy π*
  • 42. 42 MDP: Optimal Policy π* Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How do we handle the randomness (initial state s0 , transition probabilities, action...) ? GOAL: Find policy π* that maximizes the cumulative discounted reward: Environment samples initial state s0 ~ p(s0 ) Agent selects action at ~π (.|st ) Environment samples next state st+1 ~ P ( .| st , at ) Environment samples reward rt ~ R(. | st ,at ) reward (rt ) state (st ) action (at )
  • 43. 43Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How do we handle the randomness (initial state s0 , transition probabilities, action) ? GOAL: Find policy π* that maximizes the cumulative discounted reward: The optimal policy π* will maximize the expected sum of rewards: initial state selected action at t sampled state for t+1expected cumulative discounted reward MDP: Optimal Policy π*
  • 44. 44 MDP: Policy: Value function Vπ (s) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How to estimate how good state s is for a given policy π ? With the value function at state s, Vπ (s), the expected cumulative reward from following policy π from state s. “...from following policy π from state s.” “Expected cumulative reward…””
  • 45. 45 MDP: Policy: Q-value function Qπ (s,a) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How to estimate how good a state-action pair (s,a) is for a given policy π ? With the Q-value function at state s and action a, Qπ (s,a), the expected cumulative reward from taking action a in state s, and then following policy π. “...from taking action a in state s and then following policy π.” “Expected cumulative reward…””
  • 46. 46 MDP: Policy: Optimal Q-value function Q* (s,a) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. The optimal Q-value function at state s and action, Q* (s,a), is the maximum expected cumulative reward achievable from a given (state, action) pair: choose the policy that maximizes the expected cumulative reward (From the previous page) Q-value function
  • 47. 47 MDP: Policy: Bellman equation Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q* (s,a) satisfies the following Bellman equation: Maximum expected cumulative reward for future pair (s’,a’) FUTURE REWARD (From the previous page) Optimal Q-value function reward for considered pair (s,a) Maximum expected cumulative reward for considered pair (s,a) Expectation across possible future states s’ (randomness) discount factor
  • 48. 48 MDP: Policy: Bellman equation Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q* (s,a) satisfies the following Bellman equation: The optimal policy π* corresponds to taking the best action in any state according to Q*. GOAL: Find policy π* that maximizes the cumulative discounted reward: select action a’ that maximizes expected cumulative reward
  • 49. 49 MDP: Policy: Solving the Optimal Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Value iteration algorithm: Estimate the Bellman equation with an iterative update. The iterative estimation Qi (s,a) will converge to the optimal Q*(s,a) as i ➝ ∞. (From the previous page) Bellman Equation Updated Q-value function Current Q-value for future pair (s’,a’)
  • 50. 50 MDP: Policy: Solving the Optimal Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Qi (s,a) will converge to the optimal Q*(s,a) as i ➝ ∞. Updated Q-value for current pair (s,a) Current Q-value for next pair (s’,a’) This iterative approach is not scalable because it requires computing Qi (s,a) for every state-action pair. Eg. If state is current game pixels, computationally unfeasible to compute Qi (s,a) for the entire state space !
  • 51. 51 MDP: Policy: Solving the Optimal Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. This iterative approach is not scalable because it requires computing Q(s,a) for every state-action pair. Eg. If state is current game pixels, computationally unfeasible to compute Q(s,a) for the entire state space ! Solution: Use a deep neural network as an function approximator of Q*(s,a). Q(s,a,Ө) ≈ Q*(s,a) Neural Network parameters
  • 52. 52 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning ○ Forward and Backward passes ○ DQN ○ Experience Replay ○ Examples 5. RL Frameworks 6. Learn more ○ Coming next…
  • 53. 53 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. The function to approximate is a Q-function that satisfies the Bellman equation: Q(s,a,Ө) ≈ Q*(s,a)
  • 54. 54 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. The function to approximate is a Q-function that satisfies the Bellman equation: Q(s,a,Ө) ≈ Q*(s,a) Forward Pass Loss function: Sample a (s,a) pair Predicted Q-value with Өi Sample a future state s’ Predict Q-value with Өi-1
  • 55. 55 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Train the DNN to approximate a Q-value function that satisfies the Bellman equation
  • 56. 56 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Must compute reward during training
  • 57. 57 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Backward Pass Gradient update (with respect to Q-function parameters Ө): Forward Pass Loss function:
  • 58. 58Source: Tambet Matiisen, Demystifying Deep Reinforcement Learning (Nervana) Deep Q-learning: Deep Q-Network DQN Q(s,a,Ө) ≈ Q*(s,a)
  • 59. 59Source: Tambet Matiisen, Demystifying Deep Reinforcement Learning (Nervana) Deep Q-learning: Deep Q-Network DQN Q(s,a,Ө) ≈ Q*(s,a) efficiency Single Feed Forward Pass A single feedforward pass to compute the Q-values for all actions from the current state (efficient)
  • 60. 60 Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533. Deep Q-learning: Deep Q-Network DQN Number of actions between 4-18, depending on the Atari game
  • 61. 61 Deep Q-learning: Deep Q-Network DQN Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q(st , ⬅), Q(st , ➡), Q(st , ⬆), Q(st ,⬇ )
  • 62. 62 Deep Q-learning: Experience Replay Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Learning from batches of consecutive samples is problematic: ● Samples are too correlated ➡ inefficient learning ● Q-network parameters determine the next training samples ➡ can lead to bad feedback loops.
  • 63. 63 Deep Q-learning: Experience Replay Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Experience replay: ● Continually update a replay memory table of transitions (st , at , rt , st+1 ) as game (experience) episodes are played. ● Train a Q-network on random minibatches of transitions from the replay memory, instead of consecutiev samples.
  • 64. 64Andrej Karpathy, “ConvNetJS Deep Q Learning Demo” Deep Q-learning: Demo
  • 65. 65 Miriam Bellver, Xavier Giro-i-Nieto, Ferran Marques, and Jordi Torres. "Hierarchical Object Detection with Deep Reinforcement Learning." Deep Reinforcement Learning Workshop NIPS 2016. Deep Q-learning: DQN: Computer Vision Method for performing hierarchical object detection in images guided by a deep reinforcement learning agent. OBJECT FOUND
  • 66. 66 Deep Q-learning: DQN: Computer Vision State: The agent will decide which action to choose based on: ● visual description of the current observed region ● history vector that maps past actions performed
  • 67. 67 Deep Q-learning: DQN: Computer Vision Reward: Reward for movement actions Reward for terminal action
  • 68. 68 Deep Q-learning: DQN: Computer Vision Actions: Two kind of actions: ● movement actions: to which of the 5 possible regions defined by the hierarchy to move ● terminal action: the agent indicates that the object has been found
  • 69. 69 Miriam Bellver, Xavier Giro-i-Nieto, Ferran Marques, and Jordi Torres. "Hierarchical Object Detection with Deep Reinforcement Learning." Deep Reinforcement Learning Workshop NIPS 2016. Deep Q-learning: DQN: Computer Vision
  • 70. 70 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  • 71. 71 RL Frameworks OpenAI Gym + keras-rl + keras-rl keras-rl implements some state-of-the art deep reinforcement learning algorithms in Python and seamlessly integrates with the deep learning library Keras. Just like Keras, it works with either Theano or TensorFlow, which means that you can train your algorithm efficiently either on CPU or GPU. Furthermore, keras-rl works with OpenAI Gym out of the box. Slide credit: Míriam Bellver
  • 73. 73 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  • 74. 74 Deep Learning TV, “Reinforcement learning - Ep. 30” Siraj Raval, Deep Q Learning for Video Games Learn more
  • 75. Emma Brunskill, Stanford CS234: Reinforcement Learning Learn more
  • 76. David Silver, UCL COMP050, Reinforcement Learning Learn more
  • 77. Nando de Freitas, “Machine Learning” (University of Oxford) Learn more
  • 78. 78 Pieter Abbeel and John Schulman, CS 294-112 Deep Reinforcement Learning, Berkeley. Slides: “Reinforcement Learning - Policy Optimization” OpenAI / UC Berkeley (2017) Learn more
  • 79. 79 Learn more Slide credit: Míriam Bellver actor state critic ‘q-value’action (5) state action (5) actor performs an action critic assesses how good the action was, and the gradients are used to train the actor and the critic Actor-Critic algorithm Grondman, Ivo, Lucian Busoniu, Gabriel AD Lopes, and Robert Babuska. "A survey of actor-critic reinforcement learning: Standard and natural policy gradients." IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 42, no. 6 (2012): 1291-1307.
  • 81. 81 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) ○ Policy ○ Optimal Policy ○ Value Function ○ Q-value function ○ Optimal Q-value function ○ Bellman equation ○ Value iteration algorithm 4. Deep Q-learning ○ Forward and Backward passes ○ DQN ○ Experience Replay ○ Examples 5. RL Frameworks 6. Learn more ○ Coming next…
  • 82. Conclusions Reinforcement Learning ● There is no supervisor, only reward signal ● Feedback is delayed, not instantaneous ● Time really matters (sequential, non i.i.d data) Slide credit: UCL Course on RL by David Silver
  • 84. 84 Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. and Dieleman, S., 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), pp.484-489 Coming next...
  • 85. 85 Greg Kohs, “AlphaGo” (2017) Coming next...
  • 86. 86 Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A.S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J., Schrittwieser, J. and Quan, J., 2017. Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782. [Press release] Coming next...
  • 87. 87 Vinyals, O., Ewalds, T., Bartunov, S., Georgiev, P., Vezhnevets, A.S., Yeo, M., Makhzani, A., Küttler, H., Agapiou, J., Schrittwieser, J. and Quan, J., 2017. Starcraft ii: A new challenge for reinforcement learning. arXiv preprint arXiv:1708.04782. [Press release] Coming next...