Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Deep Reinforcement Learning with Shallow Trees
Matineh Shaker
AI Scientist (Bonsai)
MLConf San Francisco
10 November 2017

Outline
● Introduction to RL (Reinforcement Learning)
● Markov decision processes
● Value-based methods
● Concept-Network Reinforcement Learning (CNRL)
● Use cases
2

A Reinforcement Learning Example
3
Rocket Trajectory Optimization:
OpenAI Gym’s LunarLander Simulator

A Reinforcement Learning Example
4
State:
x_position
y_position
x_velocity
y_velocity
angle
angular velocity
left_leg
right_leg
Action (Discrete):
do nothing (0)
fire left engine (1)
fire main engine (2)
fire right engine (3)
Action (Continuous):
main engine power
left/right engine power
Reward: Moving from the top of the screen to landing pad and
zero speed has about 100-140 points. Episode finishes if the
lander crashes or comes to rest, additional -100 or +100.
Each leg ground contact is +10. Firing main engine has -0.3
points each frame.

Basic RL Concepts
5
Reward Hypothesis
Goals can be described by maximizing the expected cumulative reward .
Sequential Decision Making
Actions may have long-term consequences.
Rewards may be delayed, like a financial investment.
Sometimes the agent sacrifices instant rewards to maximize long-term reward (just like life!)
State Data
Sequential and non i.i.d
Agent’s actions affect the next data samples.

Definitions
Policy
Dictates agent’s behavior, and maps from state to action:
Deterministic policy: a = Л(s)
Stochastic policy: Л(a|s) = P(At
= a|St
= s)
Value function
Determines how good each state (and action) is:
VЛ
(s)=EЛ
[ Rt+1
+ Rt+2
+ 2
Rt+3
+... | St
=s ]
QЛ
(s,a)
Model
Predicts what the environment will do next (simulator’s job for instance)
6

Agent and Environment
At each time step, the agent:
Receives observation
Receives reward
Takes action
The environment:
Receives action
Sends next observation
Sends next reward
7

Markov Decision Processes (MDP)
8
Mathematical framework for sequential decision making.
An environment in which all states are Markovian:
Markov Decision Process is a tuple:
Pictures from David Silver’s Slides

Exploration vs. Exploitation
Exploration vs. Exploitation Dilemma
● Reinforcement learning (specially model-free) is like trial-and-error learning.
● The agent should find a good policy that maximizes future rewards from its experiences
of the environment, in a potentially very large state space.
● Exploration finds more information about the environment, while Exploitation exploits
known information to maximise reward.
9

Value Based Methods: Q-Learning
What are the Problems:
● The iterative update is not scalable enough:
● Computing Q(s,a) for every state-action pair is not feasible most of the times.
Solution:
● Use a function approximator to estimate Q(s,a). such as a neural network! (differentiable)
10
Using Bellman equation as an iterative update, to find optimal policy:

Value Based Methods: Q-Learning
Use a function approximator to estimate the action-value function:
Q(s, a; ) ≅ Q*(s, a)
is the function parameter (weights of NN)
Function approximator can be a deep neural network: DQN
11
Loss Function:

Value Based Methods: DQN
Learning from batches of consecutive samples is problematic and costly:
- Sample correlation: Samples are correlated, which in return, makes inefficient learning
- Bad feedback loops: Current Q-network parameters dictates next training samples and can
lead to bad feedback loops (e.g if maximizing action is to move left, training samples will
be dominated by samples from left-hand size)
To solve them, use Experience Replay
- Continually update a replay memory table of transitions (st
, at
, rt
, st+1
).
- Train Q-network on random mini-batches of transitions from the replay memory.
12

Concept Network Reinforcement Learning
● Solving complex tasks by decomposing them to high level actions or "concepts".
● “Multi-level hierarchical RL” approach, inspired by Sutton’s Options:
○ enables efficient exploration by the abstractions over low level actions,
○ improving sample efficiency significantly,
○ especially in “sparse reward”.
● Allows existing solutions to sub-problems to be composed into an overall solution
without requiring re-training.
13

Temporal Abstractions
● At each time t for each state st
, a higher level “selector” chooses concept ct
among all
possible concepts available to the selector.
● Each concept remains active for some time, until a predefined terminal state is reached.
● An internal critic evaluates how close the agent is to satisfying a terminal condition of ct
,
and sends reward rc
(t) to the selector.
● Similar to baseline RL, except that an extra layer of abstraction is defined on the set of
“primitive” actions, forming a concept, so that execution of each concept corresponds to
a certain action.
14

Robotics Pick and Place with Concepts
17
Lift Orient Stack

18

19
Deep Reinforcement Learning for Dexterous Manipulation with Concept Networks
https://arxiv.org/abs/1709.06977

Definitions
State
The agent’s internal representation in the environment.
Information the agent uses to pick the next action.
Policy
Dictates agent’s behavior, and maps from state to action:
Deterministic policy: a = Л(s)
Stochastic policy: Л(a|s) = P(At
= a|St
= s)
Value function
Determines how good each state (and action) is:
VЛ
(s)=EЛ
[ Rt+1
+ Rt+2
+ 2
Rt+3
+... | St
=s ]
QЛ
(s,a)
Model
Predicts what the environment will do next (simulator’s job for instance)
22

Value Based Methods: DQN with Experience Replay(2)
24

Learning vs Planning
25
Learning (Model-Free Reinforcement Learning):
The environment is initially unknown
The agent interacts with the environment, not knowing about the environment
The agent improves its policy based on previous interactions
Planning (Model-based Reinforcement Learning):
A model of the environment is known or acquired
The agent performs computations with the model, without any external interaction
The agent improves its policy based on those computations with the model

LunarLander with Concept Network
26

Introduction to RL: Challenges
27
Playing Atari with Deep Reinforcement Learning, Mnih et al, Deepmind

Policy-Based Methods
● The Q-function can be complex and unnecessary. All we want is best action!!
● Example: In a very high-dimensional state, it is wasteful and costly to learn exact value
of every (state, action) pair.
28
● Defining parameterized policies:
● For each policy, define its value:
● Gradient ascent on policy parameters to find the optimal policy!

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Similar to Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017 (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017