SlideShare a Scribd company logo
1 of 61
Introduction of Reinforcement Learning
Artificial Intelligence
• 지능이란?
 보다 추상적인 정보를 이해하는 능력
•인공 지능이란?
 이러한 지능 현상을 인공적으로 구현하려는 연구
Artificial Intelligence & Machine Learning
Deep Learning in RL
•DeepRL에서 딥러닝은 그저 하나의
module로써만 사용된다.
• Deep Learning의 강점 :
1. 층을 쌓을 수록 더욱 Abstract
Feature Learning이 가능한 유일
한 알고리즘.
2. Universal Function Approximator.
Universal Function Approximator
Reinforcement Learning이란?
•Supervised Learning :
y = f(x)
•Unsupervised Learning :
x ~ p(x) or x = f(x)
•Reinforcement Learning :
Find a policy, p(a|s) which maximizes the sum of reward
Machine Learning
Example of Supervised Learning :
Polynomial Curve Fitting
Microsoft Excel 2007의 추세선
Example of Unsupervised Learning :
Clustering
http://www.frankichamaki.com/data-driven-market-segmentation-more-effective-marketing-to-segments-using-ai/
Example of Reinforcement Learning :
Optimal Control Problem
http://graveleylab.cam.uchc.edu/WebData/mduff/older_papers.html
https://studywolf.wordpress.com/2015/03/29/reinforcement-learning-part-3-egocentric-learning/
Markov Decision Processes(MDP)
• Discrete state space : S = { A , B }
• State transition probability : P(S’|S) = {PAA = 0.7 , PAB = 0.3 , PBA = 0.5 , PBB = 0.5 }
• Purpose : Finding a steady state distribution
Markov Processes
• Discrete state space : S = { A , B }
• State transition probability : P(S’|S) = {PAA = 0.7 , PAB = 0.3 , PBA = 0.5 , PBB = 0.5 }
• Reward function : R(S’=A) = +1 , R(S’=B) = -1
• Purpose : Finding an expected reward distribution
Markov Reward Processes
• Discrete state space : S = { A , B }
• Discrete action space : A = { X, Y }
• (Action conditional) State transition probability : P(S’|S , A) = { … }
• Reward function : R(S’=A) = +1 , R(S’=B) = -1
• Purpose : Finding an optimal policy (maximizes the expected sum of future reward)
Markov Decision Processes
•Markov decision processes :
a mathematical framework for modeling decision
making.
• MDP are solved via dynamic programming and reinforcement learning.
• Applications : robotics, automated control, economics and manufacturing.
•Examples of MDP :
1) AlphaGo에서는 바둑을 MDP로 정의
2) 자동차 운전을 MDP로 정의
3) 주식시장을 MDP로 정의
Markov Decision Processes
•Objective : Finding an optimal policy which maximizes the expected
sum of future rewards
•Algorithms
1) Planning : Exhaustive Search / Dynamic Programming
2) Reinforcement Learning : MC method / TD Learning(Q-learning , …)
Agent-Environment Interaction
Discount Factor
• Sum of future rewards in episodic tasks
 Gt := Rt+1 + Rt+2 + Rt+3 + … + RT
• Sum of future rewards in continuous tasks
 Gt := Rt+1 + Rt+2 + Rt+3 + … + RT + …
Gt  ∞ (diverge)
• Sum of discounted future rewards in both case
 Gt := Rt+1 + γRt+2 + γ2Rt+3 + … + γT-1RT + …
= 𝒌=𝟏
∞ γk−1Rt+k (converge)
(Rt is bounded / γ : discount factor, 0 <= γ < 1)
Deterministic Policy, a=f(s)
State S f(S)
A Y
B Y
Stochastic Policy, p(a|s)
State S P(X|S) P(Y|S)
A 0.3 0.7
B 0.4 0.6
• State : 5x5
• Action : 4방향이동
• Reward : A에 도착하면 +10, B에 도착하면 +5, 벽에 부딪히면 -1, 그이외 0
• Discounted Factor : 0.9
Gridworld
Value-based Reinforcement Learning
Value Function
•We will introduce a value function which let us know the expected
sum of future rewards at given state s, following policy π.
1) State-value function
2) Action-value function
1) Policy from state-value function
 One-step ahead search for all actions
with state transition probability(model).
2) Policy from action-value function
 a = f(s) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑞π
𝑠 , 𝑎
Optimal Control(Policy) with Value Function
•Objective : Finding an optimal policy which maximizes
the expected sum of future rewards
•Algorithms
1) Planning : Exhaustive Search / Dynamic Programming
 주사위 눈의 평균 = 1*1/6 + 2*1/6 + … + 6*1/6 = 3.5
2) Reinforcement Learning : MC method / TD Learning
 주사위 눈의 평균 = 100번을 던져서 나온 눈을 평균 냄 = 3.5
Planning vs Learning
Solution of the MDP : Planning
•So our last job is find optimal policy and there are two approaches.
2) Dynamic Programming
1) Exhaustive Search
Find Optimal Policy with Exhaustive Search
If we know the one step dynamics of the MDP,
P(s’,r|s,a) , we can do exhaustive search
iteratively until the end step T.
And we can choose the optimal action path, but
this needs O(NT)! A
A
(+1)
B
(-1)
A
(+1)
B
(-1)
A
(+1)
B
(-1)
A
(+1)
B
(-1)
A
(+1)
B
(-1)
A
(+1)
B
(-1)
X
X
Y
Y
A
(+1)
B
(-1)
A
(+1)
B
(-1)
A
(+1)
B
(-1)
A
(+1)
B
(-1)
Dynamic Programming
•전체 문제를 풀 때 중복되는 계산이 발생 
이를 subproblem으로 나누어 풀어 중복계산을 없앰
•전체 path단위로 계산을 하면 앞부분 계산이 항상 중첩 
두 step 단위로 subproblem 계산을 끝내고 다음 step으로 넘어감.
Dynamic Programming
• We can apply the DP in this problem, and the computational cost
reduces to O(N2T). (But still we need to know the environment
dynamics.)
• DP is a computer science technique which calculates the final goal
value with compositions of cumulative partial values.
Policy Iteration
29
• Policy iteration consists of two simultaneous,
interacting processes.
• Policy evaluation (Iterative formulation):
First, make the value function more precise with the
current policy.
• Policy improvement (Greedy selection):
Second, make greedy policy greedy with respect to
the current value function.
Policy Iteration
•How can we get the state-value function with DP?
(The state-value is subproblem. And action-value function is also similarly computed.)
Policy Iteration = Policy Evaluation + Policy Improvement
30
Policy Iteration
31
Solution of the MDP : Learning
•The planning methods must know the perfect dynamics of
environment, P(s’,r|s,a)
•But typically this is really hard to know and empirically impossible.
Therefore we will ignore this and just calculate the mean from
samples.
•This is the starting point of the machine learning is embedded.
1) Monte Carlo Methods
2) Temporal-Difference Learning
(a.k.a reinforcement learning)
32
Monte Carlo Methods
33
Starting State Value
S1 Average of G(S1)
S2 Average of G(S2)
S3 Average of G(S2)
Tabular state-value function
Monte Carlo Methods
34
•We need a full length of
experience for each starting state.
•This is really time consuming to
update one state while waiting
the terminal of episode.
•Continuous task에는 적용 불가
Q-learning (Temporal Difference Learning)
Bellman Equation
(iterative formulation)
Temporal-Difference Learning
•TD learning is a combination of Monte Carlo idea and
dynamic programming (DP) idea.
•Like Monte Carlo methods, TD methods can learn directly
from raw experience without a model of the environment's
dynamics.
•Like DP, TD methods update estimates based in part without
waiting for a final outcome (they bootstrap).
37
Temporal-Difference Learning
38
Q-learning Algorithm
Temporal-Difference Learning
40
On policy / Off policy
•On policy : Target policy = Behavioral policy
there can be only one policy.
 This can learn a stochastic policy. Ex) SARSA
•Off policy : Target policy != Behavioral policy
there can be several policies.
 Sample efficient. Ex) Q-learning
41
Sarsa: On-Policy TD Control
42
Eligibility Trace
43
•Smoothly combining the TD and MC.
Eligibility Trace
44
Comparisons
45
Planning + Learning
•There is only a difference between planning and learning. That is
the existence of model.
•So we call planning is model-based method, and learning is
model-free method.
46
Planning + Learning
47
Deep Reinforcement Learning
•Value function approximation with deep learning
 Large scale or infinite dimensional state can be solvable
Generalization with deep learning
This needs supervised learning techniques and online moving target regression.
Deep Q Networks (DQN)
•Q learning에서 특정 state에 대한 action-value function으로 CNN
을 사용
State
Q(State,Action=0)
Q(State,Action=1)
.
.
.
Q(State,Action=K)
Loss function
•Q learning의 업데이트 식
•TD error의 크기를 Loss로 사용, regression 방식으로 학습
TD error
Target Network
• Moving Target Regression :
Q를 학습하는 순간, target 값도 같이 변화  학습 성능 저하
 Target Network : DQN과 동일한 구조를 가지고 있으며 학습 도
중 weight값이 변하지 않는 별도의 네트워크로 Q(s’,a’)로 고정된
target value를 사용하여 학습 안정성 개선
(Target Network의 weight값들은 주기적으로 DQN의 것을 복사)
Replay Memory
•Data inefficiency : 그 동안 경험한 experience들을 off-
policy learning으로 재활용(Mini-batch learning)
•Data correlation : Mini-batch를 On-line으로 구성할 경우
Mini batch 내의 데이터들이 서로 비슷함(Correlated)
 한쪽으로 치우친 불균형한 학습이 발생함
 Replay Memory : (State, Action, Reward, Next State) 데
이터를 Buffer에 저장해놓고 그 안에서 Random Sampling하
여 Mini-batch를 구성하는 방법으로 해결
Reward Clipping
•Reward Scale problem : 도메인에 따라 Reward의 scale이
다르기 때문에 학습되는 Q-value의 크기도 제각각.
이렇게 학습해야하는 Q-value의 variance가 매우 큰 경우
(ex : -1000~1000) 신경망 학습이 어려움.
 Reward Clipping: Reward의 크기를 -1~1 사이로 잘라내
어 안정적으로 학습
DQN의 성능
• ATARI 2600 고전게임에서 실험
•절반 이상의 게임에서 사람보다 우수
• 기존방식 (linear)에 비해 월등한 향상
• 일부 게임은 학습에 실패함.
(sparse reward, 복잡한 단계를 가진 게임)
DQN 데모
DQN의 발전된 모델들 - Double DQN
•Positive bias: max 연산자에 의해 Q-value를 실
제보다 높게 평가하는 경향이 있음
 Double Q learning을 사용하여 이 현상을 방,
DQN에 보다 빠른 학습이 가능
DQN의 발전된 모델들 - Prioritized Replay Memory
•일반적인 DQN은 Replay Memory에서 Uniform Sampling
 TD error가 높은 데이터들이 선택될 확률을 높여,
더 빠르고 효율적인 학습이 가능
References
[1] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An
introduction. MIT press, 1998.
[2] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement
learning." Nature 518.7540 (2015): 529-533.
[3] Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement
Learning with Double Q-Learning." AAAI. 2016.
[4] Wang, Ziyu, et al. "Dueling network architectures for deep reinforcement
learning." arXiv preprint arXiv:1511.06581 (2015).
[5] Schaul, Tom, et al. "Prioritized experience replay." arXiv preprint
arXiv:1511.05952 (2015).
59
Appendix
• Atari 2600 - https://www.youtube.com/watch?v=iqXKQf2BOSE
• Super MARIO - https://www.youtube.com/watch?v=qv6UVOQ0F44
• Robot Learns to Flip Pancakes - https://www.youtube.com/watch?v=W_gxLKSsSIE
• Stanford Autonomous Helicopter - Airshow #2 -
https://www.youtube.com/watch?v=VCdxqn0fcnE
• OpenAI Gym - https://gym.openai.com/envs
• Awesome RL - https://github.com/aikorea/awesome-rl
• Udacity RL course
• TensorFlow DRL - https://github.com/nivwusquorum/tensorflow-deepq
• Karpathy rldemo -
http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html
60
감사합니다.

More Related Content

What's hot

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningSeung Jae Lee
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processHamed Abdi
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanPeerasak C.
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsBill Liu
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNEuijin Jeong
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision processVARUN KUMAR
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIMikko Mäkipää
 

What's hot (20)

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
 
Lecture 9 Markov decision process
Lecture 9 Markov decision processLecture 9 Markov decision process
Lecture 9 Markov decision process
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Mdp
MdpMdp
Mdp
 

Similar to Reinforcement Learning

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learningCairo University
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDAmmar Rashed
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperDataScienceLab
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksBen Ball
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Dongmin Lee
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Hogeon Seo
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfssuseradaf5f
 

Similar to Reinforcement Learning (20)

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfD
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
Big Data Challenges and Solutions
Big Data Challenges and SolutionsBig Data Challenges and Solutions
Big Data Challenges and Solutions
 
anintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdfanintroductiontoreinforcementlearning-180912151720.pdf
anintroductiontoreinforcementlearning-180912151720.pdf
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Learning To Run
Learning To RunLearning To Run
Learning To Run
 

Recently uploaded

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

Reinforcement Learning

  • 2. Artificial Intelligence • 지능이란?  보다 추상적인 정보를 이해하는 능력 •인공 지능이란?  이러한 지능 현상을 인공적으로 구현하려는 연구
  • 3. Artificial Intelligence & Machine Learning
  • 4. Deep Learning in RL •DeepRL에서 딥러닝은 그저 하나의 module로써만 사용된다. • Deep Learning의 강점 : 1. 층을 쌓을 수록 더욱 Abstract Feature Learning이 가능한 유일 한 알고리즘. 2. Universal Function Approximator.
  • 7. •Supervised Learning : y = f(x) •Unsupervised Learning : x ~ p(x) or x = f(x) •Reinforcement Learning : Find a policy, p(a|s) which maximizes the sum of reward Machine Learning
  • 8. Example of Supervised Learning : Polynomial Curve Fitting Microsoft Excel 2007의 추세선
  • 9. Example of Unsupervised Learning : Clustering http://www.frankichamaki.com/data-driven-market-segmentation-more-effective-marketing-to-segments-using-ai/
  • 10. Example of Reinforcement Learning : Optimal Control Problem http://graveleylab.cam.uchc.edu/WebData/mduff/older_papers.html https://studywolf.wordpress.com/2015/03/29/reinforcement-learning-part-3-egocentric-learning/
  • 12. • Discrete state space : S = { A , B } • State transition probability : P(S’|S) = {PAA = 0.7 , PAB = 0.3 , PBA = 0.5 , PBB = 0.5 } • Purpose : Finding a steady state distribution Markov Processes
  • 13. • Discrete state space : S = { A , B } • State transition probability : P(S’|S) = {PAA = 0.7 , PAB = 0.3 , PBA = 0.5 , PBB = 0.5 } • Reward function : R(S’=A) = +1 , R(S’=B) = -1 • Purpose : Finding an expected reward distribution Markov Reward Processes
  • 14. • Discrete state space : S = { A , B } • Discrete action space : A = { X, Y } • (Action conditional) State transition probability : P(S’|S , A) = { … } • Reward function : R(S’=A) = +1 , R(S’=B) = -1 • Purpose : Finding an optimal policy (maximizes the expected sum of future reward) Markov Decision Processes
  • 15. •Markov decision processes : a mathematical framework for modeling decision making. • MDP are solved via dynamic programming and reinforcement learning. • Applications : robotics, automated control, economics and manufacturing. •Examples of MDP : 1) AlphaGo에서는 바둑을 MDP로 정의 2) 자동차 운전을 MDP로 정의 3) 주식시장을 MDP로 정의 Markov Decision Processes
  • 16. •Objective : Finding an optimal policy which maximizes the expected sum of future rewards •Algorithms 1) Planning : Exhaustive Search / Dynamic Programming 2) Reinforcement Learning : MC method / TD Learning(Q-learning , …) Agent-Environment Interaction
  • 17. Discount Factor • Sum of future rewards in episodic tasks  Gt := Rt+1 + Rt+2 + Rt+3 + … + RT • Sum of future rewards in continuous tasks  Gt := Rt+1 + Rt+2 + Rt+3 + … + RT + … Gt  ∞ (diverge) • Sum of discounted future rewards in both case  Gt := Rt+1 + γRt+2 + γ2Rt+3 + … + γT-1RT + … = 𝒌=𝟏 ∞ γk−1Rt+k (converge) (Rt is bounded / γ : discount factor, 0 <= γ < 1)
  • 19. Stochastic Policy, p(a|s) State S P(X|S) P(Y|S) A 0.3 0.7 B 0.4 0.6
  • 20. • State : 5x5 • Action : 4방향이동 • Reward : A에 도착하면 +10, B에 도착하면 +5, 벽에 부딪히면 -1, 그이외 0 • Discounted Factor : 0.9 Gridworld
  • 22. Value Function •We will introduce a value function which let us know the expected sum of future rewards at given state s, following policy π. 1) State-value function 2) Action-value function
  • 23. 1) Policy from state-value function  One-step ahead search for all actions with state transition probability(model). 2) Policy from action-value function  a = f(s) = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑎 𝑞π 𝑠 , 𝑎 Optimal Control(Policy) with Value Function
  • 24. •Objective : Finding an optimal policy which maximizes the expected sum of future rewards •Algorithms 1) Planning : Exhaustive Search / Dynamic Programming  주사위 눈의 평균 = 1*1/6 + 2*1/6 + … + 6*1/6 = 3.5 2) Reinforcement Learning : MC method / TD Learning  주사위 눈의 평균 = 100번을 던져서 나온 눈을 평균 냄 = 3.5 Planning vs Learning
  • 25. Solution of the MDP : Planning •So our last job is find optimal policy and there are two approaches. 2) Dynamic Programming 1) Exhaustive Search
  • 26. Find Optimal Policy with Exhaustive Search If we know the one step dynamics of the MDP, P(s’,r|s,a) , we can do exhaustive search iteratively until the end step T. And we can choose the optimal action path, but this needs O(NT)! A A (+1) B (-1) A (+1) B (-1) A (+1) B (-1) A (+1) B (-1) A (+1) B (-1) A (+1) B (-1) X X Y Y A (+1) B (-1) A (+1) B (-1) A (+1) B (-1) A (+1) B (-1)
  • 27. Dynamic Programming •전체 문제를 풀 때 중복되는 계산이 발생  이를 subproblem으로 나누어 풀어 중복계산을 없앰 •전체 path단위로 계산을 하면 앞부분 계산이 항상 중첩  두 step 단위로 subproblem 계산을 끝내고 다음 step으로 넘어감.
  • 28. Dynamic Programming • We can apply the DP in this problem, and the computational cost reduces to O(N2T). (But still we need to know the environment dynamics.) • DP is a computer science technique which calculates the final goal value with compositions of cumulative partial values.
  • 29. Policy Iteration 29 • Policy iteration consists of two simultaneous, interacting processes. • Policy evaluation (Iterative formulation): First, make the value function more precise with the current policy. • Policy improvement (Greedy selection): Second, make greedy policy greedy with respect to the current value function.
  • 30. Policy Iteration •How can we get the state-value function with DP? (The state-value is subproblem. And action-value function is also similarly computed.) Policy Iteration = Policy Evaluation + Policy Improvement 30
  • 32. Solution of the MDP : Learning •The planning methods must know the perfect dynamics of environment, P(s’,r|s,a) •But typically this is really hard to know and empirically impossible. Therefore we will ignore this and just calculate the mean from samples. •This is the starting point of the machine learning is embedded. 1) Monte Carlo Methods 2) Temporal-Difference Learning (a.k.a reinforcement learning) 32
  • 33. Monte Carlo Methods 33 Starting State Value S1 Average of G(S1) S2 Average of G(S2) S3 Average of G(S2) Tabular state-value function
  • 34. Monte Carlo Methods 34 •We need a full length of experience for each starting state. •This is really time consuming to update one state while waiting the terminal of episode. •Continuous task에는 적용 불가
  • 37. Temporal-Difference Learning •TD learning is a combination of Monte Carlo idea and dynamic programming (DP) idea. •Like Monte Carlo methods, TD methods can learn directly from raw experience without a model of the environment's dynamics. •Like DP, TD methods update estimates based in part without waiting for a final outcome (they bootstrap). 37
  • 41. On policy / Off policy •On policy : Target policy = Behavioral policy there can be only one policy.  This can learn a stochastic policy. Ex) SARSA •Off policy : Target policy != Behavioral policy there can be several policies.  Sample efficient. Ex) Q-learning 41
  • 42. Sarsa: On-Policy TD Control 42
  • 46. Planning + Learning •There is only a difference between planning and learning. That is the existence of model. •So we call planning is model-based method, and learning is model-free method. 46
  • 49. •Value function approximation with deep learning  Large scale or infinite dimensional state can be solvable Generalization with deep learning This needs supervised learning techniques and online moving target regression.
  • 50. Deep Q Networks (DQN) •Q learning에서 특정 state에 대한 action-value function으로 CNN 을 사용 State Q(State,Action=0) Q(State,Action=1) . . . Q(State,Action=K)
  • 51. Loss function •Q learning의 업데이트 식 •TD error의 크기를 Loss로 사용, regression 방식으로 학습 TD error
  • 52. Target Network • Moving Target Regression : Q를 학습하는 순간, target 값도 같이 변화  학습 성능 저하  Target Network : DQN과 동일한 구조를 가지고 있으며 학습 도 중 weight값이 변하지 않는 별도의 네트워크로 Q(s’,a’)로 고정된 target value를 사용하여 학습 안정성 개선 (Target Network의 weight값들은 주기적으로 DQN의 것을 복사)
  • 53. Replay Memory •Data inefficiency : 그 동안 경험한 experience들을 off- policy learning으로 재활용(Mini-batch learning) •Data correlation : Mini-batch를 On-line으로 구성할 경우 Mini batch 내의 데이터들이 서로 비슷함(Correlated)  한쪽으로 치우친 불균형한 학습이 발생함  Replay Memory : (State, Action, Reward, Next State) 데 이터를 Buffer에 저장해놓고 그 안에서 Random Sampling하 여 Mini-batch를 구성하는 방법으로 해결
  • 54. Reward Clipping •Reward Scale problem : 도메인에 따라 Reward의 scale이 다르기 때문에 학습되는 Q-value의 크기도 제각각. 이렇게 학습해야하는 Q-value의 variance가 매우 큰 경우 (ex : -1000~1000) 신경망 학습이 어려움.  Reward Clipping: Reward의 크기를 -1~1 사이로 잘라내 어 안정적으로 학습
  • 55. DQN의 성능 • ATARI 2600 고전게임에서 실험 •절반 이상의 게임에서 사람보다 우수 • 기존방식 (linear)에 비해 월등한 향상 • 일부 게임은 학습에 실패함. (sparse reward, 복잡한 단계를 가진 게임)
  • 57. DQN의 발전된 모델들 - Double DQN •Positive bias: max 연산자에 의해 Q-value를 실 제보다 높게 평가하는 경향이 있음  Double Q learning을 사용하여 이 현상을 방, DQN에 보다 빠른 학습이 가능
  • 58. DQN의 발전된 모델들 - Prioritized Replay Memory •일반적인 DQN은 Replay Memory에서 Uniform Sampling  TD error가 높은 데이터들이 선택될 확률을 높여, 더 빠르고 효율적인 학습이 가능
  • 59. References [1] Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 1998. [2] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning." Nature 518.7540 (2015): 529-533. [3] Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double Q-Learning." AAAI. 2016. [4] Wang, Ziyu, et al. "Dueling network architectures for deep reinforcement learning." arXiv preprint arXiv:1511.06581 (2015). [5] Schaul, Tom, et al. "Prioritized experience replay." arXiv preprint arXiv:1511.05952 (2015). 59
  • 60. Appendix • Atari 2600 - https://www.youtube.com/watch?v=iqXKQf2BOSE • Super MARIO - https://www.youtube.com/watch?v=qv6UVOQ0F44 • Robot Learns to Flip Pancakes - https://www.youtube.com/watch?v=W_gxLKSsSIE • Stanford Autonomous Helicopter - Airshow #2 - https://www.youtube.com/watch?v=VCdxqn0fcnE • OpenAI Gym - https://gym.openai.com/envs • Awesome RL - https://github.com/aikorea/awesome-rl • Udacity RL course • TensorFlow DRL - https://github.com/nivwusquorum/tensorflow-deepq • Karpathy rldemo - http://cs.stanford.edu/people/karpathy/convnetjs/demo/rldemo.html 60

Editor's Notes

  1. 안녕하세요. 저는 서울대학교 박사과정 곽동현 이리고 합니다. 저는 이제 강화학습을 발표할 예정인데요, 딥러닝 세미나에서 강화학습이 왜 자꾸 등장하는지 의아에 하실 수 있을 것 같습니다. 그래서 먼저 큰 그림을 그려보면서 최종적으로 강화학습까지 이해를 도달해보고자 한다.
  2. 과거에는 이러한 인공지능을 보다 세분화해서, 인간의 관점에서 8가지 토픽으로 정리하였음. 그리고 그 중에 한가지 토픽이 기계도 학습이 가능 해야한다로, 기계학습은 그냥 인공지능의 소주제였음. 그런데 지금은 인공지능의 8가지 토픽들이 전부다 기계학습을 이용한 방법들로 실현되기 시작하며, 인공지능과 기계학습이 거의 동의어처럼 사용되고 있음. 그리고 사실 딥러닝 역시 기계학습의 분야중 하나에 불과했었음. 그런데 이 딥러닝도 다른 알고리즘에 비해 상대적으로 비약적인 발전을 하면서 마치 “인공지능=기계학습=딥러닝” 이런식으로 말하게 되는 경향이 생겼음. 그러나 기계학습과 딥러닝은 반드시 구분을 해야한다. 왜냐하면 딥러닝을 연구하면서 발생하는 문제들이, 다른 기계학습 알고리즘과 결합되며 해결되기 때문이다. 즉 기존의 기계학습 알고리즘들을 알아야, 이를 딥러닝과 결합해 새로운 break through를 만들어 낼 수 있다. 그러한 대표적인 예 중에 하나가 바로 Deep Reinforcement Learning 이다. 그래서 결국 제가 이 자리에서 발표를 하게 된 의의는, 딥러닝과 결합될 수 있는 또다른 기계학습 알고리즘으로 써 강화학습을 소개하는 것으로 합리화를 하고 있습니다.
  3. DeepRL에서의 딥러닝은 그저 하나의 function approximator 로써만 사용됩니다. 딥러닝이 다른 기계학습 알고리즘과 비교했을 때 가지는 아주 강력한 특징은 바로 이 2가지이고, 이 두가지만 알면 제 발표의 DeepRL에서 필요한 딥러닝의 배경지식은 끝입니다.  사실 딥러닝이 가지고있는 가정, 즉 hypothesis가 이러한 low-level observation으로 부터 feature learning을 하는데에 적합함. 이미 feature가 학습되어있으면 리니어 모델이 더잘할수도있음
  4. 감독학습과 무감독학습은 상대적으로 무미건조한 수학적 모델이라면, 강화학습은 그 성격이 다르다. 풀려는 문제가 애초에 살아있는 유기체의 의사결정을 모델링하려는 이론적 접근이다.
  5. 위의 쥐학습 문제를 강화학습으로 풀려면, MDP라는 이론적 도구로 포물레이션을 해야한다. A crawling robot: a Q-learning example: https://www.youtube.com/watch?v=2iNrJx6IDEo Deep Reinforcement Learning for Robotic Manipulation : https://youtu.be/ZhsEKTo7V04?t=1m27s
  6. Another Reason of discount factor 1) Prefer immediate reward than future reward 2) Variance Reduction of the future
  7. 대표적인 stochastic policy의 예시, 가위바위보
  8. Prediction : 해당 policy에 대해 정확한 value function을 구하는 것 Control : optimal policy를 구하는 것
  9. Dynamic programming에서 programming은 어떤 문제를 해결해주는 program을 만드는 방법을 의미한다(Linear programming처럼). Dynamic은 시퀀셜한 문제를 뜻함.
  10. Prediction problem : 정확한 value function을 구하는 것 Control problem : optimal policy를 구하는 것