강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)

강화학습 해부학 교실 :
Rainbow
이론부터 구현까지

박진우 Curt Park
curt.park@medipixel.io
AI Research / Developer in Medipixel
연사 소개
김경환 Kyunghwan Kim
kh.kim@medipixel.io
AI Research / Developer in Medipixel

Guidewire control by RL
https://youtu.be/uAZtUNwA4i0

목차
1. 강화학습 ?
2. 강화학습 그리고 Deep 강화학습
- Q-Learning
- Function approximation
- DQN
- Rainbow DQN

강화학습?
DeepMind AlphaGo

https://www.youtube.com/watch?v=TmPfTpjtdgg
강화학습?
DeepMind DQN with Atari game

https://youtu.be/UZHTNBMAfAA
강화학습?
OpenAI Dota2

https://www.youtube.com/watch?v=cUTMhmVh1qs
강화학습?
Deepmind AlphaStar

강화학습?
https://youtu.be/Dr0RvX1F-YQ
“Sim-to-Real Reinforcement Learning for Deformable Object Manipulation”
J. Matas, S. James and A. J Davison
CoRL, 2018

강화학습?
https://www.youtube.com/watch?v=FmMPHL3TcrE
“Learning to Walk via Deep Reinforcement Learning”
T. Haarnoja et al.
arXiv:1812.11103v1

강화학습 ?
● 상호작용을 통한 학습
+ 10 - 100

강화학습 ?
ref : Reinforcement Learning: An Introduction, 2nd ed (Sutton and Barto)
Reinforcement learning is what to do - how to map situations to actions -
so as to maximize a numerical reward signal

강화학습 ?
● 강화학습 (Reinforcement Learning)
- 보상(Reward)를 통해 학습.
- 보상은 현재 상태(State)에서 컴퓨터(Agent)가 선택한 행동
(Action)을 선택했을 때 환경(Environment)의 반응
- 보상을 최대화하는 행동을 하도록 학습.

Markov Decision Process
Agent Environment
State, Reward
Action
< MDP framework >

강화학습의 특성
➢ Trial-and-error search
○ 학습의 주체(Agent)가 상황에 가장 적합한 행동을 찾기까지 수많은 시행
착오가 필요.
➢ Delayed reward
○ 복잡한 상황에서는 현재 선택한 행동이 현재의 보상뿐 아니라 미래의
보상에도 영향을 줌.
● 강화학습의 특성

● Trial-and-error search
➢ Exploitation (이용)
➢ Exploration (탐험)
- 학습된 policy를 이용해 action을 선택
- 탐험을 위해 무작위로 action 선택

탐험과 이용
< 맛집 > < 새로운 음식점 >
Exploitation Exploration
< Exploitation & Exploration >

● Delayed reward
- 현재의 행동이 미래의 보상에 영향을 주기 때문에 단기적인 보상(reward)
으로는 현재 행동의 가치를 알기 어렵다.
- 행동에 대한 장기적인 보상을 고려해보자 !
Return !

Return
S0 S1 S2 S3 S4 ST
R1
< 단기적인 보상 >

Return
S0 S1 S2 S3 S4 ST
R1 R2 R3 R4 R5
< 장기적인 보상 >

Return
● Return
➢ 상태 t의 state 기준으로 마지막 time step T 시점 까지의 발생된 보상의
총합

Return
● Discounted Return
➢ 감가율이 적용된 보상의 총합

Policy
MDP 모델에서 우리가 구해야 할 답?

Policy
● Policy (정책)
➢ 어떤 State에서 어떤 Action을 취할 확률.
○ π(a∣s) 로 표현.
○ 강화학습을 이용해 얻고자 하는 답.
○ MDP 문제를 풀었다 ! → 최적의 정책을 찾았다 !

최적 정책 찾기
- 정책 π 를 평가할 수 있는 함수 :
- 최적의 정책

어떤 함수가 goodness 함수가 될 수 있을까?

어떤 함수가 goodness 함수가 될 수 있을까?
→ Value function !

Value function
● Value function
➢ Policy π 를 따를 때의 상태와 행동에 대한 미래 가치
- 특정 정책(Policy)에서 모든 상태(행동)의 좋은 정도를 평가.
- 좋은 정도의 기준 : 상태 s에서의 Return의 기대값.
- 가치함수를 이용해 정책을 평가 할 수 있다 !

Value function
● State value function
➢ 정책 π 를 따를 때 어떤 state에 대한 가치를 표현
○ 가치 : 보상의 총 합에 대한 기댓값

Value function
● State-action value function (Q 함수)
➢ 정책 π 를 따를 때 어떤 state와 action에 대한 가치를 표현

강화학습과 Value Function
보상(Reward)을 최대로 하는 행동을 고르자
미래의 보상을 포함한 보상(Return)을 최대로 하는 행동을 고르자
기대보상을 최대로 하는 정책을 고르자
Expected Return
=Value Function

Value-based RL
현재 State에서 할 수 있는 Action의 점수를 매겨서
가장 점수가 높은 Action을 고르자 !

Value-based RL
현재 State에서 할 수 있는 Action의 점수를 매겨서
가장 점수가 높은 Action을 고르자 !
Value function
Greedy policy

Value-based RL
- 점수 : Q - value
action 1
Q : 10
action 2
Q : -5

Q Learning
- Q learning
● Value function 업데이트

FrozenLake-v0
[State]
S : starting point, safe
F : frozen surface, safe
H : hole, fall to your doom
G : goal
[Action]
LEFT = 0
DOWN = 1
RIGHT = 2
UP = 3
S F F F
F H F H
F F F H
H F F G

Q-Learning
S F F F
F H F H
F F F H
H F F G
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
Reward : 1
( 𝛾가 1이라 가정 )

Q-Learning
S F F F
F H F H
F F F H
H F F G
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 1
0
0
0 0
0
Reward : 1

Q-Learning
S F F F
F H F H
F F F H
H F F G
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
0
0
0 0
1
0
0 1
0
0
0 0
0
Reward : 1

Q-Learning
S F F F
F H F H
F F F H
H F F G
0
0 1
0
0
0 1
0
0
0 0
1
0
0 0
1
0
0 0
1
0
0 1
0
0
0 0
0
Reward : 1

Greedy policy의 문제
- 탐험을 하지 않아 near optimal policy로 학습되는 경우 더이상 발전하지
않음.
S F F F
F F F H
F F F H
H F F G

ϵ - greedy Policy
- ϵ 값에 따라 exploration 과 exploitation 을 적절히 선택
60 % 40 %

ϵ - greedy Policy
S F F F
F F F H
F F F H
H F F G

What we have learned?
ref : 가깝고도 먼 DeepRL(이웅원)
https://tykimos.github.io/warehouse/2018-2-7-ISS_Near_and_Far_DeepRL_4.pdf
- 각 state, action에 해당하는 Q 값을 전부 저장 후 업데이트
● Tabular method

Tabular method의 한계
바둑 : 개 이미지 : 84 x 84 x 3 로봇 : Continuous
State space가 크다면 ?

Tabular method의 한계
- 실제 문제의 환경 → Large state space
- 비슷한 state는 비슷한 output을 내는 함수 만들면? (Generalization !)
- parameter를 이용해 함수를 추정해보자. → Function Approximation
- How? Neural Net + Deep Learning !!

Function Approximation
Q Learning

Tabular vs Function Approximation
- 모든 state, action에 대한 Q 값을 table에 저장하고 업데이트.
- state, action의 범위가 커질수록 space complexity 문제로 적용이 어려움
- parameter를 이용해 Q 값을 추정.
- 최근엔 주로 인공신경망을 이용.
● Tabular method
● Function Approximation method

- parameter : w
ref : David Silver
http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/FA.pdf

- Function Approximation example
S F F F
F H F H
F F F H
H F F G
Function
Approximator
(w)
Q(s, Left)
Q(s, Right)
Q(s, Up)
Q(s, Down)
< Current State s >

Tabular vs Function Approximation
- Tabular : state, action에 대한 Q 값을 업데이트한다.
● ' 학습한다. ' 의 의미
- F.A. : parameter를 gradient descent로 업데이트한다.

DQN
https://www.youtube.com/watch?v=TmPfTpjtdgg
● DQN 논문 데모 영상

DQN
https://www.nature.com/articles/nature14236
● 논문에서 DQN Agent가 Human-level을 뛰어넘는 모습을 보여줌

DQN
- Convolution Neural Network
- Experience Replay
- Fixed Target Network
● Deep Q Network (2015)

DQN - CNN
- 이미지(pixel) 데이터를 입력 데이터로 사용.
- 환경에 따라 state의 도메인을 크게 바꾸지 않고 적용 가능하게 해줌.
- ex. 자동차 게임 : 현재 속력, 가속도, … → 자동차 게임 이미지
인형뽑기 게임 : 집게 위치, 집게 속도, … → 인형뽑기 게임 이미지
● Convolution Neural Network

DQN - CNN
https://www.nature.com/articles/nature14236
Output
각 action에
대한 Q value
Input
게임 화면
pixel 값

DQN - Experience Replay
ref: 강화학습기초부터DQN까지 (박진우)
https://www.slideshare.net/CurtPark1/dqn-reinforcement-learning-from-basics-to-dqn
● Challenge 1 - Correlation between samples
- 강화학습에서의 sample은 시간에 따라 순차적으로 수집되기 때문에 correlation이
높다
- Sample간의 correlation이 높으면 학습이 불안정해진다.

- transition(S, A, R, S’)을 memory(buffer)에 저장하고 batch 단위로 학습하자.
- data(transition)간의 correlation을 없앰.
- batch 단위로 학습 가능.
● Experience replay

DQN Agent Environment
State, Reward
Action
Replay Buffer
[S, A, R, S’]
[S, A, R, S’]
batch만큼 sampling
N개의 Transition 저장

DQN - Target Network
● Challenge 2 - Non-stationary targets
- Loss function에서 target 과 current value 가 모두 파라미터 w를 통해
계산됨.
- w가 업데이트 되면 target도 바뀌어 버림.
target
=

DQN - Target Network
- 일정 step마다 업데이트 되는 network를 추가하여 update시의 target으로 사용
k step 마다
copy
Main Network Target Network
● Target network

DQN - 추가 변화
● Gradient Clipping
- Loss function의 기울기의 절대값이 1 이상인 경우 1이 되도록 Clipping
ref: wiki
https://en.wikipedia.org/wiki/Huber_loss

Rainbow DQN?
1. Deep Q Network
2. Double Q-learning
3. Prioritized Replay
4. Dueling Networks
5. Multi-step Learning
6. Distributional RL
7. Noisy Network

Rainbow DQN
● Rainbow가 다른 알고리즘들의 성능을 뛰어넘는 모습을 보여줌

Q-learning의 문제점
- Q-learning은 maximization 방법으로 Q를 업데이트.
- maximization 때문에 overestimation 문제가 발생. (과대평가)
- 즉, Q-value가 낙관적인 예측을 하게됨.

Double Q-learning
-0.1

Double Q-learning
Left
평균 reward : -0.1
Right
reward : 0
<

Double Q-learning
A
Right 0
Left 0
B action1 0
< Q-table >

Double Q-learning
A
Right 0
Left 0
B action1 +0.2
< Q-table >
-0.1 0.2

Double Q-learning
A
Right 0
Left +0.2
B action1 +0.2
< Q-table >

Double Q-learning
Q-learning
Q-learning (변형)
Double Q-learning
or
Q → Q1, Q2

Double Q-learning
10000번 테스트 후 평균
그래프

Double DQN
- 두 개의 Q estimator Q1, Q2 → DQN의 main Q, target Q
- main Q : Q 값이 max가 되는 action을 고른다.
- target Q : 보상을 추정하고 업데이트 한다.

Prioritized Replay
" Replay Buffer의 성능을 올리자 ! "

Prioritized Replay
- 어떤 경험을 저장할 것인가?
- 저장된 경험 중 어떤 경험을 꺼내서 학습시킬까?

Prioritized Replay
어떤 경험에 우선순위를 높게 둘까?

Prioritized Replay
어떤 경험에 우선순위를 높게 둘까?
→ TD - Error !

Prioritized Replay
- TD error :
-
" TD error가 클수록 우선순위를 높이자 ! "
alpha = [0, 1]
alpha 0이면 Uniform sample

Prioritized Replay
- 우선순위를 임의로 지정해서 sampling 했기 때문에
- 실제 환경에서의 transition의 분포와 sampling 된 transition의 분포 사이에
차이가 발생. → 학습시 bias가 발생
- 신경망 update 시 Importance-sampling weight로 보정 !
논문에서 제시하는 PER 적용시 문제점

Prioritized Replay
beta = [0, 1]
beta 0이면 weight 적용 x

Dueling Networks
어떤 State에서 어떤 Action을 했을 때의 가치

Dueling Networks
정확한 값보다 차이를 배우는 것이 더 쉽다 !
10? 20?
-10? -20?
기준 : 0
+5? 3?
-2? -3?

Dueling Networks
기준점 (state value)
Value Advantage
Q값과 상대적인 차이

Dueling Networks
Q(s, a1)
Q(s, a2)
Q(s, a3)
S

Dueling Networks
A(s, a1)
A(s, a2)
A(s, a3)
S
V(s)
Q

Dueling Networks
Sum :
- But, 단순 sum 식으로는 Q에 대한 V와 A 값이 unique 하지 않음
- ex. Q = 4라면 V + A가 (1, 3), (2, 2), (3, 1) 과 같은 경우의 수가 존재
● Dueling Network에서의 Q 표현 방법

Dueling Networks
Max :
Average :

Dueling Networks
Max :
Average :
max와 달리 유일한 V와 A를 보장해주지는 않지만
max와 유사한 성능을 보이며 최적화의 안정성이 증가됨
유일한 V와 A를 보장함

Multi-Step Learning
S0 S1 S2 S3 S4 ST
R1
1 step :
S0 S1 S2 S3 S4 ST
R1 R2 R3
3 step :

Distributional RL
ref : RLKorea 민규식, A Distributional Perspective on Reinforcement Learning
https://reinforcement-learning-kr.github.io/2018/10/02/C51/

network
Distributional RL
S
Q
S
Q
network
" Return을 분포로 표현하자. "

Distributional RL
Distributional :
General :

Distributional RL
N
: Value distribution
- x축 : atom (or support)
- y축 : 각 atom의 값을 받을 확률
● Value distribution

Distributional RL
- Value distribution의 기댓값을 사용
● Q-value

Distributional RL
- Target value distribution과 현재 value distribution간의 차이를 줄이는
방향으로 학습
- KL-Divergence
● Loss function
: Target value distribution

Distributional RL
1. 다음 state에 대한 value 분포를 구함 :
2. Target atom을 구함
● Target value distribution

Distributional RL
● Target value distribution
- x축 : atom (or support)
- y축 : 각 atom의 값을 받을 확률
하지만 ...

Distributional RL
● Target value distribution - Projection
- Target value distribution과 value distribution의 atom이 불일치
- Target atom이 reward와 ⲅ와의 연산 때문에 변경됨
- Projection을 통해 일치 시킴.
1 2 3 4 5 6 7 1 2.3 3.2 4.1 5 5.9 6.8
R = 0.5, ⲅ = 0.9

Distributional RL
1 2.3 3.2 4.1 5 5.9 6.8
3 3.2 4
0.50.5 * (4 - 3.2)
= 0.4
0.5 * (3.2 - 3)
= 0.1
ref : RLKorea 민규식, A Distributional Perspective on Reinforcement Learning
https://reinforcement-learning-kr.github.io/2018/10/02/C51/

Distributional RL
1 2.3 3.2 4.1 5 5.9 6.8
1 2 3 4 5 6 7

network
Distributional RL
S
Q(s, a1)
Q(s, a2)
Q(s, a3)
Q(s, a4)
action size
● General

network
Distributional RL
S
action size x atom size
Expectation
Q(s, a1)
Q(s, a2)
Q(s, a3)
Q(s, a4)
action size
● Distributional

Noisy Network
< Exploitation & Exploration >
● Exploitation과 Exploration

Noisy Network
ϵ - greedy 보다 효율적인 exploration
방법이 없을까 ?

Noisy Network
ϵ - greedy Policy
Random perturbations of the policy
교란, 동요
Large-scale behavioral pattern에서는
비효율적

Noisy Network
" Network에 noise를 추가하여 exploration을 하자. "
perturbations

Noisy Network
" Network에 noise를 추가하여 exploration을 하자. "
perturbations
State-dependent 한 exploration을 할 수 있다.

Noisy Network
Q(s, a1)
Q(s, a2)
Q(s, a3)
S
element-wise
multiplication
:

Noisy Network
1. Independent Gaussian noise
- noise를 weight, bias size 만큼 만든 후 weight에 연산
- noise를 만들기 위한 연산: (p x q) + q
● Gaussian noise를 만드는 방법
p x q q

Noisy Network
2. Factorised Gaussian noise
- input size(p)의 noise와 output size(q)의 noise를 만듦
- 두 noise를 이용해 (p x q) size의 noise를 만듦
- noise를 만들기 위한 연산: p + q
p
q

Reference
● Sutton, R. and Barto, A., Reinforcement Learning: An Introduction, 2nd ed., MIT Press, 2018.
● V. Mnih et al., "Human-level control through deep reinforcement learning." Nature, 518 (7540):529–533, 2015.
● van Hasselt et al., "Deep Reinforcement Learning with Double Q-learning." arXiv preprint arXiv:1509.06461, 2015.
● T. Schaul et al., "Prioritized Experience Replay." arXiv preprint arXiv:1511.05952, 2015.
● Z. Wang et al., "Dueling Network Architectures for Deep Reinforcement Learning." arXiv preprint arXiv:1511.06581, 2015.
● M. Fortunato et al., "Noisy Networks for Exploration." arXiv preprint arXiv:1706.10295, 2017.
● M. G. Bellemare et al., "A Distributional Perspective on Reinforcement Learning." arXiv preprint arXiv:1707.06887, 2017.
● M. Hessel et al., "Rainbow: Combining Improvements in Deep Reinforcement Learning." arXiv preprint arXiv:1710.02298, 2017.
● 이웅원, “가깝고도 먼 DeepRL”, https://tykimos.github.io/warehouse/2018-2-7-ISS_Near_and_Far_DeepRL_4.pdf
● David Silver, “Lecture 6 in UCL Course on RL” , http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/FA.pdf
● RLKorea 민규식, “A Distributional Perspective on Reinforcement Learning”, https://reinforcement-learning-kr.github.io/2018/10/02/C51/

강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)

Similar to 강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon) (20)

강화학습 해부학 교실: Rainbow 이론부터 구현까지 (2nd dlcat in Daejeon)