Pixel RNN to Pixel CNN++

Pixel RNN 부터
Pixel CNN++ 까지
2020. 01. 16 (목)
이동헌

Contents
Taxonomy of Generative Models
(1) Pixel RNN
(2) Pixel CNN
(3) Gated Pixel CNN
(4) Pixel CNN++
(Google DeepMind, arxiv, 2016)
(Google DeepMind, arxiv, 2016)
(Google DeepMind, NIPS, 2016)
(OpenAI, ICML, 2017)

Generative model은 Maximum Likelihood를 바탕으로 학습하는 것으로
정리할 수 있으며, 이 때 어떤 식으로 likelihood를 다루느냐 (근사를 할
것이냐 혹은 정확히 표현할 것이냐 등)에 따라 다양한 전략이 존재

Density (=Prior distribution, model) 정의
(+) 다루기가 비교적 편하고 어느 정도 모델의 움직임이
예측가능
(-) 우리가 아는 것 이상으로는 결과를 낼 수 없는 한계
Density를 정의하지 않고 Sampling 함

Generator가 만드는 분포로부터 sample을 생성
(Markov Chain과 다르게 input 없이 sample 생성)
sample x′을 반복적으로 뽑다보면 결국에
는 x′이 pmodel(x)로부터 나온 sample로 수렴
(+) Sample간의 분산이 높지 않은 경우 괜찮
은 성능
(-) 고차원에서 성능 떨어지고 계산 느림

학습 시, Density를
수학적으로 계산
(미적분)이 가능
Neural Autoregressive à
: 이전의 자기 자신을 이용하여
현재의 자신을 예측하는 모델

• Encoder:
• Decoder: from a latent code z, reconstructed sample
!" #$ z to be close to the data used to obtain the latent code, x
5!67! 5 8 79 8~;< 8 $ , =>?@@A B7!C?@ ß VAE는 결합분포를 적분식으로 표현
하며 이를 ‘직접’ 적분하지 못하기 때문
에 variational inference로 '추정'

(1) Pixel RNN
• Autoregreesive Model의 핵심은, 데이터간의 dependency 순서를 정해주는 것!
• One effective approach to tractably model a joint distribution of the pixels in the
image is to cast it as a product of conditional distributions.
à Pixel (1~n2) 순서로 진행
Oord, Aaron van den, Nal Kalchbrenner, and Koray Kavukcuoglu. "Pixel recurrent neural networks." arXiv preprint arXiv:1601.06759 (2016).

(1) Pixel RNN
Architecture

(1) Pixel RNN
• R, G, B 순서로 진행
MASK
: First Layer, each of the RGB channels is connected to previous
channels and to the context, but is not connected to itself.
: Subsequent Layers, the channels are also connected to themselves.
Multiple Residual Blocks (모델마다 다름)

(1) Pixel RNN
Input
Hidden
State
input-to-state & state-to-state
Row LSTM
Multiplication à Convolution
https://www.slideshare.net/thinkingfactory/pr12-pixelrnn-jaejun-yoo?from_action=save

(1) Pixel RNN
Input
Hidden
State
input-to-state & state-to-state
Diagonal BiLSTM 2x1 Conv
• Diagonal convolution 어려우므로, skew the feature maps
à it can be parallelized
https://www.slideshare.net/thinkingfactory/pr12-pixelrnn-jaejun-yoo?from_action=save

(2) Pixel CNN
input-to-state
Input
Hidden
State

Experiments
• Discrete Softmax Distribution

Experiments
• Negative log-likelihood (NLL)

Experiments

(3) Gated Pixel CNN
v Pixel CNN 성능 개선
1) ReLU à Gated Activation Unit à Conditional PixelCNN
<A single layer in the Gated PixelCNN architecture>
Condition
(Vk,g ∗ s is an unmasked 1 × 1 convolution, h=s)
Van den Oord, Aaron, et al. "Conditional image generation with pixelcnn decoders." Advances in neural information processing systems. 2016.

(3) Gated Pixel CNN
2) Stacks : blinded spot 제거
PixelCNN
1.Horizontal Stack : It conditions only on the current row and takes as input the output of previous layer as
well as the of the vertical stack.
2.Vertical Stack : It conditions on all the rows above the current pixel. It doesn’t have any masking. It’s output
is fed into the horizontal stack and the receptive field grows in rectangular fashion.
Gated PixelCNN
current pixel
https://towardsdatascience.com/auto-regressive-generative-models-pixelrnn-pixelcnn-32d192911173

(4) Pixel CNN++
1) Discretized logistic mixture likelihood
The softmax layer which is used to compute the conditional distribution of a pixel although efficiency is very costly in terms of
memory. Also, it makes gradients sparse early on during training.
à To counter this, we assume a latent color intensity akin to that used in variational autoencoders, with a continuous distribution
It is rounded off to its nearest 8-bit representation to give pixel value. The distribution of intensity is logistic so the pixel values
can be easily determined.
Salimans, Tim, et al. "Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications." arXiv preprint arXiv:1701.05517 (2017).
à This method is memory efficient, output is of lower dimensions which provides denser gradients thus solving both problems.

(4) Pixel CNN++
2) Other Modification
• Conditioning on whole pixels : PixelCNN factorizes the model over the 3 sub pixels according to the color(RGB) which
however, complicates the model. The dependency between color channels of a pixel is relatively simple and doesn’t
require a deep model to train.
à Therefore, it is better to condition on whole pixels instead of separate colors and then output joint distributions over
all 3 channels of the predicted pixel.
• Downsampling : PixelCNN cannot compute long range dependencies. This is one of the disadvantages of PixelCNN as
to why it cannot match the performance of PixelRNN. To overcome this, we downsample the layers by using
convolutions of stride 2. Downsampling reduces input size and thus improves relative size of receptive field which
leads to some loss of information but it can be compensated by adding extra short-cut connections.

(4) Pixel CNN++
2) Other Modification
• Short-cut connections : This model the encoder-decoder structure of U-net. Layers 2 and 3 are downsampled and then
layers 5 and 6 are upsampled. There is a residual connection from encoders to decoders to provide the localised
information.
• Dropout : Since the model for PixelCNN and PixelCNN++ are both very powerful, they are likely to overfit data if not
regularized. So, we apply dropout on the residual path after the first convolution.

Experiments
Salimans, Tim, et al. "Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications." arXiv preprint arXiv:1701.05517 (2017).

Pixel RNN to Pixel CNN++

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Pixel RNN to Pixel CNN++

Similar to Pixel RNN to Pixel CNN++ (20)

More from Dongheon Lee

More from Dongheon Lee (10)

Recently uploaded

Recently uploaded (20)

Pixel RNN to Pixel CNN++