[GAN by Hung-yi Lee]Part 1: General introduction of GAN

Generative Adversarial Network
and its Applications to Speech Processing
and Natural Language Processing
Hung-yi Lee

Schedule
• 14:00 – 15:10: Lecture
• 15:10 – 15:30: Break
• 15:30 – 16:40: Lecture
• 16:40 – 17:00: Break
• 17:00 – 18:00: Lecture

Yann LeCun’s comment
https://www.quora.com/What-are-some-recent-and-
potentially-upcoming-breakthroughs-in-unsupervised-learning

Yann LeCun’s comment
https://www.quora.com/What-are-some-recent-and-potentially-upcoming-breakthroughs-
in-deep-learning
……

Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, Shakir Mohamed, “Variational Approaches for Auto-
Encoding Generative Adversarial Networks”, arXiv, 2017
All Kinds of GAN … https://github.com/hindupuravinash/the-gan-zoo
GAN
ACGAN
BGAN
DCGAN
EBGAN
fGAN
GoGAN
CGAN
……

ICASSP Keyword search on session index page,
so session names are included.
Number of papers whose titles include the keyword
0
5
10
15
20
25
30
35
40
45
2012 2013 2014 2015 2016 2017 2018
generative adversarial reinforcement
GAN becomes a very
important technology.

Outline
Part I: General Introduction of Generative
Adversarial Network (GAN)
Part II: Applications to Speech and Text
Processing
Part III: Recent Progress at National
Taiwan University

Part I: General Introduction
Generative Adversarial Network
and its Applications to Speech Processing
and Natural Language Processing

Outline of Part 1
Generation by GAN
• Image Generation as Example
• Theory behind GAN
• Issues and Possible Solutions
• Evaluation
Conditional Generation
Feature Extraction
Unsupervised Conditional Generation
Relation to Reinforcement Learning

Anime Face Generation
Draw
Generator
Examples

Basic Idea of GAN
Generator
It is a neural network
(NN), or a function.
Generator
0.1
−3
⋮
2.4
0.9
imagevector
Generator
3
−3
⋮
2.4
0.9
Generator
0.1
2.1
⋮
5.4
0.9
Generator
0.1
−3
⋮
2.4
3.5
high
dimensional
vector
Powered by: http://mattya.github.io/chainer-DCGAN/
Each dimension of input vector
represents some characteristics.
Longer hair
blue hair Open mouth

Discri-
minator
scalar
image
Basic Idea of GAN It is a neural network
(NN), or a function.
Larger value means real,
smaller value means fake.
Discri-
minator
Discri-
minator
Discri-
minator1.0 1.0
0.1 Discri-
minator
0.1

• Initialize generator and discriminator
• In each training iteration:
DG
sample
generated
objects
G
Algorithm
D
Update
vector
vector
vector
vector
0000
1111
randomly
sampled
Database
Step 1: Fix generator G, and update discriminator D
Discriminator learns to assign high scores to real objects
and low scores to generated objects.
Fix

DG
Algorithm
Step 2: Fix discriminator D, and update generator G
Discri-
minator
NN
Generator
vector
0.13
hidden layer
update fix
Gradient Ascent
large network
Generator learns to “fool” the discriminator

DG
Learning
D
Sample some
real objects:
Generate some
fake objects:
G
Algorithm
D
Update
Learning
G
G D
image
1111
image
image
image
1
update fix
0000vector
vector
vector
vector
vector
vector
vector
vector
fix

Basic Idea of GAN
NN
Generator
v1
Discri-
minator
v1
Real images:
NN
Generator
v2
Discri-
minator
v2
NN
Generator
v3
Discri-
minator
v3
This is where the term
“adversarial” comes from.

100 updates
Source of training data: https://zhuanlan.zhihu.com/p/24767059

1000 updates

2000 updates

5000 updates

10,000 updates

20,000 updates

50,000 updates

The faces
generated by
machine.
The images are generated
by Yen-Hao Chen, Po-
Chun Chien, Jun-Chen Xie,
Tsung-Han Wu.

0.0
0.0
G
0.9
0.9
G
0.1
0.1
G
0.2
0.2
G
0.3
0.3
G
0.4
0.4
G
0.5
0.5
G
0.6
0.6
G
0.7
0.7
G
0.8
0.8
G

(Variational) Auto-encoder
As close as possible
NN
Encoder
NN
Decoder
code
NN
Decoder
code
Randomly generate
a vector as code Image ?
= Generator
= Generator

Auto-encoder v.s. GAN
As close as possibleNN
Decoder
code
Genera-
tor
code
= Generator
Discri-
minator
If discriminator does not simply memorize the images,
Generator learns the patterns of faces.
Just copy an image
1
Auto-encoder
GAN
Fuzzy …

FID[Martin Heusel, et al., NIPS, 2017]: Smaller is better
[Mario Lucic, et al. arXiv,
2017]

Generation
• We want to find data distribution 𝑃𝑑𝑎𝑡𝑎 𝑥
𝑃𝑑𝑎𝑡𝑎 𝑥
High
Probability
Low
Probability
Image
Space
𝑥: an image (a high-
dimensional vector)

Maximum Likelihood Estimation
• Given a data distribution 𝑃𝑑𝑎𝑡𝑎 𝑥 (We can sample from it.)
• We have a distribution 𝑃𝐺 𝑥; 𝜃 parameterized by 𝜃
• We want to find 𝜃 such that 𝑃𝐺 𝑥; 𝜃 close to 𝑃𝑑𝑎𝑡𝑎 𝑥
• E.g. 𝑃𝐺 𝑥; 𝜃 is a Gaussian Mixture Model, 𝜃 are means
and variances of the Gaussians
Sample 𝑥1
, 𝑥2
, … , 𝑥 𝑚
from 𝑃𝑑𝑎𝑡𝑎 𝑥
We can compute 𝑃𝐺 𝑥 𝑖; 𝜃
Likelihood of generating the samples
𝐿 = ෑ
𝑖=1
𝑚
𝑃𝐺 𝑥 𝑖; 𝜃
Find 𝜃∗ maximizing the likelihood

Maximum Likelihood Estimation
= Minimize KL Divergence
𝜃∗ = 𝑎𝑟𝑔 max
𝜃
ෑ
𝑖=1
𝑚
𝑃𝐺 𝑥 𝑖; 𝜃 = 𝑎𝑟𝑔 max
𝜃
𝑙𝑜𝑔 ෑ
𝑖=1
𝑚
𝑃𝐺 𝑥 𝑖; 𝜃
= 𝑎𝑟𝑔 max
𝜃
෍
𝑖=1
𝑚
𝑙𝑜𝑔𝑃𝐺 𝑥 𝑖
; 𝜃
≈ 𝑎𝑟𝑔 max
𝜃
𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎
[𝑙𝑜𝑔𝑃𝐺 𝑥; 𝜃 ]
= 𝑎𝑟𝑔 max
𝜃
න
𝑥
𝑃𝑑𝑎𝑡𝑎 𝑥 𝑙𝑜𝑔𝑃𝐺 𝑥; 𝜃 𝑑𝑥 − න
𝑥
𝑃𝑑𝑎𝑡𝑎 𝑥 𝑙𝑜𝑔𝑃𝑑𝑎𝑡𝑎 𝑥 𝑑𝑥
= 𝑎𝑟𝑔 min
𝜃
𝐾𝐿 𝑃𝑑𝑎𝑡𝑎||𝑃𝐺
𝑥1
, 𝑥2
, … , 𝑥 𝑚
from 𝑃𝑑𝑎𝑡𝑎 𝑥
How to define a general 𝑃𝐺?

Generator
• A generator G is a network. The network defines a
probability distribution 𝑃𝐺
generator
G𝑧 𝑥 = 𝐺 𝑧
Normal
Distribution
𝑃𝐺(𝑥) 𝑃𝑑𝑎𝑡𝑎 𝑥
as close as possible
How to compute the divergence?
𝐺∗
= 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎
Divergence between distributions 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎
𝑥: an image (a high-
dimensional vector)

Discriminator
𝐺∗
= 𝑎𝑟𝑔 min
𝐺
Although we do not know the distributions of 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎,
we can sample from them.
sample
G
vector
vector
vector
vector
sample from
normal
Database
Sampling from 𝑷 𝑮
Sampling from 𝑷 𝒅𝒂𝒕𝒂

Discriminator 𝐺∗
= 𝑎𝑟𝑔 min
𝐺
Discriminator
: data sampled from 𝑃𝑑𝑎𝑡𝑎
: data sampled from 𝑃𝐺
train
𝑉 𝐺, 𝐷 = 𝐸 𝑥∼𝑃 𝑑𝑎𝑡𝑎
𝑙𝑜𝑔𝐷 𝑥 + 𝐸 𝑥∼𝑃 𝐺
𝑙𝑜𝑔 1 − 𝐷 𝑥
Example Objective Function for D
(G is fixed)
𝐷∗ = 𝑎𝑟𝑔 max
𝐷
𝑉 𝐷, 𝐺Training:
Using the example objective
function is exactly the same as
training a binary classifier.
[Goodfellow, et al., NIPS, 2014]
The maximum objective value
is related to JS divergence.
Sigmoid Output

max
𝐷
𝑉 𝐺, 𝐷
• Given G, what is the optimal D* maximizing
• Given x, the optimal D* maximizing
𝑉 = 𝐸 𝑥∼𝑃 𝑑𝑎𝑡𝑎
𝑙𝑜𝑔𝐷 𝑥 + 𝐸 𝑥∼𝑃 𝐺
𝑃𝑑𝑎𝑡𝑎 𝑥 𝑙𝑜𝑔𝐷 𝑥 + 𝑃𝐺 𝑥 𝑙𝑜𝑔 1 − 𝐷 𝑥
= න
𝑥
𝑃𝑑𝑎𝑡𝑎 𝑥 𝑙𝑜𝑔𝐷 𝑥 𝑑𝑥 + න
𝑥
𝑃𝐺 𝑥 𝑙𝑜𝑔 1 − 𝐷 𝑥 𝑑𝑥
= න
𝑥
𝑃𝑑𝑎𝑡𝑎 𝑥 𝑙𝑜𝑔𝐷 𝑥 + 𝑃𝐺 𝑥 𝑙𝑜𝑔 1 − 𝐷 𝑥 𝑑𝑥
Assume that D(x) can be any function
𝑙𝑜𝑔𝐷 𝑥
+𝐸 𝑥∼𝑃 𝐺

max
𝐷
𝑉 𝐺, 𝐷
• Given x, the optimal D* maximizing
• Find D* maximizing: f 𝐷 = a𝑙𝑜𝑔(𝐷) + 𝑏𝑙𝑜𝑔 1 − 𝐷
𝑃𝑑𝑎𝑡𝑎 𝑥 𝑙𝑜𝑔𝐷 𝑥 + 𝑃𝐺 𝑥 𝑙𝑜𝑔 1 − 𝐷 𝑥
𝑑f 𝐷
𝑑𝐷
= 𝑎 ×
1
𝐷
+ 𝑏 ×
1
1 − 𝐷
× −1 = 0
𝑎 ×
1
𝐷∗
= 𝑏 ×
1
1 − 𝐷∗
𝑎 × 1 − 𝐷∗ = 𝑏 × 𝐷∗
𝑎 − 𝑎𝐷∗ = 𝑏𝐷∗
𝐷∗ =
𝑎
𝑎 + 𝑏
𝐷∗
𝑥 =
𝑃𝑑𝑎𝑡𝑎 𝑥 + 𝑃𝐺 𝑥
a D b D
0 < < 1
𝑎 = 𝑎 + 𝑏 𝐷∗

max
𝐷
𝑉 𝐺, 𝐷
= 𝐸 𝑥∼𝑃 𝑑𝑎𝑡𝑎
𝑙𝑜𝑔
𝑙𝑜𝑔
𝑃𝐺 𝑥
= න
𝑥
𝑃𝑑𝑎𝑡𝑎 𝑥 𝑙𝑜𝑔
𝑑𝑥
max
𝐷
𝑉 𝐺, 𝐷
+ න
𝑥
𝑃𝐺 𝑥 𝑙𝑜𝑔
𝑃𝐺 𝑥
𝑑𝑥
2
2
1
2
1
2
+2𝑙𝑜𝑔
1
2
𝐷∗ 𝑥 =
= 𝑉 𝐺, 𝐷∗
−2𝑙𝑜𝑔2

max
𝐷
𝑉 𝐺, 𝐷
= −2log2 + KL Pdata||
Pdata + PG
2
= −2𝑙𝑜𝑔2 + 2𝐽𝑆𝐷 𝑃𝑑𝑎𝑡𝑎||𝑃𝐺 Jensen-Shannon divergence
= −2𝑙𝑜𝑔2 + න
𝑥
𝑃𝑑𝑎𝑡𝑎 𝑥 𝑙𝑜𝑔
𝑃𝑑𝑎𝑡𝑎 𝑥 + 𝑃𝐺 𝑥 /2
𝑑𝑥
+ න
𝑥
𝑃𝐺 𝑥 𝑙𝑜𝑔
𝑃𝐺 𝑥
𝑃𝑑𝑎𝑡𝑎 𝑥 + 𝑃𝐺 𝑥 /2
𝑑𝑥
+KL PG||
Pdata + PG
2
max
𝐷
𝑉 𝐺, 𝐷 𝐷∗ 𝑥 =
= 𝑉 𝐺, 𝐷∗

Discriminator 𝐺∗
= 𝑎𝑟𝑔 min
𝐺
Discriminator
: data sampled from 𝑃𝑑𝑎𝑡𝑎
: data sampled from 𝑃𝐺
train
hard to discriminatesmall divergence
Discriminator
train
easy to discriminatelarge divergence
𝐷
𝑉 𝐷, 𝐺
Training:
(cannot make objective large)

𝐺1 𝐺2 𝐺3
𝑉 𝐺1 , 𝐷 𝑉 𝐺2 , 𝐷 𝑉 𝐺3 , 𝐷
𝐷 𝐷 𝐷
𝑉 𝐺1 , 𝐷1
∗
Divergence between 𝑃𝐺1
and 𝑃𝑑𝑎𝑡𝑎
𝐺∗ = 𝑎𝑟𝑔 min
𝐺
𝐷∗
= 𝑎𝑟𝑔 max
𝐷
𝑉 𝐷, 𝐺
max
𝐷
𝑉 𝐺, 𝐷

𝐺∗
= 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎max
𝐷
𝑉 𝐺, 𝐷
Step 1: Fix generator G, and update discriminator D
Step 2: Fix discriminator D, and update generator G
𝐷
𝑉 𝐷, 𝐺
[Goodfellow, et al., NIPS, 2014]

Algorithm
• To find the best G minimizing the loss function 𝐿 𝐺 ,
𝜃 𝐺 ← 𝜃 𝐺 − 𝜂 Τ𝜕𝐿 𝐺 𝜕𝜃 𝐺
𝑓 𝑥 = max{𝑓1 𝑥 , 𝑓2 𝑥 , 𝑓3 𝑥 }
𝑑𝑓 𝑥
𝑑𝑥
=?
𝑓1 𝑥
𝑓2 𝑥
𝑓3 𝑥
Τ𝑑𝑓1 𝑥 𝑑𝑥 Τ𝑑𝑓2 𝑥 𝑑𝑥 Τ𝑑𝑓3 𝑥 𝑑𝑥
Τ𝑑𝑓𝑖 𝑥 𝑑𝑥
If 𝑓𝑖 𝑥 is the
max one
𝐺
max
𝐷
𝑉 𝐺, 𝐷
𝐿 𝐺
𝜃 𝐺 defines G

Algorithm
• Given 𝐺0
• Find 𝐷0
∗
maximizing 𝑉 𝐺0, 𝐷
• 𝜃 𝐺 ← 𝜃 𝐺 − 𝜂 Τ𝜕𝑉 𝐺, 𝐷0
∗
𝜕𝜃 𝐺 Obtain 𝐺1
• Find 𝐷1
∗
• 𝜃 𝐺 ← 𝜃 𝐺 − 𝜂 Τ𝜕𝑉 𝐺, 𝐷1
∗
• ……
𝑉 𝐺0, 𝐷0
∗
is the JS divergence between 𝑃𝑑𝑎𝑡𝑎 𝑥 and 𝑃𝐺0
𝑥
𝑉 𝐺1, 𝐷1
∗
𝑥
𝐺
max
𝐷
𝑉 𝐺, 𝐷
𝐿 𝐺
Decrease JS
divergence(?)
Decrease JS
divergence(?)
Using Gradient Ascent

Algorithm
• Given 𝐺0
• Find 𝐷0
∗
• 𝜃 𝐺 ← 𝜃 𝐺 − 𝜂 Τ𝜕𝑉 𝐺, 𝐷0
∗
𝑉 𝐺0, 𝐷0
∗
𝑥
𝐺
max
𝐷
𝑉 𝐺, 𝐷
𝐿 𝐺
Decrease JS
divergence(?)
𝑉 𝐺0 , 𝐷
𝐷0
∗
𝑉 𝐺1 , 𝐷
𝐷0
∗
𝑉 𝐺0 , 𝐷0
∗
𝑉 𝐺1 , 𝐷0
∗
smaller
𝑉 𝐺1 , 𝐷1
∗ ……
Assume 𝐷0
∗
≈ 𝐷1
∗
Don’t update G
too much

In practice …
• Given G, how to compute max
𝐷
𝑉 𝐺, 𝐷
• Sample 𝑥1, 𝑥2, … , 𝑥 𝑚 from 𝑃𝑑𝑎𝑡𝑎 𝑥 , sample
෤𝑥1, ෤𝑥2, … , ෤𝑥 𝑚 from generator 𝑃𝐺 𝑥
෨𝑉 =
1
𝑚
෍
𝑖=1
𝑚
𝑙𝑜𝑔𝐷 𝑥 𝑖 +
1
𝑚
෍
𝑖=1
𝑚
𝑙𝑜𝑔 1 − 𝐷 ෤𝑥 𝑖Maximize
Minimize Cross-entropy
Binary Classifier
𝑥1, 𝑥2, … , 𝑥 𝑚 from 𝑃𝑑𝑎𝑡𝑎 𝑥
෤𝑥1, ෤𝑥2, … , ෤𝑥 𝑚 from 𝑃𝐺 𝑥
D is a binary classifier with sigmoid output (can be deep)
Positive examples
Negative examples
=

• Sample m examples 𝑥1, 𝑥2, … , 𝑥 𝑚 from data distribution
• Sample m noise samples 𝑧1, 𝑧2, … , 𝑧 𝑚 from the prior
𝑃𝑝𝑟𝑖𝑜𝑟 𝑧
• Obtaining generated data ෤𝑥1
, ෤𝑥2
, … , ෤𝑥 𝑚
, ෤𝑥 𝑖
= 𝐺 𝑧 𝑖
• Update discriminator parameters 𝜃 𝑑 to maximize
• ෨𝑉 =
1
𝑚
σ𝑖=1
𝑚
1
𝑚
σ𝑖=1
𝑚
𝑙𝑜𝑔 1 − 𝐷 ෤𝑥 𝑖
• 𝜃 𝑑 ← 𝜃 𝑑 + 𝜂𝛻 ෨𝑉 𝜃 𝑑
• Sample another m noise samples 𝑧1, 𝑧2, … , 𝑧 𝑚 from the
prior 𝑃𝑝𝑟𝑖𝑜𝑟 𝑧
• Update generator parameters 𝜃𝑔 to minimize
• ෨𝑉 =
1
𝑚
σ𝑖=1
𝑚
1
𝑚
σ𝑖=1
𝑚
𝑙𝑜𝑔 1 − 𝐷 𝐺 𝑧 𝑖
• 𝜃𝑔 ← 𝜃𝑔 − 𝜂𝛻 ෨𝑉 𝜃𝑔
Algorithm
Repeat
k times
Learning
D
Learning
G
Initialize 𝜃 𝑑 for D and 𝜃𝑔 for G
max
𝐷
𝑉 𝐺, 𝐷Can only find
lower bound of
Only
Once

Objective Function for Generator
in Real Implementation
𝑉 = 𝐸 𝑥∼𝑃 𝐺
−𝑙𝑜𝑔 𝐷 𝑥
Real implementation:
label x from PG as positive
−𝑙𝑜𝑔 𝐷 𝑥
𝐷 𝑥
Slow at the beginning
Minimax GAN (MMGAN)
Non-saturating GAN (NSGAN)

Using the divergence
you like 
[Sebastian Nowozin, et al., NIPS,
2016]
Can we use other divergence?

Outline of Part 1
Generation by GAN
• Image Generation as Example
• Theory behind GAN
• Issues and Possible Solutions
• Evaluation
Feature Extraction
More tips and tricks:
https://github.com/soumith/ganhacks

JS divergence is not suitable
• In most cases, 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎 are not overlapped.
• 1. The nature of data
• 2. Sampling
Both 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺 are low-dim
manifold in high-dim space.
𝑃𝑑𝑎𝑡𝑎
𝑃𝐺
The overlap can be ignored.
Even though 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺
have overlap.
If you do not have enough
sampling ……

𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺1
𝐽𝑆 𝑃𝐺0
, 𝑃𝑑𝑎𝑡𝑎
= 𝑙𝑜𝑔2
𝑃𝑑𝑎𝑡𝑎𝑃𝐺100
……
𝐽𝑆 𝑃𝐺1
= 𝑙𝑜𝑔2
𝐽𝑆 𝑃𝐺100
= 0
What is the problem of JS divergence?
……
JS divergence is log2 if two distributions do not overlap.
Intuition: If two distributions do not overlap, binary classifier
achieves 100% accuracy
Equally bad
Same objective value is obtained. Same divergence

Wasserstein GAN (WGAN):
Earth Mover’s Distance
• Considering one distribution P as a pile of earth,
and another distribution Q as the target
• The average distance the earth mover has to move
the earth.
𝑃 𝑄
d
𝑊 𝑃, 𝑄 = 𝑑

Wasserstein distance
Source of image: https://vincentherrmann.github.io/blog/wasserstein/
𝑃
𝑄
Using the “moving plan” with the smallest average distance to
define the Wasserstein distance.
There are many possible “moving plans”.
Smaller
distance?
Larger
distance?

WGAN: Earth Mover’s Distance
Source of image: https://vincentherrmann.github.io/blog/wasserstein/
𝑃
𝑄
Using the “moving plan” with the smallest average distance to
define the earth mover’s distance.
There many possible “moving plans”.
Best “moving plans” of this example

A “moving plan” is a matrix
The value of the element is the
amount of earth from one
position to another.
moving plan 𝛾
All possible plan Π
𝐵 𝛾 = ෍
𝑥 𝑝,𝑥 𝑞
𝛾 𝑥 𝑝, 𝑥 𝑞 𝑥 𝑝 − 𝑥 𝑞
Average distance of a plan 𝛾:
Earth Mover’s Distance:
𝑊 𝑃, 𝑄 = min
𝛾∈Π
𝐵 𝛾
The best plan
𝑃
𝑄
𝑥 𝑝
𝑥 𝑞

𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺50
𝐽𝑆 𝑃𝐺0
= 𝑙𝑜𝑔2
𝑃𝑑𝑎𝑡𝑎𝑃𝐺100
…… ……
𝑑0
𝑑50
𝐽𝑆 𝑃𝐺50
= 𝑙𝑜𝑔2
𝐽𝑆 𝑃𝐺100
= 0
𝑊 𝑃𝐺0
= 𝑑0
𝑊 𝑃𝐺50
= 𝑑50
𝑊 𝑃𝐺100
= 0
Why Earth Mover’s Distance?
𝐷𝑓 𝑃𝑑𝑎𝑡𝑎||𝑃𝐺
𝑊 𝑃𝑑𝑎𝑡𝑎, 𝑃𝐺

WGAN
𝑉 𝐺, 𝐷
= max
𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧
𝐷 𝑥 − 𝐸 𝑥~𝑃 𝐺
𝐷 𝑥
Evaluate wasserstein distance between 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺
[Martin Arjovsky, et al., arXiv, 2017]
D has to be smooth enough.
real
−∞
generated
D
∞
Without the constraint, the
training of D will not converge.
Keeping the D smooth forces
D(x) become ∞ and −∞

WGAN
𝑉 𝐺, 𝐷
= max
𝐷 𝑥
Evaluate wasserstein distance between 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺
[Martin Arjovsky, et al., arXiv, 2017]
How to fulfill this constraint?D has to be smooth enough.
𝑓 𝑥1 − 𝑓 𝑥2 ≤ 𝐾 𝑥1 − 𝑥2
Lipschitz Function
K=1 for "1 − 𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧"
Output
change
Input
change 1−Lipschitz
not 1−Lipschitz
Do not change fast
Weight Clipping
Force the parameters w between c and -c
After parameter update, if w > c, w = c;
if w < -c, w = -c

𝑉 𝐺, 𝐷
= max
𝐷 𝑥
≈ max
𝐷
{𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎
𝐷 𝑥
−𝜆 ‫׬‬𝑥
𝑚𝑎𝑥 0, 𝛻𝑥 𝐷 𝑥 − 1 𝑑𝑥}
−𝜆𝐸 𝑥~𝑃 𝑝𝑒𝑛𝑎𝑙𝑡𝑦
𝑚𝑎𝑥 0, 𝛻𝑥 𝐷 𝑥 − 1 }
A differentiable function is 1-Lipschitz if and only if
it has gradients with norm less than or equal to 1
everywhere. 𝛻𝑥 𝐷 𝑥 ≤ 1 for all x
Improved WGAN (WGAN-GP)
𝑉 𝐺, 𝐷
𝐷 ∈ 1 − 𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧
Prefer 𝛻𝑥 𝐷 𝑥 ≤ 1 for all x
Prefer 𝛻𝑥 𝐷 𝑥 ≤ 1 for x sampling from 𝑥~𝑃𝑝𝑒𝑛𝑎𝑙𝑡𝑦

𝑃𝑑𝑎𝑡𝑎 𝑃𝐺
Only give gradient constraint to the region between 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺
because they influence how 𝑃𝐺 moves to 𝑃𝑑𝑎𝑡𝑎
𝑚𝑎𝑥 0, 𝛻𝑥 𝐷 𝑥 − 1 }
≈ max
𝐷
𝐷 𝑥𝑉 𝐺, 𝐷
𝑃𝑝𝑒𝑛𝑎𝑙𝑡𝑦
“Given that enforcing the Lipschitz constraint everywhere
is intractable, enforcing it only along these straight lines
seems sufficient and experimentally results in good
performance.”

“Simply penalizing overly
large gradients also works in
theory, but experimentally we
found that this approach
converged faster and to
better optima.”
𝑃𝑑𝑎𝑡𝑎
𝑃𝐺
𝑚𝑎𝑥 0, 𝛻𝑥 𝐷 𝑥 − 1 }
≈ max
𝐷
𝐷 𝑥𝑉 𝐺, 𝐷
𝑃𝑝𝑒𝑛𝑎𝑙𝑡𝑦
𝛻𝑥 𝐷 𝑥 − 1 2
𝐷 𝑥𝐷 𝑥
The gradient norm
in this region = 1
More: Improved the improved WGAN

Spectrum Norm Spectral Normalization → Keep
gradient norm smaller than 1
everywhere [Miyato, et al., ICLR, 2018]

• Sample m examples 𝑥1, 𝑥2, … , 𝑥 𝑚 from data distribution
• Sample m noise samples 𝑧1, 𝑧2, … , 𝑧 𝑚 from the prior
𝑃𝑝𝑟𝑖𝑜𝑟 𝑧
• Obtaining generated data ෤𝑥1
, ෤𝑥2
, … , ෤𝑥 𝑚
, ෤𝑥 𝑖
= 𝐺 𝑧 𝑖
• ෨𝑉 =
1
𝑚
σ𝑖=1
𝑚
1
𝑚
σ𝑖=1
𝑚
𝑙𝑜𝑔 1 − 𝐷 ෤𝑥 𝑖
• 𝜃 𝑑 ← 𝜃 𝑑 + 𝜂𝛻 ෨𝑉 𝜃 𝑑
• Sample another m noise samples 𝑧1, 𝑧2, … , 𝑧 𝑚 from the
prior 𝑃𝑝𝑟𝑖𝑜𝑟 𝑧
• Update generator parameters 𝜃𝑔 to minimize
• ෨𝑉 =
1
𝑚
σ𝑖=1
𝑚
1
𝑚
σ𝑖=1
𝑚
𝑙𝑜𝑔 1 − 𝐷 𝐺 𝑧 𝑖
• 𝜃𝑔 ← 𝜃𝑔 − 𝜂𝛻 ෨𝑉 𝜃𝑔
Algorithm of Original GAN
Repeat
k times
Learning
D
Learning
G
Only
Once
𝐷 𝑥 𝑖
𝐷 ෤𝑥 𝑖−
No sigmoid for the output of D
WGAN
Weight clipping /
Gradient Penalty …
𝐷 𝐺 𝑧 𝑖−

Energy-based GAN (EBGAN)
• Using an autoencoder as discriminator D
Discriminator
0 for the
best images
Generator is
the same.
-
0.1
EN DE
Autoencoder
X -1 -0.1
[Junbo Zhao, et al., arXiv, 2016]
Using the negative reconstruction error of
auto-encoder to determine the goodness
Benefit: The auto-encoder can be pre-train
by real images without generator.

EBGAN
realgen gen
Hard to reconstruct, easy to destroy
m
0 is for
the best.
Do not have to
be very negative
Auto-encoder based discriminator
only gives limited region large value.

Mode Collapse
: real data
: generated data
Training with too many iterations ……

Mode Dropping
BEGAN on CelebA
Generator
at iteration t
Generator
at iteration t+1
Generator
at iteration t+2
Generator switches mode during training

Ensemble
Generator
1
Generator
2
……
……
Train a set of generators: 𝐺1, 𝐺2, ⋯ , 𝐺 𝑁
Random pick a generator 𝐺𝑖
Use 𝐺𝑖 to generate the image
To generate an image

Likelihood
: real data (not observed
during training)
: generated data
𝑥 𝑖
Log Likelihood:
Generator
PG
𝐿 =
1
𝑁
෍
𝑖
We cannot compute 𝑃𝐺 𝑥 𝑖 . We can only
sample from 𝑃𝐺.
Prior
Distribution

Likelihood
- Kernel Density Estimation
• Estimate the distribution of PG(x) from sampling
Generator
PG
Now we have an approximation of 𝑃𝐺, so we can
compute 𝑃𝐺 𝑥 𝑖 for each real data 𝑥 𝑖
Then we can compute the likelihood.
Each sample is the mean of a
Gaussian with the same covariance.

Likelihood v.s. Quality
• Low likelihood, high quality?
• High likelihood, low quality?
Considering a model generating good images (small variance)
Generator
PG
Generated data Data for evaluation 𝑥 𝑖
𝑃𝐺 𝑥 𝑖 = 0
G1
𝐿 =
1
𝑁
෍
𝑖
G2
X 0.99
X 0.01
= −𝑙𝑜𝑔100 +
1
𝑁
෍
𝑖
4.6100

Objective Evaluation
Off-the-shelf
Image Classifier
𝑥 𝑃 𝑦|𝑥
Concentrated distribution
means higher visual quality
CNN𝑥1
𝑃 𝑦1
|𝑥1
Uniform distribution
means higher variety
CNN𝑥2
𝑃 𝑦2
|𝑥2
CNN𝑥3
𝑃 𝑦3|𝑥3
…
𝑃 𝑦 =
1
𝑁
෍
𝑛
𝑃 𝑦 𝑛|𝑥 𝑛
[Tim Salimans, et al., NIPS, 201
𝑥: image
𝑦: class (output of CNN)
e.g. Inception net,
VGG, etc.
class 1
class 2
class 3

We don’t want memory GAN.
• Using k-nearest neighbor to check whether the
generator generates new objects
https://arxiv.org/pdf/1511.01844.pdf

Mode Dropping
Sanjeev Arora, Andrej Risteski, Yi Zhang, “Do GANs
learn the distribution? Some Theory and Empirics”, ICLR,
2018
If you sample 400 images,
with probability > 50%,
You would sample “identical” images.
There are approximately
0.16M different faces.
DCGAN
ALI
There are approximately
1M different faces.

Target of
NN output
Text-to-Image
• Traditional supervised approach
NN Image
Text: “train”
a dog is running
a bird is flying
A blurry image!
c1: a dog is running
as close as
possible

Conditional GAN
D
(original)
scalar𝑥
G
𝑧Normal distribution
x = G(c,z)
c: train
x is real image or not
Image
Real images:
Generated images:
1
0
Generator will learn to
generate realistic images ….
But completely ignore the
input conditions.
[Scott Reed, et al, ICML, 2016]

Conditional GAN
D
(better)
scalar
𝑐
𝑥
True text-image pairs:
G
𝑧Normal distribution
x = G(c,z)
c: train
Image
x is realistic or not +
c and x are matched or not
(train , )
(train , )(cat , )
[Scott Reed, et al, ICML, 2016]
1
00

• Sample m positive examples 𝑐1
, 𝑥1
, 𝑐2
, 𝑥2
, … , 𝑐 𝑚
, 𝑥 𝑚
from database
• Sample m noise samples 𝑧1, 𝑧2, … , 𝑧 𝑚 from a distribution
• Obtaining generated data ෤𝑥1, ෤𝑥2, … , ෤𝑥 𝑚 , ෤𝑥 𝑖 = 𝐺 𝑐 𝑖, 𝑧 𝑖
• Sample m objects ො𝑥1, ො𝑥2, … , ො𝑥 𝑚 from database
• ෨𝑉 =
1
𝑚
σ𝑖=1
𝑚
𝑙𝑜𝑔𝐷 𝑐 𝑖, 𝑥 𝑖
+
1
𝑚
෍
𝑖=1
𝑚
𝑙𝑜𝑔 1 − 𝐷 𝑐 𝑖, ෤𝑥 𝑖 +
1
𝑚
෍
𝑖=1
𝑚
𝑙𝑜𝑔 1 − 𝐷 𝑐 𝑖, ො𝑥 𝑖
• 𝜃 𝑑 ← 𝜃 𝑑 + 𝜂𝛻 ෨𝑉 𝜃 𝑑
• Sample m noise samples 𝑧1
, 𝑧2
, … , 𝑧 𝑚
from a distribution
• Sample m conditions 𝑐1, 𝑐2, … , 𝑐 𝑚 from a database
• Update generator parameters 𝜃𝑔 to maximize
• ෨𝑉 =
1
𝑚
σ𝑖=1
𝑚
𝑙𝑜𝑔 𝐷 𝐺 𝑐 𝑖, 𝑧 𝑖 , 𝜃𝑔 ← 𝜃𝑔 + 𝜂𝛻 ෨𝑉 𝜃𝑔
Learning
D
Learning
G

x is realistic or not +
c and x are matched
or not
Conditional GAN - Discriminator
[Takeru Miyato, et al., ICLR, 2018]
[Han Zhang, et al., arXiv, 2017]
[Augustus Odena et al., ICML, 2017]
condition c
object x
Network
Network
Network
score
Network
Network
(almost every paper)
condition c
object x
c and x are matched
or not
x is realistic or not

Conditional GAN
paired data
blue eyes
red hair
short hair
Collecting anime faces
and the description of its
characteristics
red hair,
green eyes
blue hair,
red eyes
圖片生成：
吳宗翰、謝濬丞、
陳延昊、錢柏均

Stack GAN
Han Zhang, Tao Xu, Hongsheng
Li, Shaoting Zhang, Xiaogang
Wang, Xiaolei Huang, Dimitris Metaxas,
“StackGAN: Text to Photo-realistic Image
Synthesis with Stacked Generative Adversarial
Networks”, ICCV, 2017

Image-to-image
https://arxiv.org/pdf/1611.07004
G
𝑧
x = G(c,z)
𝑐

as close as
possible
Image-to-image
• Traditional supervised approach
NN Image
It is blurry because it is the
average of several images.
Testing:
input close

Image-to-image
• Experimental results
Testing:
input close GAN
G
𝑧
Image D scalar
GAN + close

Patch GAN
D
score
D D
score score
https://arxiv.org/pdf/1611.07004.pdf

Image super resolution
• Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew
Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes
Totz, Zehan Wang, Wenzhe Shi, “Photo-Realistic Single Image Super-Resolution
Using a Generative Adversarial Network”, CVPR, 2016

Image Completion
http://hi.cs.waseda.ac.jp/~iizu
ka/projects/completion/en/

Demo
https://www.youtube.com/watch?v=5Ua4NUKowPU
http://hi.cs.waseda.ac.jp/~iizuka/projects/com
pletion/en/
Waseda University

Video Generation
Generator
Discrimi
nator
Last frame is real
or generated
Discriminator thinks it is real
target
Minimize
distance

https://github.com/dyelax/Adversarial_Video_Generation

InfoGAN
Regular
GAN
Modifying a specific dimension,
no clear meaning
What we expect Actually …
(The colors
represents the
characteristics.)

What is InfoGAN?
Discrimi
nator
Classifier
scalarGenerator=z
z’
c
x
c
Predict the code c
that generates x
“Auto-
encoder”
Parameter
sharing
(only the last
layer is different)
encoder
decoder

What is InfoGAN?
Classifier
Generator=z
z’
c
x
c
Predict the code c
that generates x
“Auto-
encoder”
encoder
decoder
c must have clear
influence on x
The classifier can
recover c from x.

https://arxiv.org/abs/1606.03657

Modifying Input Code
Generator
0.1
−3
⋮
2.4
0.9
Generator
3
−3
⋮
2.4
0.9
Generator
0.1
−3
⋮
6.4
0.9
Generator
0.1
−3
⋮
2.4
3.5
Each dimension of input vector
represents some characteristics.
Longer hair
blue hair Open mouth
The input code determines the generator output.
Understand the meaning of each dimension to
control the output.

GAN+Autoencoder
• We have a generator (input z, output x)
• However, given x, how can we find z?
• Learn an encoder (input x, output z)
Generator
(Decoder)
Encoder z xx
fixed
Discriminator
Init (only modify the last layer)
Autoencoder

Attribute Representation
𝑧𝑙𝑜𝑛𝑔 =
1
𝑁1
෍
𝑥∈𝑙𝑜𝑛𝑔
𝐸𝑛 𝑥 −
1
𝑁2
෍
𝑥′∉𝑙𝑜𝑛𝑔
𝐸𝑛 𝑥′
Short
Hair
Long
Hair
𝐸𝑛 𝑥 + 𝑧𝑙𝑜𝑛𝑔 = 𝑧′𝑥 𝐺𝑒𝑛 𝑧′
𝑧𝑙𝑜𝑛𝑔

https://www.youtube.com/watch?v=kPEIJJsQr7U
Photo
Editing

https://www.youtube.com/watch?v=9c4z6YsBGQ0
Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman and Alexei A. Efros. "Generative
Visual Manipulation on the Natural Image Manifold", ECCV, 2016.

Andrew Brock, Theodore Lim, J.M.
Ritchie, Nick Weston, Neural Photo Editing with
Introspective Adversarial Networks, arXiv preprint,
2017

Basic Idea
space of
code z
Why move on the code space?
Fulfill the
constraint

Generator
(Decoder)
Encoder z xx
Back to z
• Method 1
• Method 2
• Method 3
𝑧∗ = 𝑎𝑟𝑔 min
𝑧
𝐿 𝐺 𝑧 , 𝑥 𝑇
𝑥 𝑇
𝑧∗
Difference between G(z) and xT
 Pixel-wise
 By another network
Using the results from method 2 as the initialization of method 1
Gradient Descent

Editing Photos
• z0 is the code of the input image
𝑧∗ = 𝑎𝑟𝑔 min
𝑧
𝑈 𝐺 𝑧 + 𝜆1 𝑧 − 𝑧0
2 − 𝜆2 𝐷 𝐺 𝑧
Does it fulfill the constraint of editing?
Not too far away from
the original image
Using discriminator
to check the image is
realistic or notimage

VAE-GAN
Discrimi
nator
Generator
(Decoder)
Encoder z x scalarx
VAE
GAN
 Minimize
reconstruction error
 z close to normal
 Minimize
reconstruction error
 Cheat discriminator
 Discriminate real,
generated and
reconstructed images
Discriminator provides the reconstruction loss
Anders Boesen, Lindbo Larsen, Søren Kaae
Sønderby, Hugo Larochelle, Ole Winther,
“Autoencoding beyond pixels using a learned similarity
metric”, ICML. 2016

• Initialize En, De, Dis
• In each iteration:
• Sample M images 𝑥1, 𝑥2, ⋯ , 𝑥 𝑀 from database
• Generate M codes ǁ𝑧1, ǁ𝑧2, ⋯ , ǁ𝑧 𝑀 from encoder
• ǁ𝑧 𝑖 = 𝐸𝑛 𝑥 𝑖
• Generate M images ෤𝑥1
, ෤𝑥2
, ⋯ , ෤𝑥 𝑀
from decoder
• ෤𝑥 𝑖 = 𝐷𝑒 ǁ𝑧 𝑖
• Sample M codes 𝑧1
, 𝑧2
, ⋯ , 𝑧 𝑀
from prior P(z)
• Generate M images ො𝑥1
, ො𝑥2
, ⋯ , ො𝑥 𝑀
from decoder
• ො𝑥 𝑖 = 𝐷𝑒 𝑧 𝑖
• Update En to decrease ෤𝑥 𝑖 − 𝑥 𝑖 , decrease
KL(P( ǁ𝑧 𝑖|xi)||P(z))
• Update De to decrease ෤𝑥 𝑖 − 𝑥 𝑖 , increase
𝐷𝑖𝑠 ෤𝑥 𝑖 and 𝐷𝑖𝑠 ො𝑥 𝑖
• Update Dis to increase 𝐷𝑖𝑠 𝑥 𝑖 , decrease
𝐷𝑖𝑠 ෤𝑥 𝑖
and 𝐷𝑖𝑠 ො𝑥 𝑖
Algorithm
Another kind of
discriminator:
Discriminator
x
real gen recon

BiGAN
Encoder Decoder Discriminator
Image x
code z
Image x
code z
Image x code z
from encoder
or decoder?
(real) (generated)
(from prior
distribution)

Algorithm
• Initialize encoder En, decoder De, discriminator Dis
• In each iteration:
• Sample M images 𝑥1, 𝑥2, ⋯ , 𝑥 𝑀 from database
• Generate M codes ǁ𝑧1
, ǁ𝑧2
, ⋯ , ǁ𝑧 𝑀
from encoder
• ǁ𝑧 𝑖 = 𝐸𝑛 𝑥 𝑖
• Sample M codes 𝑧1, 𝑧2, ⋯ , 𝑧 𝑀 from prior P(z)
• Generate M codes ෤𝑥1, ෤𝑥2, ⋯ , ෤𝑥 𝑀 from decoder
• ෤𝑥 𝑖 = 𝐷𝑒 𝑧 𝑖
• Update Dis to increase 𝐷𝑖𝑠 𝑥 𝑖, ǁ𝑧 𝑖 , decrease 𝐷𝑖𝑠 ෤𝑥 𝑖, 𝑧 𝑖
• Update En and De to decrease 𝐷𝑖𝑠 𝑥 𝑖
, ǁ𝑧 𝑖
, increase
𝐷𝑖𝑠 ෤𝑥 𝑖, 𝑧 𝑖

Encoder Decoder Discriminator
Image x
code z
Image x
code z
Image x code z
from encoder
or decoder?
(real) (generated)
(from prior
distribution)
𝑃 𝑥, 𝑧 𝑄 𝑥, 𝑧
Evaluate the
difference
between P and QP and Q would be the same
Optimal encoder
and decoder:
En(x’) = z’ De(z’) = x’
En(x’’) = z’’De(z’’) = x’’
For all x’
For all z’’

BiGAN
En(x’) = z’ De(z’) = x’
En(x’’) = z’’De(z’’) = x’’
For all x’
For all z’’
How about?
Encoder
Decoder
x
z
෤𝑥
Encoder
Decoder
x
z
ǁ𝑧
En, De
En, De
Optimal encoder
and decoder:

Domain Adversarial Training
• Training and testing data are in different domains
Training
data:
Testing
data:
Generator
Generator
The same
distribution
feature
feature
Take digit
classification as example

blue points
red points
feature extractor (Generator)
Discriminator
(Domain classifier)
image Which domain?
Always output
zero vectors
Domain Classifier Fails

feature extractor (Generator)
Discriminator
(Domain classifier)
image
Label predictor
Which digits?
Not only cheat the domain
classifier, but satisfying label
predictor at the same time
More speech- and text-related applications in the following.
Successfully applied on image classification
[Ganin et al, ICML, 2015][Ajakan et al. JMLR, 2016 ]
Which domain?

Domain-adversarial training
Yaroslav Ganin, Victor Lempitsky, Unsupervised Domain Adaptation by Backpropagation,
ICML, 2015
Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand,
Domain-Adversarial Training of Neural Networks, JMLR, 2016

Domain X Domain Y
male female
It is good.
It’s a good day.
I love you.
It is bad.
It’s a bad day.
I don’t love you.
Transform an object from one domain to another
without paired data (e.g. style transfer)
GDomain X Domain Y

Unsupervised
• Approach 1: Direct Transformation
• Approach 2: Projection to Common Space
?𝐺 𝑋→𝑌
Domain X Domain Y
For texture or
color change
𝐸𝑁 𝑋 𝐷𝐸 𝑌
Encoder of
domain X
Decoder of
domain Y
Larger change, only keep the semantics
Domain YDomain X Face
Attribute

?
Direct Transformation
𝐺 𝑋→𝑌
Domain X
Domain Y
𝐷 𝑌
Domain Y
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y

𝐺 𝑋→𝑌
Domain X
Domain Y
𝐷 𝑌
Domain Y
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Not what we want!
ignore input

𝐺 𝑋→𝑌
Domain X
Domain Y
𝐷 𝑌
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Not what we want!
ignore input
[Tomer Galanti, et al. ICLR, 2018]
The issue can be avoided by network design.
Simpler generator makes the input and
output more closely related.

𝐺 𝑋→𝑌
Domain X
Domain Y
𝐷 𝑌
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Encoder
Network
Encoder
Network
pre-trained
as close as
possible
Baseline of DTN [Yaniv Taigman, et al., ICLR, 2017]

𝐺 𝑋→𝑌
𝐷 𝑌
Domain Y
scalar
Input image
belongs to
domain Y or not
𝐺Y→X
Lack of information
for reconstruction
[Jun-Yan Zhu, et al., ICCV, 201
Cycle consistency

𝐺 𝑋→𝑌 𝐺Y→X
𝐺Y→X 𝐺 𝑋→𝑌
𝐷 𝑌𝐷 𝑋
scalar: belongs to
domain Y or not
scalar: belongs to
domain X or not

Cycle GAN –
Silver Hair
• https://github.com/Aixile/c
hainer-cyclegan

Issue of Cycle Consistency
• CycleGAN: a Master of Steganography
[Casey Chu, et al., NIPS workshop, 2017]
𝐺Y→X𝐺 𝑋→𝑌
The information is hidden.

Cycle GAN
Dual GAN
Disco GAN
[Jun-Yan Zhu, et al., ICCV, 2017
[Zili Yi, et al., ICCV, 2017]
[Taeksoo Kim,
et al., ICML,
2017]

StarGAN
For multiple domains,
considering starGAN
[Yunjey Choi, arXiv, 2017]

Domain X Domain Y
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
image
Face
Attribute
Projection to Common Space
Target

Domain X Domain Y
𝐸𝑁 𝑋
𝐷𝐸 𝑋image
image
image
image
Minimizing reconstruction error
Training

𝐸𝑁 𝑋
𝐷𝐸 𝑋image
image
image
image
Because we train two auto-encoders separately …
The images with the same attribute may not project
to the same position in the latent space.
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Training

Sharing the parameters of encoders and decoders
Training
𝐸𝑁 𝑋
𝐸𝑁𝑌
𝐷𝐸 𝑋
𝐷𝐸 𝑌
Couple GAN[Ming-Yu Liu, et al., NIPS, 2016]
UNIT[Ming-Yu Liu, et al., NIPS, 2017]

𝐸𝑁 𝑋
𝐷𝐸 𝑋image
image
image
image
The domain discriminator forces the output of 𝐸𝑁𝑋 and
𝐸𝑁𝑌 have the same distribution.
From 𝐸𝑁𝑋 or 𝐸𝑁𝑌
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Training
Domain
Discriminator
𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the
domain discriminator
[Guillaume Lample, et al., NIPS, 2017]

𝐸𝑁 𝑋
𝐷𝐸 𝑋image
image
image
image
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Training
Cycle Consistency:
Used in ComboGAN [Asha Anoosheh, et al., arXiv, 017]

𝐸𝑁 𝑋
𝐷𝐸 𝑋image
image
image
image
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Training
Semantic Consistency:
Used in DTN [Yaniv Taigman, et al., ICLR, 2017] and
XGAN [Amélie Royer, et al., arXiv, 2017]
To the same
latent space

Basic Components
EnvActor
Reward
Function
Video
Game
Go
Get 20 scores when
killing a monster
The rule
of GO
You cannot control

• Input of neural network: the observation of machine
represented as a vector or a matrix
• Output neural network : each action corresponds to a
neuron in output layer
…
…
NN as actor
pixels
fire
right
left
Score of an
action
0.7
0.2
0.1
Take the action
based on the
probability.
Neural network as Actor

Start with
observation 𝑠1 Observation 𝑠2 Observation 𝑠3
Example: Playing Video Game
Action 𝑎1: “right”
Obtain reward
𝑟1 = 0
Action 𝑎2 : “fire”
(kill an alien)
Obtain reward
𝑟2 = 5

Start with
observation 𝑠1 Observation 𝑠2 Observation 𝑠3
Example: Playing Video Game
After many turns
Action 𝑎 𝑇
Obtain reward 𝑟𝑇
Game Over
(spaceship destroyed)
This is an episode.
We want the total
reward be maximized.
Total reward:
𝑅 = ෍
𝑡=1
𝑇
𝑟𝑡

Actor, Environment, Reward
𝜏 = 𝑠1, 𝑎1, 𝑠2, 𝑎2, ⋯ , 𝑠 𝑇, 𝑎 𝑇
Trajectory
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……

Reward Function → Discriminator
Reinforcement Learning v.s. GAN
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
𝑅 𝜏 = ෍
𝑡=1
𝑇
𝑟𝑡
Reward
𝑟1
Reward
𝑟2
“Black box”
You cannot use
backpropagation.
Actor → Generator Fixed
updatedupdated

Imitation Learning
We have demonstration of the expert.
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
reward function is not available
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁
Each Ƹ𝜏 is a trajectory
of the expert.
Self driving: record
human drivers
Robot: grab the
arm of robot

Inverse Reinforcement Learning
Reward
Function
Environment
Optimal
Actor
Inverse Reinforcement
Learning
Using the reward function to find the optimal actor.
Modeling reward can be easier. Simple reward
function can lead to complex policy.
Reinforcement
Learning
Expert
demonstration
of the expert
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁

Framework of IRL
Expert ො𝜋
Actor 𝜋
Obtain
Reward Function R
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁
𝜏1, 𝜏2, ⋯ , 𝜏 𝑁
Find an actor based
on reward function R
By Reinforcement learning
෍
𝑛=1
𝑁
𝑅 Ƹ𝜏 𝑛 > ෍
𝑛=1
𝑁
𝑅 𝜏
Reward function
→ Discriminator
Actor
→ Generator
Reward
Function R
The expert is always
the best.

𝜏1, 𝜏2, ⋯ , 𝜏 𝑁
GAN
IRL
G
D
High score for real,
low score for generated
Find a G whose output
obtains large score from D
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁
Expert
Actor
Reward
Function
Larger reward for Ƹ𝜏 𝑛,
Lower reward for 𝜏
Find a Actor obtains
large reward

Robot
http://rll.berkeley.edu/gcl/
Chelsea Finn, Sergey Levine, Pieter
Abbeel, ” Guided Cost Learning: Deep
Inverse Optimal Control via Policy
Optimization”, ICML, 2016

Concluding Remarks
Generation

[GAN by Hung-yi Lee]Part 1: General introduction of GAN

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to [GAN by Hung-yi Lee]Part 1: General introduction of GAN

Similar to [GAN by Hung-yi Lee]Part 1: General introduction of GAN (20)

More from NAVER Engineering

More from NAVER Engineering (20)

Recently uploaded

Recently uploaded (20)

[GAN by Hung-yi Lee]Part 1: General introduction of GAN