Generative Adversarial Network and its Applications on Speech and Natural Language Processing, Part 1.
발표자: Hung-yi Lee(국립 타이완대 교수)
발표일: 18.7.
Generative adversarial network (GAN) is a new idea for training models, in which a generator and a discriminator compete against each other to improve the generation quality. Recently, GAN has shown amazing results in image generation, and a large amount and a wide variety of new ideas, techniques, and applications have been developed based on it. Although there are only few successful cases, GAN has great potential to be applied to text and speech generations to overcome limitations in the conventional methods.
In the first part of the talk, I will first give an introduction of GAN and provide a thorough review about this technology. In the second part, I will focus on the applications of GAN to speech and natural language processing. I will demonstrate the applications of GAN on voice I will also talk about the research directions towards unsupervised speech recognition by GAN.conversion, unsupervised abstractive summarization and sentiment controllable chat-bot.
5. Mihaela Rosca, Balaji Lakshminarayanan, David Warde-Farley, Shakir Mohamed, “Variational Approaches for Auto-
Encoding Generative Adversarial Networks”, arXiv, 2017
All Kinds of GAN … https://github.com/hindupuravinash/the-gan-zoo
GAN
ACGAN
BGAN
DCGAN
EBGAN
fGAN
GoGAN
CGAN
……
6. ICASSP Keyword search on session index page,
so session names are included.
Number of papers whose titles include the keyword
0
5
10
15
20
25
30
35
40
45
2012 2013 2014 2015 2016 2017 2018
generative adversarial reinforcement
GAN becomes a very
important technology.
7. Outline
Part I: General Introduction of Generative
Adversarial Network (GAN)
Part II: Applications to Speech and Text
Processing
Part III: Recent Progress at National
Taiwan University
8. Part I: General Introduction
Generative Adversarial Network
and its Applications to Speech Processing
and Natural Language Processing
9. Outline of Part 1
Generation by GAN
• Image Generation as Example
• Theory behind GAN
• Issues and Possible Solutions
• Evaluation
Conditional Generation
Feature Extraction
Unsupervised Conditional Generation
Relation to Reinforcement Learning
10. Outline of Part 1
Generation by GAN
• Image Generation as Example
• Theory behind GAN
• Issues and Possible Solutions
• Evaluation
Conditional Generation
Feature Extraction
Unsupervised Conditional Generation
Relation to Reinforcement Learning
12. Basic Idea of GAN
Generator
It is a neural network
(NN), or a function.
Generator
0.1
−3
⋮
2.4
0.9
imagevector
Generator
3
−3
⋮
2.4
0.9
Generator
0.1
2.1
⋮
5.4
0.9
Generator
0.1
−3
⋮
2.4
3.5
high
dimensional
vector
Powered by: http://mattya.github.io/chainer-DCGAN/
Each dimension of input vector
represents some characteristics.
Longer hair
blue hair Open mouth
13. Discri-
minator
scalar
image
Basic Idea of GAN It is a neural network
(NN), or a function.
Larger value means real,
smaller value means fake.
Discri-
minator
Discri-
minator
Discri-
minator1.0 1.0
0.1 Discri-
minator
0.1
14. • Initialize generator and discriminator
• In each training iteration:
DG
sample
generated
objects
G
Algorithm
D
Update
vector
vector
vector
vector
0000
1111
randomly
sampled
Database
Step 1: Fix generator G, and update discriminator D
Discriminator learns to assign high scores to real objects
and low scores to generated objects.
Fix
15. • Initialize generator and discriminator
• In each training iteration:
DG
Algorithm
Step 2: Fix discriminator D, and update generator G
Discri-
minator
NN
Generator
vector
0.13
hidden layer
update fix
Gradient Ascent
large network
Generator learns to “fool” the discriminator
16. • Initialize generator and discriminator
• In each training iteration:
DG
Learning
D
Sample some
real objects:
Generate some
fake objects:
G
Algorithm
D
Update
Learning
G
G D
image
1111
image
image
image
1
update fix
0000vector
vector
vector
vector
vector
vector
vector
vector
fix
17. Basic Idea of GAN
NN
Generator
v1
Discri-
minator
v1
Real images:
NN
Generator
v2
Discri-
minator
v2
NN
Generator
v3
Discri-
minator
v3
This is where the term
“adversarial” comes from.
27. (Variational) Auto-encoder
As close as possible
NN
Encoder
NN
Decoder
code
NN
Decoder
code
Randomly generate
a vector as code Image ?
= Generator
= Generator
28. Auto-encoder v.s. GAN
As close as possibleNN
Decoder
code
Genera-
tor
code
= Generator
Discri-
minator
If discriminator does not simply memorize the images,
Generator learns the patterns of faces.
Just copy an image
1
Auto-encoder
GAN
Fuzzy …
29. FID[Martin Heusel, et al., NIPS, 2017]: Smaller is better
[Mario Lucic, et al. arXiv,
2017]
30. Outline of Part 1
Generation by GAN
• Image Generation as Example
• Theory behind GAN
• Issues and Possible Solutions
• Evaluation
Conditional Generation
Feature Extraction
Unsupervised Conditional Generation
Relation to Reinforcement Learning
31. Generation
• We want to find data distribution 𝑃𝑑𝑎𝑡𝑎 𝑥
𝑃𝑑𝑎𝑡𝑎 𝑥
High
Probability
Low
Probability
Image
Space
𝑥: an image (a high-
dimensional vector)
32. Maximum Likelihood Estimation
• Given a data distribution 𝑃𝑑𝑎𝑡𝑎 𝑥 (We can sample from it.)
• We have a distribution 𝑃𝐺 𝑥; 𝜃 parameterized by 𝜃
• We want to find 𝜃 such that 𝑃𝐺 𝑥; 𝜃 close to 𝑃𝑑𝑎𝑡𝑎 𝑥
• E.g. 𝑃𝐺 𝑥; 𝜃 is a Gaussian Mixture Model, 𝜃 are means
and variances of the Gaussians
Sample 𝑥1
, 𝑥2
, … , 𝑥 𝑚
from 𝑃𝑑𝑎𝑡𝑎 𝑥
We can compute 𝑃𝐺 𝑥 𝑖; 𝜃
Likelihood of generating the samples
𝐿 = ෑ
𝑖=1
𝑚
𝑃𝐺 𝑥 𝑖; 𝜃
Find 𝜃∗ maximizing the likelihood
33. Maximum Likelihood Estimation
= Minimize KL Divergence
𝜃∗ = 𝑎𝑟𝑔 max
𝜃
ෑ
𝑖=1
𝑚
𝑃𝐺 𝑥 𝑖; 𝜃 = 𝑎𝑟𝑔 max
𝜃
𝑙𝑜𝑔 ෑ
𝑖=1
𝑚
𝑃𝐺 𝑥 𝑖; 𝜃
= 𝑎𝑟𝑔 max
𝜃
𝑖=1
𝑚
𝑙𝑜𝑔𝑃𝐺 𝑥 𝑖
; 𝜃
≈ 𝑎𝑟𝑔 max
𝜃
𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎
[𝑙𝑜𝑔𝑃𝐺 𝑥; 𝜃 ]
= 𝑎𝑟𝑔 max
𝜃
න
𝑥
𝑃𝑑𝑎𝑡𝑎 𝑥 𝑙𝑜𝑔𝑃𝐺 𝑥; 𝜃 𝑑𝑥 − න
𝑥
𝑃𝑑𝑎𝑡𝑎 𝑥 𝑙𝑜𝑔𝑃𝑑𝑎𝑡𝑎 𝑥 𝑑𝑥
= 𝑎𝑟𝑔 min
𝜃
𝐾𝐿 𝑃𝑑𝑎𝑡𝑎||𝑃𝐺
𝑥1
, 𝑥2
, … , 𝑥 𝑚
from 𝑃𝑑𝑎𝑡𝑎 𝑥
How to define a general 𝑃𝐺?
34. Generator
• A generator G is a network. The network defines a
probability distribution 𝑃𝐺
generator
G𝑧 𝑥 = 𝐺 𝑧
Normal
Distribution
𝑃𝐺(𝑥) 𝑃𝑑𝑎𝑡𝑎 𝑥
as close as possible
How to compute the divergence?
𝐺∗
= 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎
Divergence between distributions 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎
𝑥: an image (a high-
dimensional vector)
35. Discriminator
𝐺∗
= 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎
Although we do not know the distributions of 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎,
we can sample from them.
sample
G
vector
vector
vector
vector
sample from
normal
Database
Sampling from 𝑷 𝑮
Sampling from 𝑷 𝒅𝒂𝒕𝒂
36. Discriminator 𝐺∗
= 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎
Discriminator
: data sampled from 𝑃𝑑𝑎𝑡𝑎
: data sampled from 𝑃𝐺
train
𝑉 𝐺, 𝐷 = 𝐸 𝑥∼𝑃 𝑑𝑎𝑡𝑎
𝑙𝑜𝑔𝐷 𝑥 + 𝐸 𝑥∼𝑃 𝐺
𝑙𝑜𝑔 1 − 𝐷 𝑥
Example Objective Function for D
(G is fixed)
𝐷∗ = 𝑎𝑟𝑔 max
𝐷
𝑉 𝐷, 𝐺Training:
Using the example objective
function is exactly the same as
training a binary classifier.
[Goodfellow, et al., NIPS, 2014]
The maximum objective value
is related to JS divergence.
Sigmoid Output
37. max
𝐷
𝑉 𝐺, 𝐷
• Given G, what is the optimal D* maximizing
• Given x, the optimal D* maximizing
𝑉 = 𝐸 𝑥∼𝑃 𝑑𝑎𝑡𝑎
𝑙𝑜𝑔𝐷 𝑥 + 𝐸 𝑥∼𝑃 𝐺
𝑙𝑜𝑔 1 − 𝐷 𝑥
𝑃𝑑𝑎𝑡𝑎 𝑥 𝑙𝑜𝑔𝐷 𝑥 + 𝑃𝐺 𝑥 𝑙𝑜𝑔 1 − 𝐷 𝑥
= න
𝑥
𝑃𝑑𝑎𝑡𝑎 𝑥 𝑙𝑜𝑔𝐷 𝑥 𝑑𝑥 + න
𝑥
𝑃𝐺 𝑥 𝑙𝑜𝑔 1 − 𝐷 𝑥 𝑑𝑥
= න
𝑥
𝑃𝑑𝑎𝑡𝑎 𝑥 𝑙𝑜𝑔𝐷 𝑥 + 𝑃𝐺 𝑥 𝑙𝑜𝑔 1 − 𝐷 𝑥 𝑑𝑥
Assume that D(x) can be any function
𝑉 = 𝐸 𝑥∼𝑃 𝑑𝑎𝑡𝑎
𝑙𝑜𝑔𝐷 𝑥
+𝐸 𝑥∼𝑃 𝐺
𝑙𝑜𝑔 1 − 𝐷 𝑥
41. Discriminator 𝐺∗
= 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎
Discriminator
: data sampled from 𝑃𝑑𝑎𝑡𝑎
: data sampled from 𝑃𝐺
train
hard to discriminatesmall divergence
Discriminator
train
easy to discriminatelarge divergence
𝐷∗ = 𝑎𝑟𝑔 max
𝐷
𝑉 𝐷, 𝐺
Training:
(cannot make objective large)
42. 𝐺1 𝐺2 𝐺3
𝑉 𝐺1 , 𝐷 𝑉 𝐺2 , 𝐷 𝑉 𝐺3 , 𝐷
𝐷 𝐷 𝐷
𝑉 𝐺1 , 𝐷1
∗
Divergence between 𝑃𝐺1
and 𝑃𝑑𝑎𝑡𝑎
𝐺∗ = 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎
The maximum objective value
is related to JS divergence.
𝐷∗
= 𝑎𝑟𝑔 max
𝐷
𝑉 𝐷, 𝐺
max
𝐷
𝑉 𝐺, 𝐷
43. 𝐺∗
= 𝑎𝑟𝑔 min
𝐺
𝐷𝑖𝑣 𝑃𝐺, 𝑃𝑑𝑎𝑡𝑎max
𝐷
𝑉 𝐺, 𝐷
The maximum objective value
is related to JS divergence.
• Initialize generator and discriminator
• In each training iteration:
Step 1: Fix generator G, and update discriminator D
Step 2: Fix discriminator D, and update generator G
𝐷∗ = 𝑎𝑟𝑔 max
𝐷
𝑉 𝐷, 𝐺
[Goodfellow, et al., NIPS, 2014]
44. Algorithm
• To find the best G minimizing the loss function 𝐿 𝐺 ,
𝜃 𝐺 ← 𝜃 𝐺 − 𝜂 Τ𝜕𝐿 𝐺 𝜕𝜃 𝐺
𝑓 𝑥 = max{𝑓1 𝑥 , 𝑓2 𝑥 , 𝑓3 𝑥 }
𝑑𝑓 𝑥
𝑑𝑥
=?
𝑓1 𝑥
𝑓2 𝑥
𝑓3 𝑥
Τ𝑑𝑓1 𝑥 𝑑𝑥 Τ𝑑𝑓2 𝑥 𝑑𝑥 Τ𝑑𝑓3 𝑥 𝑑𝑥
Τ𝑑𝑓𝑖 𝑥 𝑑𝑥
If 𝑓𝑖 𝑥 is the
max one
𝐺∗ = 𝑎𝑟𝑔 min
𝐺
max
𝐷
𝑉 𝐺, 𝐷
𝐿 𝐺
𝜃 𝐺 defines G
45. Algorithm
• Given 𝐺0
• Find 𝐷0
∗
maximizing 𝑉 𝐺0, 𝐷
• 𝜃 𝐺 ← 𝜃 𝐺 − 𝜂 Τ𝜕𝑉 𝐺, 𝐷0
∗
𝜕𝜃 𝐺 Obtain 𝐺1
• Find 𝐷1
∗
maximizing 𝑉 𝐺1, 𝐷
• 𝜃 𝐺 ← 𝜃 𝐺 − 𝜂 Τ𝜕𝑉 𝐺, 𝐷1
∗
𝜕𝜃 𝐺 Obtain 𝐺2
• ……
𝑉 𝐺0, 𝐷0
∗
is the JS divergence between 𝑃𝑑𝑎𝑡𝑎 𝑥 and 𝑃𝐺0
𝑥
𝑉 𝐺1, 𝐷1
∗
is the JS divergence between 𝑃𝑑𝑎𝑡𝑎 𝑥 and 𝑃𝐺1
𝑥
𝐺∗ = 𝑎𝑟𝑔 min
𝐺
max
𝐷
𝑉 𝐺, 𝐷
𝐿 𝐺
Decrease JS
divergence(?)
Decrease JS
divergence(?)
Using Gradient Ascent
46. Algorithm
• Given 𝐺0
• Find 𝐷0
∗
maximizing 𝑉 𝐺0, 𝐷
• 𝜃 𝐺 ← 𝜃 𝐺 − 𝜂 Τ𝜕𝑉 𝐺, 𝐷0
∗
𝜕𝜃 𝐺 Obtain 𝐺1
𝑉 𝐺0, 𝐷0
∗
is the JS divergence between 𝑃𝑑𝑎𝑡𝑎 𝑥 and 𝑃𝐺0
𝑥
𝐺∗ = 𝑎𝑟𝑔 min
𝐺
max
𝐷
𝑉 𝐺, 𝐷
𝐿 𝐺
Decrease JS
divergence(?)
𝑉 𝐺0 , 𝐷
𝐷0
∗
𝑉 𝐺1 , 𝐷
𝐷0
∗
𝑉 𝐺0 , 𝐷0
∗
𝑉 𝐺1 , 𝐷0
∗
smaller
𝑉 𝐺1 , 𝐷1
∗ ……
Assume 𝐷0
∗
≈ 𝐷1
∗
Don’t update G
too much
47. In practice …
• Given G, how to compute max
𝐷
𝑉 𝐺, 𝐷
• Sample 𝑥1, 𝑥2, … , 𝑥 𝑚 from 𝑃𝑑𝑎𝑡𝑎 𝑥 , sample
𝑥1, 𝑥2, … , 𝑥 𝑚 from generator 𝑃𝐺 𝑥
෨𝑉 =
1
𝑚
𝑖=1
𝑚
𝑙𝑜𝑔𝐷 𝑥 𝑖 +
1
𝑚
𝑖=1
𝑚
𝑙𝑜𝑔 1 − 𝐷 𝑥 𝑖Maximize
𝑉 = 𝐸 𝑥∼𝑃 𝑑𝑎𝑡𝑎
𝑙𝑜𝑔𝐷 𝑥
+𝐸 𝑥∼𝑃 𝐺
𝑙𝑜𝑔 1 − 𝐷 𝑥
Minimize Cross-entropy
Binary Classifier
𝑥1, 𝑥2, … , 𝑥 𝑚 from 𝑃𝑑𝑎𝑡𝑎 𝑥
𝑥1, 𝑥2, … , 𝑥 𝑚 from 𝑃𝐺 𝑥
D is a binary classifier with sigmoid output (can be deep)
Positive examples
Negative examples
=
48. • In each training iteration:
• Sample m examples 𝑥1, 𝑥2, … , 𝑥 𝑚 from data distribution
𝑃𝑑𝑎𝑡𝑎 𝑥
• Sample m noise samples 𝑧1, 𝑧2, … , 𝑧 𝑚 from the prior
𝑃𝑝𝑟𝑖𝑜𝑟 𝑧
• Obtaining generated data 𝑥1
, 𝑥2
, … , 𝑥 𝑚
, 𝑥 𝑖
= 𝐺 𝑧 𝑖
• Update discriminator parameters 𝜃 𝑑 to maximize
• ෨𝑉 =
1
𝑚
σ𝑖=1
𝑚
𝑙𝑜𝑔𝐷 𝑥 𝑖 +
1
𝑚
σ𝑖=1
𝑚
𝑙𝑜𝑔 1 − 𝐷 𝑥 𝑖
• 𝜃 𝑑 ← 𝜃 𝑑 + 𝜂𝛻 ෨𝑉 𝜃 𝑑
• Sample another m noise samples 𝑧1, 𝑧2, … , 𝑧 𝑚 from the
prior 𝑃𝑝𝑟𝑖𝑜𝑟 𝑧
• Update generator parameters 𝜃𝑔 to minimize
• ෨𝑉 =
1
𝑚
σ𝑖=1
𝑚
𝑙𝑜𝑔𝐷 𝑥 𝑖 +
1
𝑚
σ𝑖=1
𝑚
𝑙𝑜𝑔 1 − 𝐷 𝐺 𝑧 𝑖
• 𝜃𝑔 ← 𝜃𝑔 − 𝜂𝛻 ෨𝑉 𝜃𝑔
Algorithm
Repeat
k times
Learning
D
Learning
G
Initialize 𝜃 𝑑 for D and 𝜃𝑔 for G
max
𝐷
𝑉 𝐺, 𝐷Can only find
lower bound of
Only
Once
49. Objective Function for Generator
in Real Implementation
𝑉 = 𝐸 𝑥∼𝑃 𝑑𝑎𝑡𝑎
𝑙𝑜𝑔𝐷 𝑥
𝑉 = 𝐸 𝑥∼𝑃 𝐺
−𝑙𝑜𝑔 𝐷 𝑥
Real implementation:
label x from PG as positive
−𝑙𝑜𝑔 𝐷 𝑥
𝑙𝑜𝑔 1 − 𝐷 𝑥
+𝐸 𝑥∼𝑃 𝐺
𝑙𝑜𝑔 1 − 𝐷 𝑥
𝐷 𝑥
Slow at the beginning
Minimax GAN (MMGAN)
Non-saturating GAN (NSGAN)
50. Using the divergence
you like
[Sebastian Nowozin, et al., NIPS,
2016]
Can we use other divergence?
51. Outline of Part 1
Generation by GAN
• Image Generation as Example
• Theory behind GAN
• Issues and Possible Solutions
• Evaluation
Conditional Generation
Feature Extraction
Unsupervised Conditional Generation
Relation to Reinforcement Learning
More tips and tricks:
https://github.com/soumith/ganhacks
52. JS divergence is not suitable
• In most cases, 𝑃𝐺 and 𝑃𝑑𝑎𝑡𝑎 are not overlapped.
• 1. The nature of data
• 2. Sampling
Both 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺 are low-dim
manifold in high-dim space.
𝑃𝑑𝑎𝑡𝑎
𝑃𝐺
The overlap can be ignored.
Even though 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺
have overlap.
If you do not have enough
sampling ……
53. 𝑃𝑑𝑎𝑡𝑎𝑃𝐺0 𝑃𝑑𝑎𝑡𝑎𝑃𝐺1
𝐽𝑆 𝑃𝐺0
, 𝑃𝑑𝑎𝑡𝑎
= 𝑙𝑜𝑔2
𝑃𝑑𝑎𝑡𝑎𝑃𝐺100
……
𝐽𝑆 𝑃𝐺1
, 𝑃𝑑𝑎𝑡𝑎
= 𝑙𝑜𝑔2
𝐽𝑆 𝑃𝐺100
, 𝑃𝑑𝑎𝑡𝑎
= 0
What is the problem of JS divergence?
……
JS divergence is log2 if two distributions do not overlap.
Intuition: If two distributions do not overlap, binary classifier
achieves 100% accuracy
Equally bad
Same objective value is obtained. Same divergence
54. Wasserstein GAN (WGAN):
Earth Mover’s Distance
• Considering one distribution P as a pile of earth,
and another distribution Q as the target
• The average distance the earth mover has to move
the earth.
𝑃 𝑄
d
𝑊 𝑃, 𝑄 = 𝑑
55. Wasserstein distance
Source of image: https://vincentherrmann.github.io/blog/wasserstein/
𝑃
𝑄
Using the “moving plan” with the smallest average distance to
define the Wasserstein distance.
There are many possible “moving plans”.
Smaller
distance?
Larger
distance?
56. WGAN: Earth Mover’s Distance
Source of image: https://vincentherrmann.github.io/blog/wasserstein/
𝑃
𝑄
Using the “moving plan” with the smallest average distance to
define the earth mover’s distance.
There many possible “moving plans”.
Best “moving plans” of this example
57. A “moving plan” is a matrix
The value of the element is the
amount of earth from one
position to another.
moving plan 𝛾
All possible plan Π
𝐵 𝛾 =
𝑥 𝑝,𝑥 𝑞
𝛾 𝑥 𝑝, 𝑥 𝑞 𝑥 𝑝 − 𝑥 𝑞
Average distance of a plan 𝛾:
Earth Mover’s Distance:
𝑊 𝑃, 𝑄 = min
𝛾∈Π
𝐵 𝛾
The best plan
𝑃
𝑄
𝑥 𝑝
𝑥 𝑞
59. WGAN
𝑉 𝐺, 𝐷
= max
𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧
𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎
𝐷 𝑥 − 𝐸 𝑥~𝑃 𝐺
𝐷 𝑥
Evaluate wasserstein distance between 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺
[Martin Arjovsky, et al., arXiv, 2017]
D has to be smooth enough.
real
−∞
generated
D
∞
Without the constraint, the
training of D will not converge.
Keeping the D smooth forces
D(x) become ∞ and −∞
60. WGAN
𝑉 𝐺, 𝐷
= max
𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧
𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎
𝐷 𝑥 − 𝐸 𝑥~𝑃 𝐺
𝐷 𝑥
Evaluate wasserstein distance between 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺
[Martin Arjovsky, et al., arXiv, 2017]
How to fulfill this constraint?D has to be smooth enough.
𝑓 𝑥1 − 𝑓 𝑥2 ≤ 𝐾 𝑥1 − 𝑥2
Lipschitz Function
K=1 for "1 − 𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧"
Output
change
Input
change 1−Lipschitz
not 1−Lipschitz
Do not change fast
Weight Clipping
Force the parameters w between c and -c
After parameter update, if w > c, w = c;
if w < -c, w = -c
61. 𝑉 𝐺, 𝐷
= max
𝐷∈1−𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧
𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎
𝐷 𝑥 − 𝐸 𝑥~𝑃 𝐺
𝐷 𝑥
≈ max
𝐷
{𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎
𝐷 𝑥 − 𝐸 𝑥~𝑃 𝐺
𝐷 𝑥
−𝜆 𝑥
𝑚𝑎𝑥 0, 𝛻𝑥 𝐷 𝑥 − 1 𝑑𝑥}
−𝜆𝐸 𝑥~𝑃 𝑝𝑒𝑛𝑎𝑙𝑡𝑦
𝑚𝑎𝑥 0, 𝛻𝑥 𝐷 𝑥 − 1 }
A differentiable function is 1-Lipschitz if and only if
it has gradients with norm less than or equal to 1
everywhere. 𝛻𝑥 𝐷 𝑥 ≤ 1 for all x
Improved WGAN (WGAN-GP)
𝑉 𝐺, 𝐷
𝐷 ∈ 1 − 𝐿𝑖𝑝𝑠𝑐ℎ𝑖𝑡𝑧
Prefer 𝛻𝑥 𝐷 𝑥 ≤ 1 for all x
Prefer 𝛻𝑥 𝐷 𝑥 ≤ 1 for x sampling from 𝑥~𝑃𝑝𝑒𝑛𝑎𝑙𝑡𝑦
62. Improved WGAN (WGAN-GP)
𝑃𝑑𝑎𝑡𝑎 𝑃𝐺
Only give gradient constraint to the region between 𝑃𝑑𝑎𝑡𝑎 and 𝑃𝐺
because they influence how 𝑃𝐺 moves to 𝑃𝑑𝑎𝑡𝑎
−𝜆𝐸 𝑥~𝑃 𝑝𝑒𝑛𝑎𝑙𝑡𝑦
𝑚𝑎𝑥 0, 𝛻𝑥 𝐷 𝑥 − 1 }
≈ max
𝐷
{𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎
𝐷 𝑥 − 𝐸 𝑥~𝑃 𝐺
𝐷 𝑥𝑉 𝐺, 𝐷
𝑃𝑝𝑒𝑛𝑎𝑙𝑡𝑦
“Given that enforcing the Lipschitz constraint everywhere
is intractable, enforcing it only along these straight lines
seems sufficient and experimentally results in good
performance.”
63. “Simply penalizing overly
large gradients also works in
theory, but experimentally we
found that this approach
converged faster and to
better optima.”
Improved WGAN (WGAN-GP)
𝑃𝑑𝑎𝑡𝑎
𝑃𝐺
−𝜆𝐸 𝑥~𝑃 𝑝𝑒𝑛𝑎𝑙𝑡𝑦
𝑚𝑎𝑥 0, 𝛻𝑥 𝐷 𝑥 − 1 }
≈ max
𝐷
{𝐸 𝑥~𝑃 𝑑𝑎𝑡𝑎
𝐷 𝑥 − 𝐸 𝑥~𝑃 𝐺
𝐷 𝑥𝑉 𝐺, 𝐷
𝑃𝑝𝑒𝑛𝑎𝑙𝑡𝑦
𝛻𝑥 𝐷 𝑥 − 1 2
𝐷 𝑥𝐷 𝑥
The gradient norm
in this region = 1
More: Improved the improved WGAN
64. Spectrum Norm Spectral Normalization → Keep
gradient norm smaller than 1
everywhere [Miyato, et al., ICLR, 2018]
65. • In each training iteration:
• Sample m examples 𝑥1, 𝑥2, … , 𝑥 𝑚 from data distribution
𝑃𝑑𝑎𝑡𝑎 𝑥
• Sample m noise samples 𝑧1, 𝑧2, … , 𝑧 𝑚 from the prior
𝑃𝑝𝑟𝑖𝑜𝑟 𝑧
• Obtaining generated data 𝑥1
, 𝑥2
, … , 𝑥 𝑚
, 𝑥 𝑖
= 𝐺 𝑧 𝑖
• Update discriminator parameters 𝜃 𝑑 to maximize
• ෨𝑉 =
1
𝑚
σ𝑖=1
𝑚
𝑙𝑜𝑔𝐷 𝑥 𝑖 +
1
𝑚
σ𝑖=1
𝑚
𝑙𝑜𝑔 1 − 𝐷 𝑥 𝑖
• 𝜃 𝑑 ← 𝜃 𝑑 + 𝜂𝛻 ෨𝑉 𝜃 𝑑
• Sample another m noise samples 𝑧1, 𝑧2, … , 𝑧 𝑚 from the
prior 𝑃𝑝𝑟𝑖𝑜𝑟 𝑧
• Update generator parameters 𝜃𝑔 to minimize
• ෨𝑉 =
1
𝑚
σ𝑖=1
𝑚
𝑙𝑜𝑔𝐷 𝑥 𝑖 +
1
𝑚
σ𝑖=1
𝑚
𝑙𝑜𝑔 1 − 𝐷 𝐺 𝑧 𝑖
• 𝜃𝑔 ← 𝜃𝑔 − 𝜂𝛻 ෨𝑉 𝜃𝑔
Algorithm of Original GAN
Repeat
k times
Learning
D
Learning
G
Only
Once
𝐷 𝑥 𝑖
𝐷 𝑥 𝑖−
No sigmoid for the output of D
WGAN
Weight clipping /
Gradient Penalty …
𝐷 𝐺 𝑧 𝑖−
66. Energy-based GAN (EBGAN)
• Using an autoencoder as discriminator D
Discriminator
0 for the
best images
Generator is
the same.
-
0.1
EN DE
Autoencoder
X -1 -0.1
[Junbo Zhao, et al., arXiv, 2016]
Using the negative reconstruction error of
auto-encoder to determine the goodness
Benefit: The auto-encoder can be pre-train
by real images without generator.
67. EBGAN
realgen gen
Hard to reconstruct, easy to destroy
m
0 is for
the best.
Do not have to
be very negative
Auto-encoder based discriminator
only gives limited region large value.
69. Mode Dropping
BEGAN on CelebA
Generator
at iteration t
Generator
at iteration t+1
Generator
at iteration t+2
Generator switches mode during training
71. Outline of Part 1
Generation by GAN
• Image Generation as Example
• Theory behind GAN
• Issues and Possible Solutions
• Evaluation
Conditional Generation
Feature Extraction
Unsupervised Conditional Generation
Relation to Reinforcement Learning
72. Likelihood
: real data (not observed
during training)
: generated data
𝑥 𝑖
Log Likelihood:
Generator
PG
𝐿 =
1
𝑁
𝑖
𝑙𝑜𝑔𝑃𝐺 𝑥 𝑖
We cannot compute 𝑃𝐺 𝑥 𝑖 . We can only
sample from 𝑃𝐺.
Prior
Distribution
73. Likelihood
- Kernel Density Estimation
• Estimate the distribution of PG(x) from sampling
Generator
PG
Now we have an approximation of 𝑃𝐺, so we can
compute 𝑃𝐺 𝑥 𝑖 for each real data 𝑥 𝑖
Then we can compute the likelihood.
Each sample is the mean of a
Gaussian with the same covariance.
74. Likelihood v.s. Quality
• Low likelihood, high quality?
• High likelihood, low quality?
Considering a model generating good images (small variance)
Generator
PG
Generated data Data for evaluation 𝑥 𝑖
𝑃𝐺 𝑥 𝑖 = 0
G1
𝐿 =
1
𝑁
𝑖
𝑙𝑜𝑔𝑃𝐺 𝑥 𝑖
G2
X 0.99
X 0.01
= −𝑙𝑜𝑔100 +
1
𝑁
𝑖
𝑙𝑜𝑔𝑃𝐺 𝑥 𝑖
4.6100
75. Objective Evaluation
Off-the-shelf
Image Classifier
𝑥 𝑃 𝑦|𝑥
Concentrated distribution
means higher visual quality
CNN𝑥1
𝑃 𝑦1
|𝑥1
Uniform distribution
means higher variety
CNN𝑥2
𝑃 𝑦2
|𝑥2
CNN𝑥3
𝑃 𝑦3|𝑥3
…
𝑃 𝑦 =
1
𝑁
𝑛
𝑃 𝑦 𝑛|𝑥 𝑛
[Tim Salimans, et al., NIPS, 201
𝑥: image
𝑦: class (output of CNN)
e.g. Inception net,
VGG, etc.
class 1
class 2
class 3
76. Objective Evaluation
=
𝑥
𝑦
𝑃 𝑦|𝑥 𝑙𝑜𝑔𝑃 𝑦|𝑥
−
𝑦
𝑃 𝑦 𝑙𝑜𝑔𝑃 𝑦
Negative entropy of P(y|x)
Entropy of P(y)
Inception Score
𝑃 𝑦 =
1
𝑁
𝑛
𝑃 𝑦 𝑛|𝑥 𝑛
𝑃 𝑦|𝑥
class 1
class 2
class 3
77. We don’t want memory GAN.
• Using k-nearest neighbor to check whether the
generator generates new objects
https://arxiv.org/pdf/1511.01844.pdf
78. Mode Dropping
Sanjeev Arora, Andrej Risteski, Yi Zhang, “Do GANs
learn the distribution? Some Theory and Empirics”, ICLR,
2018
If you sample 400 images,
with probability > 50%,
You would sample “identical” images.
There are approximately
0.16M different faces.
DCGAN
ALI
There are approximately
1M different faces.
79. Outline of Part 1
Generation by GAN
• Image Generation as Example
• Theory behind GAN
• Issues and Possible Solutions
• Evaluation
Conditional Generation
Feature Extraction
Unsupervised Conditional Generation
Relation to Reinforcement Learning
80. Target of
NN output
Text-to-Image
• Traditional supervised approach
NN Image
Text: “train”
a dog is running
a bird is flying
A blurry image!
c1: a dog is running
as close as
possible
81. Conditional GAN
D
(original)
scalar𝑥
G
𝑧Normal distribution
x = G(c,z)
c: train
x is real image or not
Image
Real images:
Generated images:
1
0
Generator will learn to
generate realistic images ….
But completely ignore the
input conditions.
[Scott Reed, et al, ICML, 2016]
82. Conditional GAN
D
(better)
scalar
𝑐
𝑥
True text-image pairs:
G
𝑧Normal distribution
x = G(c,z)
c: train
Image
x is realistic or not +
c and x are matched or not
(train , )
(train , )(cat , )
[Scott Reed, et al, ICML, 2016]
1
00
83. • In each training iteration:
• Sample m positive examples 𝑐1
, 𝑥1
, 𝑐2
, 𝑥2
, … , 𝑐 𝑚
, 𝑥 𝑚
from database
• Sample m noise samples 𝑧1, 𝑧2, … , 𝑧 𝑚 from a distribution
• Obtaining generated data 𝑥1, 𝑥2, … , 𝑥 𝑚 , 𝑥 𝑖 = 𝐺 𝑐 𝑖, 𝑧 𝑖
• Sample m objects ො𝑥1, ො𝑥2, … , ො𝑥 𝑚 from database
• Update discriminator parameters 𝜃 𝑑 to maximize
• ෨𝑉 =
1
𝑚
σ𝑖=1
𝑚
𝑙𝑜𝑔𝐷 𝑐 𝑖, 𝑥 𝑖
+
1
𝑚
𝑖=1
𝑚
𝑙𝑜𝑔 1 − 𝐷 𝑐 𝑖, 𝑥 𝑖 +
1
𝑚
𝑖=1
𝑚
𝑙𝑜𝑔 1 − 𝐷 𝑐 𝑖, ො𝑥 𝑖
• 𝜃 𝑑 ← 𝜃 𝑑 + 𝜂𝛻 ෨𝑉 𝜃 𝑑
• Sample m noise samples 𝑧1
, 𝑧2
, … , 𝑧 𝑚
from a distribution
• Sample m conditions 𝑐1, 𝑐2, … , 𝑐 𝑚 from a database
• Update generator parameters 𝜃𝑔 to maximize
• ෨𝑉 =
1
𝑚
σ𝑖=1
𝑚
𝑙𝑜𝑔 𝐷 𝐺 𝑐 𝑖, 𝑧 𝑖 , 𝜃𝑔 ← 𝜃𝑔 + 𝜂𝛻 ෨𝑉 𝜃𝑔
Learning
D
Learning
G
84. x is realistic or not +
c and x are matched
or not
Conditional GAN - Discriminator
[Takeru Miyato, et al., ICLR, 2018]
[Han Zhang, et al., arXiv, 2017]
[Augustus Odena et al., ICML, 2017]
condition c
object x
Network
Network
Network
score
Network
Network
(almost every paper)
condition c
object x
c and x are matched
or not
x is realistic or not
85. Conditional GAN
paired data
blue eyes
red hair
short hair
Collecting anime faces
and the description of its
characteristics
red hair,
green eyes
blue hair,
red eyes
圖片生成:
吳宗翰、謝濬丞、
陳延昊、錢柏均
86. Stack GAN
Han Zhang, Tao Xu, Hongsheng
Li, Shaoting Zhang, Xiaogang
Wang, Xiaolei Huang, Dimitris Metaxas,
“StackGAN: Text to Photo-realistic Image
Synthesis with Stacked Generative Adversarial
Networks”, ICCV, 2017
88. as close as
possible
Image-to-image
• Traditional supervised approach
NN Image
It is blurry because it is the
average of several images.
Testing:
input close
91. Image super resolution
• Christian Ledig, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew
Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes
Totz, Zehan Wang, Wenzhe Shi, “Photo-Realistic Single Image Super-Resolution
Using a Generative Adversarial Network”, CVPR, 2016
96. Outline of Part 1
Generation by GAN
• Image Generation as Example
• Theory behind GAN
• Issues and Possible Solutions
• Evaluation
Conditional Generation
Feature Extraction
Unsupervised Conditional Generation
Relation to Reinforcement Learning
102. GAN+Autoencoder
• We have a generator (input z, output x)
• However, given x, how can we find z?
• Learn an encoder (input x, output z)
Generator
(Decoder)
Encoder z xx
as close as possible
fixed
Discriminator
Init (only modify the last layer)
Autoencoder
108. Generator
(Decoder)
Encoder z xx
as close as possible
Back to z
• Method 1
• Method 2
• Method 3
𝑧∗ = 𝑎𝑟𝑔 min
𝑧
𝐿 𝐺 𝑧 , 𝑥 𝑇
𝑥 𝑇
𝑧∗
Difference between G(z) and xT
Pixel-wise
By another network
Using the results from method 2 as the initialization of method 1
Gradient Descent
109. Editing Photos
• z0 is the code of the input image
𝑧∗ = 𝑎𝑟𝑔 min
𝑧
𝑈 𝐺 𝑧 + 𝜆1 𝑧 − 𝑧0
2 − 𝜆2 𝐷 𝐺 𝑧
Does it fulfill the constraint of editing?
Not too far away from
the original image
Using discriminator
to check the image is
realistic or notimage
110. VAE-GAN
Discrimi
nator
Generator
(Decoder)
Encoder z x scalarx
as close as possible
VAE
GAN
Minimize
reconstruction error
z close to normal
Minimize
reconstruction error
Cheat discriminator
Discriminate real,
generated and
reconstructed images
Discriminator provides the reconstruction loss
Anders Boesen, Lindbo Larsen, Søren Kaae
Sønderby, Hugo Larochelle, Ole Winther,
“Autoencoding beyond pixels using a learned similarity
metric”, ICML. 2016
111. • Initialize En, De, Dis
• In each iteration:
• Sample M images 𝑥1, 𝑥2, ⋯ , 𝑥 𝑀 from database
• Generate M codes ǁ𝑧1, ǁ𝑧2, ⋯ , ǁ𝑧 𝑀 from encoder
• ǁ𝑧 𝑖 = 𝐸𝑛 𝑥 𝑖
• Generate M images 𝑥1
, 𝑥2
, ⋯ , 𝑥 𝑀
from decoder
• 𝑥 𝑖 = 𝐷𝑒 ǁ𝑧 𝑖
• Sample M codes 𝑧1
, 𝑧2
, ⋯ , 𝑧 𝑀
from prior P(z)
• Generate M images ො𝑥1
, ො𝑥2
, ⋯ , ො𝑥 𝑀
from decoder
• ො𝑥 𝑖 = 𝐷𝑒 𝑧 𝑖
• Update En to decrease 𝑥 𝑖 − 𝑥 𝑖 , decrease
KL(P( ǁ𝑧 𝑖|xi)||P(z))
• Update De to decrease 𝑥 𝑖 − 𝑥 𝑖 , increase
𝐷𝑖𝑠 𝑥 𝑖 and 𝐷𝑖𝑠 ො𝑥 𝑖
• Update Dis to increase 𝐷𝑖𝑠 𝑥 𝑖 , decrease
𝐷𝑖𝑠 𝑥 𝑖
and 𝐷𝑖𝑠 ො𝑥 𝑖
Algorithm
Another kind of
discriminator:
Discriminator
x
real gen recon
113. Algorithm
• Initialize encoder En, decoder De, discriminator Dis
• In each iteration:
• Sample M images 𝑥1, 𝑥2, ⋯ , 𝑥 𝑀 from database
• Generate M codes ǁ𝑧1
, ǁ𝑧2
, ⋯ , ǁ𝑧 𝑀
from encoder
• ǁ𝑧 𝑖 = 𝐸𝑛 𝑥 𝑖
• Sample M codes 𝑧1, 𝑧2, ⋯ , 𝑧 𝑀 from prior P(z)
• Generate M codes 𝑥1, 𝑥2, ⋯ , 𝑥 𝑀 from decoder
• 𝑥 𝑖 = 𝐷𝑒 𝑧 𝑖
• Update Dis to increase 𝐷𝑖𝑠 𝑥 𝑖, ǁ𝑧 𝑖 , decrease 𝐷𝑖𝑠 𝑥 𝑖, 𝑧 𝑖
• Update En and De to decrease 𝐷𝑖𝑠 𝑥 𝑖
, ǁ𝑧 𝑖
, increase
𝐷𝑖𝑠 𝑥 𝑖, 𝑧 𝑖
114. Encoder Decoder Discriminator
Image x
code z
Image x
code z
Image x code z
from encoder
or decoder?
(real) (generated)
(from prior
distribution)
𝑃 𝑥, 𝑧 𝑄 𝑥, 𝑧
Evaluate the
difference
between P and QP and Q would be the same
Optimal encoder
and decoder:
En(x’) = z’ De(z’) = x’
En(x’’) = z’’De(z’’) = x’’
For all x’
For all z’’
115. BiGAN
En(x’) = z’ De(z’) = x’
En(x’’) = z’’De(z’’) = x’’
For all x’
For all z’’
How about?
Encoder
Decoder
x
z
𝑥
Encoder
Decoder
x
z
ǁ𝑧
En, De
En, De
Optimal encoder
and decoder:
116. Domain Adversarial Training
• Training and testing data are in different domains
Training
data:
Testing
data:
Generator
Generator
The same
distribution
feature
feature
Take digit
classification as example
117. blue points
red points
Domain Adversarial Training
feature extractor (Generator)
Discriminator
(Domain classifier)
image Which domain?
Always output
zero vectors
Domain Classifier Fails
118. Domain Adversarial Training
feature extractor (Generator)
Discriminator
(Domain classifier)
image
Label predictor
Which digits?
Not only cheat the domain
classifier, but satisfying label
predictor at the same time
More speech- and text-related applications in the following.
Successfully applied on image classification
[Ganin et al, ICML, 2015][Ajakan et al. JMLR, 2016 ]
Which domain?
119. Domain-adversarial training
Yaroslav Ganin, Victor Lempitsky, Unsupervised Domain Adaptation by Backpropagation,
ICML, 2015
Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand,
Domain-Adversarial Training of Neural Networks, JMLR, 2016
120. Outline of Part 1
Generation by GAN
• Image Generation as Example
• Theory behind GAN
• Issues and Possible Solutions
• Evaluation
Conditional Generation
Feature Extraction
Unsupervised Conditional Generation
Relation to Reinforcement Learning
121. Domain X Domain Y
male female
It is good.
It’s a good day.
I love you.
It is bad.
It’s a bad day.
I don’t love you.
Unsupervised Conditional Generation
Transform an object from one domain to another
without paired data (e.g. style transfer)
GDomain X Domain Y
122. Unsupervised
Conditional Generation
• Approach 1: Direct Transformation
• Approach 2: Projection to Common Space
?𝐺 𝑋→𝑌
Domain X Domain Y
For texture or
color change
𝐸𝑁 𝑋 𝐷𝐸 𝑌
Encoder of
domain X
Decoder of
domain Y
Larger change, only keep the semantics
Domain YDomain X Face
Attribute
124. Direct Transformation
𝐺 𝑋→𝑌
Domain X
Domain Y
𝐷 𝑌
Domain Y
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Not what we want!
ignore input
125. Direct Transformation
𝐺 𝑋→𝑌
Domain X
Domain Y
𝐷 𝑌
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Not what we want!
ignore input
[Tomer Galanti, et al. ICLR, 2018]
The issue can be avoided by network design.
Simpler generator makes the input and
output more closely related.
126. Direct Transformation
𝐺 𝑋→𝑌
Domain X
Domain Y
𝐷 𝑌
Domain X
scalar
Input image
belongs to
domain Y or not
Become similar
to domain Y
Encoder
Network
Encoder
Network
pre-trained
as close as
possible
Baseline of DTN [Yaniv Taigman, et al., ICLR, 2017]
127. Direct Transformation
𝐺 𝑋→𝑌
𝐷 𝑌
Domain Y
scalar
Input image
belongs to
domain Y or not
𝐺Y→X
as close as possible
Lack of information
for reconstruction
[Jun-Yan Zhu, et al., ICCV, 201
Cycle consistency
128. Direct Transformation
𝐺 𝑋→𝑌 𝐺Y→X
as close as possible
𝐺Y→X 𝐺 𝑋→𝑌
as close as possible
𝐷 𝑌𝐷 𝑋
scalar: belongs to
domain Y or not
scalar: belongs to
domain X or not
137. Unsupervised
Conditional Generation
• Approach 1: Direct Transformation
• Approach 2: Projection to Common Space
?𝐺 𝑋→𝑌
Domain X Domain Y
For texture or
color change
𝐸𝑁 𝑋 𝐷𝐸 𝑌
Encoder of
domain X
Decoder of
domain Y
Larger change, only keep the semantics
Domain YDomain X Face
Attribute
138. Domain X Domain Y
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
image
Face
Attribute
Projection to Common Space
Target
139. Domain X Domain Y
𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
image
Minimizing reconstruction error
Projection to Common Space
Training
140. 𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
image
Minimizing reconstruction error
Because we train two auto-encoders separately …
The images with the same attribute may not project
to the same position in the latent space.
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Minimizing reconstruction error
Projection to Common Space
Training
141. Sharing the parameters of encoders and decoders
Projection to Common Space
Training
𝐸𝑁 𝑋
𝐸𝑁𝑌
𝐷𝐸 𝑋
𝐷𝐸 𝑌
Couple GAN[Ming-Yu Liu, et al., NIPS, 2016]
UNIT[Ming-Yu Liu, et al., NIPS, 2017]
142. 𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
image
Minimizing reconstruction error
The domain discriminator forces the output of 𝐸𝑁𝑋 and
𝐸𝑁𝑌 have the same distribution.
From 𝐸𝑁𝑋 or 𝐸𝑁𝑌
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Projection to Common Space
Training
Domain
Discriminator
𝐸𝑁𝑋 and 𝐸𝑁𝑌 fool the
domain discriminator
[Guillaume Lample, et al., NIPS, 2017]
143. 𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
image
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Projection to Common Space
Training
Cycle Consistency:
Used in ComboGAN [Asha Anoosheh, et al., arXiv, 017]
Minimizing reconstruction error
144. 𝐸𝑁 𝑋
𝐸𝑁𝑌 𝐷𝐸 𝑌
𝐷𝐸 𝑋image
image
image
image
𝐷 𝑋
𝐷 𝑌
Discriminator
of X domain
Discriminator
of Y domain
Projection to Common Space
Training
Semantic Consistency:
Used in DTN [Yaniv Taigman, et al., ICLR, 2017] and
XGAN [Amélie Royer, et al., arXiv, 2017]
To the same
latent space
145. Outline of Part 1
Generation by GAN
• Image Generation as Example
• Theory behind GAN
• Issues and Possible Solutions
• Evaluation
Conditional Generation
Feature Extraction
Unsupervised Conditional Generation
Relation to Reinforcement Learning
147. • Input of neural network: the observation of machine
represented as a vector or a matrix
• Output neural network : each action corresponds to a
neuron in output layer
…
…
NN as actor
pixels
fire
right
left
Score of an
action
0.7
0.2
0.1
Take the action
based on the
probability.
Neural network as Actor
148. Start with
observation 𝑠1 Observation 𝑠2 Observation 𝑠3
Example: Playing Video Game
Action 𝑎1: “right”
Obtain reward
𝑟1 = 0
Action 𝑎2 : “fire”
(kill an alien)
Obtain reward
𝑟2 = 5
149. Start with
observation 𝑠1 Observation 𝑠2 Observation 𝑠3
Example: Playing Video Game
After many turns
Action 𝑎 𝑇
Obtain reward 𝑟𝑇
Game Over
(spaceship destroyed)
This is an episode.
We want the total
reward be maximized.
Total reward:
𝑅 =
𝑡=1
𝑇
𝑟𝑡
151. Reward Function → Discriminator
Reinforcement Learning v.s. GAN
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
𝑅 𝜏 =
𝑡=1
𝑇
𝑟𝑡
Reward
𝑟1
Reward
𝑟2
“Black box”
You cannot use
backpropagation.
Actor → Generator Fixed
updatedupdated
152. Imitation Learning
We have demonstration of the expert.
Actor
𝑠1
𝑎1
Env
𝑠2
Env
𝑠1
𝑎1
Actor
𝑠2
𝑎2
Env
𝑠3
𝑎2
……
reward function is not available
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁
Each Ƹ𝜏 is a trajectory
of the expert.
Self driving: record
human drivers
Robot: grab the
arm of robot
154. Framework of IRL
Expert ො𝜋
Actor 𝜋
Obtain
Reward Function R
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁
𝜏1, 𝜏2, ⋯ , 𝜏 𝑁
Find an actor based
on reward function R
By Reinforcement learning
𝑛=1
𝑁
𝑅 Ƹ𝜏 𝑛 >
𝑛=1
𝑁
𝑅 𝜏
Reward function
→ Discriminator
Actor
→ Generator
Reward
Function R
The expert is always
the best.
155. 𝜏1, 𝜏2, ⋯ , 𝜏 𝑁
GAN
IRL
G
D
High score for real,
low score for generated
Find a G whose output
obtains large score from D
Ƹ𝜏1, Ƹ𝜏2, ⋯ , Ƹ𝜏 𝑁
Expert
Actor
Reward
Function
Larger reward for Ƹ𝜏 𝑛,
Lower reward for 𝜏
Find a Actor obtains
large reward