Theory and Application of GANs

Theory and Application
of Generative
Adversarial Networks
Ming-Yu Liu, Julie Bernauer, Jan Kautz

NVIDIA
1

Tutorial schedule
• 1:30 - 2:50 Part 1

• 2:50 - 3:20 Demo

• 3:30 - 4:00 Coﬀee Break

• 4:00 - 5:30 Part 2
2

Before we start
3
• Due to the
exponential growth,
we will miss several
important GAN
works.

• The GAN Zoo 
https://deephunt.in/
the-gan-
zoo-79597dc8c347
Image from Deep Hunt, blog by Avinash Hindupur

Outlines
1. Introduction

2. GAN objective

3. GAN training

4. Joint image distribution and video distribution

5. Computer vision applications
4

1. Introduction
1. Generative modeling

2. Gaussian mixture models

3. Manifold assumption

4. Variational autoencoders

5. Autoregressive models

6. Generative adversarial networks

7. Taxonomy of generative models
6

Generative modeling
• Density estimation
7
• Sample generation
Images & slide credit Goodfellow 2016

Step 1: observe a set of samples
Example: Gaussian mixture models
8
p(x|✓) =
X
i
⇡iN(x|µi, ⌃i)
Step 2: assume a GMM model
Step 3: perform maximum likelihood learning
max
✓
X
x(j)2Dataset
log p(✓|x(j)
)

Example: Gaussian mixture models
9
p(x|✓) =
X
i
⇡iN(x|µi, ⌃i)
1. [Density Estimation]
2. [Sample Generation]
Good for low-dimensional data

Manifold assumption
• Data lie approximately on a manifold of much lower
dimension than the input space.
10
Latent space Data space
Images credit Ward, A. D. et al 2007

Manifold examples
11
MNIST
Slide credit, Courville 2017

Generative modeling
via autoencoders and GMM
• Two stage process

1. Apply an autoencoder to embed
images in a low dimensional space.  
 
2. Fit a GMM to the embeddings

• Drawbacks

• Not end-to-end

• can still be high-dimensional

• Tend to memorize samples.
12
Encoder
Network E
Decoder
Network: D
Input image
Latent code
Reconstructed image
Vincent et al. JMLR 2011
min
E,D
X
x(j)
||x(j)
D(E(x(j)
))||2
E(x(j)
)
E(x(j)
)

Variational autoencoders
• Derive from a variational framework

• Constrain the encoder to output a conditional
Gaussian distribution

• The decoder reconstructs inputs from samples
from the conditional distribution
13
• Kingma and Welling, “Auto-encoding variational
Bayes," ICLR 2014

• Rezende, Mohamed, and Wierstra, “Stochastic
back-propagation and variational inference in deep
latent Gaussian models,” ICML 2014
E(x(j)
) = q✓(z|x(j)
) = N(z|µ(x(j)
), ⌃(x(j)
))
D(z(j)
) = p✓(x(j)
|z(j)
⇠ q✓(z|x(j)
))
Encoder
Network E
Decoder
Network: D
Input image
Latent code
Reconstructed image
~
~

Maximizing a variational lower bound
14
Encoder
Network E
Decoder
Network: D
Input image
Latent code
Reconstructed image Ideally, we would like to maximize
But we maximize a lower bound for tractability
max
✓
L(✓|Dataset) ⌘
X
x(j)
log p(✓|x(j)
)
max
✓
LV (✓|Dataset) ⌘
X
x(j)
KL(q✓(z|x(j)
)||N(z|0, I))+
Ez⇠(q✓(z|x(j))[log p✓(x(j)
|z)]
 L(✓|Dataset)

Reparameterization trick
15
• Make sampling a diﬀerentiable operation

• The variational distribution

• Sampling
E(x) D(z)
q✓(z|x) = N(z|µ(x), ⌃(x))
z = µ(x) + ⌃(x)✏, where ✏ ⇠ N(0, I)
µ(x) ⌃(x)
Image credit, Courville 2017

Samples from variational
autoencoders
16
LFW dataset ImageNet dataset

Why are generated samples blurry?
• Pixel values are modeled as a conditional Gaussian, which
leads to Euclidean loss minimization.

• Regress to the mean problem and rendering blurry image.

• Diﬃcult to hand-craft a good perceptual loss function.
17
log p✓(x(j)
|z) = ||x(j)
D(z)||2
+ Const

Pitfall of Euclidean distance
for image modeling
• Blue curve plots the Euclidean distance between a reference
image and its horizontal translation.

• Red curve is the Euclidean distance between and
18

Autoregressive generative models
19

PixelCNNs
• Oord, Aaron van den, et al. 2016

• Use masked convolution to enforce the autoregressive
relationship.

• Minimizing cross entropy loss instead of Euclidean loss.
20

PixelCNNs
21

Example results from PixelCNNs
22

Parallel multiscale autoregressive
density estimation
23
• Reed et al. ICML 2017

• Can we speed generation time of PixelCNNs?

• Yes, via multiscale generation

Parallel multiscale autoregressive
density estimation
24
• Also seems to help provide better global structure.

Generative adversarial networks
25
Discriminator
Network: D
Input image
1
Generator
Network G
Discriminator
Network: D
Random vector
Generated image
0 • Goodfellow et al. NIPS
2014

• Forget about designing a
perceptual loss. Let’s train
a discriminator to
diﬀerential real and fake
image.

Deep convolutional generative
adversarial networks (DCGANs)
26
By Redford et al. ICLR 2016

Results from DCGANs
27
Z space interpolation

Vector space arithmetic
28
Slide credit, Goodfellow 2016

GANs vs VAEs
29
z
x
G
GANs
z
x
?
z
x
D
VAEs
z
x
E

Taxonomy of generative models
30
Maximum
likelihood
Explicit density
Trackable
density
Approximate
density
Implicit density
GANs
VAEsPixelCNNs
Slide credit Goodfellow 2016

Quick recaps
• Generative modeling

• Manifold assumption

• VAEs, PixelCNNs, GANs

• Taxonomy
31
Maximum
likelihood
Explicit
density
Trackable
density
Approximate
density
Implicit
density
GAN
VAEsPixelCNN

Outlines
1. Introduction

2. GAN objective
3. GAN training


32

2. GAN objective
1. GAN objective

2. Discriminator strategy

3. GAN theory

4. Eﬀect of limited discriminator capacity
34

GAN objectives
35
min
G
max
D
Ex⇠pX
[log D(x)] + Ez⇠pZ
[log(1 D(G(z))]
Solving a minimax problem
pX : Data distribution,
usually represented by samples.
pG(Z) : Model distribution, where
Z is usually modeled as uniform or Gaussian.

Alternating gradient updates
• Step 1: Fix G and perform a gradient step to

• Step 2: Fix D and perform a gradient step to  
(in theory)

(in practice)
36
max
D
Ex⇠pX
[log(1 D(G(z))]
min
G
Ez⇠pZ
[log(1 D(G(z))]
max
G
Ez⇠pZ
[log D(G(z))]
log D(G(z))
bounded above
log 1-D(G(z))
unbounded below
bad for minimization

Discriminator strategy
• Optimal (non-parametric) discriminator
D(x) =
pX (x)
pX(x) + pG(Z)(x)
37

Learning a Gaussian mixture model
using GANs
• Red points: true samples;
samples from data
distribution

• Blue points: fake samples;
samples from model
distribution

• Z-space: a Gaussian
distribution of N(0,I)
• The model ﬁrst learns 1
mode, then 2 modes, and
ﬁnally 3 modes.
38
2D GMM learning

GAN theory
• Optimal (non-parametric) discriminator

• Under an ideal discriminator, the generator minimizes the
Jensen-Shannon divergence between pX and pG(Z) . This also
requires that D and G have suﬃcient capacity and a suﬃciently
large dataset.
39
D(x) =
pX (x)
pX(x) + pG(Z)(x)
JS(pX |pG(Z)) = KL(pX ||
pX + pG(Z)
2
)+
KL(pG(Z)||
pX + pG(Z)
2
)
KL(pX |pG(Z)) = EpX
[log
pX (x)
pG(Z)(x)
]

For a discriminator with capacity n, the generator only
need to remembers samples to be epsilon
away from the optimal objective.
In practice
• Dataset sizes are limited.

• Finite network capacity

• (Sanjeev Arora et al. ICML 2017)
40
C
n
✏2
log n
Consequences:

1. The generator does not need to generalize.

2. Larger training datasets have limited utility.
Capacity meaning # of trainable parameters.

Effect of limited discriminator capacity
41
CIFAR10
dataset
Image credit Arora and Zhang 2017

Quick Recaps
• Discriminator strategy

• GAN theory

• Eﬀect of limited
discriminator capacity
42
D(x) =
pX (x)
pX(x) + pG(Z)(x)

Outlines
1. Introduction

2. GAN objective

3. GAN training

43

3. GAN Training
1. Non-convergence in GANs

2. Mode collapse

3. Lake of global structure

4. Tricks

5. New objective functions

6. Surrogate objective functions

7. Network architectures
45

Non-convergence in GANs
• GAN training is theoretically guaranteed to converge if we
can modify the density functions directly, but

• Instead, we modify G (sample generation function)

• We represent G and D as highly non-convex parametric
functions.

• “Oscillation”: can train for a very long time, generating
very many diﬀerent categories of samples, without clearly
generating better samples.

• Mode collapse: most severe form of non-convergence
46

Mode collapse
47
Figure credit, Metz et al, 2016

Mode collapse
48
• Same data dimension.

• Performance
correlated with
number of modes.

• Performance
correlated with
manifold dimensions.

Mode collapse
49
• Same number of
modes.

• Performance
correlated with
manifold dimensions.

• Performance non-
correlated with data
dimensions.

Problem with global structure
50

Problem with counting
51

Improve GAN training
52
Tricks
• Label smoothing
• Historical batches
• …
Surrogate or auxiliary objective
• UnrolledGAN
• WGAN-GP
• DRAGAN
• …
New objectives
• EBGAN
• LSGAN
• WGAN
• BEGAN
• fGAN
• …
Network architecture
• LAPGAN
• …

53
Tricks
• Label smoothing
• …
• UnrolledGAN
• WGAN-GP
• DRAGAN
• …
New objectives
• EBGAN
• LSGAN
• WGAN
• BEGAN
• fGAN
• …
• LAPGAN
• …

Tricks - one sided label smoothing
• Does not reduce classiﬁcation accuracy, only conﬁdence.

• Prevents discriminator from giving very large gradient
signal to generator.

• Prevents extrapolating to encourage extreme samples.

• Two sided label smoothing https://github.com/soumith/
ganhacks
54
Ex⇠pX
[0.9 log D(x)] + Ez⇠pZ
[log(1 D(G(z)))]
T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen,
“Improved techniques for training GANs”, NIPS 2016

Tricks - historical generator batches
Help stabilize the discriminator
training in the early stage.
55
Discriminator
Network: D
Input images
@ t iteration
1
Generator
Network G
Discriminator
Network: D
Random vector
Generated
images
@ t
0
Discriminator
Network: D
Generated
images
@ t-1
0
Discriminator
Network: D
Generated
images
@ t-K
0
A. Srivastava, T. Oﬁster, O. Tuzel, J. Susskind, W. Wang, R. Webb, “Learning from
Simulated and Unsupervised Images through adversarial training”. CVPR’17

Other tricks
• https://github.com/soumith/ganhacks

• Use labels

• Normalize the inputs to [-1,1]

• Use TanH

• Use BatchNorm

• Avoid sparse gradients: ReLU, MaxPool

• …

• Salimans et al. NIPS 2016

• Use Virtual BatchNorm

• MiniBatch Discrimination

• …
56

57
Tricks
• Label smoothing
• …
• UnrolledGAN
• WGAN-GP
• DRAGAN
• …
New objectives
• EBGAN
• LSGAN
• WGAN
• BEGAN
• fGAN
• …
• LAPGAN
• …

EBGAN
58
Junbo Zhao, Michael Mathieu, Yann LeCun, “Energy-based generative adversarial
networks,” ICLR 2017
Discriminator
Network: D
Input image
1
Generator
Network G
Discriminator
Network: D
Random vector
Generated image
0
Energy
Function
Input image
Low energy
value
Generator
Network G
Energy
Function
Random vector
Generated image
High energy
value
Energy function
can be modeled
by an auto-
encoder.

EBGAN
59
En(x) = ||Decoder(Encoder(x)) x||
Discriminator objective
min
En
Ex⇠pX
[En(x)] + Ez⇠pZ
E[[m En(G(z))]+
]
Generator objective
min
G
Ez⇠pZ
E[En(G(z))]

EBGAN
60

LSGAN
X. Mao, Q. Li, H. Xie, R. Lau, Z. Wang, “Least squares generative adversarial
networks” 2016
Still use a classiﬁer but replace cross-entropy loss with Euclidean loss.
min
D
Ex⇠pX
[ log D(x)] + Ez⇠pZ
[ log(1 D(G(z)))]
Discriminator
min
D
Ex⇠pX
[(D(x) 1)2
] + Ez⇠pZ
[D(G(z))2
]
GAN
LSGAN
Generator
min
G
Ez⇠pZ
[(D(G(z)) 1)2
]
min
G
Ez⇠pZ
[ log D(G(z))]GAN
LSGAN
61

LSGAN
networks” 2016
JS(pX ||pG(Z)) = KL(pX ||
pX + pG(Z)
2
) + KL(pG(Z)||
pX + pG(Z)
2
)
GAN: minimize Jensen-Shannon divergence between pX and pG(Z)
LSGAN: minimize chi-square distance between pX + pG(Z) and 2pG(Z)
X2
(pX + pG(Z)||2pG(Z)) =
Z
X
2pG(Z)(x) (pG(Z)(x) + pX(x))
pG(Z)(x) + pX (x)
dx2
(
pX + pG(Z)
2
||pG(Z))
62

63
LSGAN
networks” 2016

Wasserstein GAN
Replace classiﬁer with a critic function
Discriminator
GAN
WGAN
64
M. Arjovsky, S. Chintala, L. Bottou “Wasserstein GAN” 2016
max
D
Ex⇠pX
[D(x)] Ez⇠pZ
[D(G(z))]
max
D
Ex⇠pX
[log(1 D(G(z))]
Generator
GAN
WGAN max
G
Ez⇠pZ
[D(G(z))]
max
G
Ez⇠pZ
[log D(G(z))]

Wasserstein GAN
JS(pX ||pG(Z)) = KL(pX ||
pX + pG(Z)
2
) + KL(pG(Z)||
pX + pG(Z)
2
)
GAN: minimize Jensen-Shannon divergence between pX and pG(Z)
WGAN: minimize earth mover distance between pX and pG(Z)
EM(pX , pG(Z)) = inf
2
Q
(pX ,pG(Z))
E(x,y)⇠ [||x y||]
65

66
Wasserstein GAN

GAN vs WGAN
67

GAN vs WGAN
68
ZJS(pX||pG(Z)) = KL(pX||
pX + pG(Z)
2
) + KL(pG(Z)||
pX + pG(Z)
2
)
JS(pX||pG(Z)) = KL(pX||
pX + pG(Z)
2
) + KL(pG(Z)||
pX + pG(Z)
2
)
Jesen-Shannon divergence in this
example

GAN vs WGAN
69
Earth Mover distance in this
example
EM(pX , pG(Z)) = inf
2
Q
(pX ,pG(Z))
E(x,y)⇠ [||x y||]
EM(pX, pG(Z)) = |✓|

GAN vs WGAN
GAN WGAN
• If we can directly change the density shape parameter,
the Earth Mover distance is smoother.

• But we do not directly change the density shape
parameter, we change the generation function.

Boundary Equilibrium GAN (BEGAN)
71
David Berthelot, Thomas Schumm, Luke Metz, “Boundary equilibrium generative
adversarial networks” 2017
Energy
Function
Input image
Low energy
value
Generator
Network G
Energy
Function
Random vector
Generated image
High energy
value
• Use EBGAN autoencoder-based
discriminator.

• Not based on minimizing distance
between pX and pG(Z)

• Based on minimizing distance
between pEn(X) and pEn(G(Z))

• The distance is a lower bound on
the Wasserstein 1 (Earth Mover)
distance.

72
BEGAN
EM(pEn(X), pEn(G(Z))) = inf
2
Q
(pEn(X),pEn(G(Z)))
E(x1,x2)⇠ [|x1 x2|]
EM(pX, pG(Z)) = inf
2
Q
(pX ,pG(Z))
E(x,y)⇠ [||x y||]
WGAN
inf E(x1,x2)⇠ [|x1 x2|] inf |E(x1,x2)⇠ [x1 x2]| = mean(x1) mean(x2)
But low bound
max
G
min
D
Ex⇠pX
[En(x)] Ez⇠pZ
[En(G(z))]
Final objective function

73
Discriminator objective: min
En
En(x) ktEn(G(z))
Generator objective: min
G
En(G(z))
Equilibrium objective: kt+1 = kt + k( En(x) En(G(z)))
Hyperparamete: =
En(G(z))
En(x)
value between 0 and 1

74

75
Tricks
• Label smoothing
• …
• UnrolledGAN
• WGAN-GP
• DRAGAN
• …
New objectives
• EBGAN
• LSGAN
• WGAN
• BEGAN
• fGAN
• …
• LAPGAN
• …

UnrolledGAN
• Considering several future updates of the discriminator as
updating the generator.

• Update the discriminator in the same way.

• Avoid that the generator generates samples from few modes.
76
L. Metz, B. Poole, D. Pfau, J. Sohl-Dickstein, “Unrolled generative adversarial
networks” ICLR 2017

UnrolledGAN
77
UnrolledGAN
GAN
• DK if a function of D and G

• Considering several future updates of the discriminator as
updating the generator.
max
G
Ez⇠pZ
[log DK(G(z))]Surrogate objective

WGAN-GP
• y: imaginary samples
78
I. Gulrajani, F. Ahmed, M. Arjovsky, V. Domoulin, A. Courville “Improved Training of
Wasserstein GANs” 2017
y = ux + (1 u)G(z)
min
G
max
D
Ex⇠pX
[D(x)] Ez⇠pZ
[D(G(Z))] + Ey⇠pY
[(||ryD(y)||2 1)2
]min
G
max
D
Ex⇠pX
[D(x)] Ez⇠pZ
[D(G(Z))] + Ey⇠pY
[(||ryD(y)||2 1)2
]
Optimal critic has unit gradient norm almost everywhere

DRAGAN
79
N. Kodali, J. Abernethy, J. Hays, Z. Kira “How to train your DRAGAN” 2017
min
G
max
D
Ex⇠pX
[log D(x)] Ez⇠pZ
[log(1 D(G(Z)))] + Ey⇠pY
[(||ryD(y)||2 1)2
]
y = ↵x + (1 ↵)(x + ) y: imaginary samples around true sample
min
G
max
D
Ex⇠pX
[log D(x)] Ez⇠pZ
[log(1 D(G(Z)))] + Ey⇠pY
[(||ryD(y)||2 1)2
]

80
Tricks
• Label smoothing
• …
• UnrolledGAN
• WGAN-GP
• DRAGAN
• …
New objectives
• EBGAN
• LSGAN
• WGAN
• BEGAN
• fGAN
• …
• LAPGAN
• …

LAPGAN
81
E. Denton, S. Chintala, A. Szlam, R. Fergus “Deep Generative Image Models
using a Laplacian Pyramid of Adversarial Networks” NIPS 2015
Testing

LAPGAN
82
E. Denton, S. Chintala, A. Szlam, R. Fergus “Deep Generative Image Models
using a Laplacian Pyramid of Adversarial Networks” NIPS 2015
Training of each resolution is performed independent of the others.

Evaluation
• Popular metrics for GAN evaluation.

• Inception loss

• Train an accurate classifier

• Train a conditional image generation model.

• Check how accurate the classifier can recognize the
the generated images.

• Human evaluation

• Generative models are difficult to evaluate. (This et al.
ICLR 2016)

• You can hack each evaluation metric.

• Still an open problem.
83

Outlines
1. Introduction

2. GAN objective

3. GAN training

84

4. Joint image distribution and video
distribution
85

4. Joint image distribution and video
distribution
1. Joint image distribution learning

2. Video distribution learning
86

Joint distribution of multi-domain
images
87
Scene
Image plane
Camera
color
image
depth
image
near
far
thermal
image
cool
hot
• !(#$,#&,…, #(): where #) are images of the scene in different modalities.
• Ex. !(#*+,+-,#./0-12,,#304./):
• In this presentation, “modality” = “domain”

Joint distribution of multi-domain
images
88
• Define domain by attribute.
• Multi-domain images are views of an object with different attributes.
• !(#$,#&,…, #(): where #* are images of the object with different attributes.
Non-smiling Smiling Young SeniorNon-beard Beard
summer winterimages Hand-drawings
Font#1 Font#2

Extending GAN for joint image
distribution learning
89
Discriminator
Network: D
Input image
1
Generator
Network G
Discriminator
Network: D
Random vector
Generated image
0

What if we do not have paired images
from different domains for learning?
90

CoGAN: Coupled Generative
Adversarial Networks
91
Ming-Yu Liu, Oncel Tuzel “Coupled Generative Adversarial Networks” NIPS 2016
Learning a joint distribution of images using samples from marginal distributions
Discriminator
Network 2: D2
Input image
1
Generator
Network 1: G1
Discriminator
Network 1: D1
Generated image
0
Discriminator
Network 1: D1
1
Input image
Generator
Network 2: G2
Discriminator
Network 2: D2
Generated image
0
Applying
weight-
sharing to
high-level
layer
Z

Video GANs
101
C. Vondrick, H. Pirsiavash, A. Torralba “Generating videos with scene
dynamics” NIPS 2016

MoCoGANs
103
S. Tulyakov, M. Liu, X. Yang, J. Kautz “MoCoGAN: Decomposing Motion and
Content for Video Generation” 2017

104
ZC = [zC, zC, ..., zC]
Motion trajectorySampled content
zC
z
(1)
M
z
(2)
M
z
(3)
M
z
(4)
M z
(5)
M
ZM = [z
(1)
M , z
(2)
M , ..., z
(K)
M ]
MoCoGANs

MoCoGANs
105
˜x(1)GI
zC
z
(1)
M
GI ˜x(K)
z
(K)
M
DIS1
GI ˜x(2)
z
(2)
M
˜v
vh(0)
"(K)
"(1)
"(2)
RM
DVST

106
Training:
min
DI
Ev⇠pV
[ log DI(S1(v))] + E˜v⇠p˜V
[ log(1 DI(S1(˜v)))]
min
DV
Ev⇠pV
[ log DV(ST(v))] + E˜v⇠p˜V
[ log(1 DV(ST(˜v)))]
max
GI,RM
E˜v⇠p˜V
[ log(1 DI(S1(˜v)))] + E˜v⇠p˜V
[ log(1 DV(ST(˜v)))].
MoCoGANs

107
MoCoGANs

VGAN vs. MoCoGAN
108
VGAN MoCoGAN
Video generation
Directly generating 3D
volume
Video frames are generated
sequentially.
Generator
Motion dynamic video
and static background
Image
Discriminator
One discriminator
- 3D CNN for clips
Two discriminators
- 3D CNN for clips
- 2D CNN for frames
Variable length No Yes
Motion and Content
Separation
No Yes
User preference on
generated face video
quality
5% 95%

Outlines
1. Introduction

2. GAN objective

3. GAN Training


109

110

Why are GANs useful for computer
vision?
111
Deep NetworksHand-crafted features
Generative
Adversarial
Networks
Hand-crafted
objective function

• Let and be two different image domains (day-
time image and night-time image domains).

• Let .

• Image-to-image translation concerns the problem of
translating to a corresponding image

• Correspondence can mean different things in different
contexts.
Image-to-image Translation
X1 X2
x1 2 X1
x1 x2 2 X2
Input Output
Image
Translator
112

Examples and use cases
Low-res to high-res Blurry to sharp
Noisy to cleanLDR to HDR
Thermal to color
Image to painting
Day to night Summer to winter
Synthetic to real
• Bad weather to good
weather

• Greyscale to color

• …
113

Prior Works
• Image-to-image translation has been studied for decades
in computer vision.

• Diﬀerent approaches have been exploited including:

• Filtering-based approaches

• Optimization-based approaches

• Dictionary learning-based approaches

• Deep learning-based approaches

• GAN-based approaches.
114

Generative Adversarial Networks
(GANs)
• Training is done via applying alternating stochastic
gradient updates to the following two learning problems.

• Eﬀect: minimizing JSD between and
max
G
Ez⇠pN
[log D(G(z))]
max
D
Ex⇠pX
[log D(X)] + Ez⇠pN
[log(1 D(G(z)))]
pXpG(z),z⇠N
G D
D
z ⇠ N
x 2 XGoodfellow et al. NIPS’14
x = G(z)
115

Conditional GANs
• Training is done via applying alternating stochastic
gradient updates to the following two learning problems.

• Eﬀect: minimizing JSD between and
F D
Dx2 ⇠ X2
x1 ⇠ X1 x = F(x1)
max
D
Ex2⇠pX2
[log D(x2)] + Ex1⇠pX1
[log(1 D(F(x1)))]
pF (x1),x1⇠X1
pX2
max
F
Ex1⇠pX1
[log D(F(x1))]
116

Conditional GAN for Image-to-image
Translation
• Conditional GAN alone is insuﬃcient for image
translation.

• No guarantee that the conditionally generated image is
related to the source image in a desired way.

• In the supervised setting,

•

• This can be easily ﬁxed.
Dataset = {(x
(1)
1 , x
(1)
2 ), (x
(2)
1 , x
(2)
2 ), ..., (x
(N)
1 , x
(N)
2 )}
117

Supervised Image-to-image
Translation
• Supervisedly relating to

• Ledig et al, CVPR’17: Adding content loss
F Dx1 ⇠ X1 x = F(x1)
x = F(x
(i)
1 ) x
(i)
2
||x x
(i)
2 ||2 + ||VGG(x) VGG(x
(i)
2 )||2
118

SRGAN
119
C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Cunningham, A. Acosta, A. Aitken,
A. Tejani, J. Totz, Z. Wang, W. Shi “Photo-realistic image superresolution using a
generative adversarial networks “, CVPR 2017

Supervised Image-to-image
Translation
• Supervisedly relating to

• Isola et al, CVPR’17: Learning a joint distribution.
F Dx1 ⇠ X1 x = F(x1)
x = F(x
(i)
1 ) x
(i)
2
max
F
EpX1
[log(D(x1, F(x1)))]
120

Pix2Pix
121
P. Isola, J. Zhu, T. Zhou, A. Efros “Image-to-image translation with conditional
generative networks“, CVPR 2017

Unsupervised Image-to-image
Translation
• Corresponding images could be expensive to get.

• In the unsupervised setting

•

•

• With no correspondences, we need additional
constraints/assumptions for learning image-to-image
translation.
Dataset2 = {x
(m1)
2 , x
(m2)
2 , ..., x
(mM )
2 }
Dataset1 = {x
(n1)
1 , x
(n2)
1 , ..., x
(nN )
1 }
122

SimGAN
F Dx1 ⇠ X1 x = F(x1)
• Srivastava et al. CVPR’17: adding cross-domain content
loss.
max
F
EpX1
[log D(F(x1)) ||F(x1) x1||1]
123

Cycle Constraint
• Learning two-way translation.

• DiscoGAN by Kim et al. ICML’17 (arXiv 1703.05192 )

• CycleGAN by Zhu et al. arXiv 1703.10593
F12
D2
F21
D1
ˆx2 = F12(x1)
max
F12,F21
EpX1
[log(D2(F12(x1)) ||F21(F12(x1)) x1||p
p]+
EpX2
[log(D1(F21(x2)) ||F12(F21(x2)) x2||p
p]
124
x1 ⇠ X1 F21(F12(x1))

CycleGAN unsupervised image to
image translation results
125

Shared Latent Space Constraint
• CoGAN by Liu et al. NIPS’16

• Assume corresponding images in diﬀerent domains share
the same representation in a hidden space.

• and

•
G1 D1
D2G2
z ⇠ N
ˆx1 = G1(z)
ˆx2 = G2(z)
weight-sharing
G1 = GL1 GH G2 = GL2 GH
max
GH ,GL1,GL2
EpN
[log D1(GL1(GH(z)) + log D2(GL2(GH(z))]
126

Shared Latent Space Constraint
• Under suﬃcient capacity and descending to local
minimum assumptions as in Goodfellow et al., CoGAN
learns a joint distribution of multi-domain images.

• Image translation
G1 D1
D2G2
z ⇠ N
ˆx1 = G1(z)
ˆx2 = G2(z)
weight-sharing
G2(argminz||G1(z) x1||p
p)
127

The CoVAE-GAN Framework
G1 D1
D2G2
E1
E2
x1 ⇠ X1
x2 ⇠ X2
G1(E1(x1))
G1(E2(x2))
G2(E1(x1))
G2(E2(x2))
• Augmenting CoGAN with VAEs for mapping images to the
shared latent space.

• Applying weight-sharing constraints to encoders E1 and E2

• Image reconstruction streams and image translation
streams 128
Ming-Yu Liu, Thomas Breuel, Jan Kautz “Unsupervised Image-to-image
translation networks“, arXiv 2017

The CoVAE-GAN Framework
G1 D1
D2G2
E1
E2
x1 ⇠ X1
x2 ⇠ X2
G1(E1(x1))
G1(E2(x2))
G2(E1(x1))
G2(E2(x2))
EpX1
[log D1(G1(E1(x1))) + log D2(G2(E1(x1)))
1||G1(E1(x1)) x1||p
p 2KL(E1(x1)||pN )]+
EpX2
[log D1(G1(E2(x2))) + log D2(G2(E2(x2)))
1||G2(E2(x2)) x2||p
p 2KL(E2(x2)||pN )]
129

Intuition
G1
G2
E1
E2
In the ideal case (supervised setting)
Embedding Space
We can enforce that corresponding images are
mapping to the same point in the embedding
space.
E1(x
(i)
1 ) ⌘ E2(x
(i)
2 )
130

Intuition
G1
G2
E1
In the unsupervised setting
Embedding Space
Although we can use a GAN discriminator to
make sure the images coming out from G2
resembling real images, there is no guarantee
that the output image from G1 and G2 form a
corresponding pair.
D2
131

Intuition
G1
G2
E1
Embedding Space
However, if we apply shared latent space
assumption via weight-sharing, then the
generated images from G1 and G2 will resemble a
pair of corresponding images. D2
132

G1
G2
E1
Embedding Space
Similarly, we can enforce weight-sharing to E1
and E2 to encourage a pair of corresponding
images to have the same embedding.
D2
E2
D1
133

Toy Examples
G1 D1
D2G2
E1
E2
x1 ⇠ X1
x2 ⇠ X2
G1(E1(x1))
G1(E2(x2))
G2(E1(x1))
G2(E2(x2))
Domain 1 Domain 2
Image Translation Errors
Results
134

Image Translation Results
Foggy image to clear sky image
Front ViewBack View Right ViewLeft View
136

Attribute-based Face Image
Translation
Input +Blond Hair +Eye Glasses +Goatee +Smiling Input +Blond Hair +Eye Glasses +Goatee +Smiling
137

Image translation results
Input +Blondhair Input +Eyeglasses
Input -SmilingInput -Goatee
138

Image translation results
Input +Blondhair Input +Eyeglasses
Input +SmilingInput +Goatee
139

Conclusions
• Family of generative models

• Theory of GANs

• Various training techniques

• Joint image distribution and video distribution

• Image translation applications
140

Theory and Application of GANs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Theory and Application of GANs

Similar to Theory and Application of GANs (20)

More from MLReview

More from MLReview (12)

Recently uploaded

Recently uploaded (20)

Theory and Application of GANs