Wasserstein gan

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Wasserstein GAN
JIN HO LEE
2018-11-30
JIN HO LEE Wasserstein GAN 2018-11-30 1 / 26

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
• 1. Introduction
• 2 Different Distances
• 3 Wasserstein GAN
• 4 Empirical Results
▷ 4.1 Experimental Procedure
▷ 4.2 Meaningful loss metric
▷ 4.3 Improved stability
• 5 Related Work

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1. Introduction
1. Introduction
• Main Goal : Learning GAN by using Wasserstein distance W(Pr, Pg)
• In Section 2, we provide how the Earth Mover (EM) distance behaves in
comparison to Total Variation (TV), Kullback-Leibler (KL) divergence and
Jensen-Shannon (JS) divergence.
• In Section 3, we define Wasserstein-GAN and efficient approximation of
the EM distance
• we empirically show that WGANs cure the main training problems of
GANs.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances
• A σ-algebra Σ of subst of X is a collection Σ of subsets of X satisfying
the following conditions
(a) ∅ ∈ Σ
(b) if B ∈ Σ then Bc ∈ Σ
(c) if B1, B2, · · · is a countable collection of sets in Σ then ∪∞
n=1Bn ∈ Σ
• Borel algebra : the smallest σ-algebra containing the open sets
• A probability space consists of sample space Ω, events F and
probability measure P where the set of events F is a σ-algebra
• A function µ is a probability measure on a probability space (X, Σ, P) if
(a) µ(X) = 1, µ(∅) = 0, µ(A) ∈ [0, 1] for every A ∈ Σ
(b) countable additivity : for all countable collections {Ei} of pairwise
disjoint sets:
µ (∪iEi) =
∑
i
µ(Ei).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
• The Total Variation (TV) distance
δ(Pr, Pg) = sup
A∈Σ
|Pr(A) − Pg(A)|.
• The Kullback-Leibler (KL) divergence
KL(Pr||Pg) =
∫
log
(
Pr(x)
Pg(x)
)
Pr(x)dµ(x).
• The Jensen-Shannon (JS) divergence
JS(Pr, Pg) = KL(Pr||Pm) + KL(Pg||Pm),
where Pm = (Pr + Pg)/2 is the mixture.
• The Earth-Mover (EM) distance or Wasserstein-1
W(Pr, Pg) = inf
γ∈Π(Pr,Pg)
E(x,y)∼γ[||x − y||],
where Π(Pr, Pg) denotes the set of all joint distributions γ(x, y) whose
marginals are respectively Pr and Pg, that is γ is a coupling of Pr and Pg.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances Couplings
Couplings
• χ : compact metric space
• Σ : the set of all Borel subset of χ
• Prob(χ) : probability measures on χ
Definition
Let µ and ν be probability measures on the same measurable space (S, Σ).
A coupling of µ and ν is a probability measure on the coupling product
space (S × S, Σ × Σ) such that the marginals of coincide with µ and ν, i.e.,
γ(A × S) = µ(A) and γ(S × A) = ν(A) ∀A ∈ Σ.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances Couplings
Example
For 0 ≤ p1 ≤ p2 ≤ 1, qi = 1 − pi(i = 1, 2), we consider the following joint
distributions:
Since ˜X ∼ Ber(p1) and ˜Y ∼ Ber(p2), f and g are couplings of Ber(p1) and
Ber(p2).

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances Example of Wasserstein Distance
Example
For previous joint distributions f and g, we assume that(it’s not true)
Π[Ber(p1), Ber(p2)] = {f, g}.
Then we have
W(Ber(p1), Ber(p2)) = min{q1p2 + p1q2, p2 − p1}.
Proof.
Since Π[Ber(p1), Ber(p2)] = {f, g}, we consider only two cases.
case 1. f ∈ Π[Ber(p1), Ber(p2)].
E(x,y)∼f[||x − y||]
= f(0, 0)||0 − 0|| + f(0, 1)||0 − 1|| + f(1, 0)||1 − 0|| + f(1, 1)||1 − 1||
= q1p2 + p1q2

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances Example of Wasserstein Distance
case 2. g ∈ Π[Ber(p1), Ber(p2)].
E(x,y)∼g[||x − y||]
= g(0, 0)||0 − 0|| + g(0, 1)||0 − 1|| + g(1, 0)||1 − 0|| + g(1, 1)||1 − 1||
= p2 − p1
By case 1 and 2, we have
W(Ber(p1), Ber(p2)) = inf
γ∈Π[Ber(p1),Ber(p2)]
E(x,y)∼γ[||x − y||]
= inf
γ∈{f,g}
E(x,y)∼γ[||x − y||]
= min{q1p2 + p1q2, p2 − p1}.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances An example of coulpings
Lemma
For p1, p2 ∈ [0, 1], the set of all couplings Π[Ber(p1), Ber(p2)] of Ber(p1)
and Ber(p2) is {pa|a ∈ [0, 1]} where
pa(0, 0) = a
pa(0, 1) = q1 − a
pa(1, 0) = q2 − a
pa(1, 1) = p2 − q1 + a
Proof.
Let γ ∈ Π[Ber(p1), Ber(p2)]. Then we have the following table
γ Y = 0 Y = 1 Σyγ(x, y)
X = 0 q1
X = 1 q2
Σxγ(x, y) q2 p2

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances An example of coulpings
For a ∈ [0, 1], if γ(0, 0) = a, then the following table is completely
determined.
γ Y = 0 Y = 1 Σyγ(x, y)
X = 0 a q1 − a q1
X = 1 q2 − a p2 − (q1 − a) q2
Σxγ(x, y) q2 p2
It means that, for a ∈ [0, 1], we can have a coupling γ of Ber(p1) and
Ber(p2) such that γ(0, 0) = a. This complete the proof.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances A computational result of Wasserstein Distance
Theorem
For p1 ≤ p2, we have
W(Ber(p1), Ber(p2)) = p2 − p1.
Proof.
From the previous Lemma, we have Π[Ber(p1), Ber(p2)] = {pa|a ∈ [0, 1]}
where pa(0, 0) = a. Then we obtain
E(x,y)∼pa
[||x − y||]
= pa(0, 0)||0 − 0|| + pa(0, 1)||0 − 1|| + pa(1, 0)||1 − 0|| + pa(1, 1)||1 − 1||
= 2 − p1 − p2 − 2a
Since p1 and p2 are constants and a is less or equal to marginal
probabilities, we have a ≤ min{q1, q2}. From the assumption p1 ≤ p2, we
have q1 ≥ q2 and min{q1, q2} = q2.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances A computational result of Wasserstein Distance
The function E(x,y)∼pa
[||x − y||] = 2 − p1 − p2 − 2a is linear by a and
a ≤ q2, we have
2 − p1 − p2 − 2a ≥ 2 − p1 − p2 − 2(1 − p2) = p2 − p1.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances Example 1
Example (1)
• We assume that
▷ Z ∼ U[0, 1] : uniform distribution on the unit interval.
▷ P0 : be the distribution of (0, Z) ∈ R2, uniform on a straight vertical
line passing through the origin.
▷ gθ(z) = (θ, z) with θ a single real parameter.
Then we obtain the following.
• W(P0, Pθ) = |θ|
• JS(P0, Pθ) =
{
log 2 if θ ̸= 0,
0 if θ = 0,
• KL(Pθ||P0) = KL(P0||Pθ) =
{
+∞ if θ ̸= 0,
0 if θ = 0,

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances Example 1
• δ(P0, Pθ) =
{
1 if θ ̸= 0,
0 if θ = 0,
• When θt → 0, the sequence (Pθ)t∈N converges to P0 under the EM
distance, but does not convege at all under either us JS, KL, reverse KL,
or TV divergences.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2. Different Distances Theorem 1
Theorem (1)
Let Pr be a fixed distribution over X. Let Z be a random variable (e.g
Gaussian) over another space Z. Let g : Z × Rd → χ be a function, that
will be denoted gθ(z) with z the first coordinate and θ the second. Let Pθ
denote the distribution of gθ(z). Then,
1. If g is continuous in θ, so is W(Pr, Pθ).
2. If g is locally Lipschitz and satisfies regularity assumption 1, then
W(Pr, Pθ) is continuous everywhere, and differentiable almost everywhere.
3. Statements 1-2 are false for the Jensen-Shannon divergence JS(Pr, Pθ)
and all the KLs.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
The following corollary tells us that learning by minimizing the EM
distance makes sense (at least in theory) with neural networks.
Corollary
Let gθ be any feedforward neural network parameterized by θ, and p(z) a
prior over z such that Ez∼p(z)[||z||] < ∞ (e.g. Gaussian, uniform, etc.).
Then assumption 1 is satisfied and therefore W(Pr, Pθ) is continuous
everywhere and differentiable almost everywhere.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Theorem (2)
Let P be a distribution on a compact space X and (Pn)n∈N be a sequence
of distributions on X. Then, considering all limits as n → ∞,
1. The following statements are equivalent
• δ(Pn, P) → 0 with δ the total variation distance.
• JS(Pn, P) → 0 with JS the Jensen-Shannon divergence.
2. The following statements are equivalent
• W(Pn, P) → 0.
• Pn
D
−→ P where
D
−→ represents convergence in distribution for random
variables.
3. KL(Pn||P) → 0 or KL(P||n) → 0 imply the statements in (2)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3. Wasserstein GAN
3. Wasserstein GAN
• Computing W(Pr, Pg) is intractible from the definition of Wasserstein
distance. However, then Kantorovich-Rubinstein duality tell us that:
W(Pr, Pg) = sup
||f||L≤1
Ex∼Pr [f(x)] − Ex∼Pθ
[f(x)]
where ||f||L ≤ 1 means that f satisfies 1-Lipschitz condition.
• Note that, if we replace ||f||L ≤ 1 for ||f||L ≤ K for some K, we have
K · W(Pr, Pg) = sup
||f||L≤K
[f(x)].
• If we have a parametrized family functions {fw}w∈W that are all
K-Lipschitz for some K, then we have:
max
w∈W
Ex∼Pr [fw(x)] − Ex∼Pθ
[fw(x)] ≤ sup
||f||L≤K
[f(x)]
= K · W(Pr, Pθ)

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3. Wasserstein GAN Theorem 3
Theorem (3)
Let Pr be any distribution. Let Pθ be the distribution of gθ(Z) with Z a
random variable with density p and gθ a function satisfying assumption 1.
Then, there is a solution f : χ → R to the problem
max
||f||L≤1
[f(x)]
and we have
∇θW(Pr, Pθ) = −Ez∼p(z)[∇θf(gθ(z))]
when both terms are well-defined.
• Objective functions:
LWGAN
D = Ex∼Pr [fw(x)] − Ez∼P(z)[fw(gθ(z))]
LWGAN
G = Ez∼P(z)[f(gθ(z))]
where wD ← clip(w, −0.01, 0.01) in LD.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3. Wasserstein GAN Algorithm 1

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3. Wasserstein GAN Figure 2
In this paper, the Authors call discriminator critic. In Figure 2, we train a
GAN discriminator and a WGAN critic still optimality. The discriminator
learn very quickly to distinguish between fake and real. But, the critic
can’t saturate and converges to a linear function.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4. Empirical Results
• We claim two main benefits:
▷ a meaningful loss metric that correlates with the generator’s
convergence and sample quality
▷ improved stability of the optimization process

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4. Empirical Results 4.1 Experimental Procedure
• Training curves and the visualization of samples at different stages of
training show clear correlation between the Wasserstein estimate and the
generated image quality.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4. Empirical Results 4.1 Experimental Procedure

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Some knowledge to read Appendix
• Let χ ∈ Rd be a compact set, that is closed and bounded by Heine-Borel
Theorem and Prob(χ) a probability measure over χ.
• We define
Cb(χ) = {f : χ → R|f is continuous and bounded}
• For f ∈ Cb(χ), we can define a norm ||f||∞ = max
x∈χ
|f(x)|, since f is
bounded.
• Then we have a normed vector space (Cb(χ), || · ||∞).
• The dual space
Cb(χ)∗
= {ϕ : Cb(χ) → R|ϕ is linear and continuous}
has norm ||ϕ|| = sup
f∈Cb(χ),||f||∞≤1
|ϕ(f)|.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Some knowledge to read Appendix
• Let µ be a signed measure over χ, and let the Total Variational distance
||µ||TV = sup
A⊂χ
|µ(A)|
where A is a Borel subset in χ. For two probability distributions Pr and
Pθ, we the function
δ(Pr, Pθ) = ||Pr − Pθ||TV
is a distance in Prob(χ) (called the Total Variation distance)
• We can consider
Φ : (Prob(χ), δ) → (Cb(χ)∗
, || · ||)
where Φ(P)(f) = Ex∼P[f(x)] is a linear function over Cb(χ).
• By the Riesz Representation Theorem, Φ is an isometric immersion, that
is δ(P, Q) = ||Φ(P) − Φ(Q)|| and ϕ is a 1-1 correspondence.

Wasserstein gan

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Wasserstein gan

Similar to Wasserstein gan (20)

More from Jinho Lee

More from Jinho Lee (9)

Recently uploaded

Recently uploaded (20)

Wasserstein gan