Successfully reported this slideshow.
Upcoming SlideShare
×

# NTHU AI Reading Group: Improved Training of Wasserstein GANs

Improved Training of Wasserstein GANs

See all

### Related Audiobooks

#### Free with a 30 day trial from Scribd

See all
• Full Name
Comment goes here.

Are you sure you want to Yes No

### NTHU AI Reading Group: Improved Training of Wasserstein GANs

1. 1. NTHU AI Reading Group: Improved Training of Wasserstein GANs Mark Chang 2017/6/6
2. 2. Outlines • Wasserstein GANs • Derivation of Kantorovich-Rubinstein Duality • Improved Training of WGANs • Experiments
3. 3. Outlines • Wasserstein GANs • Regular GANs • Source of Instability • Earth Mover’s Distance • Kantorovich-Rubinstein Duality • Wasserstein GANs • Weight Clipping • Derivation of Kantorovich-Rubinstein Duality • Improved Training of WGANs • Experiments
4. 4. Regular GANs Generator Network G(z)prior min G max D V (D, G) generated data real data 1 0 Discriminator Network D(x) sigmoid function V (D, G) = Ex⇠Pr(x)[logD(x)] + Ez⇠Pz(z)[log(1 D(G(z))] z ⇠ Pz(z) x ⇠ Pr(x)
5. 5. Source of Instability x Pr(x) Vanishing Gradient Optimal Discriminator D⇤ (x) Disjoint Distributions V (D, G) = Ex⇠Pr(x)[logD(x)] + Ez⇠Pz(z)[log(1 D(G(z))] real data generated data Pg(x)
6. 6. Earth Mover’s Distance Cost function of WGAN : Earth Mover’s Distance V (D, G) = Ex⇠Pr(x)[logD(x)] + Ez⇠Pz(z)[log(1 D(G(z))] EMD(Pr, P✓) = inf 2⇧(Pr,P✓) X x,y kx yk (x, y) = inf 2⇧(Pr,P✓) E(x,y)⇠ kx yk
7. 7. Earth Mover’s Distance Pr(x) Pg(x)
8. 8. Earth Mover’s Distance x y photo credit : https://vincentherrmann.github.io/blog/wasserstein/ EMD(Pr, P✓) = inf 2⇧(Pr,P✓) X x,y kx yk (x, y) = inf 2⇧(Pr,P✓) E(x,y)⇠ kx yk Real data Generated data X y (x, y) = Pr(x) X x (x, y) = P✓(y) x1 y2 (x1, y2)
9. 9. Kantorovich-Rubinstein Duality Kantorovich-Rubinstein Duality EMD(Pr, P✓) = sup kfkL1 Ex⇠Pr f(x) Ex⇠P✓ f(x). 1-Lipschitz Constraint This formula is highly intractable EMD(Pr, P✓) = inf 2⇧(Pr,P✓) X x,y kx yk (x, y) = inf 2⇧(Pr,P✓) E(x,y)⇠ kx yk
10. 10. Wasserstein GANs Generator Network prior generated data real data Critic Network z ⇠ Pz(z) x ⇠ Pr(x) no sigmoid function fw(x) fw(g✓(z)) g✓ fw min ✓ max w2[ k,k]l Ex⇠Pr [fw(x)] Ez⇠Pz [fw(g✓(z))] k-Lipschitz Constraint
11. 11. Wasserstein GANs • k-Lipschitz continuous f(x) 8x1, x2, 9k For a real function such that photo credit : https://en.wikipedia.org/wiki/Lipschitz_continuity f(x) |f(x1) f(x2)| |x1 x2|  k g(x) = kx
12. 12. Weight Clipping Enforce a k-Lipschitz constraint : w 2 [ c, c]l f(x) is a multi-layer neural network.
13. 13. Weight Clipping
14. 14. Outlines • Wasserstein GANs • Derivation of Kantorovich-Rubinstein Duality • Earth Mover’s Distance • Linear Programming • Dual Form • Improved Training of WGANs • Experiments
15. 15. Derivation of Kantorovich- Rubinstein Duality • Wasserstein GAN and the Kantorovich-Rubinstein Duality • https://vincentherrmann.github.io/blog/wasserstein/ • Optimal Transportation: Continuous and Discrete • http://smat.epfl.ch/~zemel/vt/pdm.pdf • Optimal Transport: Old and New • http://www.springer.com/br/book/9783540710493
16. 16. Earth Mover’s Distance photo credit : https://vincentherrmann.github.io/blog/wasserstein/ sum of all the element-wise products EMD(Pr, P✓) = inf 2⇧(Pr,P✓) X x,y kx yk (x, y) = inf 2⇧(Pr,P✓) hD, iFEMD(Pr, P✓) = inf 2⇧(Pr,P✓) X x,y kx yk (x, y) = inf 2⇧(Pr,P✓) hD, iF P✓(y) Pr(x) = 2 6 6 6 4 (x1, y1) (x1, y2) · · · (x1, yn) (x2, y1) (x2, y2) · · · (x2, yn) ... ... ... ... (xn, y1) (xn, y2) · · · (xn, yn) 3 7 7 7 5 D = 2 6 6 6 4 kx1 y1k kx1 y2k · · · kx1 ynk kx2 y1k kx2 y2k · · · kx2 ynk ... ... ... ... kxn y1k kxn y2k · · · kxn ynk 3 7 7 7 5
17. 17. Linear Programming Ax = b x 0 Objective function: minimize Constraint: z = cT x X y (x, y) = Pr(x) X x (x, y) = P✓(y) Objective function: Constraint: 8x, y (x, y) 0 EMD(Pr, P✓) = inf 2⇧(Pr,P✓) hD, iF
18. 18. Linear Programming z = cT x 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 (x1, y1) (x1, y2) ... (x2, y1) (x2, y2) ... (xn, y1) (xn, y2) ... 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 c = vec(D) x = vec( ) 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 kx1 y1k kx1 y2k ... kx2 y1k kx2 y2k ... kxn y1k kxn y2k ... 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 EMD(Pr, P✓) = inf 2⇧ hD, iF Objective function:
19. 19. Linear Programming 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 (x1, y1) (x1, y2) ... (x2, y1) (x2, y2) ... (xn, y1) (xn, y2) ... 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 = Ax = b b =  Pr P✓ X y (x, y) = Pr(x) X x (x, y) = P✓(y) x = vec( )A 2 6 6 6 6 6 6 6 6 6 6 6 6 4 Pr(x1) Pr(x2) ... Pr(xn) P✓(y1) P✓(y2) ... P✓(yn) 3 7 7 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 4 1 1 · · · 0 0 · · · 0 0 · · · 0 0 · · · 1 1 · · · 0 0 · · · ... ... ... ... ... ... ... ... ... 0 0 · · · 0 0 · · · 1 1 · · · 1 0 · · · 1 0 · · · 1 0 · · · 0 1 · · · 0 1 · · · 0 1 · · · ... ... ... ... ... ... ... ... ... 0 0 · · · 0 0 · · · 0 0 · · · 3 7 7 7 7 7 7 7 7 7 7 7 7 5 Constraint:
20. 20. Dual Form Ax = b x 0 z = cT x ˜z = bT y z = cT x yT Ax = yT b = ˜z AT y  c z = ˜zStrong Duality: Weak Duality: z ˜z Primal Problem: Dual Problem: minimize: maximize: constraint: constraint: is a lower bound of z = cT x yT Ax = yT bx yT Ax = yT b = ˜z
21. 21. Dual Form ˜z = bT y EMD(Pr, P✓) = fT Pr + gT P✓ b =  Pr P✓ y =  f g Objective function: 2 6 6 6 6 6 6 6 6 6 6 6 6 4 f(x1) f(x2) ... f(xn) g(x1) g(x2) ... g(xn) 3 7 7 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 4 Pr(x1) Pr(x2) ... Pr(xn) P✓(x1) P✓(x2) ... P✓(xn) 3 7 7 7 7 7 7 7 7 7 7 7 7 5 )
22. 22. Dual Form AT y  c 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 1 0 · · · 0 1 0 · · · 0 1 0 · · · 0 0 1 · · · 0 ... ... ... ... ... ... ... ... 0 1 · · · 0 1 0 · · · 0 0 1 · · · 0 0 1 · · · 0 ... ... ... ... ... ... ... ... 0 0 · · · 1 1 0 · · · 0 0 0 · · · 1 0 1 · · · 0 ... ... ... ... ... ... ... ... 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5  c = vec(D)AT y =  f g constraint: 2 6 6 6 6 6 6 6 6 6 6 6 6 4 f(x1) f(x2) ... f(xn) g(x1) g(x2) ... g(xn) 3 7 7 7 7 7 7 7 7 7 7 7 7 5 2 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 4 kx1 x1k kx1 x2k ... kx2 x1k kx2 x2k ... kxn x1k kxn x2k ... 3 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 5 f(xi) + g(xj)  kxi xjk) 8i, j
23. 23. Dual Form f(xi) + g(xj)  kxi xjk f(xi) + g(xi)  kxi xik = 0 EMD(Pr, P✓) = fT Pr + gT P✓ f(xi) = g(xi) 8i, jconstraint: if i = j ) ) maximize:
24. 24. Dual Form EMD(Pr, P✓) = sup kfkL1 Ex⇠Pr f(x) Ex⇠P✓ f(x). ⇢ f(xi) f(xj)  kxi xjk f(xi) f(xj) kxi xjk 1  f(xi) f(xj) kxi xjk  1 kfkL1 f(xi) = g(xi) f(xi) + g(xj)  kxi xjk8i, jconstraint: ) ) ) 1-Lipschitz Constraint: The slope should be between -1 and 1 ) 1-Lipschitz Constraint
25. 25. Outlines • Wasserstein GANs • Derivation of Kantorovich-Rubinstein Duality • Improved Training of WGANs • Difficulties with weight constraints • Gradient penalty • Experiments
26. 26. Difficulties with weight constraints • Capacity underuse • Weights attain their maximum or minimum values • Can only learn simple function • Exploding and vanishing gradients • Clipping parameter is too large -> exploding gradient • Clipping parameter is too small -> vanishing gradient
27. 27. Difficulties with weight constraints • Capacity underuse
28. 28. Difficulties with weight constraints • Capacity underuse
29. 29. Difficulties with weight constraints • Exploding and vanishing gradients
30. 30. Gradient penalty • Optimal critic has gradients with norm 1 almost everywhere under andPr Pg xt = (1 t)x + ty rf⇤ (xt) = y xt ky xtk krf⇤ (xt)k= 1) x ⇠ Pr y ⇠ Pg L = E˜x⇠Pg [f(˜x)] E˜x⇠Pr [f(x)] + Ext⇠Pt [(krxt f(xt)k 1)2 ] gradient penaltyoriginal critic loss