3. WHAT IS GENERATIVE MODEL
(x) = g (z)
https://blog.openai.com/generativeāmodels/
pĪø^ Īø
DeepBio 3
Z
'
.
feature
space
.
l Assumption
)
for Connectionist .
%
TZMTTYVCT
CNOTSAMPLINCT )
4. WHY GENERATIVE
The new way of simulating applied math/engineering domain
Combining with Reinforcement Learning
Good for semiāsupervised learning
Can work with multiāmodal output
Can make data with realitic generation
DeepBio 4
ZAN 's TALK W wynsrolb ,
& mmm
8. Generative model
(x) = g (z)
Let z ā¼ N(0, 1)
Let g be a neural networks with transpose convolutional layers
ļ“¾So Nice !!ļ“æ
x ā¼ X : MNIST dataset
L2 Loss ļ“¾Mean Square Errorļ“æ
p^Īø Īø
DeepBio 8
FEATURE SPACE
PARAMHZZZFD BY O
=)
MAĆ2MUMw4L2k2e2H=
if font I -
No , E)
p(Ylx7 ~
N(g(Ho)
,
f) -8440) :
learner parfait .
LIPIO ) =
by Tlpcai , yi )
=
Ggtplyiiai ) Paci ) =6gTNYiK) +6ft Ma )
argmoaxtulokartmfncogtplesilnt =
hytrstrexp thrashing
=
-
Nloy FE6 -
Ira ( Itt -
genital P )
12. Notations
x : Observed data, z : Latent variable
p(x) : Evidence, p(z) : Prior
p(xā£z) : Likelihood, p(zā£x) : Posterior
Probabilistic Model Difined As Joint Distribution of x, z
p(x, z)
DeepBio 12
I
t.FM#yfnsPH=niaeH=
#
eK=xd=
Fj
p(y=yzA=yrji
G-
Ā§
no
,
r
;
s
Fini,plrx-aifsEPk-aiiEahPCZZ1zIPCx.nayt.ls@TkneNZ-zslx.x
;)
=
I.
Pixar , 2- =
th =
Ā„ =
Ā„ Fu
=P IEE
,
I # ni ) Paxil
PRIME
13. Model
p(x, z) = p(xā£z)p(z)
Our Interest Is Posterior !!
p(zā£x)
p(zā£x) = : Infer Good Value of z Given x
p(x) = p(x, z)dz = p(xā£z)p(z)dz
p(x) is Hard to calculateļ“¾INTRACTABLEļ“æ
Approaximate Posterior
p(x)
p(xā£z)p(z)
ā« ā«
DeepBio 13
4th =
HE rhueirjsinry 02.2 Latent variable
2 't 3h17 , "
I did ?
*t#N
Bayesian
Znfercnce
Ā£ OBERVABUE
.
ZMP2Rzc#=
( BAYHZAN )
( PRODUCT RULE )
SAMMY
gMY#*oHAstw4HARD JOB 't
guys 's
.si#PuN4
:
14. Variational Inference
Pick a familiy of distributions over the latent variables with its
own variational parameters,
q (zā£x)
Find Ļ that makes q close to the posterior of interest
Ļ
DeepBio 14
www.vpaste.MN @
0 At KNT DON'T Access )
PARAMZRZZZD
Ā¢
we
-0
/
-
Measure
ā
USZNCT Vz
,
SAMPLZNCT PROBLEM
maw
"
pqootp ā OPTIMZZZNLT PROBLEM ,
y ā
gfor
gaussian , ( Āµ ,
r )
for uniform ,
( *min
, * max
)
i.
15. KULLBACK LEIBLER DIVERGENCE
Only if Q(i) = 0 implies P(i) = 0, for all i,
Measure of the nonāsymmetric difference between two
probability distributions P and Q
KL(Pā£ā£Q) = p(x) log dxā«
q(x)
p(x)
= p(x) log p(x)dx ā p(x) log q(x)dxā« ā«
DeepBio 15
Pe > Qc
c)
equivalent
.
=) Q > P at Ewan th Kee
Pnt '
Ehyiohf %Z Malek 3h
we ioy M 39
,
Qcij '
t o
4mL
Ki ) '
to Terminator .
fiercely=) KENTROPY -
ENTROPY
"
BECAUSE 67 ENTROPY
,
NWTYMIETRK
ENTROPY = UNCERTAINTY
16. Property
The Kullback Leibler divergence is always nonānegative,
KL(Pā£ā£Q) ā„ 0
DeepBio 16
kl ( MIQ ) =/ pimhg My dn
Hmm @tTH )
17. Proof
X ā 1 ā„ log X ā log ā„ 1 ā X
Using this,
X
1
KL(Pā£ā£Q) = p(x) log dxā«
q(x)
p(x)
ā„ p(x) 1 ā dxā« (
p(x)
q(x)
)
= {p(x) ā q(x)}dxā«
= p(x)dx ā q(x)dxā« ā«
= 1 ā 1 = 0
DeepBio 17
Ā„tĀ„Iaā#i* ā
"
:# * ,
Lee
: :
19. Maximizing Likelihood is equivalent to minimizing KL
Divergence
Ļā
= argmin ā p(x) log q(x; Ļ)dxĻ ( ā« )
= argmax p(x) log q(x; Ļ)dxĻ ā«
= argmax E [log q(x; Ļ)]Ļ xā¼p(x;Ļ)
ā argmax Ī£ log q(x ; Ļ)Ļ [
N
1
i
N
i ]
DeepBio 19
Ā„
III?Iki :
:*;Ā„a
-
LOLTLZKZLZHOLOD
20. JENSEN'S INEQUALITY
For Concave Function, f(E[x]) ā„ E[f(x)]
For Conveax Function, f(E[x]) ā¤ E[f(x)]
DeepBio 20
An c.
Norte
.#TĀ„Ā±i:#āfftn )
21. Evidence Lower BOund
log p(x) = log p(x, z)dxā«
z
= log p(x, z) dxā«
z q(z)
q(z)
= log q(z) dxā«
z q(z)
p(x, z)
= log E dxq
q(z)
p(x, z)
ā„ E [log p(x, z)] āE [log q(z)]q q
DeepBio 21
of i. WZU KNOWN PROBABZLISTZC
Xdz
DZSTRZBUTZON
-
Ć·of
www.?EeumgD*ĀµĆ·ā=***nāĆ
dz
-
zefso
LOCTPCHE
2-430am on at .
22. Variational Distribution
q (zā£x) = argmin KL(q (zā£x)ā£ā£p (zā£x))
Choose a family of variational distributionsļ“¾qļ“æ
Fit the parameterļ“¾Ļļ“æ to minimize the distance of two
distributionļ“¾KLāDivergenceļ“æ
Ļ
ā
Ļ Ļ Īø
DeepBio 22
go.fi#ttI
( RZVZRSZ KL DWERCTZNEE
)
23. KL Divergence
KL(q (zā£x)ā£ā£p (zā£x))Ļ Īø =E logqĻ
[
p (zā£x)Īø
q (zā£x)Ļ
]
=E log q (zā£x) ā log p (zā£x)qĻ
[ Ļ Īø ]
=E log q (zā£x) ā log p (zā£x)qĻ
[ Ļ Īø
p (x)Īø
p (x)Īø
]
=E log q (zā£x) ā log p (x, z) + log p (x)qĻ
[ Ļ Īø Īø ]
=E [log q (zā£x) ā log p (x, z)] + log p (x)qĻ Ļ Īø Īø
DeepBio 23
1KZVERSE)
24. Object
q (zā£x) = argmin E log q (zā£x) ā log p (x, z) + log p (x)
q (zā£x) is negative ELBO plus log marginal probability of x
log p (x) does not depend on q
Minimizing the KL divergence is the same as maximizing the
ELBO
q (zā£x) = argmax ELBO
Ļ
ā
Ļ [ qĻ
[ Ļ Īø ] Īø ]
Ļ
ā
Īø
Ļ
ā
Ļ
DeepBio 24
frtoolmyttmmee
MZNZMZZZKL 7430
āā¢
ā
hfpdn )
6
mm
-
EUBO
25. Variational Lower Bound
For each data point x , marginal likelihood of individual data pointi
log p (x )Īø i ā„ L(Īø, Ļ; x )i
=E ā log q (zā£x ) + log p (x , z)q (zā£x )Ļ i
[ Ļ i Īø i ]
=E log p (x ā£z)p (z) ā log q (zā£x )q (zā£x )Ļ i
[ Īø i Īø Ļ i ]
=E log p (x ā£z) ā (log q (zā£x ) ā log p (z))q (zā£x )Ļ i
[ Īø i Ļ i Īø ]
=E log p (x ā£z) āE logq (zā£x )Ļ i
[ Īø i ] q (zā£x )Ļ i
[(
p (z)Īø
q (zā£x )Ļ i
)]
=E log p (x ā£z) ā KL q (zā£x )ā£ā£p (z)q (zā£x )Ļ i
[ Īø i ] (( Ļ i Īø ))
DeepBio 25
EUBO
Infarct
IT
a
- ĀµAxvM2#
ā KLBIMZMMH )
#yq# EKKAVGATA
26. ELBO
L(Īø, Ļ; x ) =E log p (x ā£z) ā KL q (zā£x )ā£ā£p (z)
q (zā£x ) : proposal distribution
p (z) : prior ļ“¾our beliefļ“æ
How to Choose a Good Proposal Distribution
Easy to sample
Differentiable ļ“¾āµ Backprop.ļ“æ
i q (zā£x )Ļ i
[ Īø i ] (( Ļ i Īø ))
Ļ i
Īø
DeepBio 26
n
posterior approximate
ā Earth 4h .
) ā CTAVKZAN
27. Maximizing ELBO ā I
L(Ļ; x ) =E log p(x ā£z) ā KL q (zā£x )ā£ā£p(z)
Ļ = argmax E log p(x ā£z)
E log p(x ā£z) : LogāLikelihood ļ“¾NOT LOSSļ“æ
Maximize likelihood for maximizing ELBO ļ“¾NOT MINIMIZE!!ļ“æ
i q (zā£x )Ļ i
[ i ] (( Ļ i ))
ā
Ļ q (zā£x )Ļ i
[ i ]
q (zā£x )Ļ i
[ i ]
DeepBio 27
( Lott ruklrtloob )
28. Log Likelihood
In case of Bernoulli distribution p(xā£z) is,
E log p(xā£z) = x log p(y ) + (1 ā x ) log(1 ā p(y ))
For maximize it, minimize Negative Log Likelihood !!
Loss = ā [x log( ) + (1 ā x ) log(1 ā )]
Already know as Sigmoid CrossāEntropy
is output of Decoder
We call it Reconstructure Loss
q (zā£x)Ļ
i=1
ā
n
i i i i
n
1
i=1
ā
n
i x^i i x^i
x^i
DeepBio 28
normalisation
I
L
f :
the
output
is 4 ,
i ] )
* zl
Ā£CH
( or Binomial Cross
Entropy )
Zn ale of Faustian distribution ,
loss
= L 2 los } ( Mk )
29. Maximizing ELBO ā II
L(Ļ; x ) =E log p(x ā£z) ā KL q (zā£x )ā£ā£p(z)
Ļ = argmin KL q (zā£x )ā£ā£p(z)
Assume that prior and posterior approaximation are Gaussian
ļ“¾actually it's not a critical issue...ļ“æ
Then we can use KL Divergence according to definition
Let prior be N(0, 1)
How about q (zā£x ) ?
i q (zā£x )Ļ i
[ i ] (( Ļ i ))
ā
Ļ (( Ļ i ))
Ļ i
DeepBio 29
30. Posterior
Posterior approaximation is Gaussian,
q (zā£x ) = N(Ī¼ , Ļ )
where, (Ī¼ , Ļ ) is the output of Encoder
Ļ i i i
2
i i
DeepBio 30
if dimofznto
ā
Nof Āµ ,
6 =@
ā¢ ā¢
ā¢ ā¢
31. Minimizing KL Divergence
KL(q (zā£x)ā£ā£p(z)) = q (z) log q (z)dz ā q (z) log p(z)dz
q (z) log q (zā£x)dz = N(Ī¼ , Ļ ) log N(Ī¼ , Ļ )dz
Ā Ā Ā = ā log 2Ļ ā (1 + log Ļ )
q (z) log p(z)dz = N(Ī¼ , Ļ ) log N(0, 1)dz
Ā Ā = ā log 2Ļ ā (Ī¼ + Ļ )
Therefore,
KL(q (zā£x)ā£ā£p(z)) = 1 + log Ļ ā Ī¼ ā Ļ
Ļ ā« Ļ Ļ ā« Ļ
ā« Ļ Ļ ā« i i
2
i i
2
2
N
2
1
āN
i
2
ā« Ļ ā« i i
2
2
N
2
1
āN
i
2
i
2
Ļ
2
1
ā
N
[ i
2
i
2
i
2
]
DeepBio 31
for
)"EFFI.
! Basic format
32. AUTOāENCODER
Encoder : MLPs to Infer (Ī¼ , Ļ ) for q (zā£x )
Decoder : MLPs to Infer using latent variables ā¼ N(Ī¼, Ļ )
Is it differentiable? ļ“¾ = possible to backprop?ļ“æ
i i Ļ i
x^ 2
DeepBio 32
J I
33. REPARAMETERIZATION TRICK
Tutorial on Variational Autoencoders
DeepBio 33
NOT ABLE To 0
BACKPAY -
Now , sampling process is
independent
To the model .
1- ā D k ) I GAMPLENLT ( not ā armpit )
( Tust constant )
40. Features
Advantage
Fast and Easy to train
We can check the loss and evaluate
Disadvantage
Low Quality
Even though q reached the optimal point, it is quite different with p
Issues
Reconstruction loss ļ“¾xāentropy, L1, L2, ...ļ“æ
MLPs structure
Regularizer loss ļ“¾sometimes don't use log, sometimes use exp, ...ļ“æ
...
DeepBio 40
45. Value Function
min max V (D, G)
=E [log D(x)] +E [log(1 ā D(G(z)))]
For second term, E [log(1 ā D(G(z)))]
D want to maximize it ā Do not fool
G want to minimize it ā Fool
G D
xā¼p (x)data zā¼p (z)z
zā¼p (z)z
DeepBio 45
D
47. Global Optimulity of p = p
D (x) =
note that 'FOR ANY GIVEN generator G'
g data
G
ā
p + p (x)data g
p (x)data
DeepBio 47
olaphoz of Ham CT4 output 't original data st Foot 2tt .
14
48. Proof
For G fixed,
V (G, D) = p (x) log(D(x))dx + p (z) log(1 ā D(G(z))dz
= p (x) log(D(x)) + p (x) log(1 ā D(x))dx
Let X = D(x), Ā Ā a = p (x), Ā Ā b = p (x). So,
V = a log X + b log(1 ā X)
Find X which can maximize the value function V .
ā V
ā«x r ā«z g
ā«x r g
r g
X
DeepBio 48
8-8#
d- pug
if P .
=
Pg
,
then pay =D ( GCZI ) alternate
@ ā¬ #
[ 1
49. Proof
ā VX = ā a log X + b log(1 ā X)X ( )
= ā a log X + ā b log(1 ā X)X X
= a + b
X
1
1 ā X
ā1
=
X(1 ā X)
a(1 ā X) ā bX
=
X(1 ā X)
a ā aX ā bX
=
X(1 ā X)
a ā (a + b)X
DeepBio 49
51. Proof
Find the solution of this,
f(X) = a ā (a + b)X
Solution,
Function f(X) is monotone decreasing.
ā“ is the maximum point of f(X).
a ā (a + b)X = 0
(a + b)X = a
X =
a + b
a
a+b
a
DeepBio 51
ā
fix ) has maximum
point
52. Theorem
The global minimum of the virtual training criterion L(D, g ) is
achieved if and only if p = p .
At that point, L(D, g ) achieves the value ā log 4.
Īø
g r
Īø
DeepBio 52
53. Proof
L(D , g ) = max V (G, D)ā
Īø D
=E [log D (x)] +E [log(1 ā D (G(z)))]xā¼pr G
ā
zā¼pz G
ā
=E [log D (x)] +E [log(1 ā D (x))]xā¼pr G
ā
xā¼pg G
ā
=E [log ] +E [log ]xā¼pr
p (x) + p (x)r g
p (x)r
xā¼pg
p (x) + p (x)r g
p (x)g
=E [log ] +E [log ] + log 4 ā log 4xā¼pr
p (x) + p (x)r g
p (x)r
xā¼pg
p (x) + p (x)r g
p (x)g
=E [log ] + log 2 +E [log ] + log 2 ā log 4xā¼pr
p (x) + p (x)r g
p (x)r
xā¼pg
p (x) + p (x)r g
p (x)g
=E [log ] +E [log ] ā log 4xā¼pr
p (x) + p (x)r g
2p (x)r
xā¼pg
p (x) + p (x)r g
2p (x)g
DeepBio 53
ā fixed D
,
find EF
)
it , Preae =P
gen
.
54. where JS is JensenāShannon Divergence difined as
JS(Pā£ā£Q) = KL(Pā£ā£M) + KL(Qā£ā£M)
where, M = (P + Q)
āµ JS always ā„ 0, then ā log 4 is global minimum
Ā =E [log(p (x)/ )] +E [log(p (x)/ ] ā log 4xā¼pr r
2
p (x) + p (x)r g
xā¼pg g
2
p (x) + p (x)r g
= KL[p (x)ā£ā£ ] + KL[p (x)ā£ā£ ] ā log 4r
2
p (x) + p (x)r g
g
2
p (x) + p (x)r g
= ālog4 + 2JS(p (x)ā£ā£p (x))r g
2
1
2
1
2
1
DeepBio 54
55. JensenāShannon Divergence
JS(Pā£ā£Q) = KL(Pā£ā£M) + KL(Qā£ā£M)
Two types of KL Divergence
KL(Pā£ā£Q) : Maximum liklihood. Approximations Q that overgeneralise P
KL(Qā£ā£P) : Reverse KL Divergence. tends to favour underāgeneralisation.
The optimal Q will typically describe the single largest mode of P well
Jensen Divergence would exhibit a behaviour that is kind of halfway
between the two extremes above
2
1
2
1
DeepBio 55
58. Training
Cost Function For D
J = ā E log D(x) ā E log(1 ā D(G(z)))
Typical cross entropy with label 1, 0 ļ“¾Bernoulliļ“æ
Cost Function For G
J = ā E log(D(G(z)))
Maximize log D(G(z)) instead of minimizing
log(1 ā D(G(z))) ļ“¾cause vanishing gradientļ“æ
Also standard cross entropy with label 1
Really Good this way is??
(D)
2
1
xā¼pdata 2
1
z
(G)
2
1
z
DeepBio 58
59. Secret of G Loss
We already know that
E [ā log(1 ā D (g (z)))] = ā 2JS(P ā£ā£P )
Furthurmore,
z Īø
ā
Īø Īø r g
KL(P ā£ā£P )g r =E logx[
p (x)r
p (x)g
]
=E log āE logx[
p (x)r
p (x)g
] x[
p (x)g
p (x)g
]
=E log ā KL(P ā£ā£P )x[
1 ā D (x)ā
D (x)ā
] g g
=E log ā KL(P ā£ā£P )x[
1 ā D (g (z))ā
Īø
D (g (z))ā
Īø
] g g
DeepBio 59
( from Martin )
60. Taking derivatives in Īø at Īø we get
Subtracting this last equation with result for JSD,
E [āā log D (g (z))] = ā [KL(P ā£ā£P ) ā JS(P ā£ā£P )]
JS push for the distributions to be different, which seems like a
fault in the update
KL appearing here assigns an extremely high cost to
generation fake looking samples, and an extremely low cost on
mode dropping
0
ā KL(P ā£ā£P )Īø gĪø r = āā E log ā ā KL(P ā£ā£P )Īø z[
1 ā D (g (z))ā
Īø
D (g (z))ā
Īø
] Īø gĪø gĪø
=E āā logz[ Īø
1 ā D (g (z))ā
Īø
D (g (z))ā
Īø
]
z Īø
ā
Īø Īø gĪø r gĪø r
DeepBio 60
78. Normalizing Input
normalize the images between ā1 and 1
Tanh as the last layer of the generator output
A Modified Loss Function
Like maximizing D(G(z)) instead of minimizing
1 ā D(G(z))
Use a spherical Z
Sample from a gaussian distribution rather that uniform
DeepBio 78
79. XX Norm
One label per one miniābatch
Batch norm, layer norm, instance norm, or batch renorm ...
Avoid Sparse Gradients : Relu, MaxPool
the stability of the GAN game suffers if you have sparse
gradients
leakyRelu = good ļ“¾in both G and Dļ“æ
For down sampling, use : AVG pooling, strided conv
For up sampling, use : Conv_transpose, PixelShuffle
DeepBio 79
80. Use Soft and Noisy Lables
real : 1 ā> 0.7 ~ 1.2
fake : 0 ā> 0.0 ~ 0.3
flip for discriminatorļ“¾occasionallyļ“æ
ADAM is Good
SGD for D, ADAM for G
If you have labels, use them
go to the Conditional GAN
DeepBio 80
81. Add noise to inputs, decay over time
add some artificial noise to inputs to D
adding gaussian noise to every layer of G
Use dropout in G in both train and test phase
Provide noise in the form of dropout
Apply on several layers of our G at both traing and test time
DeepBio 81