Unleash Your Potential - Namagunga Girls Coding Club
Backpropagation in Convolutional Neural Network
1. Memo:
Backpropaga.on
in
Convolu.onal
Neural
Network
Hiroshi
Kuwajima
13-‐03-‐2014
Created
14-‐08-‐2014
Revised
1 /14
2. 2
/14
Note
n Purpose
The
purpose
of
this
memo
is
trying
to
understand
and
remind
the
backpropaga.on
algorithm
in
Convolu.onal
Neural
Network
based
on
a
discussion
with
Prof.
Masayuki
Tanaka.
n Table
of
Contents
In
this
memo,
backpropaga.on
algorithms
in
different
neural
networks
are
explained
in
the
following
order.
p Single
neuron
3
p Mul.-‐layer
neural
network
5
p General
cases
7
p Convolu.on
layer
9
p Pooling
layer
11
p Convolu.onal
Neural
Network
13
n Nota.on
This
memo
follows
the
nota.on
in
UFLDL
tutorial
(h[p://ufldl.stanford.edu/tutorial)
3. 3
/14
Neural
Network
as
a
Composite
Func4on
A
neural
network
is
decomposed
into
a
composite
func.on
where
each
func.on
element
corresponds
to
a
differen.able
opera.on.
n Single
neuron
(the
simplest
neural
network)
example
A
single
neuron
is
decomposed
into
a
composite
func.on
of
an
affine
func.on
element
parameterized
by
W
and
b
and
an
ac.va.on
func.on
element
f
which
we
choose
to
be
the
sigmoid
func.on.
Deriva.ves
of
both
affine
and
sigmoid
func.on
elements
w.r.t.
both
inputs
and
parameters
are
known.
Note
that
sigmoid
func.on
does
not
have
neither
parameters
nor
deriva.ves
parameters.
Sigmoid
func.on
is
applied
element-‐wise.
‘●’
denotes
Hadamard
product,
or
element-‐wise
product.
hW,b x( ) = f WT
x + b( )= sigmoid affineW,b x( )( )= sigmoid!affineW,b( ) x( )
∂a
∂z
= a • 1− a( ) where a = hw,b x( ) = sigmoid z( ) =
1
1+ exp −z( )
∂z
∂x
= W,
∂z
∂W
= x,
∂z
∂b
= I where z = affineW,b x( ) = WT
x + b, and I is identity matrix
Decomposi.on
Neuron
Standard
network
representa.on
x1
x2
x3
+1
hW,b(x)
Affine
Ac.va.on
(e.g.
sigmoid)
Composite
func.on
representa.on
x1
x2
x3
+1
hW,b(x)
z
a
4. ∇W J W,b;x, y( ) =
∂
∂W
J W,b;x, y( ) =
∂J
∂z
∂z
∂W
= δ z( )
xT
∇bJ W,b;x, y( ) =
∂
∂b
J W,b;x, y( ) =
∂J
∂z
∂z
∂b
= δ z( )
4
/14
Chain
Rule
of
Error
Signals
and
Gradients
Error
signals
are
defined
as
the
deriva.ves
of
any
cost
func.on
J
which
we
choose
to
be
the
square
error.
Error
signals
are
computed
(propagated
backward)
by
the
chain
rule
of
deriva.ve
and
useful
for
compu.ng
the
gradient
of
the
cost
func.on.
n Single
neuron
example
Suppose
we
have
m
labeled
training
examples
{(x(1),
y(1)),
…,
(x(m),
y(m))}.
Square
error
cost
func.on
for
each
example
is
as
follows.
Overall
cost
func.on
is
the
summa.on
of
cost
func.ons
over
all
examples.
Error
signals
of
the
square
error
cost
func.on
for
each
example
are
propagated
using
deriva.ves
of
func.on
elements
w.r.t.
inputs.
Gradient
of
the
cost
func.on
w.r.t.
parameters
for
each
example
is
computed
using
error
signals
and
deriva.ves
of
func.on
elements
w.r.t.
parameters.
Summing
gradients
for
all
examples
gets
overall
gradient.
δ a( )
=
∂
∂a
J W,b;x, y( ) = − y − a( )
δ z( )
=
∂
∂z
J W,b;x, y( ) =
∂J
∂a
∂a
∂z
= δ a( )
• a • 1− a( )
J W,b;x, y( ) =
1
2
y − hw,b x( )
2
5. 5
/14
Decomposi4on
of
Mul4-‐Layer
Neural
Network
n Composite
func.on
representa.on
of
a
mul.-‐layer
neural
network
n Deriva.ves
of
func.on
elements
w.r.t.
inputs
and
parameters
a 1( )
= x, a lmax( )
= hw,b x( )
∂a l+1( )
∂z l+1( ) = a l+1( )
• 1− a l+1( )
( ) where a l+1( )
= sigmoid z l+1( )
( )=
1
1+ exp −z l+1( )
( )
∂z l+1( )
∂a l( ) = W l( )
,
∂z l+1( )
∂W l( ) = a l( )
,
∂z l+1( )
∂b l( ) = I where z l+1( )
= W l( )
( )
T
a l( )
+ b l( )
hW,b x( ) = sigmoid!affineW 2( ),b 2( ) !sigmoid!affineW 1( ),b 1( )( ) x( )
Decomposi.on
Standard
network
representa.on
x1
x2
x3
+1
Layer
1
+1
Layer
2
x
hW,b(x)
a2
(2)
a1
(2)
a3
(2)
Composite
func.on
representa.on
x1
x2
x3
+1
Affine
1
Sigmoid
1
x
z2
(2)
z1
(2)
z3
(2)
+1
Affine
2
a2
(2)
a1
(2)
a3
(2)
hW,b(x)
z1
(3)
a1
(3)
a1
(1)
a2
(1)
a3
(1)
Sigmoid
2
6. 6
/14
Error
Signals
and
Gradients
in
Mul4-‐Layer
NN
n Error
signals
of
the
square
error
cost
func.on
for
each
example
n Gradient
of
the
cost
func.on
w.r.t.
parameters
for
each
example
δ
a l( )
( ) =
∂
∂a l( ) J W,b;x, y( ) =
− y − a l( )
( ) for l = lmax
∂J
∂z l+1( )
∂z l+1( )
∂a l( ) = W l( )
( )
T
δ
z l+1( )
( ) otherwise
⎧
⎨
⎪
⎩
⎪
δ
z l( )
( ) =
∂
∂z l( ) J W,b;x, y( ) =
∂J
∂a l( )
∂a l( )
∂z l( ) = δ
a l( )
( ) • a l( )
• 1− a l( )
( )
∇W l( ) J W,b;x, y( ) =
∂
∂W l( ) J W,b;x, y( ) =
∂J
∂z l+1( )
∂z l+1( )
∂W l( ) = δ
z l+1( )
( ) a l( )
( )
T
∇b l( ) J W,b;x, y( ) =
∂
∂b l( ) J W,b;x, y( ) =
∂J
∂z l+1( )
∂z l+1( )
∂b l( ) = δ
z l+1( )
( )
7. 7
/14
Backpropaga4on
in
General
Cases
1. Decompose
opera.ons
in
layers
of
a
neural
network
into
func.on
elements
whose
deriva.ves
w.r.t
inputs
are
known
by
symbolic
computa.on.
2. Backpropagate
error
signals
corresponding
to
a
differen.able
cost
func.on
by
numerical
computa.on
(Star.ng
from
cost
func.on,
plug
in
error
signals
backward).
3. Use
backpropagated
error
signals
to
compute
gradients
w.r.t.
parameters
only
for
the
func.on
elements
with
parameters
where
their
deriva.ves
w.r.t
parameters
are
known
by
symbolic
computa.on.
4. Sum
gradients
over
all
example
to
get
overall
gradient.
hθ x( ) = f lmax( )
!…! fθ l( )
l( )
!…! fθ 2( )
2( )
! f 1( )
( ) x( ) where f 1( )
= x, f lmax( )
= hθ x( ) and ∀l :
∂f l+1( )
∂f l( ) is known
δ l( )
=
∂
∂f l( ) J θ;x, y( ) =
∂J
∂f l+1( )
∂f l+1( )
∂f l( ) = δ l+1( ) ∂f l+1( )
∂f l( ) where
∂J
∂f lmax( ) is known
∇θ l( ) J θ;x, y( ) =
∂
∂θ l( ) J θ;x, y( ) =
∂J
∂f l( )
∂fθ l( )
l( )
∂θ l( ) = δ l( ) ∂fθ l( )
l( )
∂θ l( ) where
∂fθ l( )
l( )
∂θ l( ) is known
∇θ l( ) J θ( ) = ∇θ l( ) J θ;x i( )
, y i( )
( )i=1
m
∑
8. 8
/14
Convolu4onal
Neural
Network
A
convolu.on-‐pooling
layer
in
Convolu.onal
Neural
Network
is
a
composite
func.on
decomposed
into
func.on
elements
f(conv),
f(sigm),
and
f(pool).
Let
x
be
the
output
from
the
previous
layer.
Sigmoid
nonlinearity
is
op.onal.
f pool( )
! f sigm( )
! fw
conv( )
( ) x( )
Convolu.on
Sigmoid
x
Pooling
Forward
propaga.on
Backward
propaga.on
Convolu.on
Sigmoid
x
Pooling
9. 9
/14
Deriva4ves
of
Convolu4on
n Discrete
convolu.on
parameterized
by
a
feature
w
and
its
deriva.ves
Let
x
be
the
input,
and
y
be
the
output
of
convolu.on
layer.
Here
we
focus
on
only
one
feature
vector
w,
although
a
convolu.on
layer
usually
has
mul.ple
features
W
=
[w1
w2
…
wn].
n
indexes
x
and
y
where
1
≤
n
≤
|x|
for
xn,
1
≤
n
≤
|y|
=
|x|
-‐
|w|
+
1
for
yn.
i
indexes
w
where
1
≤
i
≤
|w|.
(f*g)[n]
denotes
the
n-‐th
element
of
f*g.
y = x ∗w = yn[ ], yn = x ∗w( ) n[ ]= xn+i−1wi
i=1
w
∑ = wT
xn:n+ w −1
∂yn−i+1
∂xn
= wi,
∂yn
∂wi
= xn+i−1 for 1≤ i ≤ w
xn
w1
yn
w2
…
yn-‐1
…
From
a
fixed
xn
stand
point,
xn
has
outgoing
connec.ons
to
yn-‐|W|+1:n,
i.e.,
all
yn-‐|W|+1:n
have
deriva.ves
w.r.t.
xn.
Note
that
y
and
w
indices
are
reverse
order.
x
Convolu.on
|w|
xn
…
w1
w2
…
yn
yn
has
incoming
connec.ons
from
xn:n+|W|-‐1.
x
Convolu.on
|w|
xn+1
10. 10
/14
Backpropaga4on
in
Convolu4on
Layer
Error
signals
and
gradient
for
each
example
are
computed
by
convolu.on
using
the
commuta.vity
property
of
convolu.on
and
the
mul.variable
chain
rule
of
deriva.ve.
Let’s
focus
on
single
elements
of
error
signals
and
a
gradient
w.r.t.
w.
δn
x( )
=
∂J
∂xn
=
∂J
∂y
∂y
∂xn
=
∂J
∂yn−i+1
∂yn−i+1
∂xni=1
w
∑ = δn−i+1
y( )
wi
i=1
w
∑ = δ
y( )
∗flip w( )( ) n[ ], δ
x( )
= δn
x( )
⎡
⎣
⎤
⎦
= δ
y( )
∗flip w( )
∂J
∂wi
=
∂J
∂y
∂y
∂wi
=
∂J
∂yn
∂yn
∂win=1
x − w +1
∑ = δn
y( )
xn+i−1
n=1
x − w +1
∑ = δ
y( )
∗ x( ) i[ ],
∂J
∂w
=
∂J
∂wi
⎡
⎣
⎢
⎤
⎦
⎥ = δ
y( )
∗ x = x ∗δ
y( )
↑
Reverse
order
linear
combina.on
x
*
w
=
y
xn
…
W1
W2
…
yn
Forward
propaga.on
(convolu.on)
(Valid
convolu.on)
|w|
xn+1
Backward
propaga.on
…
…
δ(y)
n
w1
w2
δ(x)
=
flip(w)
*
δ(y)
δ(x)
n
(Full
convolu.on)
|w|
δ(y)
n-‐1
…
∂J/∂wi
xn
δ(y)
1
δ(y)
2
x
*
δ(y)
=
∂J/∂W
(Valid
convolu.on)
Gradient
computa.on
|y|
xn+1
…
11. 11
/14
Deriva4ves
of
Pooling
Pooling
layer
subsamples
sta.s.cs
to
obtain
summary
sta.s.cs
with
any
aggregate
func.on
(or
filter)
g
whose
input
is
vector,
and
output
is
scalar.
Subsampling
is
an
opera.on
like
convolu.on,
however
g
is
applied
to
disjoint
(non-‐overlapping)
regions.
n Defini.on:
subsample
(or
downsample)
Let
m
be
the
size
of
pooling
region,
x
be
the
input,
and
y
be
the
output
of
the
pooling
layer.
subsample(f,
g)[n]
denotes
the
n-‐th
element
of
subsample(f,
g).
yn = subsample x,g( ) n[ ]= g x n−1( )m+1:nm( )
y = subsample x,g( ) = yn[ ]
g x( ) =
xk
k=1
m
∑
m
,
∂g
∂x
=
1
m
mean pooling
max x( ),
∂g
∂xi
=
1 if xi = max x( )
0 otherwise
⎧
⎨
⎩
max pooling
x p
= xk
p
k=1
m
∑
⎛
⎝⎜
⎞
⎠⎟
1/p
,
∂g
∂xi
= xk
p
k=1
m
∑
⎛
⎝⎜
⎞
⎠⎟
1/p−1
xi
p−1
Lp
pooling
or any other differentiable Rm
→ R functions
⎧
⎨
⎪
⎪
⎪
⎪
⎪
⎩
⎪
⎪
⎪
⎪
⎪
x
Pooling
yn
g
m
x(n-‐1)m+1
…
12. 12
/14
Backpropaga4on
in
Pooling
Layer
Error
signals
for
each
example
are
computed
by
upsampling.
Upsampling
is
an
opera.on
which
backpropagates
(distributes)
the
error
signals
over
the
aggregate
func.on
g
using
its
deriva.ves
g’n
=
∂g/∂x(n-‐1)m+1:nm.
g’n
can
change
depending
on
pooling
region
n.
p In
max
pooling,
the
unit
which
was
the
max
at
forward
propaga.on
receives
all
the
error
at
backward
propaga.on
and
the
unit
is
different
depending
on
the
region
n.
n Defini.on:
upsample
upsample(f,
g)[n]
denotes
the
n-‐th
element
of
upsample(f,
g).
δ n−1( )m+1:nm
x( )
= upsample δ
y( )
, ′g( ) n[ ]= δn
y( )
′gn = δn
y( ) ∂g
∂x n−1( )m+1:nm
=
∂J
∂yn
∂yn
∂x n−1( )m+1:nm
=
∂J
∂x n−1( )m+1:nm
δ
x( )
= upsample δ
a( )
, ′g( )= δ n−1( )m+1:nm
x( )
⎡
⎣
⎤
⎦
subsample(x,
g)
=
y
yn
Forward
propaga.on
(subsapmling)
g
x(n-‐1)m+1
…
m
δ(x)
=
upsample(δ(y),
g’)
δ(y)
n
δ(x)
(n-‐1)m+1
…
Backward
propaga.on
(upsapmling)
∂g/∂x
m
14. 14
/14
Remarks
n References
p UFLDL
Tutorial,
h[p://ufldl.stanford.edu/tutorial
p Chain
Rule
of
Neural
Network
is
Error
Back
Propaga.on,
h[p://like.silk.to/studymemo/ChainRuleNeuralNetwork.pdf
n Acknowledgement
This
memo
was
wri[en
thanks
to
a
good
discussion
with
Prof.
Masayuki
Tanaka.