2
/14

Note
n  Purpose

The
purpose
of
this
memo
is
trying
to
understand
and
remind
the
backpropaga.on
algorithm
in

Convolu.onal
Neural
Network
based
on
a
discussion
with
Prof.
Masayuki
Tanaka.

n  Table
of
Contents

In
this
memo,
backpropaga.on
algorithms
in
diﬀerent
neural
networks
are
explained
in
the
following

order.

p  Single
neuron

3

p  Mul.-‐layer
neural
network
5

p  General
cases

7

p  Convolu.on
layer

9

p  Pooling
layer

11

p  Convolu.onal
Neural
Network
13

n  Nota.on

This
memo
follows
the
nota.on
in
UFLDL
tutorial
(h[p://uﬂdl.stanford.edu/tutorial)

3
/14

Neural
Network
as
a
Composite
Func4on
A
neural
network
is
decomposed
into
a
composite
func.on
where
each
func.on
element

corresponds
to
a
differen.able
opera.on.

n  Single
neuron
(the
simplest
neural
network)
example

A
single
neuron
is
decomposed
into
a
composite
func.on
of
an
affine
func.on
element
parameterized
by

W
and
b
and
an
ac.va.on
func.on
element

f
which
we
choose
to
be
the
sigmoid
func.on.

Deriva.ves
of
both
affine
and
sigmoid
func.on
elements
w.r.t.
both
inputs
and
parameters
are
known.

Note
that
sigmoid
func.on
does
not
have
neither
parameters
nor
deriva.ves
parameters.

Sigmoid
func.on
is
applied
element-‐wise.
‘●’
denotes
Hadamard
product,
or
element-‐wise
product.

hW,b x( ) = f WT
x + b( )= sigmoid affineW,b x( )( )= sigmoid!affineW,b( ) x( )
∂a
∂z
= a • 1− a( ) where a = hw,b x( ) = sigmoid z( ) =
1
1+ exp −z( )
∂z
∂x
= W,
∂z
∂W
= x,
∂z
∂b
= I where z = affineW,b x( ) = WT
x + b, and I is identity matrix
Decomposi.on
Neuron
Standard
network
representa.on
x1
x2
x3
+1
hW,b(x)
Affine
Ac.va.on

(e.g.
sigmoid)

Composite
func.on
representa.on
x1
x2
x3
+1
hW,b(x)
z
a

∇W J W,b;x, y( ) =
∂
∂W
J W,b;x, y( ) =
∂J
∂z
∂z
∂W
= δ z( )
xT
∇bJ W,b;x, y( ) =
∂
∂b
J W,b;x, y( ) =
∂J
∂z
∂z
∂b
= δ z( )
4
/14

Chain
Rule
of
Error
Signals
and
Gradients
Error
signals
are
deﬁned
as
the
deriva.ves
of
any
cost
func.on
J
which
we
choose
to
be

the
square
error.
Error
signals
are
computed
(propagated
backward)
by
the
chain
rule
of

deriva.ve
and
useful
for
compu.ng
the
gradient
of
the
cost
func.on.

n  Single
neuron
example

Suppose
we
have
m
labeled
training
examples
{(x(1),
y(1)),
…,
(x(m),
y(m))}.
Square
error
cost
func.on
for
each

example
is
as
follows.
Overall
cost
func.on
is
the
summa.on
of
cost
func.ons
over
all
examples.

Error
signals
of
the
square
error
cost
func.on
for
each
example
are
propagated
using
deriva.ves
of

func.on
elements
w.r.t.
inputs.

Gradient
of
the
cost
func.on
w.r.t.
parameters
for
each
example
is
computed
using
error
signals
and

deriva.ves
of
func.on
elements
w.r.t.
parameters.
Summing
gradients
for
all
examples
gets
overall

gradient.

δ a( )
=
∂
∂a
J W,b;x, y( ) = − y − a( )
δ z( )
=
∂
∂z
J W,b;x, y( ) =
∂J
∂a
∂a
∂z
= δ a( )
• a • 1− a( )
J W,b;x, y( ) =
1
2
y − hw,b x( )
2

5
/14

Decomposi4on
of
Mul4-‐Layer
Neural
Network
n  Composite
func.on
representa.on
of
a
mul.-‐layer
neural
network

n  Deriva.ves
of
func.on
elements
w.r.t.
inputs
and
parameters

a 1( )
= x, a lmax( )
= hw,b x( )
∂a l+1( )
∂z l+1( ) = a l+1( )
• 1− a l+1( )
( ) where a l+1( )
= sigmoid z l+1( )
( )=
1
1+ exp −z l+1( )
( )
∂z l+1( )
∂a l( ) = W l( )
,
∂z l+1( )
∂W l( ) = a l( )
,
∂z l+1( )
∂b l( ) = I where z l+1( )
= W l( )
( )
T
a l( )
+ b l( )
hW,b x( ) = sigmoid!affineW 2( ),b 2( ) !sigmoid!affineW 1( ),b 1( )( ) x( )
Decomposi.on
Standard
network
representa.on
x1
x2
x3
+1
Layer
1
+1
Layer
2
x
hW,b(x)
a2
(2)
a1
(2)
a3
(2)
Composite
func.on
representa.on
x1
x2
x3
+1
Aﬃne
1
Sigmoid
1
x
z2
(2)
z1
(2)
z3
(2)
+1
Aﬃne
2
a2
(2)
a1
(2)
a3
(2)
hW,b(x)
z1
(3)
a1
(3)
a1
(1)
a2
(1)
a3
(1)
Sigmoid
2

6
/14

Error
Signals
and
Gradients
in
Mul4-‐Layer
NN
n  Error
signals
of
the
square
error
cost
func.on
for
each
example

n  Gradient
of
the
cost
func.on
w.r.t.
parameters
for
each
example

δ
a l( )
( ) =
∂
∂a l( ) J W,b;x, y( ) =
− y − a l( )
( ) for l = lmax
∂J
∂z l+1( )
∂z l+1( )
∂a l( ) = W l( )
( )
T
δ
z l+1( )
( ) otherwise
⎧
⎨
⎪
⎩
⎪
δ
z l( )
( ) =
∂
∂z l( ) J W,b;x, y( ) =
∂J
∂a l( )
∂a l( )
∂z l( ) = δ
a l( )
( ) • a l( )
• 1− a l( )
( )
∇W l( ) J W,b;x, y( ) =
∂
∂W l( ) J W,b;x, y( ) =
∂J
∂z l+1( )
∂z l+1( )
∂W l( ) = δ
z l+1( )
( ) a l( )
( )
T
∇b l( ) J W,b;x, y( ) =
∂
∂b l( ) J W,b;x, y( ) =
∂J
∂z l+1( )
∂z l+1( )
∂b l( ) = δ
z l+1( )
( )

7
/14

Backpropaga4on
in
General
Cases
1.  Decompose
opera.ons
in
layers
of
a
neural
network
into
func.on
elements
whose

deriva.ves
w.r.t
inputs
are
known
by
symbolic
computa.on.

2.  Backpropagate
error
signals
corresponding
to
a
diﬀeren.able
cost
func.on
by

numerical
computa.on
(Star.ng
from
cost
func.on,
plug
in
error
signals
backward).

3.  Use
backpropagated
error
signals
to
compute
gradients
w.r.t.
parameters
only
for
the

func.on
elements
with
parameters
where
their
deriva.ves
w.r.t
parameters
are

known
by
symbolic
computa.on.

4.  Sum
gradients
over
all
example
to
get
overall
gradient.

hθ x( ) = f lmax( )
!…! fθ l( )
l( )
!…! fθ 2( )
2( )
! f 1( )
( ) x( ) where f 1( )
= x, f lmax( )
= hθ x( ) and ∀l :
∂f l+1( )
∂f l( ) is known
δ l( )
=
∂
∂f l( ) J θ;x, y( ) =
∂J
∂f l+1( )
∂f l+1( )
∂f l( ) = δ l+1( ) ∂f l+1( )
∂f l( ) where
∂J
∂f lmax( ) is known
∇θ l( ) J θ;x, y( ) =
∂
∂θ l( ) J θ;x, y( ) =
∂J
∂f l( )
∂fθ l( )
l( )
∂θ l( ) = δ l( ) ∂fθ l( )
l( )
∂θ l( ) where
∂fθ l( )
l( )
∂θ l( ) is known
∇θ l( ) J θ( ) = ∇θ l( ) J θ;x i( )
, y i( )
( )i=1
m
∑

8
/14

Convolu4onal
Neural
Network
A
convolu.on-‐pooling
layer
in
Convolu.onal
Neural
Network
is
a
composite
func.on

decomposed
into
func.on
elements
f(conv),
f(sigm),
and
f(pool).

Let
x
be
the
output
from
the
previous
layer.
Sigmoid
nonlinearity
is
op.onal.

f pool( )
! f sigm( )
! fw
conv( )
( ) x( )
Convolu.on
Sigmoid
x
Pooling
Forward
propaga.on
Backward
propaga.on
Convolu.on
Sigmoid
x
Pooling

9
/14

Deriva4ves
of
Convolu4on
n  Discrete
convolu.on
parameterized
by
a
feature
w
and
its
deriva.ves

Let
x
be
the
input,
and
y
be
the
output
of
convolu.on
layer.
Here
we
focus
on
only
one
feature
vector
w,

although
a
convolu.on
layer
usually
has
mul.ple
features
W
=
[w1
w2
…
wn].
n
indexes
x
and
y
where

1
≤
n
≤
|x|
for
xn,
1
≤
n
≤
|y|
=
|x|
-‐
|w|
+
1
for
yn.
i
indexes
w
where
1
≤
i
≤
|w|.
(f*g)[n]
denotes
the
n-‐th

element
of
f*g.

y = x ∗w = yn[ ], yn = x ∗w( ) n[ ]= xn+i−1wi
i=1
w
∑ = wT
xn:n+ w −1
∂yn−i+1
∂xn
= wi,
∂yn
∂wi
= xn+i−1 for 1≤ i ≤ w
xn
w1
yn
w2
…
yn-‐1
…
From
a
ﬁxed
xn
stand
point,

xn
has
outgoing
connec.ons

to
yn-‐|W|+1:n,
i.e.,

all
yn-‐|W|+1:n
have
deriva.ves

w.r.t.
xn.
Note
that
y
and
w

indices
are
reverse
order.

x
Convolu.on
|w|
xn
…
w1
w2
…
yn
yn
has
incoming

connec.ons
from
xn:n+|W|-‐1.

x
Convolu.on
|w|
xn+1

10
/14

Backpropaga4on
in
Convolu4on
Layer
Error
signals
and
gradient
for
each
example
are
computed
by
convolu.on
using

the
commuta.vity
property
of
convolu.on
and
the
mul.variable
chain
rule
of
deriva.ve.

Let’s
focus
on
single
elements
of
error
signals
and
a
gradient
w.r.t.
w.

δn
x( )
=
∂J
∂xn
=
∂J
∂y
∂y
∂xn
=
∂J
∂yn−i+1
∂yn−i+1
∂xni=1
w
∑ = δn−i+1
y( )
wi
i=1
w
∑ = δ
y( )
∗flip w( )( ) n[ ], δ
x( )
= δn
x( )
⎡
⎣
⎤
⎦
= δ
y( )
∗flip w( )
∂J
∂wi
=
∂J
∂y
∂y
∂wi
=
∂J
∂yn
∂yn
∂win=1
x − w +1
∑ = δn
y( )
xn+i−1
n=1
x − w +1
∑ = δ
y( )
∗ x( ) i[ ],
∂J
∂w
=
∂J
∂wi
⎡
⎣
⎢
⎤
⎦
⎥ = δ
y( )
∗ x = x ∗δ
y( )
↑
Reverse
order
linear
combina.on
x

*

w

=

y
xn
…
W1
W2
…
yn
Forward
propaga.on
(convolu.on)
(Valid
convolu.on)
|w|
xn+1
Backward
propaga.on
…
…
δ(y)
n
w1
w2
δ(x)

=

ﬂip(w)

*

δ(y)

δ(x)
n
(Full
convolu.on)
|w|
δ(y)
n-‐1
…
∂J/∂wi
xn
δ(y)
1
δ(y)
2
x

*

δ(y)

=

∂J/∂W
(Valid
convolu.on)
Gradient
computa.on
|y|
xn+1
…

11
/14

Deriva4ves
of
Pooling
Pooling
layer
subsamples
sta.s.cs
to
obtain
summary
sta.s.cs
with
any
aggregate

func.on
(or
ﬁlter)
g
whose
input
is
vector,
and
output
is
scalar.
Subsampling
is
an

opera.on
like
convolu.on,
however
g
is
applied
to
disjoint
(non-‐overlapping)
regions.

n  Deﬁni.on:
subsample
(or
downsample)

Let
m
be
the
size
of
pooling
region,
x
be
the
input,
and
y
be
the
output
of
the
pooling
layer.

subsample(f,
g)[n]
denotes
the
n-‐th
element
of
subsample(f,
g).

yn = subsample x,g( ) n[ ]= g x n−1( )m+1:nm( )
y = subsample x,g( ) = yn[ ]
g x( ) =
xk
k=1
m
∑
m
,
∂g
∂x
=
1
m
mean pooling
max x( ),
∂g
∂xi
=
1 if xi = max x( )
0 otherwise
⎧
⎨
⎩
max pooling
x p
= xk
p
k=1
m
∑
⎛
⎝⎜
⎞
⎠⎟
1/p
,
∂g
∂xi
= xk
p
k=1
m
∑
⎛
⎝⎜
⎞
⎠⎟
1/p−1
xi
p−1
Lp
pooling
or any other differentiable Rm
→ R functions
⎧
⎨
⎪
⎪
⎪
⎪
⎪
⎩
⎪
⎪
⎪
⎪
⎪
x
Pooling
yn
g
m
x(n-‐1)m+1
…

12
/14

Backpropaga4on
in
Pooling
Layer
Error
signals
for
each
example
are
computed
by
upsampling.
Upsampling
is
an
opera.on

which
backpropagates
(distributes)
the
error
signals
over
the
aggregate
func.on
g
using

its
deriva.ves
g’n
=
∂g/∂x(n-‐1)m+1:nm.
g’n
can
change
depending
on
pooling
region
n.

p  In
max
pooling,
the
unit
which
was
the
max
at
forward
propaga.on
receives
all
the
error
at
backward

propaga.on
and
the
unit
is
diﬀerent
depending
on
the
region
n.

n  Deﬁni.on:
upsample

upsample(f,
g)[n]
denotes
the
n-‐th
element
of
upsample(f,
g).

δ n−1( )m+1:nm
x( )
= upsample δ
y( )
, ′g( ) n[ ]= δn
y( )
′gn = δn
y( ) ∂g
∂x n−1( )m+1:nm
=
∂J
∂yn
∂yn
∂x n−1( )m+1:nm
=
∂J
∂x n−1( )m+1:nm
δ
x( )
= upsample δ
a( )
, ′g( )= δ n−1( )m+1:nm
x( )
⎡
⎣
⎤
⎦
subsample(x,
g)
=
y
yn
Forward
propaga.on
(subsapmling)
g
x(n-‐1)m+1
…
m
δ(x)
=
upsample(δ(y),
g’)
δ(y)
n
δ(x)
(n-‐1)m+1
…
Backward
propaga.on
(upsapmling)
∂g/∂x
m

13
/14

Backpropaga4on
in
CNN
(Summary)
Plug
in

δ(conv)

Plug
in

δ(conv)

…
∂J/∂Wn
xn
xn+1
…
(Valid
convolu.on)
δ(conv)
1
δ(conv)
2
x ∗δ
conv( )
= ∇wJ
3.
Compute
gradient
∇wJ
…
…
δ(conv)
n-‐1
δ(conv)
n
W1
W2
δ(x)
n
(Full
convolu.on)
2.
Propagate
error
signals
δ(conv)

δ
x( )
= δ
conv( )
∗flip w( ) δ
conv( )
= upsample δ
pool( )
, ′g( )• f sigm( )
• 1− f sigm( )
( )
1.
Propagate
error
signals
δ(pool)

δ(pool)
n
δ(sigm)
(n-‐1)m+1
…
δ(conv)
(n-‐1)m+1
…
Deriva.ve
of
sigmoid

Convolu.on
Convolu.on
Sigmoid
Pooling

14
/14

Remarks
n  References

p  UFLDL
Tutorial,
h[p://uﬂdl.stanford.edu/tutorial

p  Chain
Rule
of
Neural
Network
is
Error
Back
Propaga.on,

h[p://like.silk.to/studymemo/ChainRuleNeuralNetwork.pdf

n  Acknowledgement

This
memo
was
wri[en
thanks
to
a
good
discussion
with
Prof.
Masayuki
Tanaka.

Backpropagation in Convolutional Neural Network

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Backpropagation in Convolutional Neural Network

Similar to Backpropagation in Convolutional Neural Network (20)

Recently uploaded

Recently uploaded (20)

Backpropagation in Convolutional Neural Network