SlideShare a Scribd company logo
1 of 14
Memo:	
  Backpropaga.on	
  	
  
in	
  Convolu.onal	
  Neural	
  Network	
Hiroshi	
  Kuwajima	
  
13-­‐03-­‐2014	
  Created	
  
14-­‐08-­‐2014	
  Revised	
  
1 /14
2	
  /14	
  
Note	
n  Purpose	
  
The	
  purpose	
  of	
  this	
  memo	
  is	
  trying	
  to	
  understand	
  and	
  remind	
  the	
  backpropaga.on	
  algorithm	
  in	
  
Convolu.onal	
  Neural	
  Network	
  based	
  on	
  a	
  discussion	
  with	
  Prof.	
  Masayuki	
  Tanaka.	
  	
  
n  Table	
  of	
  Contents	
  
In	
  this	
  memo,	
  backpropaga.on	
  algorithms	
  in	
  different	
  neural	
  networks	
  are	
  explained	
  in	
  the	
  following	
  
order.	
  	
  
	
  
p  Single	
  neuron 	
   	
  3	
  
p  Mul.-­‐layer	
  neural	
  network 	
  5	
  
p  General	
  cases 	
   	
  7	
  
p  Convolu.on	
  layer 	
   	
  9	
  
p  Pooling	
  layer	
   	
   	
  11	
  
p  Convolu.onal	
  Neural	
  Network 	
  13	
  
n  Nota.on	
  
This	
  memo	
  follows	
  the	
  nota.on	
  in	
  UFLDL	
  tutorial	
  (h[p://ufldl.stanford.edu/tutorial)	
  
3	
  /14	
  
Neural	
  Network	
  as	
  a	
  Composite	
  Func4on	
A	
  neural	
  network	
  is	
  decomposed	
  into	
  a	
  composite	
  func.on	
  where	
  each	
  func.on	
  element	
  
corresponds	
  to	
  a	
  differen.able	
  opera.on.	
  	
  
	
  
n  Single	
  neuron	
  (the	
  simplest	
  neural	
  network)	
  example	
  
A	
  single	
  neuron	
  is	
  decomposed	
  into	
  a	
  composite	
  func.on	
  of	
  an	
  affine	
  func.on	
  element	
  parameterized	
  by	
  
W	
  and	
  b	
  and	
  an	
  ac.va.on	
  func.on	
  element	
  	
  f	
  which	
  we	
  choose	
  to	
  be	
  the	
  sigmoid	
  func.on.	
  	
  
	
  
	
  
	
  
	
  
	
  
	
  
Deriva.ves	
  of	
  both	
  affine	
  and	
  sigmoid	
  func.on	
  elements	
  w.r.t.	
  both	
  inputs	
  and	
  parameters	
  are	
  known.	
  
Note	
  that	
  sigmoid	
  func.on	
  does	
  not	
  have	
  neither	
  parameters	
  nor	
  deriva.ves	
  parameters.	
  	
  
Sigmoid	
  func.on	
  is	
  applied	
  element-­‐wise.	
  ‘●’	
  denotes	
  Hadamard	
  product,	
  or	
  element-­‐wise	
  product.	
  	
  
hW,b x( ) = f WT
x + b( )= sigmoid affineW,b x( )( )= sigmoid!affineW,b( ) x( )
∂a
∂z
= a • 1− a( ) where a = hw,b x( ) = sigmoid z( ) =
1
1+ exp −z( )
∂z
∂x
= W,
∂z
∂W
= x,
∂z
∂b
= I where z = affineW,b x( ) = WT
x + b, and I is identity matrix
Decomposi.on	
Neuron	
Standard	
  network	
  representa.on	
x1	
x2	
x3	
+1	
hW,b(x)	
Affine	
   Ac.va.on	
  
(e.g.	
  sigmoid)	
  
Composite	
  func.on	
  representa.on	
x1	
x2	
x3	
+1	
hW,b(x)	
z	
 a
∇W J W,b;x, y( ) =
∂
∂W
J W,b;x, y( ) =
∂J
∂z
∂z
∂W
= δ z( )
xT
∇bJ W,b;x, y( ) =
∂
∂b
J W,b;x, y( ) =
∂J
∂z
∂z
∂b
= δ z( )
4	
  /14	
  
Chain	
  Rule	
  of	
  Error	
  Signals	
  and	
  Gradients	
Error	
  signals	
  are	
  defined	
  as	
  the	
  deriva.ves	
  of	
  any	
  cost	
  func.on	
  J	
  which	
  we	
  choose	
  to	
  be	
  
the	
  square	
  error.	
  Error	
  signals	
  are	
  computed	
  (propagated	
  backward)	
  by	
  the	
  chain	
  rule	
  of	
  
deriva.ve	
  and	
  useful	
  for	
  compu.ng	
  the	
  gradient	
  of	
  the	
  cost	
  func.on.	
  	
  
	
  
n  Single	
  neuron	
  example	
  
Suppose	
  we	
  have	
  m	
  labeled	
  training	
  examples	
  {(x(1),	
  y(1)),	
  …,	
  (x(m),	
  y(m))}.	
  Square	
  error	
  cost	
  func.on	
  for	
  each	
  
example	
  is	
  as	
  follows.	
  Overall	
  cost	
  func.on	
  is	
  the	
  summa.on	
  of	
  cost	
  func.ons	
  over	
  all	
  examples.	
  	
  
	
  
	
  
Error	
  signals	
  of	
  the	
  square	
  error	
  cost	
  func.on	
  for	
  each	
  example	
  are	
  propagated	
  using	
  deriva.ves	
  of	
  
func.on	
  elements	
  w.r.t.	
  inputs.	
  	
  
	
  
	
  
Gradient	
  of	
  the	
  cost	
  func.on	
  w.r.t.	
  parameters	
  for	
  each	
  example	
  is	
  computed	
  using	
  error	
  signals	
  and	
  
deriva.ves	
  of	
  func.on	
  elements	
  w.r.t.	
  parameters.	
  Summing	
  gradients	
  for	
  all	
  examples	
  gets	
  overall	
  
gradient.	
  	
  
δ a( )
=
∂
∂a
J W,b;x, y( ) = − y − a( )
δ z( )
=
∂
∂z
J W,b;x, y( ) =
∂J
∂a
∂a
∂z
= δ a( )
• a • 1− a( )
J W,b;x, y( ) =
1
2
y − hw,b x( )
2
5	
  /14	
  
Decomposi4on	
  of	
  Mul4-­‐Layer	
  Neural	
  Network	
n  Composite	
  func.on	
  representa.on	
  of	
  a	
  mul.-­‐layer	
  neural	
  network	
  
	
  
n  Deriva.ves	
  of	
  func.on	
  elements	
  w.r.t.	
  inputs	
  and	
  parameters	
  
a 1( )
= x, a lmax( )
= hw,b x( )
∂a l+1( )
∂z l+1( ) = a l+1( )
• 1− a l+1( )
( ) where a l+1( )
= sigmoid z l+1( )
( )=
1
1+ exp −z l+1( )
( )
∂z l+1( )
∂a l( ) = W l( )
,
∂z l+1( )
∂W l( ) = a l( )
,
∂z l+1( )
∂b l( ) = I where z l+1( )
= W l( )
( )
T
a l( )
+ b l( )
hW,b x( ) = sigmoid!affineW 2( ),b 2( ) !sigmoid!affineW 1( ),b 1( )( ) x( )
Decomposi.on	
Standard	
  network	
  representa.on	
x1	
x2	
x3	
+1	
Layer	
  1	
+1	
Layer	
  2	
x	
hW,b(x)	
a2
(2)	
a1
(2)	
a3
(2)	
Composite	
  func.on	
  representa.on	
x1	
x2	
x3	
+1	
Affine	
  1	
 Sigmoid	
  1	
x	
z2
(2)	
z1
(2)	
z3
(2)	
+1	
Affine	
  2	
a2
(2)	
a1
(2)	
a3
(2)	
hW,b(x)	
z1
(3)	
 a1
(3)	
a1
(1)	
a2
(1)	
a3
(1)	
Sigmoid	
  2
6	
  /14	
  
Error	
  Signals	
  and	
  Gradients	
  in	
  Mul4-­‐Layer	
  NN	
n  Error	
  signals	
  of	
  the	
  square	
  error	
  cost	
  func.on	
  for	
  each	
  example	
  
n  Gradient	
  of	
  the	
  cost	
  func.on	
  w.r.t.	
  parameters	
  for	
  each	
  example	
  
δ
a l( )
( ) =
∂
∂a l( ) J W,b;x, y( ) =
− y − a l( )
( ) for l = lmax
∂J
∂z l+1( )
∂z l+1( )
∂a l( ) = W l( )
( )
T
δ
z l+1( )
( ) otherwise
⎧
⎨
⎪
⎩
⎪
δ
z l( )
( ) =
∂
∂z l( ) J W,b;x, y( ) =
∂J
∂a l( )
∂a l( )
∂z l( ) = δ
a l( )
( ) • a l( )
• 1− a l( )
( )
∇W l( ) J W,b;x, y( ) =
∂
∂W l( ) J W,b;x, y( ) =
∂J
∂z l+1( )
∂z l+1( )
∂W l( ) = δ
z l+1( )
( ) a l( )
( )
T
∇b l( ) J W,b;x, y( ) =
∂
∂b l( ) J W,b;x, y( ) =
∂J
∂z l+1( )
∂z l+1( )
∂b l( ) = δ
z l+1( )
( )
7	
  /14	
  
Backpropaga4on	
  in	
  General	
  Cases	
1.  Decompose	
  opera.ons	
  in	
  layers	
  of	
  a	
  neural	
  network	
  into	
  func.on	
  elements	
  whose	
  
deriva.ves	
  w.r.t	
  inputs	
  are	
  known	
  by	
  symbolic	
  computa.on.	
  	
  
2.  Backpropagate	
  error	
  signals	
  corresponding	
  to	
  a	
  differen.able	
  cost	
  func.on	
  by	
  
numerical	
  computa.on	
  (Star.ng	
  from	
  cost	
  func.on,	
  plug	
  in	
  error	
  signals	
  backward).	
  	
  
3.  Use	
  backpropagated	
  error	
  signals	
  to	
  compute	
  gradients	
  w.r.t.	
  parameters	
  only	
  for	
  the	
  
func.on	
  elements	
  with	
  parameters	
  where	
  their	
  deriva.ves	
  w.r.t	
  parameters	
  are	
  
known	
  by	
  symbolic	
  computa.on.	
  	
  
4.  Sum	
  gradients	
  over	
  all	
  example	
  to	
  get	
  overall	
  gradient.	
  	
  
hθ x( ) = f lmax( )
!…! fθ l( )
l( )
!…! fθ 2( )
2( )
! f 1( )
( ) x( ) where f 1( )
= x, f lmax( )
= hθ x( ) and ∀l :
∂f l+1( )
∂f l( ) is known
δ l( )
=
∂
∂f l( ) J θ;x, y( ) =
∂J
∂f l+1( )
∂f l+1( )
∂f l( ) = δ l+1( ) ∂f l+1( )
∂f l( ) where
∂J
∂f lmax( ) is known
∇θ l( ) J θ;x, y( ) =
∂
∂θ l( ) J θ;x, y( ) =
∂J
∂f l( )
∂fθ l( )
l( )
∂θ l( ) = δ l( ) ∂fθ l( )
l( )
∂θ l( ) where
∂fθ l( )
l( )
∂θ l( ) is known
∇θ l( ) J θ( ) = ∇θ l( ) J θ;x i( )
, y i( )
( )i=1
m
∑
8	
  /14	
  
Convolu4onal	
  Neural	
  Network	
A	
  convolu.on-­‐pooling	
  layer	
  in	
  Convolu.onal	
  Neural	
  Network	
  is	
  a	
  composite	
  func.on	
  
decomposed	
  into	
  func.on	
  elements	
  f(conv),	
  f(sigm),	
  and	
  f(pool).	
  	
  
Let	
  x	
  be	
  the	
  output	
  from	
  the	
  previous	
  layer.	
  Sigmoid	
  nonlinearity	
  is	
  op.onal.	
  	
  
f pool( )
! f sigm( )
! fw
conv( )
( ) x( )
Convolu.on	
 Sigmoid	
x	
 Pooling	
Forward	
  propaga.on	
 Backward	
  propaga.on	
Convolu.on	
 Sigmoid	
x	
 Pooling
9	
  /14	
  
Deriva4ves	
  of	
  Convolu4on	
n  Discrete	
  convolu.on	
  parameterized	
  by	
  a	
  feature	
  w	
  and	
  its	
  deriva.ves	
  
Let	
  x	
  be	
  the	
  input,	
  and	
  y	
  be	
  the	
  output	
  of	
  convolu.on	
  layer.	
  Here	
  we	
  focus	
  on	
  only	
  one	
  feature	
  vector	
  w,	
  
although	
  a	
  convolu.on	
  layer	
  usually	
  has	
  mul.ple	
  features	
  W	
  =	
  [w1	
  w2	
  …	
  wn].	
  n	
  indexes	
  x	
  and	
  y	
  where	
  	
  
1	
  ≤	
  n	
  ≤	
  |x|	
  for	
  xn,	
  1	
  ≤	
  n	
  ≤	
  |y|	
  =	
  |x|	
  -­‐	
  |w|	
  +	
  1	
  for	
  yn.	
  i	
  indexes	
  w	
  where	
  1	
  ≤	
  i	
  ≤	
  |w|.	
  (f*g)[n]	
  denotes	
  the	
  n-­‐th	
  
element	
  of	
  f*g.	
  	
  
y = x ∗w = yn[ ], yn = x ∗w( ) n[ ]= xn+i−1wi
i=1
w
∑ = wT
xn:n+ w −1
∂yn−i+1
∂xn
= wi,
∂yn
∂wi
= xn+i−1 for 1≤ i ≤ w
xn	
w1	
 yn	
w2	
…	
yn-­‐1	
…	
From	
  a	
  fixed	
  xn	
  stand	
  point,	
  	
  
xn	
  has	
  outgoing	
  connec.ons	
  	
  
to	
  yn-­‐|W|+1:n,	
  i.e.,	
  	
  
all	
  yn-­‐|W|+1:n	
  have	
  deriva.ves	
  	
  
w.r.t.	
  xn.	
  Note	
  that	
  y	
  and	
  w	
  
indices	
  are	
  reverse	
  order.	
  	
x	
 Convolu.on	
|w|	
xn	
…	
w1	
w2	
…	
yn	
yn	
  has	
  incoming	
  	
  
connec.ons	
  from	
  xn:n+|W|-­‐1.	
  	
  
x	
 Convolu.on	
|w|	
 xn+1
10	
  /14	
  
Backpropaga4on	
  in	
  Convolu4on	
  Layer	
Error	
  signals	
  and	
  gradient	
  for	
  each	
  example	
  are	
  computed	
  by	
  convolu.on	
  using	
  	
  
the	
  commuta.vity	
  property	
  of	
  convolu.on	
  and	
  the	
  mul.variable	
  chain	
  rule	
  of	
  deriva.ve.	
  	
  
Let’s	
  focus	
  on	
  single	
  elements	
  of	
  error	
  signals	
  and	
  a	
  gradient	
  w.r.t.	
  w.	
  	
  
	
  
δn
x( )
=
∂J
∂xn
=
∂J
∂y
∂y
∂xn
=
∂J
∂yn−i+1
∂yn−i+1
∂xni=1
w
∑ = δn−i+1
y( )
wi
i=1
w
∑ = δ
y( )
∗flip w( )( ) n[ ], δ
x( )
= δn
x( )
⎡
⎣
⎤
⎦
= δ
y( )
∗flip w( )
∂J
∂wi
=
∂J
∂y
∂y
∂wi
=
∂J
∂yn
∂yn
∂win=1
x − w +1
∑ = δn
y( )
xn+i−1
n=1
x − w +1
∑ = δ
y( )
∗ x( ) i[ ],
∂J
∂w
=
∂J
∂wi
⎡
⎣
⎢
⎤
⎦
⎥ = δ
y( )
∗ x = x ∗δ
y( )
↑	
  Reverse	
  order	
  linear	
  combina.on	
x	
  	
  	
  	
  	
  *	
  	
  	
  	
  	
  w	
  	
  	
  	
  	
  =	
  	
  	
  	
  	
  y	
xn	
…	
W1	
W2	
…	
yn	
Forward	
  propaga.on	
  (convolu.on)	
(Valid	
  convolu.on)	
|w|	
 xn+1	
Backward	
  propaga.on	
…	
…	
δ(y)
n	
w1	
w2	
δ(x)	
  	
  =	
  	
  	
  flip(w)	
  	
  *	
  	
  	
  δ(y)
	
δ(x)
n	
(Full	
  convolu.on)	
|w|	
δ(y)
n-­‐1	
…	
∂J/∂wi	
xn	
δ(y)
1	
δ(y)
2	
x	
  	
  	
  	
  *	
  	
  	
  	
  δ(y)	
  	
  	
  =	
  	
  	
  ∂J/∂W	
(Valid	
  convolu.on)	
Gradient	
  computa.on	
|y|	
xn+1	
…
11	
  /14	
  
Deriva4ves	
  of	
  Pooling	
Pooling	
  layer	
  subsamples	
  sta.s.cs	
  to	
  obtain	
  summary	
  sta.s.cs	
  with	
  any	
  aggregate	
  
func.on	
  (or	
  filter)	
  g	
  whose	
  input	
  is	
  vector,	
  and	
  output	
  is	
  scalar.	
  Subsampling	
  is	
  an	
  
opera.on	
  like	
  convolu.on,	
  however	
  g	
  is	
  applied	
  to	
  disjoint	
  (non-­‐overlapping)	
  regions.	
  
	
  
n  Defini.on:	
  subsample	
  (or	
  downsample)	
  
Let	
  m	
  be	
  the	
  size	
  of	
  pooling	
  region,	
  x	
  be	
  the	
  input,	
  and	
  y	
  be	
  the	
  output	
  of	
  the	
  pooling	
  layer.	
  	
  
subsample(f,	
  g)[n]	
  denotes	
  the	
  n-­‐th	
  element	
  of	
  subsample(f,	
  g).	
  	
  
yn = subsample x,g( ) n[ ]= g x n−1( )m+1:nm( )
y = subsample x,g( ) = yn[ ]
g x( ) =
xk
k=1
m
∑
m
,
∂g
∂x
=
1
m
mean pooling
max x( ),
∂g
∂xi
=
1 if xi = max x( )
0 otherwise
⎧
⎨
⎩
max pooling
x p
= xk
p
k=1
m
∑
⎛
⎝⎜
⎞
⎠⎟
1/p
,
∂g
∂xi
= xk
p
k=1
m
∑
⎛
⎝⎜
⎞
⎠⎟
1/p−1
xi
p−1
Lp
pooling
or any other differentiable Rm
→ R functions
⎧
⎨
⎪
⎪
⎪
⎪
⎪
⎩
⎪
⎪
⎪
⎪
⎪
x	
 Pooling	
yn	
g	
m	
x(n-­‐1)m+1	
…
12	
  /14	
  
Backpropaga4on	
  in	
  Pooling	
  Layer	
Error	
  signals	
  for	
  each	
  example	
  are	
  computed	
  by	
  upsampling.	
  Upsampling	
  is	
  an	
  opera.on	
  
which	
  backpropagates	
  (distributes)	
  the	
  error	
  signals	
  over	
  the	
  aggregate	
  func.on	
  g	
  using	
  
its	
  deriva.ves	
  g’n	
  =	
  ∂g/∂x(n-­‐1)m+1:nm.	
  g’n	
  can	
  change	
  depending	
  on	
  pooling	
  region	
  n.	
  	
  
p  In	
  max	
  pooling,	
  the	
  unit	
  which	
  was	
  the	
  max	
  at	
  forward	
  propaga.on	
  receives	
  all	
  the	
  error	
  at	
  backward	
  
propaga.on	
  and	
  the	
  unit	
  is	
  different	
  depending	
  on	
  the	
  region	
  n.	
  	
  
	
  
n  Defini.on:	
  upsample	
  
upsample(f,	
  g)[n]	
  denotes	
  the	
  n-­‐th	
  element	
  of	
  upsample(f,	
  g).	
  	
  
δ n−1( )m+1:nm
x( )
= upsample δ
y( )
, ′g( ) n[ ]= δn
y( )
′gn = δn
y( ) ∂g
∂x n−1( )m+1:nm
=
∂J
∂yn
∂yn
∂x n−1( )m+1:nm
=
∂J
∂x n−1( )m+1:nm
δ
x( )
= upsample δ
a( )
, ′g( )= δ n−1( )m+1:nm
x( )
⎡
⎣
⎤
⎦
subsample(x,	
  g)	
  =	
  y	
yn	
Forward	
  propaga.on	
  (subsapmling)	
g	
x(n-­‐1)m+1	
…	
m	
δ(x)	
  =	
  upsample(δ(y),	
  g’)	
δ(y)
n	
δ(x)
(n-­‐1)m+1	
…	
Backward	
  propaga.on	
  (upsapmling)	
∂g/∂x	
m
13	
  /14	
  
Backpropaga4on	
  in	
  CNN	
  (Summary)	
Plug	
  in	
  
δ(conv)
	
Plug	
  in	
  
δ(conv)
	
…	
∂J/∂Wn	
xn	
xn+1	
…	
(Valid	
  convolu.on)	
δ(conv)
1	
δ(conv)
2	
x ∗δ
conv( )
= ∇wJ
3.	
  Compute	
  gradient	
  ∇wJ	
…	
…	
δ(conv)
n-­‐1	
δ(conv)
n	
W1	
W2	
δ(x)
n	
(Full	
  convolu.on)	
2.	
  Propagate	
  error	
  signals	
  δ(conv)
	
δ
x( )
= δ
conv( )
∗flip w( ) δ
conv( )
= upsample δ
pool( )
, ′g( )• f sigm( )
• 1− f sigm( )
( )
1.	
  Propagate	
  error	
  signals	
  δ(pool)
	
δ(pool)
n	
δ(sigm)
(n-­‐1)m+1	
…	
δ(conv)
(n-­‐1)m+1	
…	
Deriva.ve	
  of	
  sigmoid	
  
Convolu.on	
 Convolu.on	
 Sigmoid	
 Pooling
14	
  /14	
  
Remarks	
n  References	
  
p  UFLDL	
  Tutorial,	
  h[p://ufldl.stanford.edu/tutorial	
  
p  Chain	
  Rule	
  of	
  Neural	
  Network	
  is	
  Error	
  Back	
  Propaga.on,	
  	
  
h[p://like.silk.to/studymemo/ChainRuleNeuralNetwork.pdf	
  
n  Acknowledgement	
  
This	
  memo	
  was	
  wri[en	
  thanks	
  to	
  a	
  good	
  discussion	
  with	
  Prof.	
  Masayuki	
  Tanaka.	
  	
  

More Related Content

What's hot

backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networksAkash Goel
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descentkandelin
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learningJörgen Sandig
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsKasun Chinthaka Piyarathna
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...Simplilearn
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...Mohammed Bennamoun
 
Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Muhammad Haroon
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent methodSanghyuk Chun
 
Chapter 4 Image Processing: Image Transformation
Chapter 4 Image Processing: Image TransformationChapter 4 Image Processing: Image Transformation
Chapter 4 Image Processing: Image TransformationVarun Ojha
 
Computer Vision: Correlation, Convolution, and Gradient
Computer Vision: Correlation, Convolution, and GradientComputer Vision: Correlation, Convolution, and Gradient
Computer Vision: Correlation, Convolution, and GradientAhmed Gad
 
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron ClassifiersArtificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron ClassifiersMohammed Bennamoun
 
[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun YooJaeJun Yoo
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Simplilearn
 

What's hot (20)

backpropagation in neural networks
backpropagation in neural networksbackpropagation in neural networks
backpropagation in neural networks
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Optimization/Gradient Descent
Optimization/Gradient DescentOptimization/Gradient Descent
Optimization/Gradient Descent
 
Neural networks and deep learning
Neural networks and deep learningNeural networks and deep learning
Neural networks and deep learning
 
Lstm
LstmLstm
Lstm
 
Convolutional Neural Network and Its Applications
Convolutional Neural Network and Its ApplicationsConvolutional Neural Network and Its Applications
Convolutional Neural Network and Its Applications
 
Vgg
VggVgg
Vgg
 
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
What Is Deep Learning? | Introduction to Deep Learning | Deep Learning Tutori...
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
Artificial Neural Network Lecture 6- Associative Memories & Discrete Hopfield...
 
Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)Convolutional Neural Network (CNN)
Convolutional Neural Network (CNN)
 
Gradient descent method
Gradient descent methodGradient descent method
Gradient descent method
 
Chapter 4 Image Processing: Image Transformation
Chapter 4 Image Processing: Image TransformationChapter 4 Image Processing: Image Transformation
Chapter 4 Image Processing: Image Transformation
 
Computer Vision: Correlation, Convolution, and Gradient
Computer Vision: Correlation, Convolution, and GradientComputer Vision: Correlation, Convolution, and Gradient
Computer Vision: Correlation, Convolution, and Gradient
 
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron ClassifiersArtificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
 
CNN
CNNCNN
CNN
 
[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo[PR12] Inception and Xception - Jaejun Yoo
[PR12] Inception and Xception - Jaejun Yoo
 
Resnet
ResnetResnet
Resnet
 
Mask R-CNN
Mask R-CNNMask R-CNN
Mask R-CNN
 
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
Backpropagation And Gradient Descent In Neural Networks | Neural Network Tuto...
 

Similar to Backpropagation in Convolutional Neural Network

Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...HidenoriOgata
 
Litvinenko_RWTH_UQ_Seminar_talk.pdf
Litvinenko_RWTH_UQ_Seminar_talk.pdfLitvinenko_RWTH_UQ_Seminar_talk.pdf
Litvinenko_RWTH_UQ_Seminar_talk.pdfAlexander Litvinenko
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural networkIldar Nurgaliev
 
A Generalization of the Chow-Liu Algorithm and its Applications to Artificial...
A Generalization of the Chow-Liu Algorithm and its Applications to Artificial...A Generalization of the Chow-Liu Algorithm and its Applications to Artificial...
A Generalization of the Chow-Liu Algorithm and its Applications to Artificial...Joe Suzuki
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Daisuke Yoneoka
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionCharles Deledalle
 
GradStudentSeminarSept30
GradStudentSeminarSept30GradStudentSeminarSept30
GradStudentSeminarSept30Ryan White
 
Numerical integration based on the hyperfunction theory
Numerical integration based on the hyperfunction theoryNumerical integration based on the hyperfunction theory
Numerical integration based on the hyperfunction theoryHidenoriOgata
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBOYoonho Lee
 
1531 fourier series- integrals and trans
1531 fourier series- integrals and trans1531 fourier series- integrals and trans
1531 fourier series- integrals and transDr Fereidoun Dejahang
 
1 hofstad
1 hofstad1 hofstad
1 hofstadYandex
 

Similar to Backpropagation in Convolutional Neural Network (20)

Matrix calculus
Matrix calculusMatrix calculus
Matrix calculus
 
Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...Hyperfunction method for numerical integration and Fredholm integral equation...
Hyperfunction method for numerical integration and Fredholm integral equation...
 
Litvinenko_RWTH_UQ_Seminar_talk.pdf
Litvinenko_RWTH_UQ_Seminar_talk.pdfLitvinenko_RWTH_UQ_Seminar_talk.pdf
Litvinenko_RWTH_UQ_Seminar_talk.pdf
 
Artificial neural networks
Artificial neural networks Artificial neural networks
Artificial neural networks
 
Artificial neural network
Artificial neural networkArtificial neural network
Artificial neural network
 
A Generalization of the Chow-Liu Algorithm and its Applications to Artificial...
A Generalization of the Chow-Liu Algorithm and its Applications to Artificial...A Generalization of the Chow-Liu Algorithm and its Applications to Artificial...
A Generalization of the Chow-Liu Algorithm and its Applications to Artificial...
 
Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9Murphy: Machine learning A probabilistic perspective: Ch.9
Murphy: Machine learning A probabilistic perspective: Ch.9
 
Derivatives
DerivativesDerivatives
Derivatives
 
IVR - Chapter 1 - Introduction
IVR - Chapter 1 - IntroductionIVR - Chapter 1 - Introduction
IVR - Chapter 1 - Introduction
 
MSR
MSRMSR
MSR
 
3. Functions II.pdf
3. Functions II.pdf3. Functions II.pdf
3. Functions II.pdf
 
Ece3075 a 8
Ece3075 a 8Ece3075 a 8
Ece3075 a 8
 
GradStudentSeminarSept30
GradStudentSeminarSept30GradStudentSeminarSept30
GradStudentSeminarSept30
 
Numerical integration based on the hyperfunction theory
Numerical integration based on the hyperfunction theoryNumerical integration based on the hyperfunction theory
Numerical integration based on the hyperfunction theory
 
Meta-learning and the ELBO
Meta-learning and the ELBOMeta-learning and the ELBO
Meta-learning and the ELBO
 
Colloquium
ColloquiumColloquium
Colloquium
 
1531 fourier series- integrals and trans
1531 fourier series- integrals and trans1531 fourier series- integrals and trans
1531 fourier series- integrals and trans
 
1 hofstad
1 hofstad1 hofstad
1 hofstad
 
exponen dan logaritma
exponen dan logaritmaexponen dan logaritma
exponen dan logaritma
 
5.n nmodels i
5.n nmodels i5.n nmodels i
5.n nmodels i
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 

Backpropagation in Convolutional Neural Network

  • 1. Memo:  Backpropaga.on     in  Convolu.onal  Neural  Network Hiroshi  Kuwajima   13-­‐03-­‐2014  Created   14-­‐08-­‐2014  Revised   1 /14
  • 2. 2  /14   Note n  Purpose   The  purpose  of  this  memo  is  trying  to  understand  and  remind  the  backpropaga.on  algorithm  in   Convolu.onal  Neural  Network  based  on  a  discussion  with  Prof.  Masayuki  Tanaka.     n  Table  of  Contents   In  this  memo,  backpropaga.on  algorithms  in  different  neural  networks  are  explained  in  the  following   order.       p  Single  neuron    3   p  Mul.-­‐layer  neural  network  5   p  General  cases    7   p  Convolu.on  layer    9   p  Pooling  layer      11   p  Convolu.onal  Neural  Network  13   n  Nota.on   This  memo  follows  the  nota.on  in  UFLDL  tutorial  (h[p://ufldl.stanford.edu/tutorial)  
  • 3. 3  /14   Neural  Network  as  a  Composite  Func4on A  neural  network  is  decomposed  into  a  composite  func.on  where  each  func.on  element   corresponds  to  a  differen.able  opera.on.       n  Single  neuron  (the  simplest  neural  network)  example   A  single  neuron  is  decomposed  into  a  composite  func.on  of  an  affine  func.on  element  parameterized  by   W  and  b  and  an  ac.va.on  func.on  element    f  which  we  choose  to  be  the  sigmoid  func.on.                 Deriva.ves  of  both  affine  and  sigmoid  func.on  elements  w.r.t.  both  inputs  and  parameters  are  known.   Note  that  sigmoid  func.on  does  not  have  neither  parameters  nor  deriva.ves  parameters.     Sigmoid  func.on  is  applied  element-­‐wise.  ‘●’  denotes  Hadamard  product,  or  element-­‐wise  product.     hW,b x( ) = f WT x + b( )= sigmoid affineW,b x( )( )= sigmoid!affineW,b( ) x( ) ∂a ∂z = a • 1− a( ) where a = hw,b x( ) = sigmoid z( ) = 1 1+ exp −z( ) ∂z ∂x = W, ∂z ∂W = x, ∂z ∂b = I where z = affineW,b x( ) = WT x + b, and I is identity matrix Decomposi.on Neuron Standard  network  representa.on x1 x2 x3 +1 hW,b(x) Affine   Ac.va.on   (e.g.  sigmoid)   Composite  func.on  representa.on x1 x2 x3 +1 hW,b(x) z a
  • 4. ∇W J W,b;x, y( ) = ∂ ∂W J W,b;x, y( ) = ∂J ∂z ∂z ∂W = δ z( ) xT ∇bJ W,b;x, y( ) = ∂ ∂b J W,b;x, y( ) = ∂J ∂z ∂z ∂b = δ z( ) 4  /14   Chain  Rule  of  Error  Signals  and  Gradients Error  signals  are  defined  as  the  deriva.ves  of  any  cost  func.on  J  which  we  choose  to  be   the  square  error.  Error  signals  are  computed  (propagated  backward)  by  the  chain  rule  of   deriva.ve  and  useful  for  compu.ng  the  gradient  of  the  cost  func.on.       n  Single  neuron  example   Suppose  we  have  m  labeled  training  examples  {(x(1),  y(1)),  …,  (x(m),  y(m))}.  Square  error  cost  func.on  for  each   example  is  as  follows.  Overall  cost  func.on  is  the  summa.on  of  cost  func.ons  over  all  examples.         Error  signals  of  the  square  error  cost  func.on  for  each  example  are  propagated  using  deriva.ves  of   func.on  elements  w.r.t.  inputs.         Gradient  of  the  cost  func.on  w.r.t.  parameters  for  each  example  is  computed  using  error  signals  and   deriva.ves  of  func.on  elements  w.r.t.  parameters.  Summing  gradients  for  all  examples  gets  overall   gradient.     δ a( ) = ∂ ∂a J W,b;x, y( ) = − y − a( ) δ z( ) = ∂ ∂z J W,b;x, y( ) = ∂J ∂a ∂a ∂z = δ a( ) • a • 1− a( ) J W,b;x, y( ) = 1 2 y − hw,b x( ) 2
  • 5. 5  /14   Decomposi4on  of  Mul4-­‐Layer  Neural  Network n  Composite  func.on  representa.on  of  a  mul.-­‐layer  neural  network     n  Deriva.ves  of  func.on  elements  w.r.t.  inputs  and  parameters   a 1( ) = x, a lmax( ) = hw,b x( ) ∂a l+1( ) ∂z l+1( ) = a l+1( ) • 1− a l+1( ) ( ) where a l+1( ) = sigmoid z l+1( ) ( )= 1 1+ exp −z l+1( ) ( ) ∂z l+1( ) ∂a l( ) = W l( ) , ∂z l+1( ) ∂W l( ) = a l( ) , ∂z l+1( ) ∂b l( ) = I where z l+1( ) = W l( ) ( ) T a l( ) + b l( ) hW,b x( ) = sigmoid!affineW 2( ),b 2( ) !sigmoid!affineW 1( ),b 1( )( ) x( ) Decomposi.on Standard  network  representa.on x1 x2 x3 +1 Layer  1 +1 Layer  2 x hW,b(x) a2 (2) a1 (2) a3 (2) Composite  func.on  representa.on x1 x2 x3 +1 Affine  1 Sigmoid  1 x z2 (2) z1 (2) z3 (2) +1 Affine  2 a2 (2) a1 (2) a3 (2) hW,b(x) z1 (3) a1 (3) a1 (1) a2 (1) a3 (1) Sigmoid  2
  • 6. 6  /14   Error  Signals  and  Gradients  in  Mul4-­‐Layer  NN n  Error  signals  of  the  square  error  cost  func.on  for  each  example   n  Gradient  of  the  cost  func.on  w.r.t.  parameters  for  each  example   δ a l( ) ( ) = ∂ ∂a l( ) J W,b;x, y( ) = − y − a l( ) ( ) for l = lmax ∂J ∂z l+1( ) ∂z l+1( ) ∂a l( ) = W l( ) ( ) T δ z l+1( ) ( ) otherwise ⎧ ⎨ ⎪ ⎩ ⎪ δ z l( ) ( ) = ∂ ∂z l( ) J W,b;x, y( ) = ∂J ∂a l( ) ∂a l( ) ∂z l( ) = δ a l( ) ( ) • a l( ) • 1− a l( ) ( ) ∇W l( ) J W,b;x, y( ) = ∂ ∂W l( ) J W,b;x, y( ) = ∂J ∂z l+1( ) ∂z l+1( ) ∂W l( ) = δ z l+1( ) ( ) a l( ) ( ) T ∇b l( ) J W,b;x, y( ) = ∂ ∂b l( ) J W,b;x, y( ) = ∂J ∂z l+1( ) ∂z l+1( ) ∂b l( ) = δ z l+1( ) ( )
  • 7. 7  /14   Backpropaga4on  in  General  Cases 1.  Decompose  opera.ons  in  layers  of  a  neural  network  into  func.on  elements  whose   deriva.ves  w.r.t  inputs  are  known  by  symbolic  computa.on.     2.  Backpropagate  error  signals  corresponding  to  a  differen.able  cost  func.on  by   numerical  computa.on  (Star.ng  from  cost  func.on,  plug  in  error  signals  backward).     3.  Use  backpropagated  error  signals  to  compute  gradients  w.r.t.  parameters  only  for  the   func.on  elements  with  parameters  where  their  deriva.ves  w.r.t  parameters  are   known  by  symbolic  computa.on.     4.  Sum  gradients  over  all  example  to  get  overall  gradient.     hθ x( ) = f lmax( ) !…! fθ l( ) l( ) !…! fθ 2( ) 2( ) ! f 1( ) ( ) x( ) where f 1( ) = x, f lmax( ) = hθ x( ) and ∀l : ∂f l+1( ) ∂f l( ) is known δ l( ) = ∂ ∂f l( ) J θ;x, y( ) = ∂J ∂f l+1( ) ∂f l+1( ) ∂f l( ) = δ l+1( ) ∂f l+1( ) ∂f l( ) where ∂J ∂f lmax( ) is known ∇θ l( ) J θ;x, y( ) = ∂ ∂θ l( ) J θ;x, y( ) = ∂J ∂f l( ) ∂fθ l( ) l( ) ∂θ l( ) = δ l( ) ∂fθ l( ) l( ) ∂θ l( ) where ∂fθ l( ) l( ) ∂θ l( ) is known ∇θ l( ) J θ( ) = ∇θ l( ) J θ;x i( ) , y i( ) ( )i=1 m ∑
  • 8. 8  /14   Convolu4onal  Neural  Network A  convolu.on-­‐pooling  layer  in  Convolu.onal  Neural  Network  is  a  composite  func.on   decomposed  into  func.on  elements  f(conv),  f(sigm),  and  f(pool).     Let  x  be  the  output  from  the  previous  layer.  Sigmoid  nonlinearity  is  op.onal.     f pool( ) ! f sigm( ) ! fw conv( ) ( ) x( ) Convolu.on Sigmoid x Pooling Forward  propaga.on Backward  propaga.on Convolu.on Sigmoid x Pooling
  • 9. 9  /14   Deriva4ves  of  Convolu4on n  Discrete  convolu.on  parameterized  by  a  feature  w  and  its  deriva.ves   Let  x  be  the  input,  and  y  be  the  output  of  convolu.on  layer.  Here  we  focus  on  only  one  feature  vector  w,   although  a  convolu.on  layer  usually  has  mul.ple  features  W  =  [w1  w2  …  wn].  n  indexes  x  and  y  where     1  ≤  n  ≤  |x|  for  xn,  1  ≤  n  ≤  |y|  =  |x|  -­‐  |w|  +  1  for  yn.  i  indexes  w  where  1  ≤  i  ≤  |w|.  (f*g)[n]  denotes  the  n-­‐th   element  of  f*g.     y = x ∗w = yn[ ], yn = x ∗w( ) n[ ]= xn+i−1wi i=1 w ∑ = wT xn:n+ w −1 ∂yn−i+1 ∂xn = wi, ∂yn ∂wi = xn+i−1 for 1≤ i ≤ w xn w1 yn w2 … yn-­‐1 … From  a  fixed  xn  stand  point,     xn  has  outgoing  connec.ons     to  yn-­‐|W|+1:n,  i.e.,     all  yn-­‐|W|+1:n  have  deriva.ves     w.r.t.  xn.  Note  that  y  and  w   indices  are  reverse  order.   x Convolu.on |w| xn … w1 w2 … yn yn  has  incoming     connec.ons  from  xn:n+|W|-­‐1.     x Convolu.on |w| xn+1
  • 10. 10  /14   Backpropaga4on  in  Convolu4on  Layer Error  signals  and  gradient  for  each  example  are  computed  by  convolu.on  using     the  commuta.vity  property  of  convolu.on  and  the  mul.variable  chain  rule  of  deriva.ve.     Let’s  focus  on  single  elements  of  error  signals  and  a  gradient  w.r.t.  w.       δn x( ) = ∂J ∂xn = ∂J ∂y ∂y ∂xn = ∂J ∂yn−i+1 ∂yn−i+1 ∂xni=1 w ∑ = δn−i+1 y( ) wi i=1 w ∑ = δ y( ) ∗flip w( )( ) n[ ], δ x( ) = δn x( ) ⎡ ⎣ ⎤ ⎦ = δ y( ) ∗flip w( ) ∂J ∂wi = ∂J ∂y ∂y ∂wi = ∂J ∂yn ∂yn ∂win=1 x − w +1 ∑ = δn y( ) xn+i−1 n=1 x − w +1 ∑ = δ y( ) ∗ x( ) i[ ], ∂J ∂w = ∂J ∂wi ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = δ y( ) ∗ x = x ∗δ y( ) ↑  Reverse  order  linear  combina.on x          *          w          =          y xn … W1 W2 … yn Forward  propaga.on  (convolu.on) (Valid  convolu.on) |w| xn+1 Backward  propaga.on … … δ(y) n w1 w2 δ(x)    =      flip(w)    *      δ(y) δ(x) n (Full  convolu.on) |w| δ(y) n-­‐1 … ∂J/∂wi xn δ(y) 1 δ(y) 2 x        *        δ(y)      =      ∂J/∂W (Valid  convolu.on) Gradient  computa.on |y| xn+1 …
  • 11. 11  /14   Deriva4ves  of  Pooling Pooling  layer  subsamples  sta.s.cs  to  obtain  summary  sta.s.cs  with  any  aggregate   func.on  (or  filter)  g  whose  input  is  vector,  and  output  is  scalar.  Subsampling  is  an   opera.on  like  convolu.on,  however  g  is  applied  to  disjoint  (non-­‐overlapping)  regions.     n  Defini.on:  subsample  (or  downsample)   Let  m  be  the  size  of  pooling  region,  x  be  the  input,  and  y  be  the  output  of  the  pooling  layer.     subsample(f,  g)[n]  denotes  the  n-­‐th  element  of  subsample(f,  g).     yn = subsample x,g( ) n[ ]= g x n−1( )m+1:nm( ) y = subsample x,g( ) = yn[ ] g x( ) = xk k=1 m ∑ m , ∂g ∂x = 1 m mean pooling max x( ), ∂g ∂xi = 1 if xi = max x( ) 0 otherwise ⎧ ⎨ ⎩ max pooling x p = xk p k=1 m ∑ ⎛ ⎝⎜ ⎞ ⎠⎟ 1/p , ∂g ∂xi = xk p k=1 m ∑ ⎛ ⎝⎜ ⎞ ⎠⎟ 1/p−1 xi p−1 Lp pooling or any other differentiable Rm → R functions ⎧ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎪ ⎪ ⎪ ⎪ ⎪ x Pooling yn g m x(n-­‐1)m+1 …
  • 12. 12  /14   Backpropaga4on  in  Pooling  Layer Error  signals  for  each  example  are  computed  by  upsampling.  Upsampling  is  an  opera.on   which  backpropagates  (distributes)  the  error  signals  over  the  aggregate  func.on  g  using   its  deriva.ves  g’n  =  ∂g/∂x(n-­‐1)m+1:nm.  g’n  can  change  depending  on  pooling  region  n.     p  In  max  pooling,  the  unit  which  was  the  max  at  forward  propaga.on  receives  all  the  error  at  backward   propaga.on  and  the  unit  is  different  depending  on  the  region  n.       n  Defini.on:  upsample   upsample(f,  g)[n]  denotes  the  n-­‐th  element  of  upsample(f,  g).     δ n−1( )m+1:nm x( ) = upsample δ y( ) , ′g( ) n[ ]= δn y( ) ′gn = δn y( ) ∂g ∂x n−1( )m+1:nm = ∂J ∂yn ∂yn ∂x n−1( )m+1:nm = ∂J ∂x n−1( )m+1:nm δ x( ) = upsample δ a( ) , ′g( )= δ n−1( )m+1:nm x( ) ⎡ ⎣ ⎤ ⎦ subsample(x,  g)  =  y yn Forward  propaga.on  (subsapmling) g x(n-­‐1)m+1 … m δ(x)  =  upsample(δ(y),  g’) δ(y) n δ(x) (n-­‐1)m+1 … Backward  propaga.on  (upsapmling) ∂g/∂x m
  • 13. 13  /14   Backpropaga4on  in  CNN  (Summary) Plug  in   δ(conv) Plug  in   δ(conv) … ∂J/∂Wn xn xn+1 … (Valid  convolu.on) δ(conv) 1 δ(conv) 2 x ∗δ conv( ) = ∇wJ 3.  Compute  gradient  ∇wJ … … δ(conv) n-­‐1 δ(conv) n W1 W2 δ(x) n (Full  convolu.on) 2.  Propagate  error  signals  δ(conv) δ x( ) = δ conv( ) ∗flip w( ) δ conv( ) = upsample δ pool( ) , ′g( )• f sigm( ) • 1− f sigm( ) ( ) 1.  Propagate  error  signals  δ(pool) δ(pool) n δ(sigm) (n-­‐1)m+1 … δ(conv) (n-­‐1)m+1 … Deriva.ve  of  sigmoid   Convolu.on Convolu.on Sigmoid Pooling
  • 14. 14  /14   Remarks n  References   p  UFLDL  Tutorial,  h[p://ufldl.stanford.edu/tutorial   p  Chain  Rule  of  Neural  Network  is  Error  Back  Propaga.on,     h[p://like.silk.to/studymemo/ChainRuleNeuralNetwork.pdf   n  Acknowledgement   This  memo  was  wri[en  thanks  to  a  good  discussion  with  Prof.  Masayuki  Tanaka.