Advanced Spark and TensorFlow Meetup 08-04-2016
Fundamental Algorithms of Neural Networks including Gradient Descent, Back Propagation, Auto Differentiation, Partial Derivatives, Chain Rule
5. Gradient Descent Outline
▣ Problem: fit data
▣ Basic OLS linear regression
▣ Visualize error curve and regression line
▣ Step by step through changes
8. Simple Start: Linear Regression
▣ Want to find a model that can fit our data
▣ Could do it algebraically…
▣ BUT that doesn’t generalize well
9. Simple Start: Linear Regression
▣ Step back: what does ordinary linear regression
try to do?
▣ Minimize the sum of (or average) squared error
▣ How else could we minimize?
10. Gradient Descent
▣ Start with a random guess
▣ Use the derivative (gradient when dealing with
multiple variables) to get the slope of the error
curve
▣ Move our parameters to move down the error
curve
25. Gradient Descent Variants
▣ There are additional techniques that can help
speed up (or otherwise improve) gradient
descent
▣ The next slides describe some of these!
▣ More details (and some awesome visuals)
here: article by Sebastian Ruder
26. Gradient Descent
▣Get true gradient with respect to all examples
▣One step = one epoch
▣Slow and generally unfeasible for large training
sets
30. Mini-Batch Gradient Descent
▣Similar idea to stochastic gradient descent
▣Approximate derivative with a sample batch of
examples
▣Middle ground between “true” stochastic
gradient and full gradient descent
32. Momentum
▣Idea: if we see multiple gradients in a row with
same direction, we should increase our learning
rate
▣Accumulate a “momentum” vector to speed up
descent
37. AdaGrad
▣ Idea: update individual weights differently
depending on how frequently they change
▣ Keeps a running tally of previous updates for
each weight, and divides new updates by a
factor of the previous updates
▣ Downside: for long running training,
eventually all gradients diminish
▣ Paper on jmlr.org
38. AdaDelta / RMSProp
▣ Two slightly different algorithms with same
concept: only keep a window of the previous
n gradients when scaling updates
▣ Seeks to reduce diminishing gradient problem
with AdaGrad
▣ AdaDelta Paper on arxiv.org
39. Adam
▣ Adam expands on the concepts introduced
with AdaDelta and RMSProp
▣ Uses both first order and second order
moments, decayed over time
▣ Paper on arxiv.org
41. Beyond OLS Regression
▣ Can’t do everything with linear regression!
▣ Nor polynomial…
▣ Why can’t we let the computer figure out how
to model?
42. Neural Networks: Idea
▣ Chain together non-linear functions
▣ Have lots of parameters that can be adjusted
▣ These “weights” determine the model function
43. Feed forward neural network
+1 +1
x1
x2
+1
l (2)
l (3)
l (4)
l (1)
input hidden 1 hidden 2 output
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
52. σ
σ
σ
+1
σ
σ
σ
+1
x1
x2
+1
W(1)
W(2)
W(3)
a(2)
a(3)
a(4)
ŷ
xi
: input value
ŷ: output vector
+1: bias (constant) unit
a(l)
: activation vector for layer l
Layer inputs, z(l)
W(l)
: weight matrix for layer l
z(l)
: input into layer l
z(l)
= W(l-1)
a(l-1)
+ b(l-1)
SM
SM
SM
σ: sigmoid (logistic) function
SM: Softmax function
66. Goal:
Find which direction to shift weights
How:
Find partial derivatives of the cost with
respect to weight matrices
How (again):
Chain rule the sh*t out of this mofo
101. Why not hard code?
▣ Want to iterate fast!
▣ Want flexibility
▣ Want to reuse our code!
102. Auto-Differentiation: Idea
▣ Use functions that have easy-to-compute
derivatives
▣ Compose these functions to create more
complex super-model
▣ Use the chain rule to get partial derivatives of
the model
103. What makes a “good” function?
▣ Obvious stuff: differentiable (continuously
and smoothly!)
▣ Simple operations: add, subtract, multiply
▣ Reuse previous computation
118. Neural Network terms
▣ Neuron: a unit that transforms input via an activation function and outputs the result to
other neurons and/or the final result
▣ Activation function: a(l)
, a transformation function, typically non-linear. Sigmoid, ReLU
▣ Bias unit: a trainable scalar shift, typically applied to each non-output layer (think
y-intercept term in the linear function)
▣ Layer: a grouping of “neurons” and biases that (in general) take in values from the same
previous neurons and pass values forwards to the same targets
▣ Hidden layer: A layer that is neither the input layer nor the output layer
▣ Input layer:
▣ Output layer
119. Terminology used
▣ Learning rate
▣ Parameters
▣ Training step
▣ Training example
▣ Epoch vs training time
▣