Successfully reported this slideshow.
Upcoming SlideShare
×

# Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark and TensorFlow Meetup - 08-04-2016

3,334 views

Published on

Advanced Spark and TensorFlow Meetup 08-04-2016

Fundamental Algorithms of Neural Networks including Gradient Descent, Back Propagation, Auto Differentiation, Partial Derivatives, Chain Rule

Published in: Software
• Full Name
Comment goes here.

Are you sure you want to Yes No

### Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark and TensorFlow Meetup - 08-04-2016

1. 1. Backprop, Gradient Descent, And Auto Differentiation Sam Abrahams, Memdump LLC
2. 2. https://goo.gl/tKOvr 7 Link to these slides:
3. 3. YO! I am Sam Abrahams I am a data scientist and engineer. You can find me on GitHub @samjabrahams Buy my book: TensorFlow for Machine Intelligence
5. 5. Gradient Descent Outline ▣ Problem: fit data ▣ Basic OLS linear regression ▣ Visualize error curve and regression line ▣ Step by step through changes
6. 6. Scatter Plot Simple Start: Linear Regression
7. 7. Ordinary Least Squares Linear Regression Simple Start: Linear Regression
8. 8. Simple Start: Linear Regression ▣ Want to find a model that can fit our data ▣ Could do it algebraically… ▣ BUT that doesn’t generalize well
9. 9. Simple Start: Linear Regression ▣ Step back: what does ordinary linear regression try to do? ▣ Minimize the sum of (or average) squared error ▣ How else could we minimize?
10. 10. Gradient Descent ▣ Start with a random guess ▣ Use the derivative (gradient when dealing with multiple variables) to get the slope of the error curve ▣ Move our parameters to move down the error curve
11. 11. Single Variable Cost Curve J (cost) W
12. 12. Single Variable Cost Curve J (cost) Random guess put us here W
13. 13. ∂ W J (cost) ∂J ∂W
14. 14. ∂ W J (cost) ∂J ∂W < 0
15. 15. ∂ W J (cost) ∂J ∂W < 0; move to the right!
16. 16. Single Variable Cost Curve J (cost) W
17. 17. Single Variable Cost Curve J (cost) W ∂J ∂W
18. 18. Single Variable Cost Curve J (cost) W ∂J ∂W < 0
19. 19. Single Variable Cost Curve J (cost) W ∂J ∂W < 0; move to the right!
20. 20. Single Variable Cost Curve J (cost) W
21. 21. Single Variable Cost Curve J (cost) W ∂J ∂W
22. 22. Single Variable Cost Curve J (cost) W ∂J ∂W
23. 23. Single Variable Cost Curve J (cost) ∂J ∂W W
25. 25. Gradient Descent Variants ▣ There are additional techniques that can help speed up (or otherwise improve) gradient descent ▣ The next slides describe some of these! ▣ More details (and some awesome visuals) here: article by Sebastian Ruder
26. 26. Gradient Descent ▣Get true gradient with respect to all examples ▣One step = one epoch ▣Slow and generally unfeasible for large training sets
28. 28. Stochastic Gradient Descent ▣Basic idea: approximate derivative by only using one example ▣“Online learning” ▣Update weights after each example
30. 30. Mini-Batch Gradient Descent ▣Similar idea to stochastic gradient descent ▣Approximate derivative with a sample batch of examples ▣Middle ground between “true” stochastic gradient and full gradient descent
32. 32. Momentum ▣Idea: if we see multiple gradients in a row with same direction, we should increase our learning rate ▣Accumulate a “momentum” vector to speed up descent
33. 33. Without Momentum
34. 34. Momentum
35. 35. Nesterov Momentum ▣ Idea: before updating our weights, look ahead to where we have accumulated momentum ▣ Adjust our update based on “future”
36. 36. Nesterov Momentum Source: Lecture by Geoffrey Hinton Momentum Vector Gradient/correction Nesterov steps Standard momentum steps
37. 37. AdaGrad ▣ Idea: update individual weights differently depending on how frequently they change ▣ Keeps a running tally of previous updates for each weight, and divides new updates by a factor of the previous updates ▣ Downside: for long running training, eventually all gradients diminish ▣ Paper on jmlr.org
39. 39. Adam ▣ Adam expands on the concepts introduced with AdaDelta and RMSProp ▣ Uses both first order and second order moments, decayed over time ▣ Paper on arxiv.org
40. 40. 2. Forward & Back Propagation The Chain Rule got the last laugh, high-school-you
41. 41. Beyond OLS Regression ▣ Can’t do everything with linear regression! ▣ Nor polynomial… ▣ Why can’t we let the computer figure out how to model?
42. 42. Neural Networks: Idea ▣ Chain together non-linear functions ▣ Have lots of parameters that can be adjusted ▣ These “weights” determine the model function
43. 43. Feed forward neural network +1 +1 x1 x2 +1 l (2) l (3) l (4) l (1) input hidden 1 hidden 2 output W(1) W(2) W(3) a(2) a(3) a(4) ŷ
44. 44. σ σ σ +1 σ σ σ +1 SM SM SM x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l W(l) : weight matrix for layer l z(l) : input into layer l σ: sigmoid (logistic) function SM: Softmax function
45. 45. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Layer 1 W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM σ: sigmoid (logistic) function SM: Softmax function
46. 46. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Layer 2 W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM σ: sigmoid (logistic) function SM: Softmax function
47. 47. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Layer 3 W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM σ: sigmoid (logistic) function SM: Softmax function
48. 48. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Layer 4 W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM σ: sigmoid (logistic) function SM: Softmax function
49. 49. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Biases (constant units) W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM σ: sigmoid (logistic) function SM: Softmax function
50. 50. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Input W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM σ: sigmoid (logistic) function SM: Softmax function
51. 51. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Weight matrices W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM σ: sigmoid (logistic) function SM: Softmax function
52. 52. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Layer inputs, z(l) W(l) : weight matrix for layer l z(l) : input into layer l z(l) = W(l-1) a(l-1) + b(l-1) SM SM SM σ: sigmoid (logistic) function SM: Softmax function
53. 53. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Activation vectors W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM σ: sigmoid (logistic) function SM: Softmax function
54. 54. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Sigmoid activation function W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM σ: sigmoid (logistic) function SM: Softmax function
55. 55. SM SM SM σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l Softmax activation function W(l) : weight matrix for layer l z(l) : input into layer l σ: sigmoid (logistic) function SM: Softmax function
56. 56. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ σ: sigmoid (logistic) function SM: Softmax function xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l W(l) : weight matrix for layer l z(l) : input into layer l Output SM SM SM
57. 57. x1 x2 W(1) W(2) W(3) a(2) a(3) a(4) Forward Propagation Input vector is passed into the network
58. 58. x1 x2 +1 W(1) a(2) W(2) W(3) a(3) a(4) Forward Propagation Input is multiplied with W(1) weight matrix and added with layer 1 biases to calculate z(2) z(2) = W(1) x + b(1)
59. 59. σ σ σ x1 x2 +1 W(1) a(2) Forward Propagation W(2) W(3) a(3) a(4) Activation value for the second layer is calculated by passing z(2) into some function. In this case, the sigmoid function. a(2) = σ(z(2) )
60. 60. σ σ σ +1 x1 x2 +1 W(1) W(2) a(2) Forward Propagation W(3) a(3) a(4) z(3) is calculated by multiplying a(2) vector with W(2) weight matrix and adding layer 2 biases z(3) = W(2) a(2) + b(2)
61. 61. σ σ σ +1 σ σ σ x1 x2 +1 W(1) W(2) a(2) a(3) Forward Propagation Similar to previous layer, a(3) is calculated by passing z(3) into the sigmoid function a(3) = σ(z(3) ) W(3)
62. 62. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) Forward Propagation z(4) is calculated by multiplying a(3) vector with W(3) weight matrix and adding layer 3 biases z(4) = W(3) a(3) + b(3) a(4)
63. 63. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) SM SM SM Forward Propagation For the final layer, we calculate a(4) by passing z(4) into the Softmax function a(4) = SM(z(4) )
64. 64. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) ŷ SM SM SM Forward Propagation We then make our prediction based on the final layer’s output
65. 65. Page of Math z(2) = W(1) x + b(1) z(3) = W(2) a(2) + b(2) z(4) = W(3) a(3) + b(3) a(2) = σ(z(2) ) a(3) = σ(z(3) ) a(4) = ŷ = SM(z(4) )
66. 66. Goal: Find which direction to shift weights How: Find partial derivatives of the cost with respect to weight matrices How (again): Chain rule the sh*t out of this mofo
67. 67. DANGER: MATH
68. 68. Chain Rule Reminder
69. 69. Chain Rule Reminder
70. 70. Chain rule example Find derivative with respect to x:
71. 71. Chain rule example First split into two functions:
72. 72. Chain rule example Then get derivative of components:
73. 73. Chain rule example
74. 74. Chain rule example
75. 75. Chain rule example
76. 76. Chain rule example
77. 77. Chain rule example
78. 78. Chain rule example
79. 79. Chain rule example
80. 80. DEEPER
81. 81. DEEPER Want:
82. 82. DEEPER
83. 83. DEEPER NOTE: “Cancelling out” isn’t how the math actually works. But it’s a handy way to think about it.
84. 84. DEEPER NOTE: “Cancelling out” isn’t how the math actually works. But it’s a handy way to think about it.
85. 85. DEEPER NOTE: “Cancelling out” isn’t how the math actually works. But it’s a handy way to think about it.
86. 86. Back Prop Back to backpropagation: Want:
87. 87. Return of Page of Math z(2) = W(1) x + b(1) z(3) = W(2) a(2) + b(2) z(4) = W(3) a(3) + b(3) a(2) = σ(z(2) ) a(3) = σ(z(3) ) a(4) = ŷ = SM(z(4) )
88. 88. Partials, step by step a(4) = ŷ = SM(z(4) ) With cross entropy loss:
89. 89. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) c o s t σ: sigmoid (logistic) function SM: Softmax function xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM Back Propagation Want:
90. 90. Partials, step by step
91. 91. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) c o s t σ: sigmoid (logistic) function SM: Softmax function xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM Back Propagation Want:
92. 92. Partials, step by step
93. 93. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) c o s t σ: sigmoid (logistic) function SM: Softmax function xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM Back Propagation Want:
94. 94. Partials, step by step
95. 95. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) c o s t σ: sigmoid (logistic) function SM: Softmax function xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM Back Propagation Want:
96. 96. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) c o s t σ: sigmoid (logistic) function SM: Softmax function xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM Back Propagation Want:
97. 97. σ σ σ +1 σ σ σ +1 x1 x2 +1 W(1) W(2) W(3) a(2) a(3) a(4) c o s t σ: sigmoid (logistic) function SM: Softmax function xi : input value ŷ: output vector +1: bias (constant) unit a(l) : activation vector for layer l W(l) : weight matrix for layer l z(l) : input into layer l SM SM SM Back Propagation Want:
98. 98. Partials, step by step
99. 99. As programmers... How do we NOT do this ourselves? We’re lazy by trade.
100. 100. 3. Automatic Differentiation Bringing sexy lazy back
101. 101. Why not hard code? ▣ Want to iterate fast! ▣ Want flexibility ▣ Want to reuse our code!
102. 102. Auto-Differentiation: Idea ▣ Use functions that have easy-to-compute derivatives ▣ Compose these functions to create more complex super-model ▣ Use the chain rule to get partial derivatives of the model
103. 103. What makes a “good” function? ▣ Obvious stuff: differentiable (continuously and smoothly!) ▣ Simple operations: add, subtract, multiply ▣ Reuse previous computation
104. 104. Nice functions: sigmoid
105. 105. Nice functions: sigmoid
106. 106. Nice functions: hyperbolic tangent
107. 107. Nice functions: hyperbolic tangent
108. 108. Nice functions: Rectified linear unit
109. 109. Nice functions: Rectified linear unit