Francesco Gadaleta, Chief Data Officer at Abe AI, takes a deep dive into the secret behind deep learning: function optimization. Watch this video as he goes over the most used optimization techniques for artificial intelligence and deep learning technologies.
Read the full post here: http://bit.ly/2m12Nxd
4. FUNCTION OPTIMIZATION
Minimizing or maximizing a function
(eg. the difference between the predicted
value and the true value)
This function is usually referred to as the
loss function
5. THE CORE OF DEEP NEURAL NETWORKS
x1
x2
x3
b=+1
W1 W2
(Logistic regression) (Logistic regression)
b1
b2
6. THE GRADIENT AND
GRADIENT DESCENT METHODS
The gradient is a vector-valued multivariable generalization of
the derivative.
Like the derivative, the gradient represents the slope of the
tangent of the graph of the function.
Gradient 3D plot
7. TYPES OF OPTIMIZATION
First-order methods minimize or maximize the loss function
using its gradient.
Second-order methods minimize or maximize using the second
derivative (Hessian). Very costly to compute
L-BFGS Limited-memory Broyden–Fletcher–Goldfarb–Shanno uses an
approximation of the Hessian
8. CONVEX FUNCTIONS
Computers are very good at minimizing only a specific family of
functions (convex functions)
Convex Function Non-Convex Function
Definition:
9. Layers
Output: predict supervised target
Hidden: learn abstract
representations
Input: raw sensory inputs.
THE CORE OF NEURAL NETWORKS
10. THE CORE OF NEURAL NETWORKS
(Logistic regression)
SGD Stochastic
Gradient Descent
Backpropagation
(at each layer)
11. GRADIENT DESCENT
LEADING TOWARDS THE MINIMUM
● Follow the negative gradient
● Tune parameters to minimize
loss function
● Direction and learning rate
12. GRADIENT DESCENT METHODS
MOMENTUM
● SGD has trouble around local optima
(the surface curves much more steeply in one dimension
than in another)
● Momentum accelerates SGD in the relevant
direction and reduces oscillations in the other
directions
13. GRADIENT DESCENT METHODS
ADAGRAD
● Adaptive Gradient adapts the learning rate
to the parameters
● Larger updates for infrequent parameters
● Smaller updates for frequent ones.
14. GRADIENT DESCENT METHODS
ADAM
● Best of both worlds
● Adaptive learning rates for each parameter
● Adam keeps an exponentially decaying average of past
squared gradients and of past gradients too in a similar
fashion to momentum
Calculates
(first moment, second moment) of the gradients
(the mean, the variance)
15. GRADIENT DESCENT METHODS
WHAT IS THE BEST OPTIMIZER
Consider one of the
adaptive learning-rate
methods.
No need to tune the
learning rate to achieve
the best results with
default values
Input data is sparse
Good and reliable for
simple networks.
In general SGD will get to
the minimum, even though
it might struggle a bit near
saddle points, taking
longer to converge.
Off-the-shelf SGD
Choose one of the
adaptive learning rate
methods for faster
convergence
Complex and deep nets