SlideShare a Scribd company logo
1 of 41
Download to read offline
Everything You Wanted to
Know about Optimization
(and some you didn’t)
Madison May
madison@indico.io
Madison May
Machine Learning Architect @ Indico Data Solutions
Solve big problems with small data.
Email: madison@indico.io
Twitter: @pragmaticml
Github: @madisonmay
Ancient History
(an optimization primer)
Definitions:
● Loss: differentiable measure
of model error
● Gradient: direction of
steepest descent at point on
error surface
● Loss surface: how loss varies
with parameter value
● Learning rate: how far to
move params in direction of
gradient
Gradient descent
● Compute loss on entire
dataset
● Compute gradient of
parameters with respect to
loss
● Update parameters in the
direction of the gradient
scaled by some parameter
(the learning rate)
SGD
(mini-batch
gradient descent)
● Compute loss on small
number of examples
● Compute gradient of
parameters with respect to
loss
● Update parameters in the
direction of the gradient
scaled by some parameter
(the learning rate)
Mini-batch vs. Full
● Don’t need to compute the
gradient on all of your training
examples to get a gradient
estimate that is good enough.
● Better to update your
parameters more frequently
with noisy gradient than get a
perfect gradient estimate and
update model params less.
● Stochastic gradient estimates
help avoid local minima /
saddle points
https://en.wikipedia.org/wiki/gradient_descen
t
SGD with
Momentum
● SGD problematic when the
magnitude of gradients varies
between parameters.
● Parameters will oscillate
between the two sides of the
bowl (see right).
● Keeping an exponential moving
average of past gradients
(momentum) helps to dampen
oscillation (acts like heavy ball)
● Helps accelerate through flat
areas of loss surface. Images from sebastianruder.com
With Momentum
Without Momentum
SGD with Nesterov
Momentum (NAG)
● Instead of measuring loss at
current parameters, apply the
previous gradient once more
before measuring loss since
that gradient update
● Allows optimizer to correct
more quickly to changes in the
loss landscape
Hinton Lecture 6c
Blue: momentum
Green: NAG
Adagrad, Adadelta,
And RMSProp
● Different parameters require
differently scaled updates
● Values of previous gradients
are used to scale the current
gradient estimate in a
heuristic manner
● Significantly less sensitive to
hyperparameters thanks to
per parameter scaling
Adam
● Most common go-to in
current deep learning
research
● Stores exponential moving
average of squared gradients
(Adadelta / RMSProp-like
term) and gradients
(momentum-like term)
● Behaves like a “heavy ball
with friction” and finds flat
minima of loss function.
● Empirically leads to quicker
convergence than SGD http://ruder.io/optimizing-gradient-descent
Takeaways
● SGD: update params by scaling gradient
● Momentum: incorporating exponential moving
averages of gradient value allow for SGD to escape
saddle points. Acts like acceleration of ball on surface
due to mass.
● Adadelta / RMSprop: inverse scaling by exponential
moving average of square of gradient to help with
sensitivity to hyperparameters
● Adam: incorporates elements of momentum and
Adadelta / RMSprop
Async Training, Batch Size,
and Regularization Affect
Learning Rate Dynamics
Batch Size +
Learning Rate
● Batch size is inversely
proportional to learning rate
● Instead of learning rate
annealing, you could increase
batch size for faster training
times with equivalent
accuracy thanks to increased
parallelism and fewer
parameter updates
Image from “Don't Decay the Learning Rate,
Increase the Batch Size”
See also: Revisiting Small Batch Training for
Deep Neural Networks
Batch Size +
Learning Rate
● “...both large learning rate and
small batch size contribute
towards SGD finding flatter
minima that generalize well”
-“Finding Flatter Minima with SGD”
Images from “Qualitatively characterizing
neural network optimization problems” and
“Finding Flatter Minima with SGD”
Async Training &
Momentum
● Data parallelism is popular in
training of large models (often
called Hogwild!)
● Data parallelism acts similarly
to momentum (running
average of gradient updates
vs. true average)
● Reduce your momentum
parameter to compensate for
the increase in “effective
momentum”
Image from “Aynchronicity Begets
Momentum, with an Application to Deep
Learning”
Regularization +
Learning Rate
● L2 regularization (penalizing
magnitude of weights)
decreases norm of weights
● Decreasing the norm of the
weights necessitates a
corresponding decrease in
learning rate for optimal
learning
Figure from “L2 Regularization versus Batch
and Weight Normalization”
Takeaways
● There’s a difference between the learning rate
parameter and the effective learning rate of models
● Understand how batch size, async training, and the
norm of model parameters interact with learning rate
for best results.
Learning Rate Scheduling
Learning Rate
Annealing
● For non-adaptive methods,
the learning rate that is
optimal at the beginning of
learning is not the same as the
learning rate that is optimal
near the end of learning
● Adjustments become finer
later on in optimization, and
learning rate should be
lowered to accommodate this
Figure from http://srdas.github.io/DLBook/
Cyclic Learning
Rates + Snapshot
Ensembling
● Increase and decrease the
learning rate on a schedule?
● Good optima found when
learning rate is low
● High learning rate kicks model
out of local optima
● Averaging parameters acts
like ensembling
Figure from “Snapshot Ensembles:
Train 1 Get M for Free”
Takeaways
● Use learning rate annealing when using vanilla SGD or
SGD w/ momentum
● Consider snapshot ensembling for easy incremental
model performance improvements.
Improving Adam
ICLR 2018 Optimization Papers
● On the convergence of Adam and beyond (Sashank J. Reddi, Satyen Kale, Sanjiv Kumar)
● Normalized, direction-preserving Adam (Zijun Zhang, Lin Ma, Zongpeng Li, Chuan Wu)
● Fixing Weight Decay Regularization in Adam (Ilya Loshchilov, Frank Hutter)
● Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients (Lukas Balles, Philipp
Hennig)
● YellowFin and the Art of Momentum Tuning (Jian Zhang, Ioannis Mitliagkas, Christopher Re)
What can we
improve about
Adam?
“Despite superior training outcomes, adaptive
optimization methods such as Adam, Adagrad or
RMSprop have been found to generalize poorly
compared to Stochastic Gradient Descent (SGD).
These methods tend to perform well in the initial
portion of training but are outperformed by SGD at
later stages of training.”
From “Improving Generalization Performance by
Switching from Adam to SGD”
Image from “The Marginal Value of Adaptive Gradient
Methods in Machine Learning”
Problems with
Exponential Moving
Averages
Hypotheses:
● Some features are rarely active
but when they are active, they
provide large gradients
● Exponential moving averages
don’t entirely deal with this kind of
behavior, influence of past
gradient updates diminishes
quickly
From “On the Convergence of Adam and Beyond”
Non-converge of Adam in 1D setting.
Image from “On the Convergence of Adam and Beyond”
How do we fix it?
Potential Solution:
● Instead of storing exponential
moving average of past squared
gradients, store the maximum past
squared gradient and use that to
adjust size of weight update
● Resultant algorithm is termed
“AMSGrad”
● Enjoys theoretical guarantees that
were missing from Adam.
● Empirically leads to better
generalization performance
From “On the Convergence of Adam and Beyond”
Image from
http://ruder.io/deep-learning-optimization-2017
Other potential
problems
Hypotheses:
● “When combined with adaptive
gradients, L2 regularization leads to
weights with large gradients being
regularized less than they would be
when using weight decay.”
● In other words, using L2
regularization in conjunction with
Adam is not effective -- although
weight decay is.
From “Fixing Weight Decay Regularization in Adam”
Image from “Fixing Weight Decay Regularization in
Adam”
How do we fix it?
Potential Solution:
● Use weight decay as originally
formulated rather than L2
normalization
From “Fixing Weight Decay Regularization in Adam”
Image from “Fixing Weight Decay Regularization in
Adam”
Update Directions
● Adam and other adaptive gradient
methods set a learning rate on a
per parameter basis
● Setting individual learning rates
results in different update
directions than vanilla SGD
● Adam trades reduction in variance
of update direction for increase in
bias of update direction from true
gradient direction
From “On the Convergence of Adam and Beyond”
Image from “Dissecting Adam: The Sign, Magnitude, and
Variance of Stochastic Gradients”
How do we fix it?
Potential Solution:
● YellowFin: since an individual
learning rate per parameter leads
to different update directions than
SGD, only use a global learning
rate, and solve the learning rate
setting problem separately
● Implements a lr & momentum rate
tuner w/ negative feedback loop
that requires no hyperparameter
tuning and leads to faster
convergence than Adam in
practice.
From “YellowFin and the Art of Momentum Tuning”
Image: ResNet loss on CIFAR100 from
“YellowFin and the Art of Momentum Tuning”
Takeaways
● Adam generally performs well but has its limits
● Use with weight decay rather than L2 regularization
● At the upper extremes of training data availability try
SGD + nesterov momentum + learning rate annealing
or YellowFin.
● Compare against AMSGrad
● Monitor arxiv.org and wait 6 months -- academia is still
deciding on whether it’s time to move past Adam.
Other considerations
● Weight initialization
● Batch norm / Layer norm
Shoutouts
● Sebastian Ruder has blogged extensively about
optimization -- his content forms the basis for much of
this talk
○ http://ruder.io/optimizing-gradient-descent/
○ http://ruder.io/deep-learning-optimization-2017
/
● Chapter 8 of “Deep Learning” by Goodfellow, Bengio,
and Courville was a useful supplement
○ https://www.deeplearningbook.com
● Fei Fei Li’s CS231N course at Stanford:
○ http://cs231n.github.io/neural-networks-3
Questions?
Appendix
Premature
Convergence
Hypothesis:
● Models converge before intended if
learning rate is strictly decayed
Potential Solution:
● Anneal learning rate on cosine
schedule, reset to default learning
rate every N epochs.
● Works well in conjunction with
weight decay for vanilla SGD + Adam
● Reduces hyperparam sensitivity
From “Fixing Weight Decay Regularization in Adam”
Image from “Fixing Weight Decay
Regularization in Adam”
Weight Initialization
Properties of Good
Weight Init
● Break symmetry -- otherwise
all units will behave in the
same manner.
● Weight distribution should
have zero mean (prior that
features are uncorrelated).
● Uniform or Gaussian
Other Weight Init
Considerations
● Glorot initialization -- scaling
based on number of layer
inputs / outputs
● He initialization -- scaling
weight norm based on
number of layer inputs, for
ReLU activation
● Goal: preserve relative scale
of activation variance and
gradient variance through
many layers
Glorot Uniform
He Initialization
Takeaways
● Parameter initialization matters (more than you might
think)
● Take care to ensure that activation and gradient
variances stay roughly constant throughout layers
when training deep networks (consider visualization)

More Related Content

Similar to Everything You Need to Know about Optimization

Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
MSCV Capstone Spring 2020 Presentation - RL for AD
MSCV Capstone Spring 2020 Presentation - RL for ADMSCV Capstone Spring 2020 Presentation - RL for AD
MSCV Capstone Spring 2020 Presentation - RL for ADMayank Gupta
 
# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustmentTerence Huang
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learningJeremy Nixon
 
08 distributed optimization
08 distributed optimization08 distributed optimization
08 distributed optimizationMarco Quartulli
 
logisticregression-190726150723.pdf
logisticregression-190726150723.pdflogisticregression-190726150723.pdf
logisticregression-190726150723.pdfSuaibDanish
 
Logistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | DisadvantagesLogistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | DisadvantagesRajat Sharma
 
Linear Regression Paper Review.pptx
Linear Regression Paper Review.pptxLinear Regression Paper Review.pptx
Linear Regression Paper Review.pptxMurindanyiSudi1
 
3. Training Artificial Neural Networks.pptx
3. Training Artificial Neural Networks.pptx3. Training Artificial Neural Networks.pptx
3. Training Artificial Neural Networks.pptxmunwar7
 
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...PATHALAMRAJESH
 
Reinforcement Learning 10. On-policy Control with Approximation
Reinforcement Learning 10. On-policy Control with ApproximationReinforcement Learning 10. On-policy Control with Approximation
Reinforcement Learning 10. On-policy Control with ApproximationSeung Jae Lee
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Universitat Politècnica de Catalunya
 
On the Variance of the Adaptive Learning Rate and Beyond
On the Variance of the Adaptive Learning Rate and BeyondOn the Variance of the Adaptive Learning Rate and Beyond
On the Variance of the Adaptive Learning Rate and BeyondSungchul Kim
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 

Similar to Everything You Need to Know about Optimization (20)

ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Regression ppt
Regression pptRegression ppt
Regression ppt
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
MSCV Capstone Spring 2020 Presentation - RL for AD
MSCV Capstone Spring 2020 Presentation - RL for ADMSCV Capstone Spring 2020 Presentation - RL for AD
MSCV Capstone Spring 2020 Presentation - RL for AD
 
# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment# Can we trust ai. the dilemma of model adjustment
# Can we trust ai. the dilemma of model adjustment
 
Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)Deep Learning for Computer Vision: Optimization (UPC 2016)
Deep Learning for Computer Vision: Optimization (UPC 2016)
 
Optimization in deep learning
Optimization in deep learningOptimization in deep learning
Optimization in deep learning
 
08 distributed optimization
08 distributed optimization08 distributed optimization
08 distributed optimization
 
logisticregression-190726150723.pdf
logisticregression-190726150723.pdflogisticregression-190726150723.pdf
logisticregression-190726150723.pdf
 
Logistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | DisadvantagesLogistic regression : Use Case | Background | Advantages | Disadvantages
Logistic regression : Use Case | Background | Advantages | Disadvantages
 
Linear Regression Paper Review.pptx
Linear Regression Paper Review.pptxLinear Regression Paper Review.pptx
Linear Regression Paper Review.pptx
 
Optimization_methods.pdf
Optimization_methods.pdfOptimization_methods.pdf
Optimization_methods.pdf
 
Regresión
RegresiónRegresión
Regresión
 
3. Training Artificial Neural Networks.pptx
3. Training Artificial Neural Networks.pptx3. Training Artificial Neural Networks.pptx
3. Training Artificial Neural Networks.pptx
 
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC                           ...
Copy of CRICKET MATCH WIN PREDICTOR USING LOGISTIC ...
 
Reinforcement Learning 10. On-policy Control with Approximation
Reinforcement Learning 10. On-policy Control with ApproximationReinforcement Learning 10. On-policy Control with Approximation
Reinforcement Learning 10. On-policy Control with Approximation
 
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
Optimizing Deep Networks (D1L6 Insight@DCU Machine Learning Workshop 2017)
 
On the Variance of the Adaptive Learning Rate and Beyond
On the Variance of the Adaptive Learning Rate and BeyondOn the Variance of the Adaptive Learning Rate and Beyond
On the Variance of the Adaptive Learning Rate and Beyond
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 

More from indico data

Small Data for Big Problems: Practical Transfer Learning for NLP
Small Data for Big Problems: Practical Transfer Learning for NLPSmall Data for Big Problems: Practical Transfer Learning for NLP
Small Data for Big Problems: Practical Transfer Learning for NLPindico data
 
Getting to AI ROI: Finding Value in Your Unstructured Content
Getting to AI ROI: Finding Value in Your Unstructured ContentGetting to AI ROI: Finding Value in Your Unstructured Content
Getting to AI ROI: Finding Value in Your Unstructured Contentindico data
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPindico data
 
TensorFlow in Practice
TensorFlow in PracticeTensorFlow in Practice
TensorFlow in Practiceindico data
 
The Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep LearningThe Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep Learningindico data
 
How Machine Learning is Shaping Digital Marketing
How Machine Learning is Shaping Digital MarketingHow Machine Learning is Shaping Digital Marketing
How Machine Learning is Shaping Digital Marketingindico data
 
Deep Advances in Generative Modeling
Deep Advances in Generative ModelingDeep Advances in Generative Modeling
Deep Advances in Generative Modelingindico data
 
Machine Learning for Non-technical People
Machine Learning for Non-technical PeopleMachine Learning for Non-technical People
Machine Learning for Non-technical Peopleindico data
 
Getting started with indico APIs [Python]
Getting started with indico APIs [Python]Getting started with indico APIs [Python]
Getting started with indico APIs [Python]indico data
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Pythonindico data
 

More from indico data (10)

Small Data for Big Problems: Practical Transfer Learning for NLP
Small Data for Big Problems: Practical Transfer Learning for NLPSmall Data for Big Problems: Practical Transfer Learning for NLP
Small Data for Big Problems: Practical Transfer Learning for NLP
 
Getting to AI ROI: Finding Value in Your Unstructured Content
Getting to AI ROI: Finding Value in Your Unstructured ContentGetting to AI ROI: Finding Value in Your Unstructured Content
Getting to AI ROI: Finding Value in Your Unstructured Content
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 
TensorFlow in Practice
TensorFlow in PracticeTensorFlow in Practice
TensorFlow in Practice
 
The Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep LearningThe Unreasonable Benefits of Deep Learning
The Unreasonable Benefits of Deep Learning
 
How Machine Learning is Shaping Digital Marketing
How Machine Learning is Shaping Digital MarketingHow Machine Learning is Shaping Digital Marketing
How Machine Learning is Shaping Digital Marketing
 
Deep Advances in Generative Modeling
Deep Advances in Generative ModelingDeep Advances in Generative Modeling
Deep Advances in Generative Modeling
 
Machine Learning for Non-technical People
Machine Learning for Non-technical PeopleMachine Learning for Non-technical People
Machine Learning for Non-technical People
 
Getting started with indico APIs [Python]
Getting started with indico APIs [Python]Getting started with indico APIs [Python]
Getting started with indico APIs [Python]
 
Introduction to Deep Learning with Python
Introduction to Deep Learning with PythonIntroduction to Deep Learning with Python
Introduction to Deep Learning with Python
 

Recently uploaded

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Everything You Need to Know about Optimization

  • 1. Everything You Wanted to Know about Optimization (and some you didn’t) Madison May madison@indico.io
  • 2. Madison May Machine Learning Architect @ Indico Data Solutions Solve big problems with small data. Email: madison@indico.io Twitter: @pragmaticml Github: @madisonmay
  • 4. Definitions: ● Loss: differentiable measure of model error ● Gradient: direction of steepest descent at point on error surface ● Loss surface: how loss varies with parameter value ● Learning rate: how far to move params in direction of gradient
  • 5. Gradient descent ● Compute loss on entire dataset ● Compute gradient of parameters with respect to loss ● Update parameters in the direction of the gradient scaled by some parameter (the learning rate)
  • 6. SGD (mini-batch gradient descent) ● Compute loss on small number of examples ● Compute gradient of parameters with respect to loss ● Update parameters in the direction of the gradient scaled by some parameter (the learning rate)
  • 7. Mini-batch vs. Full ● Don’t need to compute the gradient on all of your training examples to get a gradient estimate that is good enough. ● Better to update your parameters more frequently with noisy gradient than get a perfect gradient estimate and update model params less. ● Stochastic gradient estimates help avoid local minima / saddle points https://en.wikipedia.org/wiki/gradient_descen t
  • 8. SGD with Momentum ● SGD problematic when the magnitude of gradients varies between parameters. ● Parameters will oscillate between the two sides of the bowl (see right). ● Keeping an exponential moving average of past gradients (momentum) helps to dampen oscillation (acts like heavy ball) ● Helps accelerate through flat areas of loss surface. Images from sebastianruder.com With Momentum Without Momentum
  • 9. SGD with Nesterov Momentum (NAG) ● Instead of measuring loss at current parameters, apply the previous gradient once more before measuring loss since that gradient update ● Allows optimizer to correct more quickly to changes in the loss landscape Hinton Lecture 6c Blue: momentum Green: NAG
  • 10. Adagrad, Adadelta, And RMSProp ● Different parameters require differently scaled updates ● Values of previous gradients are used to scale the current gradient estimate in a heuristic manner ● Significantly less sensitive to hyperparameters thanks to per parameter scaling
  • 11. Adam ● Most common go-to in current deep learning research ● Stores exponential moving average of squared gradients (Adadelta / RMSProp-like term) and gradients (momentum-like term) ● Behaves like a “heavy ball with friction” and finds flat minima of loss function. ● Empirically leads to quicker convergence than SGD http://ruder.io/optimizing-gradient-descent
  • 12. Takeaways ● SGD: update params by scaling gradient ● Momentum: incorporating exponential moving averages of gradient value allow for SGD to escape saddle points. Acts like acceleration of ball on surface due to mass. ● Adadelta / RMSprop: inverse scaling by exponential moving average of square of gradient to help with sensitivity to hyperparameters ● Adam: incorporates elements of momentum and Adadelta / RMSprop
  • 13. Async Training, Batch Size, and Regularization Affect Learning Rate Dynamics
  • 14. Batch Size + Learning Rate ● Batch size is inversely proportional to learning rate ● Instead of learning rate annealing, you could increase batch size for faster training times with equivalent accuracy thanks to increased parallelism and fewer parameter updates Image from “Don't Decay the Learning Rate, Increase the Batch Size” See also: Revisiting Small Batch Training for Deep Neural Networks
  • 15. Batch Size + Learning Rate ● “...both large learning rate and small batch size contribute towards SGD finding flatter minima that generalize well” -“Finding Flatter Minima with SGD” Images from “Qualitatively characterizing neural network optimization problems” and “Finding Flatter Minima with SGD”
  • 16. Async Training & Momentum ● Data parallelism is popular in training of large models (often called Hogwild!) ● Data parallelism acts similarly to momentum (running average of gradient updates vs. true average) ● Reduce your momentum parameter to compensate for the increase in “effective momentum” Image from “Aynchronicity Begets Momentum, with an Application to Deep Learning”
  • 17. Regularization + Learning Rate ● L2 regularization (penalizing magnitude of weights) decreases norm of weights ● Decreasing the norm of the weights necessitates a corresponding decrease in learning rate for optimal learning Figure from “L2 Regularization versus Batch and Weight Normalization”
  • 18. Takeaways ● There’s a difference between the learning rate parameter and the effective learning rate of models ● Understand how batch size, async training, and the norm of model parameters interact with learning rate for best results.
  • 20. Learning Rate Annealing ● For non-adaptive methods, the learning rate that is optimal at the beginning of learning is not the same as the learning rate that is optimal near the end of learning ● Adjustments become finer later on in optimization, and learning rate should be lowered to accommodate this Figure from http://srdas.github.io/DLBook/
  • 21. Cyclic Learning Rates + Snapshot Ensembling ● Increase and decrease the learning rate on a schedule? ● Good optima found when learning rate is low ● High learning rate kicks model out of local optima ● Averaging parameters acts like ensembling Figure from “Snapshot Ensembles: Train 1 Get M for Free”
  • 22. Takeaways ● Use learning rate annealing when using vanilla SGD or SGD w/ momentum ● Consider snapshot ensembling for easy incremental model performance improvements.
  • 24. ICLR 2018 Optimization Papers ● On the convergence of Adam and beyond (Sashank J. Reddi, Satyen Kale, Sanjiv Kumar) ● Normalized, direction-preserving Adam (Zijun Zhang, Lin Ma, Zongpeng Li, Chuan Wu) ● Fixing Weight Decay Regularization in Adam (Ilya Loshchilov, Frank Hutter) ● Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients (Lukas Balles, Philipp Hennig) ● YellowFin and the Art of Momentum Tuning (Jian Zhang, Ioannis Mitliagkas, Christopher Re)
  • 25. What can we improve about Adam? “Despite superior training outcomes, adaptive optimization methods such as Adam, Adagrad or RMSprop have been found to generalize poorly compared to Stochastic Gradient Descent (SGD). These methods tend to perform well in the initial portion of training but are outperformed by SGD at later stages of training.” From “Improving Generalization Performance by Switching from Adam to SGD” Image from “The Marginal Value of Adaptive Gradient Methods in Machine Learning”
  • 26. Problems with Exponential Moving Averages Hypotheses: ● Some features are rarely active but when they are active, they provide large gradients ● Exponential moving averages don’t entirely deal with this kind of behavior, influence of past gradient updates diminishes quickly From “On the Convergence of Adam and Beyond” Non-converge of Adam in 1D setting. Image from “On the Convergence of Adam and Beyond”
  • 27. How do we fix it? Potential Solution: ● Instead of storing exponential moving average of past squared gradients, store the maximum past squared gradient and use that to adjust size of weight update ● Resultant algorithm is termed “AMSGrad” ● Enjoys theoretical guarantees that were missing from Adam. ● Empirically leads to better generalization performance From “On the Convergence of Adam and Beyond” Image from http://ruder.io/deep-learning-optimization-2017
  • 28. Other potential problems Hypotheses: ● “When combined with adaptive gradients, L2 regularization leads to weights with large gradients being regularized less than they would be when using weight decay.” ● In other words, using L2 regularization in conjunction with Adam is not effective -- although weight decay is. From “Fixing Weight Decay Regularization in Adam” Image from “Fixing Weight Decay Regularization in Adam”
  • 29. How do we fix it? Potential Solution: ● Use weight decay as originally formulated rather than L2 normalization From “Fixing Weight Decay Regularization in Adam” Image from “Fixing Weight Decay Regularization in Adam”
  • 30. Update Directions ● Adam and other adaptive gradient methods set a learning rate on a per parameter basis ● Setting individual learning rates results in different update directions than vanilla SGD ● Adam trades reduction in variance of update direction for increase in bias of update direction from true gradient direction From “On the Convergence of Adam and Beyond” Image from “Dissecting Adam: The Sign, Magnitude, and Variance of Stochastic Gradients”
  • 31. How do we fix it? Potential Solution: ● YellowFin: since an individual learning rate per parameter leads to different update directions than SGD, only use a global learning rate, and solve the learning rate setting problem separately ● Implements a lr & momentum rate tuner w/ negative feedback loop that requires no hyperparameter tuning and leads to faster convergence than Adam in practice. From “YellowFin and the Art of Momentum Tuning” Image: ResNet loss on CIFAR100 from “YellowFin and the Art of Momentum Tuning”
  • 32. Takeaways ● Adam generally performs well but has its limits ● Use with weight decay rather than L2 regularization ● At the upper extremes of training data availability try SGD + nesterov momentum + learning rate annealing or YellowFin. ● Compare against AMSGrad ● Monitor arxiv.org and wait 6 months -- academia is still deciding on whether it’s time to move past Adam.
  • 33. Other considerations ● Weight initialization ● Batch norm / Layer norm
  • 34. Shoutouts ● Sebastian Ruder has blogged extensively about optimization -- his content forms the basis for much of this talk ○ http://ruder.io/optimizing-gradient-descent/ ○ http://ruder.io/deep-learning-optimization-2017 / ● Chapter 8 of “Deep Learning” by Goodfellow, Bengio, and Courville was a useful supplement ○ https://www.deeplearningbook.com ● Fei Fei Li’s CS231N course at Stanford: ○ http://cs231n.github.io/neural-networks-3
  • 37. Premature Convergence Hypothesis: ● Models converge before intended if learning rate is strictly decayed Potential Solution: ● Anneal learning rate on cosine schedule, reset to default learning rate every N epochs. ● Works well in conjunction with weight decay for vanilla SGD + Adam ● Reduces hyperparam sensitivity From “Fixing Weight Decay Regularization in Adam” Image from “Fixing Weight Decay Regularization in Adam”
  • 39. Properties of Good Weight Init ● Break symmetry -- otherwise all units will behave in the same manner. ● Weight distribution should have zero mean (prior that features are uncorrelated). ● Uniform or Gaussian
  • 40. Other Weight Init Considerations ● Glorot initialization -- scaling based on number of layer inputs / outputs ● He initialization -- scaling weight norm based on number of layer inputs, for ReLU activation ● Goal: preserve relative scale of activation variance and gradient variance through many layers Glorot Uniform He Initialization
  • 41. Takeaways ● Parameter initialization matters (more than you might think) ● Take care to ensure that activation and gradient variances stay roughly constant throughout layers when training deep networks (consider visualization)