Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal Data Scientist, SAS Institute Inc. at MLconf ATL 2016

Copyr ight © 2016, SAS Institute Inc. All rights reser ved.
LOCAL SEARCH OPTIMIZATION FOR
HYPERPARAMETER TUNING
FUNDA GUNES, PATRICK KOCH

LSO FOR HYPER-
PARAMETER TUNING
OVERVIEW
1. Training Predictive Models
• Optimization problem
• Parameters – optimization and model training
2. Tuning Predictive Models
• Manual, grid search, optimization
• Optimization challenges
3. Sample Results
4. Distributed / Parallel Processing
5. Opportunities and Challenges

LSO FOR HYPER-
PARAMETER TUNING
PREDICTIVE MODELS
Transformation of relevant data to high-quality descriptive and predictive models
Ex.: Neural Network
• How to determine model
configuration?
• How to determine weights, activation
functions?
….
….
….
Input layer Hidden layer Output layer
x1
x2
x3
x4
xn
wij
wjk
f1(x)
f2(x)
fm(x)

LSO FOR HYPER-
PARAMETER TUNING
MODEL TRAINING – OPTIMIZATION PROBLEM
• Objective Function
• 𝑓 𝑤 =
1
𝑛 𝑖=1
𝑛
𝐿 𝑤; 𝑥𝑖, 𝑦𝑖 + 𝜆1 𝑤 1 +
𝜆2
2
𝑤 2
2
• Loss, 𝐿(𝑤; 𝑥𝑖, 𝑦𝑖), for observation (𝑥𝑖, 𝑦𝑖) and weights, 𝑤
• Stochastic Gradient Descent (SGD)
• Variation of Gradient Descent 𝑤 𝑘+1 = 𝑤 𝑘 − 𝜂𝛻𝑓(𝑤 𝑘)
• Approximate 𝛻𝑓 𝑤 𝑘 with gradient of mini-batch sample: {𝑥 𝑘𝑖, 𝑦 𝑘𝑖 }
• 𝛻𝑓𝑡(𝑤) =
1
𝑚 𝑖=1
𝑚
𝛻𝐿(𝑤; 𝑥 𝑘𝑖, 𝑦 𝑘𝑖)
• Choice of 𝜆1, 𝜆2, 𝜂, 𝑚 can have significant effects on solution quality

LSO FOR HYPER-
PARAMETER TUNING
MODEL TRAINING – OPTIMIZATION PARAMETERS (SGD)
• Learning rate, η
• Too high, diverges
• Too low, slow performance
• Momentum, μ
• Too high, could “pass” solution
• Too low, no performance improvement
• Other parameters
• Mini-batch size, m
• Adaptive decay rate, β
• Annealing rate, r
• Communication frequency, c
• Regularization parameters, λ1 and λ2
• Too low, has little effect
• Too high, drives iterates to 0

LSO FOR HYPER-
PARAMETER TUNING
HYPERPARAMETERS
• Quality of trained model governed by so-called ‘hyperparameters’
No clear defaults agreeable to a wide range of applications
• Optimization options (SGD)
• Neural Net training options
• Number of hidden layers
• Number of neurons in each hidden layer
• Random distribution for initial
connection weights (normal, uniform, Cauchy)
• Error function (gamma, normal, Poisson, entropy)
….
….
….
Input layer Hidden layer Output layer
x1
x2
x3
x4
xn
wij
wjk
f1(x)
f2(x)
fm(x)

LSO FOR HYPER-
PARAMETER TUNING
MODEL TUNING
• Traditional Approach: manual tuning
Even with expertise in machine learning algorithms and their parameters, best settings
are directly dependent on the data used in training and scoring
• Hyperparameter Optimization: grid vs. random vs. optimization
x1
x2
Standard Grid Search
x1
x2
Random Latin Hypercube
x1
x2
Random Search

LSO FOR HYPER-
PARAMETER TUNING
HYPERPARAMETER OPTIMIZATION CHALLENGES
x
Objective
blows up
Categorical / Integer
Variables
T(x)
Noisy/Nondeterministic
Flat-regions.
Node failure
Tuning objective, T(x), is validation error score (to avoid increased overfitting effect)

LSO FOR HYPER-
PARAMETER TUNING
PARALLEL HYBRID DERIVATIVE-FREE OPTIMIZATION STRATEGY
Default hybrid search strategy:
1. Initial Search: Latin Hypercube Sampling (LHS)
2. Global search: Genetic Algorithm (GA)
• Supports integer, categorical variables
• Handles nonsmooth, discontinuous space
3. Local search: Generating Set Search (GSS)
• Similar to Pattern Search
• First-order convergence properties
• Developed for continuous variables
All naturally parallelizable methods
Can run any solver combination in parallel

LSO FOR HYPER-
PARAMETER TUNING
SAMPLE TUNING RESULTS – AVERAGE IMPROVEMENT
• 13 test problems
• Single 30% validation
partition
• Single Machine
• Conservative default
tuning process:
• 5 Iterations
• 10 configurations per
iteration
• Run 10 times each
• Averaged results
ML
Average
Error %
Reduction
DT 5.2
GBT 5.0
RF 4.4
DT-P 3.0
Higher is better
No Free Lunch

LSO FOR HYPER-
PARAMETER TUNING
SAMPLE TUNING RESULTS – TUNING TIME
• 13 test problems
• Single 30% validation
partition
• Single Machine
• Conservative default
tuning process:
• 5 Iterations
• 10 configurations per
iteration
• Run 10 times each
• Averaged results
ML
Average
% Error
Average
Time
DT 15.3 9.2
DT-P 13.9 12.8
RF 12.7 49.1
GBT 11.7 121.8

LSO FOR HYPER-
PARAMETER TUNING
SINGLE PARTITION VS. CROSS VALIDATION
• For each problem:
• Tune with single 30% partition
• Score best model on test set
• Repeat 10 times
• Average difference between test
error and validation error
• Repeat process with 5-fold
cross validation
• Cross validation for small-
medium data sets
• 5x cost increase for sequential
tuning process
• manage in parallel / threaded
environment

LSO FOR HYPER-
PARAMETER TUNING
PARALLEL MODE: DATA PARALLEL VS. MODEL PARALLEL
LHS
GA
GSS
D
NM
…
Node 1
Node 2
Node 3
Node k
...
x
f(x)
Train
Score
LHS
GA
GSS
D
NM
…
Train
Node 1
Score
x
f(x)
Train
Node 2
Score
Train
Node k
Score
...
Hybrid Solver Manager Hybrid Solver Manager
...
Data Distributed / Parallel Model Parallel

LSO FOR HYPER-
PARAMETER TUNING
“DATA PARALLEL” – DISTRIBUTED TRAIN, SEQUENTIAL TUNE
15
36 34
40
65
112
140
226
0
50
100
150
200
250
1 2 4 8 16 32 64 128
Time(seconds)
Number of Nodes for Training
IRIS Forest Tuning Time
105 Train / 45 Validate
0
100
200
300
400
500
600
1 2 4 8 16 32 64
Time(seconds)
Number of Nodes for Training
Credit Data Tuning Time
49k Train / 21k Validate

LSO FOR HYPER-
PARAMETER TUNING
“DATA AND MODEL PARALLEL” – DISTRIBUTED TRAIN,
PARALLEL TUNE
• Distributed train,
parallel tune:
• Train: 4 workers
• Tune: n parallel
0
2
4
6
8
10
12
2 4 8 16 32
Time(hours)
Maximum Models Trained in Parallel
Handwritten Data
Train 60k / Validate 10k
LHS
GA
GSS
D
NM
…
Train, Score
Node 1
Node 2
Node 3
Node 4
Train, Score
Node 5
Node 6
Node 7
Node 8
Hybrid Solver Manager
...
...

LSO FOR HYPER-
PARAMETER TUNING
HYPERPARAMETER OPTIMIZATION RESULTS
• Default Neural Network, 9.6 % error
• Iteration 1 – Latin Hypercube best,
6.2% error
• Very similar configuration can have
very different error…
• Best after tuning: 2.6% error
0
20
40
60
80
100
0 50 100 150
Error%
Evaluation
Iteration 1: Latin Hypercube Sample Error
0
2
4
6
8
10
12
0 5 10 15 20
Objective(MisclassificationError%)
Iteration
Handwritten Data
Train 60k/Validate 10k
Tuned Neural Net
Hidden Layers 2
Neurons 200, 200
Learning Rate 0.012
Annealing Rate 2.3e-05

LSO FOR HYPER-
PARAMETER TUNING
MANY TUNING OPPORTUNITIES AND CHALLENGES
• Initial Implementation
• PROCs NNET, TREESPLIT, FOREST, GRADBOOST
• SAS® Viya™ Data Mining and Machine Learning
• Other search methods, extending hybrid solver framework
• Bayesian / surrogate-based Optimization
• New hybrid search strategies
• Selecting best machine learning algorithm
• Parallel tuning across algorithms & strategies for effective node usage
• Combining models
Better (generalized) Models = Better Predictions = Better Decisions

THANK YOU!
Funda.Gunes@sas.com
Patrick.Koch@sas.com
sas.com/viya

Parallel&Serial,Pub/Sub,
WebServices,MQs
Source-based
Engines
Microservices
UAA
Query
Gen
Folders
CAS
Mgmt
Data
Source
Mgmt
Analytics
GUIs
etc.…
BI
GUIs
Env
Mgr
Model
Mgmt
Log
Audit
UAAUAA
Data
Mgmt
GUIs
In-Memory
Engine
In-Cloud
In-Database
In-Hadoop
In-Stream
Solutions
APIs
Infrastructures
Platforms
SAS
®
Viya
™
PLATFORM
Analytics
Data ManagementFraud and Security
Intelligence
Business Visualization
Risk Management
!
Customer Intelligence
Cloud Analytics Services (CAS)
DISTRIBUTED
COMPUTING

EXPERIMENTAL
RESULTS
EXPERIMENT SETTINGS
• Decision tree:
• Depth
• Number of bins for interval variables
• Splitting criterion
• Random Forest:
• Number of trees
• Number of variables at each split
• Fraction of observations used to build a tree
• Gradient Boosting:
• Number of trees
• Number of variables at each split
• Fraction of observations used to build a tree
• L1 regularization
• L2 regularization
• Learning rate
• Neural Networks:
• Number of hidden layers
• Number of neurons in each hidden layer
• LBFGS optimization parameters
• SGD optimization parameters
For this study:
• Linux cluster with 144 nodes
• Each node has 16 cores
• Viya Data Mining and Machine Learning
HYPERPARAMETERS

Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal Data Scientist, SAS Institute Inc. at MLconf ATL 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal Data Scientist, SAS Institute Inc. at MLconf ATL 2016

Similar to Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal Data Scientist, SAS Institute Inc. at MLconf ATL 2016 (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal Data Scientist, SAS Institute Inc. at MLconf ATL 2016