SlideShare a Scribd company logo
1 of 34
Download to read offline
Automating Data Science on Spark: 

A Bayesian Approach
Vu Pham, Huan Dao, Christopher Nguyen
San Francisco, June 8th 2016
@arimoinc
Feature
Engineering Model
Selection
Hyperparameter
Tuning
Data Science
is
so MANUAL!
@arimoinc
Natural
Interfaces
Machine
Learning
Collaboration
Agenda
@arimoinc
1. For Hyper-parameter Tuning
2. For “Automating” Data Science on Spark
3. Experiments
@arimoinc
Bayesian Optimization
for Hyper-parameter Tuning
@arimoinc
Feature
Engineering Model
Selection
Hyper-parameter
Tuning
We can
automate
this?
What if…
Hyper-parameters
@arimoinc
K-means K
Neural Network # of layers, dropout, momentum…
Random Forest Feature set, # of trees, max depth…
SVM Regularization term (C)
Gradient Descent Learning rate, number of iterations…
Manual Search
@arimoinc
Grid Search
@arimoinc
Random Search
@arimoinc
Hyper-parameters tuning
@arimoinc
U: S R
c U(c)
Hyper-parameter Space Performance onValidation Set
How to intelligently select the next configuration?
(Given the observations in the past)
Maximize a utility functions over hyper-parameter space:
@arimoinc
Bayesian Inference
Courtesy: http://www2.stat.duke.edu/~mw/fineart.html
Bayesian Optimization Explained
@arimoinc
1. Incorporate a prior over the space of possible objective functions (GP)
2. Combine the prior with the likelihood to obtain a posterior over
function values given observations
3. Select next configuration to evaluate based on the posterior
• According to an acquisition function
Bayesian Optimization Explained
@arimoinc
@arimoinc
expensive
multi-modal
noisy
blackbox
Bayesian Optimization is to
globally optimize functions that are:
@arimoinc
Bayesian Optimization for
“Automating” Data Science
on Spark
Automatic Data Science Workflow
@arimoinc
Feature
Engineering
Model
Selection
Hyperparameter
Tuning
Machines doing part of Data Science
@arimoinc
Dataset
Feature preprocessing
Data Cleaning
Generate new features
Feature selection
Dimensionality
reduction
Predictive modeling Scoring
A generic Machine Learning pipeline
@arimoinc
Training set
Val/Test set
Feature preprocessing
Truncated SVD
n: # of dims
f-score with target variable
Keep alpha percents top features
K-means
k: # of clusters
RandomForest #1
number of trees
maximum depth
percentage of features
RandomForest #k
…
Scoring
Spark WorkersSpark Workers
Pipeline on DDF and BayesOpt
@arimoinc
Spark Driver
DDF
Arimo Pipeline Manager
…
Client
Bayesian Optimizer
Client uses Bayesian Optimizer to select the hyper-parameters
of the pipeline so that it maximizes the performance on a validation set
Spark Workers
Pipeline on DDF and BayesOpt
@arimoinc
train_ddf = session.get_ddf(…)
valid_ddf = session.get_ddf(…)
optimizer = SpearmintOptimizer(chooser_name=‘GPEIperSecChooser',
max_finished_jobs=max_iters, grid_size=5000, ..)
best_params, trace = auto_model(
optimizer, train_ddf, 'arrdelay',
classification=True,
excluded_columns=['actualelapsedtime','arrtime', 'year'],
validation_ddf=val_ddf)
@arimoinc
Experimental Results
Experiment 1: SF Crimes
@arimoinc
Dates Category Descript DayOfWeek PdDistrict Resolution Address X Y
2015-05-13 23:53:00 WARRANTS WARRANT ARREST Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST -122.4258 37.7745
2015-05-13 23:53:00
OTHER
OFFENSES
TRAFFICVIOLATION
ARREST
Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST -122.4258 37.7745
2015-05-13 23:33:00
OTHER
OFFENSES
TRAFFICVIOLATION
ARREST
Wednesday NORTHERN ARREST, BOOKED
VANNESS AV /
GREENWICH ST
-122.4243 37.8004
Experiment 1: SF Crimes dataset
@arimoinc
Hyper-parameter Type Range
Number of hidden layers INT 1, 2, 3
Number of hidden units INT 64, 128, 256
Dropout at the input layer FLOAT [0, 0.5]
Dropout at the hidden layers FLOAT [0,0.75]
Learning rate FLOAT [0.01, 0.1]
L2 Weight decay FLOAT [0, 0.01]
Logloss on the validation set
Running time (hours) ~ 40 iterations
Experiment 1: SF Crimes dataset
@arimoinc
Hyper-parameter Type Range Spearmint
Number of hidden layers INT 1, 2, 3 2
Number of hidden units INT 64, 128, 256 256
Dropout at the input layer FLOAT [0, 0.5] 0.423678
Dropout at the hidden layers FLOAT [0,0.75] 0.091693
Learning rate FLOAT [0.01, 0.1] 0.025994
L2 Weight decay FLOAT [0, 0.01] 0.00238
Logloss on the validation set 2.1502
Running time (hours) ~ 40 iterations 15.8
Experiment 1: SF Crimes dataset
@arimoinc
2.149
2.152
2.155
2.157
2.16
1 7 17 18 39
2.1600
2.1555 2.1552 2.1551
2.1519
2.1502
Completed jobs 1 7 16 17 18 39
Elapsed time 1021 8003 12742 1623 1983 31561
Number of layers 1 3 3 2 3 2
Hidden units 64 256 256 256 256 256
Learning rate 0.01 0.1 0.01 0.01 0.021 0.026
Input dropout 0 0.5 0.5 0.463 0.5 0.424
Hidden dropout 0 0 0 0.024 0.089 0.092
Weight decay 0 0 0.002 0.003 0 0.002
Experiment #1: With SigOpt
@arimoinc
Hyper-parameter Type Range Spearmint
Number of hidden layers INT 1, 2, 3 2 3
Number of hidden units INT 64, 128, 256 256 256
Dropout at the input layer FLOAT [0, 0.5] 0.423678 0.3141
Dropout at the hidden layers FLOAT [0,0.75] 0.091693 0.0944
Learning rate FLOAT [0.01, 0.1] 0.025994 0.0979
L2 Weight decay FLOAT [0, 0.01] 0.00238 0.0039
Logloss on the validation set 2.1502 2.14892
Running time (hours) ~ 40 iterations 15.8 20.1
SF Crimes - Time to results
@arimoinc
Experiment #2: Airlines data
@arimoinc
Year 1987-2008 DepDelay departure delay, in minutes
Month 1-12 Origin origin IATA airport code
DayofMonth 1-31 Dest destination IATA airport code
DayOfWeek 1 (Monday) - 7 (Sunday) Distance in miles
DepTime actual departure time TaxiIn taxi in time, in minutes
CRSDepTime scheduled departure time TaxiOut taxi out time in minutes
ArrTime actual arrival time Cancelled was the flight cancelled?
CRSArrTime scheduled arrival time CancellationCode reason for cancellation
UniqueCarrier unique carrier code Diverted 1 = yes, 0 = no
FlightNum flight number CarrierDelay in minutes
TailNum plane tail number WeatherDelay in minutes
ActualElapsedTime in minutes NASDelay in minutes
CRSElapsedTime in minutes SecurityDelay in minutes
AirTime in minutes LateAircraftDelay in minutes
ArrDelay arrival delay, in minutes Delayed Is the flight delayed
Experiment #2
@arimoinc
Training set
Val/Test set
Feature preprocessing
~900 features
Truncated SVD
n: # of dims
f-score with target variable
Keep alpha percents top features
K-means
k: # of clusters
RandomForest #1
number of trees
maximum depth
percentage of features
RandomForest #k
…
Scoring
Experiment #2: Hyper-parameters
@arimoinc
Hyperparameter Type Range BayesOpt
Number of SVD dimensions INT [5, 100] 98
Top feature percentage FLOAT [0.1, 1] 0.8258
k (# of clusters) INT [1, 6] 2
Number of trees (RF) INT [50, 500] 327
Max. depth (RF) INT [1, 20] 12
Min. instances per node (RF) INT [1, 1000] 414
F1-score on validation set 0.8736
Summary
@arimoinc
1. Bayesian Optimization for Hyper-parameter Tuning
2. Bayesian Optimization for 

“Automating” Data Science on Spark
3. Experiments
Getting Started
@arimoinc
• Blogpost: http://goo.gl/PFyBKI
• Open-source: spearmint, hyperopt, SMAC, AutoML
• Commercial: Whetlab, SigOpt, …
http://goo.gl/PFyBKI
https://www.arimo.com
@arimoinc @pentagoniac @phvu
CHECK IT OUT!

More Related Content

Recently uploaded

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 

Recently uploaded (20)

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 

Featured

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationErica Santiago
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellSaba Software
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming LanguageSimplilearn
 

Featured (20)

How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 
Barbie - Brand Strategy Presentation
Barbie - Brand Strategy PresentationBarbie - Brand Strategy Presentation
Barbie - Brand Strategy Presentation
 
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them wellGood Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
Good Stuff Happens in 1:1 Meetings: Why you need them and how to do them well
 
Introduction to C Programming Language
Introduction to C Programming LanguageIntroduction to C Programming Language
Introduction to C Programming Language
 

Automating Data Science on Spark: A Bayesian Approach

  • 1. Automating Data Science on Spark: 
 A Bayesian Approach Vu Pham, Huan Dao, Christopher Nguyen San Francisco, June 8th 2016
  • 4. Agenda @arimoinc 1. For Hyper-parameter Tuning 2. For “Automating” Data Science on Spark 3. Experiments
  • 7. Hyper-parameters @arimoinc K-means K Neural Network # of layers, dropout, momentum… Random Forest Feature set, # of trees, max depth… SVM Regularization term (C) Gradient Descent Learning rate, number of iterations…
  • 11. Hyper-parameters tuning @arimoinc U: S R c U(c) Hyper-parameter Space Performance onValidation Set How to intelligently select the next configuration? (Given the observations in the past) Maximize a utility functions over hyper-parameter space:
  • 13. Bayesian Optimization Explained @arimoinc 1. Incorporate a prior over the space of possible objective functions (GP) 2. Combine the prior with the likelihood to obtain a posterior over function values given observations 3. Select next configuration to evaluate based on the posterior • According to an acquisition function
  • 17. Automatic Data Science Workflow @arimoinc Feature Engineering Model Selection Hyperparameter Tuning
  • 18. Machines doing part of Data Science @arimoinc Dataset Feature preprocessing Data Cleaning Generate new features Feature selection Dimensionality reduction Predictive modeling Scoring
  • 19. A generic Machine Learning pipeline @arimoinc Training set Val/Test set Feature preprocessing Truncated SVD n: # of dims f-score with target variable Keep alpha percents top features K-means k: # of clusters RandomForest #1 number of trees maximum depth percentage of features RandomForest #k … Scoring
  • 20. Spark WorkersSpark Workers Pipeline on DDF and BayesOpt @arimoinc Spark Driver DDF Arimo Pipeline Manager … Client Bayesian Optimizer Client uses Bayesian Optimizer to select the hyper-parameters of the pipeline so that it maximizes the performance on a validation set Spark Workers
  • 21. Pipeline on DDF and BayesOpt @arimoinc train_ddf = session.get_ddf(…) valid_ddf = session.get_ddf(…) optimizer = SpearmintOptimizer(chooser_name=‘GPEIperSecChooser', max_finished_jobs=max_iters, grid_size=5000, ..) best_params, trace = auto_model( optimizer, train_ddf, 'arrdelay', classification=True, excluded_columns=['actualelapsedtime','arrtime', 'year'], validation_ddf=val_ddf)
  • 23. Experiment 1: SF Crimes @arimoinc Dates Category Descript DayOfWeek PdDistrict Resolution Address X Y 2015-05-13 23:53:00 WARRANTS WARRANT ARREST Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST -122.4258 37.7745 2015-05-13 23:53:00 OTHER OFFENSES TRAFFICVIOLATION ARREST Wednesday NORTHERN ARREST, BOOKED OAK ST / LAGUNA ST -122.4258 37.7745 2015-05-13 23:33:00 OTHER OFFENSES TRAFFICVIOLATION ARREST Wednesday NORTHERN ARREST, BOOKED VANNESS AV / GREENWICH ST -122.4243 37.8004
  • 24. Experiment 1: SF Crimes dataset @arimoinc Hyper-parameter Type Range Number of hidden layers INT 1, 2, 3 Number of hidden units INT 64, 128, 256 Dropout at the input layer FLOAT [0, 0.5] Dropout at the hidden layers FLOAT [0,0.75] Learning rate FLOAT [0.01, 0.1] L2 Weight decay FLOAT [0, 0.01] Logloss on the validation set Running time (hours) ~ 40 iterations
  • 25. Experiment 1: SF Crimes dataset @arimoinc Hyper-parameter Type Range Spearmint Number of hidden layers INT 1, 2, 3 2 Number of hidden units INT 64, 128, 256 256 Dropout at the input layer FLOAT [0, 0.5] 0.423678 Dropout at the hidden layers FLOAT [0,0.75] 0.091693 Learning rate FLOAT [0.01, 0.1] 0.025994 L2 Weight decay FLOAT [0, 0.01] 0.00238 Logloss on the validation set 2.1502 Running time (hours) ~ 40 iterations 15.8
  • 26. Experiment 1: SF Crimes dataset @arimoinc 2.149 2.152 2.155 2.157 2.16 1 7 17 18 39 2.1600 2.1555 2.1552 2.1551 2.1519 2.1502 Completed jobs 1 7 16 17 18 39 Elapsed time 1021 8003 12742 1623 1983 31561 Number of layers 1 3 3 2 3 2 Hidden units 64 256 256 256 256 256 Learning rate 0.01 0.1 0.01 0.01 0.021 0.026 Input dropout 0 0.5 0.5 0.463 0.5 0.424 Hidden dropout 0 0 0 0.024 0.089 0.092 Weight decay 0 0 0.002 0.003 0 0.002
  • 27. Experiment #1: With SigOpt @arimoinc Hyper-parameter Type Range Spearmint Number of hidden layers INT 1, 2, 3 2 3 Number of hidden units INT 64, 128, 256 256 256 Dropout at the input layer FLOAT [0, 0.5] 0.423678 0.3141 Dropout at the hidden layers FLOAT [0,0.75] 0.091693 0.0944 Learning rate FLOAT [0.01, 0.1] 0.025994 0.0979 L2 Weight decay FLOAT [0, 0.01] 0.00238 0.0039 Logloss on the validation set 2.1502 2.14892 Running time (hours) ~ 40 iterations 15.8 20.1
  • 28. SF Crimes - Time to results @arimoinc
  • 29. Experiment #2: Airlines data @arimoinc Year 1987-2008 DepDelay departure delay, in minutes Month 1-12 Origin origin IATA airport code DayofMonth 1-31 Dest destination IATA airport code DayOfWeek 1 (Monday) - 7 (Sunday) Distance in miles DepTime actual departure time TaxiIn taxi in time, in minutes CRSDepTime scheduled departure time TaxiOut taxi out time in minutes ArrTime actual arrival time Cancelled was the flight cancelled? CRSArrTime scheduled arrival time CancellationCode reason for cancellation UniqueCarrier unique carrier code Diverted 1 = yes, 0 = no FlightNum flight number CarrierDelay in minutes TailNum plane tail number WeatherDelay in minutes ActualElapsedTime in minutes NASDelay in minutes CRSElapsedTime in minutes SecurityDelay in minutes AirTime in minutes LateAircraftDelay in minutes ArrDelay arrival delay, in minutes Delayed Is the flight delayed
  • 30. Experiment #2 @arimoinc Training set Val/Test set Feature preprocessing ~900 features Truncated SVD n: # of dims f-score with target variable Keep alpha percents top features K-means k: # of clusters RandomForest #1 number of trees maximum depth percentage of features RandomForest #k … Scoring
  • 31. Experiment #2: Hyper-parameters @arimoinc Hyperparameter Type Range BayesOpt Number of SVD dimensions INT [5, 100] 98 Top feature percentage FLOAT [0.1, 1] 0.8258 k (# of clusters) INT [1, 6] 2 Number of trees (RF) INT [50, 500] 327 Max. depth (RF) INT [1, 20] 12 Min. instances per node (RF) INT [1, 1000] 414 F1-score on validation set 0.8736
  • 32. Summary @arimoinc 1. Bayesian Optimization for Hyper-parameter Tuning 2. Bayesian Optimization for 
 “Automating” Data Science on Spark 3. Experiments
  • 33. Getting Started @arimoinc • Blogpost: http://goo.gl/PFyBKI • Open-source: spearmint, hyperopt, SMAC, AutoML • Commercial: Whetlab, SigOpt, …