SlideShare a Scribd company logo
1 of 62
Scaling Out Logistic Regression
with Apache Spark
Nir Cohen - CTO
General Background about the company
› The company was founded 8 years ago
› 300 ~ employees world wide
› 240 employees in Israel
› Stay updated about our open positions in our website.
You can contact jobs@similarweb.com
› Nir Cohen – nir@similarweb.com
The product
Data Size
› 650 servers total
› Several Hadoop Clusters – 120 Servers in the biggest.
› 5 Hbase clusters
› Couchbase clusters
› Kafka clusters
› MYSQL Galera clusters
› 5TB of new data every day
› Full data backup to s3
Plan for the next hour or so
› The need
› Some history
› Spark related algorithmic intuitions
› Dive into spark
› Our Additions
› Runtime issues
› Current Categorization Algorithm
The Need
Need: The Customer
Need: The Product
Need: The Product – Direct Competitors
Need: How would you classify the Web?
› Crawl the web
› Collect data about each website
› Manually classify a few
› Use machine learning to derive model
› Classify all the websites we’ve seen
Some History
LEARNING SET:
CLASSES
› Shopping
– Clothing
– Consumer Electronics
– Jewelry
– …
› Sports
– Baseball
– Basketball
– Boxing
– …
› …
Manually defined
246 categories
2 level tree
25 Parent categories
LEARNING SET:
FEATURES
› Tag Count Source
– cnn.com | news | 1
– bbc.com | culture | 50
– …
› Html Analyzer Source
– cnn.com | money | 14
– nba.com | nba draft | 2
– …
11 basic sources
Feature is:
site | tag | score
Some reintroduced after
additional processing
Eventually – 16 sources
18 GB of data
4M Unique features
Our challenge
› Large Scale Logistic Regression
– ~500K site samples
– 4M Unique features
– ~800K features/source
– 246 classes
– Eventually apply model to 400M sites
FIRST LOGISTIC
REGRESSION
ATTEMPT
› Only scales up
› Pre-combination of features
reduces coverage
› Runtime: a few days
› Code is complex, and hard to tweak
algorithm
› Bus test
Single machine Java
logistic regression
implementation
 highly optimized
 Manually tuned loss
function
 multi threaded
 Uses plain arrays and
divides "stripes"
between threads
 Works on “summed
features”
SECOND LOGISTIC
REGRESSION
ATTEMPT
 Out of the box solution
 Customizable
 Open source
 Distributable
Why we choose spark
› Has out of the box distributed solution for large scale
multinomial logistic regression
› Simplicity
› Lower production maintenance costs compared to R
› Intent to move to Spark for large complex algorithmics
Spark related
Algorithmics
Intuitive reminder
Basic Regression Method
› We want to estimate value of y based on samples (x, y)
𝑦 = 𝑓 𝑥, 𝛽 ; 𝛽 – unknown function constants
› Define loss function 𝑙 𝛽 that corresponds with accuracy
– for example : 𝑙 𝛽 ≡ 𝑖
𝑠𝑎𝑚𝑝𝑙𝑒𝑠
𝑓 𝑥 𝑖,𝛽 −𝑦 𝑖
2
#𝑠𝑎𝑚𝑝𝑙𝑒𝑠
› Find 𝛽 that minimize 𝑙 𝛽
Logistic Regression
› In case of classification we want to use logistic function
𝑦 = 𝑓 𝑥, 𝛽 = 𝑃(𝑦|𝑥; 𝛽) =
𝑒 𝛽𝑥
1 + 𝑒 𝛽𝑥
› Define differentiable loss function (log-likelihood)
𝑙 𝑥, 𝛽 = 𝑖
𝑠𝑎𝑚𝑝𝑙𝑒𝑠
𝑙𝑜𝑔𝑃(𝑦𝑖|𝑥𝑖; 𝛽)
› We cannot find 𝛽 analytically
› However, 𝑙 𝑥, 𝛽 is smooth,
continuous and convex!
– Has one global minimum
GRADIENT DESCENT
Generally
• Value of −𝛻𝑙(𝛽) is a vector
that points in direction of
steepest descent
• In every step
𝛽 𝑘+1 = 𝛽 𝑘 − 𝛼𝛻𝑙(𝛽 𝑘)
• 𝛼 – learning rate
• Converges when
𝛻𝑙 𝛽 → 0
Spark
• 𝑟𝑎𝑡𝑒 =
𝛼
𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑛𝑢𝑚𝑏𝑒𝑟
• SGD – stochastic mini-
batch GD
LINE SEARCH –
DETERMINING STEP
SIZE
Approximate method
At each iteration
• Find step size that
sufficiently decreases l
• By reducing the range of
possible steps sizes
Spark:
• StrongWolfeLineSearch
• Sufficiency check is a
function of
𝑙 𝛽 , 𝛻𝑙 𝛽
Is there a faster way?
Function Analysis for y = 𝑙(𝛽0, 𝛽1, … , 𝛽 𝑛)
› So, we want 𝛽 that satisfy 𝑙′ 𝛽 = 0
𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑣𝑒𝑐𝑡𝑜𝑟 ≝ 𝛻𝑙 𝛽 =
𝜕𝑙
𝜕𝛽1
𝜕𝑙
𝜕𝛽2
⋮
𝜕𝑙
𝜕𝛽 𝑛
Hessian Matrix − 𝐻 𝛽 ≝ 𝛻2
𝑙 𝛽 =
𝜕𝑙
𝜕𝛽1 𝛽1
⋯
𝜕𝑙
𝜕𝛽1 𝛽 𝑛
⋮ ⋱ ⋮
𝜕𝑙
𝜕𝛽 𝑛 𝛽1
⋯
𝜕𝑙
𝜕𝛽 𝑛 𝛽 𝑛
At minimum,
derivative is 0
In Our Case
800Kx800K
way too much…
NEWTON’S METHOD
(NEWTON-RAPHSON)
𝑁𝑒𝑤𝑡𝑜𝑛′
𝑠 𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛:
𝑥 𝑘+1=𝑥 𝑘−
𝑓(𝑥 𝑘)
𝑓′(𝑥 𝑘)
𝑈𝑝𝑑𝑎𝑡𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑜𝑛𝑒 𝑓𝑒𝑎𝑡𝑢𝑟𝑒
𝛽 𝑘+1=𝛽 𝑘−
𝑙′(𝛽 𝑘)
𝑙′′(𝛽 𝑘)
𝑂𝑢𝑟 𝐶𝑎𝑠𝑒: 𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑒 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠
𝛽 𝑘+1=𝛽 𝑘−𝛻𝑙 𝛽 𝑘 × 𝐻−1
(𝛽 𝑘)
𝐻−1
(𝛽 𝑘) − 𝑖𝑛𝑣𝑒𝑟𝑠𝑒 𝐻𝑒𝑠𝑠𝑖𝑎𝑛 𝑚𝑎𝑡𝑟𝑖𝑥
"NewtonIteration Ani" by Ralf Pfeifer - NewtonIteration_Ani.gif
https://en.wikipedia.org/wiki/Newton's_method
Illustration for simple parabola (1 feature)
GRADIENT DESCENT NEWTON’S GRADIENT DESCENT
Images from here
Is there a fast and simpler
way?
SECANT METHOD
(QUAZI-NEWTON)
Approximation of derivative
𝑙′′
𝛽1 ≈
𝑙′ 𝛽1 − 𝑙′(𝛽0)
𝛽1 − 𝛽0
𝑁𝑒𝑤𝑡𝑜𝑛′ 𝑠 𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑏𝑒𝑐𝑜𝑚𝑒𝑠
𝛽 𝑘+1 = 𝛽 𝑘−
𝑙′(𝛽 𝑘)
𝑙′′(𝛽 𝑘−1)
=
= 𝛽 𝑘−𝑙′(𝛽 𝑘)
𝛽 𝑘 − 𝛽 𝑘−1
𝑙′ 𝛽 𝑘 − 𝑙′(𝛽 𝑘−1)
! Hessian is not needed !
In our case, we need only 𝛻𝑙
Animation from here
Requirements and Convergence rate
Newton-Raphson Quazi-Newton
Analytical formula for gradient Analytical formula for gradient
Compute gradient at each step 𝑂(𝑀 × 𝑁) Compute gradient at each step 𝑂(𝑀 × 𝑁)
Analytical formula for Hessian
Compute Inverse Hessian at each step -
𝑂(𝑀2 𝑁)
Save last calculations of gradient
Order Of Convergence q=2 Order Of Convergence q=1.6
Which is faster?
Which is cheaper (memory, cpu) in 1000 iterations for M=100,000 features?
Which of Gradient Descent, Newton or Quazi-Newton should we use?
BFGS - Quazi-Newton with Line Search
› Initially guess 𝛽0 and set 𝐻0
−1
= 𝐼
› In each step k
– Calculate gradient value (direction)
𝑝 𝑘 = −𝛻𝑓(𝛽 𝑘) × 𝐻 𝑘
−1
– Find step size (𝛼 𝑘) using line search (with Wolfe conditions)
– Update 𝛽 𝑘+1 = 𝛽 𝑘 + 𝛼 𝑘 𝑝 𝑘
– Update 𝐻 𝑘+1
−1
= 𝐻 𝑘
−1
+ 𝑢𝑝𝑑𝑎𝑡𝑒𝐹𝑢𝑛𝑐(𝐻 𝑘
−1
, 𝛻𝑓 𝛽 𝑘 , 𝛻𝑓 𝛽 𝑘+1 , 𝛼 𝑘, 𝑝 𝑘 )
› Stop when improvement is small enough
› More info BFGS
Back To Engineering
Challenges Implementing Logistic Regression
› In order to get the values of gradient we need instantiate
the formula with the learning set
– For every iteration we need to go over the learning set
› If we want to speed this up by parallelization we need ship
model or learning set to each thread/process
› Single machine -> process is CPU bound
› Multiple machines -> network bound
› With large number of features, memory becomes a
problem as well
Why we choose to use L-BFGS
› Only out of the box multinomial logistic regression
› Gives good value for money
– Good tradeoff between cost per iteration and number of
iterations
› Uses spark’s GeneralizedLinearModel API:
L-BFGS
› L stands for Limited Memory
– Replace 𝐻𝑒𝑠𝑠𝑖𝑎𝑛 which is 𝑀 × 𝑀 matrix
with a few (~10) most recent updates of 𝛻𝑓 𝛽 𝑘 and 𝑓(𝛽 𝑘)
which are 𝑀 sized vectors
› spark.LBFGS
– Distributed wrapper over breeze.LBFGS
– Mostly, distribution of gradient calculation
› Rest is not
› Shipping around the model and collecting gradient values
– Uses L2 regularization
– Scaling Features
Spark internals
distributed sub loop (max 10)
distributed but cached
on executors
Partial agg on executors, final on Driver
AGGREGATE &
TREE AGGREGATE
Aggregate
• Each executor holds a portion
of learning set
• Broadcast model to
executors
• Collect results to driver
TreeAggregate
• Simple heuristic to add level
• Perform partial aggregation
by shipping results to other
executors (by repartitioning)
Weights - 𝛽
Partial gradient
Job UI – big job
Implementation
Overfitting
› We have more features then samples
› Some features are poorly represented
› For example:
– only one sample for “carbon” tag
– sample is labeled “automotive”
› Model would give high weight to this feature for
“automotive” class and 0 for others
– Do you think it is correct?
› How would you solve this?
Regularization
› Solution internal to regression mechanism
› We introduce regularization into the cost function
𝑙 𝑡𝑜𝑡𝑎𝑙 𝛽, 𝑥 =𝑙 𝑚𝑜𝑑𝑒𝑙 𝛽, 𝑥 + 𝜆 ∙ 𝑙 𝑟𝑒𝑔 𝛽
L2 regularization : 𝑙 𝑟𝑒𝑔 𝛽 =
1
2
𝛽 2
› 𝜆 – regularization constant
› What happens if 𝜆 is too large?
› What happens if 𝜆 is too small?
› Spark’s LBFGS has L2 built-in
Finding Best Lambda
› We choose best 𝜆 using cross-validation
– Set aside 30% of learning set, and use it for test
› Build model for every 𝜆 and compare precision
› Lets Parallelize? Is there more efficient way to do this?
– We use the fact that for large 𝜆, model is underfitted, converges
fast
– Start from large 𝜆 and use its model as a starting point of next
iteration
CHOOSING
REGULARIZATION
PARAMETER
Lambda Precision Iterations
25 35.06% 3
12.5 35.45% 12
6.25 36.68% 5
3.125 38.41% 5
1.563 Failure!
0.781 45.87% 13
0.391 50.64% 10
0.195 55.04% 13
0.098 58.33% 17
0.049 60.93% 19
0.024 62.33% 21
0.012 64.30% 25
0.006 65.95% 42
0.003 65.46% 38
 After choosing the best
lambda, we can use
complete learning set to
calculate final model
 Failures can be caused
externally or internally
 Avg iteration time 2 sec
LBFGS EXTENSION
& BUGFIXES
› Spark layer of LBFGS swallows all
failures
– and returns bad weights
› Feature scaling was always on
– Redundant in our case
– Rendered passed weights unusable
– Lowered model precision
› Expose effective number of
iterations to external monitoring
• Enable passing starting
weights into LBFGS
• More transparency
SPARK ADDITIONS
& BUG FIXES
› PoliteLBFGS addition to spark.LBFGS
– 3-5% more precise (for our data)
– 30% faster calculation
› Planning to contribute back to spark
class PoliteLbfgs
extends spark.Lbfgs
Was it worth the trouble?
po·lite : pəˈlīt/
having or showing behavior that is respectful and considerate of
others.
synonyms: well mannered, civil, courteous, mannerly, respectful,
deferential, well behaved
Job UI – small job
RUNNING
Hardware
› 110 machines
› 5.20 TB Memory
› 6600 VCores
› Yarn
› Block size 128 MB
› Cluster is shared with other
MapReduce jobs and
HBase
› 60 Vcores per machine
› 64GB Memory
– ~1 GB per VCore
› 12 Cores
– 5 Vcores per physical core
(tuned for MapReduce)
› CentOS 6.6
› cdh-5.4.8
Execution – Good Neighboring
› Each source has different number of samples and
features
› Execution profiles for single learning run
Small Large
#Samples ~50K 500K
Input Size under 1gb 1g - 3g
#Executors 2 22
Executor Memory 2g 4g
Driver Memory 2g 18g
Yarn Driver Overhead 2g
Yarn Executor Overhead 1g
#Jobs per profile 200 180
Execution Example
Hardware : Driver 2 cores, 20g memory
Hardware : Executors 22 machines x (2 cores, 5g memory)
Number of Features 100,000
Number of Samples 500,000
Total Number of Iterations
(try out 14 different 𝜆)
152
Avg Iteration Time 18.8 sec
Total Learning Time 2863 sec (48 minutes)
Max Iterations for single 𝜆 30
Could you guess the reason for difference?
run Phase name
real time
[sec]
iteration time
[sec]
iterations
1 parent-glm-AVTags 29101 153.2 190
2 parent-glm-AVTags 15226 82.3 185
3 parent-glm-AVTags 2863 18.8 152
• OK, I admit, cluster was very loaded in first run
• What about the second ?
• org.apache.spark.shuffle.MetadataFetchFailedException:
Missing an output location for shuffle
• Increase spark.shuffle.memoryFraction=0.5
AKKA IN REAL
WORLD
› spark.akka.frameSize = 100
› spark.akka.askTimeout = 200
› spark.akka.lookupTimeout = 200Response times are
slower when cluster is
loaded
askTimeout - seems to
be particularly
responsible for
executors failures when
removing broadcasts
and unpersisting RDD
Kryo Stability
› Kryo uses quite a lot of memory,
– if buffer is not sufficient, process will crush
– spark.kryoserializer.buffer.max.mb = 512
LEARNING SET:
CLASSES
› Shopping
– Clothing
– Consumer Electronics
– Jewelry
– …
› Sports
– Baseball
– Basketball
– Boxing
– …
› …
Manually defined
246 categories
2 level tree
25 Parent categories
LEARNING SET:
FEATURES
› Tag Count Source
– cnn.com | news | 1
– bbc.com | culture | 50
– …
› Html Analyzer Source
– cnn.com | money | 14
– nba.com | nba draft | 2
– …
11 basic sources
Feature is:
site | tag | score
Some reintroduced after
additional processing
Eventually – 16 sources
~500K site samples
18 GB of data
4M Unique features
~800K features/source
Need: How would you improve over time?
› We collect different kinds of data:
– Tags
– Links
– User behavior
– …
› How to identify where to focus collection efforts?
› How to improve classification algorithm?
Current Approach - Training
› foreach source
– choose 100K most influential features
– train model for L1
– foreach L1 class (avg 9.2 L2 classes per L1)
› train model for L2
› foreach source
– foreach sample in training set
› Calculate probabilities (𝜃) of belonging to any of L1 classes
› train Random Forest using L1 probabilities set
16 sources
25 L1 classes
Current Approach - Application
› foreach site to classify
– foreach source
› Calculate probabilities (𝜃) belonging to L1 class
– aggregate results and estimate L1 (using RF model)
– given estimated L1, foreach source
› calculate estimated L2
– choose (by voting) final L2
OTHER
EXTENSIONS
› Extend
mllib.LogisticRegressionModel to
return probabilities instead final
decision from “predict” method
› For Example
– Site : nhl.com
– Instead “is L1=sports”
– We produce
› P(news) = 30%
› P(sports) = 65%
› P(art) = 5%
model.advise(p:point)
Summary : This Approach vs Straight Logistic
Regression
› Increases precision by using more features
› Increases coverage by using very granular features
› Have feedback (from RF) regarding quality of each source
– Using out-of-bag error
› Natural parallelization by source
› No need for feature scaling
Scaling out logistic regression with Spark

More Related Content

What's hot

Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizerHojin Yang
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniquesmark_landry
 
Exploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitExploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitShiladitya Sen
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbitjakehofman
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Antti Haapala
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learningpauldix
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsRevolution Analytics
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
Training Large-scale Ad Ranking Models in Spark
Training Large-scale Ad Ranking Models in SparkTraining Large-scale Ad Ranking Models in Spark
Training Large-scale Ad Ranking Models in SparkPatrick Pletscher
 
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Chris Fregly
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Databricks
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsYoonho Lee
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkSigOpt
 
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)Hansol Kang
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsYoung-Geun Choi
 
Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindingsDmitriy Lyubimov
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopHéloïse Nonne
 

What's hot (20)

Gradient descent optimizer
Gradient descent optimizerGradient descent optimizer
Gradient descent optimizer
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
Exploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal WabbitExploring Optimization in Vowpal Wabbit
Exploring Optimization in Vowpal Wabbit
 
Technical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal WabbitTechnical Tricks of Vowpal Wabbit
Technical Tricks of Vowpal Wabbit
 
Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit Wapid and wobust active online machine leawning with Vowpal Wabbit
Wapid and wobust active online machine leawning with Vowpal Wabbit
 
Terascale Learning
Terascale LearningTerascale Learning
Terascale Learning
 
Parallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear ModelsParallel External Memory Algorithms Applied to Generalized Linear Models
Parallel External Memory Algorithms Applied to Generalized Linear Models
 
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Training Large-scale Ad Ranking Models in Spark
Training Large-scale Ad Ranking Models in SparkTraining Large-scale Ad Ranking Models in Spark
Training Large-scale Ad Ranking Models in Spark
 
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
 
Pytorch meetup
Pytorch meetupPytorch meetup
Pytorch meetup
 
Gradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation GraphsGradient Estimation Using Stochastic Computation Graphs
Gradient Estimation Using Stochastic Computation Graphs
 
MLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott ClarkMLConf 2016 SigOpt Talk by Scott Clark
MLConf 2016 SigOpt Talk by Scott Clark
 
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
Mahout scala and spark bindings
Mahout scala and spark bindingsMahout scala and spark bindings
Mahout scala and spark bindings
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and Hadoop
 

Similar to Scaling out logistic regression with Spark

Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016MLconf
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsScott Clark
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsSigOpt
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemAvleen Vig
 
Implementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on SparkImplementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on SparkDalei Li
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Spark Summit
 
20180522 infra autoscaling_system
20180522 infra autoscaling_system20180522 infra autoscaling_system
20180522 infra autoscaling_systemKai Sasaki
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationYan Xu
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionJaroslaw Szymczak
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural NetworksDatabricks
 
Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovValeriia Maliarenko
 
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdfAuto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdfKundjanasith Thonglek
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 
21 Experiments to Increase Throughput
21 Experiments to Increase Throughput21 Experiments to Increase Throughput
21 Experiments to Increase ThroughputAndrew Rusling
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData
 
Performance Optimization of Rails Applications
Performance Optimization of Rails ApplicationsPerformance Optimization of Rails Applications
Performance Optimization of Rails ApplicationsSerge Smetana
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData
 

Similar to Scaling out logistic regression with Spark (20)

Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
ELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log systemELK: Moose-ively scaling your log system
ELK: Moose-ively scaling your log system
 
Implementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on SparkImplementation of linear regression and logistic regression on Spark
Implementation of linear regression and logistic regression on Spark
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
 
20180522 infra autoscaling_system
20180522 infra autoscaling_system20180522 infra autoscaling_system
20180522 infra autoscaling_system
 
Deep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and RegularizationDeep Feed Forward Neural Networks and Regularization
Deep Feed Forward Neural Networks and Regularization
 
XGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competitionXGBoost: the algorithm that wins every competition
XGBoost: the algorithm that wins every competition
 
Training Neural Networks
Training Neural NetworksTraining Neural Networks
Training Neural Networks
 
Concurrency
ConcurrencyConcurrency
Concurrency
 
Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei Radov
 
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdfAuto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
Auto-Scaling Apache Spark cluster using Deep Reinforcement Learning.pdf
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 
Tree building 2
Tree building 2Tree building 2
Tree building 2
 
21 Experiments to Increase Throughput
21 Experiments to Increase Throughput21 Experiments to Increase Throughput
21 Experiments to Increase Throughput
 
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14thSnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
SnappyData Ad Analytics Use Case -- BDAM Meetup Sept 14th
 
Performance Optimization of Rails Applications
Performance Optimization of Rails ApplicationsPerformance Optimization of Rails Applications
Performance Optimization of Rails Applications
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...SnappyData, the Spark Database. A unified cluster for streaming, transactions...
SnappyData, the Spark Database. A unified cluster for streaming, transactions...
 

Recently uploaded

Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 

Recently uploaded (20)

Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 

Scaling out logistic regression with Spark

  • 1. Scaling Out Logistic Regression with Apache Spark
  • 3. General Background about the company › The company was founded 8 years ago › 300 ~ employees world wide › 240 employees in Israel › Stay updated about our open positions in our website. You can contact jobs@similarweb.com › Nir Cohen – nir@similarweb.com
  • 5. Data Size › 650 servers total › Several Hadoop Clusters – 120 Servers in the biggest. › 5 Hbase clusters › Couchbase clusters › Kafka clusters › MYSQL Galera clusters › 5TB of new data every day › Full data backup to s3
  • 6. Plan for the next hour or so › The need › Some history › Spark related algorithmic intuitions › Dive into spark › Our Additions › Runtime issues › Current Categorization Algorithm
  • 10. Need: The Product – Direct Competitors
  • 11. Need: How would you classify the Web? › Crawl the web › Collect data about each website › Manually classify a few › Use machine learning to derive model › Classify all the websites we’ve seen
  • 13. LEARNING SET: CLASSES › Shopping – Clothing – Consumer Electronics – Jewelry – … › Sports – Baseball – Basketball – Boxing – … › … Manually defined 246 categories 2 level tree 25 Parent categories
  • 14. LEARNING SET: FEATURES › Tag Count Source – cnn.com | news | 1 – bbc.com | culture | 50 – … › Html Analyzer Source – cnn.com | money | 14 – nba.com | nba draft | 2 – … 11 basic sources Feature is: site | tag | score Some reintroduced after additional processing Eventually – 16 sources 18 GB of data 4M Unique features
  • 15. Our challenge › Large Scale Logistic Regression – ~500K site samples – 4M Unique features – ~800K features/source – 246 classes – Eventually apply model to 400M sites
  • 16. FIRST LOGISTIC REGRESSION ATTEMPT › Only scales up › Pre-combination of features reduces coverage › Runtime: a few days › Code is complex, and hard to tweak algorithm › Bus test Single machine Java logistic regression implementation  highly optimized  Manually tuned loss function  multi threaded  Uses plain arrays and divides "stripes" between threads  Works on “summed features”
  • 17. SECOND LOGISTIC REGRESSION ATTEMPT  Out of the box solution  Customizable  Open source  Distributable
  • 18. Why we choose spark › Has out of the box distributed solution for large scale multinomial logistic regression › Simplicity › Lower production maintenance costs compared to R › Intent to move to Spark for large complex algorithmics
  • 20. Basic Regression Method › We want to estimate value of y based on samples (x, y) 𝑦 = 𝑓 𝑥, 𝛽 ; 𝛽 – unknown function constants › Define loss function 𝑙 𝛽 that corresponds with accuracy – for example : 𝑙 𝛽 ≡ 𝑖 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑓 𝑥 𝑖,𝛽 −𝑦 𝑖 2 #𝑠𝑎𝑚𝑝𝑙𝑒𝑠 › Find 𝛽 that minimize 𝑙 𝛽
  • 21. Logistic Regression › In case of classification we want to use logistic function 𝑦 = 𝑓 𝑥, 𝛽 = 𝑃(𝑦|𝑥; 𝛽) = 𝑒 𝛽𝑥 1 + 𝑒 𝛽𝑥 › Define differentiable loss function (log-likelihood) 𝑙 𝑥, 𝛽 = 𝑖 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑙𝑜𝑔𝑃(𝑦𝑖|𝑥𝑖; 𝛽) › We cannot find 𝛽 analytically › However, 𝑙 𝑥, 𝛽 is smooth, continuous and convex! – Has one global minimum
  • 22. GRADIENT DESCENT Generally • Value of −𝛻𝑙(𝛽) is a vector that points in direction of steepest descent • In every step 𝛽 𝑘+1 = 𝛽 𝑘 − 𝛼𝛻𝑙(𝛽 𝑘) • 𝛼 – learning rate • Converges when 𝛻𝑙 𝛽 → 0 Spark • 𝑟𝑎𝑡𝑒 = 𝛼 𝑖𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑛𝑢𝑚𝑏𝑒𝑟 • SGD – stochastic mini- batch GD
  • 23. LINE SEARCH – DETERMINING STEP SIZE Approximate method At each iteration • Find step size that sufficiently decreases l • By reducing the range of possible steps sizes Spark: • StrongWolfeLineSearch • Sufficiency check is a function of 𝑙 𝛽 , 𝛻𝑙 𝛽
  • 24. Is there a faster way?
  • 25. Function Analysis for y = 𝑙(𝛽0, 𝛽1, … , 𝛽 𝑛) › So, we want 𝛽 that satisfy 𝑙′ 𝛽 = 0 𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡 𝑣𝑒𝑐𝑡𝑜𝑟 ≝ 𝛻𝑙 𝛽 = 𝜕𝑙 𝜕𝛽1 𝜕𝑙 𝜕𝛽2 ⋮ 𝜕𝑙 𝜕𝛽 𝑛 Hessian Matrix − 𝐻 𝛽 ≝ 𝛻2 𝑙 𝛽 = 𝜕𝑙 𝜕𝛽1 𝛽1 ⋯ 𝜕𝑙 𝜕𝛽1 𝛽 𝑛 ⋮ ⋱ ⋮ 𝜕𝑙 𝜕𝛽 𝑛 𝛽1 ⋯ 𝜕𝑙 𝜕𝛽 𝑛 𝛽 𝑛 At minimum, derivative is 0 In Our Case 800Kx800K way too much…
  • 26. NEWTON’S METHOD (NEWTON-RAPHSON) 𝑁𝑒𝑤𝑡𝑜𝑛′ 𝑠 𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛: 𝑥 𝑘+1=𝑥 𝑘− 𝑓(𝑥 𝑘) 𝑓′(𝑥 𝑘) 𝑈𝑝𝑑𝑎𝑡𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑤𝑖𝑡ℎ 𝑜𝑛𝑒 𝑓𝑒𝑎𝑡𝑢𝑟𝑒 𝛽 𝑘+1=𝛽 𝑘− 𝑙′(𝛽 𝑘) 𝑙′′(𝛽 𝑘) 𝑂𝑢𝑟 𝐶𝑎𝑠𝑒: 𝑀𝑢𝑙𝑡𝑖𝑝𝑙𝑒 𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠 𝛽 𝑘+1=𝛽 𝑘−𝛻𝑙 𝛽 𝑘 × 𝐻−1 (𝛽 𝑘) 𝐻−1 (𝛽 𝑘) − 𝑖𝑛𝑣𝑒𝑟𝑠𝑒 𝐻𝑒𝑠𝑠𝑖𝑎𝑛 𝑚𝑎𝑡𝑟𝑖𝑥 "NewtonIteration Ani" by Ralf Pfeifer - NewtonIteration_Ani.gif https://en.wikipedia.org/wiki/Newton's_method
  • 27. Illustration for simple parabola (1 feature) GRADIENT DESCENT NEWTON’S GRADIENT DESCENT Images from here
  • 28. Is there a fast and simpler way?
  • 29. SECANT METHOD (QUAZI-NEWTON) Approximation of derivative 𝑙′′ 𝛽1 ≈ 𝑙′ 𝛽1 − 𝑙′(𝛽0) 𝛽1 − 𝛽0 𝑁𝑒𝑤𝑡𝑜𝑛′ 𝑠 𝐼𝑡𝑒𝑟𝑎𝑡𝑖𝑜𝑛 𝑏𝑒𝑐𝑜𝑚𝑒𝑠 𝛽 𝑘+1 = 𝛽 𝑘− 𝑙′(𝛽 𝑘) 𝑙′′(𝛽 𝑘−1) = = 𝛽 𝑘−𝑙′(𝛽 𝑘) 𝛽 𝑘 − 𝛽 𝑘−1 𝑙′ 𝛽 𝑘 − 𝑙′(𝛽 𝑘−1) ! Hessian is not needed ! In our case, we need only 𝛻𝑙 Animation from here
  • 30. Requirements and Convergence rate Newton-Raphson Quazi-Newton Analytical formula for gradient Analytical formula for gradient Compute gradient at each step 𝑂(𝑀 × 𝑁) Compute gradient at each step 𝑂(𝑀 × 𝑁) Analytical formula for Hessian Compute Inverse Hessian at each step - 𝑂(𝑀2 𝑁) Save last calculations of gradient Order Of Convergence q=2 Order Of Convergence q=1.6 Which is faster? Which is cheaper (memory, cpu) in 1000 iterations for M=100,000 features? Which of Gradient Descent, Newton or Quazi-Newton should we use?
  • 31. BFGS - Quazi-Newton with Line Search › Initially guess 𝛽0 and set 𝐻0 −1 = 𝐼 › In each step k – Calculate gradient value (direction) 𝑝 𝑘 = −𝛻𝑓(𝛽 𝑘) × 𝐻 𝑘 −1 – Find step size (𝛼 𝑘) using line search (with Wolfe conditions) – Update 𝛽 𝑘+1 = 𝛽 𝑘 + 𝛼 𝑘 𝑝 𝑘 – Update 𝐻 𝑘+1 −1 = 𝐻 𝑘 −1 + 𝑢𝑝𝑑𝑎𝑡𝑒𝐹𝑢𝑛𝑐(𝐻 𝑘 −1 , 𝛻𝑓 𝛽 𝑘 , 𝛻𝑓 𝛽 𝑘+1 , 𝛼 𝑘, 𝑝 𝑘 ) › Stop when improvement is small enough › More info BFGS
  • 33. Challenges Implementing Logistic Regression › In order to get the values of gradient we need instantiate the formula with the learning set – For every iteration we need to go over the learning set › If we want to speed this up by parallelization we need ship model or learning set to each thread/process › Single machine -> process is CPU bound › Multiple machines -> network bound › With large number of features, memory becomes a problem as well
  • 34. Why we choose to use L-BFGS › Only out of the box multinomial logistic regression › Gives good value for money – Good tradeoff between cost per iteration and number of iterations › Uses spark’s GeneralizedLinearModel API:
  • 35. L-BFGS › L stands for Limited Memory – Replace 𝐻𝑒𝑠𝑠𝑖𝑎𝑛 which is 𝑀 × 𝑀 matrix with a few (~10) most recent updates of 𝛻𝑓 𝛽 𝑘 and 𝑓(𝛽 𝑘) which are 𝑀 sized vectors › spark.LBFGS – Distributed wrapper over breeze.LBFGS – Mostly, distribution of gradient calculation › Rest is not › Shipping around the model and collecting gradient values – Uses L2 regularization – Scaling Features
  • 36. Spark internals distributed sub loop (max 10) distributed but cached on executors Partial agg on executors, final on Driver
  • 37. AGGREGATE & TREE AGGREGATE Aggregate • Each executor holds a portion of learning set • Broadcast model to executors • Collect results to driver TreeAggregate • Simple heuristic to add level • Perform partial aggregation by shipping results to other executors (by repartitioning) Weights - 𝛽 Partial gradient
  • 38. Job UI – big job
  • 40. Overfitting › We have more features then samples › Some features are poorly represented › For example: – only one sample for “carbon” tag – sample is labeled “automotive” › Model would give high weight to this feature for “automotive” class and 0 for others – Do you think it is correct? › How would you solve this?
  • 41. Regularization › Solution internal to regression mechanism › We introduce regularization into the cost function 𝑙 𝑡𝑜𝑡𝑎𝑙 𝛽, 𝑥 =𝑙 𝑚𝑜𝑑𝑒𝑙 𝛽, 𝑥 + 𝜆 ∙ 𝑙 𝑟𝑒𝑔 𝛽 L2 regularization : 𝑙 𝑟𝑒𝑔 𝛽 = 1 2 𝛽 2 › 𝜆 – regularization constant › What happens if 𝜆 is too large? › What happens if 𝜆 is too small? › Spark’s LBFGS has L2 built-in
  • 42. Finding Best Lambda › We choose best 𝜆 using cross-validation – Set aside 30% of learning set, and use it for test › Build model for every 𝜆 and compare precision › Lets Parallelize? Is there more efficient way to do this? – We use the fact that for large 𝜆, model is underfitted, converges fast – Start from large 𝜆 and use its model as a starting point of next iteration
  • 43. CHOOSING REGULARIZATION PARAMETER Lambda Precision Iterations 25 35.06% 3 12.5 35.45% 12 6.25 36.68% 5 3.125 38.41% 5 1.563 Failure! 0.781 45.87% 13 0.391 50.64% 10 0.195 55.04% 13 0.098 58.33% 17 0.049 60.93% 19 0.024 62.33% 21 0.012 64.30% 25 0.006 65.95% 42 0.003 65.46% 38  After choosing the best lambda, we can use complete learning set to calculate final model  Failures can be caused externally or internally  Avg iteration time 2 sec
  • 44. LBFGS EXTENSION & BUGFIXES › Spark layer of LBFGS swallows all failures – and returns bad weights › Feature scaling was always on – Redundant in our case – Rendered passed weights unusable – Lowered model precision › Expose effective number of iterations to external monitoring • Enable passing starting weights into LBFGS • More transparency
  • 45. SPARK ADDITIONS & BUG FIXES › PoliteLBFGS addition to spark.LBFGS – 3-5% more precise (for our data) – 30% faster calculation › Planning to contribute back to spark class PoliteLbfgs extends spark.Lbfgs Was it worth the trouble? po·lite : pəˈlīt/ having or showing behavior that is respectful and considerate of others. synonyms: well mannered, civil, courteous, mannerly, respectful, deferential, well behaved
  • 46. Job UI – small job
  • 48. Hardware › 110 machines › 5.20 TB Memory › 6600 VCores › Yarn › Block size 128 MB › Cluster is shared with other MapReduce jobs and HBase › 60 Vcores per machine › 64GB Memory – ~1 GB per VCore › 12 Cores – 5 Vcores per physical core (tuned for MapReduce) › CentOS 6.6 › cdh-5.4.8
  • 49. Execution – Good Neighboring › Each source has different number of samples and features › Execution profiles for single learning run Small Large #Samples ~50K 500K Input Size under 1gb 1g - 3g #Executors 2 22 Executor Memory 2g 4g Driver Memory 2g 18g Yarn Driver Overhead 2g Yarn Executor Overhead 1g #Jobs per profile 200 180
  • 50. Execution Example Hardware : Driver 2 cores, 20g memory Hardware : Executors 22 machines x (2 cores, 5g memory) Number of Features 100,000 Number of Samples 500,000 Total Number of Iterations (try out 14 different 𝜆) 152 Avg Iteration Time 18.8 sec Total Learning Time 2863 sec (48 minutes) Max Iterations for single 𝜆 30
  • 51. Could you guess the reason for difference? run Phase name real time [sec] iteration time [sec] iterations 1 parent-glm-AVTags 29101 153.2 190 2 parent-glm-AVTags 15226 82.3 185 3 parent-glm-AVTags 2863 18.8 152 • OK, I admit, cluster was very loaded in first run • What about the second ? • org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle • Increase spark.shuffle.memoryFraction=0.5
  • 52. AKKA IN REAL WORLD › spark.akka.frameSize = 100 › spark.akka.askTimeout = 200 › spark.akka.lookupTimeout = 200Response times are slower when cluster is loaded askTimeout - seems to be particularly responsible for executors failures when removing broadcasts and unpersisting RDD
  • 53. Kryo Stability › Kryo uses quite a lot of memory, – if buffer is not sufficient, process will crush – spark.kryoserializer.buffer.max.mb = 512
  • 54.
  • 55. LEARNING SET: CLASSES › Shopping – Clothing – Consumer Electronics – Jewelry – … › Sports – Baseball – Basketball – Boxing – … › … Manually defined 246 categories 2 level tree 25 Parent categories
  • 56. LEARNING SET: FEATURES › Tag Count Source – cnn.com | news | 1 – bbc.com | culture | 50 – … › Html Analyzer Source – cnn.com | money | 14 – nba.com | nba draft | 2 – … 11 basic sources Feature is: site | tag | score Some reintroduced after additional processing Eventually – 16 sources ~500K site samples 18 GB of data 4M Unique features ~800K features/source
  • 57. Need: How would you improve over time? › We collect different kinds of data: – Tags – Links – User behavior – … › How to identify where to focus collection efforts? › How to improve classification algorithm?
  • 58. Current Approach - Training › foreach source – choose 100K most influential features – train model for L1 – foreach L1 class (avg 9.2 L2 classes per L1) › train model for L2 › foreach source – foreach sample in training set › Calculate probabilities (𝜃) of belonging to any of L1 classes › train Random Forest using L1 probabilities set 16 sources 25 L1 classes
  • 59. Current Approach - Application › foreach site to classify – foreach source › Calculate probabilities (𝜃) belonging to L1 class – aggregate results and estimate L1 (using RF model) – given estimated L1, foreach source › calculate estimated L2 – choose (by voting) final L2
  • 60. OTHER EXTENSIONS › Extend mllib.LogisticRegressionModel to return probabilities instead final decision from “predict” method › For Example – Site : nhl.com – Instead “is L1=sports” – We produce › P(news) = 30% › P(sports) = 65% › P(art) = 5% model.advise(p:point)
  • 61. Summary : This Approach vs Straight Logistic Regression › Increases precision by using more features › Increases coverage by using very granular features › Have feedback (from RF) regarding quality of each source – Using out-of-bag error › Natural parallelization by source › No need for feature scaling

Editor's Notes

  1. My customer is VirtualJoias, online jewelry retailer from Brazil. He would like to expand to UK market. Who are the major players in UK in this space?
  2. Important to note here that, there is distributed calculation involved here
  3. Regularization flattens the cost function If 𝜆 is large, algorithm converges early but not accurately If 𝜆 is small, we go back to overfitting
  4. If passing in starting weights, they are messed up because of automatic feature scaling, which breaks the whole point of passing weights
  5. 18G is huge issue on loaded Yarn cluster in FIFO mode