SlideShare a Scribd company logo
1 of 25
Download to read offline
Training Large-scale Ad Ranking Models in Spark
PRESENTED BY Patrick Pletscher October 19, 2015
About Us
2
Michal Aharon Oren Somekh Yaacov Fernandess Yair Koren
Amit Kagian Shahar Golan Raz Nissim Patrick Pletscher
Amir Ingber
Haifa
Collaborator
What We Do
3
Research focused on ad ranking algorithms for Yahoo Gemini Native Ads
Ad Ranking Overview
4
• Advertisers run several campaigns each with several ads
• Each ad has a bid set by the advertiser; different ad price types
- pay per view
- pay per click
- various conversion price types
• Auction for each impression on a Gemini Native enabled property
- auction between all eligible ads (filter by targeting/budget)
- ad with the highest expected revenue is determined
• Need to know the (personalized!) probability of a click
- we mostly get money for clicks / conversions!
Ad 1 Ad 2
2$1$
5% 1%5c 2c
user
Click-Through Rate (CTR) Prediction
5
• Given a user and context, predict probability of a click for an ad.
• Probably the most “profitable” machine learning problem in industry
- simple binary problem; but want probabilities, not just the label
- very skewed label distribution: clicks << skips
- tons of data (every impression generates a training example)
- limitations at serving: need to predict quickly
• Basic setting quite well-studied; scale makes it challenging
- Google (McMahan et al. 2013)
- Facebook (He et al. 2014)
- Yahoo (Aharon et al. 2013)
- others (Chapelle et al. 2014)
• Some more involved research topics
- Exploration/Exploitation tradeoff
- Learning from logged feedback
Overview - CTR Prediction for Gemini Native Ads
6
• Collaborative Filtering approach (Aharon et al. 2013)
- Current production system
- Implemented in Hadoop MapReduce
- Used in Gemini Native ad ranking
• Large-scale Logistic Regression
- A research prototype
- Implemented in Spark
- The combination of Spark & Scala allows us to iterate quickly
- Takes several concepts from the CF approach
Large-­scale Logistic Regression in Spark
Apache Spark
8
• “Apache Spark is a fast and general engine for large-scale data processing”
• Similar to Hadoop
• Advantages over Hadoop MapReduce
- Option to cache data in memory, great for iterative computations
- A lot of syntactic sugar
‣ filter, reduceByKey, distinct, sortByKey, join
‣ in general Spark/Scala code very concise
- Spark Shell, great for interactive/ETL* workflows
- Dataframes interesting for data scientists coming from R / Python
• Includes modules for
- machine learning
- streaming
- graph computations
- SQL / Dataframes
*ETL: Extract, transform, load
Spark at Yahoo
9
• Spark 1.5.1, the latest version of Spark
• Runs on top of Hadoop YARN 2.6
- integrates nicely with existing Hadoop tools and infrastructure

at Yahoo
- data is generally stored in HDFS
• Clusters are centrally managed
• Large Hadoop deployment at Yahoo
- A few different clusters
- Each has at least a few thousand nodes
HDFS (storage)
YARN (resource management)
SparkMapReduceHive
Dataset for CTR Prediction
10
• Billions of ad impressions daily
- Need for Streaming / Batched Streaming
- Each impression has a unique id
• Need click information for every impression for learning
- Join impressions with a click stream every x minutes
- Need to wait for the click; introduces some delay
18:30 18:45 19:00
clicks
impressions impressions
clicks
impressions
clicks
19:15
impressions
clicks
labeled
events
labeled
events
in Spark: union & reduceByKey
Example - Joining Impression & Click RDDs
11
val keyAndImpressions = impressions

.map(e => (e.joinKey, ("i", e))
val keyAndClicks = clicks

.map(e => (e.joinKey, ("c", e)))
keyAndImpressions.union(keyAndClicks)

.reduceByKey(smartCombine)

.flatMap { case (k, (t, event)) => t match {

case "ci" => Some(LabeledEvent(event, clicked=1))

case "i" => Some(LabeledEvent(event, clicked=0))

case "c" => None

}

}
def smartCombine(event1: (String, Event), event2: (String, Event)):
(String, Event) = {

(event1._1, event2._1) match {

case ("c", "c") => event1 // de-dupe

case ("i", "i") => event1 // de-dupe

case ("c", "i") => ("ci", event2._2) // combine click and impression

case ("i", "c") => ("ci", event1._2) // combine click and impression

case ("ci", _) => event1 // de-dupe

case (_, "ci") => event2 // de-dupe

}

}
Incremental Learning Architecture
12
learning
examples
18:30 18:45 19:00
clicks
impressions impressions
clicks
impressions
clicks
19:15
impressions
clicks
labeled
events
feature extraction
learning
modelmodel
Large-scale Logistic Regression
13
• Industry standard for CTR prediction (McMahan et al. 2013, He et al. 2014)
• Models the probability of a click as
- feature vector
‣ high-dimensional vector but sparse (few non-zero values)
‣ model expressivity controlled by the features
‣ a lot of hand-tuning and playing around
- model parameters
‣ need to be learned
‣ generally rather non-sparse
Features for Logistic Regression
14
• Basic features
- age, gender
- browser, device
• Feature crosses
- E.g. age x gender x state (30 year old male from Boston)
- mostly indicator features
- Examples:
‣ gender^age m^30
‣ gender^device m^Windows_NT
‣ gender^section m^5417810
‣ gender^state m^2347579
‣ age^device 30^Windows_NT
• Feature hashing to get a vector of fixed length
- hash all the index tuples, e.g. (gender^age, m^30), to get a numeric index
- will introduce collisions! Choose dimensionality large enough
Parameter Estimation
15
• Basic Problem: Regularized Maximum Likelihood
- Often: L1 regularization instead of L2
‣ promotes sparsity in the weight vector
‣ more efficient predictions in serving (also requires less memory!)
- Batch vs. streaming
‣ in our case: batched streaming, every x min perform an incremental model update
• Follow-the-regularized leader (McMahan et al. 2013)
- sequential online algorithm: only use a data point once
- similar to stochastic gradient descent
- per coordinate learning rates
- encourages sparseness
- FTRL stores weight and accumulated gradient per coordinate
fit training data prevent overfitting
Basic Parallelized FTRL in Spark
16
def train(examples: RDD[LearningExample]): Unit={

val delta = examples

.repartition(numWorkers)

.mapPartitions(xs => updatePartition(xs, weights, counts))

.treeReduce{case(a, b) => (a._1+b._1, a._2+b._2)}



weights += delta._1 / numWorkers.toDouble

counts += delta._2 / numWorkers.toDouble

}

def updatePartition(examples: Iterator[LearningExample],

weights: DenseVector[Double],

counts: DenseVector[Double]): 

Iterator[(DenseVector[Double], DenseVector[Double])]=
{
// standard FTRL code for examples
Iterator((deltaWeights, deltaCounts))
}
hack:
actually a single
result, but Spark
expects an iterator!
Summary: LR with Spark
17
• Efficient: Can learn on all the data
- before: somewhat aggressive subsampling of the skips
• Possible to do feature pre-processing
- in Hadoop MapReduce much harder: only one pass over data
- drop infrequent features, TF-IDF, …
• Spark-shell as a life-saver
- helps to debug problems as one can inspect intermediate results at scale
- have yet to try Zeppelin notebooks
• Easy to unit test complex workflows
Spark: Lessons Learned
Upgrade!
19
• Spark has a pretty regular 3 months release schedule
• Always run with the latest version
- Lots of bugs get fixed
- Difficult to keep up with new functionality (see DataFrame vs. RDD)
• Speed improvements over the past year
Configurations
20
• Our solution
- config directory containing
‣ Logging: log4j.properties
‣ Spark itself: spark-defaults.conf
‣ our code: application.conf
- two versions of configs: local & cluster
- in YARN: specify them using --files argument & SPARK_CONF_DIR variable
• Use Typesafe’s config library for all application related configs
- provide sensible defaults for everything
- overwrite using application.conf
• Do not hard-code any configurations in code
Accumulators
21
• Use accumulators for ensuring correctness!
• Example:
- parse data, ignore event if there is a problem with the data
- use accumulator to count these failed lines
class Parser(failedLinesAccumulator: Accumulator[Int]) extends Serializable {
def parse(s: String): Option[Event] = {
try {
// parsing logic goes here
Some(...)
}
catch {

case e: Exception => {

failedLinesAccumulator += 1

None

}

}
}
}
val accumulator = Some(sc.accumulator(0, “failed lines”))
val parser = new Parser(accumulator)
val events = sc.textFile(“hdfs:///myfile”)
.flatMap(s => parser.parse(s))
RDD vs. DataFrame in Spark
22
• Initially Spark advocated Resilient Distributed Data (RDD) for data set
abstraction
- type-safe
- usually stores some Scala case class
- code relatively easy to understand
• Recently Spark is pushing towards using DataFrame
- similar to R and Python’s Pandas data frames
- some advantages
‣ less rigid types: can append columns
‣ speed
- disadvantage: code readability suffers for non-basic types
‣ user defined types
‣ user defined functions
• Have not fully migrated to it yet
Every Day I’m Shuffling…
23
• Careful with operations which send a lot of data over the network
- reduceByKey
- repartition / shuffle
• Careful with sending too much data to the driver
- collect
- reduce
• found mapPartitions & treeReduce useful in some cases (see FTRL example)
• play with spark configurations: frameSize, maxResultSize, timeouts…
textFile flatMap map reduceByKey
Machine Learning in Spark
24
• Relatively basic
- some algorithms don’t scale so well
- not customizable enough for experts:
‣ optimizers that assume a regularizer
‣ built our own DSL for feature extraction & combination
‣ a lot of the APIs are not exposed, i.e. private to Spark
- will hopefully get there eventually
• Nice: new Transformer / Estimator / Pipeline approach
- Inspired by scikit-learn, makes it easy to combine different algorithms
- Requires DataFrame
- Example (from Spark docs)
val tokenizer = new Tokenizer()
.setInputCol("text")
.setOutputCol("words")
val hashingTF = new HashingTF()
.setNumFeatures(1000)
.setInputCol(tokenizer.getOutputCol)
.setOutputCol("features")
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.01)
val pipeline = new Pipeline()
.setStages(Array(tokenizer, hashingTF, lr))
val model = pipeline.fit(training)
Thank you!

More Related Content

What's hot

Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Spark Summit
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016MLconf
 
Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with HadoopSangchul Song
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningDatabricks
 
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI AI Frontiers
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...MLconf
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsKrishna Sankar
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsDatabricks
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Databricks
 
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud MLScaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud MLSeldon
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and HadoopJosh Patterson
 
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016MLconf
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Databricks
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Databricks
 
StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkSri Ambati
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkJen Aman
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on SparkMathieu Dumoulin
 

What's hot (20)

Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
Tom Peters, Software Engineer, Ufora at MLconf ATL 2016
 
Machine Learning with Hadoop
Machine Learning with HadoopMachine Learning with Hadoop
Machine Learning with Hadoop
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Auto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine LearningAuto-Pilot for Apache Spark Using Machine Learning
Auto-Pilot for Apache Spark Using Machine Learning
 
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
Training at AI Frontiers 2018 - LaiOffer Data Session: How Spark Speedup AI
 
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
Funda Gunes, Senior Research Statistician Developer & Patrick Koch, Principal...
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML models
 
Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...Ge aviation spark application experience porting analytics into py spark ml p...
Ge aviation spark application experience porting analytics into py spark ml p...
 
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud MLScaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
Scaling TensorFlow Models for Training using multi-GPUs & Google Cloud ML
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
 
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...Continuous Evaluation of Deployed Models in Production Many high-tech industr...
Continuous Evaluation of Deployed Models in Production Many high-tech industr...
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 
StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling framework
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Distributed Deep Learning on Spark
Distributed Deep Learning on SparkDistributed Deep Learning on Spark
Distributed Deep Learning on Spark
 

Similar to Training Large-scale Ad Ranking Models in Spark

Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Massimo Schenone
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkC4Media
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streamingAdam Doyle
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_sparkGeetanjali G
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"IT Event
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Scala for Java Programmers
Scala for Java ProgrammersScala for Java Programmers
Scala for Java ProgrammersEric Pederson
 
Using Spark over Cassandra
Using Spark over CassandraUsing Spark over Cassandra
Using Spark over CassandraNoam Barkai
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Spark Summit
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...Holden Karau
 

Similar to Training Large-scale Ad Ranking Models in Spark (20)

Apache Spark: What? Why? When?
Apache Spark: What? Why? When?Apache Spark: What? Why? When?
Apache Spark: What? Why? When?
 
Unified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache SparkUnified Big Data Processing with Apache Spark
Unified Big Data Processing with Apache Spark
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
Lambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter LawreyLambdas puzzler - Peter Lawrey
Lambdas puzzler - Peter Lawrey
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_spark
 
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
Volodymyr Lyubinets "Introduction to big data processing with Apache Spark"
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Scala for Java Programmers
Scala for Java ProgrammersScala for Java Programmers
Scala for Java Programmers
 
Using Spark over Cassandra
Using Spark over CassandraUsing Spark over Cassandra
Using Spark over Cassandra
 
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)Advanced Data Science on Spark-(Reza Zadeh, Stanford)
Advanced Data Science on Spark-(Reza Zadeh, Stanford)
 
Alpine academy apache spark series #1 introduction to cluster computing wit...
Alpine academy apache spark series #1   introduction to cluster computing wit...Alpine academy apache spark series #1   introduction to cluster computing wit...
Alpine academy apache spark series #1 introduction to cluster computing wit...
 
SCALA - Functional domain
SCALA -  Functional domainSCALA -  Functional domain
SCALA - Functional domain
 

Recently uploaded

Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfKamal Acharya
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdfSuman Jyoti
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptMsecMca
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringmulugeta48
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...tanu pandey
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...SUHANI PANDEY
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 

Recently uploaded (20)

(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
chapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineeringchapter 5.pptx: drainage and irrigation engineering
chapter 5.pptx: drainage and irrigation engineering
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 

Training Large-scale Ad Ranking Models in Spark

  • 1. Training Large-scale Ad Ranking Models in Spark PRESENTED BY Patrick Pletscher October 19, 2015
  • 2. About Us 2 Michal Aharon Oren Somekh Yaacov Fernandess Yair Koren Amit Kagian Shahar Golan Raz Nissim Patrick Pletscher Amir Ingber Haifa Collaborator
  • 3. What We Do 3 Research focused on ad ranking algorithms for Yahoo Gemini Native Ads
  • 4. Ad Ranking Overview 4 • Advertisers run several campaigns each with several ads • Each ad has a bid set by the advertiser; different ad price types - pay per view - pay per click - various conversion price types • Auction for each impression on a Gemini Native enabled property - auction between all eligible ads (filter by targeting/budget) - ad with the highest expected revenue is determined • Need to know the (personalized!) probability of a click - we mostly get money for clicks / conversions! Ad 1 Ad 2 2$1$ 5% 1%5c 2c user
  • 5. Click-Through Rate (CTR) Prediction 5 • Given a user and context, predict probability of a click for an ad. • Probably the most “profitable” machine learning problem in industry - simple binary problem; but want probabilities, not just the label - very skewed label distribution: clicks << skips - tons of data (every impression generates a training example) - limitations at serving: need to predict quickly • Basic setting quite well-studied; scale makes it challenging - Google (McMahan et al. 2013) - Facebook (He et al. 2014) - Yahoo (Aharon et al. 2013) - others (Chapelle et al. 2014) • Some more involved research topics - Exploration/Exploitation tradeoff - Learning from logged feedback
  • 6. Overview - CTR Prediction for Gemini Native Ads 6 • Collaborative Filtering approach (Aharon et al. 2013) - Current production system - Implemented in Hadoop MapReduce - Used in Gemini Native ad ranking • Large-scale Logistic Regression - A research prototype - Implemented in Spark - The combination of Spark & Scala allows us to iterate quickly - Takes several concepts from the CF approach
  • 8. Apache Spark 8 • “Apache Spark is a fast and general engine for large-scale data processing” • Similar to Hadoop • Advantages over Hadoop MapReduce - Option to cache data in memory, great for iterative computations - A lot of syntactic sugar ‣ filter, reduceByKey, distinct, sortByKey, join ‣ in general Spark/Scala code very concise - Spark Shell, great for interactive/ETL* workflows - Dataframes interesting for data scientists coming from R / Python • Includes modules for - machine learning - streaming - graph computations - SQL / Dataframes *ETL: Extract, transform, load
  • 9. Spark at Yahoo 9 • Spark 1.5.1, the latest version of Spark • Runs on top of Hadoop YARN 2.6 - integrates nicely with existing Hadoop tools and infrastructure
 at Yahoo - data is generally stored in HDFS • Clusters are centrally managed • Large Hadoop deployment at Yahoo - A few different clusters - Each has at least a few thousand nodes HDFS (storage) YARN (resource management) SparkMapReduceHive
  • 10. Dataset for CTR Prediction 10 • Billions of ad impressions daily - Need for Streaming / Batched Streaming - Each impression has a unique id • Need click information for every impression for learning - Join impressions with a click stream every x minutes - Need to wait for the click; introduces some delay 18:30 18:45 19:00 clicks impressions impressions clicks impressions clicks 19:15 impressions clicks labeled events labeled events in Spark: union & reduceByKey
  • 11. Example - Joining Impression & Click RDDs 11 val keyAndImpressions = impressions
 .map(e => (e.joinKey, ("i", e)) val keyAndClicks = clicks
 .map(e => (e.joinKey, ("c", e))) keyAndImpressions.union(keyAndClicks)
 .reduceByKey(smartCombine)
 .flatMap { case (k, (t, event)) => t match {
 case "ci" => Some(LabeledEvent(event, clicked=1))
 case "i" => Some(LabeledEvent(event, clicked=0))
 case "c" => None
 }
 } def smartCombine(event1: (String, Event), event2: (String, Event)): (String, Event) = {
 (event1._1, event2._1) match {
 case ("c", "c") => event1 // de-dupe
 case ("i", "i") => event1 // de-dupe
 case ("c", "i") => ("ci", event2._2) // combine click and impression
 case ("i", "c") => ("ci", event1._2) // combine click and impression
 case ("ci", _) => event1 // de-dupe
 case (_, "ci") => event2 // de-dupe
 }
 }
  • 12. Incremental Learning Architecture 12 learning examples 18:30 18:45 19:00 clicks impressions impressions clicks impressions clicks 19:15 impressions clicks labeled events feature extraction learning modelmodel
  • 13. Large-scale Logistic Regression 13 • Industry standard for CTR prediction (McMahan et al. 2013, He et al. 2014) • Models the probability of a click as - feature vector ‣ high-dimensional vector but sparse (few non-zero values) ‣ model expressivity controlled by the features ‣ a lot of hand-tuning and playing around - model parameters ‣ need to be learned ‣ generally rather non-sparse
  • 14. Features for Logistic Regression 14 • Basic features - age, gender - browser, device • Feature crosses - E.g. age x gender x state (30 year old male from Boston) - mostly indicator features - Examples: ‣ gender^age m^30 ‣ gender^device m^Windows_NT ‣ gender^section m^5417810 ‣ gender^state m^2347579 ‣ age^device 30^Windows_NT • Feature hashing to get a vector of fixed length - hash all the index tuples, e.g. (gender^age, m^30), to get a numeric index - will introduce collisions! Choose dimensionality large enough
  • 15. Parameter Estimation 15 • Basic Problem: Regularized Maximum Likelihood - Often: L1 regularization instead of L2 ‣ promotes sparsity in the weight vector ‣ more efficient predictions in serving (also requires less memory!) - Batch vs. streaming ‣ in our case: batched streaming, every x min perform an incremental model update • Follow-the-regularized leader (McMahan et al. 2013) - sequential online algorithm: only use a data point once - similar to stochastic gradient descent - per coordinate learning rates - encourages sparseness - FTRL stores weight and accumulated gradient per coordinate fit training data prevent overfitting
  • 16. Basic Parallelized FTRL in Spark 16 def train(examples: RDD[LearningExample]): Unit={
 val delta = examples
 .repartition(numWorkers)
 .mapPartitions(xs => updatePartition(xs, weights, counts))
 .treeReduce{case(a, b) => (a._1+b._1, a._2+b._2)}
 
 weights += delta._1 / numWorkers.toDouble
 counts += delta._2 / numWorkers.toDouble
 }
 def updatePartition(examples: Iterator[LearningExample],
 weights: DenseVector[Double],
 counts: DenseVector[Double]): 
 Iterator[(DenseVector[Double], DenseVector[Double])]= { // standard FTRL code for examples Iterator((deltaWeights, deltaCounts)) } hack: actually a single result, but Spark expects an iterator!
  • 17. Summary: LR with Spark 17 • Efficient: Can learn on all the data - before: somewhat aggressive subsampling of the skips • Possible to do feature pre-processing - in Hadoop MapReduce much harder: only one pass over data - drop infrequent features, TF-IDF, … • Spark-shell as a life-saver - helps to debug problems as one can inspect intermediate results at scale - have yet to try Zeppelin notebooks • Easy to unit test complex workflows
  • 19. Upgrade! 19 • Spark has a pretty regular 3 months release schedule • Always run with the latest version - Lots of bugs get fixed - Difficult to keep up with new functionality (see DataFrame vs. RDD) • Speed improvements over the past year
  • 20. Configurations 20 • Our solution - config directory containing ‣ Logging: log4j.properties ‣ Spark itself: spark-defaults.conf ‣ our code: application.conf - two versions of configs: local & cluster - in YARN: specify them using --files argument & SPARK_CONF_DIR variable • Use Typesafe’s config library for all application related configs - provide sensible defaults for everything - overwrite using application.conf • Do not hard-code any configurations in code
  • 21. Accumulators 21 • Use accumulators for ensuring correctness! • Example: - parse data, ignore event if there is a problem with the data - use accumulator to count these failed lines class Parser(failedLinesAccumulator: Accumulator[Int]) extends Serializable { def parse(s: String): Option[Event] = { try { // parsing logic goes here Some(...) } catch {
 case e: Exception => {
 failedLinesAccumulator += 1
 None
 }
 } } } val accumulator = Some(sc.accumulator(0, “failed lines”)) val parser = new Parser(accumulator) val events = sc.textFile(“hdfs:///myfile”) .flatMap(s => parser.parse(s))
  • 22. RDD vs. DataFrame in Spark 22 • Initially Spark advocated Resilient Distributed Data (RDD) for data set abstraction - type-safe - usually stores some Scala case class - code relatively easy to understand • Recently Spark is pushing towards using DataFrame - similar to R and Python’s Pandas data frames - some advantages ‣ less rigid types: can append columns ‣ speed - disadvantage: code readability suffers for non-basic types ‣ user defined types ‣ user defined functions • Have not fully migrated to it yet
  • 23. Every Day I’m Shuffling… 23 • Careful with operations which send a lot of data over the network - reduceByKey - repartition / shuffle • Careful with sending too much data to the driver - collect - reduce • found mapPartitions & treeReduce useful in some cases (see FTRL example) • play with spark configurations: frameSize, maxResultSize, timeouts… textFile flatMap map reduceByKey
  • 24. Machine Learning in Spark 24 • Relatively basic - some algorithms don’t scale so well - not customizable enough for experts: ‣ optimizers that assume a regularizer ‣ built our own DSL for feature extraction & combination ‣ a lot of the APIs are not exposed, i.e. private to Spark - will hopefully get there eventually • Nice: new Transformer / Estimator / Pipeline approach - Inspired by scikit-learn, makes it easy to combine different algorithms - Requires DataFrame - Example (from Spark docs) val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words") val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features") val lr = new LogisticRegression() .setMaxIter(10) .setRegParam(0.01) val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr)) val model = pipeline.fit(training)