SlideShare a Scribd company logo
1 of 59
A full Machine learning pipeline
in Scikit-learn vs Scala-Spark:
pros and cons
Jose Quesada and David Anderson
@quesada, @alpinegizmo, @datascienceret
Why this talk?
• How do you get from a single-machine workload to a fully distributed
one?
• Answer: Spark machine learning
• Is there something I'm missing out by staying with python?
• Mentors are world-class. CTOs, library authors, inventors, founders of
fast-growing companies, etc
• DSR accepts fewer than 5% of the applications
• Strong focus on commercial awareness
• 5 years of working experience on average
• 30+ partner companies in Europe
DSR participants do a portfolio project
Why is DSR talking about Scala/Spark?
They are behind Scala
IBM is behind this
They hired us to make training
materials
Source: Spark 2015 infographic
Time
Mindsharein‘datasciencebadasses’(subjective)
Scala
“Scala offers the easiest refactoring experience that I've ever had due
to the type system.”
Jacob, coursera engineer
Spark
• Basically distributed Scala
• API
• Scala, Java, Python, and R bindings
• Libraries
• SQL, streams, graph processing, machine learning
• One of the most active open source projects
“Spark will inevitably become the de-facto Big Data framework
for Machine Learning and Data Science.”
Dean Wampler, Lightbend
All under one roof (big Win)
Source: Spark 2015 infographic
Spark Core
Spark SQL
Spark
streaming
Spark.ml
(machine
learning
GraphX
(graphs)
Spark Programming Model
Input
Driver /
SparkContext
Worker
Worker
Data is partitioned; code is sent to the data
Input
Driver /
SparkContext
Worker
Worker
Data
Data
Example: word count
hello world
foo bar
foo foo bar
bye world
Data is immutable,
and is partitioned
across the cluster
Example: word count
hello world
foo bar
foo foo bar
bye world
We get things done
by creating new,
transformed copies
of the data.
In parallel.
hello
world
foo
bar
foo
foo
bar
bye
world
(hello, 1)
(world, 1)
(foo, 1)
(bar, 1)
(foo, 1)
(foo, 1)
(bar, 1)
(bye, 1)
(world, 1)
Example: word count
hello world
foo bar
foo foo bar
bye world
Some operations require a shuffle
to group data together
hello
world
foo
bar
foo
foo
bar
bye
world
(hello, 1)
(world, 1)
(foo, 1)
(bar, 1)
(foo, 1)
(foo, 1)
(bar, 1)
(bye, 1)
(world, 1)
(hello, 1)
(foo, 3)
(bar, 2)
(bye, 1)
(world, 2)
Example: word count
lines = sc.textFile(input)
words = lines.flatMap(lambda x: x.split(" "))
word_count =
(words.map(lambda x: (x, 1))
.reduceByKey(lambda x, y: x + y))
-------------------------------------------------
word_count.saveAsTextFile(output)
Pipelined into the same
python executor
Nothing happens until
after this line, when this
"action" forces evaluation
of the RDD
RDD – Resilient Distributed Dataset
• An immutable, partitioned collection of elements that can be
operated on in parallel
• Lazy
• Fault-tolerant
PySpark RDD Execution Model
Whenever you provide a
lambda to operate on an
RDD:
• Each Spark worker
forks a Python worker
• data is serialized and
piped to those Python
workers
Impact of this execution model
• Worker overhead (forking, serialization)
• The cluster manager isn't aware of Python's memory needs
• Very confusing error messages
Spark Dataframes (and Datasets)
• Based on RDDs, but tabular; something like SQL tables
• Not Pandas
• Rescues Python from serialization overhead
• df.filter(df.col("color") == "red") vs. rdd.filter(lambda x: x.color == "red")
• processed entirely in the JVM
• Python UDFs and maps still require serialization and piping to Python
• can write (and register) Scala code, and then call it from Python
DataFrame execution: unified across
languages
Python DF Java/Scala DF R DF
Logical Plan
Execution
API wrappers create a
logical plan (a DAG)
Catalyst optimizes the plan;
Tungsten compiles the plan
into executable code
DataFrame performance
ML Workflow
Data
Ingestion
Data Cleaning /
Feature
Engineering
Model
Training
Testing and
Validation
Deployment
Machine learning with scikit-learn
• Easy to use
• Rich ecosystem
• Limited to one machine (but see sparkit-learn package)
Machine learning with Hadoop (in short: NO)
• Each iteration is a new M/R job
• Each job must store data in HDFS – lots of overhead
How Spark killed Hadoop map/reduce
• Far easier to program
• More cost-effective since less hardware can perform the same tasks
much faster
• Can do real-time processing as well as batch processing
• Can do ML, graphs
Machine learning with Spark
• Spark was designed for ML workloads
• Caching (reuse data)
• Accumulators (keep state across iterations)
• Functional, lazy, fault-tolerant
• Many popular algorithms are supported out of the box
• Simple to productionalize models
• MLlib is RDD (the past), spark.ml is dataframes, the future
Spark is an Ecosystem of ML frameworks
• Spark was designed by people who understood the need of ML
practitioners (unlike Hadoop)
• MLlib
• Spark.ml
• System.ml (IBM)
• Keystone.ml
Spark.ML– the basics
• DataFrame: ML requires DFs holding vectors
• Transformer: transforms one DF into another
• Estimator: fit on a DF; produces a transformer
• Pipeline: chain of transformers and estimators
• Parameter: there is a unified API for specifying parameters
• Evaluator:
• CrossValidator: model selection via grid search
Hyper-parameter
tuning
Machine Learning scaling challenges that
Spark solves
Hyper-parameter
tuning
Machine Learning scaling challenges that
Spark solves
ETL/feature
engineering
Hyper-parameter
tuning
Machine Learning scaling challenges that
Spark solves
ETL/feature
engineering
Model
Q: Hardest scaling problem in data science?
A: Adding people
• Spark.ml has a clean architecture and APIs that should encourage
code sharing and reuse
• Good first step: can you refactor some ETL code as a Transformer?
• Don't see much sharing of components happening yet
• Entire libraries, yes; components, not so much
• Perhaps because Spark has been evolving so quickly
• E.g., pull request implementing non-linear SVMs that has been stuck for a
year
Structured types in Spark
SQL DataFrames DataSets
(Java/Scala only)
Syntax Errors Runtime Compile time Compile time
Analysis Errors Runtime Runtime Compile time
User experience Spark.ml – Scikit-learn
Indexing categorical features
• You are responsible for identifying and indexing categorical features
val rfcd_indexer = new StringIndexer()
.setInputCol("color")
.setOutputCol("color_index")
.fit(dataset)
val seo_indexer = new StringIndexer()
.setInputCol("status")
.setOutputCol("status_index")
.fit(dataset)
Assembling features
• You must gather all of your features into one Vector, using a
VectorAssembler
val assembler = new VectorAssembler()
.setInputCols(Array("color_index", "status_index", ...))
.setOutputCol("features")
Spark.ml – Scikit-learn: Pipelines (good news!)
• Spark ML and scikit-learn: same approach
• Chain together Estimators and Transformers
• Support non-linear pipelines (must be a DAG)
• Unify parameter passing
• Support for cross-validation and grid search
• Can write your own custom pipeline stages
Spark.ml just like scikit-learn
Transformer Description scikit-learn
Binarizer Threshold numerical feature to binary Binarizer
Bucketizer Bucket numerical features into ranges
ElementwiseProduct Scale each feature/column separately
HashingTF Hash text/data to vector. Scale by term frequency FeatureHasher
IDF Scale features by inverse document frequency TfidfTransformer
Normalizer Scale each row to unit norm Normalizer
OneHotEncoder Encode k-category feature as binary features OneHotEncoder
PolynomialExpansion Create higher-order features PolynomialFeatures
RegexTokenizer Tokenize text using regular expressions (part of text methods)
StandardScaler Scale features to 0 mean and/or unit variance StandardScaler
StringIndexer Convert String feature to 0-based indices LabelEncoder
Tokenizer Tokenize text on whitespace (part of text methods)
VectorAssembler Concatenate feature vectors FeatureUnion
VectorIndexer Identify categorical features, and index
Word2Vec Learn vector representation of words
Spark.ml – Scikit-learn: NLP tasks (thumbs up)
Graph stuff (graphX, graphframes, not great)
• Extremely easy to run monster algorithms in a cluster
• GraphX has no python API
• Graphframes are cool, and should provide access to the graph tools in
Spark from python
• In practice, it didn’t work too well
Things we liked in Spark ML
• Architecture encourages building reusable pieces
• Type safety, plus types are driving optimizations
• Model fitting returns an object that transforms the data
• Uniform way of passing parameters
• It's interesting to use the same platform for ETL and model fitting
• Very easy to parallelize ETL and grid search, or work with huge models
Disappointments using Spark ML
• Feature indexing and assembly can become tedious
• Surprised by the maximum depth limit for trees: 30
• Data exploration and visualization aren't easy in Scala
• Wish list: non-linear SVMs, deep learning (but see Deeplearning4j)
What is new for machine learning in Spark 2.0
• DataFrame-based Machine Learning API emerges as the primary ML
API: With Spark 2.0, the spark.ml package, with its “pipeline” APIs,
will emerge as the primary machine learning API. While the original
spark.mllib package is preserved, future development will focus on
the DataFrame-based API.
• Machine learning pipeline persistence: Users can now save and
load machine learning pipelines and models across all programming
languages supported by Spark.
What is new for data structures in Spark 2.0
Unifying the API for Streams and static data: Infinite datasets (same interface as dataframes)
What have Spark and Scala ever given us?
… Other than distributed dataframes,
distributed machine learning,
easy distributed grid search,
distributed SQL,
distributed stream analysis,
more performance than map reduce
easier programming model
And easier deployment …
What have Spark and Scala ever given us?
Reminder: 25 videos explaining ML on spark
• For people who already know ML
• http://datascienceretreat.com/videos/data-science-with-scala-and-
spark)
Thank you for your attention!
@quesada, @datascienceret

More Related Content

What's hot

Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Pythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlowPythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlowFernando Ortega Gallego
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningSergey Karayev
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelinesLars Albertsson
 
Simplifying Model Management with MLflow
Simplifying Model Management with MLflowSimplifying Model Management with MLflow
Simplifying Model Management with MLflowDatabricks
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduCloudera, Inc.
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking VN
 
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Databricks
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduCloudera, Inc.
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Guido Schmutz
 
Batch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkBatch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkVasia Kalavri
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOpsCarl W. Handlin
 
MLOps by Sasha Rosenbaum
MLOps by Sasha RosenbaumMLOps by Sasha Rosenbaum
MLOps by Sasha RosenbaumSasha Rosenbaum
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...Databricks
 
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Ed Fernandez
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksDatabricks
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 

What's hot (20)

Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Pythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlowPythonsevilla2019 - Introduction to MLFlow
Pythonsevilla2019 - Introduction to MLFlow
 
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep LearningTroubleshooting Deep Neural Networks - Full Stack Deep Learning
Troubleshooting Deep Neural Networks - Full Stack Deep Learning
 
Test strategies for data processing pipelines
Test strategies for data processing pipelinesTest strategies for data processing pipelines
Test strategies for data processing pipelines
 
Simplifying Model Management with MLflow
Simplifying Model Management with MLflowSimplifying Model Management with MLflow
Simplifying Model Management with MLflow
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKI
 
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
Building a MLOps Platform Around MLflow to Enable Model Productionalization i...
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
Batch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache FlinkBatch and Stream Graph Processing with Apache Flink
Batch and Stream Graph Processing with Apache Flink
 
From Data Science to MLOps
From Data Science to MLOpsFrom Data Science to MLOps
From Data Science to MLOps
 
MLOps by Sasha Rosenbaum
MLOps by Sasha RosenbaumMLOps by Sasha Rosenbaum
MLOps by Sasha Rosenbaum
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...
MATS stack (MLFlow, Airflow, Tensorflow, Spark) for Cross-system Orchestratio...
 
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
Machine Learning Platformization & AutoML: Adopting ML at Scale in the Enterp...
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Scaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on DatabricksScaling Data Analytics Workloads on Databricks
Scaling Data Analytics Workloads on Databricks
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
 

Viewers also liked

Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelinesjeykottalam
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operationsStepan Pushkarev
 
Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...PyData
 
PostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data CapturePostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data CaptureJeff Klukas
 
Building A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineBuilding A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineRobert Dempsey
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsTuri, Inc.
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonSimon Frid
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in productionTuri, Inc.
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learnJeff Klukas
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanHakka Labs
 
Multi runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learningMulti runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learningStepan Pushkarev
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructurejoshwills
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataRobert Dempsey
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In ProductionSamir Bessalah
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architectureStepan Pushkarev
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017Carol Smith
 

Viewers also liked (17)

Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Serverless machine learning operations
Serverless machine learning operationsServerless machine learning operations
Serverless machine learning operations
 
Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...Python as part of a production machine learning stack by Michael Manapat PyDa...
Python as part of a production machine learning stack by Michael Manapat PyDa...
 
PostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data CapturePostgreSQL + Kafka: The Delight of Change Data Capture
PostgreSQL + Kafka: The Delight of Change Data Capture
 
Building A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning PipelineBuilding A Production-Level Machine Learning Pipeline
Building A Production-Level Machine Learning Pipeline
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning Models
 
Managing and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in PythonManaging and Versioning Machine Learning Models in Python
Managing and Versioning Machine Learning Models in Python
 
Machine learning in production
Machine learning in productionMachine learning in production
Machine learning in production
 
Machine learning in production with scikit-learn
Machine learning in production with scikit-learnMachine learning in production with scikit-learn
Machine learning in production with scikit-learn
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong Yan
 
Multi runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learningMulti runtime serving pipelines for machine learning
Multi runtime serving pipelines for machine learning
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructure
 
Using PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of DataUsing PySpark to Process Boat Loads of Data
Using PySpark to Process Boat Loads of Data
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
Spark and machine learning in microservices architecture
Spark and machine learning in microservices architectureSpark and machine learning in microservices architecture
Spark and machine learning in microservices architecture
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
 

Similar to A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons

Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark Summit
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkDatabricks
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache SparkMiklos Christine
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondDataWorks Summit
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark TutorialAhmet Bulut
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondXiangrui Meng
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkDatabricks
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit
 

Similar to A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons (20)

Spark meetup TCHUG
Spark meetup TCHUGSpark meetup TCHUG
Spark meetup TCHUG
 
Spark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin OderskySpark - The Ultimate Scala Collections by Martin Odersky
Spark - The Ultimate Scala Collections by Martin Odersky
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
From Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data ApplicationsFrom Pipelines to Refineries: Scaling Big Data Applications
From Pipelines to Refineries: Scaling Big Data Applications
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Recent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and BeyondRecent Developments in Spark MLlib and Beyond
Recent Developments in Spark MLlib and Beyond
 
Integrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache SparkIntegrating Deep Learning Libraries with Apache Spark
Integrating Deep Learning Libraries with Apache Spark
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
 

Recently uploaded

Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excelysmaelreyes
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingsocarem879
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 

Recently uploaded (20)

Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Business Analytics using Microsoft Excel
Business Analytics using Microsoft ExcelBusiness Analytics using Microsoft Excel
Business Analytics using Microsoft Excel
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
INTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processingINTRODUCTION TO Natural language processing
INTRODUCTION TO Natural language processing
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
办美国加州大学伯克利分校毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and cons

  • 1. A full Machine learning pipeline in Scikit-learn vs Scala-Spark: pros and cons Jose Quesada and David Anderson @quesada, @alpinegizmo, @datascienceret
  • 2.
  • 4. • How do you get from a single-machine workload to a fully distributed one? • Answer: Spark machine learning • Is there something I'm missing out by staying with python?
  • 5.
  • 6.
  • 7. • Mentors are world-class. CTOs, library authors, inventors, founders of fast-growing companies, etc • DSR accepts fewer than 5% of the applications • Strong focus on commercial awareness • 5 years of working experience on average • 30+ partner companies in Europe
  • 8.
  • 9. DSR participants do a portfolio project
  • 10.
  • 11. Why is DSR talking about Scala/Spark? They are behind Scala IBM is behind this They hired us to make training materials
  • 12. Source: Spark 2015 infographic
  • 14.
  • 15. Scala “Scala offers the easiest refactoring experience that I've ever had due to the type system.” Jacob, coursera engineer
  • 16. Spark • Basically distributed Scala • API • Scala, Java, Python, and R bindings • Libraries • SQL, streams, graph processing, machine learning • One of the most active open source projects
  • 17. “Spark will inevitably become the de-facto Big Data framework for Machine Learning and Data Science.” Dean Wampler, Lightbend
  • 18. All under one roof (big Win) Source: Spark 2015 infographic Spark Core Spark SQL Spark streaming Spark.ml (machine learning GraphX (graphs)
  • 19. Spark Programming Model Input Driver / SparkContext Worker Worker
  • 20. Data is partitioned; code is sent to the data Input Driver / SparkContext Worker Worker Data Data
  • 21. Example: word count hello world foo bar foo foo bar bye world Data is immutable, and is partitioned across the cluster
  • 22. Example: word count hello world foo bar foo foo bar bye world We get things done by creating new, transformed copies of the data. In parallel. hello world foo bar foo foo bar bye world (hello, 1) (world, 1) (foo, 1) (bar, 1) (foo, 1) (foo, 1) (bar, 1) (bye, 1) (world, 1)
  • 23. Example: word count hello world foo bar foo foo bar bye world Some operations require a shuffle to group data together hello world foo bar foo foo bar bye world (hello, 1) (world, 1) (foo, 1) (bar, 1) (foo, 1) (foo, 1) (bar, 1) (bye, 1) (world, 1) (hello, 1) (foo, 3) (bar, 2) (bye, 1) (world, 2)
  • 24. Example: word count lines = sc.textFile(input) words = lines.flatMap(lambda x: x.split(" ")) word_count = (words.map(lambda x: (x, 1)) .reduceByKey(lambda x, y: x + y)) ------------------------------------------------- word_count.saveAsTextFile(output) Pipelined into the same python executor Nothing happens until after this line, when this "action" forces evaluation of the RDD
  • 25. RDD – Resilient Distributed Dataset • An immutable, partitioned collection of elements that can be operated on in parallel • Lazy • Fault-tolerant
  • 26. PySpark RDD Execution Model Whenever you provide a lambda to operate on an RDD: • Each Spark worker forks a Python worker • data is serialized and piped to those Python workers
  • 27.
  • 28. Impact of this execution model • Worker overhead (forking, serialization) • The cluster manager isn't aware of Python's memory needs • Very confusing error messages
  • 29. Spark Dataframes (and Datasets) • Based on RDDs, but tabular; something like SQL tables • Not Pandas • Rescues Python from serialization overhead • df.filter(df.col("color") == "red") vs. rdd.filter(lambda x: x.color == "red") • processed entirely in the JVM • Python UDFs and maps still require serialization and piping to Python • can write (and register) Scala code, and then call it from Python
  • 30.
  • 31. DataFrame execution: unified across languages Python DF Java/Scala DF R DF Logical Plan Execution API wrappers create a logical plan (a DAG) Catalyst optimizes the plan; Tungsten compiles the plan into executable code
  • 33. ML Workflow Data Ingestion Data Cleaning / Feature Engineering Model Training Testing and Validation Deployment
  • 34. Machine learning with scikit-learn • Easy to use • Rich ecosystem • Limited to one machine (but see sparkit-learn package)
  • 35. Machine learning with Hadoop (in short: NO) • Each iteration is a new M/R job • Each job must store data in HDFS – lots of overhead
  • 36. How Spark killed Hadoop map/reduce • Far easier to program • More cost-effective since less hardware can perform the same tasks much faster • Can do real-time processing as well as batch processing • Can do ML, graphs
  • 37. Machine learning with Spark • Spark was designed for ML workloads • Caching (reuse data) • Accumulators (keep state across iterations) • Functional, lazy, fault-tolerant • Many popular algorithms are supported out of the box • Simple to productionalize models • MLlib is RDD (the past), spark.ml is dataframes, the future
  • 38. Spark is an Ecosystem of ML frameworks • Spark was designed by people who understood the need of ML practitioners (unlike Hadoop) • MLlib • Spark.ml • System.ml (IBM) • Keystone.ml
  • 39. Spark.ML– the basics • DataFrame: ML requires DFs holding vectors • Transformer: transforms one DF into another • Estimator: fit on a DF; produces a transformer • Pipeline: chain of transformers and estimators • Parameter: there is a unified API for specifying parameters • Evaluator: • CrossValidator: model selection via grid search
  • 40. Hyper-parameter tuning Machine Learning scaling challenges that Spark solves
  • 41.
  • 42. Hyper-parameter tuning Machine Learning scaling challenges that Spark solves ETL/feature engineering
  • 43. Hyper-parameter tuning Machine Learning scaling challenges that Spark solves ETL/feature engineering Model
  • 44. Q: Hardest scaling problem in data science? A: Adding people • Spark.ml has a clean architecture and APIs that should encourage code sharing and reuse • Good first step: can you refactor some ETL code as a Transformer? • Don't see much sharing of components happening yet • Entire libraries, yes; components, not so much • Perhaps because Spark has been evolving so quickly • E.g., pull request implementing non-linear SVMs that has been stuck for a year
  • 45. Structured types in Spark SQL DataFrames DataSets (Java/Scala only) Syntax Errors Runtime Compile time Compile time Analysis Errors Runtime Runtime Compile time
  • 46. User experience Spark.ml – Scikit-learn
  • 47. Indexing categorical features • You are responsible for identifying and indexing categorical features val rfcd_indexer = new StringIndexer() .setInputCol("color") .setOutputCol("color_index") .fit(dataset) val seo_indexer = new StringIndexer() .setInputCol("status") .setOutputCol("status_index") .fit(dataset)
  • 48. Assembling features • You must gather all of your features into one Vector, using a VectorAssembler val assembler = new VectorAssembler() .setInputCols(Array("color_index", "status_index", ...)) .setOutputCol("features")
  • 49. Spark.ml – Scikit-learn: Pipelines (good news!) • Spark ML and scikit-learn: same approach • Chain together Estimators and Transformers • Support non-linear pipelines (must be a DAG) • Unify parameter passing • Support for cross-validation and grid search • Can write your own custom pipeline stages Spark.ml just like scikit-learn
  • 50. Transformer Description scikit-learn Binarizer Threshold numerical feature to binary Binarizer Bucketizer Bucket numerical features into ranges ElementwiseProduct Scale each feature/column separately HashingTF Hash text/data to vector. Scale by term frequency FeatureHasher IDF Scale features by inverse document frequency TfidfTransformer Normalizer Scale each row to unit norm Normalizer OneHotEncoder Encode k-category feature as binary features OneHotEncoder PolynomialExpansion Create higher-order features PolynomialFeatures RegexTokenizer Tokenize text using regular expressions (part of text methods) StandardScaler Scale features to 0 mean and/or unit variance StandardScaler StringIndexer Convert String feature to 0-based indices LabelEncoder Tokenizer Tokenize text on whitespace (part of text methods) VectorAssembler Concatenate feature vectors FeatureUnion VectorIndexer Identify categorical features, and index Word2Vec Learn vector representation of words Spark.ml – Scikit-learn: NLP tasks (thumbs up)
  • 51. Graph stuff (graphX, graphframes, not great) • Extremely easy to run monster algorithms in a cluster • GraphX has no python API • Graphframes are cool, and should provide access to the graph tools in Spark from python • In practice, it didn’t work too well
  • 52. Things we liked in Spark ML • Architecture encourages building reusable pieces • Type safety, plus types are driving optimizations • Model fitting returns an object that transforms the data • Uniform way of passing parameters • It's interesting to use the same platform for ETL and model fitting • Very easy to parallelize ETL and grid search, or work with huge models
  • 53. Disappointments using Spark ML • Feature indexing and assembly can become tedious • Surprised by the maximum depth limit for trees: 30 • Data exploration and visualization aren't easy in Scala • Wish list: non-linear SVMs, deep learning (but see Deeplearning4j)
  • 54. What is new for machine learning in Spark 2.0 • DataFrame-based Machine Learning API emerges as the primary ML API: With Spark 2.0, the spark.ml package, with its “pipeline” APIs, will emerge as the primary machine learning API. While the original spark.mllib package is preserved, future development will focus on the DataFrame-based API. • Machine learning pipeline persistence: Users can now save and load machine learning pipelines and models across all programming languages supported by Spark.
  • 55. What is new for data structures in Spark 2.0 Unifying the API for Streams and static data: Infinite datasets (same interface as dataframes)
  • 56. What have Spark and Scala ever given us?
  • 57. … Other than distributed dataframes, distributed machine learning, easy distributed grid search, distributed SQL, distributed stream analysis, more performance than map reduce easier programming model And easier deployment … What have Spark and Scala ever given us?
  • 58. Reminder: 25 videos explaining ML on spark • For people who already know ML • http://datascienceretreat.com/videos/data-science-with-scala-and- spark)
  • 59. Thank you for your attention! @quesada, @datascienceret

Editor's Notes

  1. Scala and spark are very close: if you learn one you learn the other. Spark is distributed scala
  2. Scala and spark are very close: if you learn one you learn the other. Spark is distributed scala This has been possible for years, but nowadays it’s not only possible but pleasant
  3. You attend a Retreat, not a training
  4. A talk should give you a superpower. - Am I missing out?
  5. redo the diagram
  6. fault-tolerant: missing partitions can be recomputed by using the lineage graph to rerun operations​
  7. When using python, the sparkcontext in python is basically a proxy. py4j is used to launch a JVM and create a native spark context. py4j manages communication between the python and java spark context objects. In the workers, some operations can be executed directly in the JVM. But, for example, if you've implemented a map function in python, a python process is forked to execute this user-supplied mapping. Each thread in the spark worker will have its own python sub-process. When Python wrapper calls the underlying Spark codes written in Scala running on a JVM, translation between two different environments and languages might be the source of more bugs and issues. 
  8. Scala and spark are very close: if you learn one you learn the other. Spark is distributed scala This has been possible for years, but nowadays it’s not only possible but pleasant
  9. Just one Map / Reduce step, but many algorithms are iterative Disk based → long startup times ------- Spark is a wholesale replacement for MapReduce that leverages lessons learned from MapReduce. The Hadoop community realized that areplacement for MR was needed. While MR has served the community well, it’s a decade old and shows clear limitations and problems, as we’ve seen. In late 2013, Cloudera, the largest Hadoop vendor officially embraced Spark as the replacement. Most of the other Hadoop vendors have followed suit. When it comes to one-pass ETL-like jobs, for example, data transformation or data integration, then MapReduce is the deal—this is what it was designed for. Advantages for Hadoop: Security, staffing
  10. sample use case for accumulators: gradient descent
  11. Spark.ml Departs from scikit-learn quite a bit
  12. Good
  13. from https://databricks.com/blog/2015/07/29/new-features-in-machine-learning-pipelines-in-apache-spark-1-4.html