SlideShare a Scribd company logo
1 of 41
Download to read offline
Productionalizing Spark
ML
https://github.com/shashankgowdal/productionalising_spark_ml
Stories from Spark battle field
● Shashank L
● Senior Software engineer at Tellius
● Big data consultant and trainer at
datamantra.io
● www.shashankgowda.com
Stages of ML
● Gathering Data
● Data preparation
● Choosing a Model
● Training
● Evaluation
● Operationalise
Motivation
● Spark ML though is an End to End solution for Distributed
ML but, not everything will be done by the Framework
● Custom data preparation techniques may be needed
depending on the quality of the data
● Efficient resource utilization when running to Scale
● Operationalising the Trained models for use
● Best practices
Introduction to SparkML
Introduction to Spark ML
● Provides higher-level API for construction and tuning of
ML workflows
● Built on top of Dataset
● Abstractions
○ Transformer
○ Estimator
○ Evaluator
○ Pipeline
Transformer
● A Transformer is an abstraction which transforms a
dataframe into another.
transform(dataset: DataFrame): DataFrame
● Prepares the dataframe for a ML algorithm to work with
● Typically contains logic which works with single row of
data
DF DFTransformer
Vector assembler
● A feature transformer that merges multiple columns into
a vector as a new column.
● Algorithm stages like LogisticRegression requires a
vector as input which is a collection of feature values
with which the algorithm has to be trained
Estimator
● An Estimator is an abstraction of a learning algorithm
that fits a model on a dataset.
fit(dataset: DataFrame): M
● Estimator is ran only in the training step
● Model returned is a transformer
DF Estimator Model
String Indexer
● Encodes set of String values to its indices.
● Label indices are stored in the StringIndexer model
● Transforming a dataset through this model adds a
output column containing those indices
Pipeline
● Chain of Transformers and Estimators
● Pipeline itself is an Estimator
● It is fitted on a DataFrame turning it into a model called
PipelineModel
● PipelineModel can contain only Transformers
● Pipeline will be fitted on the Train dataset and Test
datasets will transform on the PipelineModel
What is missing?
Data Cleanup
Null values
● Data is rarely clean and can have missing values
● Important to identify and handle them
● SparkML doesn’t handle NULLs gracefully, It's
mandatory to handle them before Training or using any
Spark ML pipeline stages
● Domain expertise is necessary to decide on how to
handle missing values
Custom Spark ML stage
● Handling Nulls should be a part of Spark ML pipeline
● Spark ML has APIs to create a custom Transformer
● Implementation
○ transform
○ transformSchema
com.shashank.sparkml.datapreparation.NullHandlerTransformer
Null Handler Transformer - Cons
● Null handling may involve aggregating over the Train data
and store state
○ Calculating mean
○ Smart handling based on % of null values
● Aggregations in a Transformer runs aggregations on the test
set
● Prediction will be slower
● Prediction accuracy also depends on type of the data in test
set
Null Handler Estimator
● Null Handler Estimator fits the Train data to get Null Handler
Model, which is a Transformer
● Similar abstraction as that of other algorithm training
● Implementation
○ fit
○ transformSchema
● NullHandler Model
○ transform
○ transformSchema
com.shashank.sparkml.datapreparation.NullHandlerEstimator
NA Values
● All missing values may not be nulls
● Missing values can also be encoded as
○ null in String
○ NA
○ Empty String
○ Custom value
● Convert these values to null and use NullHandler to
handle them
● Can be implemented as a Transformer
com.shashank.sparkml.datapreparation.NaValuesHandler
Cast Transformer
● ML is all about mathematics and numericals
● Double data type is widely used for representing
features, labels
● Spark ML expects the data type to be DoubleType in
few APIs and NumericType to be in most APIs
● Casting them as a part of Pipeline would solve
DataType mismatch problems
● Cast can be a Transformer
com.shashank.sparkml.datapreparation.CastTransformer
Building Pipeline
● Use custom stages with built-in stages to build a Pipeline
● Categorical Columns
○ NaValuesHandler
○ NullHandler
○ StringIndexer
○ OneHotEncoder
● Continuous Columns
○ NullHandler
● VectorAssembler
● AlgorithmStage
com.shashank.sparkml.datapreparation.BuildingPipeline
Efficienct?
Iterative programming in Spark
● Spark is one of the first big data framework to have
great support iterative programming natively
● Iterative programs go over the data again and again to
compute some results
● Spark ML is one of iterative frameworks in spark
Growing Logical plan
● Every iteration creates a new dataset which keeps the
logical plan growing
● A ML Transformer can have 1 or more iterations in them
● As there are more stages, logical plan grows adding
overhead to analyse the plan
● This overhead is compute bound and done at master
com.shashank.sparkml.datapreparation.GrowingLineageIssue
Multi Column handling
● Reducing the number of stages in a Pipeline can reduce
iterations on the dataset
● Pipeline stages should have the ability to handle multi
columns instead of 1 stage per column
○ Handle Nulls in all columns in a single stage
○ Replace NA values in all columns in a single stage
● Improves the plan processing performance drastically
even in case of dataset having many columns
com.shashank.sparkml.datapreparation.MultiColumnNullHandler
com.shashank.sparkml.datapreparation.GrowingLineageIssueFixed
Training
Data sampling
● ML makes data-driven predictions by building a
mathematical model from input data
● To avoid overfitting the model for input data, data is
normally sampled in train, test data
● Train data is used for learning and test data to verify
model accuracy
● Normally data is divided into 2 samples using random
sampling without overlapping rows
data.randomSplit(Array(0.6, 0.8))
Caching source data
● ML modelling is an iterative process
● ML Training or preprocessing goes over the data
multiple times
● Spark transformation being lazily evaluated, every pass
on the data reads the data from source
● Caching the source dataset speeds up the ML
modelling process
Caching source data
● Sampling and Caching the data is necessary in terms of
accuracy and performance
● Normally Data is cached, then sampled. This takes a hit
on the performance
● randomSplit on the data requires sorting the complete
data to avoid overlapping rows
● Cached data is sorted on every pass on the data
com.shashank.sparkml.caching.PipelineWithSampling
Caching source vs sample data
Caching only required columns
● Caching the source data speeds up the processing
● Normally a model may not trained on all the columns in
the dataset.
● In a Scenario where, 10 columns are considered for
Training compared to 100 columns in the data
● Applying smartness in caching will have efficient
memory utilization
● Cache only columns which are used for Training
com.shashank.sparkml.caching.CachingRequiredColumns
Spark caching behaviour
● Spark uses memory for 2 purpose - caching and
processing
● We had a definite limits for both in earlier versions
● There is possibility that caching the data equal to size of
the memory available slows down the processing
● Sometimes processing may have to flush the data to disk
to free up space for processing
● It will happen in a repeated loop if caching and processing
are done by the same Spark job
Tree Based classifier memory issue
● Tree based classifiers caches intermediate tree data
using storage level MEMORY_AND_DISK
● The data size cached is normally 3 times the source
data size (source data being a csv)
● Training a DecisionTree classifier on 20GB data has a
requirement of 60 to 80GB RAM which is impractical
● No config to disable cache or control the storage level
Adding config to Tree based classifier
● We added a new configuration parameter for Tree
based classifiers to control the storage level
decisionTreeClassifier.setIntermediateStorageLevel("DISK_ONLY")
● https://github.com/apache/spark/pull/17972
● Changes may land in Spark 2.3.0
"org.apache.spark" %% "spark-mllib" % "2.2.0_mod" from "url/to/jar/spark-mllib_2.11-2.2.0.jar",
Operationalise
Model persistence
● Built In stages of Spark ML supports model persistence
out of the box
● Every stage should extend class DefaultParamsWritable
● Provides a general implementation for persisting the
Params to a Parquet file
● Only params will be persisted, all inputs, state should be
a param
● Persisting a pipeline internally calls the persist on all its
stages
Reading Persisted model
● Custom ML stage should have a Companion object for itself,
which extends class DefaultParamsReadable
● Provides a general implementation for reading the saved
parameters into Stage params
● PipelineModel.load internally calls the read method on all
its stages to create a PipelineModel
com.shashank.sparkml.operationalize.stages.CastTransformer
Persistent Params
● If params are of type Double, Float, Long, Int, Boolean, Array,
Vector they are persistent params.
● Spark internally has logic to persist them
● Custom type like Map[K,V] or Option[Double] which we have
used cannot be persisted by Spark
● A param implementation has to be provided by the user
which requires below methods to be implemented
def jsonEncode(value: Option[T]): String
def jsonDecode(json: String): Option[T]
com.shashank.sparkml.operationalize.stages.PersistentParams
Predict Schema check
● Stages in a trained model are simple transformations
which transform the dataset from one form to another
● These transformations expects the feature columns to be
present in the Prediction dataset
● There is no ability in SparkML to validate if a dataset is
suitable for the model
● Information about the schema should be stored while
training to verify the schema and throw meaningful errors
com.shashank.sparkml.operationalize.PredictSchemaIssue
FeatureNames extraction
● A pipeline model doesn’t have API to get a list of feature
names which were used to train the model
● Feature Vector is just a collection of double values
● No information about what each of these values represent
● We can use multiple stage metadata to derive the feature
names associated with each feature value
● These features would also contain OneHotEncoded values
com.shashank.sparkml.operationalize.FeatureExtraction
References
● https://jaceklaskowski.gitbooks.io/mastering-apache-spark/
spark-mllib/spark-mllib-pipelines.html
● https://spark.apache.org/docs/latest/ml-guide.html
● https://issues.apache.org/jira/browse/SPARK-20723
intermediateRDDStorageLevel for Treebased Classifier
● https://issues.apache.org/jira/browse/SPARK-8418
single- and multi-value support to ML Transformers
● https://issues.apache.org/jira/browse/SPARK-13434
Reduce Spark RandomForest memory footprint
Thank you

More Related Content

What's hot

Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 APIdatamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streamingdatamantra
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache sparkdatamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streamingdatamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scaladatamantra
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2datamantra
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streamingdatamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1datamantra
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark applicationdatamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkabandatamantra
 
Zoo keeper in the wild
Zoo keeper in the wildZoo keeper in the wild
Zoo keeper in the wilddatamantra
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPdatamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Sparkdatamantra
 

What's hot (20)

Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 
Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2Building distributed processing system from scratch - Part 2
Building distributed processing system from scratch - Part 2
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1Building Distributed Systems from Scratch - Part 1
Building Distributed Systems from Scratch - Part 1
 
Productionalizing a spark application
Productionalizing a spark applicationProductionalizing a spark application
Productionalizing a spark application
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Interactive workflow management using Azkaban
Interactive workflow management using AzkabanInteractive workflow management using Azkaban
Interactive workflow management using Azkaban
 
Zoo keeper in the wild
Zoo keeper in the wildZoo keeper in the wild
Zoo keeper in the wild
 
Building scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTPBuilding scalable rest service using Akka HTTP
Building scalable rest service using Akka HTTP
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
A Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache SparkA Tool For Big Data Analysis using Apache Spark
A Tool For Big Data Analysis using Apache Spark
 
Reactors.io
Reactors.ioReactors.io
Reactors.io
 

Similar to Productionalizing Spark ML: Best Practices for Data Preparation, Training and Operationalization

Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Databricks
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopHolden Karau
 
Spark Structured Streaming
Spark Structured StreamingSpark Structured Streaming
Spark Structured StreamingKnoldus Inc.
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsGareth Rogers
 
Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Sparkcarl_pulley
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesTuhin Mahmud
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalystdatamantra
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache SparkKnoldus Inc.
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionFormulatedby
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesAhsan Javed Awan
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsGareth Rogers
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowAnant Corporation
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with SparkRoger Rafanell Mas
 

Similar to Productionalizing Spark ML: Best Practices for Data Preparation, Training and Operationalization (20)

Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
 
Introduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines WorkshopIntroduction to Spark ML Pipelines Workshop
Introduction to Spark ML Pipelines Workshop
 
Spark Structured Streaming
Spark Structured StreamingSpark Structured Streaming
Spark Structured Streaming
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
 
Porting R Models into Scala Spark
Porting R Models into Scala SparkPorting R Models into Scala Spark
Porting R Models into Scala Spark
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
 
Anatomy of spark catalyst
Anatomy of spark catalystAnatomy of spark catalyst
Anatomy of spark catalyst
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache Spark
 
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to ProductionData Science Salon: A Journey of Deploying a Data Science Engine to Production
Data Science Salon: A Journey of Deploying a Data Science Engine to Production
 
Boosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of TechniquesBoosting spark performance: An Overview of Techniques
Boosting spark performance: An Overview of Techniques
 
Putting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech AnalysticsPutting the Spark into Functional Fashion Tech Analystics
Putting the Spark into Functional Fashion Tech Analystics
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
SparkNet presentation
SparkNet presentationSparkNet presentation
SparkNet presentation
 
Machine Learning Orchestration with Airflow
Machine Learning Orchestration with AirflowMachine Learning Orchestration with Airflow
Machine Learning Orchestration with Airflow
 
Profiling & Testing with Spark
Profiling & Testing with SparkProfiling & Testing with Spark
Profiling & Testing with Spark
 

More from datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scaledatamantra
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientistsdatamantra
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streamingdatamantra
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2datamantra
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIdatamantra
 

More from datamantra (10)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Real time ETL processing using Spark streaming
Real time ETL processing using Spark streamingReal time ETL processing using Spark streaming
Real time ETL processing using Spark streaming
 
Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2Anatomy of Spark SQL Catalyst - Part 2
Anatomy of Spark SQL Catalyst - Part 2
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 

Recently uploaded

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 

Recently uploaded (20)

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 

Productionalizing Spark ML: Best Practices for Data Preparation, Training and Operationalization

  • 2. ● Shashank L ● Senior Software engineer at Tellius ● Big data consultant and trainer at datamantra.io ● www.shashankgowda.com
  • 3. Stages of ML ● Gathering Data ● Data preparation ● Choosing a Model ● Training ● Evaluation ● Operationalise
  • 4. Motivation ● Spark ML though is an End to End solution for Distributed ML but, not everything will be done by the Framework ● Custom data preparation techniques may be needed depending on the quality of the data ● Efficient resource utilization when running to Scale ● Operationalising the Trained models for use ● Best practices
  • 6. Introduction to Spark ML ● Provides higher-level API for construction and tuning of ML workflows ● Built on top of Dataset ● Abstractions ○ Transformer ○ Estimator ○ Evaluator ○ Pipeline
  • 7. Transformer ● A Transformer is an abstraction which transforms a dataframe into another. transform(dataset: DataFrame): DataFrame ● Prepares the dataframe for a ML algorithm to work with ● Typically contains logic which works with single row of data DF DFTransformer
  • 8. Vector assembler ● A feature transformer that merges multiple columns into a vector as a new column. ● Algorithm stages like LogisticRegression requires a vector as input which is a collection of feature values with which the algorithm has to be trained
  • 9. Estimator ● An Estimator is an abstraction of a learning algorithm that fits a model on a dataset. fit(dataset: DataFrame): M ● Estimator is ran only in the training step ● Model returned is a transformer DF Estimator Model
  • 10. String Indexer ● Encodes set of String values to its indices. ● Label indices are stored in the StringIndexer model ● Transforming a dataset through this model adds a output column containing those indices
  • 11. Pipeline ● Chain of Transformers and Estimators ● Pipeline itself is an Estimator ● It is fitted on a DataFrame turning it into a model called PipelineModel ● PipelineModel can contain only Transformers ● Pipeline will be fitted on the Train dataset and Test datasets will transform on the PipelineModel
  • 14. Null values ● Data is rarely clean and can have missing values ● Important to identify and handle them ● SparkML doesn’t handle NULLs gracefully, It's mandatory to handle them before Training or using any Spark ML pipeline stages ● Domain expertise is necessary to decide on how to handle missing values
  • 15. Custom Spark ML stage ● Handling Nulls should be a part of Spark ML pipeline ● Spark ML has APIs to create a custom Transformer ● Implementation ○ transform ○ transformSchema com.shashank.sparkml.datapreparation.NullHandlerTransformer
  • 16. Null Handler Transformer - Cons ● Null handling may involve aggregating over the Train data and store state ○ Calculating mean ○ Smart handling based on % of null values ● Aggregations in a Transformer runs aggregations on the test set ● Prediction will be slower ● Prediction accuracy also depends on type of the data in test set
  • 17. Null Handler Estimator ● Null Handler Estimator fits the Train data to get Null Handler Model, which is a Transformer ● Similar abstraction as that of other algorithm training ● Implementation ○ fit ○ transformSchema ● NullHandler Model ○ transform ○ transformSchema com.shashank.sparkml.datapreparation.NullHandlerEstimator
  • 18. NA Values ● All missing values may not be nulls ● Missing values can also be encoded as ○ null in String ○ NA ○ Empty String ○ Custom value ● Convert these values to null and use NullHandler to handle them ● Can be implemented as a Transformer com.shashank.sparkml.datapreparation.NaValuesHandler
  • 19. Cast Transformer ● ML is all about mathematics and numericals ● Double data type is widely used for representing features, labels ● Spark ML expects the data type to be DoubleType in few APIs and NumericType to be in most APIs ● Casting them as a part of Pipeline would solve DataType mismatch problems ● Cast can be a Transformer com.shashank.sparkml.datapreparation.CastTransformer
  • 20. Building Pipeline ● Use custom stages with built-in stages to build a Pipeline ● Categorical Columns ○ NaValuesHandler ○ NullHandler ○ StringIndexer ○ OneHotEncoder ● Continuous Columns ○ NullHandler ● VectorAssembler ● AlgorithmStage com.shashank.sparkml.datapreparation.BuildingPipeline
  • 22. Iterative programming in Spark ● Spark is one of the first big data framework to have great support iterative programming natively ● Iterative programs go over the data again and again to compute some results ● Spark ML is one of iterative frameworks in spark
  • 23. Growing Logical plan ● Every iteration creates a new dataset which keeps the logical plan growing ● A ML Transformer can have 1 or more iterations in them ● As there are more stages, logical plan grows adding overhead to analyse the plan ● This overhead is compute bound and done at master com.shashank.sparkml.datapreparation.GrowingLineageIssue
  • 24. Multi Column handling ● Reducing the number of stages in a Pipeline can reduce iterations on the dataset ● Pipeline stages should have the ability to handle multi columns instead of 1 stage per column ○ Handle Nulls in all columns in a single stage ○ Replace NA values in all columns in a single stage ● Improves the plan processing performance drastically even in case of dataset having many columns com.shashank.sparkml.datapreparation.MultiColumnNullHandler com.shashank.sparkml.datapreparation.GrowingLineageIssueFixed
  • 26. Data sampling ● ML makes data-driven predictions by building a mathematical model from input data ● To avoid overfitting the model for input data, data is normally sampled in train, test data ● Train data is used for learning and test data to verify model accuracy ● Normally data is divided into 2 samples using random sampling without overlapping rows data.randomSplit(Array(0.6, 0.8))
  • 27. Caching source data ● ML modelling is an iterative process ● ML Training or preprocessing goes over the data multiple times ● Spark transformation being lazily evaluated, every pass on the data reads the data from source ● Caching the source dataset speeds up the ML modelling process
  • 28. Caching source data ● Sampling and Caching the data is necessary in terms of accuracy and performance ● Normally Data is cached, then sampled. This takes a hit on the performance ● randomSplit on the data requires sorting the complete data to avoid overlapping rows ● Cached data is sorted on every pass on the data com.shashank.sparkml.caching.PipelineWithSampling
  • 29. Caching source vs sample data
  • 30. Caching only required columns ● Caching the source data speeds up the processing ● Normally a model may not trained on all the columns in the dataset. ● In a Scenario where, 10 columns are considered for Training compared to 100 columns in the data ● Applying smartness in caching will have efficient memory utilization ● Cache only columns which are used for Training com.shashank.sparkml.caching.CachingRequiredColumns
  • 31. Spark caching behaviour ● Spark uses memory for 2 purpose - caching and processing ● We had a definite limits for both in earlier versions ● There is possibility that caching the data equal to size of the memory available slows down the processing ● Sometimes processing may have to flush the data to disk to free up space for processing ● It will happen in a repeated loop if caching and processing are done by the same Spark job
  • 32. Tree Based classifier memory issue ● Tree based classifiers caches intermediate tree data using storage level MEMORY_AND_DISK ● The data size cached is normally 3 times the source data size (source data being a csv) ● Training a DecisionTree classifier on 20GB data has a requirement of 60 to 80GB RAM which is impractical ● No config to disable cache or control the storage level
  • 33. Adding config to Tree based classifier ● We added a new configuration parameter for Tree based classifiers to control the storage level decisionTreeClassifier.setIntermediateStorageLevel("DISK_ONLY") ● https://github.com/apache/spark/pull/17972 ● Changes may land in Spark 2.3.0 "org.apache.spark" %% "spark-mllib" % "2.2.0_mod" from "url/to/jar/spark-mllib_2.11-2.2.0.jar",
  • 35. Model persistence ● Built In stages of Spark ML supports model persistence out of the box ● Every stage should extend class DefaultParamsWritable ● Provides a general implementation for persisting the Params to a Parquet file ● Only params will be persisted, all inputs, state should be a param ● Persisting a pipeline internally calls the persist on all its stages
  • 36. Reading Persisted model ● Custom ML stage should have a Companion object for itself, which extends class DefaultParamsReadable ● Provides a general implementation for reading the saved parameters into Stage params ● PipelineModel.load internally calls the read method on all its stages to create a PipelineModel com.shashank.sparkml.operationalize.stages.CastTransformer
  • 37. Persistent Params ● If params are of type Double, Float, Long, Int, Boolean, Array, Vector they are persistent params. ● Spark internally has logic to persist them ● Custom type like Map[K,V] or Option[Double] which we have used cannot be persisted by Spark ● A param implementation has to be provided by the user which requires below methods to be implemented def jsonEncode(value: Option[T]): String def jsonDecode(json: String): Option[T] com.shashank.sparkml.operationalize.stages.PersistentParams
  • 38. Predict Schema check ● Stages in a trained model are simple transformations which transform the dataset from one form to another ● These transformations expects the feature columns to be present in the Prediction dataset ● There is no ability in SparkML to validate if a dataset is suitable for the model ● Information about the schema should be stored while training to verify the schema and throw meaningful errors com.shashank.sparkml.operationalize.PredictSchemaIssue
  • 39. FeatureNames extraction ● A pipeline model doesn’t have API to get a list of feature names which were used to train the model ● Feature Vector is just a collection of double values ● No information about what each of these values represent ● We can use multiple stage metadata to derive the feature names associated with each feature value ● These features would also contain OneHotEncoded values com.shashank.sparkml.operationalize.FeatureExtraction
  • 40. References ● https://jaceklaskowski.gitbooks.io/mastering-apache-spark/ spark-mllib/spark-mllib-pipelines.html ● https://spark.apache.org/docs/latest/ml-guide.html ● https://issues.apache.org/jira/browse/SPARK-20723 intermediateRDDStorageLevel for Treebased Classifier ● https://issues.apache.org/jira/browse/SPARK-8418 single- and multi-value support to ML Transformers ● https://issues.apache.org/jira/browse/SPARK-13434 Reduce Spark RandomForest memory footprint