This talk highlights major improvements in Machine Learning (ML) targeted for Apache Spark 2.0. The MLlib 2.0 release focuses on ease of use for data science—both for casual and power users. We will discuss 3 key improvements: persisting models for production, customizing Pipelines, and improvements to models and APIs critical to data science.
(1) MLlib simplifies moving ML models to production by adding full support for model and Pipeline persistence. Individual models—and entire Pipelines including feature transformations—can be built on one Spark deployment, saved, and loaded onto other Spark deployments for production and serving.
(2) Users will find it much easier to implement custom feature transformers and models. Abstractions automatically handle input schema validation, as well as persistence for saving and loading models.
(3) For statisticians and data scientists, MLlib has doubled down on Generalized Linear Models (GLMs), which are key algorithms for many use cases. MLlib now supports more GLM families and link functions, handles corner cases more gracefully, and provides more model statistics. Also, expanded language APIs allow data scientists using Python and R to call many more algorithms.
Finally, we will demonstrate these improvements live and show how they facilitate getting started with ML on Spark, customizing implementations, and moving to production.
Odoo Development Company in India | Devintelle Consulting Service
Apache Spark MLlib 2.0 Preview: Data Science and Production
1. Apache Spark MLlib 2.0 Preview:
Data Science and Production
Joseph K. Bradley
June 8, 2016
® ™
2. Who am I?
• ApacheSpark committer & PMC member
• Software Engineer@ Databricks
• Ph.D. in Machine Learning from CarnegieMellonU.
2
3. MLlib trajectory
3
0
500
1000
v0.8 v0.9 v1.0 v1.1 v1.2 v1.3 v1.4 v1.5 v1.6 v2.0
commits/release
Scala/Jav
a
API
Primary API
for MLlib
Python
API R API
DataFrame-based
API forMLlib
4. DataFrame-based API for MLlib
a.k.a. “Pipelines” API, with utilities for constructingMLPipelines
In 2.0, the DataFrame-based API will become the primary API for MLlib.
• Voted by community
• org.apache.spark.ml, pyspark.ml
The RDD-based API will entermaintenance mode.
• Still maintained with bug fixes, but no new features
•org.apache.spark.mllib, pyspark.mllib
4
5. Goals for MLlib in 2.0
Major initiatives
• Generalized LinearModels
• Python & R API expansion
• ML persistence:saving & loading models& Pipelines
Also in 2.0:
• Sketching algorithms: http://databricks.com/blog/2016/05/19
• For details, see SPARK-12626 roadmap JIRA + mailing list discussions.
5
Exploratory analysis
Production
7. Generalized Linear Models (GLMs)
Arguablythe most important class of models for ML
• Logistic regression
• Linear regression
7
In Spark 1.6 & earlier
• Many othertypes of models
• Model summary statistics +1
+1
+1
+1
-1-1
-1-1
+1
-1
+1
+1 Spam
Ham
8. GLMs in 2.0
8
GeneralizedLinearRegression
• Max 4096 features
• Solved using Iteratively Reweighted
Least Squares (IRLS)
LinearRegression &
LogisticRegression
• Millionsoffeatures
• Solved using L-BFGS / OWL-QNFixes for cornercases
• E.g., handle invalid labelsgracefully
Model
family
Supported linkfunctions
Gaussian Identity, Log, Inverse
Binomial Logit, Probit, CLogLog
Poisson Log, Identity, Sqrt
Gamma Inverse, Identity, Log
9. Python & R APIs for MLlib
Python
• Clustering algorithms: Bisecting K-Means, Gaussian Mixtures, LDA
• Meta-algorithms: OneVsRest, TrainValidationSplit
• GeneralizedLinearRegression
• Feature transformers: ChiSqSelector, MaxAbsScaler, QuantileDiscretizer
• Model inspection: summaries for Logistic Regression, Linear Regression, GLMs
R
• Regression & classification: Generalized Linear Regression, AFT survival regression
• Clustering: K-Means
9
Goal: Expand ML APIs forcritical languagesfor data science
11. Why ML persistence?
11
Data Science Software Engineering
Prototype (Python/R)
Create model
Re-implementmodel for
production (Java)
Deploy model
12. Why ML persistence?
12
Data Science Software Engineering
Prototype (Python/R)
Create Pipeline
• Extract raw features
• Transform features
• Select key features
• Fit multiple models
• Combine results to
make prediction
• Extra implementation work
• Different code paths
• Synchronization overhead
Re-implementPipeline for
production (Java)
Deploy Pipeline
13. With ML persistence...
13
Data Science Software Engineering
Prototype (Python/R)
Create Pipeline
Persist model or Pipeline:
model.save(“s3n://...”)
Load Pipeline (Scala/Java)
Model.load(“s3n://…”)
Deploy in production
14. Model tuning
ML persistence status
14
Text
preprocessing
Feature
generation
Generalized
Linear
Regression
Unfitted Fitted
Model
Pipeline
Supported in MLlib’s
RDD-based API
“recipe” “result”
15. ML persistence status
Near-complete coveragein all Spark languageAPIs
• Scala & Java: complete
• Python:complete exceptfor 2 algorithms
• R: complete for existing APIs
Single underlyingimplementation of models
Exchangeabledata format
• JSON for metadata
• Parquet for model data (coefficients,etc.)
15
16. Customizing ML Pipelines
MLlib 2.0 will include:
• 29 feature transformers (Tokenizer,Word2Vec, ...)
• 21 models (for classification, regression,clustering, ...)
• Model tuning & evaluation
But some applications requirecustomized
Transformers& Models.
16
17. Options for customization
Extend abstractions
• Transformer
• Estimator & Model
• Evaluator
17
UnaryTransformer: input à output
E.g.:
• Tokenizer: text à sequence of words
• Normalizer: vector à normalized vector
Simple API which provides:
• DataFrame-based API
• Built-in parametergetters, setters
• Distributed Row-wise transformation
Existing use cases:
• Natural Language Processing
(Snowball, Stanford NLP)
• Featurization libraries
• Many others!
18. Persistence for customized algorithms
2 traits provide persistence for simple Transformers:
•DefaultParamsWritable
•DefaultParamsReadable
Simply mix in these traits to provide persistence in Scala/Java.
à Saves & loadsalgorithmparameters
val myParam: Param[Double]
à Does notsave member data
val myCoefficients: Vector
For an example, see UnaryTransformerExample.scala
in spark/examples/
18
19. Recap
Exploratory analysis
• Generalized LinearRegression
• Python & R APIs
Production
• ML persistence
• Customizing Pipelines
19
Many thanksto the community
for contributions& support!
20. What’s next?
Prioritized items on the 2.1 roadmap JIRA (SPARK-15581):
• Critical feature completenessfor the DataFrame-based API
– Multiclass logistic regression
– Statistics
• Python API parity & R API expansion
• Scaling & speed tuning for key algorithms: trees & ensembles
GraphFrames
• Release for Spark 2.0
• Speed improvements(join elimination,connected components)
20
21. Get started
• Getinvolved via roadmap JIRA (SPARK-
15581) + mailing lists
• Download notebook for this talk
http://dbricks.co/1UfvAH9
• ML persistenceblog post
http://databricks.com/blog/2016/05/31
21
Try outthe ApacheSpark 2.0
preview release:
http://databricks.com/try