SlideShare a Scribd company logo
1 of 37
Download to read offline
On the representation and reuse of
machine learning models
Villu Ruusmann
Openscoring OÜ
https://github.com/jpmml
2
Def: "Model"
Output = func(Input)
3
Def: "Representation"
Generic Specific
Data
structure
Application
code
4
The problem
"Train once, deploy anywhere"
5
A solution
Matching model representation (MR) with the task at hand:
1. Storing a generic and stable MR
2. Generating a wide variety of more specific and volatile
MRs upon request
6
The Predictive Model Markup Language (PMML)
● XML dialect for marking up models and associated data
transformations
● Version 1.0 in 1999, version 4.3 in 2016
● "Conventions over configuration"
● 17 top-level model types + ensembling
http://dmg.org/
http://dmg.org/pmml/pmml-v4-3.html
http://dmg.org/pmml/products.html
7
A continuum from black to white boxes
Introducing transparency in the form of rich, easy to use,
well-documented APIs:
1. Unmarshalling and marshalling
2. Static analyses. Ex: schema querying
3. Dynamic analyses. Ex: scoring
4. Tracing and explaining individual predictions
8
The Zen of Machine Learning
"Making the model requires large data and many cpus.
Using it does not"
--darren
https://www.mail-archive.com/user@spark.apache.org/msg40636.html
9
Model training workflow
Real-world
feature space
ML-platform
feature space
ML-platform
model
10
Model deployment workflow
Real-world
feature space
ML-platform
feature space
ML-platform
model
Real-world
feature space
Real-world
model
vs.
11
Model resources
R code
Scikit-Learn
code
Apache Spark
ML code
Original
PMML markup
Java code
Python code
Optimized
PMML markup
Training DeploymentVersioned storage
12
Comparison of model persistence options
R Scikit-Learn Apache Spark ML
Model data structure
stability
Fair to excellent Fair Poor
Native serialization
data format
RDS (binary) Pickle (binary) SER (binary) and
JSON (text)
Export to PMML Few external N/A Built-in trait
PMMLWritable
Import from PMML Few external N/A
JPMML projects JPMML-R and
r2pmml
JPMML-SkLearn
and sklearn2pmml
JPMML-SparkML
(-Package)
13
PMML production: R
library("r2pmml")
auto <- read.csv("Auto.csv")
auto$origin <- as.factor(auto$origin)
auto.formula <- formula(mpg ~
(.) ^ 2 + # simple features and their two way interactions
I(displacement / cylinders) + I(log(weight))) # derived features
auto.lm <- lm(auto.formula, data = auto)
r2pmml(auto.lm, "auto_lm.pmml", dataset = auto)
auto.glm <- glm(auto.formula, data = auto, family = "gaussian")
r2pmml(auto.glm, "auto_glm.pmml", dataset = auto)
14
R quirks
● No pipeline concept. Some workflow standardization
efforts by third parties. Ex: caret package
● Many (equally right-) ways of doing the same thing.
Ex: "formula interface" vs. "matrix interface"
● High variance in the design and quality of packages.
Ex: academia vs. industry
● Model objects may enclose the training data set
15
PMML production: Scikit-Learn
from sklearn2pmml import sklearn2pmml
from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain
audit_df = pandas.read_csv("Audit.csv")
audit_mapper = DataFrameMapper([
(["Age", "Income", "Hours"], ContinuousDomain()),
(["Employment", "Education", "Marital", "Occupation"], [CategoricalDomain(), LabelBinarizer()]),
(["Gender", "Deductions"], [CategoricalDomain(), LabelEncoder()]),
("Adjusted", None)])
audit = audit_mapper.fit_transform(audit_df)
audit_classifier = DecisionTreeClassifier(min_samples_split = 10)
audit_classifier.fit(audit[:, 0:48], audit[:, 48].astype(int))
sklearn2pmml(audit_classifier, audit_mapper, "audit_tree.pmml")
16
Scikit-Learn quirks
● Completely schema-less at algorithm level. Ex: no
identification of columns, no tracking of column groups
● Very limited, simple data structures. Mix of Python and C
● No built-in persistence mechanism. Serialization in
generic pickle data format. Upon de-serialization, hope
that class definitions haven't changed in the meantime.
17
PMML production: Apache Spark ML
// $ spark-shell --packages org.jpmml:jpmml-sparkml-package:1.0-SNAPSHOT ..
import org.jpmml.sparkml.ConverterUtil
val df = spark.read.option("header", "true").option("inferSchema", "true").csv("Wine.csv")
val formula = new RFormula().setFormula("quality ~ .")
val regressor = new DecisionTreeRegressor()
val pipeline = new Pipeline().setStages(Array(formula, regressor))
val pipelineModel = pipeline.fit(df)
val pmmlBytes = ConverterUtil.toPMMLByteArray(df.schema, pipelineModel)
Files.write(Paths.get("wine_tree.pmml"), pmmlBytes)
18
Apache Spark ML quirks
● Split schema. Static def via Dataset#schema(),
dynamic def via Dataset column metadata
● Models make predictions in transformed output space
● High internal complexity, overhead. Ex: temporary
Dataset columns for feature transformation
● Built-in PMML export capabilities leak the JPMML-Model
library to application classpath
19
PMML consumption: Apache Spark ML
// $ spark-submit --packages org.jpmml:jpmml-spark:1.0-SNAPSHOT ..
import org.jpmml.spark.EvaluatorUtil;
import org.jpmml.spark.TransformerBuilder;
Evaluator evaluator = EvaluatorUtil.createEvaluator(new File("audit_tree.pmml"));
TransformerBuilder pmmlTransformerBuilder = new TransformerBuilder(evaluator)
.withLabelCol("Adjusted") // String column
.withProbabilityCol("Adjusted_probability", Arrays.asList("0", "1")) // Vector column
.exploded(true);
Transformer pmmlTransformer = pmmlTransformerBuilder.build();
Dataset<Row> input = ...;
Dataset<Row> output = pmmlTransformer.transform(input);
20
Comparison of feature spaces
R Scikit-Learn Apache Spark ML
Feature
identification
Named Positional Pseudo-named
Feature data type Any Float, Double Double
Feature operational
type
Continuous,
Categorical, Ordinal
Continuous Continuous,
pseudo-categorical
Dataset abstraction List<Map<String,?>> float[][] or double[][] List<double[]>
Effect of
transformations on
dataset size
Low Medium (sparse) to high (dense)
21
Feature declaration
<DataField name="Age" dataType="float" optype="continuous">
<Interval closure="closedClosed" leftMargin="17.0" rightMargin="83.0"/>
</DataField>
<DataField name="Gender" dataType="string" optype="categorical">
<Value value="Male"/>
<Value value="Female"/>
<Value value="N/A" property="missing"/>
</DataField>
http://dmg.org/pmml/v4-3/DataDictionary.html
<MiningField name="Age" outliers="asExtremeValues" lowValue="18.0" highValue="75.0"/>
<MiningField name="Gender"/>
http://dmg.org/pmml/v4-3/MiningSchema.html
22
Feature statistics
<UnivariateStats field="Age">
<NumericInfo mean="38.30279" standardDeviation="13.01375" median="3.70"/>
<ContStats>
<Interval closure="openClosed" leftMargin="17.0" rightMargin="23.6"/>
<!-- Intervals 2 through 9 omitted for clarity -->
<Interval closure="openClosed" leftMargin="76.4" rightMargin="83.0"/>
<Array type="int">261 360 297 340 280 156 135 51 13 6</Array>
</ContStats>
</UnivariateStats>
<UnivariateStats field="Gender">
<Counts totalFreq="1899" missingFreq="0" invalidFreq="0"/>
<DiscrStats>
<Array type="string">Male Female</Array>
<Array type="int">1307 592</Array>
</DiscrStats>
</UnivariateStats>
http://dmg.org/pmml/v4-3/Statistics.html23
Comparison of tree models
R Scikit-Learn Apache Spark ML
Algorithms No built-in, many
external
Few built-in Single built-in
Split type(s) Binary or multi-way;
simple and derived
features
Binary; simple features
Continuous features Rel. op. (<, <=) Rel. op. (<=)
Categorical features Set op. (%in%) Pseudo-rel. op. (==) Pseudo-set op.
Reuse Hard Easy to medium
24
Tree model declaration
<TreeModel functionName="classification" splitCharacteristic="binarySplit">
<Node id="1" recordCount="165">
<True/>
<Node id="2" score="1" recordCount="35">
<SimplePredicate field="Education" operator="equal" value="Master"/>
<ScoreDistribution value="1" recordCount="25"/>
<ScoreDistribution value="0" recordCount="10"/>
</Node>
<Node id="3" score="0" recordCount="130">
<SimplePredicate field="Education" operator="notEqual" value="Master"/>
<ScoreDistribution value="1" recordCount="20"/>
<ScoreDistribution value="0" recordCount="110"/>
</Node>
</Node>
</TreeModel>
http://dmg.org/pmml/v4-3/TreeModel.html
25
Optimization (1/3)
<Node>
SimplePredicate: Gender != "Male"
<Node>
SimplePredicate: Age <= 34.5
</Node>
<Node>
SimplePredicate: Age > 34.5
</Node>
</Node>
<Node>
SimplePredicate: Gender == "Male"
</Node>
<Node>
SimplePredicate: Gender == "Male"
</Node>
<Node>
SimplePredicate: Age <= 34.5
</Node>
<Node>
SimplePredicate: Age > 34.5
</Node>
Replacing "deep" binary splits with "shallow" multi-splits:
26
Optimization (2/3)
<TreeModel
noTrueChildStrategy="returnNullPrediction"
>
<Node>
<True/>
<Node score="4.333333333333333">
SimplePredicate: pH <= 2.93
</Node>
<Node score="5.483870967741935">
SimplePredicate: pH > 2.93
</Node>
</Node>
</TreeModel>
<TreeModel
noTrueChildStrategy="returnLastPrediction"
>
<Node score="5.483870967741935">
<True/>
<Node score="4.333333333333333">
SimplePredicate: pH <= 2.93
</Node>
</Node>
</TreeModel>
Cutting the number of terminal nodes in half:
27
Optimization (3/3)
<Node>
SimplePredicate: Age > 35.5
<Node score="0" recordCount="40">
SimplePredicate: Income <= 194386
ScoreDistribution:
"0" = 40, "1" = 0
</Node>
<Node score="0" recordCount="8">
SimplePredicate: Income > 194386
ScoreDistribution:
"0" = 7, "1" = 1
</Node>
</Node>
<Node score="0" recordCount="48">
SimplePredicate: Age > 35.5
ScoreDistribution:
"0" = 47, "1" = 1
</Node>
Removing split levels that don't affect the prediction:
28
Code generation
<TreeModel
missingValueStrategy="defaultChild"
>
<Node id="1" defaultChild="3">
<True/>
<Node id="2" score="0">
SimplePredicate: Age <= 52
ScoreDistribution:
"0" = 7, "1" = 0
</Node>
<Node id="3" score="1">
SimplePredicate: Age > 52
ScoreDistribution:
"0" = 1, "1" = 2
</Node>
</Node>
</TreeModel>
Object[] node_1(FieldValue age, ...){
if(age != null && age.asFloat() <= 52f){
return node_2(...);
} else {
return node_3(...);
}
}
Object[] node_2(...){
return new Object[]{"0", 7d, 0d};
}
Object[] node_3(...){
return new Object[]{"1", 1d, 2d};
}
Bad Idea!
29
Common pitfalls
Not treating splits on continuous features with the required
precision (tolerance < 0.5 ULP):
● Truncating values. Ex: "1.5000(..)1" → "1.50"
● Changing data type. Ex: float ↔ double
● Changing arithmetic expressions. Ex: (x1
/ x2
) ↔ (1 / x2
) * x1
30
Comparison of regression models
R Scikit-Learn Apache Spark ML
Algorithms Few built-in, many
external
Many built-in Few built-in
Term type(s) Simple and derived
features;
interactions
Simple features
Reuse Hard Easy
31
Regression model declaration
<RegressionModel functionName="regression" normalizationMethod="none">
<RegressionTable intercept="15.50143741004145">
<!-- Simple continuous feature -->
<NumericPredictor name="cylinders" coefficient="2.1609496766686194"/>
<!-- Simple categorical feature -->
<CategoricalPredictor name="origin" value="2" coefficient="-35.87525051244351"/>
<CategoricalPredictor name="origin" value="3" coefficient="-38.206750156693424"/>
<!-- Interaction -->
<PredictorTerm coefficient="-0.007734946028064237">
<FieldRef field="cylinders"/>
<FieldRef field="displacement"/>
</PredictorTerm>
<!-- Derived feature; I(log(weight)) is backed by a DerivedField element -->
<NumericPredictor name="I(log(weight))" coefficient="4.874500863508498"/>
</RegressionTable>
</RegressionModel>
http://dmg.org/pmml/v4-3/Regression.html32
Generalized regression model declaration
<GeneralRegressionModel functionName="regression" linkFunction="identity">
<ParameterList>
<Parameter name="p0" label="(intercept)"/>
<Parameter name="p11" label="cylinders:displacement"/>
</ParameterList>
<PPMatrix>
<PPCell value="1" predictorName="cylinders" parameterName="p11"/>
<PPCell value="1" predictorName="displacement" parameterName="p11"/>
</PPMatrix>
<ParamMatrix>
<PCell parameterName="p0" beta="15.50143741004145"/>
<PCell parameterName="p11" beta="-0.007734946028064237"/>
</ParamMatrix>
</GeneralRegressionModel>
http://dmg.org/pmml/v4-3/GeneralRegression.html
33
Code generation
double mpg(FieldValue cylinders, FieldValue displacement, FieldValue weight, FieldValue origin){
return 15.50143741004145d + // intercept
2.1609496766686194d * cylinders.asDouble() + // simple continuous feature
origin(origin.asString()) + // simple categorical feature
-0.007734946028064237d * (cylinders.asDouble() * displacement.asDouble()) + // interaction
4.874500863508498d * Math.ln(weight.asDouble()) // derived feature
}
double origin(String origin){
switch(origin){
case "1": return 0d; // baseline
case "2": return -35.87525051244351d;
case "3": return -38.206750156693424d;
}
throw new IllegalArgumentException(origin);
}
34
Grand summary
R Scikit-Learn Apache Spark ML
ML-platform model
information density
Medium to high Low Low to medium
ML-platform model
interpretability
Good Bad Bad
PMML markup
optimizability
Low High Medium
ML-platform to
PMML learning and
migration effort
Medium Low to medium Low
35
Q&A
villu@openscoring.io
https://github.com/jpmml
https://github.com/openscoring
https://groups.google.com/forum/#!forum/jpmml
PMML vs. PFA
https://xkcd.com/927/
37

More Related Content

What's hot

XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ FyberDaniel Hen
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboostmichiaki ito
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Abhishek Thakur
 
wk5ppt1_Titanic
wk5ppt1_Titanicwk5ppt1_Titanic
wk5ppt1_TitanicAliciaWei1
 
Parallel and Iterative Processing for Machine Learning Recommendations with S...
Parallel and Iterative Processing for Machine Learning Recommendations with S...Parallel and Iterative Processing for Machine Learning Recommendations with S...
Parallel and Iterative Processing for Machine Learning Recommendations with S...MapR Technologies
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with sparkModern Data Stack France
 
Circles graphic
Circles graphicCircles graphic
Circles graphicalldesign
 
Optimization toolbox presentation
Optimization toolbox presentationOptimization toolbox presentation
Optimization toolbox presentationRavi Kannappan
 
Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine
Deep Learning Meetup 7 - Building a Deep Learning-powered Search EngineDeep Learning Meetup 7 - Building a Deep Learning-powered Search Engine
Deep Learning Meetup 7 - Building a Deep Learning-powered Search EngineKoby Karp
 
MATLAB for Technical Computing
MATLAB for Technical ComputingMATLAB for Technical Computing
MATLAB for Technical ComputingNaveed Rehman
 
Computer graphics practical(jainam)
Computer graphics practical(jainam)Computer graphics practical(jainam)
Computer graphics practical(jainam)JAINAM KAPADIYA
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Optimization
OptimizationOptimization
OptimizationManas Das
 
An Introduction to Functional Programming - DeveloperUG - 20140311
An Introduction to Functional Programming - DeveloperUG - 20140311An Introduction to Functional Programming - DeveloperUG - 20140311
An Introduction to Functional Programming - DeveloperUG - 20140311Andreas Pauley
 
Competition 1 (blog 1)
Competition 1 (blog 1)Competition 1 (blog 1)
Competition 1 (blog 1)TarunPaparaju
 
No internet? No Problem!
No internet? No Problem!No internet? No Problem!
No internet? No Problem!Annyce Davis
 

What's hot (20)

XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboost
 
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
Approaching (almost) Any Machine Learning Problem (kaggledays dubai)
 
wk5ppt1_Titanic
wk5ppt1_Titanicwk5ppt1_Titanic
wk5ppt1_Titanic
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
Parallel and Iterative Processing for Machine Learning Recommendations with S...
Parallel and Iterative Processing for Machine Learning Recommendations with S...Parallel and Iterative Processing for Machine Learning Recommendations with S...
Parallel and Iterative Processing for Machine Learning Recommendations with S...
 
Hadoop France meetup Feb2016 : recommendations with spark
Hadoop France meetup  Feb2016 : recommendations with sparkHadoop France meetup  Feb2016 : recommendations with spark
Hadoop France meetup Feb2016 : recommendations with spark
 
Circles graphic
Circles graphicCircles graphic
Circles graphic
 
Optimization toolbox presentation
Optimization toolbox presentationOptimization toolbox presentation
Optimization toolbox presentation
 
Xgboost
XgboostXgboost
Xgboost
 
Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine
Deep Learning Meetup 7 - Building a Deep Learning-powered Search EngineDeep Learning Meetup 7 - Building a Deep Learning-powered Search Engine
Deep Learning Meetup 7 - Building a Deep Learning-powered Search Engine
 
MATLAB for Technical Computing
MATLAB for Technical ComputingMATLAB for Technical Computing
MATLAB for Technical Computing
 
03. oop concepts
03. oop concepts03. oop concepts
03. oop concepts
 
Computer graphics practical(jainam)
Computer graphics practical(jainam)Computer graphics practical(jainam)
Computer graphics practical(jainam)
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Optimization
OptimizationOptimization
Optimization
 
An Introduction to Functional Programming - DeveloperUG - 20140311
An Introduction to Functional Programming - DeveloperUG - 20140311An Introduction to Functional Programming - DeveloperUG - 20140311
An Introduction to Functional Programming - DeveloperUG - 20140311
 
Competition 1 (blog 1)
Competition 1 (blog 1)Competition 1 (blog 1)
Competition 1 (blog 1)
 
No internet? No Problem!
No internet? No Problem!No internet? No Problem!
No internet? No Problem!
 

Viewers also liked

Representing TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLRepresenting TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLVillu Ruusmann
 
Operationalizing analytics to scale
Operationalizing analytics to scaleOperationalizing analytics to scale
Operationalizing analytics to scaleLooker
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In ProductionSamir Bessalah
 
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopPaco Nathan
 
From R&D to production-ready predictive apps - Christophe Bourguignat & Yann ...
From R&D to production-ready predictive apps - Christophe Bourguignat & Yann ...From R&D to production-ready predictive apps - Christophe Bourguignat & Yann ...
From R&D to production-ready predictive apps - Christophe Bourguignat & Yann ...PAPIs.io
 
Boligløsninger for eldre 040214: Lene Schmidts presentasjon
Boligløsninger for eldre 040214: Lene Schmidts presentasjonBoligløsninger for eldre 040214: Lene Schmidts presentasjon
Boligløsninger for eldre 040214: Lene Schmidts presentasjonUrbanRegionalResearch
 
Velox at SF Data Mining Meetup
Velox at SF Data Mining MeetupVelox at SF Data Mining Meetup
Velox at SF Data Mining MeetupDan Crankshaw
 
PMML - Predictive Model Markup Language
PMML - Predictive Model Markup LanguagePMML - Predictive Model Markup Language
PMML - Predictive Model Markup Languageaguazzel
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsPaco Nathan
 
MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
  MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...  MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...Spark Summit
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsTuri, Inc.
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to productionGeorg Heiler
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache SparkJen Aman
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Spark Summit
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
Enabling Real-Time Analytics for IoT
Enabling Real-Time Analytics for IoTEnabling Real-Time Analytics for IoT
Enabling Real-Time Analytics for IoTSingleStore
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016Robert Grossman
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Spark Summit
 

Viewers also liked (20)

Representing TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMMLRepresenting TF and TF-IDF transformations in PMML
Representing TF and TF-IDF transformations in PMML
 
Operationalizing analytics to scale
Operationalizing analytics to scaleOperationalizing analytics to scale
Operationalizing analytics to scale
 
Machine Learning In Production
Machine Learning In ProductionMachine Learning In Production
Machine Learning In Production
 
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, HadoopACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
ACM Bay Area Data Mining Workshop: Pattern, PMML, Hadoop
 
From R&D to production-ready predictive apps - Christophe Bourguignat & Yann ...
From R&D to production-ready predictive apps - Christophe Bourguignat & Yann ...From R&D to production-ready predictive apps - Christophe Bourguignat & Yann ...
From R&D to production-ready predictive apps - Christophe Bourguignat & Yann ...
 
Boligløsninger for eldre 040214: Lene Schmidts presentasjon
Boligløsninger for eldre 040214: Lene Schmidts presentasjonBoligløsninger for eldre 040214: Lene Schmidts presentasjon
Boligløsninger for eldre 040214: Lene Schmidts presentasjon
 
Velox at SF Data Mining Meetup
Velox at SF Data Mining MeetupVelox at SF Data Mining Meetup
Velox at SF Data Mining Meetup
 
PMML - Predictive Model Markup Language
PMML - Predictive Model Markup LanguagePMML - Predictive Model Markup Language
PMML - Predictive Model Markup Language
 
Production Grade Data Science for Hadoop
Production Grade Data Science for HadoopProduction Grade Data Science for Hadoop
Production Grade Data Science for Hadoop
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 
MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
  MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...  MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...
 
Production and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning ModelsProduction and Beyond: Deploying and Managing Machine Learning Models
Production and Beyond: Deploying and Managing Machine Learning Models
 
Machine learning model to production
Machine learning model to productionMachine learning model to production
Machine learning model to production
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
Making Sense of Spark Performance-(Kay Ousterhout, UC Berkeley)
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Enabling Real-Time Analytics for IoT
Enabling Real-Time Analytics for IoTEnabling Real-Time Analytics for IoT
Enabling Real-Time Analytics for IoT
 
AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016AnalyticOps - Chicago PAW 2016
AnalyticOps - Chicago PAW 2016
 
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
Drizzle—Low Latency Execution for Apache Spark: Spark Summit East talk by Shi...
 

Similar to Deploy machine learning models anywhere with PMML

Machine learning key to your formulation challenges
Machine learning key to your formulation challengesMachine learning key to your formulation challenges
Machine learning key to your formulation challengesMarc Borowczak
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandFrançois Garillot
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaChetan Khatri
 
Data ops: Machine Learning in production
Data ops: Machine Learning in productionData ops: Machine Learning in production
Data ops: Machine Learning in productionStepan Pushkarev
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...Chetan Khatri
 
Red Hat Agile integration Workshop Labs
Red Hat Agile integration Workshop LabsRed Hat Agile integration Workshop Labs
Red Hat Agile integration Workshop LabsJudy Breedlove
 
The Ring programming language version 1.5.3 book - Part 53 of 184
The Ring programming language version 1.5.3 book - Part 53 of 184The Ring programming language version 1.5.3 book - Part 53 of 184
The Ring programming language version 1.5.3 book - Part 53 of 184Mahmoud Samir Fayed
 
The Ring programming language version 1.5.3 book - Part 43 of 184
The Ring programming language version 1.5.3 book - Part 43 of 184The Ring programming language version 1.5.3 book - Part 43 of 184
The Ring programming language version 1.5.3 book - Part 43 of 184Mahmoud Samir Fayed
 
Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019dgarijo
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logginglucenerevolution
 
Develop an App with the Odoo Framework
Develop an App with the Odoo FrameworkDevelop an App with the Odoo Framework
Develop an App with the Odoo FrameworkOdoo
 
Zotonic tutorial EUC 2013
Zotonic tutorial EUC 2013Zotonic tutorial EUC 2013
Zotonic tutorial EUC 2013Arjan
 
10 ways to make your code rock
10 ways to make your code rock10 ways to make your code rock
10 ways to make your code rockmartincronje
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 

Similar to Deploy machine learning models anywhere with PMML (20)

Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
 
Machine learning key to your formulation challenges
Machine learning key to your formulation challengesMachine learning key to your formulation challenges
Machine learning key to your formulation challenges
 
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in SwitzerlandMobility insights at Swisscom - Understanding collective mobility in Switzerland
Mobility insights at Swisscom - Understanding collective mobility in Switzerland
 
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed KafsiSpark Summit EU talk by Francois Garillot and Mohamed Kafsi
Spark Summit EU talk by Francois Garillot and Mohamed Kafsi
 
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scalaAutomate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala
 
Data ops: Machine Learning in production
Data ops: Machine Learning in productionData ops: Machine Learning in production
Data ops: Machine Learning in production
 
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
TransmogrifAI - Automate Machine Learning Workflow with the power of Scala an...
 
Xaml programming
Xaml programmingXaml programming
Xaml programming
 
Red Hat Agile integration Workshop Labs
Red Hat Agile integration Workshop LabsRed Hat Agile integration Workshop Labs
Red Hat Agile integration Workshop Labs
 
The Ring programming language version 1.5.3 book - Part 53 of 184
The Ring programming language version 1.5.3 book - Part 53 of 184The Ring programming language version 1.5.3 book - Part 53 of 184
The Ring programming language version 1.5.3 book - Part 53 of 184
 
The Ring programming language version 1.5.3 book - Part 43 of 184
The Ring programming language version 1.5.3 book - Part 43 of 184The Ring programming language version 1.5.3 book - Part 43 of 184
The Ring programming language version 1.5.3 book - Part 43 of 184
 
Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019
 
29. Treffen - Tobias Meier - TypeScript
29. Treffen - Tobias Meier - TypeScript29. Treffen - Tobias Meier - TypeScript
29. Treffen - Tobias Meier - TypeScript
 
Multi faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & loggingMulti faceted responsive search, autocomplete, feeds engine & logging
Multi faceted responsive search, autocomplete, feeds engine & logging
 
Develop an App with the Odoo Framework
Develop an App with the Odoo FrameworkDevelop an App with the Odoo Framework
Develop an App with the Odoo Framework
 
MDE in Practice
MDE in PracticeMDE in Practice
MDE in Practice
 
Zotonic tutorial EUC 2013
Zotonic tutorial EUC 2013Zotonic tutorial EUC 2013
Zotonic tutorial EUC 2013
 
10 ways to make your code rock
10 ways to make your code rock10 ways to make your code rock
10 ways to make your code rock
 
Into The Box 2018 - CBT
Into The Box 2018 - CBTInto The Box 2018 - CBT
Into The Box 2018 - CBT
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 

Recently uploaded

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 

Recently uploaded (20)

Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 

Deploy machine learning models anywhere with PMML

  • 1. On the representation and reuse of machine learning models Villu Ruusmann Openscoring OÜ
  • 3. Def: "Model" Output = func(Input) 3
  • 5. The problem "Train once, deploy anywhere" 5
  • 6. A solution Matching model representation (MR) with the task at hand: 1. Storing a generic and stable MR 2. Generating a wide variety of more specific and volatile MRs upon request 6
  • 7. The Predictive Model Markup Language (PMML) ● XML dialect for marking up models and associated data transformations ● Version 1.0 in 1999, version 4.3 in 2016 ● "Conventions over configuration" ● 17 top-level model types + ensembling http://dmg.org/ http://dmg.org/pmml/pmml-v4-3.html http://dmg.org/pmml/products.html 7
  • 8. A continuum from black to white boxes Introducing transparency in the form of rich, easy to use, well-documented APIs: 1. Unmarshalling and marshalling 2. Static analyses. Ex: schema querying 3. Dynamic analyses. Ex: scoring 4. Tracing and explaining individual predictions 8
  • 9. The Zen of Machine Learning "Making the model requires large data and many cpus. Using it does not" --darren https://www.mail-archive.com/user@spark.apache.org/msg40636.html 9
  • 10. Model training workflow Real-world feature space ML-platform feature space ML-platform model 10
  • 11. Model deployment workflow Real-world feature space ML-platform feature space ML-platform model Real-world feature space Real-world model vs. 11
  • 12. Model resources R code Scikit-Learn code Apache Spark ML code Original PMML markup Java code Python code Optimized PMML markup Training DeploymentVersioned storage 12
  • 13. Comparison of model persistence options R Scikit-Learn Apache Spark ML Model data structure stability Fair to excellent Fair Poor Native serialization data format RDS (binary) Pickle (binary) SER (binary) and JSON (text) Export to PMML Few external N/A Built-in trait PMMLWritable Import from PMML Few external N/A JPMML projects JPMML-R and r2pmml JPMML-SkLearn and sklearn2pmml JPMML-SparkML (-Package) 13
  • 14. PMML production: R library("r2pmml") auto <- read.csv("Auto.csv") auto$origin <- as.factor(auto$origin) auto.formula <- formula(mpg ~ (.) ^ 2 + # simple features and their two way interactions I(displacement / cylinders) + I(log(weight))) # derived features auto.lm <- lm(auto.formula, data = auto) r2pmml(auto.lm, "auto_lm.pmml", dataset = auto) auto.glm <- glm(auto.formula, data = auto, family = "gaussian") r2pmml(auto.glm, "auto_glm.pmml", dataset = auto) 14
  • 15. R quirks ● No pipeline concept. Some workflow standardization efforts by third parties. Ex: caret package ● Many (equally right-) ways of doing the same thing. Ex: "formula interface" vs. "matrix interface" ● High variance in the design and quality of packages. Ex: academia vs. industry ● Model objects may enclose the training data set 15
  • 16. PMML production: Scikit-Learn from sklearn2pmml import sklearn2pmml from sklearn2pmml.decoration import CategoricalDomain, ContinuousDomain audit_df = pandas.read_csv("Audit.csv") audit_mapper = DataFrameMapper([ (["Age", "Income", "Hours"], ContinuousDomain()), (["Employment", "Education", "Marital", "Occupation"], [CategoricalDomain(), LabelBinarizer()]), (["Gender", "Deductions"], [CategoricalDomain(), LabelEncoder()]), ("Adjusted", None)]) audit = audit_mapper.fit_transform(audit_df) audit_classifier = DecisionTreeClassifier(min_samples_split = 10) audit_classifier.fit(audit[:, 0:48], audit[:, 48].astype(int)) sklearn2pmml(audit_classifier, audit_mapper, "audit_tree.pmml") 16
  • 17. Scikit-Learn quirks ● Completely schema-less at algorithm level. Ex: no identification of columns, no tracking of column groups ● Very limited, simple data structures. Mix of Python and C ● No built-in persistence mechanism. Serialization in generic pickle data format. Upon de-serialization, hope that class definitions haven't changed in the meantime. 17
  • 18. PMML production: Apache Spark ML // $ spark-shell --packages org.jpmml:jpmml-sparkml-package:1.0-SNAPSHOT .. import org.jpmml.sparkml.ConverterUtil val df = spark.read.option("header", "true").option("inferSchema", "true").csv("Wine.csv") val formula = new RFormula().setFormula("quality ~ .") val regressor = new DecisionTreeRegressor() val pipeline = new Pipeline().setStages(Array(formula, regressor)) val pipelineModel = pipeline.fit(df) val pmmlBytes = ConverterUtil.toPMMLByteArray(df.schema, pipelineModel) Files.write(Paths.get("wine_tree.pmml"), pmmlBytes) 18
  • 19. Apache Spark ML quirks ● Split schema. Static def via Dataset#schema(), dynamic def via Dataset column metadata ● Models make predictions in transformed output space ● High internal complexity, overhead. Ex: temporary Dataset columns for feature transformation ● Built-in PMML export capabilities leak the JPMML-Model library to application classpath 19
  • 20. PMML consumption: Apache Spark ML // $ spark-submit --packages org.jpmml:jpmml-spark:1.0-SNAPSHOT .. import org.jpmml.spark.EvaluatorUtil; import org.jpmml.spark.TransformerBuilder; Evaluator evaluator = EvaluatorUtil.createEvaluator(new File("audit_tree.pmml")); TransformerBuilder pmmlTransformerBuilder = new TransformerBuilder(evaluator) .withLabelCol("Adjusted") // String column .withProbabilityCol("Adjusted_probability", Arrays.asList("0", "1")) // Vector column .exploded(true); Transformer pmmlTransformer = pmmlTransformerBuilder.build(); Dataset<Row> input = ...; Dataset<Row> output = pmmlTransformer.transform(input); 20
  • 21. Comparison of feature spaces R Scikit-Learn Apache Spark ML Feature identification Named Positional Pseudo-named Feature data type Any Float, Double Double Feature operational type Continuous, Categorical, Ordinal Continuous Continuous, pseudo-categorical Dataset abstraction List<Map<String,?>> float[][] or double[][] List<double[]> Effect of transformations on dataset size Low Medium (sparse) to high (dense) 21
  • 22. Feature declaration <DataField name="Age" dataType="float" optype="continuous"> <Interval closure="closedClosed" leftMargin="17.0" rightMargin="83.0"/> </DataField> <DataField name="Gender" dataType="string" optype="categorical"> <Value value="Male"/> <Value value="Female"/> <Value value="N/A" property="missing"/> </DataField> http://dmg.org/pmml/v4-3/DataDictionary.html <MiningField name="Age" outliers="asExtremeValues" lowValue="18.0" highValue="75.0"/> <MiningField name="Gender"/> http://dmg.org/pmml/v4-3/MiningSchema.html 22
  • 23. Feature statistics <UnivariateStats field="Age"> <NumericInfo mean="38.30279" standardDeviation="13.01375" median="3.70"/> <ContStats> <Interval closure="openClosed" leftMargin="17.0" rightMargin="23.6"/> <!-- Intervals 2 through 9 omitted for clarity --> <Interval closure="openClosed" leftMargin="76.4" rightMargin="83.0"/> <Array type="int">261 360 297 340 280 156 135 51 13 6</Array> </ContStats> </UnivariateStats> <UnivariateStats field="Gender"> <Counts totalFreq="1899" missingFreq="0" invalidFreq="0"/> <DiscrStats> <Array type="string">Male Female</Array> <Array type="int">1307 592</Array> </DiscrStats> </UnivariateStats> http://dmg.org/pmml/v4-3/Statistics.html23
  • 24. Comparison of tree models R Scikit-Learn Apache Spark ML Algorithms No built-in, many external Few built-in Single built-in Split type(s) Binary or multi-way; simple and derived features Binary; simple features Continuous features Rel. op. (<, <=) Rel. op. (<=) Categorical features Set op. (%in%) Pseudo-rel. op. (==) Pseudo-set op. Reuse Hard Easy to medium 24
  • 25. Tree model declaration <TreeModel functionName="classification" splitCharacteristic="binarySplit"> <Node id="1" recordCount="165"> <True/> <Node id="2" score="1" recordCount="35"> <SimplePredicate field="Education" operator="equal" value="Master"/> <ScoreDistribution value="1" recordCount="25"/> <ScoreDistribution value="0" recordCount="10"/> </Node> <Node id="3" score="0" recordCount="130"> <SimplePredicate field="Education" operator="notEqual" value="Master"/> <ScoreDistribution value="1" recordCount="20"/> <ScoreDistribution value="0" recordCount="110"/> </Node> </Node> </TreeModel> http://dmg.org/pmml/v4-3/TreeModel.html 25
  • 26. Optimization (1/3) <Node> SimplePredicate: Gender != "Male" <Node> SimplePredicate: Age <= 34.5 </Node> <Node> SimplePredicate: Age > 34.5 </Node> </Node> <Node> SimplePredicate: Gender == "Male" </Node> <Node> SimplePredicate: Gender == "Male" </Node> <Node> SimplePredicate: Age <= 34.5 </Node> <Node> SimplePredicate: Age > 34.5 </Node> Replacing "deep" binary splits with "shallow" multi-splits: 26
  • 27. Optimization (2/3) <TreeModel noTrueChildStrategy="returnNullPrediction" > <Node> <True/> <Node score="4.333333333333333"> SimplePredicate: pH <= 2.93 </Node> <Node score="5.483870967741935"> SimplePredicate: pH > 2.93 </Node> </Node> </TreeModel> <TreeModel noTrueChildStrategy="returnLastPrediction" > <Node score="5.483870967741935"> <True/> <Node score="4.333333333333333"> SimplePredicate: pH <= 2.93 </Node> </Node> </TreeModel> Cutting the number of terminal nodes in half: 27
  • 28. Optimization (3/3) <Node> SimplePredicate: Age > 35.5 <Node score="0" recordCount="40"> SimplePredicate: Income <= 194386 ScoreDistribution: "0" = 40, "1" = 0 </Node> <Node score="0" recordCount="8"> SimplePredicate: Income > 194386 ScoreDistribution: "0" = 7, "1" = 1 </Node> </Node> <Node score="0" recordCount="48"> SimplePredicate: Age > 35.5 ScoreDistribution: "0" = 47, "1" = 1 </Node> Removing split levels that don't affect the prediction: 28
  • 29. Code generation <TreeModel missingValueStrategy="defaultChild" > <Node id="1" defaultChild="3"> <True/> <Node id="2" score="0"> SimplePredicate: Age <= 52 ScoreDistribution: "0" = 7, "1" = 0 </Node> <Node id="3" score="1"> SimplePredicate: Age > 52 ScoreDistribution: "0" = 1, "1" = 2 </Node> </Node> </TreeModel> Object[] node_1(FieldValue age, ...){ if(age != null && age.asFloat() <= 52f){ return node_2(...); } else { return node_3(...); } } Object[] node_2(...){ return new Object[]{"0", 7d, 0d}; } Object[] node_3(...){ return new Object[]{"1", 1d, 2d}; } Bad Idea! 29
  • 30. Common pitfalls Not treating splits on continuous features with the required precision (tolerance < 0.5 ULP): ● Truncating values. Ex: "1.5000(..)1" → "1.50" ● Changing data type. Ex: float ↔ double ● Changing arithmetic expressions. Ex: (x1 / x2 ) ↔ (1 / x2 ) * x1 30
  • 31. Comparison of regression models R Scikit-Learn Apache Spark ML Algorithms Few built-in, many external Many built-in Few built-in Term type(s) Simple and derived features; interactions Simple features Reuse Hard Easy 31
  • 32. Regression model declaration <RegressionModel functionName="regression" normalizationMethod="none"> <RegressionTable intercept="15.50143741004145"> <!-- Simple continuous feature --> <NumericPredictor name="cylinders" coefficient="2.1609496766686194"/> <!-- Simple categorical feature --> <CategoricalPredictor name="origin" value="2" coefficient="-35.87525051244351"/> <CategoricalPredictor name="origin" value="3" coefficient="-38.206750156693424"/> <!-- Interaction --> <PredictorTerm coefficient="-0.007734946028064237"> <FieldRef field="cylinders"/> <FieldRef field="displacement"/> </PredictorTerm> <!-- Derived feature; I(log(weight)) is backed by a DerivedField element --> <NumericPredictor name="I(log(weight))" coefficient="4.874500863508498"/> </RegressionTable> </RegressionModel> http://dmg.org/pmml/v4-3/Regression.html32
  • 33. Generalized regression model declaration <GeneralRegressionModel functionName="regression" linkFunction="identity"> <ParameterList> <Parameter name="p0" label="(intercept)"/> <Parameter name="p11" label="cylinders:displacement"/> </ParameterList> <PPMatrix> <PPCell value="1" predictorName="cylinders" parameterName="p11"/> <PPCell value="1" predictorName="displacement" parameterName="p11"/> </PPMatrix> <ParamMatrix> <PCell parameterName="p0" beta="15.50143741004145"/> <PCell parameterName="p11" beta="-0.007734946028064237"/> </ParamMatrix> </GeneralRegressionModel> http://dmg.org/pmml/v4-3/GeneralRegression.html 33
  • 34. Code generation double mpg(FieldValue cylinders, FieldValue displacement, FieldValue weight, FieldValue origin){ return 15.50143741004145d + // intercept 2.1609496766686194d * cylinders.asDouble() + // simple continuous feature origin(origin.asString()) + // simple categorical feature -0.007734946028064237d * (cylinders.asDouble() * displacement.asDouble()) + // interaction 4.874500863508498d * Math.ln(weight.asDouble()) // derived feature } double origin(String origin){ switch(origin){ case "1": return 0d; // baseline case "2": return -35.87525051244351d; case "3": return -38.206750156693424d; } throw new IllegalArgumentException(origin); } 34
  • 35. Grand summary R Scikit-Learn Apache Spark ML ML-platform model information density Medium to high Low Low to medium ML-platform model interpretability Good Bad Bad PMML markup optimizability Low High Medium ML-platform to PMML learning and migration effort Medium Low to medium Low 35