SlideShare a Scribd company logo
1 of 46
Download to read offline
Matthew Tovbin
Principal Engineer, Salesforce Einstein
mtovbin@salesforce.com
@tovbinm
Doubt Truth to be a Liar
Non Triviality of Type Safety for Machine Learning
“Doubt thou the stars are fire,
Doubt that the sun doth move,
Doubt truth to be a liar,
But never doubt I love.”
- William Shakespeare, Hamlet
A glimpse into the future
​ What I am going to talk about:
• Machine Learning (ML) 101
• Real-life ML
• Building ML application with Spark ML
• Typed Feature Engineering with Optimus Prime
• Behind the scenes
• Going forward
Machine Learning 101
​ What is Machine Learning?
• “The capacity of a computer to learn from experience, i.e. to modify it’s
processing on the basis of newly acquired information” – 1950s, IBM Journal
What is “experience” in the computer terms?
• It’s just data.
What are the tasks Machine Learning solves?
• Recognition, diagnosis, prediction, forecasting, planning, data mining, etc.
Machine Learning 101
​ Feature
• An individual measurable property of a phenomenon being observed.
• Choosing informative, discriminating and independent features is a crucial
step for building effective ML algorithms.
Feature Vector
• An n-dimensional vector that represents a set of features corresponding to a
single observation, i.e. email open/click, product purchase etc.
Model
• A structure and corresponding interpretation that summarizes or partially
summarizes a set of data, for description or prediction.
​ [1/2] - Terms
Machine Learning 101
​ Training
• The process of training an ML model involves providing an ML algorithm
with training data to learn from.
• During training, data is evaluated by the ML algorithm, which analyzes the
distribution and type of the data, looking for rules and patterns that used
later prediction.
​ Scoring
• The process of applying a trained model to new data to generate predictions
and other values.
• Examples: a list of recommended items, forecast for time series models,
estimates of projected demand/volume/etc., probability scores.
​ [2/2] - Terms
Machine Learning 101
​ Building a ML model
Feature
Engineering
Model
Training
Model A
Model B
Model C
Model
Evaluation
Real-life ML
​ Building a ML model pipeline
ETL
Model
Evaluation
Feature
Engineering
Scoring
Model
Training
Model A
Model B
Model C
Deployment
Real-life ML
​ Just a few problems to mention
• ETL is tough
• Feature Engineering is even tougher
• Our prototype in R/Python/Octave works great, but …
• Copy/pasting code across projects doesn’t scale
• Model Training fails exactly two hours after you go to sleep
•  Data is not there
•  Data is there, but in a wrong format
•  Data is there, but is insufficient
•  OOM, Insufficient Space, Serialization…
• So we have the models/scores, but can we trust them?!
Real-life ML
​ Specifically for Salesforce
Multi-tenancy
• Multiple customers: Square, Fanatics, etc.
• Multiple customer environments: Live, Staging, QA, Dev.
• Multiple data sources: SFDC, Marketing Cloud, Service Cloud, IoT Cloud, etc.
• Multiple data entities: Leads, Opportunities, Email Campaigns, etc.
• Multiple applications: Lead Scoring, Predictive Journeys, or custom.
Security, Scale, Automation, Transparency, Cost Efficiency and so on.
Salesforce Einstein
​ AI for everyone
• ML Platform for customers, engineers, data scientists
• No need for ETL or PhD
​ PredictionIO
• Most starred Scala project on GitHub
• Now part of Apache’s incubator program
Optimus Prime
• In-house transformation framework
• Declarative, collaborative, reusable, typed
Services, Microservices, Nanoservices…
Building ML application
with Spark ML
Predict survival on the Titanic
Sources: https://www.kaggle.com/c/titanic/data, https://github.com/BenFradet/spark-kaggle
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0
A/5
21171 7.25 S
2 1 1
Cumings, Mrs. John Bradley (Florence Briggs
Thayer) female 38 1 0
PC
17599
71.283
3 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0
STON/
O2.
3101282 7.925 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 S
6 0 3 Moran, Mr. James male 0 0 330877 8.4583 Q
7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463
51.862
5 E46 S
8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.075 S
...
890 1 1 Behr, Mr. Karl Howell male 26 0 0 111369 30 C148 C
891 0 3 Dooley, Mr. Patrick male 32 0 0 370376 7.75 Q
Building ML application with Spark ML
​ [1/5] - Spinning up Spark
import org.apache.spark._!
import org.apache.spark.sql._!
import org.apache.spark.sql.types._!
import org.apache.spark.sql.functions._!
import org.apache.spark.ml._!
import org.apache.spark.ml.classification._!
import org.apache.spark.ml.evaluation._!
import org.apache.spark.ml.feature._!
import org.apache.spark.ml.tuning._!
!
// Spinning up Spark!
val conf = new SparkConf().setMaster("local[2]")!
val session = SparkSession.builder.config(conf).getOrCreate!
val (sc, sqlc) = (session.sparkContext, session.sqlContext)!
!
import sqlc.implicits._!
Building ML application with Spark ML
​ [2/5] - Reading the data
def readData(file: String): DataFrame = {!
val schema = StructType(Array(!
StructField("PassengerId", IntegerType, nullable = true), // <-- field names are easy to misspell!
StructField("Survived", DoubleType, nullable = true), // <-- we can set any type here !
StructField("Pclass", DoubleType, nullable = true),!
StructField("Name", StringType, nullable = true),!
StructField("Sex", StringType, nullable = true),!
StructField("Age", DoubleType, nullable = true),!
StructField("SibSp", DoubleType, nullable = true),!
StructField("Parch", DoubleType, nullable = true),!
StructField("Ticket", StringType, nullable = true),!
StructField("Fare", DoubleType, nullable = true),!
StructField("Cabin", StringType, nullable = true),!
StructField("Embarked", StringType, nullable = true)!
))!
val df: DataFrame = sqlc.read.format("csv").option("header", "true").schema(schema).load(file)!
// Select and rename necessary fields!
val select = Array($"Survived".as("survived"), $"Sex".as("sex"), $"Age".as("age”),!
$"Pclass".as("pclass"),$"SibSp".as("sibsp"), $"Parch".as("parch"), $"Embarked".as("embarked"))!
df.select(select: _*)!
}!
val rawData: DataFrame = readData(file = "titanic.csv") // <-- runtime exceptions...!
Building ML application with Spark ML
​ [3/5] - Feature Engineering
!
def addFeatures(df: DataFrame): DataFrame = {!
// Create a new family size field := siblings + spouses + parents + children + self!
val familySizeUDF = udf { (sibsp: Double, parch: Double) => sibsp + parch + 1 }!
!
df.withColumn("fsize", familySizeUDF(col("sibsp"), col("parch"))) // <-- full freedom to overwrite !
}!
!
def fillMissing(df: DataFrame): DataFrame = {!
// Fill missing age values with average age!
val avgAge = df.select("age").agg(avg("age")).collect.first()!
!
// Fill missing embarked values with default "S" (i.e Southampton)!
val embarkedUDF = udf{(e: String)=> e match { case x if x == null || x.isEmpty => "S"; case x => x}}!
!
df.na.fill(Map("age" -> avgAge)).withColumn("embarked", embarkedUDF(col("embarked")))!
}!
// Modify the dataframe!
val allData = fillMissing(addFeatures(rawData)).cache() // <-- need to remember about caching!
// Split the data and cache it!
val Array(trainSet, testSet) = allData.randomSplit(Array(0.75, 0.25)).map(_.cache())!
Building ML application with Spark ML
​ [4/5] - Building the pipeline
// Prepare categorical columns!
val categoricalFeatures = Array("pclass", "sex", "embarked")!
val stringIndexers = categoricalFeatures.map(colName =>!
new StringIndexer().setInputCol(colName).setOutputCol(colName + "_index").fit(allData)!
)!
// Concat all the feature into a numeric feature vector!
val allFeatures = Array("age", "sibsp", "parch", "fsize") ++ stringIndexers.map(_.getOutputCol)!
!
val vectorAssembler = new VectorAssembler().setInputCols(allFeatures).setOutputCol("feature_vector”)!
// Prepare Logistic Regression estimator!
val logReg = new LogisticRegression().setFeaturesCol("feature_vector").setLabelCol("survived”)!
// Finally build the pipeline with the stages above!
val pipeline = new Pipeline().setStages(stringIndexers ++ Array(vectorAssembler, logReg))!
Building ML application with Spark ML
​ [5/5] - Model training
// Cross validate our pipeline with various parameters!
val paramGrid =!
new ParamGridBuilder()!
.addGrid(logReg.regParam, Array(1, 0.1, 0.01))!
.addGrid(logReg.maxIter, Array(10, 50, 100))!
.build()!
!
val crossValidator =!
new CrossValidator()!
.setEstimator(pipeline) // <-- set our pipeline here!
.setEstimatorParamMaps(paramGrid)!
.setEvaluator(new BinaryClassificationEvaluator().setLabelCol("survived"))!
.setNumFolds(3)!
!
// Train the model & compute scores !
val model: CrossValidationModel = crossValidator.fit(trainSet)!
val scores: DataFrame = model.transform(testSet)!
!
// Save the model for later use!
model.save("/models/titanic-model.ml")!
Building ML application with Spark ML
​ The good parts
• Simple abstraction: Transformers (.map), Estimators (.reduce) and Pipelines
• Serialization allows reusability of models
• Good implementations for various estimators: Word2Vec, LogReg, etc.
• All-In-One: data exploration, prototyping, productionization
• Multi language support: Java/Scala/Python
• Healthy ecosystem
Building ML application with Spark ML
​ The not so good parts
• No type checking (especially painful for Transformers, Estimators and Pipelines)
• Transformer and Estimator interfaces are too open: Dataset => DataFrame
• DataFrames are everywhere
•  No type checking
•  Easy to misspell column names
•  No integration with ML Vector
•  Missing a lot of RDDs functionality
• Lack of support for common data I/O operations
• Schema and algorithms definitions are interleaved with data manipulations
Can we do better?!
Typed Feature Engineering
with Optimus Prime
Optimus Prime
What is Optimus Prime?
• A transformation framework to develop reusable, modular and typed ML
pipelines
Why are we building it?
• Declarative and intuitive syntax
• Typed operations with Spark ML
• Reusability of I/O operations, features, transformations, pipelines
• Separation of features and transformations from data operations
• Multitenant applications
Optimus Prime
Building ML application with Optimus Prime
​ [1/3] – Defining a reader & raw features
import com.salesforce.op._!
import com.salesforce.op.test.avro.Passenger!
!
// Define the reader from CSV to Avro!
val trainReader: DataReader[Passenger] = DataReaders.Simple.csv[Passenger](!
path = Some("titanic.csv"),!
schema = Passenger.getClassSchema.toString!
)!
// Define the response feature (feature names are inferred from val names)!
val survived = FeatureBuilder.Binary[Passenger].extract(Option(_.getSurvived).map(_ != 0)).asResponse!
// Define the predictor features!
val age = FeatureBuilder.NullableNumeric[Passenger].extract(Option(_.getAge)).asPredictor!
val sex = FeatureBuilder.Categorical[Passenger].extract(Set(_.getSex)).asPredictor!
val pclass = FeatureBuilder.Numeric[Passenger].extract(_.getPClass).asPredictor!
val sibsp = FeatureBuilder.Numeric[Passenger].extract(_.getSibSp).asPredictor!
val parch = FeatureBuilder.Numeric[Passenger].extract(_.getParCh).asPredictor!
val embarked = FeatureBuilder.Text[Passenger].extract(Option(_.getEmbarked)).asPredictor!
Building ML application with Optimus Prime
​ [2/3] – Feature engineering
// Create a new family size feature – no annoying UDFs here!!
val fsize: FeatureLike[Numeric] = sibsp + parch + 1!
// Fill missing age values with average age (i.e. from NullableNumeric we get Numeric)!
val ageFilled: FeatureLike[Numeric] = age.fillMissingWithMean!
// Fill missing embarked values with default "S" (i.e Southampton)!
val embarkedFilled: FeatureLike[Text] = embarked.fillMissingWith("S")!
// Create a feature vector field using default vectorizers!
val featureVector: FeatureLike[Vector] =!
Seq(sex, ageFilled, pclass, sibsp, parch, fsize, embarkedFilled).vectorize()!
!
Building ML application with Optimus Prime
​ [3/3] – Model training
!
val modelSelector = { // Create a model selector with two algorithms!
new ModelSelector[Passenger]().setInput(survived, featureVector)!
.setParams(!
LogisticRegression.RegParam -> Array(1, 0.1, 0.01),!
LogisticRegression.MaxIter -> Array(10, 50, 100),!
RandomForest.NumTrees -> Array(3, 5, 10)!
).setModels(Algs.LogisticRegression, Algs.RandomForest) // <-- multiple algorithms here!
.setEvaluator(Evals.BinaryClassification)!
}!
// Build the pipeline with the model selector !
val pipeline = new OpPipeline[Passenger]().setInput(modelSelector)!
!
// And only now we are spinning up Spark!
val conf = new SparkConf().setMaster("local[2]")!
implicit val session = SparkSession.builder.config(conf).getOrCreate!
!
// Train the model & compute scores !
val model: OpPipelineModel[Passenger] = pipeline.setReader(trainReader).train()!
val scores: DataFrame = model.setReader(testReader).score()!
!
model.save("/models/titanic-model.op") // Save the model for later use!
Building ML application with Optimus Prime
​ Summary
• Everything is typed: readers, features, pipeline, model
• Declarative and intuitive syntax with code completion
• Feature names are inferred from val names –> no misspelled names
• Features are always unique in a code block, compilation error otherwise
• Common data I/O operations provided with DataReaders (joins, aggrs etc.)
• DataFrames are abstracted away – no direct interaction
• Features and transformations are separate from data operations
Behind the scenes of
Optimus Prime
Types and interactions
Numeric
Text
…
Feature[T <: Feature Type]
transformed with
Categorical
Unary
produce
Transformers (.map)
Binary
…
Estimators (.reduce)
Average
Word2Vec
…
fitted into
My Model
Data Readers
CSV
Avro
…
Pipelines
Titanic
Lead Scoring
…
readmaterialized by
trained
joined / aggr
Feature Types V1
case object FeatureTypes {!
type Numeric = Double!
type NullableNumeric = Option[Double]!
type Categorical = scala.collection.mutable.WrappedArray[String]!
type Text = Option[String]!
type Binary = Option[Boolean]!
type DateList = scala.collection.mutable.WrappedArray[Long]!
type KeyString = scala.collection.Map[String, String]!
type KeyNumeric = scala.collection.Map[String, Double]!
type KeyBinary = scala.collection.Map[String, Boolean]!
type Vector = org.apache.spark.ml.linalg.Vector!
!
// TBD: Specific/rich types: Email, Phone, URL, etc.!
}!
FeatureType
OPNumeric OPCollection
OPSet
OPSortedSet
OPList
NonNullable
Text
Email
Base64
Phone
ID
URL
ComboBox
PickList
TextArea
OPVector OPMap
BinaryMap
IntegralMap
RealMap
CategoricalMap
OrdinalMap
DateList
DateTimeList
Integral
Real
Binary
Percent
Currency
Date
DateTime
Categorical
MultiPickList
Ordinal
TextMap
Legend: - inheritance, bold - abstract class, italic - trait, normal - concrete class
...
TextList
Feature Types V2
...
Typed Features
// Typed value container!
trait FeatureType extends Serializable {!
!
type Value // feature value type!
!
def value: Value // actual value!
!
def isEmpty: Boolean // true if value is empty !
!
def isNullable: Boolean // true if value is nullable!
!
// ...!
}!
// For example, a text feature value type!
class Text(val value: Option[String]) extends FeatureType {!
type Value = Option[String]!
def this(value: String) = this(Option(value))!
final def isEmpty: Boolean = value.isEmpty!
}!
Typed Features
// Represents a single feature (dimension)!
trait FeatureLike[O <: FeatureType] extends Serializable { !
!
implicit def wtt: WeakTypeTag[O] // Overcoming type erasure!
!
def name: String // name of the feature!
!
def defaultValue: O // feature default value!
!
def originStage: OpPipelineStage[O] // the stage which generated this feature!
!
def parents: Seq[FeatureLike[_ <: FeatureType]] // the input features for the origin stage!
!
// feature transformation function (i.e map). We have more like this...!
final def transformWith[U <: FeatureType](stage: OpPipelineStage1[O, U]): FeatureLike[U]!
!
// ...!
}!
Feature names from vals using Macros magic
val sibsp = FeatureBuilder.Numeric[Passenger](name = "sibsp") !
!
val sibsp = FeatureBuilder.Numeric[Passenger] // <-- can we just do this?!!
!
!
!
object FeatureBuilder {!
def Numeric[I]: FeatureBuilder[I, Numeric] = macro FeatureBuilderMacros.apply[I, Numeric]!
}!
!
// HAHA! So we meet again! !
private[op] object FeatureBuilderMacros { !
def apply[I: c.WeakTypeTag, O: c.WeakTypeTag](c: Context): c.Expr[FeatureBuilder[I, O]] = {!
import c.universe._!
val enclosingValName = MacrosHelper.definingValName(c)!
val featureName = c.Expr[String](Literal(Constant(enclosingValName)))!
val fbApply = Select(reify(FeatureBuilder).tree, TermName("apply"))!
val fbExpr = c.Expr[FeatureBuilder[I, O]](Apply(fbApply, featureName.tree :: Nil))!
!
reify(fbExpr.splice)!
}!
}!!
// Read more in sbt codebase - https://goo.gl/OdPvry!
!
Feature transformations with Implicit Classes
// Create a new family size feature := siblings + spouses + parents + children + self!
val fsize = sibsp.transformWith(new BinaryNullableAndNumeric(_ + _), parch)!
.transformWith(new BinaryNumeric (_ + _), 1)!
!
val fsize = sibsp + parch + 1 // <-- can we just do this?!
// Sure thing!!
implicit class RichNullableNumericFeature[I <: NullableNumeric : TypeTag](val f: FeatureLike[I]) {!
!
def +[I2 <: Numeric: TypeTag](that: FeatureLike[I2]): FeatureLike[NullableNumeric] = {!
val plus = (a: Numeric, b: Numeric) => a + b!
val stage = new BinaryNullableAndNumeric(plus)!
!
f.transformWith[I2, NullableNumeric](f = that, stage)!
}!
!
}!
!
// Note: we use another Macro here as well to infer the name of the feature to “fsize”!
Features and Transformers
sibsp: Feature[Numeric]
survived	pclass	sex	 age	 sibsp	 parch	embarked	 fsize	
0	 3	 male	 22	 1	 0	 S	 2	
1	 1	 female	 38	 1	 0	 C	 2	
1	 3	 female	 26	 0	 0	 S	 1	
Binary Transformer ( _ + _ )
parch: Feature[Numeric]
val fsize: FeatureLike[Numeric] = sibsp + parch + 1!
_: Feature[Numeric]
fsize: Feature[Numeric]Binary Transformer ( _ + 1 )
Transformation DAG
sibsp parch
_ 1
fsize
2 Pipeline Stages
Typed pipeline stages
import org.apache.spark.ml.util.MLWritable !
import org.apache.spark.ml.PipelineStage!
!
// OP pipeline stages represent a feature transformation,!
// and also carry around the input and output features!
trait OpPipelineStageBase extends OpPipelineStageParams with MLWritable { self: PipelineStage =>!
type InputFeatures!
type OutputFeatures!
!
def setInput(features: InputFeatures): this.type!
def getOutput(): OutputFeatures!
!
// This method allows us to modify the DataFrame schema accordingly!
final override def transformSchema(schema: StructType): StructType = { ... }!
}!
!
Binary Transformer ( _ + _ ) Feature[Numeric]
Feature[Numeric]
->
Feature[Numeric]
inputs
outputextends OpPipelineStage2
sibsp + parch!
Typed pipeline stages
!
// Stage providing a single feature[O]!
trait OpPipelineStage[O <: FeatureType] extends OpPipelineStageBase {!
type InputFeatures!
final override type OutputFeatures = FeatureLike[O]!
}!
// Stage from feature[I] to a feature[O]!
trait OpPipelineStage1[I <: FeatureType, O <: FeatureType] extends OpPipelineStage[O] {!
final override type InputFeatures = FeatureLike[I]!
}!
// Stage from a tuple of features to a feature[O]!
trait OpPipelineStage2[I1 <: FeatureType, I2 <: FeatureType, O <: FeatureType]!
extends OpPipelineStage[O] {!
final override type InputFeatures = (FeatureLike[I1], FeatureLike[I2])!
}!
// ...!
// And so on for various combinations: 1to2, ..., 1toN, 2to2, ..., Nto1, Nto2, ...!
// See Scala Product types - https://goo.gl/J3V5DP!
1-to-1 Transformer example
!
import org.apache.spark.ml.Transformer!
!
// A simple 1 to 1 transformer !
trait OpTransformer1[I <: FeatureType, O <: FeatureType]!
extends Transformer with OpPipelineStage1[I, O] {!
implicit def tti: TypeTag[I]!
implicit def tto: TypeTag[O]!
!
// User provided transform function that operates on input feature value I and produces O!
def transformFn: I => O!
!
// We wrap the transform function above into a UDF!
final override def transform(dataset: Dataset[_]): DataFrame = {!
val functionUDF = udf { (in: Any) => transformFn(FeatureTypes.as[I](in)).value }!
!
// Return a dataset with a new column!
dataset.withColumn(outputName, functionUDF(col(in1.name)))!
}!
}!
// ...!
// And so on for various combinations: 1to2, 1to3, ..., 1toN, ..., Nto1, Nto2, ...!
1-to-1 Estimator example
​ Last code slide. Really.
import org.apache.spark.ml.{Estimator, Model}!
!
// A simple 1 to 1 estimator which is trained into a model (transformer)!
class UnaryEstimator[I <: FeatureType : TypeTag, O <: FeatureType: TypeTag](!
val fitFn: Dataset[I#Value] => I => O!
) extends Estimator[UnaryModel[I, O]] with OpPipelineStage1[I, O] {!
implicit val iEncoder: Encoder[I#Value] = ExpressionEncoder()!
!
final override def fit(dataset: Dataset[_]): UnaryModel[I, O] = {!
val df: DataFrame = dataset.select(in1.name) !
val ds: Dataset[I#Value] = df.map(r => FeatureTypes.as[I](r.get(0)).value) // needs encoder !
!
val transformFn: I => O = fitFn(ds) // fit function returns a transform function!
!
new UnaryModel[I, O](transformFn).setParent(this).setInput(in1)!
}!
}!!
// Represents a trained model (transformer) from feature[I] to feature[O]!
class UnaryModel[I <: FeatureType, O <: FeatureType](val transformFn: I => O)!
(implicit val tti: TypeTag[I], val tto: TypeTag[O])!
extends Model[UnaryModel[I, O]] with OpTransformer1[I, O]!
Model training (.reduce)
Data
ReaderRDD[Passenger]
Feature
Extraction
DataFrame (age, sibsp,
parch, embarked, …)
Spark Pipeline .setStages(stages).fit(trainData)
Titanic Model .tranform(…)
Topological sort
Transformation DAG
sibsp parch
_ 1
fsize
val model = new OpPipeline[Passenger]().setInput(modelSelector).setReader(trainReader).train()!
Going forward with Optimus Prime
​ What is still missing?
• Wrapping existing Spark ML transformers/estimators
• Codegen for stage in-out combinations (1to1, …, 1toN, ... Nto1, Nto2, ...)
• Codegen for Macros (scary!)
• Automatic feature engineering
• Abstract away from Spark with Apache Beam
• More…
Key takeaways
• Real-life Machine Learning is hard
• Spark ML is great, but it needs type safety
• Simple and intuitive syntax saves you trouble down the road
• Scala has all the relevant facilities to provide the above – know to use it
• Modularity and reusability is the key
Further exploration
• Salesforce Einstein – http://einstein.com
•  “Democratizing AI to solve human bottlenecks” by Sarah Aerni, PhD – Scala Days, CPH 2017
• PredictionIO – http://predictionio.incubator.apache.org
• Optimus Prime – to be open-sourced (no ETA yet)
•  “Optimus Prime: declarative, collaborative, type-safe machine learning” by Shubha Nabar, PhD
https://goo.gl/hgWxJb
•  “The Lego Model for Machine Learning” by Leah McGuire, PhD – https://goo.gl/hmct4R
If You’re Curious …
einstein-recruiting@salesforce.com
Thank Y u

More Related Content

What's hot

What's hot (6)

Amazon Athena (April 2017)
Amazon Athena (April 2017)Amazon Athena (April 2017)
Amazon Athena (April 2017)
 
IncQuery gets Sirius: faster and better diagrams
IncQuery gets Sirius: faster and better diagramsIncQuery gets Sirius: faster and better diagrams
IncQuery gets Sirius: faster and better diagrams
 
Spreadsheets: Functional Programming for the Masses
Spreadsheets: Functional Programming for the MassesSpreadsheets: Functional Programming for the Masses
Spreadsheets: Functional Programming for the Masses
 
AWS Machine Learning Week SF: End to End Model Development Using SageMaker
AWS Machine Learning Week SF: End to End Model Development Using SageMakerAWS Machine Learning Week SF: End to End Model Development Using SageMaker
AWS Machine Learning Week SF: End to End Model Development Using SageMaker
 
ML-Ops how to bring your data science to production
ML-Ops  how to bring your data science to productionML-Ops  how to bring your data science to production
ML-Ops how to bring your data science to production
 
Picking the right AWS backend for your Java application
Picking the right AWS backend for your Java applicationPicking the right AWS backend for your Java application
Picking the right AWS backend for your Java application
 

Similar to Doubt Truth to be a Liar: Non Triviality of Type Safety for Machine Learning (at Salesforce)

ScalaCheck
ScalaCheckScalaCheck
ScalaCheck
BeScala
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Databricks
 
Generatingcharacterizationtestsforlegacycode
GeneratingcharacterizationtestsforlegacycodeGeneratingcharacterizationtestsforlegacycode
Generatingcharacterizationtestsforlegacycode
Carl Schrammel
 

Similar to Doubt Truth to be a Liar: Non Triviality of Type Safety for Machine Learning (at Salesforce) (20)

Embracing a Taxonomy of Types to Simplify Machine Learning with Leah McGuire
Embracing a Taxonomy of Types to Simplify Machine Learning with Leah McGuireEmbracing a Taxonomy of Types to Simplify Machine Learning with Leah McGuire
Embracing a Taxonomy of Types to Simplify Machine Learning with Leah McGuire
 
Human-Centered Interpretable Machine Learning
Human-Centered Interpretable  Machine LearningHuman-Centered Interpretable  Machine Learning
Human-Centered Interpretable Machine Learning
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
ScalaCheck
ScalaCheckScalaCheck
ScalaCheck
 
VSSML16 L5. Basic Data Transformations
VSSML16 L5. Basic Data TransformationsVSSML16 L5. Basic Data Transformations
VSSML16 L5. Basic Data Transformations
 
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph BradleyDeploying MLlib for Scoring in Structured Streaming with Joseph Bradley
Deploying MLlib for Scoring in Structured Streaming with Joseph Bradley
 
The Machine Learning Workflow with Azure
The Machine Learning Workflow with AzureThe Machine Learning Workflow with Azure
The Machine Learning Workflow with Azure
 
Practical Predictive Modeling in Python
Practical Predictive Modeling in PythonPractical Predictive Modeling in Python
Practical Predictive Modeling in Python
 
MLSEV. Automating Decision Making
MLSEV. Automating Decision MakingMLSEV. Automating Decision Making
MLSEV. Automating Decision Making
 
Generatingcharacterizationtestsforlegacycode
GeneratingcharacterizationtestsforlegacycodeGeneratingcharacterizationtestsforlegacycode
Generatingcharacterizationtestsforlegacycode
 
Visual diagnostics at scale
Visual diagnostics at scaleVisual diagnostics at scale
Visual diagnostics at scale
 
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
The Killer Feature Store: Orchestrating Spark ML Pipelines and MLflow for Pro...
 
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and PrometheusRobust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
Robust MLOps with Open-Source: ModelDB, Docker, Jenkins, and Prometheus
 
Making Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons LearnedMaking Data Science Scalable - 5 Lessons Learned
Making Data Science Scalable - 5 Lessons Learned
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
A Survey of Concurrency Constructs
A Survey of Concurrency ConstructsA Survey of Concurrency Constructs
A Survey of Concurrency Constructs
 
CBDW2014 - MockBox, get ready to mock your socks off!
CBDW2014 - MockBox, get ready to mock your socks off!CBDW2014 - MockBox, get ready to mock your socks off!
CBDW2014 - MockBox, get ready to mock your socks off!
 
Machine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackboxMachine learning for IoT - unpacking the blackbox
Machine learning for IoT - unpacking the blackbox
 

Recently uploaded

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
shinachiaurasa2
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
VishalKumarJha10
 

Recently uploaded (20)

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...Chinsurah Escorts ☎️8617697112  Starting From 5K to 15K High Profile Escorts ...
Chinsurah Escorts ☎️8617697112 Starting From 5K to 15K High Profile Escorts ...
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdfThe Top App Development Trends Shaping the Industry in 2024-25 .pdf
The Top App Development Trends Shaping the Industry in 2024-25 .pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 

Doubt Truth to be a Liar: Non Triviality of Type Safety for Machine Learning (at Salesforce)

  • 1. Matthew Tovbin Principal Engineer, Salesforce Einstein mtovbin@salesforce.com @tovbinm Doubt Truth to be a Liar Non Triviality of Type Safety for Machine Learning
  • 2. “Doubt thou the stars are fire, Doubt that the sun doth move, Doubt truth to be a liar, But never doubt I love.” - William Shakespeare, Hamlet
  • 3. A glimpse into the future ​ What I am going to talk about: • Machine Learning (ML) 101 • Real-life ML • Building ML application with Spark ML • Typed Feature Engineering with Optimus Prime • Behind the scenes • Going forward
  • 4. Machine Learning 101 ​ What is Machine Learning? • “The capacity of a computer to learn from experience, i.e. to modify it’s processing on the basis of newly acquired information” – 1950s, IBM Journal What is “experience” in the computer terms? • It’s just data. What are the tasks Machine Learning solves? • Recognition, diagnosis, prediction, forecasting, planning, data mining, etc.
  • 5. Machine Learning 101 ​ Feature • An individual measurable property of a phenomenon being observed. • Choosing informative, discriminating and independent features is a crucial step for building effective ML algorithms. Feature Vector • An n-dimensional vector that represents a set of features corresponding to a single observation, i.e. email open/click, product purchase etc. Model • A structure and corresponding interpretation that summarizes or partially summarizes a set of data, for description or prediction. ​ [1/2] - Terms
  • 6. Machine Learning 101 ​ Training • The process of training an ML model involves providing an ML algorithm with training data to learn from. • During training, data is evaluated by the ML algorithm, which analyzes the distribution and type of the data, looking for rules and patterns that used later prediction. ​ Scoring • The process of applying a trained model to new data to generate predictions and other values. • Examples: a list of recommended items, forecast for time series models, estimates of projected demand/volume/etc., probability scores. ​ [2/2] - Terms
  • 7. Machine Learning 101 ​ Building a ML model Feature Engineering Model Training Model A Model B Model C Model Evaluation
  • 8. Real-life ML ​ Building a ML model pipeline ETL Model Evaluation Feature Engineering Scoring Model Training Model A Model B Model C Deployment
  • 9. Real-life ML ​ Just a few problems to mention • ETL is tough • Feature Engineering is even tougher • Our prototype in R/Python/Octave works great, but … • Copy/pasting code across projects doesn’t scale • Model Training fails exactly two hours after you go to sleep •  Data is not there •  Data is there, but in a wrong format •  Data is there, but is insufficient •  OOM, Insufficient Space, Serialization… • So we have the models/scores, but can we trust them?!
  • 10. Real-life ML ​ Specifically for Salesforce Multi-tenancy • Multiple customers: Square, Fanatics, etc. • Multiple customer environments: Live, Staging, QA, Dev. • Multiple data sources: SFDC, Marketing Cloud, Service Cloud, IoT Cloud, etc. • Multiple data entities: Leads, Opportunities, Email Campaigns, etc. • Multiple applications: Lead Scoring, Predictive Journeys, or custom. Security, Scale, Automation, Transparency, Cost Efficiency and so on.
  • 11. Salesforce Einstein ​ AI for everyone • ML Platform for customers, engineers, data scientists • No need for ETL or PhD ​ PredictionIO • Most starred Scala project on GitHub • Now part of Apache’s incubator program Optimus Prime • In-house transformation framework • Declarative, collaborative, reusable, typed Services, Microservices, Nanoservices…
  • 13. Predict survival on the Titanic Sources: https://www.kaggle.com/c/titanic/data, https://github.com/BenFradet/spark-kaggle PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.25 S 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.283 3 C85 C 3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/ O2. 3101282 7.925 S 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1 C123 S 5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.05 S 6 0 3 Moran, Mr. James male 0 0 330877 8.4583 Q 7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.862 5 E46 S 8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.075 S ... 890 1 1 Behr, Mr. Karl Howell male 26 0 0 111369 30 C148 C 891 0 3 Dooley, Mr. Patrick male 32 0 0 370376 7.75 Q
  • 14. Building ML application with Spark ML ​ [1/5] - Spinning up Spark import org.apache.spark._! import org.apache.spark.sql._! import org.apache.spark.sql.types._! import org.apache.spark.sql.functions._! import org.apache.spark.ml._! import org.apache.spark.ml.classification._! import org.apache.spark.ml.evaluation._! import org.apache.spark.ml.feature._! import org.apache.spark.ml.tuning._! ! // Spinning up Spark! val conf = new SparkConf().setMaster("local[2]")! val session = SparkSession.builder.config(conf).getOrCreate! val (sc, sqlc) = (session.sparkContext, session.sqlContext)! ! import sqlc.implicits._!
  • 15. Building ML application with Spark ML ​ [2/5] - Reading the data def readData(file: String): DataFrame = {! val schema = StructType(Array(! StructField("PassengerId", IntegerType, nullable = true), // <-- field names are easy to misspell! StructField("Survived", DoubleType, nullable = true), // <-- we can set any type here ! StructField("Pclass", DoubleType, nullable = true),! StructField("Name", StringType, nullable = true),! StructField("Sex", StringType, nullable = true),! StructField("Age", DoubleType, nullable = true),! StructField("SibSp", DoubleType, nullable = true),! StructField("Parch", DoubleType, nullable = true),! StructField("Ticket", StringType, nullable = true),! StructField("Fare", DoubleType, nullable = true),! StructField("Cabin", StringType, nullable = true),! StructField("Embarked", StringType, nullable = true)! ))! val df: DataFrame = sqlc.read.format("csv").option("header", "true").schema(schema).load(file)! // Select and rename necessary fields! val select = Array($"Survived".as("survived"), $"Sex".as("sex"), $"Age".as("age”),! $"Pclass".as("pclass"),$"SibSp".as("sibsp"), $"Parch".as("parch"), $"Embarked".as("embarked"))! df.select(select: _*)! }! val rawData: DataFrame = readData(file = "titanic.csv") // <-- runtime exceptions...!
  • 16. Building ML application with Spark ML ​ [3/5] - Feature Engineering ! def addFeatures(df: DataFrame): DataFrame = {! // Create a new family size field := siblings + spouses + parents + children + self! val familySizeUDF = udf { (sibsp: Double, parch: Double) => sibsp + parch + 1 }! ! df.withColumn("fsize", familySizeUDF(col("sibsp"), col("parch"))) // <-- full freedom to overwrite ! }! ! def fillMissing(df: DataFrame): DataFrame = {! // Fill missing age values with average age! val avgAge = df.select("age").agg(avg("age")).collect.first()! ! // Fill missing embarked values with default "S" (i.e Southampton)! val embarkedUDF = udf{(e: String)=> e match { case x if x == null || x.isEmpty => "S"; case x => x}}! ! df.na.fill(Map("age" -> avgAge)).withColumn("embarked", embarkedUDF(col("embarked")))! }! // Modify the dataframe! val allData = fillMissing(addFeatures(rawData)).cache() // <-- need to remember about caching! // Split the data and cache it! val Array(trainSet, testSet) = allData.randomSplit(Array(0.75, 0.25)).map(_.cache())!
  • 17. Building ML application with Spark ML ​ [4/5] - Building the pipeline // Prepare categorical columns! val categoricalFeatures = Array("pclass", "sex", "embarked")! val stringIndexers = categoricalFeatures.map(colName =>! new StringIndexer().setInputCol(colName).setOutputCol(colName + "_index").fit(allData)! )! // Concat all the feature into a numeric feature vector! val allFeatures = Array("age", "sibsp", "parch", "fsize") ++ stringIndexers.map(_.getOutputCol)! ! val vectorAssembler = new VectorAssembler().setInputCols(allFeatures).setOutputCol("feature_vector”)! // Prepare Logistic Regression estimator! val logReg = new LogisticRegression().setFeaturesCol("feature_vector").setLabelCol("survived”)! // Finally build the pipeline with the stages above! val pipeline = new Pipeline().setStages(stringIndexers ++ Array(vectorAssembler, logReg))!
  • 18. Building ML application with Spark ML ​ [5/5] - Model training // Cross validate our pipeline with various parameters! val paramGrid =! new ParamGridBuilder()! .addGrid(logReg.regParam, Array(1, 0.1, 0.01))! .addGrid(logReg.maxIter, Array(10, 50, 100))! .build()! ! val crossValidator =! new CrossValidator()! .setEstimator(pipeline) // <-- set our pipeline here! .setEstimatorParamMaps(paramGrid)! .setEvaluator(new BinaryClassificationEvaluator().setLabelCol("survived"))! .setNumFolds(3)! ! // Train the model & compute scores ! val model: CrossValidationModel = crossValidator.fit(trainSet)! val scores: DataFrame = model.transform(testSet)! ! // Save the model for later use! model.save("/models/titanic-model.ml")!
  • 19. Building ML application with Spark ML ​ The good parts • Simple abstraction: Transformers (.map), Estimators (.reduce) and Pipelines • Serialization allows reusability of models • Good implementations for various estimators: Word2Vec, LogReg, etc. • All-In-One: data exploration, prototyping, productionization • Multi language support: Java/Scala/Python • Healthy ecosystem
  • 20. Building ML application with Spark ML ​ The not so good parts • No type checking (especially painful for Transformers, Estimators and Pipelines) • Transformer and Estimator interfaces are too open: Dataset => DataFrame • DataFrames are everywhere •  No type checking •  Easy to misspell column names •  No integration with ML Vector •  Missing a lot of RDDs functionality • Lack of support for common data I/O operations • Schema and algorithms definitions are interleaved with data manipulations Can we do better?!
  • 22. Optimus Prime What is Optimus Prime? • A transformation framework to develop reusable, modular and typed ML pipelines Why are we building it? • Declarative and intuitive syntax • Typed operations with Spark ML • Reusability of I/O operations, features, transformations, pipelines • Separation of features and transformations from data operations • Multitenant applications
  • 24. Building ML application with Optimus Prime ​ [1/3] – Defining a reader & raw features import com.salesforce.op._! import com.salesforce.op.test.avro.Passenger! ! // Define the reader from CSV to Avro! val trainReader: DataReader[Passenger] = DataReaders.Simple.csv[Passenger](! path = Some("titanic.csv"),! schema = Passenger.getClassSchema.toString! )! // Define the response feature (feature names are inferred from val names)! val survived = FeatureBuilder.Binary[Passenger].extract(Option(_.getSurvived).map(_ != 0)).asResponse! // Define the predictor features! val age = FeatureBuilder.NullableNumeric[Passenger].extract(Option(_.getAge)).asPredictor! val sex = FeatureBuilder.Categorical[Passenger].extract(Set(_.getSex)).asPredictor! val pclass = FeatureBuilder.Numeric[Passenger].extract(_.getPClass).asPredictor! val sibsp = FeatureBuilder.Numeric[Passenger].extract(_.getSibSp).asPredictor! val parch = FeatureBuilder.Numeric[Passenger].extract(_.getParCh).asPredictor! val embarked = FeatureBuilder.Text[Passenger].extract(Option(_.getEmbarked)).asPredictor!
  • 25. Building ML application with Optimus Prime ​ [2/3] – Feature engineering // Create a new family size feature – no annoying UDFs here!! val fsize: FeatureLike[Numeric] = sibsp + parch + 1! // Fill missing age values with average age (i.e. from NullableNumeric we get Numeric)! val ageFilled: FeatureLike[Numeric] = age.fillMissingWithMean! // Fill missing embarked values with default "S" (i.e Southampton)! val embarkedFilled: FeatureLike[Text] = embarked.fillMissingWith("S")! // Create a feature vector field using default vectorizers! val featureVector: FeatureLike[Vector] =! Seq(sex, ageFilled, pclass, sibsp, parch, fsize, embarkedFilled).vectorize()! !
  • 26. Building ML application with Optimus Prime ​ [3/3] – Model training ! val modelSelector = { // Create a model selector with two algorithms! new ModelSelector[Passenger]().setInput(survived, featureVector)! .setParams(! LogisticRegression.RegParam -> Array(1, 0.1, 0.01),! LogisticRegression.MaxIter -> Array(10, 50, 100),! RandomForest.NumTrees -> Array(3, 5, 10)! ).setModels(Algs.LogisticRegression, Algs.RandomForest) // <-- multiple algorithms here! .setEvaluator(Evals.BinaryClassification)! }! // Build the pipeline with the model selector ! val pipeline = new OpPipeline[Passenger]().setInput(modelSelector)! ! // And only now we are spinning up Spark! val conf = new SparkConf().setMaster("local[2]")! implicit val session = SparkSession.builder.config(conf).getOrCreate! ! // Train the model & compute scores ! val model: OpPipelineModel[Passenger] = pipeline.setReader(trainReader).train()! val scores: DataFrame = model.setReader(testReader).score()! ! model.save("/models/titanic-model.op") // Save the model for later use!
  • 27. Building ML application with Optimus Prime ​ Summary • Everything is typed: readers, features, pipeline, model • Declarative and intuitive syntax with code completion • Feature names are inferred from val names –> no misspelled names • Features are always unique in a code block, compilation error otherwise • Common data I/O operations provided with DataReaders (joins, aggrs etc.) • DataFrames are abstracted away – no direct interaction • Features and transformations are separate from data operations
  • 28. Behind the scenes of Optimus Prime
  • 29. Types and interactions Numeric Text … Feature[T <: Feature Type] transformed with Categorical Unary produce Transformers (.map) Binary … Estimators (.reduce) Average Word2Vec … fitted into My Model Data Readers CSV Avro … Pipelines Titanic Lead Scoring … readmaterialized by trained joined / aggr
  • 30. Feature Types V1 case object FeatureTypes {! type Numeric = Double! type NullableNumeric = Option[Double]! type Categorical = scala.collection.mutable.WrappedArray[String]! type Text = Option[String]! type Binary = Option[Boolean]! type DateList = scala.collection.mutable.WrappedArray[Long]! type KeyString = scala.collection.Map[String, String]! type KeyNumeric = scala.collection.Map[String, Double]! type KeyBinary = scala.collection.Map[String, Boolean]! type Vector = org.apache.spark.ml.linalg.Vector! ! // TBD: Specific/rich types: Email, Phone, URL, etc.! }!
  • 32. Typed Features // Typed value container! trait FeatureType extends Serializable {! ! type Value // feature value type! ! def value: Value // actual value! ! def isEmpty: Boolean // true if value is empty ! ! def isNullable: Boolean // true if value is nullable! ! // ...! }! // For example, a text feature value type! class Text(val value: Option[String]) extends FeatureType {! type Value = Option[String]! def this(value: String) = this(Option(value))! final def isEmpty: Boolean = value.isEmpty! }!
  • 33. Typed Features // Represents a single feature (dimension)! trait FeatureLike[O <: FeatureType] extends Serializable { ! ! implicit def wtt: WeakTypeTag[O] // Overcoming type erasure! ! def name: String // name of the feature! ! def defaultValue: O // feature default value! ! def originStage: OpPipelineStage[O] // the stage which generated this feature! ! def parents: Seq[FeatureLike[_ <: FeatureType]] // the input features for the origin stage! ! // feature transformation function (i.e map). We have more like this...! final def transformWith[U <: FeatureType](stage: OpPipelineStage1[O, U]): FeatureLike[U]! ! // ...! }!
  • 34. Feature names from vals using Macros magic val sibsp = FeatureBuilder.Numeric[Passenger](name = "sibsp") ! ! val sibsp = FeatureBuilder.Numeric[Passenger] // <-- can we just do this?!! ! ! ! object FeatureBuilder {! def Numeric[I]: FeatureBuilder[I, Numeric] = macro FeatureBuilderMacros.apply[I, Numeric]! }! ! // HAHA! So we meet again! ! private[op] object FeatureBuilderMacros { ! def apply[I: c.WeakTypeTag, O: c.WeakTypeTag](c: Context): c.Expr[FeatureBuilder[I, O]] = {! import c.universe._! val enclosingValName = MacrosHelper.definingValName(c)! val featureName = c.Expr[String](Literal(Constant(enclosingValName)))! val fbApply = Select(reify(FeatureBuilder).tree, TermName("apply"))! val fbExpr = c.Expr[FeatureBuilder[I, O]](Apply(fbApply, featureName.tree :: Nil))! ! reify(fbExpr.splice)! }! }!! // Read more in sbt codebase - https://goo.gl/OdPvry! !
  • 35. Feature transformations with Implicit Classes // Create a new family size feature := siblings + spouses + parents + children + self! val fsize = sibsp.transformWith(new BinaryNullableAndNumeric(_ + _), parch)! .transformWith(new BinaryNumeric (_ + _), 1)! ! val fsize = sibsp + parch + 1 // <-- can we just do this?! // Sure thing!! implicit class RichNullableNumericFeature[I <: NullableNumeric : TypeTag](val f: FeatureLike[I]) {! ! def +[I2 <: Numeric: TypeTag](that: FeatureLike[I2]): FeatureLike[NullableNumeric] = {! val plus = (a: Numeric, b: Numeric) => a + b! val stage = new BinaryNullableAndNumeric(plus)! ! f.transformWith[I2, NullableNumeric](f = that, stage)! }! ! }! ! // Note: we use another Macro here as well to infer the name of the feature to “fsize”!
  • 36. Features and Transformers sibsp: Feature[Numeric] survived pclass sex age sibsp parch embarked fsize 0 3 male 22 1 0 S 2 1 1 female 38 1 0 C 2 1 3 female 26 0 0 S 1 Binary Transformer ( _ + _ ) parch: Feature[Numeric] val fsize: FeatureLike[Numeric] = sibsp + parch + 1! _: Feature[Numeric] fsize: Feature[Numeric]Binary Transformer ( _ + 1 ) Transformation DAG sibsp parch _ 1 fsize 2 Pipeline Stages
  • 37. Typed pipeline stages import org.apache.spark.ml.util.MLWritable ! import org.apache.spark.ml.PipelineStage! ! // OP pipeline stages represent a feature transformation,! // and also carry around the input and output features! trait OpPipelineStageBase extends OpPipelineStageParams with MLWritable { self: PipelineStage =>! type InputFeatures! type OutputFeatures! ! def setInput(features: InputFeatures): this.type! def getOutput(): OutputFeatures! ! // This method allows us to modify the DataFrame schema accordingly! final override def transformSchema(schema: StructType): StructType = { ... }! }! ! Binary Transformer ( _ + _ ) Feature[Numeric] Feature[Numeric] -> Feature[Numeric] inputs outputextends OpPipelineStage2 sibsp + parch!
  • 38. Typed pipeline stages ! // Stage providing a single feature[O]! trait OpPipelineStage[O <: FeatureType] extends OpPipelineStageBase {! type InputFeatures! final override type OutputFeatures = FeatureLike[O]! }! // Stage from feature[I] to a feature[O]! trait OpPipelineStage1[I <: FeatureType, O <: FeatureType] extends OpPipelineStage[O] {! final override type InputFeatures = FeatureLike[I]! }! // Stage from a tuple of features to a feature[O]! trait OpPipelineStage2[I1 <: FeatureType, I2 <: FeatureType, O <: FeatureType]! extends OpPipelineStage[O] {! final override type InputFeatures = (FeatureLike[I1], FeatureLike[I2])! }! // ...! // And so on for various combinations: 1to2, ..., 1toN, 2to2, ..., Nto1, Nto2, ...! // See Scala Product types - https://goo.gl/J3V5DP!
  • 39. 1-to-1 Transformer example ! import org.apache.spark.ml.Transformer! ! // A simple 1 to 1 transformer ! trait OpTransformer1[I <: FeatureType, O <: FeatureType]! extends Transformer with OpPipelineStage1[I, O] {! implicit def tti: TypeTag[I]! implicit def tto: TypeTag[O]! ! // User provided transform function that operates on input feature value I and produces O! def transformFn: I => O! ! // We wrap the transform function above into a UDF! final override def transform(dataset: Dataset[_]): DataFrame = {! val functionUDF = udf { (in: Any) => transformFn(FeatureTypes.as[I](in)).value }! ! // Return a dataset with a new column! dataset.withColumn(outputName, functionUDF(col(in1.name)))! }! }! // ...! // And so on for various combinations: 1to2, 1to3, ..., 1toN, ..., Nto1, Nto2, ...!
  • 40. 1-to-1 Estimator example ​ Last code slide. Really. import org.apache.spark.ml.{Estimator, Model}! ! // A simple 1 to 1 estimator which is trained into a model (transformer)! class UnaryEstimator[I <: FeatureType : TypeTag, O <: FeatureType: TypeTag](! val fitFn: Dataset[I#Value] => I => O! ) extends Estimator[UnaryModel[I, O]] with OpPipelineStage1[I, O] {! implicit val iEncoder: Encoder[I#Value] = ExpressionEncoder()! ! final override def fit(dataset: Dataset[_]): UnaryModel[I, O] = {! val df: DataFrame = dataset.select(in1.name) ! val ds: Dataset[I#Value] = df.map(r => FeatureTypes.as[I](r.get(0)).value) // needs encoder ! ! val transformFn: I => O = fitFn(ds) // fit function returns a transform function! ! new UnaryModel[I, O](transformFn).setParent(this).setInput(in1)! }! }!! // Represents a trained model (transformer) from feature[I] to feature[O]! class UnaryModel[I <: FeatureType, O <: FeatureType](val transformFn: I => O)! (implicit val tti: TypeTag[I], val tto: TypeTag[O])! extends Model[UnaryModel[I, O]] with OpTransformer1[I, O]!
  • 41. Model training (.reduce) Data ReaderRDD[Passenger] Feature Extraction DataFrame (age, sibsp, parch, embarked, …) Spark Pipeline .setStages(stages).fit(trainData) Titanic Model .tranform(…) Topological sort Transformation DAG sibsp parch _ 1 fsize val model = new OpPipeline[Passenger]().setInput(modelSelector).setReader(trainReader).train()!
  • 42. Going forward with Optimus Prime ​ What is still missing? • Wrapping existing Spark ML transformers/estimators • Codegen for stage in-out combinations (1to1, …, 1toN, ... Nto1, Nto2, ...) • Codegen for Macros (scary!) • Automatic feature engineering • Abstract away from Spark with Apache Beam • More…
  • 43. Key takeaways • Real-life Machine Learning is hard • Spark ML is great, but it needs type safety • Simple and intuitive syntax saves you trouble down the road • Scala has all the relevant facilities to provide the above – know to use it • Modularity and reusability is the key
  • 44. Further exploration • Salesforce Einstein – http://einstein.com •  “Democratizing AI to solve human bottlenecks” by Sarah Aerni, PhD – Scala Days, CPH 2017 • PredictionIO – http://predictionio.incubator.apache.org • Optimus Prime – to be open-sourced (no ETA yet) •  “Optimus Prime: declarative, collaborative, type-safe machine learning” by Shubha Nabar, PhD https://goo.gl/hgWxJb •  “The Lego Model for Machine Learning” by Leah McGuire, PhD – https://goo.gl/hmct4R
  • 45. If You’re Curious … einstein-recruiting@salesforce.com

Editor's Notes

  1. These lines are spoken by Polonius while he reads to Gertrude, Hamlet’s letter to Ophelia. The lines simply mean that doubt whether stars are fire; or the sun moves across the sky; or truth itself be a liar; but never doubt whether I love you.