SlideShare a Scribd company logo
1 of 71
Download to read offline
Kaminski, Schlegel | Oct. 25, 2017
BUILDING CUSTOM ML PIPELINESTAGES
FOR FEATURE SELECTION.
SPARK SUMMIT EUROPE 2017.
WHATYOU WILL LEARN DURING THIS SESSION.
 How data-driven car diagnostics look like at BMW.
 Get a good understanding of the most important elements in Spark ML PipelineStages (on a feature selection example).
 Attention: There will be Scala code examples!
 Howto use spark-FeatureSelection in your Spark ML Pipeline.
 The impact of feature selection on learning performance andthe understanding of the big data black box.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 2
1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
 Potential root causes:
 Manually formalized expert knowledge cannot cope withthe vast number of possibilities.
 Cars are getting more and more complex (hybridization, connectivity).
 Less experienced workshop staff in evolving markets.
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
 Potential root causes:
 Manually formalized expert knowledge cannot cope withthe vast number of possibilities.
 Cars are getting more and more complex (hybridization, connectivity).
 Less experienced workshop staff in evolving markets.
 Improve three workflows at once by shifting from a manual to a data driven approach:
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
 Potential root causes:
 Manually formalized expert knowledge cannot cope withthe vast number of possibilities.
 Cars are getting more and more complex (hybridization, connectivity).
 Less experienced workshop staff in evolving markets.
 Improve three workflows at once by shifting from a manual to a data driven approach:
 Automatic knowledge generation.
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
1001 0110
01 10 10 1
10 01 01 10
MOTIVATION.
 #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]
 Potential root causes:
 Manually formalized expert knowledge cannot cope withthe vast number of possibilities.
 Cars are getting more and more complex (hybridization, connectivity).
 Less experienced workshop staff in evolving markets.
 Improve three workflows at once by shifting from a manual to a data driven approach:
 Automatic knowledge generation.
 Automatic workshop diagnostics.
 Predictive maintenance.
[1] BearingPoint,Global AutomotiveWarranty Survey Report 2009
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
THE DATASET AND ITS CHALLENGES.
MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label
44 3 … 20 -0.06false 2 77 27false true v.10 false
72 36 … 73 -0.01false 16 29 false v.10 false
100 4 … 16 -0.02true 45 1 false false v.10 false
44 14 … 54 -0.02true 76 false v.10 true
95 34 … 73 -0.07false 80 22 false false v.10 false
16 50 … 33 -0.02true 61 93false false false v.11 false
4 … 27 -0.09false 59 91 false v.10 false
48 … 20 -0.07false 32 31 false v.10 false
88 60 … 72 -0.01true 1.9 96 53true false true v.10 false
27 14 … 88 false 73 14 false v.10 false
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
THE DATASET AND ITS CHALLENGES.
MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label
44 3 … 20 -0.06false 2 77 27false true v.10 false
72 36 … 73 -0.01false 16 29 false v.10 false
100 4 … 16 -0.02true 45 1 false false v.10 false
44 14 … 54 -0.02true 76 false v.10 true
95 34 … 73 -0.07false 80 22 false false v.10 false
16 50 … 33 -0.02true 61 93false false false v.11 false
4 … 27 -0.09false 59 91 false v.10 false
48 … 20 -0.07false 32 31 false v.10 false
88 60 … 72 -0.01true 1.9 96 53true false true v.10 false
27 14 … 88 false 73 14 false v.10 false
High dimensional featurespace (7000 features +)
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
THE DATASET AND ITS CHALLENGES.
MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label
44 3 … 20 -0.06false 2 77 27false true v.10 false
72 36 … 73 -0.01false 16 29 false v.10 false
100 4 … 16 -0.02true 45 1 false false v.10 false
44 14 … 54 -0.02true 76 false v.10 true
95 34 … 73 -0.07false 80 22 false false v.10 false
16 50 … 33 -0.02true 61 93false false false v.11 false
4 … 27 -0.09false 59 91 false v.10 false
48 … 20 -0.07false 32 31 false v.10 false
88 60 … 72 -0.01true 1.9 96 53true false true v.10 false
27 14 … 88 false 73 14 false v.10 false
High dimensional featurespace (7000 features +)
High sparsity
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
THE DATASET AND ITS CHALLENGES.
MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label
44 3 … 20 -0.06false 2 77 27false true v.10 false
72 36 … 73 -0.01false 16 29 false v.10 false
100 4 … 16 -0.02true 45 1 false false v.10 false
44 14 … 54 -0.02true 76 false v.10 true
95 34 … 73 -0.07false 80 22 false false v.10 false
16 50 … 33 -0.02true 61 93false false false v.11 false
4 … 27 -0.09false 59 91 false v.10 false
48 … 20 -0.07false 32 31 false v.10 false
88 60 … 72 -0.01true 1.9 96 53true false true v.10 false
27 14 … 88 false 73 14 false v.10 false
High dimensional featurespace (7000 features +)
High sparsity
High class imbalance
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
SPARK PIPELINE.
Relational
DWH
Model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
SPARK PIPELINE.
ETL
Imputation
Loading
MV_S SC_IP DTC_PU CP Label
44 2 false v.10 false
72 1.5 true v.11 false
23 1.4 false v.11 false
44 1.5 true v.10 true
Relational
DWH
Model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
SPARK PIPELINE.
ETL
Imputation
Loading
MV_S SC_IP DTC_PU CP Label
44 2 false v.10 false
72 1.5 true v.11 false
23 1.4 false v.11 false
44 1.5 true v.10 true
Relational
DWH
Handling imbalance
SMOTE[2]
Undersampling
…
true false
Model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
[2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
SPARK PIPELINE.
ETL
Imputation
Loading
MV_S SC_IP DTC_PU CP Label
44 2 false v.10 false
72 1.5 true v.11 false
23 1.4 false v.11 false
44 1.5 true v.10 true
Relational
DWH
Handling imbalance
SMOTE[2]
Undersampling
…
true false
Preprocessing
StringIndexer
OneHotEncoder
VectorAssembler
Discretization
Std.Scaler
Features Label
[0.34,0.8,0,1] 0.0
[0.7,0.4,1,0] 0.0
[0.31,0.35,1,0] 1.0
[0.3,0.4,1.1] 1.0
Model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
[2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
SPARK PIPELINE.
ETL
Imputation
Loading
MV_S SC_IP DTC_PU CP Label
44 2 false v.10 false
72 1.5 true v.11 false
23 1.4 false v.11 false
44 1.5 true v.10 true
Relational
DWH
Handling imbalance
SMOTE[2]
Undersampling
…
true false
Preprocessing
StringIndexer
OneHotEncoder
VectorAssembler
Discretization
Std.Scaler
Features Label
[0.34,0.8,0,1] 0.0
[0.7,0.4,1,0] 0.0
[0.31,0.35,1,0] 1.0
[0.3,0.4,1.1] 1.0
Crossvalidation loop
Feature selection [3]
InformationGain
Correlation
ChiSquared
Ran. Forest
Gini
L1 LogReg
Classifier
Logistic Regression/
Random Forest
Model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
[3]: Schlegel et al.: Design and optimization of an autonomous feature selection pipeline for high dimensional,
heterogeneous feature spaces.
[2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
SPARK PIPELINE.
ETL
Imputation
Loading
MV_S SC_IP DTC_PU CP Label
44 2 false v.10 false
72 1.5 true v.11 false
23 1.4 false v.11 false
44 1.5 true v.10 true
Relational
DWH
Handling imbalance
SMOTE[2]
Undersampling
…
true false
Preprocessing
StringIndexer
OneHotEncoder
VectorAssembler
Discretization
Std.Scaler
Features Label
[0.34,0.8,0,1] 0.0
[0.7,0.4,1,0] 0.0
[0.31,0.35,1,0] 1.0
[0.3,0.4,1.1] 1.0
Crossvalidation loop
Feature selection [3]
InformationGain
Correlation
ChiSquared
Ran. Forest
Gini
L1 LogReg
Classifier
Logistic Regression/
Random Forest
Model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
[3]: Schlegel et al.: Design and optimization of an autonomous feature selection pipeline for high dimensional,
heterogeneous feature spaces.
[2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
PipelineStage
SPARK PIPELINE API.
Interface for usage in
Pipeline
data ?
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
PipelineStage
SPARK PIPELINE API.
Transformer
‘Transforms data’
Interface for usage in
Pipeline
data
data data
?
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
PipelineStage
SPARK PIPELINE API.
Estimator
‘Learns from data’
Transformer
‘Transforms data’
Interface for usage in
Pipeline
data
data dataTransformer data
?
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
ORG.APACHE.SPARK.ML.*
PipelineStage
Estimator
Interface for usage
in Pipeline
Transformer
Transforms data
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7
Learns from data
ORG.APACHE.SPARK.ML.*
Pipeline
Concat PipelineStages
Predictor
Interface for Predictors
PipelineModel
Model from Pipeline
PredictionModel
Model from predictor
PipelineStage
Estimator
Interface for usage
in Pipeline
Transformer
Model
Transforms data
Fitted model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7
Learns from data
ORG.APACHE.SPARK.ML.*
Pipeline
Concat PipelineStages
Predictor
Interface for Predictors
FeatureSelector
Interface for FS
PipelineModel
Model from Pipeline
PredictionModel
Model from predictor
FeatureSelectionModel
Model from
FeatureSelector
PipelineStage
Estimator
Interface for usage
in Pipeline
Transformer
Model
Transforms data
Fitted model
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7
Learns from data
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
}
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
}
Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
}
Defined later.
Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
}
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
}
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
}
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
}
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
transformSchema
(= input validation)
Transformed
schema
Exception

⚡
Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
features label
[0,1,0,1] 1.0
[1,0,0,0] 1.0 features: VectorColumn
selected: VectorColumn
label: Double
DataFrame with Schema
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
}
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
transformSchema
(= input validation)
Transformed
schema
Exception

⚡
Needsto know, what it
shall return.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
features label
[0,1,0,1] 1.0
[1,0,0,0] 1.0
Attention:
VectorColumns have Metadata:
Name, Type, Range, etc.
features: VectorColumn
selected: VectorColumn
label: Double
DataFrame with Schema
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
}
Performs input
checking and fails fast.
Canthrow exceptions.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 9
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
}
Performs input
checking and fails fast.
Canthrow exceptions.
fit
(= learn from data)
Dataset Transformer
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 9
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.Needsto know, what it
shall return.
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
override def fit(dataset: Dataset[_]): M = {}
override def copy(extra: ParamMap): Learner
}
Learns from data and returns
a Model. Here: calculate
feature importances.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
Performs input
checking and fails fast.
Canthrow exceptions.
Needsto know, what it
shall return.
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
override def fit(dataset: Dataset[_]): M = {}
override def copy(extra: ParamMap): Learner
// Abstract methods that are called from fit()
protected def train(dataset: Dataset[_]): Array[(Int, Double)]
protected def make(uid: String, selectedFeatures: Array[Int],
featureImportances: Map[String, Double]): M
}
Learns from data and returns
a Model. Here: calculate
feature importances.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
Performs input
checking and fails fast.
Canthrow exceptions.
Needsto know, what it
shall return.
MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE.
abstract class FeatureSelector[
Learner <: FeatureSelector[Learner, M],
M <: FeatureSelectorModel[M]]
extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable {
// Setters for params in FeatureSelectorParams
def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner]
// PipelineStage and Estimator methods
override def transformSchema(schema: StructType): StructType = {}
override def fit(dataset: Dataset[_]): M = {}
override def copy(extra: ParamMap): Learner
// Abstract methods that are called from fit()
protected def train(dataset: Dataset[_]): Array[(Int, Double)]
protected def make(uid: String, selectedFeatures: Array[Int],
featureImportances: Map[String, Double]): M
}
Learns from data and returns
a Model. Here: calculate
feature importances.
Not necessary, but avoids
code duplication.Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10
For setter concatenation:
mdl.setParam1(val1)
.setParam2(val2)...
Defined later.
Makes all Param writable.
Performs input
checking and fails fast.
Canthrow exceptions.
Needsto know, what it
shall return.
MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
Page 11
MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
Page 11
For persistence.
MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
// Setters for params in FeatureSelectorParams
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
// PipelineStage and Transformer methods
override def transformSchema(schema: StructType): StructType = {}
override def transform(dataset: Dataset[_]): DataFrame = {}
def write: MLWriter
}
Page 11
For persistence.
MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
// Setters for params in FeatureSelectorParams
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
// PipelineStage and Transformer methods
override def transformSchema(schema: StructType): StructType = {}
override def transform(dataset: Dataset[_]): DataFrame = {}
def write: MLWriter
}
Page 11
Same idea as in Estimator, but
different tasks.
For persistence.
MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
// Setters for params in FeatureSelectorParams
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
// PipelineStage and Transformer methods
override def transformSchema(schema: StructType): StructType = {}
override def transform(dataset: Dataset[_]): DataFrame = {}
def write: MLWriter
}
Page 11
Transforms data.
Same idea as in Estimator, but
different tasks.
For persistence.
MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String,
val selectedFeatures: Array[Int],
val featureImportances: Map[String, Double])
extends Model[M] with FeatureSelectorParams with MLWritable{
// Setters for params in FeatureSelectorParams
def setFeaturesCol(value: String): this.type = set(featuresCol, value)
// PipelineStage and Transformer methods
override def transformSchema(schema: StructType): StructType = {}
override def transform(dataset: Dataset[_]): DataFrame = {}
def write: MLWriter
}
Page 11
Transforms data.
Same idea as in Estimator, but
different tasks.
For persistence.
Adds persistence.
GIVING YOUR NEW PIPELINESTAGE PARAMETERS.
import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared._
private[selection] trait FeatureSelectorParams extends Params
with HasFeaturesCol with HasOutputCol with HasLabelCol {
// Define params and getters here...
final val param = new Param[Type](this, "name", "description")
def getParam: Type = $(param)
}
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
GIVING YOUR NEW PIPELINESTAGE PARAMETERS.
import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared._
private[selection] trait FeatureSelectorParams extends Params
with HasFeaturesCol with HasOutputCol with HasLabelCol {
// Define params and getters here...
final val param = new Param[Type](this, "name", "description")
def getParam: Type = $(param)
}
Possible, because package
is in org.apache.spark.ml.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
GIVING YOUR NEW PIPELINESTAGE PARAMETERS.
import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared._
private[selection] trait FeatureSelectorParams extends Params
with HasFeaturesCol with HasOutputCol with HasLabelCol {
// Define params and getters here...
final val param = new Param[Type](this, "name", "description")
def getParam: Type = $(param)
}
Possible, because package
is in org.apache.spark.ml.
Out of the box for severaltypes, e.g.:
DoubleParam, IntParam,
BooleanParam, StringArrayParam,...
Other types: needto implement
jsonEncode and jsonDecode to
maintain persistence.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
GIVING YOUR NEW PIPELINESTAGE PARAMETERS.
import org.apache.spark.ml.param._
import org.apache.spark.ml.param.shared._
private[selection] trait FeatureSelectorParams extends Params
with HasFeaturesCol with HasOutputCol with HasLabelCol {
// Define params and getters here...
final val param = new Param[Type](this, "name", "description")
def getParam: Type = $(param)
}
Possible, because package
is in org.apache.spark.ml.
Out of the box for severaltypes, e.g.:
DoubleParam, IntParam,
BooleanParam, StringArrayParam,...
Other types: needto implement
jsonEncode and jsonDecode to
maintain persistence.
getters are shared between
Estimator and Transformer.
setters not, for the pursuit of
concatenation.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.
 What hasto be saved?
 Metadata: uid, timestamp, version, …
 Parameters
 Learnt data: selectedFeatures & featureImportances
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.
 What hasto be saved?
 Metadata: uid, timestamp, version, …
 Parameters
 Learnt data: selectedFeatures & featureImportances
DefaultParamsWriter.saveMetadata()
DefaultParamsReader.loadMetadata()
Since we are in org.apache.spark.ml, use:
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.
 What hasto be saved?
 Metadata: uid, timestamp, version, …
 Parameters
 Learnt data: selectedFeatures & featureImportances
 Create DataFrame and use write.parquet(…)
DefaultParamsWriter.saveMetadata()
DefaultParamsReader.loadMetadata()
Since we are in org.apache.spark.ml, use:
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.
 What hasto be saved?
 Metadata: uid, timestamp, version, …
 Parameters
 Learnt data: selectedFeatures & featureImportances
 Create DataFrame and use write.parquet(…)
 How do we dothat?
 Create companion object FeatureSelectorModel, which offersthe following classes:
 abstract class FeatureSelectorModelReader[M <: FeatureSelectorModel[M]] extends MLReader[M] {…}
 class FeatureSelectorModelWriter[M <: FeatureSelectorModel[M]](instance: M) extends MLWriter {…}
DefaultParamsWriter.saveMetadata()
DefaultParamsReader.loadMetadata()
Since we are in org.apache.spark.ml, use:
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
HOW TO USE SPARK-FEATURESELECTION.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
HOW TO USE SPARK-FEATURESELECTION.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
HOW TO USE SPARK-FEATURESELECTION.
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
df
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor")
val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini")
HOW TO USE SPARK-FEATURESELECTION.
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
Feature selectors. Offer
different selection methods.
df
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor")
val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini")
// VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names!
val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected")
HOW TO USE SPARK-FEATURESELECTION.
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
Feature selectors. Offer
different selection methods.
df
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor")
val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini")
// VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names!
val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected")
// Put everything in a pipeline and fit together
val plModel = new Pipeline().setStages(Array(corSel, giniSel, merger)).fit(df)
HOW TO USE SPARK-FEATURESELECTION.
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
Feature selectors. Offer
different selection methods.
df
Feature F1 F2 F3 F4
Score 1 0.9 0.7 0.0 0.5
Score 2 0.6 0.8 0.0 0.4
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
fit
import org.apache.spark.ml.feature.selection.filter._
import org.apache.spark.ml.feature.selection.util.VectorMerger
import org.apache.spark.ml.Pipeline
// load Data
val df = spark.read.parquet("path/to/data/train.parquet")
val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor")
val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini")
// VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names!
val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected")
// Put everything in a pipeline and fit together
val plModel = new Pipeline().setStages(Array(corSel, giniSel, merger)).fit(df)
val dfT = plModel.transform(df).drop(“Features")
HOW TO USE SPARK-FEATURESELECTION.
features Label
[0,1,0,1] 1.0
[0,0,0,0] 0.0
[1,1,0,0] 0.0
[1,0,0,0] 1.0
Feature selectors. Offer
different selection methods.
selected Label
[0,1] 1.0
[0,0] 0.0
[1,1] 0.0
[1,0] 1.0
df dft
Feature F1 F2 F3 F4
Score 1 0.9 0.7 0.0 0.5
Score 2 0.6 0.8 0.0 0.4
Transform
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
fit
SPARK-FEATURESELECTION PACKAGE.
 Offers selection based on:
 Gini coefficient
 Correlation coefficient
 Information gain
 L1-Logistic regression weights
 Randomforest importances
 Utility stage:
 VectorMerger
 Three modes:
 Percentile (default)
 Fixed number of columns
 Compare to random column [4]
Find on GitHub: spark-FeatureSelection or on Spark-packages
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 15
[4]: Stoppiglia et al.: Ranking a Random Feature for Variable and Feature Selection
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
0
0,2
0,4
0,6
0,8
1
1,2
Chi² Correlation Gini InfoGain
Correlation between feature importances from feature selection and random forest
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
0
0,2
0,4
0,6
0,8
1
1,2
Chi² Correlation Gini InfoGain
Correlation between feature importances from feature selection and random forest
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Area under normalized PRC and ROC
Normalized Area under PRC Area under ROC
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Correlation Informationgain Chi² Randomforest
0 50 100 150 200 250 300 350 400 450
FS - 25 Trees
FS - 100 Trees
No FS - 25 Trees
No FS - 100 Trees
Time [s]
Time for FS methods and random forest
Multibucketizer Gini Informationgain Randomforest
PERFORMANCE.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
LESSONS LEARNT.
 Know what your data looks like and where it is located! Example:
 Operations can succeed in local mode, but fail on a cluster.
 Use .persist(StorageLevel.MEMORY_ONLY), when data fits into Memory. Default for .cache is MEMORY_AND_DISK.
 Do not reinvent the wheel for common methods  Consider putting your stages intothe spark.ml namespace.
 Use the SparkWeb GUIto understand your Spark jobs.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 17
QUESTIONS?
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
Marc.Kaminski@bmw.de
Bernhard.bb.Schegel@bmw.de
Page 18
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 19
BACKUP.
DETERMINING WHEREYOUR PIPELINESTAGE SHOULD LIVE.
Own namespace
Pro Con
Safer solution Code duplication
org.apache.spark.ml.*
Pro Con
Less code duplication
(sharedParams,
SchemaUtils, …)
More dangerous,
when not
cautious
Easier to implement
persistence
vs.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 20
FEATURE SELECTION.
 Motivation:
 Many sparse features  feature space hasto be reduced  select featuresthat carry a lot of information for prediction.
 Feature selection (unlike featuretransformation ) enables understanding of which features have a high impact onthe model.
F1 F2 Noise Label =
F1 XOR F2
0 0 0 0
1 0 0 1
0 1 0 1
1 1 1 0
Feature Selection
Feature Importance
Feature 1 0.7
Feature 2 0.7
Noise 0.2
F1 F2 Label =
F1 XOR F2
0 0 0
1 0 1
0 1 1
1 1 0
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 21
FEATURE SELECTION.
 Motivation:
 Many sparse features  feature space hasto be reduced  select featuresthat carry a lot of information for prediction.
 Feature selection (unlike featuretransformation ) enables understanding of which features have a high impact onthe model.
F1 F2 Noise Label =
F1 XOR F2
0 0 0 0
1 0 0 1
0 1 0 1
1 1 1 0
Feature Selection
Feature Importance
Feature 1 0.7
Feature 2 0.7
Noise 0.2
F1 F2 Label =
F1 XOR F2
0 0 0
1 0 1
0 1 1
1 1 0
E.g.:
- Correlation
- InformationGain
- RandomForest
etc.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 21
FEATURE SELECTION.
Description Advantages Disadvantages Examples
Filter Evaluate intrinsic data properties
Fast
Scalable
Ignore inter-feature dependencies
Ignore interaction with classifier
Chi-squared
Information gain
Correlation
Wrapper
Evaluate model performance of
feature subset
Feature dependencies
Simple
Classifier dependent selection
Computational expensive
Risk of overfitting
Genetic algorithms
Search algorithms
Embedded
Feature selection is embedded in
classifier training
Feature dependencies Classifier dependent selection L1-Logistic regression
Random forest
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 22
CHALLENGES.
Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017
 Big plans for DataFrames when performing many operations on many columns  Cantake a longtime to build and optimize DAG.
 Column limit for DataFrames introduced by several Jiras, especially: SPARK-18016  Hopefully fixed in Spark 2.3.0.
 Spark PipelineStages are not consistent in howthey handle DataFrame schemas  Sometimes no schema is appended.
Page 23

More Related Content

What's hot

Unlocking the Power of Generative AI An Executive's Guide.pdf
Unlocking the Power of Generative AI An Executive's Guide.pdfUnlocking the Power of Generative AI An Executive's Guide.pdf
Unlocking the Power of Generative AI An Executive's Guide.pdfPremNaraindas1
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AIJames Serra
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkDatabricks
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentDatabricks
 
Machine Learning with PyCarent + MLflow
Machine Learning with PyCarent + MLflowMachine Learning with PyCarent + MLflow
Machine Learning with PyCarent + MLflowDatabricks
 
Securing Data in Hadoop at Uber
Securing Data in Hadoop at UberSecuring Data in Hadoop at Uber
Securing Data in Hadoop at UberDataWorks Summit
 
GenAI at UBER: Scaling Infrastructure
GenAI at UBER: Scaling InfrastructureGenAI at UBER: Scaling Infrastructure
GenAI at UBER: Scaling InfrastructureMemory Fabric Forum
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkDatabricks
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning India Quotient
 
AI-based re-identification of behavioral data
AI-based re-identification of behavioral dataAI-based re-identification of behavioral data
AI-based re-identification of behavioral dataMOSTLY AI
 
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Databricks
 
Big Data and ML on Google Cloud
Big Data and ML on Google CloudBig Data and ML on Google Cloud
Big Data and ML on Google CloudWlodek Bielski
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Sri Ambati
 
Conversational Artificial Intelligence with Ben Tomlinson and Wayne Thompson
Conversational Artificial Intelligence with Ben Tomlinson and Wayne ThompsonConversational Artificial Intelligence with Ben Tomlinson and Wayne Thompson
Conversational Artificial Intelligence with Ben Tomlinson and Wayne ThompsonDatabricks
 
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AINing Jiang
 
Intro to Azure OpenAI Service L100 (Thai Ver).pdf
Intro to Azure OpenAI Service L100 (Thai Ver).pdfIntro to Azure OpenAI Service L100 (Thai Ver).pdf
Intro to Azure OpenAI Service L100 (Thai Ver).pdfKorkrid Akepanidtaworn
 
MLflow with Databricks
MLflow with DatabricksMLflow with Databricks
MLflow with DatabricksLiangjun Jiang
 
擬人化で考えるオブジェクト指向
擬人化で考えるオブジェクト指向擬人化で考えるオブジェクト指向
擬人化で考えるオブジェクト指向yamada28go
 

What's hot (20)

Unlocking the Power of Generative AI An Executive's Guide.pdf
Unlocking the Power of Generative AI An Executive's Guide.pdfUnlocking the Power of Generative AI An Executive's Guide.pdf
Unlocking the Power of Generative AI An Executive's Guide.pdf
 
Machine Learning and AI
Machine Learning and AIMachine Learning and AI
Machine Learning and AI
 
Building a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache SparkBuilding a Feature Store around Dataframes and Apache Spark
Building a Feature Store around Dataframes and Apache Spark
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
 
Machine Learning with PyCarent + MLflow
Machine Learning with PyCarent + MLflowMachine Learning with PyCarent + MLflow
Machine Learning with PyCarent + MLflow
 
Journey of Generative AI
Journey of Generative AIJourney of Generative AI
Journey of Generative AI
 
Securing Data in Hadoop at Uber
Securing Data in Hadoop at UberSecuring Data in Hadoop at Uber
Securing Data in Hadoop at Uber
 
GenAI at UBER: Scaling Infrastructure
GenAI at UBER: Scaling InfrastructureGenAI at UBER: Scaling Infrastructure
GenAI at UBER: Scaling Infrastructure
 
Zipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering FrameworkZipline—Airbnb’s Declarative Feature Engineering Framework
Zipline—Airbnb’s Declarative Feature Engineering Framework
 
Machine Learning on AWS
Machine Learning on AWSMachine Learning on AWS
Machine Learning on AWS
 
Google Cloud Machine Learning
 Google Cloud Machine Learning  Google Cloud Machine Learning
Google Cloud Machine Learning
 
AI-based re-identification of behavioral data
AI-based re-identification of behavioral dataAI-based re-identification of behavioral data
AI-based re-identification of behavioral data
 
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
Zipline: Airbnb’s Machine Learning Data Management Platform with Nikhil Simha...
 
Big Data and ML on Google Cloud
Big Data and ML on Google CloudBig Data and ML on Google Cloud
Big Data and ML on Google Cloud
 
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
Interpretable Machine Learning Using LIME Framework - Kasia Kulma (PhD), Data...
 
Conversational Artificial Intelligence with Ben Tomlinson and Wayne Thompson
Conversational Artificial Intelligence with Ben Tomlinson and Wayne ThompsonConversational Artificial Intelligence with Ben Tomlinson and Wayne Thompson
Conversational Artificial Intelligence with Ben Tomlinson and Wayne Thompson
 
AutoML - The Future of AI
AutoML - The Future of AIAutoML - The Future of AI
AutoML - The Future of AI
 
Intro to Azure OpenAI Service L100 (Thai Ver).pdf
Intro to Azure OpenAI Service L100 (Thai Ver).pdfIntro to Azure OpenAI Service L100 (Thai Ver).pdf
Intro to Azure OpenAI Service L100 (Thai Ver).pdf
 
MLflow with Databricks
MLflow with DatabricksMLflow with Databricks
MLflow with Databricks
 
擬人化で考えるオブジェクト指向
擬人化で考えるオブジェクト指向擬人化で考えるオブジェクト指向
擬人化で考えるオブジェクト指向
 

Viewers also liked

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...Databricks
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterDatabricks
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDatabricks
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Summit
 

Viewers also liked (7)

No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark with Ma...
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
From Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim HunterFrom Pipelines to Refineries: scaling big data applications with Tim Hunter
From Pipelines to Refineries: scaling big data applications with Tim Hunter
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim HunterDeep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
Deep-Dive into Deep Learning Pipelines with Sue Ann Hong and Tim Hunter
 
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard MaasSpark Streaming Programming Techniques You Should Know with Gerard Maas
Spark Streaming Programming Techniques You Should Know with Gerard Maas
 

Similar to Building Custom ML PipelineStages for Feature Selection with Marc Kaminski

Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...Databricks
 
Evolution of Vehicle aftter it has been released, How its made and managed
Evolution of Vehicle aftter it has been released, How its made and managedEvolution of Vehicle aftter it has been released, How its made and managed
Evolution of Vehicle aftter it has been released, How its made and managedSamuel Festus
 
Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...
Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...
Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...Tom Hubregtsen
 
StarTuned August 2013
StarTuned August 2013StarTuned August 2013
StarTuned August 2013RBMParts
 
MIPI DevCon 2020 | MASS: Automotive Displays Using VDC-M Visually Lossless C...
MIPI DevCon 2020 |  MASS: Automotive Displays Using VDC-M Visually Lossless C...MIPI DevCon 2020 |  MASS: Automotive Displays Using VDC-M Visually Lossless C...
MIPI DevCon 2020 | MASS: Automotive Displays Using VDC-M Visually Lossless C...MIPI Alliance
 
Triple Forward Camera from Tesla Model 3
 Triple Forward Camera from Tesla Model 3 Triple Forward Camera from Tesla Model 3
Triple Forward Camera from Tesla Model 3system_plus
 
Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...
Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...
Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...Intland Software GmbH
 
Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...
Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...
Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...Yole Developpement
 
Automotive supply chain visibility v2
Automotive supply chain visibility v2Automotive supply chain visibility v2
Automotive supply chain visibility v2Prasaga
 
Examining BMW´s Open Architecture for Telematic Applications - H Michel
Examining BMW´s Open Architecture for Telematic Applications - H MichelExamining BMW´s Open Architecture for Telematic Applications - H Michel
Examining BMW´s Open Architecture for Telematic Applications - H Michelmfrancis
 
Position Sensor IC Innovations Creating Value in Automotive Applications
Position Sensor IC Innovations Creating Value in Automotive ApplicationsPosition Sensor IC Innovations Creating Value in Automotive Applications
Position Sensor IC Innovations Creating Value in Automotive ApplicationsHEINZ OYRER
 
Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...
Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...
Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...IRJET Journal
 
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela PoklukarDataScienceConferenc1
 
daimler presentataion
daimler presentataiondaimler presentataion
daimler presentataionAnubhav goel
 
Tas case study one
Tas case study oneTas case study one
Tas case study oneRalph Paglia
 
Maxim auto business update final
Maxim auto business update finalMaxim auto business update final
Maxim auto business update finalmaxim2015ir
 
Fvdi abrites commander
Fvdi abrites commanderFvdi abrites commander
Fvdi abrites commanderLandy Lan
 
DSD-INT 2015 - Model as a service - Fedor Baart
DSD-INT 2015 - Model as a service - Fedor BaartDSD-INT 2015 - Model as a service - Fedor Baart
DSD-INT 2015 - Model as a service - Fedor BaartDeltares
 

Similar to Building Custom ML PipelineStages for Feature Selection with Marc Kaminski (20)

Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
Next Generation Workshop Car Diagnostics at BMW Powered by Apache Spark with ...
 
Evolution of Vehicle aftter it has been released, How its made and managed
Evolution of Vehicle aftter it has been released, How its made and managedEvolution of Vehicle aftter it has been released, How its made and managed
Evolution of Vehicle aftter it has been released, How its made and managed
 
Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...
Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...
Murad Muradi - Quantum Annealing based Optimization of Robotic Movement in Ma...
 
StarTuned August 2013
StarTuned August 2013StarTuned August 2013
StarTuned August 2013
 
MIPI DevCon 2020 | MASS: Automotive Displays Using VDC-M Visually Lossless C...
MIPI DevCon 2020 |  MASS: Automotive Displays Using VDC-M Visually Lossless C...MIPI DevCon 2020 |  MASS: Automotive Displays Using VDC-M Visually Lossless C...
MIPI DevCon 2020 | MASS: Automotive Displays Using VDC-M Visually Lossless C...
 
Bmw cas4
Bmw cas4Bmw cas4
Bmw cas4
 
Triple Forward Camera from Tesla Model 3
 Triple Forward Camera from Tesla Model 3 Triple Forward Camera from Tesla Model 3
Triple Forward Camera from Tesla Model 3
 
Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...
Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...
Intland Software | codeBeamer ALM: What’s in the Pipeline for the Automotive ...
 
Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...
Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...
Melexis Time of Flight Imager for Automotive Applications 2017 teardown rever...
 
Automotive supply chain visibility v2
Automotive supply chain visibility v2Automotive supply chain visibility v2
Automotive supply chain visibility v2
 
Examining BMW´s Open Architecture for Telematic Applications - H Michel
Examining BMW´s Open Architecture for Telematic Applications - H MichelExamining BMW´s Open Architecture for Telematic Applications - H Michel
Examining BMW´s Open Architecture for Telematic Applications - H Michel
 
Position Sensor IC Innovations Creating Value in Automotive Applications
Position Sensor IC Innovations Creating Value in Automotive ApplicationsPosition Sensor IC Innovations Creating Value in Automotive Applications
Position Sensor IC Innovations Creating Value in Automotive Applications
 
Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...
Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...
Finite Element Analysis and Optimization of Automotive Seat Floor Mounting Br...
 
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
[DSC Europe 22] Reproducibility and Versioning of ML Systems - Spela Poklukar
 
daimler presentataion
daimler presentataiondaimler presentataion
daimler presentataion
 
Tas case study one
Tas case study oneTas case study one
Tas case study one
 
Maxim auto business update final
Maxim auto business update finalMaxim auto business update final
Maxim auto business update final
 
Fvdi abrites commander
Fvdi abrites commanderFvdi abrites commander
Fvdi abrites commander
 
ST-AUT_Guidelines_VI3e.pdf
ST-AUT_Guidelines_VI3e.pdfST-AUT_Guidelines_VI3e.pdf
ST-AUT_Guidelines_VI3e.pdf
 
DSD-INT 2015 - Model as a service - Fedor Baart
DSD-INT 2015 - Model as a service - Fedor BaartDSD-INT 2015 - Model as a service - Fedor Baart
DSD-INT 2015 - Model as a service - Fedor Baart
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Spark Summit
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
Indicium: Interactive Querying at Scale Using Apache Spark, Zeppelin, and Spa...
 
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
Apache Spark-Bench: Simulate, Test, Compare, Exercise, and Yes, Benchmark wit...
 

Recently uploaded

办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 

Recently uploaded (20)

办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 

Building Custom ML PipelineStages for Feature Selection with Marc Kaminski

  • 1. Kaminski, Schlegel | Oct. 25, 2017 BUILDING CUSTOM ML PIPELINESTAGES FOR FEATURE SELECTION. SPARK SUMMIT EUROPE 2017.
  • 2. WHATYOU WILL LEARN DURING THIS SESSION.  How data-driven car diagnostics look like at BMW.  Get a good understanding of the most important elements in Spark ML PipelineStages (on a feature selection example).  Attention: There will be Scala code examples!  Howto use spark-FeatureSelection in your Spark ML Pipeline.  The impact of feature selection on learning performance andthe understanding of the big data black box. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 2
  • 3. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1] [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  • 4. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]  Potential root causes:  Manually formalized expert knowledge cannot cope withthe vast number of possibilities.  Cars are getting more and more complex (hybridization, connectivity).  Less experienced workshop staff in evolving markets. [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  • 5. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]  Potential root causes:  Manually formalized expert knowledge cannot cope withthe vast number of possibilities.  Cars are getting more and more complex (hybridization, connectivity).  Less experienced workshop staff in evolving markets.  Improve three workflows at once by shifting from a manual to a data driven approach: [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  • 6. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]  Potential root causes:  Manually formalized expert knowledge cannot cope withthe vast number of possibilities.  Cars are getting more and more complex (hybridization, connectivity).  Less experienced workshop staff in evolving markets.  Improve three workflows at once by shifting from a manual to a data driven approach:  Automatic knowledge generation. [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  • 7. 1001 0110 01 10 10 1 10 01 01 10 MOTIVATION.  #3 contributor to warranty incidents for OEMs are “notrouble found” cases. [1]  Potential root causes:  Manually formalized expert knowledge cannot cope withthe vast number of possibilities.  Cars are getting more and more complex (hybridization, connectivity).  Less experienced workshop staff in evolving markets.  Improve three workflows at once by shifting from a manual to a data driven approach:  Automatic knowledge generation.  Automatic workshop diagnostics.  Predictive maintenance. [1] BearingPoint,Global AutomotiveWarranty Survey Report 2009 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 3
  • 8. THE DATASET AND ITS CHALLENGES. MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label 44 3 … 20 -0.06false 2 77 27false true v.10 false 72 36 … 73 -0.01false 16 29 false v.10 false 100 4 … 16 -0.02true 45 1 false false v.10 false 44 14 … 54 -0.02true 76 false v.10 true 95 34 … 73 -0.07false 80 22 false false v.10 false 16 50 … 33 -0.02true 61 93false false false v.11 false 4 … 27 -0.09false 59 91 false v.10 false 48 … 20 -0.07false 32 31 false v.10 false 88 60 … 72 -0.01true 1.9 96 53true false true v.10 false 27 14 … 88 false 73 14 false v.10 false Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
  • 9. THE DATASET AND ITS CHALLENGES. MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label 44 3 … 20 -0.06false 2 77 27false true v.10 false 72 36 … 73 -0.01false 16 29 false v.10 false 100 4 … 16 -0.02true 45 1 false false v.10 false 44 14 … 54 -0.02true 76 false v.10 true 95 34 … 73 -0.07false 80 22 false false v.10 false 16 50 … 33 -0.02true 61 93false false false v.11 false 4 … 27 -0.09false 59 91 false v.10 false 48 … 20 -0.07false 32 31 false v.10 false 88 60 … 72 -0.01true 1.9 96 53true false true v.10 false 27 14 … 88 false 73 14 false v.10 false High dimensional featurespace (7000 features +) Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
  • 10. THE DATASET AND ITS CHALLENGES. MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label 44 3 … 20 -0.06false 2 77 27false true v.10 false 72 36 … 73 -0.01false 16 29 false v.10 false 100 4 … 16 -0.02true 45 1 false false v.10 false 44 14 … 54 -0.02true 76 false v.10 true 95 34 … 73 -0.07false 80 22 false false v.10 false 16 50 … 33 -0.02true 61 93false false false v.11 false 4 … 27 -0.09false 59 91 false v.10 false 48 … 20 -0.07false 32 31 false v.10 false 88 60 … 72 -0.01true 1.9 96 53true false true v.10 false 27 14 … 88 false 73 14 false v.10 false High dimensional featurespace (7000 features +) High sparsity Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
  • 11. THE DATASET AND ITS CHALLENGES. MV_S MV_0 … MV_4000 MV_BGEE_TP SC_IP SC_1 SC_2DTC_PU DTC_1 DTC_2 CP Label 44 3 … 20 -0.06false 2 77 27false true v.10 false 72 36 … 73 -0.01false 16 29 false v.10 false 100 4 … 16 -0.02true 45 1 false false v.10 false 44 14 … 54 -0.02true 76 false v.10 true 95 34 … 73 -0.07false 80 22 false false v.10 false 16 50 … 33 -0.02true 61 93false false false v.11 false 4 … 27 -0.09false 59 91 false v.10 false 48 … 20 -0.07false 32 31 false v.10 false 88 60 … 72 -0.01true 1.9 96 53true false true v.10 false 27 14 … 88 false 73 14 false v.10 false High dimensional featurespace (7000 features +) High sparsity High class imbalance Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 4
  • 12. SPARK PIPELINE. Relational DWH Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
  • 13. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5
  • 14. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Handling imbalance SMOTE[2] Undersampling … true false Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5 [2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
  • 15. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Handling imbalance SMOTE[2] Undersampling … true false Preprocessing StringIndexer OneHotEncoder VectorAssembler Discretization Std.Scaler Features Label [0.34,0.8,0,1] 0.0 [0.7,0.4,1,0] 0.0 [0.31,0.35,1,0] 1.0 [0.3,0.4,1.1] 1.0 Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5 [2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
  • 16. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Handling imbalance SMOTE[2] Undersampling … true false Preprocessing StringIndexer OneHotEncoder VectorAssembler Discretization Std.Scaler Features Label [0.34,0.8,0,1] 0.0 [0.7,0.4,1,0] 0.0 [0.31,0.35,1,0] 1.0 [0.3,0.4,1.1] 1.0 Crossvalidation loop Feature selection [3] InformationGain Correlation ChiSquared Ran. Forest Gini L1 LogReg Classifier Logistic Regression/ Random Forest Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5 [3]: Schlegel et al.: Design and optimization of an autonomous feature selection pipeline for high dimensional, heterogeneous feature spaces. [2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
  • 17. SPARK PIPELINE. ETL Imputation Loading MV_S SC_IP DTC_PU CP Label 44 2 false v.10 false 72 1.5 true v.11 false 23 1.4 false v.11 false 44 1.5 true v.10 true Relational DWH Handling imbalance SMOTE[2] Undersampling … true false Preprocessing StringIndexer OneHotEncoder VectorAssembler Discretization Std.Scaler Features Label [0.34,0.8,0,1] 0.0 [0.7,0.4,1,0] 0.0 [0.31,0.35,1,0] 1.0 [0.3,0.4,1.1] 1.0 Crossvalidation loop Feature selection [3] InformationGain Correlation ChiSquared Ran. Forest Gini L1 LogReg Classifier Logistic Regression/ Random Forest Model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 5 [3]: Schlegel et al.: Design and optimization of an autonomous feature selection pipeline for high dimensional, heterogeneous feature spaces. [2] Chawla et al.: SMOTE: Synthetic Minority Over-samplingTechnique
  • 18. PipelineStage SPARK PIPELINE API. Interface for usage in Pipeline data ? Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
  • 19. PipelineStage SPARK PIPELINE API. Transformer ‘Transforms data’ Interface for usage in Pipeline data data data ? Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
  • 20. PipelineStage SPARK PIPELINE API. Estimator ‘Learns from data’ Transformer ‘Transforms data’ Interface for usage in Pipeline data data dataTransformer data ? Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 6
  • 21. ORG.APACHE.SPARK.ML.* PipelineStage Estimator Interface for usage in Pipeline Transformer Transforms data Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7 Learns from data
  • 22. ORG.APACHE.SPARK.ML.* Pipeline Concat PipelineStages Predictor Interface for Predictors PipelineModel Model from Pipeline PredictionModel Model from predictor PipelineStage Estimator Interface for usage in Pipeline Transformer Model Transforms data Fitted model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7 Learns from data
  • 23. ORG.APACHE.SPARK.ML.* Pipeline Concat PipelineStages Predictor Interface for Predictors FeatureSelector Interface for FS PipelineModel Model from Pipeline PredictionModel Model from predictor FeatureSelectionModel Model from FeatureSelector PipelineStage Estimator Interface for usage in Pipeline Transformer Model Transforms data Fitted model Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 7 Learns from data
  • 24. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { } Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  • 25. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { } Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  • 26. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { } Defined later. Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  • 27. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { } Defined later. Makes all Param writable.Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  • 28. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] } Defined later. Makes all Param writable.Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  • 29. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] } For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable.Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8
  • 30. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] } For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. transformSchema (= input validation) Transformed schema Exception  ⚡ Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8 features label [0,1,0,1] 1.0 [1,0,0,0] 1.0 features: VectorColumn selected: VectorColumn label: Double DataFrame with Schema
  • 31. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] } For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. transformSchema (= input validation) Transformed schema Exception  ⚡ Needsto know, what it shall return. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 8 features label [0,1,0,1] 1.0 [1,0,0,0] 1.0 Attention: VectorColumns have Metadata: Name, Type, Range, etc. features: VectorColumn selected: VectorColumn label: Double DataFrame with Schema
  • 32. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} } Performs input checking and fails fast. Canthrow exceptions. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 9 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable.Needsto know, what it shall return.
  • 33. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} } Performs input checking and fails fast. Canthrow exceptions. fit (= learn from data) Dataset Transformer Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 9 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable.Needsto know, what it shall return.
  • 34. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} override def fit(dataset: Dataset[_]): M = {} override def copy(extra: ParamMap): Learner } Learns from data and returns a Model. Here: calculate feature importances. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. Performs input checking and fails fast. Canthrow exceptions. Needsto know, what it shall return.
  • 35. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} override def fit(dataset: Dataset[_]): M = {} override def copy(extra: ParamMap): Learner // Abstract methods that are called from fit() protected def train(dataset: Dataset[_]): Array[(Int, Double)] protected def make(uid: String, selectedFeatures: Array[Int], featureImportances: Map[String, Double]): M } Learns from data and returns a Model. Here: calculate feature importances. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. Performs input checking and fails fast. Canthrow exceptions. Needsto know, what it shall return.
  • 36. MAKING AN ESTIMATOR – FEATURESELECTION EXAMPLE. abstract class FeatureSelector[ Learner <: FeatureSelector[Learner, M], M <: FeatureSelectorModel[M]] extends Estimator[M] with FeatureSelectorParams with DefaultParamsWritable { // Setters for params in FeatureSelectorParams def setParam*(value: ParamType): Learner = set(param, value).asInstanceOf[Learner] // PipelineStage and Estimator methods override def transformSchema(schema: StructType): StructType = {} override def fit(dataset: Dataset[_]): M = {} override def copy(extra: ParamMap): Learner // Abstract methods that are called from fit() protected def train(dataset: Dataset[_]): Array[(Int, Double)] protected def make(uid: String, selectedFeatures: Array[Int], featureImportances: Map[String, Double]): M } Learns from data and returns a Model. Here: calculate feature importances. Not necessary, but avoids code duplication.Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 10 For setter concatenation: mdl.setParam1(val1) .setParam2(val2)... Defined later. Makes all Param writable. Performs input checking and fails fast. Canthrow exceptions. Needsto know, what it shall return.
  • 37. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ Page 11
  • 38. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ Page 11 For persistence.
  • 39. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ // Setters for params in FeatureSelectorParams def setFeaturesCol(value: String): this.type = set(featuresCol, value) // PipelineStage and Transformer methods override def transformSchema(schema: StructType): StructType = {} override def transform(dataset: Dataset[_]): DataFrame = {} def write: MLWriter } Page 11 For persistence.
  • 40. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ // Setters for params in FeatureSelectorParams def setFeaturesCol(value: String): this.type = set(featuresCol, value) // PipelineStage and Transformer methods override def transformSchema(schema: StructType): StructType = {} override def transform(dataset: Dataset[_]): DataFrame = {} def write: MLWriter } Page 11 Same idea as in Estimator, but different tasks. For persistence.
  • 41. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ // Setters for params in FeatureSelectorParams def setFeaturesCol(value: String): this.type = set(featuresCol, value) // PipelineStage and Transformer methods override def transformSchema(schema: StructType): StructType = {} override def transform(dataset: Dataset[_]): DataFrame = {} def write: MLWriter } Page 11 Transforms data. Same idea as in Estimator, but different tasks. For persistence.
  • 42. MAKING A TRANSFORMER – FEATURESELECTION EXAMPLE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 abstract class FeatureSelectorModel[M <: FeatureSelectorModel[M]] (override val uid: String, val selectedFeatures: Array[Int], val featureImportances: Map[String, Double]) extends Model[M] with FeatureSelectorParams with MLWritable{ // Setters for params in FeatureSelectorParams def setFeaturesCol(value: String): this.type = set(featuresCol, value) // PipelineStage and Transformer methods override def transformSchema(schema: StructType): StructType = {} override def transform(dataset: Dataset[_]): DataFrame = {} def write: MLWriter } Page 11 Transforms data. Same idea as in Estimator, but different tasks. For persistence. Adds persistence.
  • 43. GIVING YOUR NEW PIPELINESTAGE PARAMETERS. import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ private[selection] trait FeatureSelectorParams extends Params with HasFeaturesCol with HasOutputCol with HasLabelCol { // Define params and getters here... final val param = new Param[Type](this, "name", "description") def getParam: Type = $(param) } Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
  • 44. GIVING YOUR NEW PIPELINESTAGE PARAMETERS. import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ private[selection] trait FeatureSelectorParams extends Params with HasFeaturesCol with HasOutputCol with HasLabelCol { // Define params and getters here... final val param = new Param[Type](this, "name", "description") def getParam: Type = $(param) } Possible, because package is in org.apache.spark.ml. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
  • 45. GIVING YOUR NEW PIPELINESTAGE PARAMETERS. import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ private[selection] trait FeatureSelectorParams extends Params with HasFeaturesCol with HasOutputCol with HasLabelCol { // Define params and getters here... final val param = new Param[Type](this, "name", "description") def getParam: Type = $(param) } Possible, because package is in org.apache.spark.ml. Out of the box for severaltypes, e.g.: DoubleParam, IntParam, BooleanParam, StringArrayParam,... Other types: needto implement jsonEncode and jsonDecode to maintain persistence. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
  • 46. GIVING YOUR NEW PIPELINESTAGE PARAMETERS. import org.apache.spark.ml.param._ import org.apache.spark.ml.param.shared._ private[selection] trait FeatureSelectorParams extends Params with HasFeaturesCol with HasOutputCol with HasLabelCol { // Define params and getters here... final val param = new Param[Type](this, "name", "description") def getParam: Type = $(param) } Possible, because package is in org.apache.spark.ml. Out of the box for severaltypes, e.g.: DoubleParam, IntParam, BooleanParam, StringArrayParam,... Other types: needto implement jsonEncode and jsonDecode to maintain persistence. getters are shared between Estimator and Transformer. setters not, for the pursuit of concatenation. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 12
  • 47. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.  What hasto be saved?  Metadata: uid, timestamp, version, …  Parameters  Learnt data: selectedFeatures & featureImportances Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
  • 48. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.  What hasto be saved?  Metadata: uid, timestamp, version, …  Parameters  Learnt data: selectedFeatures & featureImportances DefaultParamsWriter.saveMetadata() DefaultParamsReader.loadMetadata() Since we are in org.apache.spark.ml, use: Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
  • 49. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.  What hasto be saved?  Metadata: uid, timestamp, version, …  Parameters  Learnt data: selectedFeatures & featureImportances  Create DataFrame and use write.parquet(…) DefaultParamsWriter.saveMetadata() DefaultParamsReader.loadMetadata() Since we are in org.apache.spark.ml, use: Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
  • 50. ADDING PERSISTENCE TOYOUR NEW PIPELINEMODEL.  What hasto be saved?  Metadata: uid, timestamp, version, …  Parameters  Learnt data: selectedFeatures & featureImportances  Create DataFrame and use write.parquet(…)  How do we dothat?  Create companion object FeatureSelectorModel, which offersthe following classes:  abstract class FeatureSelectorModelReader[M <: FeatureSelectorModel[M]] extends MLReader[M] {…}  class FeatureSelectorModelWriter[M <: FeatureSelectorModel[M]](instance: M) extends MLWriter {…} DefaultParamsWriter.saveMetadata() DefaultParamsReader.loadMetadata() Since we are in org.apache.spark.ml, use: Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 13
  • 51. HOW TO USE SPARK-FEATURESELECTION. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  • 52. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline HOW TO USE SPARK-FEATURESELECTION. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  • 53. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 df Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  • 54. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor") val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini") HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 Feature selectors. Offer different selection methods. df Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  • 55. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor") val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini") // VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names! val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected") HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 Feature selectors. Offer different selection methods. df Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14
  • 56. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor") val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini") // VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names! val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected") // Put everything in a pipeline and fit together val plModel = new Pipeline().setStages(Array(corSel, giniSel, merger)).fit(df) HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 Feature selectors. Offer different selection methods. df Feature F1 F2 F3 F4 Score 1 0.9 0.7 0.0 0.5 Score 2 0.6 0.8 0.0 0.4 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14 fit
  • 57. import org.apache.spark.ml.feature.selection.filter._ import org.apache.spark.ml.feature.selection.util.VectorMerger import org.apache.spark.ml.Pipeline // load Data val df = spark.read.parquet("path/to/data/train.parquet") val corSel = new CorrelationSelector().setInputCol("features").setOutputCol(“cor") val giniSel = new GiniSelector().setInputCol("features").setOutputCol(“gini") // VectorMerger merges VectorColumns and removes duplicates. Requires vector columns with names! val merger = new VectorMerger().setInputCols(Array(“cor", “gini")).setOutputCol(“selected") // Put everything in a pipeline and fit together val plModel = new Pipeline().setStages(Array(corSel, giniSel, merger)).fit(df) val dfT = plModel.transform(df).drop(“Features") HOW TO USE SPARK-FEATURESELECTION. features Label [0,1,0,1] 1.0 [0,0,0,0] 0.0 [1,1,0,0] 0.0 [1,0,0,0] 1.0 Feature selectors. Offer different selection methods. selected Label [0,1] 1.0 [0,0] 0.0 [1,1] 0.0 [1,0] 1.0 df dft Feature F1 F2 F3 F4 Score 1 0.9 0.7 0.0 0.5 Score 2 0.6 0.8 0.0 0.4 Transform Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 14 fit
  • 58. SPARK-FEATURESELECTION PACKAGE.  Offers selection based on:  Gini coefficient  Correlation coefficient  Information gain  L1-Logistic regression weights  Randomforest importances  Utility stage:  VectorMerger  Three modes:  Percentile (default)  Fixed number of columns  Compare to random column [4] Find on GitHub: spark-FeatureSelection or on Spark-packages Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 15 [4]: Stoppiglia et al.: Ranking a Random Feature for Variable and Feature Selection
  • 59. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
  • 60. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16 0 0,2 0,4 0,6 0,8 1 1,2 Chi² Correlation Gini InfoGain Correlation between feature importances from feature selection and random forest
  • 61. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16 0 0,2 0,4 0,6 0,8 1 1,2 Chi² Correlation Gini InfoGain Correlation between feature importances from feature selection and random forest
  • 62. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
  • 63. 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Area under normalized PRC and ROC Normalized Area under PRC Area under ROC 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Correlation Informationgain Chi² Randomforest 0 50 100 150 200 250 300 350 400 450 FS - 25 Trees FS - 100 Trees No FS - 25 Trees No FS - 100 Trees Time [s] Time for FS methods and random forest Multibucketizer Gini Informationgain Randomforest PERFORMANCE. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 16
  • 64. LESSONS LEARNT.  Know what your data looks like and where it is located! Example:  Operations can succeed in local mode, but fail on a cluster.  Use .persist(StorageLevel.MEMORY_ONLY), when data fits into Memory. Default for .cache is MEMORY_AND_DISK.  Do not reinvent the wheel for common methods  Consider putting your stages intothe spark.ml namespace.  Use the SparkWeb GUIto understand your Spark jobs. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 17
  • 65. QUESTIONS? Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Marc.Kaminski@bmw.de Bernhard.bb.Schegel@bmw.de Page 18
  • 66. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 19 BACKUP.
  • 67. DETERMINING WHEREYOUR PIPELINESTAGE SHOULD LIVE. Own namespace Pro Con Safer solution Code duplication org.apache.spark.ml.* Pro Con Less code duplication (sharedParams, SchemaUtils, …) More dangerous, when not cautious Easier to implement persistence vs. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 20
  • 68. FEATURE SELECTION.  Motivation:  Many sparse features  feature space hasto be reduced  select featuresthat carry a lot of information for prediction.  Feature selection (unlike featuretransformation ) enables understanding of which features have a high impact onthe model. F1 F2 Noise Label = F1 XOR F2 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 Feature Selection Feature Importance Feature 1 0.7 Feature 2 0.7 Noise 0.2 F1 F2 Label = F1 XOR F2 0 0 0 1 0 1 0 1 1 1 1 0 Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 21
  • 69. FEATURE SELECTION.  Motivation:  Many sparse features  feature space hasto be reduced  select featuresthat carry a lot of information for prediction.  Feature selection (unlike featuretransformation ) enables understanding of which features have a high impact onthe model. F1 F2 Noise Label = F1 XOR F2 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1 0 Feature Selection Feature Importance Feature 1 0.7 Feature 2 0.7 Noise 0.2 F1 F2 Label = F1 XOR F2 0 0 0 1 0 1 0 1 1 1 1 0 E.g.: - Correlation - InformationGain - RandomForest etc. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 21
  • 70. FEATURE SELECTION. Description Advantages Disadvantages Examples Filter Evaluate intrinsic data properties Fast Scalable Ignore inter-feature dependencies Ignore interaction with classifier Chi-squared Information gain Correlation Wrapper Evaluate model performance of feature subset Feature dependencies Simple Classifier dependent selection Computational expensive Risk of overfitting Genetic algorithms Search algorithms Embedded Feature selection is embedded in classifier training Feature dependencies Classifier dependent selection L1-Logistic regression Random forest Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017 Page 22
  • 71. CHALLENGES. Building custom ml PipelineStages for feature selection | BMW | Oct. 25, 2017  Big plans for DataFrames when performing many operations on many columns  Cantake a longtime to build and optimize DAG.  Column limit for DataFrames introduced by several Jiras, especially: SPARK-18016  Hopefully fixed in Spark 2.3.0.  Spark PipelineStages are not consistent in howthey handle DataFrame schemas  Sometimes no schema is appended. Page 23