SlideShare a Scribd company logo
1 of 65
| © Copyright 2015 Hitachi Consulting1
Applied Machine Learning
with Apache Spark
Khalid M. Salama, Ph.D.
Business Insights & Analytics
Hitachi Consulting UK
We Make it Happen. Better.
| © Copyright 2015 Hitachi Consulting2
Outline
 Overview on Data Mining
 Review on Spark Core Concepts
 Introducing Machine Learning with Spark
 Spark MLlib Data Types
 Statistics, Sampling, and Random Data Generation
 Spark ML Pipelines
 Data Pre-processing & Transformation
 Building and Evaluating Supervised ML Models (Classification and Regression)
 Other Unsupervised ML techniques (Clustering, Frequent Pattern Mining)
 Useful Resources
| © Copyright 2015 Hitachi Consulting3
Data Mining Overview
| © Copyright 2015 Hitachi Consulting4
Data Mining
… in a nutshell
Data
Mining
Machine
Learning
Statistics
Artificial
Intelligence
Databases
Other
Technologies
“Data mining, an interdisciplinary subfield of
computer science, is the computational
process of discovering patterns in large data
sets involving methods at the intersection of
artificial intelligence, machine learning,
statistics, and database systems.”
Other Related Technologies:
 Visualization
 Big Data
 High Performance Computing
 Cloud Computing
 Others..
| © Copyright 2015 Hitachi Consulting5
Learning Paradigms
Data as the teacher, machine as the student…
Supervised Learning
Labelled data = data + output (predictable, target, response, class) variable
Learn the relationship between data and output
Unsupervised Learning
Unlabelled data
Learn associations, similarities, groups, etc.
Semi-
supervised
Learning
Partially labelled data
Online/Active
Learning
Real-time learning on
data streams
Reinforcement Learning
game theory, control theory, simulation-based
optimization, operations research, robotics, etc.
| © Copyright 2015 Hitachi Consulting6
Data Mining Task
• Predicting the class of a given case – SupervisedClassification
• Estimating the value of a response value – SupervisedRegression
• Partitioning the cases into similar groups – UnsupervisedClustering
• Finding frequent (co)-occurring items – Unsupervised
Association Rules
Discovery
• Finding similar cases of a given case – BothSimilarity Analysis
• Calculating the probability of variables – BothProbabilistic Inference
• Forecasting future values – SupervisedTime Series Analysis
Important Terms:
• Learning Paradigms:
− Supervised
− Unsupervised
− Semi-supervised
− Others (Reinforcement
learning, Active, etc.)
• Analytics Types:
− Predictive
− Descriptive (Exploratory)
− Prescriptive (Decisive)
Application Fields:
• Text Mining
• Information Retrieval
• (Social) Web Mining
• Speech Recognition
• Image Recognition
• Anomaly Detection
• State Transition Analysis
• Collaborative Filtering
(Recommender systems)
| © Copyright 2015 Hitachi Consulting7
Knowledge Discovery in Databases (KDD)
…or data science, if you like!
Understanding
the Data
Modelling
Evaluation,
Interpretation,
Communication
Deployment
Cross Industry Standard Process
for Data Mining (CRISP-DM)
Data
Start
Understanding
the Business
Preparing
the Data
| © Copyright 2015 Hitachi Consulting8
Data Mining Implementation
Overall Procedure:
 Load dataset.
 Explore data (statistics, cardinality, variable types, correlations, dependencies etc.).
 Apply data transformations and pre-processing (type conversions, feature selection extraction
construction reduction, handling outliers missing values, scaling, etc.).
 Split the data into training set and test set (cross validation).
 Train a model.
 Evaluate and tune the model.
 Interpret the model and communicate results.
 Productionize the model.
Always Build Multiple Models:
 Using different approaches.
 Using different algorithms.
 Using different parameters (parameter sweeping).
 Using different dataset representations.
Empirical Evaluation for Model Selection
| © Copyright 2015 Hitachi Consulting9
Spark Core Concepts
| © Copyright 2015 Hitachi Consulting10
What is Spark?
The Lightening-fast Big Data Processing
General-purpose Big Data Processing
Integrates with HDFS
Graph Processing
Stream Processing
Machine Learning
Libraries
In-memory (fast)
Iterative
Processing
Interactive
Query
SQL
Scala – Python – Java – R – .NET
| © Copyright 2015 Hitachi Consulting11
Spark Components
Spark and the zoo…
Hadoop Distributed File System (HDFS)
Spark
….
Yet Another Resource Negotiator (YARN)Named
Node
DataNode 1 DataNode 2 DataNode 3 DataNode N
Spark Core Engine (RDDs: Resilient Distributed Datasets)
Spark SQL
(structured data)
Spark Streaming
(real-time)
Mlib
(machine learning)
GraphX
(graph processing)
Scala
Java
Python
R
.NET (Mobius)
| © Copyright 2015 Hitachi Consulting12
Spark Core Concepts
Key/Value (Pair) RDDs
Persisting & Removing
RDDs
Per-Partition Operations
Accumulators &
Broadcast Variables
Resilient Distributed Datasets (RDDs)
Transformations Actions
| © Copyright 2015 Hitachi Consulting13
Spark SQL
DataFrames
Distributed collection of data organized into named columns
Conceptually equivalent to a table in a relational database or a data frame in
R/Python, with richer Spark optimizations
RDDs
Structured Data
Files
Hive Tables RDBMS
| © Copyright 2015 Hitachi Consulting14
Introducing Machine Learning
with Spark
| © Copyright 2015 Hitachi Consulting15
Machine Learning with Spark
Spark and Parallel Machine Learning
Running multiple learning
algorithms, or multiple parameter
configurations in parallel
 Normal ML algorithms can be used in various
libraries (weka, sciekit-learn, caret, etc.)
 Simple, as it needs no synchronization
 Less efficient with large datasets; whole
dataset needs to be loaded in each worker
node
Parallelizing the execution of the
learning algorithm on a given
dataset
 The learning algorithm is specifically designed
to run in parallel on a given dataset - spark ML
algorithms.
 Efficient with large datasets; a dataset is
partitioned across multiple worker nodes
 Involves synchronization and data shuffling
 Approximation methods might be used to
reduce synchronization data shuffling, which
may affect model quality
| © Copyright 2015 Hitachi Consulting16
Machine Learning with Spark
Spark Machine Learning Libraries
Spark MLlib
 Provide base data types to implement your own ML algorithms
 Provide utilities to perform related ML functions
 Provide algorithm classes that use RDDs (matrix, vectors, etc.)
Spark ML
 Provide a uniform set of high-level APIs for ML operations (transform, model, evaluate)
 Provide algorithm classes that use DataFrames
 Provide a framework for creating practical ML pipelines
We will introduce the base data types in Spark MLlib, but focus on the
algorithms and transformations in Spark ML
| © Copyright 2015 Hitachi Consulting17
Spark MLlib Data Types
| © Copyright 2015 Hitachi Consulting18
Spark MLlib Data Types
Base data type for implementing ML algorithms
Local Vectors (Dense, Sparse)
Local Matrix (Dense, Sparse)
Distributed Matrix
 RowMatrix
 IndexedRowMatrix
 CoordinateMatrix
LabeledPoint
Rating
| © Copyright 2015 Hitachi Consulting19
Spark MLlib Data Types
Vectors
 A vector represent the input feature values of a dataset example.
 Feature values in a vector have to be numeric
 A collection of vectors represent a dataset, which can be parallelized as an RDD.
 denseVector store the values of all the features, while a sparseVector only stores the
values of a nonzero features (efficient with sparse dataset, e.g. text document vectors)
from pyspark.mllib.linalg import Vectors
v1 = Vectors.dense([1.1,-2,0.3,45,-3.5])
v2 =Vectors.sparse(5,[1,3],[10,20])
for i in range(0,5):
print "feature "+str(i)+": "+str(v2[i])
Create a vector of 5 features, where
only features 1 and 3 have values 10
and 20, respectively. The other
features are zeros
| © Copyright 2015 Hitachi Consulting20
Spark MLlib Data Types
LabeledPoints
 Represent a labelled examples in a supervised dataset
 Used in datasets for Classification and Regression Learning
 Consists of a numerical label (class/target), and a features vector
 Categorical class values need to have numerical representation (i.e., class value index)
example = LabeledPoint(1,[34,56,76])
print "input features: "+str(example.features)+" class index: "+ str(example.label)
| © Copyright 2015 Hitachi Consulting21
Spark MLlib Data Types
Matrices
 A widely-used data structure in linear algebra
 Spark MLlib support dense and spase, local and distributed matrices
from pyspark.mllib.linalg import *
denseMatrix_local = Matrices.dense(3 ,2, [1,2,3,4,5,6,7,8])
from pyspark.mllib.linalg.distributed import *
v1 = Vectors.sparse(3,[1],[10])
v2 = Vectors.sparse(3,[0,2],[5,15])
v3 = Vectors.sparse(3,[0,1,2],[10,20,30])
sparseMatrix_distributed = RowMatrix(sc.parallelize([v1,v2,v3]))
print "rows:" + str(sparseMatrix_distributed.numRows())+" - columns:"+ str(sparseMatrix_distributed.numCols())
print sparseMatrix_distributed.rows.collect()
 Other distributed matrices: IndexedRowMatrix ,CoordinateMatrix, and BlockMatrix
Create a Matrix with 3
rows and 2 columns
Each vector is a row in the matrix
| © Copyright 2015 Hitachi Consulting22
Statistics, Sampling, and Random
Data Generation
| © Copyright 2015 Hitachi Consulting23
Spark MLlib Utilities
Statistics
from pyspark.mllib.stat import *
 Statistics.colStats(dataset)  count, max, min, mean, variance, numNonzero
 Statistics.corr(rdd1, rdd2, method=<“pearons” | “spearman”>)  R correlation value between two
equal-sized collections of numerical values (in tow RDDs)
 Statisitcs.corr(dataset, method)  returns a correlation matrix between each pair of features in an
RDD of feature vectors)
 Statistic.chiSqTest(dataset)  performs dependency test between each input feature and the
target label in a given dataset of LabeledPoints
 Dataset.sampleByKey(withReplacement =<True|False>, fraction)  can be used with dataset of
LabledPoints to perform stratified sampling (i.e., fetch a sample by preserving the class value
distribution of the original dataset). The dataset if is key/value pairs, where the key is the label,
and the value is the LabeledPoint
| © Copyright 2015 Hitachi Consulting24
Spark MLlib Utilities
Random Data Generation
from pyspark.mllib.random import *
data = RandomRDDs.normalRDD(sc, 100)
mean = 1
variance = 4
data_new = data.map(lambda number: mean + (math.sqrt(stdv) * number))
Generate 10 numbers (independent, identically-
distributed) whose values follows the standard normal
distribution with mean =0 and variance = 1
N(0, 1)
Make the generated numbers values follows the normal
distribution with N(1,4)
| © Copyright 2015 Hitachi Consulting25
Spark MLlib Utilities
Kernel Density Estimator
Kernel density estimation is a non-parametric method for estimating
empirical probability without requiring assumptions about the particular
distribution that the observed samples are drawn from.
from pyspark.mllib.random import *
from pyspark.mllib.stat import *
data = RandomRDDs.normalRDD(sc, 100000)
kde = KernelDensity()
kde.setSample(data)
kde.setBandwidth(0.1)
densities = kde.estimate([0.0,1.0,2.0])
densities
Even without assuming that the sample follows a
normal distribution with mean = 0 and stdv =1, the
kernel density was able to estimate the probability
of 0, 1, and 2 from the sample
The technique is more useful with data that does
not follow a known a probability density function
| © Copyright 2015 Hitachi Consulting26
Spark ML Overview
| © Copyright 2015 Hitachi Consulting27
Spark ML Overview
A uniform set of high-level ML APIs
Spark ML standardizes APIs for machine learning algorithms to make it easier to combine
multiple algorithms into a single pipeline, or workflow.
 Transformers – used for data pre-processing.
Input: DataFrame. Output:DataFrame
 Estimators – ML algorithm used to build a predictive model.
Input: DataFrame (with three columns: CaseIndex, Features, Class). Output: Model.
 Parameters – Configurations for Transformers and Estimators
 Pipeline – Chains Transformers and Estimators
ML Pipeline
Dataset
(DataFrame)
Transformer A
(pre-processing)
Estimator
(ML Learning
Algorithm)
Model
Evaluation
Parameters
Transformer Z
(pre-processing
…
| © Copyright 2015 Hitachi Consulting28
Spark ML Transformers
| © Copyright 2015 Hitachi Consulting29
Spark ML Transformers
Overview
Text Feature Extraction
 TF-IDF (HashingTF and IDF)
 Word2Vec
 CountVectorizer
 Tokenizer
 StopWordsRemover
 n-gram
Features Vector
Preparation
 VectorAssembler
 VectorIndexer
 StringIndexer
 IndexToString
Feature Selection
 VectorSlicer
 RFormula
 ChiSqSelector
Feature Type Conversion
(Continues  Discrete)
 Binarizer
 Discrete Cosine Transform (DCT)
 OneHotEncoder
 Bucketizer
 QuantileDiscretizer
Feature Scaling
 Normalizer
 StandardScaler
 MaxAbsScaler
 MinMaxScaler
Feature
Construction
 SQLTransformer
 ElementwiseProduct
 PolynomialExpansionDimensionality Reduction
 PCA
| © Copyright 2015 Hitachi Consulting30
Spark ML Transformers
Preparing a dataset for Spark ML Pipeline
 A dataset is received with a mix of nominal and numerical attributes
 The target variable can be either categorical or numerical, as well.
 All the dataset attributes (input and target) have to be numeric.
 For nominal attributes, numerical indexes should be used instead of the actual “textual” values
 The training dataset should have three attributes:
 CaseIndex: a unique identifier of a training instance
 Features: a Spark MlLlib denseVector or sparseVector tor represent all the input variables
 Target: a numerical attribute to represent the class or the response variable in the classification or
regression problems, respectively.
| © Copyright 2015 Hitachi Consulting31
Spark ML Transformers
Features vector preparation
from pyspark.mllib.linalg import *
from pyspark.ml.feature import *
dataframe = sqlContext.createDataFrame(
[
(1,34.45,20,'M','Y'),
(2,23.67,40,'M','Y'),
(3,78.23,20,'M','Y'),
(4,37.48,40,'L','Y'),
(5,48.32,20,'S','N'),
(6,67.45,40,'S','N')
]
,['caseId','score','category','level','label'])
dataframe.show()
dataframe.printSchema()
| © Copyright 2015 Hitachi Consulting32
Spark ML Transformers
Features vector preparation
StringIndexer – encodes a string column of labels to a column of label indices
stringIndexer = StringIndexer(inputCol="label", outputCol="labelIndex")
model = stringIndexer.fit(dataframe)
dataframe_transformed = model.transform(dataframe).select('caseId','score','category','level','labelIndex')
dataframe_transformed.show()
We need to do the same for the “level” input variables
stringIndexer = StringIndexer(inputCol="level", outputCol="levelIndex")
model = stringIndexer.fit(dataframe_transformed)
dataframe_transformed =
model.transform(dataframe_transformed).select('caseId','score','category','levelIndex','labelIndex')
dataframe_transformed.show()
The original textual value can be received
using IndexToString transformation
| © Copyright 2015 Hitachi Consulting33
Spark ML Transformers
Features vector preparation
VectorAssembler - combines a given list of columns into a single vector attribute, to represent the input
feature set for a training data
assembler = VectorAssembler(inputCols=["score", "category", "levelIndex"], outputCol="features")
dataframe_transformed = assembler.transform(dataframe_transformed).select('caseId','features','labelIndex')
dataframe_transformed.show()
Note that, “category” attribute is treated a numerical attribute.
If needs to be treated as a nominal attribute, we can use either StringIndexer,
or use VectorIndexer decide to treat it as nominal or numerical attributes,
based on the maxCategory parameter. That is, if the number of distinct values
less than or equal to maxCateogry, the attribute will be indexed and treated as nominal.
indexer = VectorIndexer(inputCol="features", outputCol="features_indexed", maxCategories=3)
indexerModel = indexer.fit(dataframe_transformed)
dataframe_transformed = indexerModel.transform(dataframe_transformed)
.select('caseId','features_indexed','labelIndex')
dataframe_transformed.show()
| © Copyright 2015 Hitachi Consulting34
Spark ML Transformers
Feature type conversion
Binarizer – converts numerical attribute to nominal attribute via thresholding the numerical attribute to binary
(0/1) attribute.
binarizer = Binarizer(threshold=40.0, inputCol="score", outputCol="score_binarized")
dataframe_transformed = binarizer.transform(dataframe_transformed)
dataframe_transformed.select('score_binarized','category','levelIndex','labelIndex').show()
| © Copyright 2015 Hitachi Consulting35
Spark ML Transformers
Feature type conversion
HotOneEncoder – maps a column of label indices to a column of binary vectors, with at most a single one-
value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use
categorical features. The result vector is a SparseVector
stringIndexer = StringIndexer(inputCol="label", outputCol="labelIndex")
model = stringIndexer.fit(dataframe)
dataframe_transformed = model.transform(dataframe).select('caseId','score','category','level','labelIndex')
dataframe_transformed.show()
encoder = OneHotEncoder(dropLast=False, inputCol="levelIndex", outputCol="levelFlags")
dataframe_transformed = encoder.transform(dataframe_transformed)
dataframe_transformed.select('score','category','levelFlags','labelIndex').show()
| © Copyright 2015 Hitachi Consulting36
Spark ML Transformers
Feature type conversion
QuantileDiscretizer – converts numerical attribute to binned categorical attribute, using the quantiles of the
numerical attributes
Bucketizer – converts numerical attribute to nominal attribute via assigning numerical values into a pre-
defined buckets (ranges)
splits = [-float("inf"), 25.0, 35.0, 50.0,float("inf")]
bucketizer = Bucketizer(splits=splits, inputCol="score", outputCol="score_buckets")
dataframe_transformed = bucketizer.transform(dataframe_transformed)
dataframe_transformed.select('score','score_buckets','category','levelIndex','labelIndex').show()
| © Copyright 2015 Hitachi Consulting37
Spark ML Transformers
Feature scaling
Scaling is a very important pre-processing step when similarity/distance measures are evolved. This include
clustering and instance-based learning.
dataframe = sqlContext.createDataFrame(
[
(-0.4 ,40 ,4),
(-0.3 ,30 ,3),
(-0.5 ,50 ,5),
(-0.7 ,70 ,7),
(-0.2 ,20 ,2),
(-0.1 ,10 ,1)
],['var1','var2','var3'])
assembler = VectorAssembler(inputCols=["var1", "var2", "var3"], outputCol="features")
dataframe = assembler.transform(dataframe).select('features')
| © Copyright 2015 Hitachi Consulting38
Spark ML Transformers
Feature scaling
scaler = StandardScaler(inputCol="features", outputCol="features_scaled1", withStd=True, withMean=False)
scalerModel = scaler.fit(dataframe)
scaledData = scalerModel.transform(dataframe)
scaledData.show(truncate=False)
scaler = MinMaxScaler(inputCol="features", outputCol="features_scaled2", min=0.0, max=1.0)
scalerModel = scaler.fit(dataframe)
scaledData = scalerModel.transform(dataframe)
scaledData.show(truncate=False)
| © Copyright 2015 Hitachi Consulting39
Spark ML Transformers
Others
Feature Selection (VectorSlicer, RFormula,ChiSqSelector) – Supervised methods to select the
best features that can predict/estimate the target variables.
Dimensionality Reduction (Principal Component Analysis) - Converts a set of instances of
possibly correlated variables into a set of values of linearly uncorrelated variables called principal
components. Input: many correlated features. Output: a few uncorrelated features
Feature Construction
 SQLTransformer – SQL-based transformation
 Polynomial expansion – Expanding features into a polynomial space, which is formulated by an n-degree
combination of original dimensions. E.g. input x, n=3. Output x, x^2, x^3
 ElementwiseProduct – Multiplies each input vector by a provided “weight” vector, using element-wise
multiplication.
| © Copyright 2015 Hitachi Consulting40
Classification
| © Copyright 2015 Hitachi Consulting41
Spark ML Estimators
Classification algorithms
 Logistic regression
 Decision tree classifier
 Random forest classifier – Tree Ensemble
 Gradient-boosted tree classifier – Tree Ensemble
 Multilayer perceptron classifier – Artificial Neural Networks
 One-vs-Rest classifier – Uses binary classifiers for multiclass classification problems
 Naive Bayes – a probabilistic Bayesian model
| © Copyright 2015 Hitachi Consulting42
Spark ML Estimators
Classification template
from pyspark.ml.classification import *
# split the processed dataset to training and testing sets
(trainingData, testData) = dataset_processed.randomSplit([0.7, 0.3])
# initialize a classification algorithm
classification_algorithm = <classification algorithm>(labelCol="<Indexed Class Attribute>",
featuresCol="<Feature Vector Attribute>")
# create a classificaion model using the classification algorithm and the training sets
classifier = classification_algorithm.fit(trainingData)
# make prediction using the constructed classifier and the test set (or new unseen dataset)
predictions = classifier.transform(testData)
# display actual vs predicted classed
predictions.select("prediction", "<Indexed Class Attribute>", "<Feature Vector Attribute>").show()
| © Copyright 2015 Hitachi Consulting43
Spark ML Estimators
Classification dataset
dataset_raw = sqlContext.sql("SELECT * FROM ds_car")
dataset_raw.show()
dataset_raw.printSchema()
stringIndexer = StringIndexer(inputCol="Class", outputCol="classIndex")
stringIndexer_model = stringIndexer.fit(dataset_raw)
dataset_processed = stringIndexer_model.transform(dataset_raw)
stringIndexer = StringIndexer(inputCol="buying", outputCol="buyingIndex")
stringIndexer_model = stringIndexer.fit(dataset_processed)
dataset_processed = stringIndexer_model.transform(dataset_processed)
stringIndexer = StringIndexer(inputCol="maint", outputCol="maintIndex")
stringIndexer_model = stringIndexer.fit(dataset_processed)
dataset_processed = stringIndexer_model.transform(dataset_processed)
stringIndexer = StringIndexer(inputCol="doors", outputCol="doorsIndex")
stringIndexer_model = stringIndexer.fit(dataset_processed)
dataset_processed = stringIndexer_model.transform(dataset_processed)
stringIndexer = StringIndexer(inputCol="persons", outputCol="personsIndex")
stringIndexer_model = stringIndexer.fit(dataset_processed)
dataset_processed = stringIndexer_model.transform(dataset_processed)
stringIndexer = StringIndexer(inputCol="lug_boot", outputCol="lug_bootIndex")
stringIndexer_model = stringIndexer.fit(dataset_processed)
dataset_processed = stringIndexer_model.transform(dataset_processed)
stringIndexer = StringIndexer(inputCol="safety", outputCol="safetyIndex")
stringIndexer_model = stringIndexer.fit(dataset_processed)
dataset_processed = stringIndexer_model.transform(dataset_processed)
assembler = VectorAssembler(inputCols=['buyingIndex','maintIndex','doorsIndex','personsIndex','lug_bootIndex','safetyIndex'], outputCol="features")
dataset_processed = assembler.transform(dataset_processed).select('features','classIndex')
| © Copyright 2015 Hitachi Consulting44
Spark ML Estimators
Classification example
from pyspark.ml.classification import *
# split the processed dataset to training and testing sets
(trainingData, testData) = dataset_processed.randomSplit([0.7, 0.3])
# initialize a classification algorithm
decisionTree_algorithm = DecisionTreeClassifier(labelCol="class_indexed" featuresCol="features")
# create a classification model using the classification algorithm and the training sets
decisionTree_model = decisionTree_algorithm.fit(trainingData)
# make prediction using the constructed classifier and the test set (or new unseen dataset)
predictions = decisionTree_model.transform(testData)
# display actual vs predicted classed
predictions.select("prediction", "class_indexed", "features").show()
| © Copyright 2015 Hitachi Consulting45
Regression
| © Copyright 2015 Hitachi Consulting46
Spark ML Estimators
Regression algorithms
 Linear regression
 Generalized linear regression (Gaussian, Binomial, Poisson, Gamma)
 Decision tree regression
 Random forest regression – Tree Ensemble
 Gradient-boosted tree regression – Tree Ensemble
 Survival regression - Accelerated failure time (AFT) model
 Isotonic regression – Used when the target attributes has a fixed range (Min and Max)
| © Copyright 2015 Hitachi Consulting47
Spark ML Estimators
Regression template
from pyspark.ml.regression import *
# split the processed dataset to training and testing sets
(trainingData, testData) = dataset_processed.randomSplit([0.7, 0.3])
# initialize a classification algorithm
regression_algorithm = <regression algorithm>(labelCol="<Numerical Target Attribute>",
featuresCol="<Features Vector Attribute>")
# create a regression model using the regression algorithm and the training sets
Regression_model = regression_algorithm.fit(trainingData)
# make estimations using the constructed regression model and the test set (or new unseen dataset)
predictions = regression_model.transform(testData)
# display actual vs estimated classed
predictions.select("prediction", "<Numerical Target Attribute >", "< Features Vector Attribute>").show()
| © Copyright 2015 Hitachi Consulting48
Spark ML Estimators
Regression dataset
dataset_raw = sqlContext.sql("SELECT * FROM ds_energy")
dataset_raw.show(5)
dataset_raw.printSchema()
from pyspark.mllib.linalg import *
from pyspark.ml.feature import *
assembler = VectorAssembler(inputCols=
['RelativeCompactness',
'SurfaceArea',
'WallArea',
'RoofArea',
'OverallHeight',
'Orientation',
'GlazingArea',
'GlazingAreaDistribution'], outputCol="features")
dataset_processed = assembler.transform(dataset_raw).select('features','HeatingLoad')
indexer = VectorIndexer(inputCol="features", outputCol="features_indexed", maxCategories=10)
indexerModel = indexer.fit(dataset_processed)
dataset_processed = indexerModel.transform(dataset_processed).select('features_indexed','HeatingLoad')
dataset_processed.show(truncate=False)
| © Copyright 2015 Hitachi Consulting49
Spark ML Estimators
Regression example
from pyspark.ml.regression import *
# split the processed dataset to training and testing sets
(trainingData, testData) = dataset_processed.randomSplit([0.7, 0.3])
# initialize a Generalized Linear Model
GLM_algorithm = GeneralizedLinearRegression(family="gaussian", link="identity", maxIter=10,
regParam=0.3)
# create a regression model using the regression algorithm and the training sets
GLM_model = GLM_algorithm.fit(trainingData)
# make estimations using the constructed regression model and the test set (or new unseen dataset)
predictions = GLM_model.transform(testData)
# display actual vs estimated classed
predictions.select("prediction", "target", "features").show()
| © Copyright 2015 Hitachi Consulting50
Spark ML Estimators
Regression example
print("Coefficients: " + str(GLM_model.coefficients))
print("Intercept: " + str(GLM_model.intercept))
# Summarize the GLM_model over the training set and print out some metrics
summary = GLM_model.summary
print("Coefficient Standard Errors: " + str(summary.coefficientStandardErrors))
print("T Values: " + str(summary.tValues))
print("P Values: " + str(summary.pValues))
print("Dispersion: " + str(summary.dispersion))
print("Null Deviance: " + str(summary.nullDeviance))
print("Residual Degree Of Freedom Null: " + str(summary.residualDegreeOfFreedomNull))
print("Deviance: " + str(summary.deviance))
print("Residual Degree Of Freedom: " + str(summary.residualDegreeOfFreedom))
print("AIC: " + str(summary.aic))
print("Deviance Residuals: ")
summary.residuals().show()
| © Copyright 2015 Hitachi Consulting51
Model Evaluation
| © Copyright 2015 Hitachi Consulting52
Model Evaluation
Classification model evaluation – Binary classification
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator(
labelCol="class_Indexed", predictionCol="prediction", metricName="areaUnderROC")
accuracy = evaluator.evaluate(predictions)
metricName Parameter values
 areaUnderROC - Receiver Operating Characteristic
 areaUnderPR – Precision Recall Curve
| © Copyright 2015 Hitachi Consulting53
Model Evaluation
Classification model evaluation – Multiclass classification
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator(
labelCol="class_Indexed", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
metricName Parameter values
 "f1" (default)
 "precision"
 "recall"
 "weightedPrecision"
 "weightedRecall"
| © Copyright 2015 Hitachi Consulting54
Model Evaluation
Regression model evaluation
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(
labelCol=“target", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(predictions)
metricName Parameter values
 "rmse" (default) – Root Means Square Error
 "mse" – Mean Square Error
 "r2"
 "mae" – Mean Absolut Error
| © Copyright 2015 Hitachi Consulting55
Model Selection and Parameter
Tuning
| © Copyright 2015 Hitachi Consulting56
Model Selection and Parameter Tuning
algorithm = <classification or regression algorithm>(<parameters>)
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# TrainValidationSplit will try all combinations of values and determine best model using the evaluator.
paramGrid = ParamGridBuilder()
.addGrid(algorithm.<parameter1>, [0.1, 0.01]) 
.addGrid(algorithm.<parameter2>, [0.0, 0.5, 1.0])
.build()
tvs = TrainValidationSplit(estimator=algorithm,
estimatorParamMaps=paramGrid,
evaluator=<Classification or Regression Evaluator>(),
trainRatio=0.8)
# Run TrainValidationSplit, and choose the best set of parameters.
model = tvs.fit(train)
| © Copyright 2015 Hitachi Consulting57
Spark ML Pipelines
| © Copyright 2015 Hitachi Consulting58
Spark ML Pipeline
Pipeline example
from pyspark.ml import Pipeline
#data preparation (e.g., VectorAssembler, VectorIndexer, etc.)
transformer1 = …
transformer2 = …
transformer3 = …
#Model algorithm (e.g. DecisionTreeClassifier)
model_algorithm = …
#Pipeline which applys transformation and model building algorithm on dataset
pipeline = Pipeline(stages=[transformer1, transformer2, transformer3, model_algorithm])
model = pipeline.fit(training)
| © Copyright 2015 Hitachi Consulting59
Frequent Pattern Mining
| © Copyright 2015 Hitachi Consulting60
Frequent Pattern Mining
Association rule discovery example
from pyspark.mllib.fpm import *
data = [['a','b','c'],['a','b'],['a','d'],['b','c'],['a','b','d']]
transactions = sc.parallelize(data)
# retrieve the patterns that has at least 40 % occurrence in the dataset (transactions)
model = FPGrowth.train(transactions, minSupport=0.4)
result = model.freqItemsets().collect()
# retrieve the patterns that occurred 4 times or more
model.freqItemsets().filter(lambda itemset: itemset.freq>=4).collect()
# retrieve the pattern that have the item ‘c’
model.freqItemsets().filter(lambda itemset: 'c' in itemset.items).collect()
# retrieve the pattern that have 2 items or more
model.freqItemsets().filter(lambda itemset: len(itemset.items)>=2).collect()
| © Copyright 2015 Hitachi Consulting61
Clustering
| © Copyright 2015 Hitachi Consulting62
Clustering
Data clustering algorithms
 K-means – Spherical, centroid-based, non-parametric, partitioning, non-overlapping
 Latent Dirichlet allocation (LDA) – Probabilistic, usually used with Topic Models
 Power iteration clustering (PIC) – Graph Clustering
 Bisecting k-means – Hierarchical (Agglomerative & Divisive)
 Gaussian Mixture Model (GMM) – Expectation Maximization (EM) algorithm.
Probabilistic, parametric, overlapping
| © Copyright 2015 Hitachi Consulting63
Clustering
Data clustering example
from pyspark.mllib.clustering import *
from math import *
data = [[1,2],[5,3],[3,4],[35,20],[25,10],[30,15]]
dataPoints = sc.parallelize(data)
# create 2 clusters using k-means. The intial centroids can be either random or using K-mean++ technique
clusters = KMeans.train(dataPoints, k=2, maxIterations=10, initializationMode=<‘random’ | ‘k-means||’>)
# assign data points to clusters
predictions = dataPoints.map(lambda point: clusters.predict(point))
assignments = dataPoints.zip(predictions)
#compute sum square error (rmse) of the culsters
sse = assignments.map(lambda (point,cluster): sqrt(sum([error**2 for error in (point -
clusters.centers[cluster])]))).reduce(lambda error1,error2: error1+error2)
| © Copyright 2015 Hitachi Consulting64
My Background
Applying Computational Intelligence in Data Mining
• Honorary Research Fellow, School of Computing , University of Kent.
• Ph.D. Computer Science, University of Kent, Canterbury, UK.
• M.Sc. Computer Science , The American University in Cairo, Egypt.
• 25+ published journal and conference papers, focusing on:
– classification rules induction,
– decision trees construction,
– Bayesian classification modelling,
– data reduction,
– instance-based learning,
– evolving neural networks, and
– data clustering
• Journals: Swarm Intelligence, Swarm & Evolutionary Computation,
, Applied Soft Computing, and Memetic Computing.
• Conferences: ANTS, IEEE CEC, IEEE SIS, EvoBio,
ECTA, IEEE WCCI and INNS-BigData.
ResearchGate.org
| © Copyright 2015 Hitachi Consulting65
Thank you!

More Related Content

What's hot

Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material Bryan Yang
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesDatabricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Databricks
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streamingdatamantra
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Machine Learning with Spark MLlib
Machine Learning with Spark MLlibMachine Learning with Spark MLlib
Machine Learning with Spark MLlibTodd McGrath
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterDatabricks
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With SparkEdureka!
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph DatabasesMax De Marzi
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Spark Summit
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 

What's hot (20)

Apache spark
Apache sparkApache spark
Apache spark
 
Spark MLlib - Training Material
Spark MLlib - Training Material Spark MLlib - Training Material
Spark MLlib - Training Material
 
Spark DataFrames and ML Pipelines
Spark DataFrames and ML PipelinesSpark DataFrames and ML Pipelines
Spark DataFrames and ML Pipelines
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2 Cost-Based Optimizer in Apache Spark 2.2
Cost-Based Optimizer in Apache Spark 2.2
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Machine Learning with Spark MLlib
Machine Learning with Spark MLlibMachine Learning with Spark MLlib
Machine Learning with Spark MLlib
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim HunterWeb-Scale Graph Analytics with Apache Spark with Tim Hunter
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
 
Big Data Processing With Spark
Big Data Processing With SparkBig Data Processing With Spark
Big Data Processing With Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Introduction to Graph Databases
Introduction to Graph DatabasesIntroduction to Graph Databases
Introduction to Graph Databases
 
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
Large-Scale Text Processing Pipeline with Spark ML and GraphFrames: Spark Sum...
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 

Similar to Machine learning with Spark

Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFMLconf
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
Introducing new AIOps innovations in Oracle 19c - San Jose AICUG
Introducing new AIOps innovations in Oracle 19c - San Jose AICUGIntroducing new AIOps innovations in Oracle 19c - San Jose AICUG
Introducing new AIOps innovations in Oracle 19c - San Jose AICUGSandesh Rao
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSyed Hadoop
 
Tools Used For Data Modeling And Predictive Analysis
Tools Used For Data Modeling And Predictive AnalysisTools Used For Data Modeling And Predictive Analysis
Tools Used For Data Modeling And Predictive AnalysisMichelle Meienburg
 
Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR OverviewKhalid Salama
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapDr. Mirko Kämpf
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache SparkDatabricks
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnBenjamin Bengfort
 
Data meets AI - AICUG - Santa Clara
Data meets AI  - AICUG - Santa ClaraData meets AI  - AICUG - Santa Clara
Data meets AI - AICUG - Santa ClaraSandesh Rao
 
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...Khalid Salama
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingAnimesh Chaturvedi
 
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Karen Thompson
 
Machine Learning on the Microsoft Stack
Machine Learning on the Microsoft StackMachine Learning on the Microsoft Stack
Machine Learning on the Microsoft StackLynn Langit
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmodwaqasm86
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-Systeminside-BigData.com
 
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine LearningAUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine LearningSandesh Rao
 

Similar to Machine learning with Spark (20)

Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SFTed Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
Ted Willke, Senior Principal Engineer & GM, Datacenter Group, Intel at MLconf SF
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Introducing new AIOps innovations in Oracle 19c - San Jose AICUG
Introducing new AIOps innovations in Oracle 19c - San Jose AICUGIntroducing new AIOps innovations in Oracle 19c - San Jose AICUG
Introducing new AIOps innovations in Oracle 19c - San Jose AICUG
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
Tools Used For Data Modeling And Predictive Analysis
Tools Used For Data Modeling And Predictive AnalysisTools Used For Data Modeling And Predictive Analysis
Tools Used For Data Modeling And Predictive Analysis
 
Microsoft R - ScaleR Overview
Microsoft R - ScaleR OverviewMicrosoft R - ScaleR Overview
Microsoft R - ScaleR Overview
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Big Data SE vs. SE for Big Data
Big Data SE vs. SE for Big DataBig Data SE vs. SE for Big Data
Big Data SE vs. SE for Big Data
 
DASK and Apache Spark
DASK and Apache SparkDASK and Apache Spark
DASK and Apache Spark
 
Introduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-LearnIntroduction to Machine Learning with SciKit-Learn
Introduction to Machine Learning with SciKit-Learn
 
Data meets AI - AICUG - Santa Clara
Data meets AI  - AICUG - Santa ClaraData meets AI  - AICUG - Santa Clara
Data meets AI - AICUG - Santa Clara
 
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
Operational Machine Learning: Using Microsoft Technologies for Applied Data S...
 
Big Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computingBig Data Analytics and Ubiquitous computing
Big Data Analytics and Ubiquitous computing
 
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
 
Machine Learning on the Microsoft Stack
Machine Learning on the Microsoft StackMachine Learning on the Microsoft Stack
Machine Learning on the Microsoft Stack
 
IT webinar 2016
IT webinar 2016IT webinar 2016
IT webinar 2016
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmod
 
The Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-SystemThe Analytics Frontier of the Hadoop Eco-System
The Analytics Frontier of the Hadoop Eco-System
 
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine LearningAUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning
AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning
 

More from Khalid Salama

Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
Microservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous DeliveryMicroservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous DeliveryKhalid Salama
 
Spark with HDInsight
Spark with HDInsightSpark with HDInsight
Spark with HDInsightKhalid Salama
 
Enterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft AzureEnterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft AzureKhalid Salama
 
Microsoft Azure Batch
Microsoft Azure BatchMicrosoft Azure Batch
Microsoft Azure BatchKhalid Salama
 
NoSQL with Microsoft Azure
NoSQL with Microsoft AzureNoSQL with Microsoft Azure
NoSQL with Microsoft AzureKhalid Salama
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureKhalid Salama
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureKhalid Salama
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!Khalid Salama
 

More from Khalid Salama (11)

Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
Microservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous DeliveryMicroservices, DevOps, and Continuous Delivery
Microservices, DevOps, and Continuous Delivery
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
 
Spark with HDInsight
Spark with HDInsightSpark with HDInsight
Spark with HDInsight
 
Enterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft AzureEnterprise Cloud Data Platforms - with Microsoft Azure
Enterprise Cloud Data Platforms - with Microsoft Azure
 
Microsoft Azure Batch
Microsoft Azure BatchMicrosoft Azure Batch
Microsoft Azure Batch
 
NoSQL with Microsoft Azure
NoSQL with Microsoft AzureNoSQL with Microsoft Azure
NoSQL with Microsoft Azure
 
Hive with HDInsight
Hive with HDInsightHive with HDInsight
Hive with HDInsight
 
Real-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS AzureReal-Time Event & Stream Processing on MS Azure
Real-Time Event & Stream Processing on MS Azure
 
Intorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft AzureIntorducing Big Data and Microsoft Azure
Intorducing Big Data and Microsoft Azure
 
Data Mining - The Big Picture!
Data Mining - The Big Picture!Data Mining - The Big Picture!
Data Mining - The Big Picture!
 

Recently uploaded

Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 

Recently uploaded (17)

Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 

Machine learning with Spark

  • 1. | © Copyright 2015 Hitachi Consulting1 Applied Machine Learning with Apache Spark Khalid M. Salama, Ph.D. Business Insights & Analytics Hitachi Consulting UK We Make it Happen. Better.
  • 2. | © Copyright 2015 Hitachi Consulting2 Outline  Overview on Data Mining  Review on Spark Core Concepts  Introducing Machine Learning with Spark  Spark MLlib Data Types  Statistics, Sampling, and Random Data Generation  Spark ML Pipelines  Data Pre-processing & Transformation  Building and Evaluating Supervised ML Models (Classification and Regression)  Other Unsupervised ML techniques (Clustering, Frequent Pattern Mining)  Useful Resources
  • 3. | © Copyright 2015 Hitachi Consulting3 Data Mining Overview
  • 4. | © Copyright 2015 Hitachi Consulting4 Data Mining … in a nutshell Data Mining Machine Learning Statistics Artificial Intelligence Databases Other Technologies “Data mining, an interdisciplinary subfield of computer science, is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems.” Other Related Technologies:  Visualization  Big Data  High Performance Computing  Cloud Computing  Others..
  • 5. | © Copyright 2015 Hitachi Consulting5 Learning Paradigms Data as the teacher, machine as the student… Supervised Learning Labelled data = data + output (predictable, target, response, class) variable Learn the relationship between data and output Unsupervised Learning Unlabelled data Learn associations, similarities, groups, etc. Semi- supervised Learning Partially labelled data Online/Active Learning Real-time learning on data streams Reinforcement Learning game theory, control theory, simulation-based optimization, operations research, robotics, etc.
  • 6. | © Copyright 2015 Hitachi Consulting6 Data Mining Task • Predicting the class of a given case – SupervisedClassification • Estimating the value of a response value – SupervisedRegression • Partitioning the cases into similar groups – UnsupervisedClustering • Finding frequent (co)-occurring items – Unsupervised Association Rules Discovery • Finding similar cases of a given case – BothSimilarity Analysis • Calculating the probability of variables – BothProbabilistic Inference • Forecasting future values – SupervisedTime Series Analysis Important Terms: • Learning Paradigms: − Supervised − Unsupervised − Semi-supervised − Others (Reinforcement learning, Active, etc.) • Analytics Types: − Predictive − Descriptive (Exploratory) − Prescriptive (Decisive) Application Fields: • Text Mining • Information Retrieval • (Social) Web Mining • Speech Recognition • Image Recognition • Anomaly Detection • State Transition Analysis • Collaborative Filtering (Recommender systems)
  • 7. | © Copyright 2015 Hitachi Consulting7 Knowledge Discovery in Databases (KDD) …or data science, if you like! Understanding the Data Modelling Evaluation, Interpretation, Communication Deployment Cross Industry Standard Process for Data Mining (CRISP-DM) Data Start Understanding the Business Preparing the Data
  • 8. | © Copyright 2015 Hitachi Consulting8 Data Mining Implementation Overall Procedure:  Load dataset.  Explore data (statistics, cardinality, variable types, correlations, dependencies etc.).  Apply data transformations and pre-processing (type conversions, feature selection extraction construction reduction, handling outliers missing values, scaling, etc.).  Split the data into training set and test set (cross validation).  Train a model.  Evaluate and tune the model.  Interpret the model and communicate results.  Productionize the model. Always Build Multiple Models:  Using different approaches.  Using different algorithms.  Using different parameters (parameter sweeping).  Using different dataset representations. Empirical Evaluation for Model Selection
  • 9. | © Copyright 2015 Hitachi Consulting9 Spark Core Concepts
  • 10. | © Copyright 2015 Hitachi Consulting10 What is Spark? The Lightening-fast Big Data Processing General-purpose Big Data Processing Integrates with HDFS Graph Processing Stream Processing Machine Learning Libraries In-memory (fast) Iterative Processing Interactive Query SQL Scala – Python – Java – R – .NET
  • 11. | © Copyright 2015 Hitachi Consulting11 Spark Components Spark and the zoo… Hadoop Distributed File System (HDFS) Spark …. Yet Another Resource Negotiator (YARN)Named Node DataNode 1 DataNode 2 DataNode 3 DataNode N Spark Core Engine (RDDs: Resilient Distributed Datasets) Spark SQL (structured data) Spark Streaming (real-time) Mlib (machine learning) GraphX (graph processing) Scala Java Python R .NET (Mobius)
  • 12. | © Copyright 2015 Hitachi Consulting12 Spark Core Concepts Key/Value (Pair) RDDs Persisting & Removing RDDs Per-Partition Operations Accumulators & Broadcast Variables Resilient Distributed Datasets (RDDs) Transformations Actions
  • 13. | © Copyright 2015 Hitachi Consulting13 Spark SQL DataFrames Distributed collection of data organized into named columns Conceptually equivalent to a table in a relational database or a data frame in R/Python, with richer Spark optimizations RDDs Structured Data Files Hive Tables RDBMS
  • 14. | © Copyright 2015 Hitachi Consulting14 Introducing Machine Learning with Spark
  • 15. | © Copyright 2015 Hitachi Consulting15 Machine Learning with Spark Spark and Parallel Machine Learning Running multiple learning algorithms, or multiple parameter configurations in parallel  Normal ML algorithms can be used in various libraries (weka, sciekit-learn, caret, etc.)  Simple, as it needs no synchronization  Less efficient with large datasets; whole dataset needs to be loaded in each worker node Parallelizing the execution of the learning algorithm on a given dataset  The learning algorithm is specifically designed to run in parallel on a given dataset - spark ML algorithms.  Efficient with large datasets; a dataset is partitioned across multiple worker nodes  Involves synchronization and data shuffling  Approximation methods might be used to reduce synchronization data shuffling, which may affect model quality
  • 16. | © Copyright 2015 Hitachi Consulting16 Machine Learning with Spark Spark Machine Learning Libraries Spark MLlib  Provide base data types to implement your own ML algorithms  Provide utilities to perform related ML functions  Provide algorithm classes that use RDDs (matrix, vectors, etc.) Spark ML  Provide a uniform set of high-level APIs for ML operations (transform, model, evaluate)  Provide algorithm classes that use DataFrames  Provide a framework for creating practical ML pipelines We will introduce the base data types in Spark MLlib, but focus on the algorithms and transformations in Spark ML
  • 17. | © Copyright 2015 Hitachi Consulting17 Spark MLlib Data Types
  • 18. | © Copyright 2015 Hitachi Consulting18 Spark MLlib Data Types Base data type for implementing ML algorithms Local Vectors (Dense, Sparse) Local Matrix (Dense, Sparse) Distributed Matrix  RowMatrix  IndexedRowMatrix  CoordinateMatrix LabeledPoint Rating
  • 19. | © Copyright 2015 Hitachi Consulting19 Spark MLlib Data Types Vectors  A vector represent the input feature values of a dataset example.  Feature values in a vector have to be numeric  A collection of vectors represent a dataset, which can be parallelized as an RDD.  denseVector store the values of all the features, while a sparseVector only stores the values of a nonzero features (efficient with sparse dataset, e.g. text document vectors) from pyspark.mllib.linalg import Vectors v1 = Vectors.dense([1.1,-2,0.3,45,-3.5]) v2 =Vectors.sparse(5,[1,3],[10,20]) for i in range(0,5): print "feature "+str(i)+": "+str(v2[i]) Create a vector of 5 features, where only features 1 and 3 have values 10 and 20, respectively. The other features are zeros
  • 20. | © Copyright 2015 Hitachi Consulting20 Spark MLlib Data Types LabeledPoints  Represent a labelled examples in a supervised dataset  Used in datasets for Classification and Regression Learning  Consists of a numerical label (class/target), and a features vector  Categorical class values need to have numerical representation (i.e., class value index) example = LabeledPoint(1,[34,56,76]) print "input features: "+str(example.features)+" class index: "+ str(example.label)
  • 21. | © Copyright 2015 Hitachi Consulting21 Spark MLlib Data Types Matrices  A widely-used data structure in linear algebra  Spark MLlib support dense and spase, local and distributed matrices from pyspark.mllib.linalg import * denseMatrix_local = Matrices.dense(3 ,2, [1,2,3,4,5,6,7,8]) from pyspark.mllib.linalg.distributed import * v1 = Vectors.sparse(3,[1],[10]) v2 = Vectors.sparse(3,[0,2],[5,15]) v3 = Vectors.sparse(3,[0,1,2],[10,20,30]) sparseMatrix_distributed = RowMatrix(sc.parallelize([v1,v2,v3])) print "rows:" + str(sparseMatrix_distributed.numRows())+" - columns:"+ str(sparseMatrix_distributed.numCols()) print sparseMatrix_distributed.rows.collect()  Other distributed matrices: IndexedRowMatrix ,CoordinateMatrix, and BlockMatrix Create a Matrix with 3 rows and 2 columns Each vector is a row in the matrix
  • 22. | © Copyright 2015 Hitachi Consulting22 Statistics, Sampling, and Random Data Generation
  • 23. | © Copyright 2015 Hitachi Consulting23 Spark MLlib Utilities Statistics from pyspark.mllib.stat import *  Statistics.colStats(dataset)  count, max, min, mean, variance, numNonzero  Statistics.corr(rdd1, rdd2, method=<“pearons” | “spearman”>)  R correlation value between two equal-sized collections of numerical values (in tow RDDs)  Statisitcs.corr(dataset, method)  returns a correlation matrix between each pair of features in an RDD of feature vectors)  Statistic.chiSqTest(dataset)  performs dependency test between each input feature and the target label in a given dataset of LabeledPoints  Dataset.sampleByKey(withReplacement =<True|False>, fraction)  can be used with dataset of LabledPoints to perform stratified sampling (i.e., fetch a sample by preserving the class value distribution of the original dataset). The dataset if is key/value pairs, where the key is the label, and the value is the LabeledPoint
  • 24. | © Copyright 2015 Hitachi Consulting24 Spark MLlib Utilities Random Data Generation from pyspark.mllib.random import * data = RandomRDDs.normalRDD(sc, 100) mean = 1 variance = 4 data_new = data.map(lambda number: mean + (math.sqrt(stdv) * number)) Generate 10 numbers (independent, identically- distributed) whose values follows the standard normal distribution with mean =0 and variance = 1 N(0, 1) Make the generated numbers values follows the normal distribution with N(1,4)
  • 25. | © Copyright 2015 Hitachi Consulting25 Spark MLlib Utilities Kernel Density Estimator Kernel density estimation is a non-parametric method for estimating empirical probability without requiring assumptions about the particular distribution that the observed samples are drawn from. from pyspark.mllib.random import * from pyspark.mllib.stat import * data = RandomRDDs.normalRDD(sc, 100000) kde = KernelDensity() kde.setSample(data) kde.setBandwidth(0.1) densities = kde.estimate([0.0,1.0,2.0]) densities Even without assuming that the sample follows a normal distribution with mean = 0 and stdv =1, the kernel density was able to estimate the probability of 0, 1, and 2 from the sample The technique is more useful with data that does not follow a known a probability density function
  • 26. | © Copyright 2015 Hitachi Consulting26 Spark ML Overview
  • 27. | © Copyright 2015 Hitachi Consulting27 Spark ML Overview A uniform set of high-level ML APIs Spark ML standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a single pipeline, or workflow.  Transformers – used for data pre-processing. Input: DataFrame. Output:DataFrame  Estimators – ML algorithm used to build a predictive model. Input: DataFrame (with three columns: CaseIndex, Features, Class). Output: Model.  Parameters – Configurations for Transformers and Estimators  Pipeline – Chains Transformers and Estimators ML Pipeline Dataset (DataFrame) Transformer A (pre-processing) Estimator (ML Learning Algorithm) Model Evaluation Parameters Transformer Z (pre-processing …
  • 28. | © Copyright 2015 Hitachi Consulting28 Spark ML Transformers
  • 29. | © Copyright 2015 Hitachi Consulting29 Spark ML Transformers Overview Text Feature Extraction  TF-IDF (HashingTF and IDF)  Word2Vec  CountVectorizer  Tokenizer  StopWordsRemover  n-gram Features Vector Preparation  VectorAssembler  VectorIndexer  StringIndexer  IndexToString Feature Selection  VectorSlicer  RFormula  ChiSqSelector Feature Type Conversion (Continues  Discrete)  Binarizer  Discrete Cosine Transform (DCT)  OneHotEncoder  Bucketizer  QuantileDiscretizer Feature Scaling  Normalizer  StandardScaler  MaxAbsScaler  MinMaxScaler Feature Construction  SQLTransformer  ElementwiseProduct  PolynomialExpansionDimensionality Reduction  PCA
  • 30. | © Copyright 2015 Hitachi Consulting30 Spark ML Transformers Preparing a dataset for Spark ML Pipeline  A dataset is received with a mix of nominal and numerical attributes  The target variable can be either categorical or numerical, as well.  All the dataset attributes (input and target) have to be numeric.  For nominal attributes, numerical indexes should be used instead of the actual “textual” values  The training dataset should have three attributes:  CaseIndex: a unique identifier of a training instance  Features: a Spark MlLlib denseVector or sparseVector tor represent all the input variables  Target: a numerical attribute to represent the class or the response variable in the classification or regression problems, respectively.
  • 31. | © Copyright 2015 Hitachi Consulting31 Spark ML Transformers Features vector preparation from pyspark.mllib.linalg import * from pyspark.ml.feature import * dataframe = sqlContext.createDataFrame( [ (1,34.45,20,'M','Y'), (2,23.67,40,'M','Y'), (3,78.23,20,'M','Y'), (4,37.48,40,'L','Y'), (5,48.32,20,'S','N'), (6,67.45,40,'S','N') ] ,['caseId','score','category','level','label']) dataframe.show() dataframe.printSchema()
  • 32. | © Copyright 2015 Hitachi Consulting32 Spark ML Transformers Features vector preparation StringIndexer – encodes a string column of labels to a column of label indices stringIndexer = StringIndexer(inputCol="label", outputCol="labelIndex") model = stringIndexer.fit(dataframe) dataframe_transformed = model.transform(dataframe).select('caseId','score','category','level','labelIndex') dataframe_transformed.show() We need to do the same for the “level” input variables stringIndexer = StringIndexer(inputCol="level", outputCol="levelIndex") model = stringIndexer.fit(dataframe_transformed) dataframe_transformed = model.transform(dataframe_transformed).select('caseId','score','category','levelIndex','labelIndex') dataframe_transformed.show() The original textual value can be received using IndexToString transformation
  • 33. | © Copyright 2015 Hitachi Consulting33 Spark ML Transformers Features vector preparation VectorAssembler - combines a given list of columns into a single vector attribute, to represent the input feature set for a training data assembler = VectorAssembler(inputCols=["score", "category", "levelIndex"], outputCol="features") dataframe_transformed = assembler.transform(dataframe_transformed).select('caseId','features','labelIndex') dataframe_transformed.show() Note that, “category” attribute is treated a numerical attribute. If needs to be treated as a nominal attribute, we can use either StringIndexer, or use VectorIndexer decide to treat it as nominal or numerical attributes, based on the maxCategory parameter. That is, if the number of distinct values less than or equal to maxCateogry, the attribute will be indexed and treated as nominal. indexer = VectorIndexer(inputCol="features", outputCol="features_indexed", maxCategories=3) indexerModel = indexer.fit(dataframe_transformed) dataframe_transformed = indexerModel.transform(dataframe_transformed) .select('caseId','features_indexed','labelIndex') dataframe_transformed.show()
  • 34. | © Copyright 2015 Hitachi Consulting34 Spark ML Transformers Feature type conversion Binarizer – converts numerical attribute to nominal attribute via thresholding the numerical attribute to binary (0/1) attribute. binarizer = Binarizer(threshold=40.0, inputCol="score", outputCol="score_binarized") dataframe_transformed = binarizer.transform(dataframe_transformed) dataframe_transformed.select('score_binarized','category','levelIndex','labelIndex').show()
  • 35. | © Copyright 2015 Hitachi Consulting35 Spark ML Transformers Feature type conversion HotOneEncoder – maps a column of label indices to a column of binary vectors, with at most a single one- value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. The result vector is a SparseVector stringIndexer = StringIndexer(inputCol="label", outputCol="labelIndex") model = stringIndexer.fit(dataframe) dataframe_transformed = model.transform(dataframe).select('caseId','score','category','level','labelIndex') dataframe_transformed.show() encoder = OneHotEncoder(dropLast=False, inputCol="levelIndex", outputCol="levelFlags") dataframe_transformed = encoder.transform(dataframe_transformed) dataframe_transformed.select('score','category','levelFlags','labelIndex').show()
  • 36. | © Copyright 2015 Hitachi Consulting36 Spark ML Transformers Feature type conversion QuantileDiscretizer – converts numerical attribute to binned categorical attribute, using the quantiles of the numerical attributes Bucketizer – converts numerical attribute to nominal attribute via assigning numerical values into a pre- defined buckets (ranges) splits = [-float("inf"), 25.0, 35.0, 50.0,float("inf")] bucketizer = Bucketizer(splits=splits, inputCol="score", outputCol="score_buckets") dataframe_transformed = bucketizer.transform(dataframe_transformed) dataframe_transformed.select('score','score_buckets','category','levelIndex','labelIndex').show()
  • 37. | © Copyright 2015 Hitachi Consulting37 Spark ML Transformers Feature scaling Scaling is a very important pre-processing step when similarity/distance measures are evolved. This include clustering and instance-based learning. dataframe = sqlContext.createDataFrame( [ (-0.4 ,40 ,4), (-0.3 ,30 ,3), (-0.5 ,50 ,5), (-0.7 ,70 ,7), (-0.2 ,20 ,2), (-0.1 ,10 ,1) ],['var1','var2','var3']) assembler = VectorAssembler(inputCols=["var1", "var2", "var3"], outputCol="features") dataframe = assembler.transform(dataframe).select('features')
  • 38. | © Copyright 2015 Hitachi Consulting38 Spark ML Transformers Feature scaling scaler = StandardScaler(inputCol="features", outputCol="features_scaled1", withStd=True, withMean=False) scalerModel = scaler.fit(dataframe) scaledData = scalerModel.transform(dataframe) scaledData.show(truncate=False) scaler = MinMaxScaler(inputCol="features", outputCol="features_scaled2", min=0.0, max=1.0) scalerModel = scaler.fit(dataframe) scaledData = scalerModel.transform(dataframe) scaledData.show(truncate=False)
  • 39. | © Copyright 2015 Hitachi Consulting39 Spark ML Transformers Others Feature Selection (VectorSlicer, RFormula,ChiSqSelector) – Supervised methods to select the best features that can predict/estimate the target variables. Dimensionality Reduction (Principal Component Analysis) - Converts a set of instances of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. Input: many correlated features. Output: a few uncorrelated features Feature Construction  SQLTransformer – SQL-based transformation  Polynomial expansion – Expanding features into a polynomial space, which is formulated by an n-degree combination of original dimensions. E.g. input x, n=3. Output x, x^2, x^3  ElementwiseProduct – Multiplies each input vector by a provided “weight” vector, using element-wise multiplication.
  • 40. | © Copyright 2015 Hitachi Consulting40 Classification
  • 41. | © Copyright 2015 Hitachi Consulting41 Spark ML Estimators Classification algorithms  Logistic regression  Decision tree classifier  Random forest classifier – Tree Ensemble  Gradient-boosted tree classifier – Tree Ensemble  Multilayer perceptron classifier – Artificial Neural Networks  One-vs-Rest classifier – Uses binary classifiers for multiclass classification problems  Naive Bayes – a probabilistic Bayesian model
  • 42. | © Copyright 2015 Hitachi Consulting42 Spark ML Estimators Classification template from pyspark.ml.classification import * # split the processed dataset to training and testing sets (trainingData, testData) = dataset_processed.randomSplit([0.7, 0.3]) # initialize a classification algorithm classification_algorithm = <classification algorithm>(labelCol="<Indexed Class Attribute>", featuresCol="<Feature Vector Attribute>") # create a classificaion model using the classification algorithm and the training sets classifier = classification_algorithm.fit(trainingData) # make prediction using the constructed classifier and the test set (or new unseen dataset) predictions = classifier.transform(testData) # display actual vs predicted classed predictions.select("prediction", "<Indexed Class Attribute>", "<Feature Vector Attribute>").show()
  • 43. | © Copyright 2015 Hitachi Consulting43 Spark ML Estimators Classification dataset dataset_raw = sqlContext.sql("SELECT * FROM ds_car") dataset_raw.show() dataset_raw.printSchema() stringIndexer = StringIndexer(inputCol="Class", outputCol="classIndex") stringIndexer_model = stringIndexer.fit(dataset_raw) dataset_processed = stringIndexer_model.transform(dataset_raw) stringIndexer = StringIndexer(inputCol="buying", outputCol="buyingIndex") stringIndexer_model = stringIndexer.fit(dataset_processed) dataset_processed = stringIndexer_model.transform(dataset_processed) stringIndexer = StringIndexer(inputCol="maint", outputCol="maintIndex") stringIndexer_model = stringIndexer.fit(dataset_processed) dataset_processed = stringIndexer_model.transform(dataset_processed) stringIndexer = StringIndexer(inputCol="doors", outputCol="doorsIndex") stringIndexer_model = stringIndexer.fit(dataset_processed) dataset_processed = stringIndexer_model.transform(dataset_processed) stringIndexer = StringIndexer(inputCol="persons", outputCol="personsIndex") stringIndexer_model = stringIndexer.fit(dataset_processed) dataset_processed = stringIndexer_model.transform(dataset_processed) stringIndexer = StringIndexer(inputCol="lug_boot", outputCol="lug_bootIndex") stringIndexer_model = stringIndexer.fit(dataset_processed) dataset_processed = stringIndexer_model.transform(dataset_processed) stringIndexer = StringIndexer(inputCol="safety", outputCol="safetyIndex") stringIndexer_model = stringIndexer.fit(dataset_processed) dataset_processed = stringIndexer_model.transform(dataset_processed) assembler = VectorAssembler(inputCols=['buyingIndex','maintIndex','doorsIndex','personsIndex','lug_bootIndex','safetyIndex'], outputCol="features") dataset_processed = assembler.transform(dataset_processed).select('features','classIndex')
  • 44. | © Copyright 2015 Hitachi Consulting44 Spark ML Estimators Classification example from pyspark.ml.classification import * # split the processed dataset to training and testing sets (trainingData, testData) = dataset_processed.randomSplit([0.7, 0.3]) # initialize a classification algorithm decisionTree_algorithm = DecisionTreeClassifier(labelCol="class_indexed" featuresCol="features") # create a classification model using the classification algorithm and the training sets decisionTree_model = decisionTree_algorithm.fit(trainingData) # make prediction using the constructed classifier and the test set (or new unseen dataset) predictions = decisionTree_model.transform(testData) # display actual vs predicted classed predictions.select("prediction", "class_indexed", "features").show()
  • 45. | © Copyright 2015 Hitachi Consulting45 Regression
  • 46. | © Copyright 2015 Hitachi Consulting46 Spark ML Estimators Regression algorithms  Linear regression  Generalized linear regression (Gaussian, Binomial, Poisson, Gamma)  Decision tree regression  Random forest regression – Tree Ensemble  Gradient-boosted tree regression – Tree Ensemble  Survival regression - Accelerated failure time (AFT) model  Isotonic regression – Used when the target attributes has a fixed range (Min and Max)
  • 47. | © Copyright 2015 Hitachi Consulting47 Spark ML Estimators Regression template from pyspark.ml.regression import * # split the processed dataset to training and testing sets (trainingData, testData) = dataset_processed.randomSplit([0.7, 0.3]) # initialize a classification algorithm regression_algorithm = <regression algorithm>(labelCol="<Numerical Target Attribute>", featuresCol="<Features Vector Attribute>") # create a regression model using the regression algorithm and the training sets Regression_model = regression_algorithm.fit(trainingData) # make estimations using the constructed regression model and the test set (or new unseen dataset) predictions = regression_model.transform(testData) # display actual vs estimated classed predictions.select("prediction", "<Numerical Target Attribute >", "< Features Vector Attribute>").show()
  • 48. | © Copyright 2015 Hitachi Consulting48 Spark ML Estimators Regression dataset dataset_raw = sqlContext.sql("SELECT * FROM ds_energy") dataset_raw.show(5) dataset_raw.printSchema() from pyspark.mllib.linalg import * from pyspark.ml.feature import * assembler = VectorAssembler(inputCols= ['RelativeCompactness', 'SurfaceArea', 'WallArea', 'RoofArea', 'OverallHeight', 'Orientation', 'GlazingArea', 'GlazingAreaDistribution'], outputCol="features") dataset_processed = assembler.transform(dataset_raw).select('features','HeatingLoad') indexer = VectorIndexer(inputCol="features", outputCol="features_indexed", maxCategories=10) indexerModel = indexer.fit(dataset_processed) dataset_processed = indexerModel.transform(dataset_processed).select('features_indexed','HeatingLoad') dataset_processed.show(truncate=False)
  • 49. | © Copyright 2015 Hitachi Consulting49 Spark ML Estimators Regression example from pyspark.ml.regression import * # split the processed dataset to training and testing sets (trainingData, testData) = dataset_processed.randomSplit([0.7, 0.3]) # initialize a Generalized Linear Model GLM_algorithm = GeneralizedLinearRegression(family="gaussian", link="identity", maxIter=10, regParam=0.3) # create a regression model using the regression algorithm and the training sets GLM_model = GLM_algorithm.fit(trainingData) # make estimations using the constructed regression model and the test set (or new unseen dataset) predictions = GLM_model.transform(testData) # display actual vs estimated classed predictions.select("prediction", "target", "features").show()
  • 50. | © Copyright 2015 Hitachi Consulting50 Spark ML Estimators Regression example print("Coefficients: " + str(GLM_model.coefficients)) print("Intercept: " + str(GLM_model.intercept)) # Summarize the GLM_model over the training set and print out some metrics summary = GLM_model.summary print("Coefficient Standard Errors: " + str(summary.coefficientStandardErrors)) print("T Values: " + str(summary.tValues)) print("P Values: " + str(summary.pValues)) print("Dispersion: " + str(summary.dispersion)) print("Null Deviance: " + str(summary.nullDeviance)) print("Residual Degree Of Freedom Null: " + str(summary.residualDegreeOfFreedomNull)) print("Deviance: " + str(summary.deviance)) print("Residual Degree Of Freedom: " + str(summary.residualDegreeOfFreedom)) print("AIC: " + str(summary.aic)) print("Deviance Residuals: ") summary.residuals().show()
  • 51. | © Copyright 2015 Hitachi Consulting51 Model Evaluation
  • 52. | © Copyright 2015 Hitachi Consulting52 Model Evaluation Classification model evaluation – Binary classification from pyspark.ml.evaluation import BinaryClassificationEvaluator evaluator = BinaryClassificationEvaluator( labelCol="class_Indexed", predictionCol="prediction", metricName="areaUnderROC") accuracy = evaluator.evaluate(predictions) metricName Parameter values  areaUnderROC - Receiver Operating Characteristic  areaUnderPR – Precision Recall Curve
  • 53. | © Copyright 2015 Hitachi Consulting53 Model Evaluation Classification model evaluation – Multiclass classification from pyspark.ml.evaluation import MulticlassClassificationEvaluator evaluator = MulticlassClassificationEvaluator( labelCol="class_Indexed", predictionCol="prediction", metricName="accuracy") accuracy = evaluator.evaluate(predictions) metricName Parameter values  "f1" (default)  "precision"  "recall"  "weightedPrecision"  "weightedRecall"
  • 54. | © Copyright 2015 Hitachi Consulting54 Model Evaluation Regression model evaluation from pyspark.ml.evaluation import RegressionEvaluator evaluator = RegressionEvaluator( labelCol=“target", predictionCol="prediction", metricName="rmse") rmse = evaluator.evaluate(predictions) metricName Parameter values  "rmse" (default) – Root Means Square Error  "mse" – Mean Square Error  "r2"  "mae" – Mean Absolut Error
  • 55. | © Copyright 2015 Hitachi Consulting55 Model Selection and Parameter Tuning
  • 56. | © Copyright 2015 Hitachi Consulting56 Model Selection and Parameter Tuning algorithm = <classification or regression algorithm>(<parameters>) # We use a ParamGridBuilder to construct a grid of parameters to search over. # TrainValidationSplit will try all combinations of values and determine best model using the evaluator. paramGrid = ParamGridBuilder() .addGrid(algorithm.<parameter1>, [0.1, 0.01]) .addGrid(algorithm.<parameter2>, [0.0, 0.5, 1.0]) .build() tvs = TrainValidationSplit(estimator=algorithm, estimatorParamMaps=paramGrid, evaluator=<Classification or Regression Evaluator>(), trainRatio=0.8) # Run TrainValidationSplit, and choose the best set of parameters. model = tvs.fit(train)
  • 57. | © Copyright 2015 Hitachi Consulting57 Spark ML Pipelines
  • 58. | © Copyright 2015 Hitachi Consulting58 Spark ML Pipeline Pipeline example from pyspark.ml import Pipeline #data preparation (e.g., VectorAssembler, VectorIndexer, etc.) transformer1 = … transformer2 = … transformer3 = … #Model algorithm (e.g. DecisionTreeClassifier) model_algorithm = … #Pipeline which applys transformation and model building algorithm on dataset pipeline = Pipeline(stages=[transformer1, transformer2, transformer3, model_algorithm]) model = pipeline.fit(training)
  • 59. | © Copyright 2015 Hitachi Consulting59 Frequent Pattern Mining
  • 60. | © Copyright 2015 Hitachi Consulting60 Frequent Pattern Mining Association rule discovery example from pyspark.mllib.fpm import * data = [['a','b','c'],['a','b'],['a','d'],['b','c'],['a','b','d']] transactions = sc.parallelize(data) # retrieve the patterns that has at least 40 % occurrence in the dataset (transactions) model = FPGrowth.train(transactions, minSupport=0.4) result = model.freqItemsets().collect() # retrieve the patterns that occurred 4 times or more model.freqItemsets().filter(lambda itemset: itemset.freq>=4).collect() # retrieve the pattern that have the item ‘c’ model.freqItemsets().filter(lambda itemset: 'c' in itemset.items).collect() # retrieve the pattern that have 2 items or more model.freqItemsets().filter(lambda itemset: len(itemset.items)>=2).collect()
  • 61. | © Copyright 2015 Hitachi Consulting61 Clustering
  • 62. | © Copyright 2015 Hitachi Consulting62 Clustering Data clustering algorithms  K-means – Spherical, centroid-based, non-parametric, partitioning, non-overlapping  Latent Dirichlet allocation (LDA) – Probabilistic, usually used with Topic Models  Power iteration clustering (PIC) – Graph Clustering  Bisecting k-means – Hierarchical (Agglomerative & Divisive)  Gaussian Mixture Model (GMM) – Expectation Maximization (EM) algorithm. Probabilistic, parametric, overlapping
  • 63. | © Copyright 2015 Hitachi Consulting63 Clustering Data clustering example from pyspark.mllib.clustering import * from math import * data = [[1,2],[5,3],[3,4],[35,20],[25,10],[30,15]] dataPoints = sc.parallelize(data) # create 2 clusters using k-means. The intial centroids can be either random or using K-mean++ technique clusters = KMeans.train(dataPoints, k=2, maxIterations=10, initializationMode=<‘random’ | ‘k-means||’>) # assign data points to clusters predictions = dataPoints.map(lambda point: clusters.predict(point)) assignments = dataPoints.zip(predictions) #compute sum square error (rmse) of the culsters sse = assignments.map(lambda (point,cluster): sqrt(sum([error**2 for error in (point - clusters.centers[cluster])]))).reduce(lambda error1,error2: error1+error2)
  • 64. | © Copyright 2015 Hitachi Consulting64 My Background Applying Computational Intelligence in Data Mining • Honorary Research Fellow, School of Computing , University of Kent. • Ph.D. Computer Science, University of Kent, Canterbury, UK. • M.Sc. Computer Science , The American University in Cairo, Egypt. • 25+ published journal and conference papers, focusing on: – classification rules induction, – decision trees construction, – Bayesian classification modelling, – data reduction, – instance-based learning, – evolving neural networks, and – data clustering • Journals: Swarm Intelligence, Swarm & Evolutionary Computation, , Applied Soft Computing, and Memetic Computing. • Conferences: ANTS, IEEE CEC, IEEE SIS, EvoBio, ECTA, IEEE WCCI and INNS-BigData. ResearchGate.org
  • 65. | © Copyright 2015 Hitachi Consulting65 Thank you!