SlideShare a Scribd company logo
1 of 43
© 2016 MapR Technologies 10-1
© 2016 MapR Technologies
Machine Learning with Apache Spark
© 2016 MapR Technologies 10-2
© 2016 MapR Technologies
© 2016 MapR Technologies 10-3
Agenda
• Brief overview of
• Classification
• Clustering
• Collaborative Filtering
• Predicting Flight Delays using a Decision Tree
© 2016 MapR Technologies 10-4
Spark SQL
• Structured Data
• Querying with
SQL/HQL
• DataFrames
Spark Streaming
• Processing of live
streams
• Micro-batching
MLlib
• Machine Learning
• Multiple types of
ML algorithms
GraphX
• Graph processing
• Graph parallel
computations
RDD Transformations and Actions
• Task scheduling
• Memory management
• Fault recovery
• Interacting with storage systems
Spark Core
What is MLlib?
© 2016 MapR Technologies 10-5
MLlib Algorithms and Utilities
Algorithms and Utilities Description
Basic statistics Includes summary statistics, correlations, hypothesis testing, random data
generation
Classification and
regression
Includes methods for linear models, decision trees and Naïve Bayes
Collaborative filtering Supports model-based collaborative filtering using alternating least
squares (ALS) algorithm
Clustering Supports K-means clustering
Dimensionality reduction Supports dimensionality reduction on the RowMatrix class; singular value
decomposition (SVD) and principal component analysis (PCA)
Feature extraction and
transformation
Contains several classes for common feature transformations
© 2016 MapR Technologies 10-6
Examples of ML Algorithms
Supervised
• Classification
– Naïve Bayes
– SVM
– Random Decision
Forests
• Regression
– Linear
– Logistic
Machine Learning
Unsupervised
• Clustering
– K-means
• Dimensionality reduction
– Principal Component
Analysis
– SVD
© 2016 MapR Technologies 10-7
Examples of ML Algorithms
Supervised
• Classification
– Naïve Bayes
– SVM
– Random Decision
Forests
• Regression
– Linear
– Logistic
Machine Learning
Unsupervised
• Clustering
– K-means
• Dimensionality reduction
– Principal Component
Analysis
– SVD
© 2016 MapR Technologies 10-8
Examples of ML Algorithms
Machine Learning
Unsupervised
• Clustering
– K-means
• Dimensionality reduction
– Principal Component
Analysis
– SVD
Supervised
• Classification
– Naïve Bayes
– SVM
– Random Decision
Forests
• Regression
– Linear
– Logistic
© 2016 MapR Technologies 10-9
Three Categories of Techniques for Machine Learning
Collaborative Filtering
(Recommendation)
Classification Clustering
© 2016 MapR Technologies 10-10
Machine Learning: Classification
Classification
Identifies
category for item
© 2016 MapR Technologies 10-11
Classification: Definition
Form of ML that:
• Identifies which category an item belongs to
• Uses supervised learning algorithms
– Data is labeled
Sentiment
© 2016 MapR Technologies 10-12
If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck
swims
walks
quacks
Features:
walks
quacks
swims
Features:
© 2016 MapR Technologies 10-13
Building and Deploying a Classifier Model
Image reference O’Reilly Learning Spark
Spam:
free money now!
get this money
free savings $$$
Training Data
Non-spam:
how are you?
that Spark job
lunch plans
© 2016 MapR Technologies 10-14
Building and Deploying a Classifier Model
Image reference O’Reilly Learning Spark
+
+
̶+
̶ ̶
Feature Vectors
Featurization
Spam:
free money now!
get this money
free savings $$$
Training Data
Non-spam:
how are you?
that Spark job
lunch plans
© 2016 MapR Technologies 10-15
Building and Deploying a Classifier Model
Image reference O’Reilly Learning Spark
+
+
̶+
̶ ̶
Feature Vectors Model
Featurization TrainingSpam:
free money now!
get this money
free savings $$$
Training Data
Non-spam:
how are you?
that Spark job
lunch plans
+
+
̶+
̶ ̶
© 2016 MapR Technologies 10-16
Building and Deploying a Classifier Model
Image reference O’Reilly Learning Spark
+
+
̶+
̶ ̶
Feature Vectors Model
Featurization Training
Model
Evaluation
Best Model
Spam:
free money now!
get this money
free savings $$$
Training Data
Non-spam:
how are you?
that Spark job
lunch plans
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
+
+
̶+
̶ ̶
© 2016 MapR Technologies 10-17
Machine Learning: Clustering
Classification Clustering
© 2016 MapR Technologies 10-18
Clustering: Definition
• Unsupervised learning task
• Groups objects into clusters of high similarity
© 2016 MapR Technologies 10-19
Clustering: Definition
• Unsupervised learning task
• Groups objects into clusters of high similarity
– Search results grouping
– Grouping of customers
– Anomaly detection
– Text categorization
© 2016 MapR Technologies 10-20
Clustering: Example
• Group similar objects
• Use MLlib K-means algorithm
1. Initialize coordinates to center
of clusters (centroid)
2. Assign all points to nearest
centroid
3. Update centroids to center of
points
4. Repeat until conditions met
© 2016 MapR Technologies 10-21
Three Categories of Techniques for Machine Learning
Collaborative Filtering
(Recommendation)
Classification Clustering
© 2016 MapR Technologies 10-22
Collaborative Filtering with Spark
• Recommend items
– (Filtering)
• Based on user preferences data
– (Collaborative)
4 5 5
5 5
5 ?
Ted
Carol
Bob
A B C
User Item Rating Matrix
© 2016 MapR Technologies 10-23
Train a Model to Make Predictions
Ted and Carol like movies B and C
Bob likes movie B, what might he like?
Bob likes movie B, predict C
Training
Data
ModelAlgorithm
New Data PredictionsModel
4 5 5
5 5
5 ?
Ted
Carol
Bob
A B C
User Item Rating Matrix
© 2016 MapR Technologies 10-24
© 2016 MapR Technologies
Predict Flight Delays
© 2016 MapR Technologies 10-25
Use Case: Flight Data
• Predict if a flight is going to be delayed
• Use Decision Tree for prediction
• Used for Classification and Regression
• Represents tree with nodes, Binary decision at each node
© 2016 MapR Technologies 10-26
Flight Data
© 2016 MapR Technologies 10-27
// Define the schema
case class Flight(dofM: String, dofW: String, carrier: String, tailnum: String,
flnum: Int, org_id: String, origin: String, dest_id: String, dest: String,
crsdeptime: Double, deptime: Double, depdelaymins: Double, crsarrtime: Double,
arrtime: Double, arrdelay: Double, crselapsedtime: Double, dist: Int)
def parseFlight(str: String): Flight = {
val line = str.split(",")
Flight(line(0), line(1), line(2), line(3), line(4).toInt, line(5), line(6),
line(7), line(8), line(9).toDouble, line(10).toDouble, line(11).toDouble,
line(12).toDouble, line(13).toDouble, line(14).toDouble, line(15).toDouble,
line(16).toInt)
}
// load file into a RDD
val rdd = sc.textFile(”flights.csv”)
// create an RDD of Flight objects
val flightRDD = rdd.map(parseFlight).cache()
//Array(Flight(1,3,AA,N338AA,1,12478,JFK,12892,LAX 900.0,914.0,14.0,1225.0,1238.0,
13.0,385.0,2475)
Parse Input
© 2016 MapR Technologies 10-28
Building and Deploying a Classifier Model
+
+
̶+
̶ ̶
Feature Vectors
Featurization
Delayed:
Friday
LAX
AA
Training Data
Not Delayed:
Wednesday
BNA
Delta
© 2016 MapR Technologies 10-29
Classification Learning Problem - Features
Label  delayed and not delayed - delayed if delay > 40 minutes
Features  {day_of_month, weekday, crsdeptime, crsarrtime, carrier,
crselapsedtime, origin, dest}
© 2016 MapR Technologies 10-30
// create map of airline -> number
var carrierMap: Map[String, Int] = Map()
var index: Int = 0
flightsRDD.map(flight => flight.carrier).distinct.collect.foreach(
x => { carrierMap += (x -> index); index += 1 }
)
carrierMap.toString
// String = Map(DL -> 5,US -> 9, AA -> 6, UA -> 4...)
// create map of destination airport -> number
var destMap: Map[String, Int] = Map()
var index2: Int = 0
flightsRDD.map(flight => flight.dest).distinct.collect.foreach(
x => { destMap += (x -> index2); index2 += 1 })
destMap.toString
// Map(JFK -> 214, LAX -> 294, ATL -> 273,MIA -> 175 ...
Transform non-numeric features into numeric values
© 2016 MapR Technologies 10-31
Classification Learning Problem - Features
Label  delayed and not delayed - delayed if delay > 40 minutes
Features  {day_of_month, weekday, crsdeptime, crsarrtime, carrier,
crselapsedtime, origin, dest}
MLLIB Datatypes:
Vector: Contains the feature data points
LabeledPoint: Contains feature vector and label
© 2016 MapR Technologies 10-32
// Defining the features array
val mlprep = flightsRDD.map(flight => {
val monthday = flight.dofM.toInt - 1 // category
val weekday = flight.dofW.toInt - 1 // category
val crsdeptime1 = flight.crsdeptime.toInt
val crsarrtime1 = flight.crsarrtime.toInt
val carrier1 = carrierMap(flight.carrier) // category
val crselapsedtime1 = flight.crselapsedtime.toDouble
val origin1 = originMap(flight.origin) // category
val dest1 = destMap(flight.dest) // category
val delayed = if (flight.depdelaymins.toDouble > 40) 1.0 else 0.0
Array(delayed.toDouble, monthday.toDouble, weekday.toDouble, crsdeptime1.toDouble,
crsarrtime1.toDouble, carrier1.toDouble, crselapsedtime1.toDouble, origin1.toDouble,
dest1.toDouble)
})
mlprep.take(1)
//Array(Array(0.0, 0.0, 2.0, 900.0, 1225.0, 6.0, 385.0, 214.0, 294.0))
val mldata = mlprep.map(x => LabeledPoint(x(0),Vectors.dense(x(1),x(2),x(3),x(4), x(5),x(6),
x(7), x(8))))
mldata.take(1)
// Array[LabeledPoint] = Array((0.0,[0.0,2.0,900.0,1225.0,6.0,385.0,214.0,294.0]))
Define the features, Create LabeledPoint with Vector
© 2016 MapR Technologies 10-36
Build Model
Split data into:
• Training data RDD (80%)
• Test data RDD (20%)
Data
Build
Model
Training
Set
Test
Set
© 2016 MapR Technologies 10-37
// Randomly split RDD into training data RDD (80%) and test
data RDD (20%)
val splits = mldata.randomSplit(Array(0.8, 0.2))
val trainingRDD = splits(0).cache()
val testRDD = splits(1).cache()
testData.take(1)
//Array[LabeledPoint] =
Array((0.0,[18.0,6.0,900.0,1225.0,6.0,385.0,214.0,294.0]))
Split Data
© 2016 MapR Technologies 10-38
Build Model
Training Set with Labels, Build a model
Data
Build
Model
Training
Set
Test
Set
© 2016 MapR Technologies 10-39
Use Case: Flight Data
• Predict if a flight is going to be delayed
• Use Decision Tree for prediction
• Used for Classification and Regression
• Represents tree with nodes
• Binary decision at each node
© 2016 MapR Technologies 10-40
// set ranges for categorical features
var categoricalFeaturesInfo = Map[Int, Int]()
categoricalFeaturesInfo += (0 -> 31) //dofM 31 categories
categoricalFeaturesInfo += (1 -> 7) //dofW 7 categories
categoricalFeaturesInfo += (4 -> carrierMap.size) //number of carriers
categoricalFeaturesInfo += (6 -> originMap.size) //number of origin airports
categoricalFeaturesInfo += (7 -> destMap.size) //number of dest airports
val numClasses = 2
val impurity = "gini"
val maxDepth = 9
val maxBins = 7000
// call DecisionTree trainClassifier with the trainingData , which returns the model
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
impurity, maxDepth, maxBins)
Build Model
© 2016 MapR Technologies 10-41
// print out the decision tree
model.toDebugString
// 0=dofM 4=carrier 3=crsarrtime1 6=origin
res20: String =
DecisionTreeModel classifier of depth 9 with 919 nodes
If (feature 0 in {11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0,
22.0,23.0,24.0,25.0,26.0,27.0,30.0})
If (feature 4 in {0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,13.0})
If (feature 3 <= 1603.0)
If (feature 0 in {11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0})
If (feature 6 in {0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,10.0,11.0,12.0,13.0...
Build Model
© 2016 MapR Technologies 10-42
Get Predictions
Test
Data
Without label
Predict
Delay or Not
Model
© 2016 MapR Technologies 10-43
// Get Predictions,create RDD of test Label, test Prediction
val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
labelAndPreds.take(1)
// Label, Prediction
//Array((0.0,0.0))
Get Predictions
© 2016 MapR Technologies 10-44
// get instances where label != prediction
val wrongPrediction =(labelAndPreds.filter{
case (label, prediction) => ( label !=prediction)
})
val wrong= wrongPrediction.count()
res35: Long = 11040
val ratioWrong=wrong.toDouble/testData.count()
ratioWrong: Double = 0.3157443157443157
Test Model
© 2016 MapR Technologies 10-45
To Learn More:
• Download example code
– https://github.com/caroljmcdonald/sparkmldecisiontree
• Read explanation of example code
– https://www.mapr.com/blog/apache-spark-machine-learning-tutorial
• Engage with us!
– https://www.mapr.com/blog/author/carol-mcdonald
– https://community.mapr.com
© 2016 MapR Technologies 10-46
Q&A
@mapr
https://www.mapr.com/blog/author/carol-mcdonald
Engage with us!
mapr-technologies

More Related Content

What's hot

Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Carol McDonald
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Carol McDonald
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUsCarol McDonald
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Carol McDonald
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareCarol McDonald
 
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningDemystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningCarol McDonald
 
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkFree Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkMapR Technologies
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataCarol McDonald
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBaseCarol McDonald
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingGabriele Modena
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBMapR Technologies
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to NewMapR Technologies
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Carol McDonald
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsDatabricks
 

What's hot (20)

Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real-Ti...
 
Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures Streaming patterns revolutionary architectures
Streaming patterns revolutionary architectures
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
Introduction to machine learning with GPUs
Introduction to machine learning with GPUsIntroduction to machine learning with GPUs
Introduction to machine learning with GPUs
 
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with Ma...
 
How Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health CareHow Big Data is Reducing Costs and Improving Outcomes in Health Care
How Big Data is Reducing Costs and Improving Outcomes in Health Care
 
Demystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep LearningDemystifying AI, Machine Learning and Deep Learning
Demystifying AI, Machine Learning and Deep Learning
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Free Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache SparkFree Code Friday - Machine Learning with Apache Spark
Free Code Friday - Machine Learning with Apache Spark
 
Advanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming DataAdvanced Threat Detection on Streaming Data
Advanced Threat Detection on Streaming Data
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Approximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processingApproximation algorithms for stream and batch processing
Approximation algorithms for stream and batch processing
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
NoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DBNoSQL Application Development with JSON and MapR-DB
NoSQL Application Development with JSON and MapR-DB
 
Mathematical bridges From Old to New
Mathematical bridges From Old to NewMathematical bridges From Old to New
Mathematical bridges From Old to New
 
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
Fast, Scalable, Streaming Applications with Spark Streaming, the Kafka API an...
 
On-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy ModelsOn-Prem Solution for the Selection of Wind Energy Models
On-Prem Solution for the Selection of Wind Energy Models
 

Viewers also liked

CourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on SparkCourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on SparkDataWorks Summit
 
Spark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System AdministratorsSpark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System AdministratorsAlpine Data
 
Small intro to Big Data - Old version
Small intro to Big Data - Old versionSmall intro to Big Data - Old version
Small intro to Big Data - Old versionSoftwareMill
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine LearningCarol McDonald
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkDB Tsai
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill Carol McDonald
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with SparkKhalid Salama
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APICarol McDonald
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Sparkdatamantra
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningJames Ward
 
Ted Dunning - Keynote: How Can We Take Flink Forward?
Ted Dunning -  Keynote: How Can We Take Flink Forward?Ted Dunning -  Keynote: How Can We Take Flink Forward?
Ted Dunning - Keynote: How Can We Take Flink Forward?Flink Forward
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Spark Summit
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkPetr Zapletal
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientistsAjay Ohri
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Spark Summit
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMeeraj Kunnumpurath
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsCloudera, Inc.
 

Viewers also liked (20)

CourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on SparkCourboSpark: Decision Tree for Time-series on Spark
CourboSpark: Decision Tree for Time-series on Spark
 
Spark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System AdministratorsSpark Tuning for Enterprise System Administrators
Spark Tuning for Enterprise System Administrators
 
ESKibana
ESKibanaESKibana
ESKibana
 
Small intro to Big Data - Old version
Small intro to Big Data - Old versionSmall intro to Big Data - Old version
Small intro to Big Data - Old version
 
Apache Spark Machine Learning
Apache Spark Machine LearningApache Spark Machine Learning
Apache Spark Machine Learning
 
Unsupervised Learning with Apache Spark
Unsupervised Learning with Apache SparkUnsupervised Learning with Apache Spark
Unsupervised Learning with Apache Spark
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 
Machine learning with Spark
Machine learning with SparkMachine learning with Spark
Machine learning with Spark
 
Streaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka APIStreaming Patterns Revolutionary Architectures with the Kafka API
Streaming Patterns Revolutionary Architectures with the Kafka API
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Introduction to Machine Learning with Spark
Introduction to Machine Learning with SparkIntroduction to Machine Learning with Spark
Introduction to Machine Learning with Spark
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Ted Dunning - Keynote: How Can We Take Flink Forward?
Ted Dunning -  Keynote: How Can We Take Flink Forward?Ted Dunning -  Keynote: How Can We Take Flink Forward?
Ted Dunning - Keynote: How Can We Take Flink Forward?
 
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
 
MLlib and Machine Learning on Spark
MLlib and Machine Learning on SparkMLlib and Machine Learning on Spark
MLlib and Machine Learning on Spark
 
Statistics for data scientists
Statistics for  data scientistsStatistics for  data scientists
Statistics for data scientists
 
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
Practical Large Scale Experiences with Spark 2.0 Machine Learning: Spark Summ...
 
Machine Learning by Example - Apache Spark
Machine Learning by Example - Apache SparkMachine Learning by Example - Apache Spark
Machine Learning by Example - Apache Spark
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark ApplicationsTop 5 Mistakes to Avoid When Writing Apache Spark Applications
Top 5 Mistakes to Avoid When Writing Apache Spark Applications
 

Similar to Apache Spark Machine Learning Decision Trees

Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezMapR Technologies
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop BigDataEverywhere
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkVince Gonzalez
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Arvind Surve
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Arvind Surve
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaData Science Thailand
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
Querying Network Packet Captures with Spark and Drill
Querying Network Packet Captures with Spark and DrillQuerying Network Packet Captures with Spark and Drill
Querying Network Packet Captures with Spark and DrillVince Gonzalez
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...Debraj GuhaThakurta
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...Debraj GuhaThakurta
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Codemotion
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Tugdual Grall
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseMapR Technologies
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache MahoutTed Dunning
 
PGQL: A Language for Graphs
PGQL: A Language for GraphsPGQL: A Language for Graphs
PGQL: A Language for GraphsJean Ihm
 

Similar to Apache Spark Machine Learning Decision Trees (20)

Intro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco VasquezIntro to Apache Spark by Marco Vasquez
Intro to Apache Spark by Marco Vasquez
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
 
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
Apache SystemML Optimizer and Runtime techniques by Arvind Surve and Matthias...
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Microsoft R Server for Data Sciencea
Microsoft R Server for Data ScienceaMicrosoft R Server for Data Sciencea
Microsoft R Server for Data Sciencea
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Querying Network Packet Captures with Spark and Drill
Querying Network Packet Captures with Spark and DrillQuerying Network Packet Captures with Spark and Drill
Querying Network Packet Captures with Spark and Drill
 
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
TWDI Accelerate Seattle, Oct 16, 2017: Distributed and In-Database Analytics ...
 
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
TDWI Accelerate, Seattle, Oct 16, 2017: Distributed and In-Database Analytics...
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
Anomaly Detection in Telecom with Spark - Tugdual Grall - Codemotion Amsterda...
 
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 
Free Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBaseFree Code Friday - Spark Streaming with HBase
Free Code Friday - Spark Streaming with HBase
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
What's new in Apache Mahout
What's new in Apache MahoutWhat's new in Apache Mahout
What's new in Apache Mahout
 
PGQL: A Language for Graphs
PGQL: A Language for GraphsPGQL: A Language for Graphs
PGQL: A Language for Graphs
 

More from Carol McDonald

Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Carol McDonald
 
Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churnCarol McDonald
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBaseCarol McDonald
 
Machine Learning Recommendations with Spark
Machine Learning Recommendations with SparkMachine Learning Recommendations with Spark
Machine Learning Recommendations with SparkCarol McDonald
 
Getting started with HBase
Getting started with HBaseGetting started with HBase
Getting started with HBaseCarol McDonald
 

More from Carol McDonald (6)

Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
Applying Machine Learning to IOT: End to End Distributed Pipeline for Real- T...
 
Spark machine learning predicting customer churn
Spark machine learning predicting customer churnSpark machine learning predicting customer churn
Spark machine learning predicting customer churn
 
Apache Spark streaming and HBase
Apache Spark streaming and HBaseApache Spark streaming and HBase
Apache Spark streaming and HBase
 
Machine Learning Recommendations with Spark
Machine Learning Recommendations with SparkMachine Learning Recommendations with Spark
Machine Learning Recommendations with Spark
 
CU9411MW.DOC
CU9411MW.DOCCU9411MW.DOC
CU9411MW.DOC
 
Getting started with HBase
Getting started with HBaseGetting started with HBase
Getting started with HBase
 

Recently uploaded

MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 

Recently uploaded (20)

MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 

Apache Spark Machine Learning Decision Trees

  • 1. © 2016 MapR Technologies 10-1 © 2016 MapR Technologies Machine Learning with Apache Spark
  • 2. © 2016 MapR Technologies 10-2 © 2016 MapR Technologies
  • 3. © 2016 MapR Technologies 10-3 Agenda • Brief overview of • Classification • Clustering • Collaborative Filtering • Predicting Flight Delays using a Decision Tree
  • 4. © 2016 MapR Technologies 10-4 Spark SQL • Structured Data • Querying with SQL/HQL • DataFrames Spark Streaming • Processing of live streams • Micro-batching MLlib • Machine Learning • Multiple types of ML algorithms GraphX • Graph processing • Graph parallel computations RDD Transformations and Actions • Task scheduling • Memory management • Fault recovery • Interacting with storage systems Spark Core What is MLlib?
  • 5. © 2016 MapR Technologies 10-5 MLlib Algorithms and Utilities Algorithms and Utilities Description Basic statistics Includes summary statistics, correlations, hypothesis testing, random data generation Classification and regression Includes methods for linear models, decision trees and Naïve Bayes Collaborative filtering Supports model-based collaborative filtering using alternating least squares (ALS) algorithm Clustering Supports K-means clustering Dimensionality reduction Supports dimensionality reduction on the RowMatrix class; singular value decomposition (SVD) and principal component analysis (PCA) Feature extraction and transformation Contains several classes for common feature transformations
  • 6. © 2016 MapR Technologies 10-6 Examples of ML Algorithms Supervised • Classification – Naïve Bayes – SVM – Random Decision Forests • Regression – Linear – Logistic Machine Learning Unsupervised • Clustering – K-means • Dimensionality reduction – Principal Component Analysis – SVD
  • 7. © 2016 MapR Technologies 10-7 Examples of ML Algorithms Supervised • Classification – Naïve Bayes – SVM – Random Decision Forests • Regression – Linear – Logistic Machine Learning Unsupervised • Clustering – K-means • Dimensionality reduction – Principal Component Analysis – SVD
  • 8. © 2016 MapR Technologies 10-8 Examples of ML Algorithms Machine Learning Unsupervised • Clustering – K-means • Dimensionality reduction – Principal Component Analysis – SVD Supervised • Classification – Naïve Bayes – SVM – Random Decision Forests • Regression – Linear – Logistic
  • 9. © 2016 MapR Technologies 10-9 Three Categories of Techniques for Machine Learning Collaborative Filtering (Recommendation) Classification Clustering
  • 10. © 2016 MapR Technologies 10-10 Machine Learning: Classification Classification Identifies category for item
  • 11. © 2016 MapR Technologies 10-11 Classification: Definition Form of ML that: • Identifies which category an item belongs to • Uses supervised learning algorithms – Data is labeled Sentiment
  • 12. © 2016 MapR Technologies 10-12 If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck swims walks quacks Features: walks quacks swims Features:
  • 13. © 2016 MapR Technologies 10-13 Building and Deploying a Classifier Model Image reference O’Reilly Learning Spark Spam: free money now! get this money free savings $$$ Training Data Non-spam: how are you? that Spark job lunch plans
  • 14. © 2016 MapR Technologies 10-14 Building and Deploying a Classifier Model Image reference O’Reilly Learning Spark + + ̶+ ̶ ̶ Feature Vectors Featurization Spam: free money now! get this money free savings $$$ Training Data Non-spam: how are you? that Spark job lunch plans
  • 15. © 2016 MapR Technologies 10-15 Building and Deploying a Classifier Model Image reference O’Reilly Learning Spark + + ̶+ ̶ ̶ Feature Vectors Model Featurization TrainingSpam: free money now! get this money free savings $$$ Training Data Non-spam: how are you? that Spark job lunch plans + + ̶+ ̶ ̶
  • 16. © 2016 MapR Technologies 10-16 Building and Deploying a Classifier Model Image reference O’Reilly Learning Spark + + ̶+ ̶ ̶ Feature Vectors Model Featurization Training Model Evaluation Best Model Spam: free money now! get this money free savings $$$ Training Data Non-spam: how are you? that Spark job lunch plans + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶
  • 17. © 2016 MapR Technologies 10-17 Machine Learning: Clustering Classification Clustering
  • 18. © 2016 MapR Technologies 10-18 Clustering: Definition • Unsupervised learning task • Groups objects into clusters of high similarity
  • 19. © 2016 MapR Technologies 10-19 Clustering: Definition • Unsupervised learning task • Groups objects into clusters of high similarity – Search results grouping – Grouping of customers – Anomaly detection – Text categorization
  • 20. © 2016 MapR Technologies 10-20 Clustering: Example • Group similar objects • Use MLlib K-means algorithm 1. Initialize coordinates to center of clusters (centroid) 2. Assign all points to nearest centroid 3. Update centroids to center of points 4. Repeat until conditions met
  • 21. © 2016 MapR Technologies 10-21 Three Categories of Techniques for Machine Learning Collaborative Filtering (Recommendation) Classification Clustering
  • 22. © 2016 MapR Technologies 10-22 Collaborative Filtering with Spark • Recommend items – (Filtering) • Based on user preferences data – (Collaborative) 4 5 5 5 5 5 ? Ted Carol Bob A B C User Item Rating Matrix
  • 23. © 2016 MapR Technologies 10-23 Train a Model to Make Predictions Ted and Carol like movies B and C Bob likes movie B, what might he like? Bob likes movie B, predict C Training Data ModelAlgorithm New Data PredictionsModel 4 5 5 5 5 5 ? Ted Carol Bob A B C User Item Rating Matrix
  • 24. © 2016 MapR Technologies 10-24 © 2016 MapR Technologies Predict Flight Delays
  • 25. © 2016 MapR Technologies 10-25 Use Case: Flight Data • Predict if a flight is going to be delayed • Use Decision Tree for prediction • Used for Classification and Regression • Represents tree with nodes, Binary decision at each node
  • 26. © 2016 MapR Technologies 10-26 Flight Data
  • 27. © 2016 MapR Technologies 10-27 // Define the schema case class Flight(dofM: String, dofW: String, carrier: String, tailnum: String, flnum: Int, org_id: String, origin: String, dest_id: String, dest: String, crsdeptime: Double, deptime: Double, depdelaymins: Double, crsarrtime: Double, arrtime: Double, arrdelay: Double, crselapsedtime: Double, dist: Int) def parseFlight(str: String): Flight = { val line = str.split(",") Flight(line(0), line(1), line(2), line(3), line(4).toInt, line(5), line(6), line(7), line(8), line(9).toDouble, line(10).toDouble, line(11).toDouble, line(12).toDouble, line(13).toDouble, line(14).toDouble, line(15).toDouble, line(16).toInt) } // load file into a RDD val rdd = sc.textFile(”flights.csv”) // create an RDD of Flight objects val flightRDD = rdd.map(parseFlight).cache() //Array(Flight(1,3,AA,N338AA,1,12478,JFK,12892,LAX 900.0,914.0,14.0,1225.0,1238.0, 13.0,385.0,2475) Parse Input
  • 28. © 2016 MapR Technologies 10-28 Building and Deploying a Classifier Model + + ̶+ ̶ ̶ Feature Vectors Featurization Delayed: Friday LAX AA Training Data Not Delayed: Wednesday BNA Delta
  • 29. © 2016 MapR Technologies 10-29 Classification Learning Problem - Features Label  delayed and not delayed - delayed if delay > 40 minutes Features  {day_of_month, weekday, crsdeptime, crsarrtime, carrier, crselapsedtime, origin, dest}
  • 30. © 2016 MapR Technologies 10-30 // create map of airline -> number var carrierMap: Map[String, Int] = Map() var index: Int = 0 flightsRDD.map(flight => flight.carrier).distinct.collect.foreach( x => { carrierMap += (x -> index); index += 1 } ) carrierMap.toString // String = Map(DL -> 5,US -> 9, AA -> 6, UA -> 4...) // create map of destination airport -> number var destMap: Map[String, Int] = Map() var index2: Int = 0 flightsRDD.map(flight => flight.dest).distinct.collect.foreach( x => { destMap += (x -> index2); index2 += 1 }) destMap.toString // Map(JFK -> 214, LAX -> 294, ATL -> 273,MIA -> 175 ... Transform non-numeric features into numeric values
  • 31. © 2016 MapR Technologies 10-31 Classification Learning Problem - Features Label  delayed and not delayed - delayed if delay > 40 minutes Features  {day_of_month, weekday, crsdeptime, crsarrtime, carrier, crselapsedtime, origin, dest} MLLIB Datatypes: Vector: Contains the feature data points LabeledPoint: Contains feature vector and label
  • 32. © 2016 MapR Technologies 10-32 // Defining the features array val mlprep = flightsRDD.map(flight => { val monthday = flight.dofM.toInt - 1 // category val weekday = flight.dofW.toInt - 1 // category val crsdeptime1 = flight.crsdeptime.toInt val crsarrtime1 = flight.crsarrtime.toInt val carrier1 = carrierMap(flight.carrier) // category val crselapsedtime1 = flight.crselapsedtime.toDouble val origin1 = originMap(flight.origin) // category val dest1 = destMap(flight.dest) // category val delayed = if (flight.depdelaymins.toDouble > 40) 1.0 else 0.0 Array(delayed.toDouble, monthday.toDouble, weekday.toDouble, crsdeptime1.toDouble, crsarrtime1.toDouble, carrier1.toDouble, crselapsedtime1.toDouble, origin1.toDouble, dest1.toDouble) }) mlprep.take(1) //Array(Array(0.0, 0.0, 2.0, 900.0, 1225.0, 6.0, 385.0, 214.0, 294.0)) val mldata = mlprep.map(x => LabeledPoint(x(0),Vectors.dense(x(1),x(2),x(3),x(4), x(5),x(6), x(7), x(8)))) mldata.take(1) // Array[LabeledPoint] = Array((0.0,[0.0,2.0,900.0,1225.0,6.0,385.0,214.0,294.0])) Define the features, Create LabeledPoint with Vector
  • 33. © 2016 MapR Technologies 10-36 Build Model Split data into: • Training data RDD (80%) • Test data RDD (20%) Data Build Model Training Set Test Set
  • 34. © 2016 MapR Technologies 10-37 // Randomly split RDD into training data RDD (80%) and test data RDD (20%) val splits = mldata.randomSplit(Array(0.8, 0.2)) val trainingRDD = splits(0).cache() val testRDD = splits(1).cache() testData.take(1) //Array[LabeledPoint] = Array((0.0,[18.0,6.0,900.0,1225.0,6.0,385.0,214.0,294.0])) Split Data
  • 35. © 2016 MapR Technologies 10-38 Build Model Training Set with Labels, Build a model Data Build Model Training Set Test Set
  • 36. © 2016 MapR Technologies 10-39 Use Case: Flight Data • Predict if a flight is going to be delayed • Use Decision Tree for prediction • Used for Classification and Regression • Represents tree with nodes • Binary decision at each node
  • 37. © 2016 MapR Technologies 10-40 // set ranges for categorical features var categoricalFeaturesInfo = Map[Int, Int]() categoricalFeaturesInfo += (0 -> 31) //dofM 31 categories categoricalFeaturesInfo += (1 -> 7) //dofW 7 categories categoricalFeaturesInfo += (4 -> carrierMap.size) //number of carriers categoricalFeaturesInfo += (6 -> originMap.size) //number of origin airports categoricalFeaturesInfo += (7 -> destMap.size) //number of dest airports val numClasses = 2 val impurity = "gini" val maxDepth = 9 val maxBins = 7000 // call DecisionTree trainClassifier with the trainingData , which returns the model val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins) Build Model
  • 38. © 2016 MapR Technologies 10-41 // print out the decision tree model.toDebugString // 0=dofM 4=carrier 3=crsarrtime1 6=origin res20: String = DecisionTreeModel classifier of depth 9 with 919 nodes If (feature 0 in {11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0, 22.0,23.0,24.0,25.0,26.0,27.0,30.0}) If (feature 4 in {0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,13.0}) If (feature 3 <= 1603.0) If (feature 0 in {11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0}) If (feature 6 in {0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,10.0,11.0,12.0,13.0... Build Model
  • 39. © 2016 MapR Technologies 10-42 Get Predictions Test Data Without label Predict Delay or Not Model
  • 40. © 2016 MapR Technologies 10-43 // Get Predictions,create RDD of test Label, test Prediction val labelAndPreds = testData.map { point => val prediction = model.predict(point.features) (point.label, prediction) } labelAndPreds.take(1) // Label, Prediction //Array((0.0,0.0)) Get Predictions
  • 41. © 2016 MapR Technologies 10-44 // get instances where label != prediction val wrongPrediction =(labelAndPreds.filter{ case (label, prediction) => ( label !=prediction) }) val wrong= wrongPrediction.count() res35: Long = 11040 val ratioWrong=wrong.toDouble/testData.count() ratioWrong: Double = 0.3157443157443157 Test Model
  • 42. © 2016 MapR Technologies 10-45 To Learn More: • Download example code – https://github.com/caroljmcdonald/sparkmldecisiontree • Read explanation of example code – https://www.mapr.com/blog/apache-spark-machine-learning-tutorial • Engage with us! – https://www.mapr.com/blog/author/carol-mcdonald – https://community.mapr.com
  • 43. © 2016 MapR Technologies 10-46 Q&A @mapr https://www.mapr.com/blog/author/carol-mcdonald Engage with us! mapr-technologies

Editor's Notes

  1. Spark’s machine learning (ML) library goal is to make practical machine learning scalable and easy.
  2. MLlib provides the machine learning algorithms and utilities listed here. In this talk we will be going over Classification using Decision trees.
  3. In general, machine learning may be broken down into two classes of algorithms: supervised and unsupervised.
  4. Supervised algorithms use labeled data in which both the input and output are provided to the algorithm.
  5. Unsupervised algorithms do not have the outputs in advance. These algorithms are left to make sense of the data without any hints.
  6. Next we will briefly describe three common categories of machine learning techniques, starting with Classification.
  7. Gmail uses a machine learning technique called classification to classify if an email is spam or not, based on the data of an email: the sender, recipients, subject, and message body. Classification takes a set of data with known labels and learns how to label new records based on that information. The algorithm identifies which category an item belongs to, based on labeled examples of known items. For example, it identifies whether an email is spam or not based on emails known to be spam or not.
  8. Classification is a supervised algorithm meaning it uses labeled data, for example, labeled as spam/non-spam or fraud/non-fraud to build a model. The model is then use to predict the label or class for new data. Some common use cases for classification include credit card fraud detection and email spam detection.
  9. You can classify something based on pre-determined features. Features are the “if questions” that you ask. The label is the answer to those questions. In this example, if it walks, swims, and quacks like a duck, then the label is "duck".
  10. To build a classifier model, you first extract the features that most contribute to the classification. In our email example, we find features that define an email as spam, or not spam.
  11. The features are transformed and put into Feature Vectors, which is an array of numbers representing the value for each feature.
  12. An algorithm “trains” a model by making associations between the input features and the labeled output associated with those features.
  13. Then at runtime, we deploy the best model which can then be used to make predictions on new points.
  14. Google News uses a technique called clustering to group news articles into different categories, based on title and content. Clustering algorithms discover groupings that occur in collections of data.
  15. In clustering, an algorithm classifies inputs into categories by analyzing similarities between input examples. Clustering uses unsupervised algorithms, which do not have the outputs in advance. No known classes are used as a reference, as with a supervised algorithm like classification.
  16. Clustering can be used for many purposes, for example: grouping similar customers anomaly detection, such as fraud detection and text categorization, such as sorting books into genres
  17. K-means is one of the most commonly used clustering algorithms. The Objective of the K-means algorithm is given a set of data points, create K number of clusters that group the most similar (closest) points.
  18. Amazon uses a machine learning technique called collaborative filtering or recommendation, to determine products users will like based on their history and similarity to other users.
  19. Collaborative filtering algorithms recommend items (this is the filtering part) based on preference information from many users (this is the collaborative part). The collaborative filtering approach is based on similarity; people who liked similar items in the past will like similar items in the future. ----- In the example shown, Ted likes movies A, B, and C. Carol likes movies B and C. Bob likes movie B. To recommend a movie to Bob, we calculate that users who liked B also liked C, so C is a possible recommendation for Bob.
  20. The goal of a collaborative filtering algorithm is to take preferences data from users, and to create a model that can be used for recommendations or predictions.   Ted likes movies A, B, and C. Carol likes movies B and C. We take this data and run it through an algorithm to build a model.   Then when we have new data such as Bob likes movie B, we use the model to predict that C is a possible recommendation for Bob.
  21. Next We will look at using a Decision Tree to predict if a flight is going to be delayed.
  22. Decision trees create a model that predicts the class or label based on several input features. Decision trees work by evaluating an expression containing a feature at every node and selecting a branch to the next node based on the answer. The feature questions are the nodes, and the answers “yes” or “no” are the branches in the tree to the child nodes.
  23. We are using flight information for January 2014. For each flight, we have the following
  24. we use a Scala case class to define the Flight schema corresponding to a line in the csv data file The parseFlight function parses a line from the data file into the Flight class we load the flight data into the an RDD, then we use the map transformation on the RDD. This will apply the parseFlight function to each element in the rdd, and return a new RDD of Flight objects.
  25. Next the features are transformed and put into a Feature Vector, which is a vector of numbers representing the value for each feature.
  26. We want to extract the features that will contribute to the classification the most. In our example, we will use the features day of the month, day of the week, the carrier, departure time, arrival time, the origin airport and destination airport. The label is delayed or not delayed. A flight is considered to be delayed if it is more than 40 minutes late.
  27. Here we transform the non-numeric features into numeric values. For example, the carrier AA is the number 6. airport ATL is 273.
  28. Next we put the label and features Vector in a LabeledPoint , which holds the features and label.
  29. Here The features for each flight are put into an RDD of arrays of numbers representing the value for each feature. Next, we create an RDD of LabeledPoints consisting of the label and the features in numeric format for each flight.
  30. A typical machine learning workflow is shown here. To make our predictions, we will perform the following steps
  31. Split the data into two parts, one for building the model and one for testing the model.
  32. We then run the algorithm to build and train a model. We make predictions with the training data, and observe the results. Then, test the model with the test data.
  33. Next we split the data into two parts, one for building the model and one for testing the model.
  34. The top line in the code shown here applies the 80-20 split to our data.
  35. Next we build the model with the training data set which has labels
  36. As a reminder We will use Decision Tree algorithm to build the model
  37. categoricalFeaturesInfo specifies which features are categorical and how many categorical values each of those features can take. This is given as a map from feature index to the number of categories for that feature. The first categorical feature, categoricalFeaturesInfo = (0 -> 31) specifies that feature index 0 (which represents the day of the month) has 31 categories (values {1, ..., 31}). The second one represents day of the week and can take the values from 1 though 7. The carrier value can go from 1 to the number of distinct carriers and so on. maxDepth: Maximum depth of a tree. maxBins: Number of bins used when discretizing continuous features. impurity: Impurity measure of the homogeneity of the labels at the node.
  38. Model.toDebugString prints out the decision tree, which asks the following questions to determine if the flight was delayed or not:
  39. Now that we have trained our model, we want to get predictions for the test data without the label , in order to compare the predictions to the label
  40. Next we use the test data to get predictions. Here we create and RDD with the Test Labels and Test Predictions in order to compare the predictions to the actual flight delay label.
  41. Here we compare the predictions to the label. The wrong prediction ratio is the count of wrong predictions divided by the count of test data values, which is 31%.