Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
CourboSpark: Decision Tree for Time-series on Spark
Next
Download to read offline and view in fullscreen.

7

Share

Download to read offline

Apache Spark Machine Learning Decision Trees

Download to read offline

Predict Flight Delays with Apache Spark's Machine Learning Decision Tree algorithm

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Apache Spark Machine Learning Decision Trees

  1. 1. © 2016 MapR Technologies 10-1 © 2016 MapR Technologies Machine Learning with Apache Spark
  2. 2. © 2016 MapR Technologies 10-2 © 2016 MapR Technologies
  3. 3. © 2016 MapR Technologies 10-3 Agenda • Brief overview of • Classification • Clustering • Collaborative Filtering • Predicting Flight Delays using a Decision Tree
  4. 4. © 2016 MapR Technologies 10-4 Spark SQL • Structured Data • Querying with SQL/HQL • DataFrames Spark Streaming • Processing of live streams • Micro-batching MLlib • Machine Learning • Multiple types of ML algorithms GraphX • Graph processing • Graph parallel computations RDD Transformations and Actions • Task scheduling • Memory management • Fault recovery • Interacting with storage systems Spark Core What is MLlib?
  5. 5. © 2016 MapR Technologies 10-5 MLlib Algorithms and Utilities Algorithms and Utilities Description Basic statistics Includes summary statistics, correlations, hypothesis testing, random data generation Classification and regression Includes methods for linear models, decision trees and Naïve Bayes Collaborative filtering Supports model-based collaborative filtering using alternating least squares (ALS) algorithm Clustering Supports K-means clustering Dimensionality reduction Supports dimensionality reduction on the RowMatrix class; singular value decomposition (SVD) and principal component analysis (PCA) Feature extraction and transformation Contains several classes for common feature transformations
  6. 6. © 2016 MapR Technologies 10-6 Examples of ML Algorithms Supervised • Classification – Naïve Bayes – SVM – Random Decision Forests • Regression – Linear – Logistic Machine Learning Unsupervised • Clustering – K-means • Dimensionality reduction – Principal Component Analysis – SVD
  7. 7. © 2016 MapR Technologies 10-7 Examples of ML Algorithms Supervised • Classification – Naïve Bayes – SVM – Random Decision Forests • Regression – Linear – Logistic Machine Learning Unsupervised • Clustering – K-means • Dimensionality reduction – Principal Component Analysis – SVD
  8. 8. © 2016 MapR Technologies 10-8 Examples of ML Algorithms Machine Learning Unsupervised • Clustering – K-means • Dimensionality reduction – Principal Component Analysis – SVD Supervised • Classification – Naïve Bayes – SVM – Random Decision Forests • Regression – Linear – Logistic
  9. 9. © 2016 MapR Technologies 10-9 Three Categories of Techniques for Machine Learning Collaborative Filtering (Recommendation) Classification Clustering
  10. 10. © 2016 MapR Technologies 10-10 Machine Learning: Classification Classification Identifies category for item
  11. 11. © 2016 MapR Technologies 10-11 Classification: Definition Form of ML that: • Identifies which category an item belongs to • Uses supervised learning algorithms – Data is labeled Sentiment
  12. 12. © 2016 MapR Technologies 10-12 If it Walks/Swims/Quacks Like a Duck …… Then It Must Be a Duck swims walks quacks Features: walks quacks swims Features:
  13. 13. © 2016 MapR Technologies 10-13 Building and Deploying a Classifier Model Image reference O’Reilly Learning Spark Spam: free money now! get this money free savings $$$ Training Data Non-spam: how are you? that Spark job lunch plans
  14. 14. © 2016 MapR Technologies 10-14 Building and Deploying a Classifier Model Image reference O’Reilly Learning Spark + + ̶+ ̶ ̶ Feature Vectors Featurization Spam: free money now! get this money free savings $$$ Training Data Non-spam: how are you? that Spark job lunch plans
  15. 15. © 2016 MapR Technologies 10-15 Building and Deploying a Classifier Model Image reference O’Reilly Learning Spark + + ̶+ ̶ ̶ Feature Vectors Model Featurization TrainingSpam: free money now! get this money free savings $$$ Training Data Non-spam: how are you? that Spark job lunch plans + + ̶+ ̶ ̶
  16. 16. © 2016 MapR Technologies 10-16 Building and Deploying a Classifier Model Image reference O’Reilly Learning Spark + + ̶+ ̶ ̶ Feature Vectors Model Featurization Training Model Evaluation Best Model Spam: free money now! get this money free savings $$$ Training Data Non-spam: how are you? that Spark job lunch plans + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶
  17. 17. © 2016 MapR Technologies 10-17 Machine Learning: Clustering Classification Clustering
  18. 18. © 2016 MapR Technologies 10-18 Clustering: Definition • Unsupervised learning task • Groups objects into clusters of high similarity
  19. 19. © 2016 MapR Technologies 10-19 Clustering: Definition • Unsupervised learning task • Groups objects into clusters of high similarity – Search results grouping – Grouping of customers – Anomaly detection – Text categorization
  20. 20. © 2016 MapR Technologies 10-20 Clustering: Example • Group similar objects • Use MLlib K-means algorithm 1. Initialize coordinates to center of clusters (centroid) 2. Assign all points to nearest centroid 3. Update centroids to center of points 4. Repeat until conditions met
  21. 21. © 2016 MapR Technologies 10-21 Three Categories of Techniques for Machine Learning Collaborative Filtering (Recommendation) Classification Clustering
  22. 22. © 2016 MapR Technologies 10-22 Collaborative Filtering with Spark • Recommend items – (Filtering) • Based on user preferences data – (Collaborative) 4 5 5 5 5 5 ? Ted Carol Bob A B C User Item Rating Matrix
  23. 23. © 2016 MapR Technologies 10-23 Train a Model to Make Predictions Ted and Carol like movies B and C Bob likes movie B, what might he like? Bob likes movie B, predict C Training Data ModelAlgorithm New Data PredictionsModel 4 5 5 5 5 5 ? Ted Carol Bob A B C User Item Rating Matrix
  24. 24. © 2016 MapR Technologies 10-24 © 2016 MapR Technologies Predict Flight Delays
  25. 25. © 2016 MapR Technologies 10-25 Use Case: Flight Data • Predict if a flight is going to be delayed • Use Decision Tree for prediction • Used for Classification and Regression • Represents tree with nodes, Binary decision at each node
  26. 26. © 2016 MapR Technologies 10-26 Flight Data
  27. 27. © 2016 MapR Technologies 10-27 // Define the schema case class Flight(dofM: String, dofW: String, carrier: String, tailnum: String, flnum: Int, org_id: String, origin: String, dest_id: String, dest: String, crsdeptime: Double, deptime: Double, depdelaymins: Double, crsarrtime: Double, arrtime: Double, arrdelay: Double, crselapsedtime: Double, dist: Int) def parseFlight(str: String): Flight = { val line = str.split(",") Flight(line(0), line(1), line(2), line(3), line(4).toInt, line(5), line(6), line(7), line(8), line(9).toDouble, line(10).toDouble, line(11).toDouble, line(12).toDouble, line(13).toDouble, line(14).toDouble, line(15).toDouble, line(16).toInt) } // load file into a RDD val rdd = sc.textFile(”flights.csv”) // create an RDD of Flight objects val flightRDD = rdd.map(parseFlight).cache() //Array(Flight(1,3,AA,N338AA,1,12478,JFK,12892,LAX 900.0,914.0,14.0,1225.0,1238.0, 13.0,385.0,2475) Parse Input
  28. 28. © 2016 MapR Technologies 10-28 Building and Deploying a Classifier Model + + ̶+ ̶ ̶ Feature Vectors Featurization Delayed: Friday LAX AA Training Data Not Delayed: Wednesday BNA Delta
  29. 29. © 2016 MapR Technologies 10-29 Classification Learning Problem - Features Label  delayed and not delayed - delayed if delay > 40 minutes Features  {day_of_month, weekday, crsdeptime, crsarrtime, carrier, crselapsedtime, origin, dest}
  30. 30. © 2016 MapR Technologies 10-30 // create map of airline -> number var carrierMap: Map[String, Int] = Map() var index: Int = 0 flightsRDD.map(flight => flight.carrier).distinct.collect.foreach( x => { carrierMap += (x -> index); index += 1 } ) carrierMap.toString // String = Map(DL -> 5,US -> 9, AA -> 6, UA -> 4...) // create map of destination airport -> number var destMap: Map[String, Int] = Map() var index2: Int = 0 flightsRDD.map(flight => flight.dest).distinct.collect.foreach( x => { destMap += (x -> index2); index2 += 1 }) destMap.toString // Map(JFK -> 214, LAX -> 294, ATL -> 273,MIA -> 175 ... Transform non-numeric features into numeric values
  31. 31. © 2016 MapR Technologies 10-31 Classification Learning Problem - Features Label  delayed and not delayed - delayed if delay > 40 minutes Features  {day_of_month, weekday, crsdeptime, crsarrtime, carrier, crselapsedtime, origin, dest} MLLIB Datatypes: Vector: Contains the feature data points LabeledPoint: Contains feature vector and label
  32. 32. © 2016 MapR Technologies 10-32 // Defining the features array val mlprep = flightsRDD.map(flight => { val monthday = flight.dofM.toInt - 1 // category val weekday = flight.dofW.toInt - 1 // category val crsdeptime1 = flight.crsdeptime.toInt val crsarrtime1 = flight.crsarrtime.toInt val carrier1 = carrierMap(flight.carrier) // category val crselapsedtime1 = flight.crselapsedtime.toDouble val origin1 = originMap(flight.origin) // category val dest1 = destMap(flight.dest) // category val delayed = if (flight.depdelaymins.toDouble > 40) 1.0 else 0.0 Array(delayed.toDouble, monthday.toDouble, weekday.toDouble, crsdeptime1.toDouble, crsarrtime1.toDouble, carrier1.toDouble, crselapsedtime1.toDouble, origin1.toDouble, dest1.toDouble) }) mlprep.take(1) //Array(Array(0.0, 0.0, 2.0, 900.0, 1225.0, 6.0, 385.0, 214.0, 294.0)) val mldata = mlprep.map(x => LabeledPoint(x(0),Vectors.dense(x(1),x(2),x(3),x(4), x(5),x(6), x(7), x(8)))) mldata.take(1) // Array[LabeledPoint] = Array((0.0,[0.0,2.0,900.0,1225.0,6.0,385.0,214.0,294.0])) Define the features, Create LabeledPoint with Vector
  33. 33. © 2016 MapR Technologies 10-36 Build Model Split data into: • Training data RDD (80%) • Test data RDD (20%) Data Build Model Training Set Test Set
  34. 34. © 2016 MapR Technologies 10-37 // Randomly split RDD into training data RDD (80%) and test data RDD (20%) val splits = mldata.randomSplit(Array(0.8, 0.2)) val trainingRDD = splits(0).cache() val testRDD = splits(1).cache() testData.take(1) //Array[LabeledPoint] = Array((0.0,[18.0,6.0,900.0,1225.0,6.0,385.0,214.0,294.0])) Split Data
  35. 35. © 2016 MapR Technologies 10-38 Build Model Training Set with Labels, Build a model Data Build Model Training Set Test Set
  36. 36. © 2016 MapR Technologies 10-39 Use Case: Flight Data • Predict if a flight is going to be delayed • Use Decision Tree for prediction • Used for Classification and Regression • Represents tree with nodes • Binary decision at each node
  37. 37. © 2016 MapR Technologies 10-40 // set ranges for categorical features var categoricalFeaturesInfo = Map[Int, Int]() categoricalFeaturesInfo += (0 -> 31) //dofM 31 categories categoricalFeaturesInfo += (1 -> 7) //dofW 7 categories categoricalFeaturesInfo += (4 -> carrierMap.size) //number of carriers categoricalFeaturesInfo += (6 -> originMap.size) //number of origin airports categoricalFeaturesInfo += (7 -> destMap.size) //number of dest airports val numClasses = 2 val impurity = "gini" val maxDepth = 9 val maxBins = 7000 // call DecisionTree trainClassifier with the trainingData , which returns the model val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins) Build Model
  38. 38. © 2016 MapR Technologies 10-41 // print out the decision tree model.toDebugString // 0=dofM 4=carrier 3=crsarrtime1 6=origin res20: String = DecisionTreeModel classifier of depth 9 with 919 nodes If (feature 0 in {11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0,20.0,21.0, 22.0,23.0,24.0,25.0,26.0,27.0,30.0}) If (feature 4 in {0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0,11.0,13.0}) If (feature 3 <= 1603.0) If (feature 0 in {11.0,12.0,13.0,14.0,15.0,16.0,17.0,18.0,19.0}) If (feature 6 in {0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,10.0,11.0,12.0,13.0... Build Model
  39. 39. © 2016 MapR Technologies 10-42 Get Predictions Test Data Without label Predict Delay or Not Model
  40. 40. © 2016 MapR Technologies 10-43 // Get Predictions,create RDD of test Label, test Prediction val labelAndPreds = testData.map { point => val prediction = model.predict(point.features) (point.label, prediction) } labelAndPreds.take(1) // Label, Prediction //Array((0.0,0.0)) Get Predictions
  41. 41. © 2016 MapR Technologies 10-44 // get instances where label != prediction val wrongPrediction =(labelAndPreds.filter{ case (label, prediction) => ( label !=prediction) }) val wrong= wrongPrediction.count() res35: Long = 11040 val ratioWrong=wrong.toDouble/testData.count() ratioWrong: Double = 0.3157443157443157 Test Model
  42. 42. © 2016 MapR Technologies 10-45 To Learn More: • Download example code – https://github.com/caroljmcdonald/sparkmldecisiontree • Read explanation of example code – https://www.mapr.com/blog/apache-spark-machine-learning-tutorial • Engage with us! – https://www.mapr.com/blog/author/carol-mcdonald – https://community.mapr.com
  43. 43. © 2016 MapR Technologies 10-46 Q&A @mapr https://www.mapr.com/blog/author/carol-mcdonald Engage with us! mapr-technologies
  • ssuserce170b

    Jun. 15, 2019
  • AnilWankhade1

    Jul. 19, 2017
  • mloukaddi

    Nov. 15, 2016
  • choeungjin

    Sep. 16, 2016
  • bwrasa

    Sep. 12, 2016
  • MatheusMota10

    Sep. 11, 2016
  • MarcosColebrookSantamaria

    Jul. 22, 2016

Predict Flight Delays with Apache Spark's Machine Learning Decision Tree algorithm

Views

Total views

2,963

On Slideshare

0

From embeds

0

Number of embeds

62

Actions

Downloads

60

Shares

0

Comments

0

Likes

7

×