Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

4

Share

Download to read offline

Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning, Structured Streaming, Kafka with MapR-ES and MapR-DB

Download to read offline

Real-Time Analysis of Popular Uber Locations using Apache APIs:
Spark Machine Learning,
Spark Structured Streaming,
Kafka
with MapR-ES and MapR-DB

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning, Structured Streaming, Kafka with MapR-ES and MapR-DB

  1. 1. Real-Time Analysis of Popular Uber Locations using Apache APIs: •  Spark Machine Learning, •  Spark Structured Streaming, •  Kafka with MapR-ES and MapR-DB
  2. 2. 2 © 2018 MapR Technologies, Inc •  Overview of Unsupervised Machine Learning Clustering •  Use K-Means to Cluster Uber locations and save ML model •  Overview of Kafka API •  Use Spark Structured Streaming: •  To Read from Kafka topic •  Enrich with ML model •  Write to MapR-DB JSON document database •  Use Spark SQL to query MapR-DB database Agenda 2
  3. 3. 3 © 2018 MapR Technologies, Inc Use Case: Real-Time Analysis of Geographically Clustered Vehicles
  4. 4. Intro to Machine Learning
  5. 5. 5 © 2018 MapR Technologies, Inc What is Machine Learning? Data Build ModelTrain Algorithm Finds patterns New Data Use Model (prediction function) Predictions Contains patterns Recognizes patterns
  6. 6. 6 © 2018 MapR Technologies, Inc ML Discovery Model Building Model Training/ Building Training Set Test Model Predictions Test Set Evaluate Results Historical Data Deployed Model Insights Data Discovery, Model Creation Production Feature Extraction Feature Extraction Uber trips Stream TopicUber trips New Data
  7. 7. 7 © 2018 MapR Technologies, Inc Supervised and Unsupervised Machine Learning Machine Learning Unsupervised •  Clustering •  Collaborative Filtering •  Frequent Pattern Mining Supervised •  Classification •  Regression Label
  8. 8. 8 © 2018 MapR Technologies, Inc Supervised Algorithms use labeled data Data features Build Model New Data features Predict Use Model X1, X2 Y f(X1, X2) =Y X1, X2 Y
  9. 9. 9 © 2018 MapR Technologies, Inc Unsupervised Algorithms use Unlabeled data Customer GroupsBuild ModelTrain Algorithm Finds patterns New Customer Purchase Data Use Model Similar Customer Group Contains patterns Recognizes patterns Customer purchase data
  10. 10. 10 © 2018 MapR Technologies, Inc Unsupervised Machine Learning: Clustering Clustering group news articles into different categories
  11. 11. 11 © 2018 MapR Technologies, Inc Clustering: Definition Unsupervised learning task Groups objects into clusters of high similarity
  12. 12. 12 © 2018 MapR Technologies, Inc Clustering: Definition Unsupervised learning task Groups objects into clusters of high similarity –  Search results grouping –  Grouping of customers, patients –  Text categorization –  recommendations •  Anomaly detection: find what’s not similar
  13. 13. 13 © 2018 MapR Technologies, Inc Clustering: Example Group similar objects
  14. 14. 14 © 2018 MapR Technologies, Inc Clustering: Example Group similar objects Use MLlib K-means algorithm 1.  Initialize coordinates to K cluster centers
  15. 15. 15 © 2018 MapR Technologies, Inc Clustering: Example Group similar objects Use MLlib K-means algorithm 1.  Initialize coordinates to K clusters centers (centroid) 2.  Assign all points to nearest cluster center (centroid)
  16. 16. 16 © 2018 MapR Technologies, Inc Clustering: Example Group similar objects Use MLlib K-means algorithm 1.  Initialize coordinates to center of clusters (centroid) 2.  Assign all points to nearest centroid 3.  Update centroids to center of assigned points
  17. 17. 17 © 2018 MapR Technologies, Inc Clustering: Example Group similar objects Use MLlib K-means algorithm 1.  Initialize coordinates to center of clusters (centroid) 2.  Assign all points to nearest centroid 3.  Update centroids to center of points 4.  Repeat until conditions met
  18. 18. Cluster Uber Trip Locations
  19. 19. 19 © 2018 MapR Technologies, Inc How a Spark Application Runs on a Cluster
  20. 20. 20 © 2018 MapR Technologies, Inc Spark Distributed Datasets partitioned •  Read only collection of typed objects Dataset[T] •  Partitioned across a cluster •  Operated on in parallel •  in memory can be Cached
  21. 21. 21 © 2018 MapR Technologies, Inc Loading a Dataset
  22. 22. 22 © 2018 MapR Technologies, Inc Dataset Read From a File Worker Worker Worker Block 1 Block 2 Block 3 Driver tasks tasks tasks
  23. 23. 23 © 2018 MapR Technologies, Inc Dataset Read From a File Worker Worker Worker Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data
  24. 24. 24 © 2018 MapR Technologies, Inc Date/Time: The date and time of the Uber pickup Lat: The latitude of the Uber pickup Lon: The longitude of the Uber pickup Base: The TLC base company affiliated with the Uber pickup The Data Records are in CSV format. An example line is shown below: 2014-08-01 00:00:00,40.729,-73.9422,B02598 Uber Data
  25. 25. 25 © 2018 MapR Technologies, Inc case class Uber(dt: String, lat: Double, lon: Double, base: String) val schema = StructType(Array( StructField("dt", TimestampType, true), StructField("lat", DoubleType, true), StructField("lon", DoubleType, true), StructField("base", StringType, true) )) Load the data into a Dataframe: Define the Schema
  26. 26. 26 © 2018 MapR Technologies, Inc val df = spark.read.format("csv").option("inferSchema", "false") .schema(schema).option("header", "false") .load(file) Load the data into a Dataframe
  27. 27. 27 © 2018 MapR Technologies, Inc Load the data into a DataFrame columns row
  28. 28. 28 © 2018 MapR Technologies, Inc val df = spark.read.format("csv").option("inferSchema", "false") .schema(schema).option("header", "false") .load(file).as[Uber] Load the data into a Dataset
  29. 29. 29 © 2018 MapR Technologies, Inc Load the data into a Dataset Collection of Uber objects columns row
  30. 30. 30 © 2018 MapR Technologies, Inc •  in Spark 2.0, DataFrame APIs merged with Datasets APIs •  A Dataset is a collection of typed objects (SQL and functions) •  Dataset[T] •  A DataFrame is a Dataset of generic Row objects (SQL) •  Dataset[Row] Dataset merged with Dataframe
  31. 31. 31 © 2018 MapR Technologies, Inc Spark Distributed Datasets Transformations create a new Dataset from the current one, Lazily evaluated Actions return a value to the driver
  32. 32. 32 © 2018 MapR Technologies, Inc Spark ML workflow
  33. 33. 33 © 2018 MapR Technologies, Inc Feature Vectors are vectors of numbers representing the value for each feature Extract the Features Image reference O’Reilly Learning Spark + + ̶+ ̶ ̶ Feature Vectors Model Featurization Training Model Evaluation Best Model Training Data + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶ + + ̶+ ̶ ̶
  34. 34. 34 © 2018 MapR Technologies, Inc Uber Example •  What are the “if questions” or properties we can use to group? –  These are the Features: –  We will group by Lattitude, longitude •  Use Spark SQL to analyze: Day of the week, time, rush hour for groups … •  NOTE: this example uses real Uber data, but the code is from me, not Uber NEAR REALTIME PRICE SURGING
  35. 35. 35 © 2018 MapR Technologies, Inc val featureCols = Array("lat", "lon") val assembler = new VectorAssembler() .setInputCols(featureCols) .setOutputCol("features") val df2 = assembler.transform(df) Use VectorAssembler to put features in vector column
  36. 36. 36 © 2018 MapR Technologies, Inc val kmeans = new KMeans() .setK(10) .setFeaturesCol("features") .setPredictionCol("cid") .setMaxIter(20) Create Kmeans Estimator, Set Features
  37. 37. 37 © 2018 MapR Technologies, Inc val model = kmeans.fit(df2) Fit the Model on the Training Data Features
  38. 38. 38 © 2018 MapR Technologies, Inc model.clusterCenters.foreach(println) [40.76930621976264,-73.96034885367698] [40.67562793272868,-73.79810579052476] [40.68848772848041,-73.9634449047477] [40.78957777777776,-73.14270740740741] [40.32418330308531,-74.18665245009073] [40.732808848486286,-74.00150153727878] [40.75396549974632,-73.57692359208531] [40.901700842900674,-73.868760398198] Cluster Centers from fitted model
  39. 39. 39 © 2018 MapR Technologies, Inc Clusters from fitted model
  40. 40. 40 © 2018 MapR Technologies, Inc K-means model val clusters = model.summary.predictions Or val clusters = model.transform(df3) clusters.createOrReplaceTempView("uber”) clusters.show() Analyze Clusters summary DataFrame + Features + cluster
  41. 41. 41 © 2018 MapR Technologies, Inc clusters.groupBy("cid").count().orderBy(desc( "count")).show(5) +---+-----+ |cid|count| +---+-----+ | 6|83505| | 5|79472| | 0|56241| | 16|26933| | 13|23581| +---+-----+ Which clusters had the highest number of pickups?
  42. 42. 42 © 2018 MapR Technologies, Inc Which clusters had the highest number of pickups? %sql SELECT COUNT(cid), cid FROM uber GROUP BY cid ORDER BY COUNT(cid) DESC
  43. 43. 43 © 2018 MapR Technologies, Inc How many pickups occurred in the busiest 5 clusters by hour? select hour(uber.dt) as hr,cid, count(cid) as ct from uber where cid in (0,8,9,13,17) group By hour(uber.dt), cid
  44. 44. 44 © 2018 MapR Technologies, Inc Which hours had the highest number of pickups? SELECT hour(uber.dt) as hr,count(cid) as ct FROM uber GROUP BY hour(uber.dt)
  45. 45. 45 © 2018 MapR Technologies, Inc fitted model model.write.overwrite().save("/path/savemodel") Use later val sameModel = KMeansModel.load("/user/user01/data/savemodel") Save the model to distributed file system saveDataFrame + Features
  46. 46. 46 © 2018 MapR Technologies, Inc hadoop fs -ls /user/mapr/ubermodel/metadata /user/mapr/ubermodel/metadata/_SUCCESS /user/mapr/ubermodel/metadata/part-00000 hadoop fs -ls /user/mapr/ubermodel/data /user/mapr/ubermodel/data/_SUCCESS /user/mapr/ubermodel/data/part-00000-4d20b313-ddc1-43cb- a863-434a36330639-c000.snappy.parquet hadoop fs -cat /user/mapr/ubermodel/metadata/part-00000 {"class":"org.apache.spark.ml.clustering.KMeansModel","timestamp": 1540826934502,"sparkVersion":"2.3.1-mapr-1808","uid":"kmeans_4ad427355253","paramMap": {"predictionCol":"cid","seed":1,"initMode":"k-means||","featuresCol":"features","initSteps": 2,"maxIter":100,"tol":1.0E-4,"k":20}} The model on the distributed file system
  47. 47. Kafka API and Streaming Data
  48. 48. 48 © 2018 MapR Technologies, Inc Use Case: Real-Time Analysis of Geographically Clustered Vehicles
  49. 49. 49 © 2018 MapR Technologies, Inc What is a Stream ? •  A stream is an continuous sequence of events or records •  Records are key-value pairs
  50. 50. 50 © 2018 MapR Technologies, Inc Examples of Streaming Data Fraud detection Smart Machinery Smart Meters Home Automation Networks Manufacturing Security Systems Patient Monitoring
  51. 51. 51 © 2018 MapR Technologies, Inc A Stanford team has shown that a machine-learning model can identify arrhythmias from an EKG better than an expert •  https://www.technologyreview.com/s/608234/the-machines-are-getting-ready- to-play-doctor/ Example of Streaming Data combined with Machine Learning
  52. 52. 52 © 2018 MapR Technologies, Inc https://mapr.com/blog/ml-iot-connected-medical-devices/ Applying Machine Learning to Live Patient Data
  53. 53. 53 © 2018 MapR Technologies, Inc Collect the Data Data IngestSource Stream Topic •  Data Ingest: –  Using the Kafka API
  54. 54. 54 © 2018 MapR Technologies, Inc Topics: Logical collection of events Organize Events into Categories Organize Data into Topics with the MapR Event Store for Kafka Consumers MapR Cluster Topic: Pressure Topic: Temperature Topic: Warnings Consumers Consumers Kafka API Kafka API
  55. 55. 55 © 2018 MapR Technologies, Inc Topics are partitioned for throughput and scalability Scalable Messaging with MapR Event Streams Server 1 Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Server 2 Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Server 3 Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning
  56. 56. 56 © 2018 MapR Technologies, Inc Scalable Messaging with MapR Event Streams Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning Producers are load balanced between partitions Kafka API
  57. 57. 57 © 2018 MapR Technologies, Inc Scalable Messaging with MapR Event Streams Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning Consumers Consumers Consumers Consumer groups can read in parallel Kafka API
  58. 58. 58 © 2018 MapR Technologies, Inc New Messages are Added to the end Partition is like an Event Log New Message 6 5 4 3 2 1 Old Message
  59. 59. 59 © 2018 MapR Technologies, Inc Messages are delivered in the order they are received Partition is like a Queue
  60. 60. 60 © 2018 MapR Technologies, Inc Messages remain on the partition, available to other consumers Unlike a queue, events are still persisted after they’re delivered
  61. 61. 61 © 2018 MapR Technologies, Inc Messages can be persisted forever Or Older messages can be deleted automatically based on time to live When Are Messages Deleted? MapR Cluster 6 5 4 3 2 1Partition 1 Older message
  62. 62. 62 © 2018 MapR Technologies, Inc How do we do this with High Performance at Scale? •  Parallel operations •  minimizes disk read/writes
  63. 63. 63 © 2018 MapR Technologies, Inc Processing Same Message for Different Purposes
  64. 64. Spark Structured Streaming
  65. 65. 65 © 2018 MapR Technologies, Inc Process the Data with Spark Structured Streaming
  66. 66. 66 © 2018 MapR Technologies, Inc Datasets Read from Stream Task Cache Process & Cache Data offsets Stream partition Task Cache Process & Cache Data Task Cache Process & Cache Data Driver Stream partition Stream partition Data is cached for aggregations And windowed functions
  67. 67. 67 © 2018 MapR Technologies, Inc new data in the data stream = new rows appended to an unbounded table Data stream as an unbounded table Treat Stream as Unbounded Tables
  68. 68. 68 © 2018 MapR Technologies, Inc The Stream is continuously processed
  69. 69. 69 © 2018 MapR Technologies, Inc Spark automatically streamifies SQL plans Image reference Databricks
  70. 70. 70 © 2018 MapR Technologies, Inc Stream Processing
  71. 71. 71 © 2018 MapR Technologies, Inc ML Discovery Model Building Model Training/ Building Training Set Test Model Predictions Test Set Evaluate Results Historical Data Deployed Model Insights Data Discovery, Model Creation Production Feature Extraction Feature Extraction Uber trips Stream TopicUber trips New Data
  72. 72. 72 © 2018 MapR Technologies, Inc Use Case: Real-Time Analysis of Geographically Clustered Vehicles
  73. 73. 73 © 2018 MapR Technologies, Inc // load the saved model from the distributed file system val model = KMeansModel.load(modelpath) Load the saved model
  74. 74. 74 © 2018 MapR Technologies, Inc val df1 = spark.readStream.format("kafka") .option("kafka.bootstrap.servers", "maprdemo:9092") .option("subscribe", "/apps/uberstream:ubers”) .option("startingOffsets", "earliest") .option("failOnDataLoss", false) .option("maxOffsetsPerTrigger", 1000) .load() Streaming pipeline Kafka Data source
  75. 75. 75 © 2018 MapR Technologies, Inc df1.printSchema() root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true) Kafka DataFrame schema
  76. 76. 76 © 2018 MapR Technologies, Inc case class Uber(dt: String, lat: Double, lon: Double, base: String, rdt: String)   // Parse string into Uber case class def parseUber(str: String): Uber = { val p = str.split(",") Uber(p(0), p(1).toDouble, p(2).toDouble, p(3), p(4)) } Function to Parse CSV data to Uber Object
  77. 77. 77 © 2018 MapR Technologies, Inc //register a user-defined function (UDF) to deserialize the message spark.udf.register("deserialize", (message: String) => parseUber(message)) //use the UDF in a select expression val df2 = df1.selectExpr("""deserialize(CAST(value as STRING)) AS message""").select($"message".as[Uber]) Parse message txt to Uber Object
  78. 78. 78 © 2018 MapR Technologies, Inc val featureCols = Array("lat", "lon”) val assembler = new VectorAssembler() .setInputCols(featureCols) .setOutputCol("features")   val df3 = assembler.transform(df2) Use VectorAssembler to put Features in a column
  79. 79. 79 © 2018 MapR Technologies, Inc //use model to get the cluster ids from the features val clusters1 = model.transform(df3) Use Model to get Cluster Ids from the features
  80. 80. 80 © 2018 MapR Technologies, Inc //select columns we want to keep val clusters= clusters1.select($"dt".cast(TimestampType), $"lat", $"lon", $"base",$"rdt", $”cid”) // Create object with unique Id for mapr-db case class UberwId(_id: String, dt: java.sql.Timestamp, base: String, cid: Integer, clat: Double, clon: Double) val cdf = clusters.withColumn("_id", concat($"cid", lit("_"), $"rdt")).as[UberwId] // cdf is like this: +--------------------+-------------------+-------+--------+------+---+----------------+------------------+ | _id| dt| lat| lon| base|cid| clat| clon| +--------------------+-------------------+-------+--------+------+---+----------------+------------------+ |0_922337049642672...|2014-08-18 08:36:00| 40.723|-74.0021|B02598| 0|40.7173662333218|-74.00933866774037| |0_922337049642672...|2014-08-18 08:36:00|40.7288|-74.0113|B02598| 0|40.7173662333218|-74.00933866774037| |0_922337049642672...|2014-08-18 08:35:00|40.7417|-74.0488|B02617| 0|40.7173662333218|-74.00933866774037| Create Unique Id for MapR-DB row key
  81. 81. 81 © 2018 MapR Technologies, Inc Writing to a Memory Sink Write results to MapR-DB Start running the query val query = cdf.writeStream .format(MapRDBSourceConfig.Format) .option(MapRDBSourceConfig.TablePathOption, tableName) .option(MapRDBSourceConfig.IdFieldPathOption, "_id") .option(MapRDBSourceConfig.CreateTableOption, false) .option("checkpointLocation", "/user/mapr/ubercheck") .option(MapRDBSourceConfig.BulkModeOption, true) .option(MapRDBSourceConfig.SampleSizeOption, 1000) query.start().awaitTermination()
  82. 82. 82 © 2018 MapR Technologies, Inc %sql select * from uber limit 3: Streaming Applicaton
  83. 83. 83 © 2018 MapR Technologies, Inc SELECT hour(uber.dt) as hr,cid, count(cid) as ct FROM uber group By hour(uber.dt), cid Streaming Applicaton
  84. 84. Spark & MapR-DB
  85. 85. 85 © 2018 MapR Technologies, Inc Stream Processing Pipeline
  86. 86. 86 © 2018 MapR Technologies, Inc MapR-DB Connector for Apache Spark Spark Streaming writing to MapR-DB JSON
  87. 87. 87 © 2018 MapR Technologies, Inc Spark MapR-DB Connector
  88. 88. 88 © 2018 MapR Technologies, Inc Relational Database vs. MapR-DB bottleneck Storage ModelRDBMS MapR-DB Normalized schema à Joins for queries can cause bottleneck De-Normalized schema à Data that is read together is stored together Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val
  89. 89. 89 © 2018 MapR Technologies, Inc Designed for Partitioning and Scaling
  90. 90. 90 © 2018 MapR Technologies, Inc MapR-DB JSON Document Store Data is automatically partitioned and sorted by _id row key!
  91. 91. 91 © 2018 MapR Technologies, Inc Writing to a MapR-DB Sink Write Streaming DataFrame Query Results to MapR-DB Start running the query val query = cdf.writeStream .format(MapRDBSourceConfig.Format) .option(MapRDBSourceConfig.TablePathOption, tableName) .option(MapRDBSourceConfig.IdFieldPathOption, "_id") .option(MapRDBSourceConfig.CreateTableOption, false) .option("checkpointLocation", "/user/mapr/ubercheck") .option(MapRDBSourceConfig.BulkModeOption, true) .option(MapRDBSourceConfig.SampleSizeOption, 1000) query.start().awaitTermination()
  92. 92. 92 © 2018 MapR Technologies, Inc Streaming Applicaton
  93. 93. Explore the Data With Spark SQL
  94. 94. 94 © 2018 MapR Technologies, Inc •  Spark SQL queries and updates to MapR-DB •  With projection and filter pushdown, custom partitioning, and data locality Spark SQL Querying MapR-DB JSON
  95. 95. 95 © 2018 MapR Technologies, Inc val df: Dataset[UberwId] = spark .loadFromMapRDB[UberwId](tableName, schema) .as[UberwId] Spark Distributed Datasets read from MapR-DB Partitions Worker Task Worker Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data Task Task Driver tasks tasks tasks
  96. 96. 96 © 2018 MapR Technologies, Inc Data Frame Load data df.createOrReplaceTempView("uber") df.show Load the data into a Dataframe Data is automatically partitioned and sorted by _id row key!
  97. 97. 97 © 2018 MapR Technologies, Inc val res = df.groupBy(“cid") .count() .orderBy(desc(count)) .show(5) +---+------+ |cid| count| +---+------+ | 6|197225| | 5|192073| | 0|131296| | 16| 62465| | 13| 52408| +---+------+ Top 5 Cluster trip counts ?
  98. 98. 98 © 2018 MapR Technologies, Inc val points = df.select("lat","lon”,"cid”).orderBy(desc("dt")) Display latest locations and Cluster centers on a Google Map
  99. 99. 99 © 2018 MapR Technologies, Inc df.filter($"_id" <= ”1”).select(hour($"dt").alias("hour"), $"cid") .groupBy("hour","cid").agg(count("cid") .alias("count")) Which hours have the highest pickups for cluster id 0 ?
  100. 100. 100 © 2018 MapR Technologies, Inc df.filter($"_id" <= "1").select(hour($"dt").alias("hour"), $"cid") .groupBy("hour","cid").agg(count("cid").alias("count")) .orderBy(desc( "count")).explain == Physical Plan == *(3) Sort [count#120L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#120L DESC NULLS LAST, 200) +- *(2) HashAggregate(keys=[hour#113, cid#5], functions=[count(cid#5)]) +- Exchange hashpartitioning(hour#113, cid#5, 200) +- *(1) HashAggregate(keys=[hour#113, cid#5], functions=[partial_count(cid#5)]) +- *(1) Project [hour(dt#1, Some(Etc/UTC)) AS hour#113, cid#5] +- *(1) Filter (isnotnull(_id#0) && (_id#0 <= 1)) +- *(1) Scan MapRDBRelation(/user/mapr/ubertable [dt#1,cid#5,_id#0] PushedFilters: [IsNotNull(_id), LessThanOrEqual(_id,1)] MapR-DB Projection and Filter push down
  101. 101. 101 © 2018 MapR Technologies, Inc Spark MapR-DB Projection Filter push down Projection and Filter pushdown reduces the amount of data passed between MapR-DB and the Spark engine when selecting and filtering data. Data is selected and filtered in MapR-DB!
  102. 102. 102 © 2018 MapR Technologies, Inc SELECT hour(uber.dt) as hr,cid, count(cid) as ct FROM uber GROUP BY hour(uber.dt), cid Which hours and Clusters have the highest pick ups?
  103. 103. 103 © 2018 MapR Technologies, Inc MapR Data Platform
  104. 104. 104 © 2018 MapR Technologies, Inc Link to Code for this webinar is in appendix of this book. https://mapr.com/ebook/getting-started- with-apache-spark-v2/ New Spark Ebook
  105. 105. 105 © 2018 MapR Technologies, Inc
  106. 106. 106 © 2018 MapR Technologies, Inc MapR Free ODT http://learn.mapr.com/ To Learn More: New Spark 2.0 training
  107. 107. 107 © 2018 MapR Technologies, Inc https://mapr.com/blog/ MapR Blog
  • GnanaSekhar4

    Jul. 21, 2020
  • ssuserce170b

    Jun. 15, 2019
  • lalpal

    Feb. 16, 2019
  • Tejachowdary27

    Dec. 5, 2018

Real-Time Analysis of Popular Uber Locations using Apache APIs: Spark Machine Learning, Spark Structured Streaming, Kafka with MapR-ES and MapR-DB

Views

Total views

776

On Slideshare

0

From embeds

0

Number of embeds

2

Actions

Downloads

50

Shares

0

Comments

0

Likes

4

×