Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2017 MapR Technologies
Data Pipeline Using Apache APIs: Kafka,
Spark, and MapR-DB
© 2017 MapR Technologies
Data Pipeline Using Apache APIs: Kafka, Spark, and MapR-DB
•  Kafka
•  Spark Streaming
•  Spark S...
© 2017 MapR Technologies
Streaming ETL Pipeline
Data Collect Process Store
Stream
Topic
Spark
Streaming
Kafka API
SQL
Open...
© 2017 MapR Technologies
Traditional ETL
Image Reference: Databricks
© 2017 MapR Technologies
Streaming ETL
Image Reference: Databricks
© 2017 MapR Technologies
What is a Stream ?
Producers Consumers
•  A stream is an continuous sequence of events or records...
© 2017 MapR Technologies
Examples of Streaming Data
Fraud detection Smart Machinery Smart Meters Home Automation
Networks ...
© 2017 MapR Technologies
Examples of Streaming Data
•  Monitoring devices combined with ML can provide alerts for Sepsis,
...
© 2017 MapR Technologies
Examples of Streaming Data
•  A Stanford team has shown that a machine-learning model can identif...
© 2017 MapR Technologies
Applying Machine Learning to Live Patient Data
•  https://www.slideshare.net/caroljmcdonald/apply...
© 2017 MapR Technologies
What has changed in the past 10 years?
•  Distributed computing
•  Streaming analytics
•  Improve...
© 2017 MapR Technologies
Serve DataStore DataCollect Data
What Do We Need to Do ?
Process DataData Sources
? ? ? ?
© 2017 MapR Technologies
Collect the Data
Data IngestSource
Stream
Topic
•  Data Ingest:
–  Using the Kafka API
© 2017 MapR Technologies
Organize Data into Topics with MapR-Event Streams
Topics: Logical collection of events, Organize ...
© 2017 MapR Technologies
Scalable Messaging with MapR Event Streams
Server 1
Partition1: Topic - Pressure
Partition1: Topi...
© 2017 MapR Technologies
Scalable Messaging with MapR Event Streams
Partition1: Topic - Pressure
Partition1: Topic - Tempe...
© 2017 MapR Technologies
Scalable Messaging with MapR Event Streams
Partition1: Topic - Pressure
Partition1: Topic - Tempe...
© 2017 MapR Technologies
Partition is like an Event Log
Consumers
MapR Cluster
Topic: Admission / Server 1
Topic: Admissio...
© 2017 MapR Technologies
Partition is like a Queue
Messages are delivered in the order they are received
MapR Cluster
6 5 ...
© 2017 MapR Technologies
Unlike a queue, events are still persisted after they’re delivered
Messages remain on the partiti...
© 2017 MapR Technologies
When Are Messages Deleted?
•  Messages can be persisted forever
•  Or
•  Older messages can be de...
© 2017 MapR Technologies
Traditional Message queue
© 2017 MapR Technologies
How do we do this with High Performance at Scale?
•  Parallel operations
•  minimizes disk read/w...
© 2017 MapR Technologies
Processing Same Message for Different Purposes
Consumers
Consumers
Consumers
Producers
Producers
...
© 2017 MapR Technologies
Stream as the System of Record
© 2017 MapR Technologies
A Table is a Snapshot of a Stream
Updates
Imagine each event as a change to an entry in a databas...
© 2017 MapR Technologies
A Stream is a Change Log of a Table
Change Log
https://engineering.linkedin.com/distributed-syste...
© 2017 MapR Technologies
Rewind: Reprocessing Events
MapR Cluster
6 5 4 3 2 1Producers
Reprocess from
oldest
Consumer
Crea...
© 2017 MapR Technologies
Rewind Reprocessing Events
MapR Cluster
6 5 4 3 2 1Producers
To Newest
Consumer
new view
Read fro...
© 2017 MapR Technologies
Event Sourcing, Command Query Responsibility Separation:
Turning the Database Upside Down
Key-Val...
© 2017 MapR Technologies
Use Case: Streaming System of Record for Healthcare
Objective:
•  Build a flexible, secure
health...
© 2017 MapR Technologies32
ALLOY Health:
Exchange State HIE
Clinical Data Viewer
Reporting and Analytics
Clinical Data
Fin...
© 2017 MapR Technologies
Use Case: Streaming System of Record for Healthcare
© 2017 MapR Technologies
Spark Dataset
© 2017 MapR Technologies
Spark Distributed Datasets
Dataset
W
Executor
P4
W
Executor
P1 P3
W
Executor
P2
partitioned
Parti...
© 2017 MapR Technologies
val df: Dataset[Payment] = spark.read.json(”/p/file.json").as[Payment]
Spark Distributed Datasets...
© 2017 MapR Technologies
DataFrame is like a table
Dataset[Row]
row
columns
DataFrame = Dataset[Row] can use Spark SQL
© 2017 MapR Technologies
A Dataset is a collection of typed objects
Dataset[objects]
objects
columns
Dataset[Typed Object]...
© 2017 MapR Technologies
Spark Streaming
© 2017 MapR Technologies
Collect Data
Process the Data with Spark Streaming
Process Data
Stream
Topic
•  scalable, high-th...
© 2017 MapR Technologies
What is a Stream ?
Producers Consumers
•  A stream is an continuous sequence of events or records...
© 2017 MapR Technologies
Data stream Unbounded Table
new data in the
data stream
=
new rows appended
to an unbounded table...
© 2017 MapR Technologies
Spark Distributed Datasets read from Stream partitions
Task
Cache
Process
& Cache
Data
offsets
St...
© 2017 MapR Technologies
Streaming data =
Unbounded table
Static Data =
bounded table
Same Dataset operations & SQL
Stream...
© 2017 MapR Technologies
Conceptual model
incremental
query
3.  Append
© 2017 MapR Technologies
Continuous incremental execution
Spark SQL converts queries
to incremental execution plans
For in...
© 2017 MapR Technologies
Use Case: Payment Data
Payment input data
Stream
Input
"NEW","Covered Recipient Physician",,,,"13...
© 2017 MapR Technologies
Use Case: Open Payment Dataset
•  Payments Drug and Device companies make to
•  Physicians and Te...
© 2017 MapR Technologies
Scenario: Payment Data
Provider
ID
Date Payer Payer
State
Provider
Specialty
Provider
State
Amoun...
© 2017 MapR Technologies
Stream the data into a Dataframe: Define the Schema
case class Payment(physician_id: String,
date...
© 2017 MapR Technologies
Function to Parse CSV into Payment Class
def parse(str: String): Payment = {
val td = str.split("...
© 2017 MapR Technologies
Parsed and Transformed Payment Data
{
"_id":”TX_Gynecology_08/26/2016_346122858",
"physician_id":...
© 2017 MapR Technologies
Streaming pipeline Data source
Specify data source
returns a dataframe
val	df1	=	spark.readStream...
© 2017 MapR Technologies
Transformation
Cast bytes from Kafka records
to a string, parse csv , and
return Dataset[Payment]...
© 2017 MapR Technologies
Streams
Stream Processing
Stream
Processing
Storage
Raw
Enriched
Filtered
Stream Processing:
•  F...
© 2017 MapR Technologies
Dataframe Integrated Queries
L I Query Description
agg(expr, exprs) Aggregates on entire DataFram...
© 2017 MapR Technologies
Continuous aggregations
Continuously compute average payment amountval	d3=df2.avg(“amount")
© 2017 MapR Technologies
Continuous aggregations and filter
val	d3=df2.groupBy(“payer")	
					.avg(“amount")
© 2017 MapR Technologies
Continuous aggregations and filter
val	d3=df2	
					.filter($"amount"	>	20000)
© 2017 MapR Technologies
Streaming pipeline Kafka topic Data Sink
Write results to Kafka topic
Start running the query
val...
© 2017 MapR Technologies
Streaming Applicaton
Streaming Dataset
Streaming Dataset
Transformed
Stream
Topic
Stream
Topic
St...
© 2017 MapR Technologies
Spark & MapR-DB
© 2017 MapR Technologies
AnalyzeStore DataCollect Data
What Do We Need to Do ?
Process DataData Sources
? ? ? ?
Stream
Top...
© 2017 MapR Technologies
Stream Processing Pipeline
Data Collect Process Store
Stream
Topic
Spark
Streaming
Kafka API
SQL
...
© 2017 MapR Technologies
Where/How to store data ?
© 2017 MapR Technologies
Relational Database vs. MapR-DB
bottleneck
Storage ModelRDBMS MapR-DB
Normalized schema à Joins f...
© 2017 MapR Technologies
MapR-DB JSON Document Store
Data that is read together is stored
together
© 2017 MapR Technologies
Designed for Partitioning and Scaling
Key
Range
xxxx
xxxx
Key
Range
xxxx
xxxx
Key
Range
xxxx
xxxx...
© 2017 MapR Technologies
MapR-DB JSON Document Store
Data is automatically partitioned
by Key Range!
© 2017 MapR Technologies
Payment Data
Automatically sorted and
partitioned by Key Range
(_id)
{
"_id":”TX_Gynecology_08/26...
© 2017 MapR Technologies
Spark Streaming writing to MapR-DB JSON
© 2017 MapR Technologies
Spark MapR-DB Connector
•  Connection object in every Spark Executor:
•  distributed parallel wri...
© 2017 MapR Technologies
Use Case: Flight Delays
Payment input data
Stream
Input
"NEW","Covered Recipient Physician",,,,"1...
© 2017 MapR Technologies
Streaming pipeline Data Sink
Write result to maprdb
Start running the query
val	query	=	df2.write...
© 2017 MapR Technologies
Streaming Applicaton
Streaming Dataset
Streaming Dataset
Transformed
Tablets
Stream
Topic
Stream
...
© 2017 MapR Technologies
Explore the Data With Spark SQL
© 2017 MapR Technologies
AnalyzeStore DataCollect Data
What Do We Need to Do ?
Process DataData Sources
? ? ?
Stream
Topic...
© 2017 MapR Technologies
Spark SQL Querying MapR-DB JSON
© 2017 MapR Technologies
Data
Frame
Load data
Load the data into a Dataframe
val pdf: Dataset[Payment] =
spark.loadFromMap...
© 2017 MapR Technologies
val pdf: Dataset[Payment] =
spark.loadFromMapRDB[Payment]("/apps/paytable", schema)
.as[Payment]
...
© 2017 MapR Technologies
Language Integrated Queries
L I Query Description
agg(expr, exprs) Aggregates on entire DataFrame...
© 2017 MapR Technologies
Top 5 Nature of Payment by count
val res = pdf.groupBy(”Nature_of_Payment")
.count()
.orderBy(des...
© 2017 MapR Technologies
Top 5 Nature of Payment by amount of payment
%sql select Nature_of_payment,
sum(amount) as total ...
© 2017 MapR Technologies
What are the Nature of Payments with payments > $1000 with count
pdf.filter($"amount" > 1000)
.gr...
© 2017 MapR Technologies
Top 5 Physician Specialties by total Amount
%sql select physician_specialty, sum(amount) as total...
© 2017 MapR Technologies
Average Payment by Specialty
© 2017 MapR Technologies
Top Payers by Total Amount with count
%sql select payer, payer_state, count(*) as cnt,
sum(amount...
© 2017 MapR Technologies
Stream Processing
Building a Complete Data Architecture
MapR File System
(MapR-XD)
MapR Converged...
© 2017 MapR Technologies
Data Pipelines and Machine Learning Logistics
Input Data +
Actual Delay
Input Data +
Predictions
...
© 2017 MapR Technologies
© 2017 MapR Technologies
To Learn More:
•  MapR Free ODT http://learn.mapr.com/
© 2017 MapR Technologies
…helping you put data technology to work
●  Find answers
●  Ask technical questions
●  Join on-de...
© 2017 MapR Technologies
MapR Blog
• https://www.mapr.com/blog/
© 2017 MapR Technologies
To Learn More: ETL Payment data pipeline
•  https://mapr.com/blog/etl-pipeline-healthcare-dataset...
© 2017 MapR Technologies
To Learn More:
•  https://mapr.com/blog/how-stream-first-architecture-patterns-are-
revolutionizi...
© 2017 MapR Technologies
To Learn More:
•  https://mapr.com/blog/ml-iot-connected-medical-devices/
© 2017 MapR Technologies
Applying Machine Learning to Live Patient Data
•  https://www.slideshare.net/caroljmcdonald/apply...
© 2017 MapR Technologies
MapR Container for Developers
• https://maprdocs.mapr.com/home/MapRContainerDevelopers/
MapRConta...
© 2017 MapR Technologies
MapR Data Science Refinery
• https://mapr.com/products/data-science-refinery/
© 2017 MapR Technologies
MapR Data Platform
© 2017 MapR Technologies
Q&A
ENGAGE WITH US
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

5

Share

Download to read offline

Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB

Download to read offline

Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB

  1. 1. © 2017 MapR Technologies Data Pipeline Using Apache APIs: Kafka, Spark, and MapR-DB
  2. 2. © 2017 MapR Technologies Data Pipeline Using Apache APIs: Kafka, Spark, and MapR-DB •  Kafka •  Spark Streaming •  Spark SQL
  3. 3. © 2017 MapR Technologies Streaming ETL Pipeline Data Collect Process Store Stream Topic Spark Streaming Kafka API SQL Open JSON API Analyze JSON SQL
  4. 4. © 2017 MapR Technologies Traditional ETL Image Reference: Databricks
  5. 5. © 2017 MapR Technologies Streaming ETL Image Reference: Databricks
  6. 6. © 2017 MapR Technologies What is a Stream ? Producers Consumers •  A stream is an continuous sequence of events or records •  Records are key-value pairs Stream of Data key value key value key value key value
  7. 7. © 2017 MapR Technologies Examples of Streaming Data Fraud detection Smart Machinery Smart Meters Home Automation Networks Manufacturing Security Systems Patient Monitoring
  8. 8. © 2017 MapR Technologies Examples of Streaming Data •  Monitoring devices combined with ML can provide alerts for Sepsis, which is one of the leading causes for death in hospitals –  http://www.computerweekly.com/news/450422258/Putting-sepsis-algorithms-into-electronic- patient-records
  9. 9. © 2017 MapR Technologies Examples of Streaming Data •  A Stanford team has shown that a machine-learning model can identify heart arrhythmias from an electrocardiogram (ECG) better than an expert –  https://www.technologyreview.com/s/608234/the-machines-are-getting-ready-to-play-doctor/
  10. 10. © 2017 MapR Technologies Applying Machine Learning to Live Patient Data •  https://www.slideshare.net/caroljmcdonald/applying-machine-learning-to- live-patient-data
  11. 11. © 2017 MapR Technologies What has changed in the past 10 years? •  Distributed computing •  Streaming analytics •  Improved machine learning
  12. 12. © 2017 MapR Technologies Serve DataStore DataCollect Data What Do We Need to Do ? Process DataData Sources ? ? ? ?
  13. 13. © 2017 MapR Technologies Collect the Data Data IngestSource Stream Topic •  Data Ingest: –  Using the Kafka API
  14. 14. © 2017 MapR Technologies Organize Data into Topics with MapR-Event Streams Topics: Logical collection of events, Organize Events into Categories Consumers MapR Cluster Topic: Pressure Topic: Temperature Topic: Warnings Consumers Consumers Kafka API Kafka API
  15. 15. © 2017 MapR Technologies Scalable Messaging with MapR Event Streams Server 1 Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Server 2 Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Server 3 Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning Topics are partitioned for throughput and scalability
  16. 16. © 2017 MapR Technologies Scalable Messaging with MapR Event Streams Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning Producers are load balanced between partitions Kafka API
  17. 17. © 2017 MapR Technologies Scalable Messaging with MapR Event Streams Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning Consumers Consumers Consumers Consumer groups can read in parallel Kafka API
  18. 18. © 2017 MapR Technologies Partition is like an Event Log Consumers MapR Cluster Topic: Admission / Server 1 Topic: Admission / Server 2 Topic: Admission / Server 3 Consumers Consumers Partition 1 New Messages are appended to the end Partition 2 Partition 3 6 5 4 3 2 1 3 2 1 5 4 3 2 1 Producers Producers Producers New Message 6 5 4 3 2 1 Old Message
  19. 19. © 2017 MapR Technologies Partition is like a Queue Messages are delivered in the order they are received MapR Cluster 6 5 4 3 2 1 Consumer groupProducers Read cursors Consumer group
  20. 20. © 2017 MapR Technologies Unlike a queue, events are still persisted after they’re delivered Messages remain on the partition, available to other consumers MapR Cluster (1 Server) Topic: Warning Partition 1 3 2 1 Unread Events Get Unread 3 2 1 Client Library ConsumerPoll
  21. 21. © 2017 MapR Technologies When Are Messages Deleted? •  Messages can be persisted forever •  Or •  Older messages can be deleted automatically based on time to live MapR Cluster (1 Server) 6 5 4 3 2 1Partition 1 Older message
  22. 22. © 2017 MapR Technologies Traditional Message queue
  23. 23. © 2017 MapR Technologies How do we do this with High Performance at Scale? •  Parallel operations •  minimizes disk read/writes
  24. 24. © 2017 MapR Technologies Processing Same Message for Different Purposes Consumers Consumers Consumers Producers Producers Producers MapR-FS Kafka API Kafka API
  25. 25. © 2017 MapR Technologies Stream as the System of Record
  26. 26. © 2017 MapR Technologies A Table is a Snapshot of a Stream Updates Imagine each event as a change to an entry in a database. Account Id Balance WillO 80.00 BradA 20.00 1: WillO : Deposit : 100.00 2: BradA : Deposit : 50.00 3: BradA : Withdraw : 30.00 4: WillO : Withdraw: 20.00 https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying Change log 4 3 2 1
  27. 27. © 2017 MapR Technologies A Stream is a Change Log of a Table Change Log https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying 3 2 1 3 2 1 3 2 1 Duality of Streams and Tables Master: Append writes Slave: Apply writes in order Replication of changes
  28. 28. © 2017 MapR Technologies Rewind: Reprocessing Events MapR Cluster 6 5 4 3 2 1Producers Reprocess from oldest Consumer Create new view, Index, cache
  29. 29. © 2017 MapR Technologies Rewind Reprocessing Events MapR Cluster 6 5 4 3 2 1Producers To Newest Consumer new view Read from new view
  30. 30. © 2017 MapR Technologies Event Sourcing, Command Query Responsibility Separation: Turning the Database Upside Down Key-Val Document Graph Wide Column Time Series Relational ???Events Updates
  31. 31. © 2017 MapR Technologies Use Case: Streaming System of Record for Healthcare Objective: •  Build a flexible, secure healthcare exchange Records Analysis Applications Challenges: •  Many different data models •  Security and privacy issues •  HIPAA compliance Records
  32. 32. © 2017 MapR Technologies32 ALLOY Health: Exchange State HIE Clinical Data Viewer Reporting and Analytics Clinical Data Financial Data Provider Organizations What are the outcomes in the entire state on diabetes? Are there doctors that are doing this better than others? Georgia Health Connect
  33. 33. © 2017 MapR Technologies Use Case: Streaming System of Record for Healthcare
  34. 34. © 2017 MapR Technologies Spark Dataset
  35. 35. © 2017 MapR Technologies Spark Distributed Datasets Dataset W Executor P4 W Executor P1 P3 W Executor P2 partitioned Partition 1 8213034705, 95, 2.927373, jake7870, 0…… Partition 2 8213034705, 115, 2.943484, Davidbresler2, 1…. Partition 3 8213034705, 100, 2.951285, gladimacowgirl, 58… Partition 4 8213034705, 117, 2.998947, daysrus, 95…. •  Read only collection of typed objects Dataset[T] •  Partitioned across a cluster •  Operated on in parallel •  can be Cached
  36. 36. © 2017 MapR Technologies val df: Dataset[Payment] = spark.read.json(”/p/file.json").as[Payment] Spark Distributed Datasets read from a file Worker Task Worker Block 1 Block 2 Block 3 Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data Task Task Block 1 Driver tasks tasks tasks
  37. 37. © 2017 MapR Technologies DataFrame is like a table Dataset[Row] row columns DataFrame = Dataset[Row] can use Spark SQL
  38. 38. © 2017 MapR Technologies A Dataset is a collection of typed objects Dataset[objects] objects columns Dataset[Typed Object] can use Spark SQL and Functions
  39. 39. © 2017 MapR Technologies Spark Streaming
  40. 40. © 2017 MapR Technologies Collect Data Process the Data with Spark Streaming Process Data Stream Topic •  scalable, high-throughput, stream processing of live data
  41. 41. © 2017 MapR Technologies What is a Stream ? Producers Consumers •  A stream is an continuous sequence of events or records •  Records are key-value pairs Stream of Data key value key value key value key value
  42. 42. © 2017 MapR Technologies Data stream Unbounded Table new data in the data stream = new rows appended to an unbounded table Data stream as an unbounded table Treat Stream as Unbounded Tables
  43. 43. © 2017 MapR Technologies Spark Distributed Datasets read from Stream partitions Task Cache Process & Cache Data offsets Stream partition Task Cache Process & Cache Data Task Cache Process & Cache Data Driver Stream partition Stream partition Data is cached for aggregations And windowed functions
  44. 44. © 2017 MapR Technologies Streaming data = Unbounded table Static Data = bounded table Same Dataset operations & SQL Stream Processing on Spark SQL Engine
  45. 45. © 2017 MapR Technologies Conceptual model incremental query 3.  Append
  46. 46. © 2017 MapR Technologies Continuous incremental execution Spark SQL converts queries to incremental execution plans For input of data Incremental Incremental Incremental
  47. 47. © 2017 MapR Technologies Use Case: Payment Data Payment input data Stream Input "NEW","Covered Recipient Physician",,,,"132655","GREGG","D","ALZATE",,"8745 AERO DRIVE","STE 200","SAN DIEGO","CA","92123","United States",,,"Medical Doctor","Allopathic & Osteopathic Physicians|Radiology|Diagnostic Radiology","CA",,,,,"DFINE, Inc","100000000326","DFINE, Inc","CA","United States", 90.87,"02/12/2016","1","In-kind items and services","Food and Beverage",,,,"No","No Third Party Payment",,,,,"No","346039438","No","Yes","Covered","Device","Radiology","StabiliT", ,"Covered","Device","Radiology","STAR Tumor Ablation System",,,,,,,,,,,,,,,,,"2016","06/30/2017" transform Spark Streaming { "_id":"317150_08/26/2016_346122858", "physician_id":"317150", "date_payment":"08/26/2016", "record_id":"346122858", "payer":"Mission Pharmacal Company", "amount":9.23, "Physician_Specialty":"Obstetrics & Gynecology", "Nature_of_payment":"Food and Beverage" } JSON
  48. 48. © 2017 MapR Technologies Use Case: Open Payment Dataset •  Payments Drug and Device companies make to •  Physicians and Teaching Hospitals for •  Travel, Research, Gifts, Speaking fees, and Meals
  49. 49. © 2017 MapR Technologies Scenario: Payment Data Provider ID Date Payer Payer State Provider Specialty Provider State Amount Payment Nature 1261770 01/11/2016 Southern Anesthesia & Surgical, Inc CO Oral and Maxillofacial Surgery CA 117.5 Food and Beverage
  50. 50. © 2017 MapR Technologies Stream the data into a Dataframe: Define the Schema case class Payment(physician_id: String, date_payment: String, payer: String, payer_state: String amount: Double, physician_specialty: String, phys_state: String, nature_of_payment:String) val schema = StructType(Array( StructField("_id", StringType, true), StructField("physician_id", StringType, true), StructField("date_payment", StringType, true), StructField("payer", StringType, true), StructField("payer_state", StringType, true), StructField("amount", DoubleType, true), StructField("physician_specialty", StringType, true), StructField("physician_type", StringType, true), StructField("physician_state", StringType, true), StructField("nature_of_payment", StringType, true) ))
  51. 51. © 2017 MapR Technologies Function to Parse CSV into Payment Class def parse(str: String): Payment = { val td = str.split(",(?=([^"]*"[^"]*")*[^"]*$)") val physician_id = td(5) val payer = td(27) . . . val physician_state = td(20) var focus =td(19) val id =physician_state+'_’+focus+ '_’+ date_payment+'_' + record_id Payment(id, physician_id, date_payment, payer, payer_state, amount, physician_type, focus, physician_state, nature_of_payment) }
  52. 52. © 2017 MapR Technologies Parsed and Transformed Payment Data { "_id":”TX_Gynecology_08/26/2016_346122858", "physician_id":"317150", "date_payment":"08/26/2016", "payer":"Mission Pharmacal Company", "payer_state":”CO", "amount":9.23, ”physician_specialty":”Gynecology", “physician_state":”TX" ”nature_of_payment":"Food and Beverage" } Example Dataset Row
  53. 53. © 2017 MapR Technologies Streaming pipeline Data source Specify data source returns a dataframe val df1 = spark.readStream.format("kafka") .option("kafka.boostrap.servers",...) .option("subscribe", "topic") .load()
  54. 54. © 2017 MapR Technologies Transformation Cast bytes from Kafka records to a string, parse csv , and return Dataset[Payment] spark.udf.register("deserialize", (message: String) => parse(message)) val df2=df1 .selectExpr("""deserialize(CAST(value as STRING)) AS message""") .select($"message".as[Payment])
  55. 55. © 2017 MapR Technologies Streams Stream Processing Stream Processing Storage Raw Enriched Filtered Stream Processing: •  Filtering •  Transformations •  Aggregations •  Enrichments with ML •  Enrichments with joins MapR-DB MapR-XD
  56. 56. © 2017 MapR Technologies Dataframe Integrated Queries L I Query Description agg(expr, exprs) Aggregates on entire DataFrame distinct Returns new DataFrame with unique rows except(other) Returns new DataFrame with rows from this DataFrame not in other DataFrame filter(expr); where(condition) Filter based on the SQL expression or condition groupBy(cols: Columns) Groups DataFrame using specified columns join (DataFrame, joinExpr) Joins with another DataFrame using given join expression sort(sortcol) Returns new DataFrame sorted by specified column select(col) Selects set of columns
  57. 57. © 2017 MapR Technologies Continuous aggregations Continuously compute average payment amountval d3=df2.avg(“amount")
  58. 58. © 2017 MapR Technologies Continuous aggregations and filter val d3=df2.groupBy(“payer") .avg(“amount")
  59. 59. © 2017 MapR Technologies Continuous aggregations and filter val d3=df2 .filter($"amount" > 20000)
  60. 60. © 2017 MapR Technologies Streaming pipeline Kafka topic Data Sink Write results to Kafka topic Start running the query val query = df3.write .format("kafka") .option("kafka.bootstrap.servers", "host1:port1,host2:port2") .option("topic", "/apps/uberstream:uberp") query.start.awaitTermination()
  61. 61. © 2017 MapR Technologies Streaming Applicaton Streaming Dataset Streaming Dataset Transformed Stream Topic Stream Topic Stream Topic Stream Topic Stream Topic Stream Topic
  62. 62. © 2017 MapR Technologies Spark & MapR-DB
  63. 63. © 2017 MapR Technologies AnalyzeStore DataCollect Data What Do We Need to Do ? Process DataData Sources ? ? ? ? Stream Topic Spark Streaming JSON
  64. 64. © 2017 MapR Technologies Stream Processing Pipeline Data Collect Process Store Stream Topic Spark Streaming Kafka API SQL Open JSON API Analyze JSON SQL
  65. 65. © 2017 MapR Technologies Where/How to store data ?
  66. 66. © 2017 MapR Technologies Relational Database vs. MapR-DB bottleneck Storage ModelRDBMS MapR-DB Normalized schema à Joins for queries can cause bottleneck De-Normalized schema à Data that is read together is stored together Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val
  67. 67. © 2017 MapR Technologies MapR-DB JSON Document Store Data that is read together is stored together
  68. 68. © 2017 MapR Technologies Designed for Partitioning and Scaling Key Range xxxx xxxx Key Range xxxx xxxx Key Range xxxx xxxx Fast Reads and Writes by Key! Data is automatically partitioned by Key Range! Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val Key colB colC xxx val val xxx val val
  69. 69. © 2017 MapR Technologies MapR-DB JSON Document Store Data is automatically partitioned by Key Range!
  70. 70. © 2017 MapR Technologies Payment Data Automatically sorted and partitioned by Key Range (_id) { "_id":”TX_Gynecology_08/26/2016_346122858", "physician_id":"317150", "date_payment":"08/26/2016", "payer":"Mission Pharmacal Company", "payer_state":”CO", "amount":9.23, ”physician_specialty":”Gynecology", “physician_state":”TX" ”nature_of_payment":"Food and Beverage" }
  71. 71. © 2017 MapR Technologies Spark Streaming writing to MapR-DB JSON
  72. 72. © 2017 MapR Technologies Spark MapR-DB Connector •  Connection object in every Spark Executor: •  distributed parallel writes & reads
  73. 73. © 2017 MapR Technologies Use Case: Flight Delays Payment input data Stream Input "NEW","Covered Recipient Physician",,,,"132655","GREGG","D","ALZATE",,"8745 AERO DRIVE","STE 200","SAN DIEGO","CA","92123","United States",,,"Medical Doctor","Allopathic & Osteopathic Physicians|Radiology|Diagnostic Radiology","CA",,,,,"DFINE, Inc","100000000326","DFINE, Inc","CA","United States", 90.87,"02/12/2016","1","In-kind items and services","Food and Beverage",,,,"No","No Third Party Payment",,,,,"No","346039438","No","Yes","Covered","Device","Radiology","StabiliT", ,"Covered","Device","Radiology","STAR Tumor Ablation System",,,,,,,,,,,,,,,,,"2016","06/30/2017" transform Spark Streaming { "_id":"317150_08/26/2016_346122858", "physician_id":"317150", "date_payment":"08/26/2016", "record_id":"346122858", "payer":"Mission Pharmacal Company", "amount":9.23, "Physician_Specialty":"Obstetrics & Gynecology", "Nature_of_payment":"Food and Beverage" } JSON
  74. 74. © 2017 MapR Technologies Streaming pipeline Data Sink Write result to maprdb Start running the query val query = df2.writeStream .format(MapRDBSourceConfig.Format) .option(MapRDBSourceConfig.TablePathOption, "/apps/paytable") .option(MapRDBSourceConfig.CreateTableOption, false) .option(MapRDBSourceConfig.IdFieldPathOption, "value") .outputMode(”append") query.start().awaitTermination()
  75. 75. © 2017 MapR Technologies Streaming Applicaton Streaming Dataset Streaming Dataset Transformed Tablets Stream Topic Stream Topic Stream Topic Data is rapidly available for complex, ad-hoc analytics
  76. 76. © 2017 MapR Technologies Explore the Data With Spark SQL
  77. 77. © 2017 MapR Technologies AnalyzeStore DataCollect Data What Do We Need to Do ? Process DataData Sources ? ? ? Stream Topic Spark Streaming JSON SQL
  78. 78. © 2017 MapR Technologies Spark SQL Querying MapR-DB JSON
  79. 79. © 2017 MapR Technologies Data Frame Load data Load the data into a Dataframe val pdf: Dataset[Payment] = spark.loadFromMapRDB[Payment]("/apps/paytable", schema) .as[Payment] pdf.select("_id", "payer", "amount”).show
  80. 80. © 2017 MapR Technologies val pdf: Dataset[Payment] = spark.loadFromMapRDB[Payment]("/apps/paytable", schema) .as[Payment] Spark Distributed Datasets read from MapR-DB Partitions Worker Task Worker Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data Task Task Driver tasks tasks tasks
  81. 81. © 2017 MapR Technologies Language Integrated Queries L I Query Description agg(expr, exprs) Aggregates on entire DataFrame distinct Returns new DataFrame with unique rows except(other) Returns new DataFrame with rows from this DataFrame not in other DataFrame filter(expr); where(condition) Filter based on the SQL expression or condition groupBy(cols: Columns) Groups DataFrame using specified columns join (DataFrame, joinExpr) Joins with another DataFrame using given join expression sort(sortcol) Returns new DataFrame sorted by specified column select(col) Selects set of columns
  82. 82. © 2017 MapR Technologies Top 5 Nature of Payment by count val res = pdf.groupBy(”Nature_of_Payment") .count() .orderBy(desc(count)) .show(5)
  83. 83. © 2017 MapR Technologies Top 5 Nature of Payment by amount of payment %sql select Nature_of_payment, sum(amount) as total from payments group by Nature_of_payment order by total desc limit 5
  84. 84. © 2017 MapR Technologies What are the Nature of Payments with payments > $1000 with count pdf.filter($"amount" > 1000) .groupBy("Nature_of_payment") .count().orderBy(desc("count")).show()
  85. 85. © 2017 MapR Technologies Top 5 Physician Specialties by total Amount %sql select physician_specialty, sum(amount) as total from payments where physician_specialty IS NOT NULL group by physician_specialty order by total desc limit 5
  86. 86. © 2017 MapR Technologies Average Payment by Specialty
  87. 87. © 2017 MapR Technologies Top Payers by Total Amount with count %sql select payer, payer_state, count(*) as cnt, sum(amount) as total from payments group by payer, payer_state order by total desc limit 10
  88. 88. © 2017 MapR Technologies Stream Processing Building a Complete Data Architecture MapR File System (MapR-XD) MapR Converged Data Platform MapR Database (MapR-DB) MapR Event Streams Sources/Apps Bulk Processing All of these components can run on the same cluster with the MapR Converged platform.
  89. 89. © 2017 MapR Technologies Data Pipelines and Machine Learning Logistics Input Data + Actual Delay Input Data + Predictions Consumer withML Model 2 Consumer withML Model 1 Decoy results Consumer Consumer withML Model 3 Consumer Stream Archive Stream Scores Stream Input SQL SQL Real time Flight Data Stream Input Actual Delay Input Data + Predictions + Actual Delay Real Time dashboard + Historical Analysis
  90. 90. © 2017 MapR Technologies
  91. 91. © 2017 MapR Technologies To Learn More: •  MapR Free ODT http://learn.mapr.com/
  92. 92. © 2017 MapR Technologies …helping you put data technology to work ●  Find answers ●  Ask technical questions ●  Join on-demand training course discussions ●  Follow release announcements ●  Share and vote on product ideas ●  Find Meetup and event listings Connect with fellow Apache Hadoop and Spark professionals community.mapr.com
  93. 93. © 2017 MapR Technologies MapR Blog • https://www.mapr.com/blog/
  94. 94. © 2017 MapR Technologies To Learn More: ETL Payment data pipeline •  https://mapr.com/blog/etl-pipeline-healthcare-dataset-with-spark-json-mapr- db/ •  https://mapr.com/blog/streaming-data-pipeline-transform-store-explore- healthcare-dataset-mapr-db/
  95. 95. © 2017 MapR Technologies To Learn More: •  https://mapr.com/blog/how-stream-first-architecture-patterns-are- revolutionizing-healthcare-platforms/
  96. 96. © 2017 MapR Technologies To Learn More: •  https://mapr.com/blog/ml-iot-connected-medical-devices/
  97. 97. © 2017 MapR Technologies Applying Machine Learning to Live Patient Data •  https://www.slideshare.net/caroljmcdonald/applying-machine-learning-to- live-patient-data
  98. 98. © 2017 MapR Technologies MapR Container for Developers • https://maprdocs.mapr.com/home/MapRContainerDevelopers/ MapRContainerDevelopersOverview.html
  99. 99. © 2017 MapR Technologies MapR Data Science Refinery • https://mapr.com/products/data-science-refinery/
  100. 100. © 2017 MapR Technologies MapR Data Platform
  101. 101. © 2017 MapR Technologies Q&A ENGAGE WITH US
  • tigerkdp

    Jan. 15, 2020
  • natraghu

    Jan. 2, 2020
  • ssuserce170b

    Jun. 15, 2019
  • UmapathyV

    Jan. 16, 2019
  • StreamingAnalytics

    Jun. 29, 2018

Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB

Views

Total views

1,038

On Slideshare

0

From embeds

0

Number of embeds

4

Actions

Downloads

73

Shares

0

Comments

0

Likes

5

×