Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Data	Pipeline	using:	Spark,	MapR	Event	Store	
for	Kafka,	and	MapR	Database
2 © 2018 MapR Technologies, Inc
•  Overview	of	Kafka	API	
•  Use	Spark	Structured	Streaming	to	
continously:	
•  Read	from...
3 © 2018 MapR Technologies, Inc
Use	Case:	Open	Payment	Dataset	
•  Payments Drug and Device companies make to
•  Physician...
4 © 2018 MapR Technologies, Inc
•  Medicare	payment	data	for	most	common	inpatient	diagnoses	
•  Data	shows	dramatically	d...
What	is	Streaming	Data?
6 © 2018 MapR Technologies, Inc
What	is	a	Stream	?	
•  A stream is a continuous sequence of events
•  Events are key-value...
7 © 2018 MapR Technologies, Inc
What	is	Streaming	Data?	
Fraud detection Smart Machinery Smart Meters Home Automation
Netw...
8 © 2018 MapR Technologies, Inc
Monitoring	devices	combined	with	ML	can	provide	alerts	for	Sepsis,	
which	is	one	of	the	le...
9 © 2018 MapR Technologies, Inc
A	Stanford	team	has	shown	that	a	machine-learning	model	can	identify	arrhythmias	
from	an	...
10 © 2018 MapR Technologies, Inc
	the	ECG	app	on	the	Apple	Watch	can	check	heart	rhythms	and	send	a	notification	if	an	
ir...
11 © 2018 MapR Technologies, Inc
https://mapr.com/blog/ml-iot-connected-medical-devices/	
Applying	Machine	Learning	to	Liv...
Kafka	API	and	Streaming	Data
13 © 2018 MapR Technologies, Inc
Intro	to	the	Kafka	API
14 © 2018 MapR Technologies, Inc
Topics:		
Logical	collection	of	events		
Organize	Events	into	Categories	
Organize	Data	i...
15 © 2018 MapR Technologies, Inc
Topics:		
Logical	collection	of	events		
Organize	Events	into	Categories	
Organize	Data	i...
16 © 2018 MapR Technologies, Inc
Topics	are	partitioned	for	throughput	and	
scalability	
	
Scalable	Messaging	with	MapR	Ev...
17 © 2018 MapR Technologies, Inc
New	Messages	are		
Added	to	the	end	
Partition	is	like	an	Event	Log	
New
Message
6 5 4 3 ...
18 © 2018 MapR Technologies, Inc
Messages	are	delivered	in	the	order	they	are	received	
Partition	is	like	a	Queue
19 © 2018 MapR Technologies, Inc
Events	remain	on	the	partition,	available	to	other	consumers	
	
	
Unlike	a	queue,	Events	...
20 © 2018 MapR Technologies, Inc
Messages	can	be	persisted	forever		
Or			
Older	messages	can	be	deleted	automatically	bas...
21 © 2018 MapR Technologies, Inc
High	Performance	at	Scale:		
•  Partitioning	for	Parallel	operations		
•  Not	deleting	me...
22 © 2018 MapR Technologies, Inc
Processing	Same	Message	for	Different	Purposes
23 © 2018 MapR Technologies, Inc
Georgia	Health	Connect	
ALLOY Health:
Exchange State HIE
Clinical Data Viewer
Reporting a...
24 © 2018 MapR Technologies, Inc
Use	Case:	Streaming	System	of	Record	for	Healthcare
Spark	Structured	Streaming
26 © 2018 MapR Technologies, Inc
Spark	Distributed	Datasets	
partitioned
•  Distributed collection of objects
Dataset[T]
•...
27 © 2018 MapR Technologies, Inc
DataFrame	=	Dataset[Row]	can	use	Spark	SQL	
	
DataFrame is like a table
Dataset[Row]
row
...
28 © 2018 MapR Technologies, Inc
Dataset[Typed	Object]		can	use	Spark	SQL	and	Functions	
A Dataset is a distributed collec...
29 © 2018 MapR Technologies, Inc
Process	the	Data	with	Spark	Structured	Streaming
30 © 2018 MapR Technologies, Inc
Datasets	Read	from	Stream	
Task	
Cache		
Process
& Cache
Data
offsets
Stream
partition
Ta...
31 © 2018 MapR Technologies, Inc
new data in the
data stream
=
new rows appended
to an unbounded table
Data stream as an u...
32 © 2018 MapR Technologies, Inc
The	Stream	is	continuously	processed
33 © 2018 MapR Technologies, Inc
Spark	automatically		streamifies	SQL	plans	
Image	reference	Databricks
34 © 2018 MapR Technologies, Inc
Stream	Processing
35 © 2018 MapR Technologies, Inc
Use	Case:	Payment	Data	
"NEW","Covered Recipient Physician",,,,"132655","GREGG","D","ALZA...
36 © 2018 MapR Technologies, Inc
Scenario:	Payment	Data	
Provider	ID Date Payer Payer
State
Provider	
Specialty
Provider	
...
37 © 2018 MapR Technologies, Inc
val df1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "maprdemo:9...
38 © 2018 MapR Technologies, Inc
df1.printSchema()
root

|-- key: binary (nullable = true)

|-- value: binary (nullable = ...
39 © 2018 MapR Technologies, Inc
case class Payment(_id: String, physician_id: String,
date_payment: String, payer: String...
40 © 2018 MapR Technologies, Inc
//register a user-defined function (UDF) to deserialize the message
spark.udf.register("d...
41 © 2018 MapR Technologies, Inc
Writing	to	a	Memory	Sink	
Write		results	to	Memory	
Start	running	the	query	
val	query	=	...
42 © 2018 MapR Technologies, Inc
%sql	select	*	from	payments	limit	4:		
Streaming	Applicaton	
+--------------------+------...
43 © 2018 MapR Technologies, Inc
%sql	select	*	from	payments	limit	3:		
Streaming	Applicaton
44 © 2018 MapR Technologies, Inc
%sql	select	physician_specialty,	count(*)	as	cnt	from	payment2		
									group	by	physic...
Spark		&	MapR-DB
46 © 2018 MapR Technologies, Inc
MapR-DB Connector for Apache Spark 	
Spark	Streaming	writing	to	MapR-DB	JSON
47 © 2018 MapR Technologies, Inc
Spark	MapR-DB	Connector
48 © 2018 MapR Technologies, Inc
Designed	for	Partitioning	and	Scaling
49 © 2018 MapR Technologies, Inc
MapR-DB	JSON	Document	Store	
Data is automatically partitioned
and sorted by _id row key!...
50 © 2018 MapR Technologies, Inc
MapR	Database	shell	
maprdb mapr:> find /user/mapr/paytable --limit 2
{"_id":"AK_01/03/20...
51 © 2018 MapR Technologies, Inc
Writing	to	a	MapR-DB	Sink	
Write		Streaming	DataFrame	
Query	Results	to	MapR-DB	
	
Start	...
52 © 2018 MapR Technologies, Inc
Streaming	Dataset	Read	from	Topic	partitions,	Transformed,	
Written	to	MapR	Database	part...
53 © 2018 MapR Technologies, Inc
Store	in	MapR	Database
Explore	the	Data	With	Spark	SQL
55 © 2018 MapR Technologies, Inc
•  Spark SQL queries and updates to MapR-DB
•  With projection and filter pushdown, custo...
56 © 2018 MapR Technologies, Inc
val pdf: Dataset[Payment] = spark
.loadFromMapRDB[Payment](tableName, schema)
.as[Payment...
57 © 2018 MapR Technologies, Inc
Data
Frame
Load data
pdf.createOrReplaceTempView(”payment")	
pdf.select("_id",	"payer",	"...
58 © 2018 MapR Technologies, Inc
•  Which	physician	specialties,	or	physician	states,	are	getting	the	most	payments	?		
• ...
59 © 2018 MapR Technologies, Inc
Language	Integrated	Queries	
L	I	Query	 Description	
agg(expr, exprs) Aggregates	on	entir...
60 © 2018 MapR Technologies, Inc
MapR	Data	Platform
61 © 2018 MapR Technologies, Inc
Top	5	Nature	of	Payment	by	count	
val res = pdf.groupBy(”Nature_of_Payment")
.count()
.or...
62 © 2018 MapR Technologies, Inc
Top		5	Nature	of	Payment	by	amount	of	payment	
%sql select Nature_of_payment,
sum(amount)...
63 © 2018 MapR Technologies, Inc
Top	5	Physician	Specialties	by	total	Amount	
%sql select physician_specialty, sum(amount)...
64 © 2018 MapR Technologies, Inc
Average	Payment	by	Specialty
65 © 2018 MapR Technologies, Inc
What	are	the	Nature	of	Payments	with	payments	>	$1000	?	
pdf.filter($"amount" > 1000)
.gr...
66 © 2018 MapR Technologies, Inc
Top	Payers	by	Total	Amount	with	count	
%sql select payer, payer_state, count(*) as cnt,
s...
67 © 2018 MapR Technologies, Inc
Scenario:	InPatient	Payment	Data	
Provider	ID Provider	
State	
DRG Total
Discharges
Avera...
68 © 2018 MapR Technologies, Inc
MapR	Database	shell	
maprdb mapr:> find /user/mapr/ippstable --limit 2
{"_id":"001_2014_0...
69 © 2018 MapR Technologies, Inc
•  What	are	the	top	charges,	and		payments	by	Hospital?		
•  	 	by	Hospital	State?		
•  W...
70 © 2018 MapR Technologies, Inc
InPatient	Prospective	Payment	System	payments
71 © 2018 MapR Technologies, Inc
To	Download	the	code	:	Streaming	data	pipeline	blogs	
•  https://mapr.com/blog/etl-pipeli...
72 © 2018 MapR Technologies, Inc
http://mapr.com/ebooks/	
New	Spark	Ebook
73 © 2018 MapR Technologies, Inc
Read	about	Streaming	Architectures	
mapr.com/ebooks
74 © 2018 MapR Technologies, Inc
Read	about	Machine	Learning	Logistics	
mapr.com/ebooks
75 © 2018 MapR Technologies, Inc
MapR	Free	ODT			http://learn.mapr.com/	
To	Learn	More:	New	Spark	2.0	training
76 © 2018 MapR Technologies, Inc
https://mapr.com/blog/	
MapR	Blog
77 © 2018 MapR Technologies, Inc
To	Learn	More:	
•  https://mapr.com/blog/ml-iot-connected-medical-devices/
78 © 2018 MapR Technologies, Inc
To	Learn	More:	
•  https://mapr.com/blog/how-stream-first-architecture-patterns-are-revol...
79 © 2018 MapR Technologies, Inc
Applying	Machine	Learning	to	Live	Patient	Data	
•  https://www.slideshare.net/caroljmcdon...
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

2

Share

Download to read offline

Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with MapR Database

Download to read offline

Streaming Data Pipeline to Transform, Store and Explore Healthcare Dataset With Apache Kafka API, Apache Spark, Apache Drill, JSON and MapR-DB

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Streaming healthcare Data pipeline using Apache APIs: Kafka and Spark with MapR Database

  1. 1. Data Pipeline using: Spark, MapR Event Store for Kafka, and MapR Database
  2. 2. 2 © 2018 MapR Technologies, Inc •  Overview of Kafka API •  Use Spark Structured Streaming to continously: •  Read from Kafka topic •  Transfom •  Write to MapR document database •  Analyze using Spark SQL Agenda 2
  3. 3. 3 © 2018 MapR Technologies, Inc Use Case: Open Payment Dataset •  Payments Drug and Device companies make to •  Physicians and Teaching Hospitals for •  Travel, Research, Gifts, Speaking fees, and Meals •  https://www.cms.gov/openpayments/
  4. 4. 4 © 2018 MapR Technologies, Inc •  Medicare payment data for most common inpatient diagnoses •  Data shows dramatically different charges and payments •  https://www.cms.gov/research-statistics-data-and-systems/statistics-trends-and- reports/medicare-provider-charge-data/inpatient.html InPatient Payment Dataset (IPPS) Provider ID Provider State DRG Total Discharges Average Charges Average Total payments Avg Medicare Payments 1261770 CA Heart Transplant 13 1,172,866 251,876 244,457
  5. 5. What is Streaming Data?
  6. 6. 6 © 2018 MapR Technologies, Inc What is a Stream ? •  A stream is a continuous sequence of events •  Events are key-value pairs
  7. 7. 7 © 2018 MapR Technologies, Inc What is Streaming Data? Fraud detection Smart Machinery Smart Meters Home Automation Networks Manufacturing Security Systems Patient Monitoring
  8. 8. 8 © 2018 MapR Technologies, Inc Monitoring devices combined with ML can provide alerts for Sepsis, which is one of the leading causes for death in hospitals – http://www.computerweekly.com/news/450422258/Putting-sepsis- algorithms-into-electronic-patient-records Examples of Streaming Data
  9. 9. 9 © 2018 MapR Technologies, Inc A Stanford team has shown that a machine-learning model can identify arrhythmias from an EKG better than an expert •  https://www.technologyreview.com/s/608234/the-machines-are-getting-ready- to-play-doctor/ Example of Streaming Data combined with Machine Learning
  10. 10. 10 © 2018 MapR Technologies, Inc the ECG app on the Apple Watch can check heart rhythms and send a notification if an irregular heart rhythm is identified. •  https://www.apple.com/newsroom/2018/12/ecg-app-and-irregular-heart- rhythm-notification-available-today-on-apple-watch/ Example of Streaming Data combined with Machine Learning
  11. 11. 11 © 2018 MapR Technologies, Inc https://mapr.com/blog/ml-iot-connected-medical-devices/ Applying Machine Learning to Live Patient Data
  12. 12. Kafka API and Streaming Data
  13. 13. 13 © 2018 MapR Technologies, Inc Intro to the Kafka API
  14. 14. 14 © 2018 MapR Technologies, Inc Topics: Logical collection of events Organize Events into Categories Organize Data into Topics with the MapR Event Store for Kafka
  15. 15. 15 © 2018 MapR Technologies, Inc Topics: Logical collection of events Organize Events into Categories Organize Data into Topics with the MapR Event Store for Kafka Consumers MapR Cluster Topic: Pressure Topic: Temperature Topic: Warnings Consumers Consumers Kafka API Kafka API
  16. 16. 16 © 2018 MapR Technologies, Inc Topics are partitioned for throughput and scalability Scalable Messaging with MapR Event Streams Server 1 Partition1: Topic - Pressure Partition1: Topic - Temperature Partition1: Topic - Warning Server 2 Partition2: Topic - Pressure Partition2: Topic - Temperature Partition2: Topic - Warning Server 3 Partition3: Topic - Pressure Partition3: Topic - Temperature Partition3: Topic - Warning
  17. 17. 17 © 2018 MapR Technologies, Inc New Messages are Added to the end Partition is like an Event Log New Message 6 5 4 3 2 1 Old Message
  18. 18. 18 © 2018 MapR Technologies, Inc Messages are delivered in the order they are received Partition is like a Queue
  19. 19. 19 © 2018 MapR Technologies, Inc Events remain on the partition, available to other consumers Unlike a queue, Events are not deleted when Read
  20. 20. 20 © 2018 MapR Technologies, Inc Messages can be persisted forever Or Older messages can be deleted automatically based on time to live Not deleting messages, minimizes disk read/writes When Are Messages Deleted? MapR Cluster 6 5 4 3 2 1Partition 1 Older message
  21. 21. 21 © 2018 MapR Technologies, Inc High Performance at Scale: •  Partitioning for Parallel operations •  Not deleting messages, minimizes disk read/writes
  22. 22. 22 © 2018 MapR Technologies, Inc Processing Same Message for Different Purposes
  23. 23. 23 © 2018 MapR Technologies, Inc Georgia Health Connect ALLOY Health: Exchange State HIE Clinical Data Viewer Reporting and Analytics Clinical Data Financial Data Provider Organizations What are the outcomes in the entire state on diabetes? Are there doctors that are doing this better than others? Georgia Health Connect
  24. 24. 24 © 2018 MapR Technologies, Inc Use Case: Streaming System of Record for Healthcare
  25. 25. Spark Structured Streaming
  26. 26. 26 © 2018 MapR Technologies, Inc Spark Distributed Datasets partitioned •  Distributed collection of objects Dataset[T] •  Partitioned across a cluster •  Operated on in parallel •  in memory can be Cached
  27. 27. 27 © 2018 MapR Technologies, Inc DataFrame = Dataset[Row] can use Spark SQL DataFrame is like a table Dataset[Row] row columns
  28. 28. 28 © 2018 MapR Technologies, Inc Dataset[Typed Object] can use Spark SQL and Functions A Dataset is a distributed collection of objects Dataset[objects] object columns partitioned
  29. 29. 29 © 2018 MapR Technologies, Inc Process the Data with Spark Structured Streaming
  30. 30. 30 © 2018 MapR Technologies, Inc Datasets Read from Stream Task Cache Process & Cache Data offsets Stream partition Task Cache Process & Cache Data Task Cache Process & Cache Data Driver Stream partition Stream partition Data is cached for aggregations And windowed functions
  31. 31. 31 © 2018 MapR Technologies, Inc new data in the data stream = new rows appended to an unbounded table Data stream as an unbounded table Treat Stream as Unbounded Tables
  32. 32. 32 © 2018 MapR Technologies, Inc The Stream is continuously processed
  33. 33. 33 © 2018 MapR Technologies, Inc Spark automatically streamifies SQL plans Image reference Databricks
  34. 34. 34 © 2018 MapR Technologies, Inc Stream Processing
  35. 35. 35 © 2018 MapR Technologies, Inc Use Case: Payment Data "NEW","Covered Recipient Physician",,,,"132655","GREGG","D","ALZATE",,"8745 AERO DRIVE","STE 200","SAN DIEGO","CA","92123","United States",,,"Medical Doctor","Allopathic & Osteopathic Physicians|Radiology|Diagnostic Radiology","CA",,,,,"DFINE, Inc","100000000326","DFINE, Inc","CA","United States", 90.87,"02/12/2016","1","In-kind items and services","Food and Beverage",,,,"No","No Third Party Payment",,,,,"No","346039438","No","Yes","Covered","Device","Radiology","StabiliT", ,"Covered","Device","Radiology","STAR Tumor Ablation System",,,,,,,,,,,,,,,,,"2016","06/30/2017" { "_id":"317150_08/26/2016_346122858", "physician_id":"317150", "date_payment":"08/26/2016", "record_id":"346122858", "payer":"Mission Pharmacal Company", "amount":9.23, "Physician_Specialty":"Obstetrics & Gynecology", "Nature_of_payment":"Food and Beverage" }
  36. 36. 36 © 2018 MapR Technologies, Inc Scenario: Payment Data Provider ID Date Payer Payer State Provider Specialty Provider State Amount Payment Nature 1261770 01/11/2016 Southern Anesthesia & Surgical, Inc CO Oral and Maxillofacial Surgery CA 117.5 Food and Beverage
  37. 37. 37 © 2018 MapR Technologies, Inc val df1 = spark.readStream.format("kafka") .option("kafka.bootstrap.servers", "maprdemo:9092") .option("subscribe", "/apps/stream:payments”) .option("startingOffsets", "earliest") .option("failOnDataLoss", false) .option("maxOffsetsPerTrigger", 1000) .load() Streaming pipeline Kafka Data source
  38. 38. 38 © 2018 MapR Technologies, Inc df1.printSchema() root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true) Kafka DataFrame schema
  39. 39. 39 © 2018 MapR Technologies, Inc case class Payment(_id: String, physician_id: String, date_payment: String, payer: String, payer_state: String, amount: Double, physician_type: String,physician_specialty: String, physician_state: String, nature_of_payment: String) def parsePayment(str: String): Payment = { val td = str.split(",(?=([^"]*"[^"]*")*[^"]*$)") val physician_id =td(5) val amount = td(30).toDouble . . . Payment(id, physician_id, date_payment, payer, payer_state, amount, physician_type, focus, physician_state, nature_of_payment) } Function to Parse CSV data to Payment Object
  40. 40. 40 © 2018 MapR Technologies, Inc //register a user-defined function (UDF) to deserialize the message spark.udf.register("deserialize", (message: String) => parsePayment(message)) //use the UDF in a select expression val df2 = df1.selectExpr("""deserialize(CAST(value as STRING)) AS message""").select($"message".as[Payment]) Parse message txt to Payment Object
  41. 41. 41 © 2018 MapR Technologies, Inc Writing to a Memory Sink Write results to Memory Start running the query val query = cdf .writeStream .queryName("uber") .format("memory") .outputMode("append”) query.start().awaitTermination()
  42. 42. 42 © 2018 MapR Technologies, Inc %sql select * from payments limit 4: Streaming Applicaton +--------------------+------------+------------+--------------------+-----------+--------+-------------------+--------------------+---------------+------------------+ | _id|physician_id|date_payment| payer|payer_state| amount| physician_type| physician_specialty|physician_state| nature_of_payment| +--------------------+------------+------------+--------------------+-----------+--------+-------------------+--------------------+---------------+------------------+ |CA_Diagnostic Rad...| 132655| 02/12/2016| DFINE, Inc| CA| 90.87| Medical Doctor|Diagnostic Radiology| CA| Food and Beverage| |AZ_Vascular & Int...| 1006832| 01/26/2016| DFINE, Inc| CA| 25.0| Medical Doctor|Vascular & Interv...| AZ|Travel and Lodging| |AZ_Vascular & Int...| 1006832| 02/13/2016| DFINE, Inc| CA| 32.0| Medical Doctor|Vascular & Interv...| AZ|Travel and Lodging| |AZ_Vascular & Int...| 1006832| 02/19/2016| DFINE, Inc| CA| 27.27| Medical Doctor|Vascular & Interv...| AZ|Travel and Lodging|
  43. 43. 43 © 2018 MapR Technologies, Inc %sql select * from payments limit 3: Streaming Applicaton
  44. 44. 44 © 2018 MapR Technologies, Inc %sql select physician_specialty, count(*) as cnt from payment2 group by physician_specialty order by cnt desc Streaming Applicaton
  45. 45. Spark & MapR-DB
  46. 46. 46 © 2018 MapR Technologies, Inc MapR-DB Connector for Apache Spark Spark Streaming writing to MapR-DB JSON
  47. 47. 47 © 2018 MapR Technologies, Inc Spark MapR-DB Connector
  48. 48. 48 © 2018 MapR Technologies, Inc Designed for Partitioning and Scaling
  49. 49. 49 © 2018 MapR Technologies, Inc MapR-DB JSON Document Store Data is automatically partitioned and sorted by _id row key! { "_id":”AK_08/26/2016_346122858", "physician_id":"317150", "date_payment":"08/26/2016", "payer":"Mission Pharmacal Company", "payer_state":”CO", "amount":9.23, ”physician_specialty":”Gynecology", “physician_state":”AK" ”nature_of_payment":"Food and Beverage" }
  50. 50. 50 © 2018 MapR Technologies, Inc MapR Database shell maprdb mapr:> find /user/mapr/paytable --limit 2 {"_id":"AK_01/03/2016_1309545_391702408", "amount":94.21, "date_payment":"01/03/2016", "nature_of_payment":"Food and Beverage", "payer":"Stryker Corporation", "payer_state":"MI","physician_id":"1309545", "physician_specialty":"Specialist", "physician_state":"AK", "physician_type":"Medical Doctor"} {"_id":"AK_01/04/2016_100801_396460394","amount":11.59, "date_payment":"01/04/2016", "nature_of_payment":"Food and Beverage", "payer":"AbbVie, Inc.", "payer_state":"IL", "physician_id":"100801", "physician_specialty":"Family Medicine", "physician_state":"AK", "physician_type":"Medical Doctor"} 2 document(s) found. Data is automatically partitioned and sorted by _id row key!
  51. 51. 51 © 2018 MapR Technologies, Inc Writing to a MapR-DB Sink Write Streaming DataFrame Query Results to MapR-DB Start running the query val query = cdf.writeStream .format(MapRDBSourceConfig.Format) .option(MapRDBSourceConfig.TablePathOption, tableName) .option(MapRDBSourceConfig.IdFieldPathOption, "_id") .option(MapRDBSourceConfig.CreateTableOption, false) .option("checkpointLocation", "/user/mapr/check") .option(MapRDBSourceConfig.BulkModeOption, true) .option(MapRDBSourceConfig.SampleSizeOption, 1000) query.start().awaitTermination()
  52. 52. 52 © 2018 MapR Technologies, Inc Streaming Dataset Read from Topic partitions, Transformed, Written to MapR Database partitions
  53. 53. 53 © 2018 MapR Technologies, Inc Store in MapR Database
  54. 54. Explore the Data With Spark SQL
  55. 55. 55 © 2018 MapR Technologies, Inc •  Spark SQL queries and updates to MapR-DB •  With projection and filter pushdown, custom partitioning, and data locality Spark SQL Querying MapR-DB JSON
  56. 56. 56 © 2018 MapR Technologies, Inc val pdf: Dataset[Payment] = spark .loadFromMapRDB[Payment](tableName, schema) .as[Payment] Spark Distributed Datasets read from MapR-DB Partitions Worker Task Worker Driver Cache 1 Cache 2 Cache 3 Process & Cache Data Process & Cache Data Process & Cache Data Task Task Driver tasks tasks tasks
  57. 57. 57 © 2018 MapR Technologies, Inc Data Frame Load data pdf.createOrReplaceTempView(”payment") pdf.select("_id", "payer", "amount”).show Load the data into a Dataframe Data is automatically partitioned and sorted by _id row key!
  58. 58. 58 © 2018 MapR Technologies, Inc •  Which physician specialties, or physician states, are getting the most payments ? •  Count , total amount $, average amount $, standard deviation •  Which companies, or company states, are making the most payments ? •  Count , total amount $, average amount $, standard deviation •  What are the Top Nature of Payments by count, amount ? •  What are the payments over $2,000,000 ? Now We Can Ask Questions Like: Physician ID Date Payer Payer State Physician Specialty Physician State Amount Payment Nature 1261770 01/11/2016 Southern Anesthesia & Surgical, Inc CO Oral and Maxillofacial Surgery CA 117.5 Food and Beverage
  59. 59. 59 © 2018 MapR Technologies, Inc Language Integrated Queries L I Query Description agg(expr, exprs) Aggregates on entire DataFrame distinct Returns new DataFrame with unique rows except(other) Returns new DataFrame with rows from this DataFrame not in other DataFrame filter(expr); where(condition) Filter based on the SQL expression or condition groupBy(cols: Columns) Groups DataFrame using specified columns join (DataFrame, joinExpr) Joins with another DataFrame using given join expression sort(sortcol) Returns new DataFrame sorted by specified column select(col) Selects set of columns
  60. 60. 60 © 2018 MapR Technologies, Inc MapR Data Platform
  61. 61. 61 © 2018 MapR Technologies, Inc Top 5 Nature of Payment by count val res = pdf.groupBy(”Nature_of_Payment") .count() .orderBy(desc(count)) .show(5)
  62. 62. 62 © 2018 MapR Technologies, Inc Top 5 Nature of Payment by amount of payment %sql select Nature_of_payment, sum(amount) as total from payments group by Nature_of_payment order by total desc limit 5
  63. 63. 63 © 2018 MapR Technologies, Inc Top 5 Physician Specialties by total Amount %sql select physician_specialty, sum(amount) as total from payments where physician_specialty IS NOT NULL group by physician_specialty order by total desc limit 5
  64. 64. 64 © 2018 MapR Technologies, Inc Average Payment by Specialty
  65. 65. 65 © 2018 MapR Technologies, Inc What are the Nature of Payments with payments > $1000 ? pdf.filter($"amount" > 1000) .groupBy("Nature_of_payment") .count().orderBy(desc("count")).show()
  66. 66. 66 © 2018 MapR Technologies, Inc Top Payers by Total Amount with count %sql select payer, payer_state, count(*) as cnt, sum(amount) as total from payments group by payer, payer_state order by total desc limit 10
  67. 67. 67 © 2018 MapR Technologies, Inc Scenario: InPatient Payment Data Provider ID Provider State DRG Total Discharges Average Charges Average Total payments Avg Medicare Payments 1261770 CA Heart Transplant 13 1,172,866 251,876 244,457
  68. 68. 68 © 2018 MapR Technologies, Inc MapR Database shell maprdb mapr:> find /user/mapr/ippstable --limit 2 {"_id":"001_2014_010033", "avg_covered_charges":1172866.385, "avg_medicare_payments": 244457.9231, "avg_total_payments":251876.3077, "drg_code":"001", "drg_definition":"HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC", "provider_address":"619 SOUTH 19TH STREET", "provider_city":"BIRMINGHAM", "provider_id":"010033", "provider_name":"UNIVERSITY OF ALABAMA HOSPITAL", "provider_region":"AL - Birmingham", "provider_state":"AL", "provider_zip":"35233", "total_discharges":13, "year":"2014"} {"_id":"001_2014_030103", "avg_covered_charges":437531.3, "avg_medicare_payments": 133509.55, "avg_total_payments":240422.8, "drg_code":"001", "drg_definition":"HEART TRANSPLANT OR IMPLANT OF HEART ASSIST SYSTEM W MCC", "provider_address":"5777 EAST MAYO BOULEVARD", "provider_city":"PHOENIX", "provider_id":"030103", "provider_name":"MAYO CLINIC HOSPITAL", "provider_region":"AZ - Phoenix", "provider_state":"AZ", "provider_zip":"85054", "total_discharges":20, "year":"2014”}
  69. 69. 69 © 2018 MapR Technologies, Inc •  What are the top charges, and payments by Hospital? •  by Hospital State? •  What are the top charges, and payments by diagnosis? •  Most common diagnosis ? •  What are the statistics for payments for most common diagnosis (Sepsis ) •  What are the most expensive states for most common diagnosis sepsis Now We Can Ask Questions Like: Provider ID Provider State DRG Total Discharges Average Charges Average Total payments Avg Medicare Payments 1261770 CA Heart Transplant 13 1,172,866 251,876 244,457
  70. 70. 70 © 2018 MapR Technologies, Inc InPatient Prospective Payment System payments
  71. 71. 71 © 2018 MapR Technologies, Inc To Download the code : Streaming data pipeline blogs •  https://mapr.com/blog/etl-pipeline-healthcare-dataset-with-spark-json-mapr-db/ •  https://mapr.com/blog/streaming-data-pipeline-transform-store-explore-healthcare- dataset-mapr-db/
  72. 72. 72 © 2018 MapR Technologies, Inc http://mapr.com/ebooks/ New Spark Ebook
  73. 73. 73 © 2018 MapR Technologies, Inc Read about Streaming Architectures mapr.com/ebooks
  74. 74. 74 © 2018 MapR Technologies, Inc Read about Machine Learning Logistics mapr.com/ebooks
  75. 75. 75 © 2018 MapR Technologies, Inc MapR Free ODT http://learn.mapr.com/ To Learn More: New Spark 2.0 training
  76. 76. 76 © 2018 MapR Technologies, Inc https://mapr.com/blog/ MapR Blog
  77. 77. 77 © 2018 MapR Technologies, Inc To Learn More: •  https://mapr.com/blog/ml-iot-connected-medical-devices/
  78. 78. 78 © 2018 MapR Technologies, Inc To Learn More: •  https://mapr.com/blog/how-stream-first-architecture-patterns-are-revolutionizing- healthcare-platforms/
  79. 79. 79 © 2018 MapR Technologies, Inc Applying Machine Learning to Live Patient Data •  https://www.slideshare.net/caroljmcdonald/applying-machine-learning-to-live- patient-data
  • SyedAbdullah104

    Feb. 8, 2020
  • ssuserce170b

    Jun. 15, 2019

Streaming Data Pipeline to Transform, Store and Explore Healthcare Dataset With Apache Kafka API, Apache Spark, Apache Drill, JSON and MapR-DB

Views

Total views

1,346

On Slideshare

0

From embeds

0

Number of embeds

40

Actions

Downloads

54

Shares

0

Comments

0

Likes

2

×