SlideShare a Scribd company logo
1 of 28
USING MULTIPLE PERSISTENCE LAYERS IN SPARK
TO BUILD A SCALABLE PREDICTION ENGINE
Richard Williamson | @SuperHadooper
22 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
• Some Background
• Model Building/Scoring Use Case
• Spark Persistence to Parquet via
Kafka
• Exploratory Analysis with Impala
• Spark Persistence to Cassandra
• Operational queries via CQL
• Impala Model Building/Scoring
• Spark Model Building/Scoring
• Questions
AGENDA
3 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
4 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
THE EXPERIMENTAL ENTERPRISE
We need to both support investigative
work and build a solid layer for
production.
Data science allows us to observe our
experiments and respond to the
changing environment.
The foundation of the experimental
enterprise focuses on making
infrastructure readily accessible.
5 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
6 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
• VOLUME: Storing longer
periods of history
• VELOCITY: Stream
processing allows for on
the fly forecast
adjustments
• VARIETY: Data Scientists
can use additional
features to improve
model accuracy
HOW HAS
“BIG DATA”
CHANGED OUR
ABILITY TO BETTER
PREDICT THE
FUTURE?
7 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
• Lowered Cost of Storing
and Processing Data
• Improved data scientist
efficiency by having
common repository
• Improved modeling
performance by bringing
models to the data versus
bringing data to the
models
HOW HAS
“BIG DATA
TECHNOLOGY”
CHANGED OUR
ABILITY TO BETTER
PREDICT THE
FUTURE?
88 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
Can we predict future RSVPs
by consuming Meetup.com
streaming API data?
http://stream.meetup.com/2/rs
vps
https://github.com/silicon-
valley-data-science/strataca-
2015/tree/master/Spark-to-
Build-a-Scalable-Prediction-
Engine
EXAMPLE USE
CASE
‹#›
WHERE DO WE
START?
10 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
BY TAKING A LOOK AT THE DATA…
{"response":"yes","member":{"member_name":"Richard
Williamson","photo":"http://photos3.meetupstatic.com/photos/member/d/a/4/0/thumb_23159587
2.jpeg","member_id":29193652},"visibility":"public","event":{"time":1424223000000,"event_url":"http://www.me
etup.com/Big-Data-Science/events/217322312/","event_id":"fbtgdlytdbbc","event_name":"Big Data
Science @Strata Conference,
2015"},"guests":0,"mtime":1424020205391,"rsvp_id":1536654666,"group":{"group_name":"Big Data
Science","group_state":"CA","group_city":"Fremont","group_lat":37.52,"group_urlname":"Big-Data-
Science","group_id":3168962,"group_country":"us","group_topics":[{"urlkey":"data-
visualization","topic_name":"Data Visualization"},{"urlkey":"data-mining","topic_name":"Data
Mining"},{"urlkey":"businessintell","topic_name":"Business
Intelligence"},{"urlkey":"mapreduce","topic_name":"MapReduce"},{"urlkey":"hadoop","topic_name":"Hadoop"},
{"urlkey":"opensource","topic_name":"Open Source"},{"urlkey":"r-project-for-statistical-
computing","topic_name":"R Project for Statistical Computing"},{"urlkey":"predictive-
analytics","topic_name":"Predictive Analytics"},{"urlkey":"cloud-computing","topic_name":"Cloud
Computing"},{"urlkey":"big-data","topic_name":"Big Data"},{"urlkey":"data-science","topic_name":"Data
Science"},{"urlkey":"data-analytics","topic_name":"Data
Analytics"},{"urlkey":"hbase","topic_name":"HBase"},{"urlkey":"hive","topic_name":"Hive"}],"group_lon":-
121.93},"venue":{"lon":-121.889122,"venue_name":"San Jose Convention Center, Room
210AE","venue_id":21805972,"lat":37.330341}}
11 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
FROM ONE TO MANY
Now that we know what one record looks like, let’s
capture a large dataset.
• First we capture stream to Kafka by curling to file
and tailing file to Kafka consumer — this will build
up some history for us to look at.
12 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
FROM ONE TO MANY
• Then use Spark to load to Parquet for interactive
exploratory analysis in Impala:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val meetup = sqlContext.jsonFile("/home/odroid/meetupstream”)
meetup.registerTempTable("meetup")
val meetup2 = sqlContext.sql("select event.event_id,event.event_name,event.event_url,event.time,
guests,member.member_id,member.member_name,member.other_services.facebook.identifier as
facebook_identifier, member.other_services.linkedin.identifier as linkedin_identifier,
member.other_services.twitter.identifier as twitter_identifier, member.photo,mtime,response,
rsvp_id,venue.lat,venue.lon,venue.venue_id,venue.venue_name,visibility from meetup”)
meetup2.saveAsParquetFile("/home/odroid/meetup.parquet")
13 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
INTERACTIVE EXPLORATORY
ANALYSIS IN IMPALA
• We could have done basic exploratory analysis in
Spark SQL, but date logic is not fully supported.
• It’s very easy to point Impala to the data and do it
there.
14 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
INTERACTIVE EXPLORATORY
ANALYSIS IN IMPALA
• Create external Impala table pointing to the data
we just created in Spark:
CREATE EXTERNAL TABLE meetup
LIKE PARQUET '/user/odroid/meetup/_metadata'
STORED AS PARQUET
LOCATION '/user/odroid/meetup’;
15 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
INTERACTIVE EXPLORATORY
ANALYSIS IN IMPALA
• Now we can query:
select from_unixtime(cast(mtime/1000 as bigint), "yyyy-MM-dd HH”) as data_hr
,count(*) as rsvp_cnt
from meetup
group by 1
order by 1;
16 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
val ssc = new StreamingContext(sc, Seconds(10))
val lines = KafkaUtils.createStream(ssc, ”localhost:2181", "meetupstream", Map("meetupstream" -> 10)).map(_._2)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
lines.foreachRDD(rdd => {
val lines2 = sqlContext.jsonRDD(rdd)
lines2.registerTempTable("lines2")
val lines3 = sqlContext.sql("select event.event_id,event.event_name,event.event_url,event.time,guests,
member.member_id, member.member_name,member.other_services.facebook.identifier as facebook_identifier,
member.other_services.linkedin.identifier as linkedin_identifier,member.other_services.twitter.identifier as
twitter_identifier,member.photo, mtime,response,
rsvp_id,venue.lat,venue.lon,venue.venue_id,venue.venue_name,visibility from lines2")
val lines4 = lines3.map(line => (line(0), line(5), line(13), line(1), line(2), line(3), line(4), line(6), line(7), line(8), line(9),
line(10), line(11), line(12), line(14), line(15), line(16), line(17), line(18) ))
lines4.saveToCassandra("meetup","rsvps", SomeColumns("event_id", "member_id","rsvp_id","event_name", "event_url",
"time","guests","member_name","facebook_identifier","linkedin_identifier","twitter_identifier","photo","mtime","response","l
at","lon","venue_id","venue_name","visibility"))})
ssc.start()
SPARK STREAMING:
Load Cassandra from Kafka
17 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
OPERATIONAL QUERIES IN
CASSANDRA
• When low millisecond queries are needed,
Cassandra provides favorable storage.
• Now we can perform queries like this one:
select * from meetup.rsvps where event_id=“fbtgdlytdbbc” and
member_id=29193652 and rsvp_id=1536654666;
18 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
Why do we want to store the
same data in two places?
• Different access patterns prefer different physical
storage.
• Large table scans needed for analytical queries
require fast sequential access to data, making
Parquet format in HDFS a perfect fit. It can then be
queried by Impala, Spark, or even Hive.
• Fast record lookups require a carefully designed key
to allow for direct access to the specific row of
data, making Cassandra (or HBase) a very good fit.
19 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
START BUILDING MODELS
• A quick way to do some exploratory predictions is
with Impala MADlib or Spark MLlib.
• First we need to build some features from the table
we created in Impala.
20 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
START BUILDING MODELS
• A simple way to do this for hourly sales is to create a
boolean indicator variable for each hour, and set it to
true if the given observation was for the given hour.
• We can do the same to indicate whether the
observation was taken on a weekend or weekday.
21 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
START BUILDING MODELS
• The SQL to do this look something like this:
create table rsvps_by_hr_training as (
select
case when mhour=0 then 1 else 0 end as hr0
,case when mhour=1 then 1 else 0 end as hr1
…
,case when mhour=22 then 1 else 0 end as hr22
,case when mhour=23 then 1 else 0 end as hr23
,case when mdate in ("2015-02-14","2015-02-15") then 1 else 0 end as
weekend_day
,mdate,mhour,rsvp_cnt from rsvps_by_hour);
22 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
• Install MADlib for Impala: http://blog.cloudera.com/blog/2013/10/how-to-use-madlib-pre-built-
analytic-functions-with-impala/
• Train regression model in SQL:
SELECT printarray(linr(toarray(hr0,hr1,hr2,hr3,hr4,hr5,hr6,hr7,hr8,hr9,hr10,hr11,hr12,hr13,hr14,
hr15,hr16,hr17,hr18,hr19,hr20,hr21,hr22,hr23,weekend_day), rsvp_cnt))
from meetup.rsvps_by_hr_training;
• Which gives us the regression coefficients:
<8037.43, 7883.93, 7007.68, 6851.91, 6307.91, 5468.24, 4792.58, 4336.91, 4330.24, 4360.91, 4373.24,
4711.58, 5649.91, 6752.24, 8056.24, 9042.58, 9761.37, 10205.9, 10365.6, 10048.6, 9946.12, 9538.87,
9984.37, 9115.12, -2323.73>
• Which we can then apply to new data:
select mdate,mhour,cast(linrpredict(toarray(8037.43, 7883.93, 7007.68, 6851.91, 6307.91,
5468.24, 4792.58, 4336.91, 4330.24, 4360.91, 4373.24, 4711.58, 5649.91, 6752.24, 8056.24, 9042.58,
9761.37, 10205.9, 10365.6, 10048.6, 9946.12, 9538.87, 9984.37, 9115.12, -2323.73), toarray(hr0, hr1,
hr2, hr3, hr4, hr5, hr6, hr7, hr8, hr9, hr10, hr11, hr12, hr13, hr14, hr15, hr16, hr17, hr18, hr19, hr20,
hr21, hr22, hr23, weekend_day)) as int) as rsvp_cnt_pred, rsvp_cnt from
meetup.rsvps_by_hr_testing
Train/score regression model in Impala
23 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
PREDICTED HOURLY RSVPS FROM IMPALA
0
2000
4000
6000
8000
10000
12000
1 13 25 37 49 61 73 85 97
hour_mod
rsvp_cnt_prediction
rsvp_cnt
24 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
Prepare data for regression in Spark
val meetup3 = sqlContext.sql("select mtime from meetup2")
val meetup4 = meetup3.map(x => new DateTime(x(0).toString().toLong)).map(x => (x.toString("yyyy-
MM-dd HH"),x.toString("yyyy-MM-dd"),x.toString("HH")) )
case class Meetup(mdatehr: String, mdate: String, mhour: String)
val meetup5 = meetup4.map(p => Meetup(p._1, p._2, p._3))
meetup5.registerTempTable("meetup5")
val meetup6 = sqlContext.sql("select mdate,mhour,count(*) as rsvp_cnt from meetup5 where
mdatehr >= '2015-02-15 02' group by mdatehr,mdate,mhour")
meetup6.registerTempTable("meetup6")
sqlContext.sql("cache table meetup6”)
val trainingData = meetup7.map { row =>
val features = Array[Double](row(24).toString().toDouble,row(0).toString().toDouble, …
LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))}
25 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
Train and Score regression model in
Spark
val trainingData = meetup7.map { row =>
val features = Array[Double](1.0,row(0).toString().toDouble,row(1).toString().toDouble, …
LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))
}
val model = new RidgeRegressionWithSGD().run(trainingData)
val scores = meetup7.map { row =>
val features = Vectors.dense(Array[Double](1.0,row(0).toString().toDouble, …
row(23).toString().toDouble))
(row(25),row(26),row(27), model.predict(features))
}
scores.foreach(println)
26 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
PREDICTED HOURLY RSVPS FROM SPARK
0
2000
4000
6000
8000
10000
12000
1 13 25 37 49 61 73 85 97
hour_mod
rsvp_cnt_prediction
rsvp_cnt
27 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
NEXT STEPS
• Include additional features (Holidays, Growth
Trend, etc.)
• Predict at lower granularity (Event Level, Topic
Level, etc.)
• Convert Spark regression model to utilize streaming
regression
• Build ensemble models to leverage strengths of
multiple algorithms
• Move to pilot, then production
‹#›
THANK YOU
Richard Williamson
richard@svds.com
@superhadooper
Yes, we’re hiring!
info@svds.com

More Related Content

What's hot

Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...StampedeCon
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in PracticeC4Media
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetHortonworks
 
Querying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it too
Querying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it tooQuerying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it too
Querying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it tooAll Things Open
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and CassandraNatalino Busa
 
DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?DataStax
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data prajods
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and CassandraDataStax Academy
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaHelena Edelson
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData StoryLynn Langit
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaDatabricks
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieYahoo Developer Network
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
 
AWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS ExperienceAWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS ExperienceAmazon Web Services
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop EcosystemSlim Bouguerra
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Imply
 

What's hot (20)

Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
Analytics, Big Data and Nonvolatile Memory Architectures – Why you Should Car...
 
Lambda Architectures in Practice
Lambda Architectures in PracticeLambda Architectures in Practice
Lambda Architectures in Practice
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and SupersetInteractive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
Interactive Realtime Dashboards on Data Streams using Kafka, Druid and Superset
 
Querying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it too
Querying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it tooQuerying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it too
Querying NoSQL with SQL: HAVING Your JSON Cake and SELECTing it too
 
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
Real-Time Anomaly Detection  with Spark MLlib, Akka and  CassandraReal-Time Anomaly Detection  with Spark MLlib, Akka and  Cassandra
Real-Time Anomaly Detection with Spark MLlib, Akka and Cassandra
 
DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?DataStax C*ollege Credit: What and Why NoSQL?
DataStax C*ollege Credit: What and Why NoSQL?
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
 
Analytics with Spark and Cassandra
Analytics with Spark and CassandraAnalytics with Spark and Cassandra
Analytics with Spark and Cassandra
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Druid @ branch
Druid @ branch Druid @ branch
Druid @ branch
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
The Microsoft BigData Story
The Microsoft BigData StoryThe Microsoft BigData Story
The Microsoft BigData Story
 
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion StoicaSpark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
Spark Summit East 2015 Keynote -- Databricks CEO Ion Stoica
 
August 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache OozieAugust 2016 HUG: Recent development in Apache Oozie
August 2016 HUG: Recent development in Apache Oozie
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
AWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS ExperienceAWS Customer Presentation - VMIX AWS Experience
AWS Customer Presentation - VMIX AWS Experience
 
Druid at Hadoop Ecosystem
Druid at Hadoop EcosystemDruid at Hadoop Ecosystem
Druid at Hadoop Ecosystem
 
Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)Druid: Under the Covers (Virtual Meetup)
Druid: Under the Covers (Virtual Meetup)
 

Viewers also liked

Intelligence Types
Intelligence TypesIntelligence Types
Intelligence Typesishlive
 
Improving Model Predictions via Stacking and Hyper-parameters Tuning
Improving Model Predictions via Stacking and Hyper-parameters TuningImproving Model Predictions via Stacking and Hyper-parameters Tuning
Improving Model Predictions via Stacking and Hyper-parameters TuningJo-fai Chow
 
Securing the Hadoop Ecosystem
Securing the Hadoop EcosystemSecuring the Hadoop Ecosystem
Securing the Hadoop EcosystemDataWorks Summit
 
Perspectives on Ethical Big Data Governance
Perspectives on Ethical Big Data GovernancePerspectives on Ethical Big Data Governance
Perspectives on Ethical Big Data GovernanceCloudera, Inc.
 
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016StampedeCon
 
Cassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsCassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsAnirvan Chakraborty
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduceHadoop User Group
 
Spend Analysis: What Your Data Is Telling You and Why It’s Worth Listening
Spend Analysis: What Your Data Is Telling You and Why It’s Worth ListeningSpend Analysis: What Your Data Is Telling You and Why It’s Worth Listening
Spend Analysis: What Your Data Is Telling You and Why It’s Worth ListeningSAP Ariba
 
Impression tecnique for implant supported rehabilitation/ dental courses
Impression tecnique for implant supported rehabilitation/ dental coursesImpression tecnique for implant supported rehabilitation/ dental courses
Impression tecnique for implant supported rehabilitation/ dental coursesIndian dental academy
 
Zen and the Art of Supplier Management
Zen and the Art of Supplier ManagementZen and the Art of Supplier Management
Zen and the Art of Supplier ManagementSAP Ariba
 
Master Supply Chain Management with Cloud and Business Networks
Master Supply Chain Management with Cloud and Business NetworksMaster Supply Chain Management with Cloud and Business Networks
Master Supply Chain Management with Cloud and Business NetworksSAP Ariba
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...Yahoo Developer Network
 

Viewers also liked (16)

Intelligence Types
Intelligence TypesIntelligence Types
Intelligence Types
 
Presentacion
PresentacionPresentacion
Presentacion
 
Phrases
PhrasesPhrases
Phrases
 
Improving Model Predictions via Stacking and Hyper-parameters Tuning
Improving Model Predictions via Stacking and Hyper-parameters TuningImproving Model Predictions via Stacking and Hyper-parameters Tuning
Improving Model Predictions via Stacking and Hyper-parameters Tuning
 
Spark tuning
Spark tuningSpark tuning
Spark tuning
 
Securing the Hadoop Ecosystem
Securing the Hadoop EcosystemSecuring the Hadoop Ecosystem
Securing the Hadoop Ecosystem
 
Perspectives on Ethical Big Data Governance
Perspectives on Ethical Big Data GovernancePerspectives on Ethical Big Data Governance
Perspectives on Ethical Big Data Governance
 
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016
 
Cassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analyticsCassandra as event sourced journal for big data analytics
Cassandra as event sourced journal for big data analytics
 
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReducePublic Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce
 
Spend Analysis: What Your Data Is Telling You and Why It’s Worth Listening
Spend Analysis: What Your Data Is Telling You and Why It’s Worth ListeningSpend Analysis: What Your Data Is Telling You and Why It’s Worth Listening
Spend Analysis: What Your Data Is Telling You and Why It’s Worth Listening
 
3.implant components and basic techniques3
3.implant components and basic techniques33.implant components and basic techniques3
3.implant components and basic techniques3
 
Impression tecnique for implant supported rehabilitation/ dental courses
Impression tecnique for implant supported rehabilitation/ dental coursesImpression tecnique for implant supported rehabilitation/ dental courses
Impression tecnique for implant supported rehabilitation/ dental courses
 
Zen and the Art of Supplier Management
Zen and the Art of Supplier ManagementZen and the Art of Supplier Management
Zen and the Art of Supplier Management
 
Master Supply Chain Management with Cloud and Business Networks
Master Supply Chain Management with Cloud and Business NetworksMaster Supply Chain Management with Cloud and Business Networks
Master Supply Chain Management with Cloud and Business Networks
 
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
August 2016 HUG: Better together: Fast Data with Apache Spark™ and Apache Ign...
 

Similar to Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Engine - StampedeCon 2015

Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy NguyenGrokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy NguyenHuy Nguyen
 
Deploy your Multi-tier Application in Cloud Foundry
Deploy your Multi-tier Application in Cloud FoundryDeploy your Multi-tier Application in Cloud Foundry
Deploy your Multi-tier Application in Cloud Foundrycornelia davis
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQLYousun Jeong
 
Intro to Big Data - Orlando Code Camp 2014
Intro to Big Data - Orlando Code Camp 2014Intro to Big Data - Orlando Code Camp 2014
Intro to Big Data - Orlando Code Camp 2014John Ternent
 
Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayAjay Shriwastava
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerDataWorks Summit
 
10 ways to make your code rock
10 ways to make your code rock10 ways to make your code rock
10 ways to make your code rockmartincronje
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Agile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBAgile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBRussell Jurney
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Why SQL? | Kenny Gorman, Cloudera
Why SQL? | Kenny Gorman, ClouderaWhy SQL? | Kenny Gorman, Cloudera
Why SQL? | Kenny Gorman, ClouderaHostedbyConfluent
 
Evolution is Continuous, and so are Big Data and Streaming Pipelines
Evolution is Continuous, and so are Big Data and Streaming PipelinesEvolution is Continuous, and so are Big Data and Streaming Pipelines
Evolution is Continuous, and so are Big Data and Streaming PipelinesDatabricks
 
Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?Erik-Berndt Scheper
 
NSA for Enterprises Log Analysis Use Cases
NSA for Enterprises   Log Analysis Use Cases NSA for Enterprises   Log Analysis Use Cases
NSA for Enterprises Log Analysis Use Cases WSO2
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaTreasure Data, Inc.
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0Russell Jurney
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Databricks
 

Similar to Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Engine - StampedeCon 2015 (20)

Mstr meetup
Mstr meetupMstr meetup
Mstr meetup
 
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy NguyenGrokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
 
Deploy your Multi-tier Application in Cloud Foundry
Deploy your Multi-tier Application in Cloud FoundryDeploy your Multi-tier Application in Cloud Foundry
Deploy your Multi-tier Application in Cloud Foundry
 
Spark streaming , Spark SQL
Spark streaming , Spark SQLSpark streaming , Spark SQL
Spark streaming , Spark SQL
 
Intro to Big Data - Orlando Code Camp 2014
Intro to Big Data - Orlando Code Camp 2014Intro to Big Data - Orlando Code Camp 2014
Intro to Big Data - Orlando Code Camp 2014
 
Thu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjayThu-310pm-Impetus-SachinAndAjay
Thu-310pm-Impetus-SachinAndAjay
 
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services LayerLogical Data Warehouse: How to Build a Virtualized Data Services Layer
Logical Data Warehouse: How to Build a Virtualized Data Services Layer
 
10 ways to make your code rock
10 ways to make your code rock10 ways to make your code rock
10 ways to make your code rock
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDBAgile Data Science 2.0: Using Spark with MongoDB
Agile Data Science 2.0: Using Spark with MongoDB
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Why SQL? | Kenny Gorman, Cloudera
Why SQL? | Kenny Gorman, ClouderaWhy SQL? | Kenny Gorman, Cloudera
Why SQL? | Kenny Gorman, Cloudera
 
Evolution is Continuous, and so are Big Data and Streaming Pipelines
Evolution is Continuous, and so are Big Data and Streaming PipelinesEvolution is Continuous, and so are Big Data and Streaming Pipelines
Evolution is Continuous, and so are Big Data and Streaming Pipelines
 
Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?Big Data processing with Spark, Scala or Java?
Big Data processing with Spark, Scala or Java?
 
NSA for Enterprises Log Analysis Use Cases
NSA for Enterprises   Log Analysis Use Cases NSA for Enterprises   Log Analysis Use Cases
NSA for Enterprises Log Analysis Use Cases
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with Rocana
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Agile Data Science 2.0
Agile Data Science 2.0Agile Data Science 2.0
Agile Data Science 2.0
 
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
Expanding Apache Spark Use Cases in 2.2 and Beyond with Matei Zaharia and dem...
 

More from StampedeCon

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...StampedeCon
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017StampedeCon
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017StampedeCon
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...StampedeCon
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017StampedeCon
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017StampedeCon
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...StampedeCon
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...StampedeCon
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017StampedeCon
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017StampedeCon
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017StampedeCon
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017StampedeCon
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017StampedeCon
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...StampedeCon
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...StampedeCon
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016StampedeCon
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016StampedeCon
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016StampedeCon
 

More from StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 

Recently uploaded

GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 

Recently uploaded (20)

GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 

Using Multiple Persistence Layers in Spark to Build a Scalable Prediction Engine - StampedeCon 2015

  • 1. USING MULTIPLE PERSISTENCE LAYERS IN SPARK TO BUILD A SCALABLE PREDICTION ENGINE Richard Williamson | @SuperHadooper
  • 2. 22 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. • Some Background • Model Building/Scoring Use Case • Spark Persistence to Parquet via Kafka • Exploratory Analysis with Impala • Spark Persistence to Cassandra • Operational queries via CQL • Impala Model Building/Scoring • Spark Model Building/Scoring • Questions AGENDA
  • 3. 3 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
  • 4. 4 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. THE EXPERIMENTAL ENTERPRISE We need to both support investigative work and build a solid layer for production. Data science allows us to observe our experiments and respond to the changing environment. The foundation of the experimental enterprise focuses on making infrastructure readily accessible.
  • 5. 5 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED.
  • 6. 6 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. • VOLUME: Storing longer periods of history • VELOCITY: Stream processing allows for on the fly forecast adjustments • VARIETY: Data Scientists can use additional features to improve model accuracy HOW HAS “BIG DATA” CHANGED OUR ABILITY TO BETTER PREDICT THE FUTURE?
  • 7. 7 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. • Lowered Cost of Storing and Processing Data • Improved data scientist efficiency by having common repository • Improved modeling performance by bringing models to the data versus bringing data to the models HOW HAS “BIG DATA TECHNOLOGY” CHANGED OUR ABILITY TO BETTER PREDICT THE FUTURE?
  • 8. 88 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. Can we predict future RSVPs by consuming Meetup.com streaming API data? http://stream.meetup.com/2/rs vps https://github.com/silicon- valley-data-science/strataca- 2015/tree/master/Spark-to- Build-a-Scalable-Prediction- Engine EXAMPLE USE CASE
  • 10. 10 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. BY TAKING A LOOK AT THE DATA… {"response":"yes","member":{"member_name":"Richard Williamson","photo":"http://photos3.meetupstatic.com/photos/member/d/a/4/0/thumb_23159587 2.jpeg","member_id":29193652},"visibility":"public","event":{"time":1424223000000,"event_url":"http://www.me etup.com/Big-Data-Science/events/217322312/","event_id":"fbtgdlytdbbc","event_name":"Big Data Science @Strata Conference, 2015"},"guests":0,"mtime":1424020205391,"rsvp_id":1536654666,"group":{"group_name":"Big Data Science","group_state":"CA","group_city":"Fremont","group_lat":37.52,"group_urlname":"Big-Data- Science","group_id":3168962,"group_country":"us","group_topics":[{"urlkey":"data- visualization","topic_name":"Data Visualization"},{"urlkey":"data-mining","topic_name":"Data Mining"},{"urlkey":"businessintell","topic_name":"Business Intelligence"},{"urlkey":"mapreduce","topic_name":"MapReduce"},{"urlkey":"hadoop","topic_name":"Hadoop"}, {"urlkey":"opensource","topic_name":"Open Source"},{"urlkey":"r-project-for-statistical- computing","topic_name":"R Project for Statistical Computing"},{"urlkey":"predictive- analytics","topic_name":"Predictive Analytics"},{"urlkey":"cloud-computing","topic_name":"Cloud Computing"},{"urlkey":"big-data","topic_name":"Big Data"},{"urlkey":"data-science","topic_name":"Data Science"},{"urlkey":"data-analytics","topic_name":"Data Analytics"},{"urlkey":"hbase","topic_name":"HBase"},{"urlkey":"hive","topic_name":"Hive"}],"group_lon":- 121.93},"venue":{"lon":-121.889122,"venue_name":"San Jose Convention Center, Room 210AE","venue_id":21805972,"lat":37.330341}}
  • 11. 11 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. FROM ONE TO MANY Now that we know what one record looks like, let’s capture a large dataset. • First we capture stream to Kafka by curling to file and tailing file to Kafka consumer — this will build up some history for us to look at.
  • 12. 12 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. FROM ONE TO MANY • Then use Spark to load to Parquet for interactive exploratory analysis in Impala: val sqlContext = new org.apache.spark.sql.SQLContext(sc) val meetup = sqlContext.jsonFile("/home/odroid/meetupstream”) meetup.registerTempTable("meetup") val meetup2 = sqlContext.sql("select event.event_id,event.event_name,event.event_url,event.time, guests,member.member_id,member.member_name,member.other_services.facebook.identifier as facebook_identifier, member.other_services.linkedin.identifier as linkedin_identifier, member.other_services.twitter.identifier as twitter_identifier, member.photo,mtime,response, rsvp_id,venue.lat,venue.lon,venue.venue_id,venue.venue_name,visibility from meetup”) meetup2.saveAsParquetFile("/home/odroid/meetup.parquet")
  • 13. 13 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. INTERACTIVE EXPLORATORY ANALYSIS IN IMPALA • We could have done basic exploratory analysis in Spark SQL, but date logic is not fully supported. • It’s very easy to point Impala to the data and do it there.
  • 14. 14 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. INTERACTIVE EXPLORATORY ANALYSIS IN IMPALA • Create external Impala table pointing to the data we just created in Spark: CREATE EXTERNAL TABLE meetup LIKE PARQUET '/user/odroid/meetup/_metadata' STORED AS PARQUET LOCATION '/user/odroid/meetup’;
  • 15. 15 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. INTERACTIVE EXPLORATORY ANALYSIS IN IMPALA • Now we can query: select from_unixtime(cast(mtime/1000 as bigint), "yyyy-MM-dd HH”) as data_hr ,count(*) as rsvp_cnt from meetup group by 1 order by 1;
  • 16. 16 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. val ssc = new StreamingContext(sc, Seconds(10)) val lines = KafkaUtils.createStream(ssc, ”localhost:2181", "meetupstream", Map("meetupstream" -> 10)).map(_._2) val sqlContext = new org.apache.spark.sql.SQLContext(sc) lines.foreachRDD(rdd => { val lines2 = sqlContext.jsonRDD(rdd) lines2.registerTempTable("lines2") val lines3 = sqlContext.sql("select event.event_id,event.event_name,event.event_url,event.time,guests, member.member_id, member.member_name,member.other_services.facebook.identifier as facebook_identifier, member.other_services.linkedin.identifier as linkedin_identifier,member.other_services.twitter.identifier as twitter_identifier,member.photo, mtime,response, rsvp_id,venue.lat,venue.lon,venue.venue_id,venue.venue_name,visibility from lines2") val lines4 = lines3.map(line => (line(0), line(5), line(13), line(1), line(2), line(3), line(4), line(6), line(7), line(8), line(9), line(10), line(11), line(12), line(14), line(15), line(16), line(17), line(18) )) lines4.saveToCassandra("meetup","rsvps", SomeColumns("event_id", "member_id","rsvp_id","event_name", "event_url", "time","guests","member_name","facebook_identifier","linkedin_identifier","twitter_identifier","photo","mtime","response","l at","lon","venue_id","venue_name","visibility"))}) ssc.start() SPARK STREAMING: Load Cassandra from Kafka
  • 17. 17 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. OPERATIONAL QUERIES IN CASSANDRA • When low millisecond queries are needed, Cassandra provides favorable storage. • Now we can perform queries like this one: select * from meetup.rsvps where event_id=“fbtgdlytdbbc” and member_id=29193652 and rsvp_id=1536654666;
  • 18. 18 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. Why do we want to store the same data in two places? • Different access patterns prefer different physical storage. • Large table scans needed for analytical queries require fast sequential access to data, making Parquet format in HDFS a perfect fit. It can then be queried by Impala, Spark, or even Hive. • Fast record lookups require a carefully designed key to allow for direct access to the specific row of data, making Cassandra (or HBase) a very good fit.
  • 19. 19 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. START BUILDING MODELS • A quick way to do some exploratory predictions is with Impala MADlib or Spark MLlib. • First we need to build some features from the table we created in Impala.
  • 20. 20 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. START BUILDING MODELS • A simple way to do this for hourly sales is to create a boolean indicator variable for each hour, and set it to true if the given observation was for the given hour. • We can do the same to indicate whether the observation was taken on a weekend or weekday.
  • 21. 21 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. START BUILDING MODELS • The SQL to do this look something like this: create table rsvps_by_hr_training as ( select case when mhour=0 then 1 else 0 end as hr0 ,case when mhour=1 then 1 else 0 end as hr1 … ,case when mhour=22 then 1 else 0 end as hr22 ,case when mhour=23 then 1 else 0 end as hr23 ,case when mdate in ("2015-02-14","2015-02-15") then 1 else 0 end as weekend_day ,mdate,mhour,rsvp_cnt from rsvps_by_hour);
  • 22. 22 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. • Install MADlib for Impala: http://blog.cloudera.com/blog/2013/10/how-to-use-madlib-pre-built- analytic-functions-with-impala/ • Train regression model in SQL: SELECT printarray(linr(toarray(hr0,hr1,hr2,hr3,hr4,hr5,hr6,hr7,hr8,hr9,hr10,hr11,hr12,hr13,hr14, hr15,hr16,hr17,hr18,hr19,hr20,hr21,hr22,hr23,weekend_day), rsvp_cnt)) from meetup.rsvps_by_hr_training; • Which gives us the regression coefficients: <8037.43, 7883.93, 7007.68, 6851.91, 6307.91, 5468.24, 4792.58, 4336.91, 4330.24, 4360.91, 4373.24, 4711.58, 5649.91, 6752.24, 8056.24, 9042.58, 9761.37, 10205.9, 10365.6, 10048.6, 9946.12, 9538.87, 9984.37, 9115.12, -2323.73> • Which we can then apply to new data: select mdate,mhour,cast(linrpredict(toarray(8037.43, 7883.93, 7007.68, 6851.91, 6307.91, 5468.24, 4792.58, 4336.91, 4330.24, 4360.91, 4373.24, 4711.58, 5649.91, 6752.24, 8056.24, 9042.58, 9761.37, 10205.9, 10365.6, 10048.6, 9946.12, 9538.87, 9984.37, 9115.12, -2323.73), toarray(hr0, hr1, hr2, hr3, hr4, hr5, hr6, hr7, hr8, hr9, hr10, hr11, hr12, hr13, hr14, hr15, hr16, hr17, hr18, hr19, hr20, hr21, hr22, hr23, weekend_day)) as int) as rsvp_cnt_pred, rsvp_cnt from meetup.rsvps_by_hr_testing Train/score regression model in Impala
  • 23. 23 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. PREDICTED HOURLY RSVPS FROM IMPALA 0 2000 4000 6000 8000 10000 12000 1 13 25 37 49 61 73 85 97 hour_mod rsvp_cnt_prediction rsvp_cnt
  • 24. 24 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. Prepare data for regression in Spark val meetup3 = sqlContext.sql("select mtime from meetup2") val meetup4 = meetup3.map(x => new DateTime(x(0).toString().toLong)).map(x => (x.toString("yyyy- MM-dd HH"),x.toString("yyyy-MM-dd"),x.toString("HH")) ) case class Meetup(mdatehr: String, mdate: String, mhour: String) val meetup5 = meetup4.map(p => Meetup(p._1, p._2, p._3)) meetup5.registerTempTable("meetup5") val meetup6 = sqlContext.sql("select mdate,mhour,count(*) as rsvp_cnt from meetup5 where mdatehr >= '2015-02-15 02' group by mdatehr,mdate,mhour") meetup6.registerTempTable("meetup6") sqlContext.sql("cache table meetup6”) val trainingData = meetup7.map { row => val features = Array[Double](row(24).toString().toDouble,row(0).toString().toDouble, … LabeledPoint(row(27).toString().toDouble, Vectors.dense(features))}
  • 25. 25 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. Train and Score regression model in Spark val trainingData = meetup7.map { row => val features = Array[Double](1.0,row(0).toString().toDouble,row(1).toString().toDouble, … LabeledPoint(row(27).toString().toDouble, Vectors.dense(features)) } val model = new RidgeRegressionWithSGD().run(trainingData) val scores = meetup7.map { row => val features = Vectors.dense(Array[Double](1.0,row(0).toString().toDouble, … row(23).toString().toDouble)) (row(25),row(26),row(27), model.predict(features)) } scores.foreach(println)
  • 26. 26 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. PREDICTED HOURLY RSVPS FROM SPARK 0 2000 4000 6000 8000 10000 12000 1 13 25 37 49 61 73 85 97 hour_mod rsvp_cnt_prediction rsvp_cnt
  • 27. 27 © 2015 SILICON VALLEY DATA SCIENCE LLC. ALL RIGHTS RESERVED. NEXT STEPS • Include additional features (Holidays, Growth Trend, etc.) • Predict at lower granularity (Event Level, Topic Level, etc.) • Convert Spark regression model to utilize streaming regression • Build ensemble models to leverage strengths of multiple algorithms • Move to pilot, then production