SlideShare a Scribd company logo
1 of 45
Download to read offline
SPARK SUMMIT
EUROPE2016
Distributed Time Series Analysis
Framework For Spark
Larisa Sawyer
Two Sigma
Larisa Sawyer
November1, 2016 2
$0.0
$500.0
$1,000.0
$1,500.0
$2,000.0
$2,500.0
1/3/1950
1/3/1953
1/3/1956
1/3/1959
1/3/1962
1/3/1965
1/3/1968
1/3/1971
1/3/1974
1/3/1977
1/3/1980
1/3/1983
1/3/1986
1/3/1989
1/3/1992
1/3/1995
1/3/1998
1/3/2001
1/3/2004
1/3/2007
1/3/2010
1/3/2013
1/3/2016
S&P 500
Time seriesexamples
November1, 2016
w Stock market prices
w Temperatures
w Height
w …
18°C
20°C
22°C
24°C
26°C
28°C
30°C
32°C
34°C
New York
Brussels
100cm
110cm
120cm
130cm
140cm
150cm
160cm
170cm
180cm
5 6 7 8 9 10 11 12 13 14 15
Age (years)
Avg US female
100cm
110cm
120cm
130cm
140cm
150cm
160cm
170cm
180cm
5 6 7 8 9 10 11 12 13 14 15
Age (years)
Avg US female
Larisa
3
What do we do with time series data?
November1, 2016
w Forecast future values given past observations
$8.90
$8.95
$8.90
$9.06
$9.10
10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10
corn price
?
?
?
4
November1, 2016
Univariate time series
5
Multivariate time series
November1, 2016
w We can forecast better by joining multiple time series
w Our framework enables fast distributed temporal join of large scale unalignedtime series
w Temporal join is a fundamental operation for time series analysis
$8.90
$8.95
$8.90
$9.06
$9.10
10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10
corn price
75°F
72°F
71°F
72°F
68°F
67°F
65°F
temperature
6
Multivariate time series
November1, 2016
w We can forecast better by joining multiple time series
w Our framework enables fast distributed temporal join of large scale unalignedtime series
w Temporal join is a fundamental operation for time series analysis
€7.94
€7.98
€7.94
€8.08
€8.12
10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10
corn price
23°C
22°C
21°C
22°C
20°C
19°C
18°C
temperature
7
What is a left join?
November1, 2016
time series 1 time series 2
8
What is temporal join?
November1, 2016
w A particular joinfunction defined by a matching criteriaover time
w Examples of criteria
w look-backward
w look-forward
time series 1 time series 2
look-forward
time series 1 time series 2
look-backward
observation
9
Temporaljoin with look-backwardcriteria
November1, 2016
time tweets
08:00 AM
10:00 AM
12:00 PM
time BRK.A
08:00AM
11:00AM
10
ImportantLegal Information
November1, 2016 11
The information presented here is offered for informational purposes only and should
not be used for any other purpose (including, without limitation, the making of
investment decisions). Examples provided herein are for illustrative purposes only and
are not necessarily based on actual data. Nothing herein constitutes an offer to sell or the
solicitation of any offer to buy any security or other interest. We consider this
information to be confidential and not for redistribution or dissemination.
Temporaljoin with look-backwardcriteria
November1, 2016
time tweets
08:00 AM
10:00 AM
12:00 PM
time BRK.A
08:00AM
11:00AM
time tweets BRK.A
08:00 AM
10:00 AM
12:00 PM
12
Temporaljoin with look-backwardcriteria
November1, 2016
time tweets
08:00 AM
10:00 AM
12:00 PM
time BRK.A
08:00AM
11:00AM
time tweets BRK.A
08:00 AM
10:00 AM
12:00 PM
13
Temporaljoin with look-backwardcriteria
November1, 2016
time tweets
08:00 AM
10:00 AM
12:00 PM
time BRK.A
08:00AM
11:00AM
time tweets BRK.A
08:00 AM
10:00 AM
12:00 PM
14
Temporaljoin with look-backwardcriteria
November1, 2016
time tweets
08:00 AM
10:00 AM
12:00 PM
time BRK.A
08:00AM
11:00AM
time tweets BRK.A
08:00 AM
10:00 AM
12:00 PM
15
Temporaljoins in practice
November1, 2016
time tweets
08:00 AM
10:00 AM
12:00 PM
time BRK.A
08:00AM
11:00AM
16
Time Series Scale
November1, 2016
time tweets
08:00 AM
10:00 AM
12:00 PM
time BRK.A
08:00AM
11:00AMWe need fast and scalable
distributed temporal join
17
Existing solutions
November1, 2016
w Existing packages don’t support temporal join or can’t handle large time series
w Pandas / R / Matlab
w Limited to single machine
w Spark
w Does scale, but all data is unordered
w spark-ts
w Expects univariate time series to fit on single machine
w Splits by col
w Supports only snapshot data
18
Flint: A new time series library for Spark
November1, 2016
w Goal
w Provide a collection of functions to manipulate and analyze time series at scale
w Group, temporal join, summarize, aggregate …
w How
w Build a time series aware data structure
w TimeSeriesRDD extends RDD
w Optimize using temporal locality
w Reduce shuffling
w Reduce memory pressure by streaming
19
What is a TimeSeriesRDD?
November1, 2016
w TimeSeriesRDD vs RDD
w Associate time range on each partition
w Track partition time-ranges
w Preserve temporal order
20
RDD
November1, 2016
time temperature
6:00AM 60°F
6:01AM 61°F
… …
7:00AM 70°F
7:01AM 71°F
… …
8:00AM 80°F
8:01AM 81°F
… …
RDD
(6:00 AM, 60°F)
(6:01 AM, 61°F)
…
(7:00 AM, 70°F)
(7:01 AM, 71°F)
…
(8:00 AM, 80°F)
(8:01 AM, 81°F)
…
(6:58 AM, 64°F)
(6:59 AM, 65°F)
…
(7:34 AM, 74°F)
(7:35 AM, 74°F)
…
(7:58 AM, 76°F)
(7:59 AM, 77°F)
…
Raw Data
21
TimeSeriesRDD
November1, 2016
time temperature
6:00AM 60°F
6:01AM 61°F
… …
7:00AM 70°F
7:01AM 71°F
… …
8:00AM 80°F
8:01AM 81°F
… …
RDD
(6:00 AM, 60°F)
(6:01 AM, 61°F)
…
(7:00 AM, 70°F)
(7:01 AM, 71°F)
…
(8:00 AM, 80°F)
(8:01 AM, 81°F)
…
(6:58 AM, 64°F)
(6:59 AM, 65°F)
…
(7:34 AM, 74°F)
(7:35 AM, 74°F)
…
(7:58 AM, 76°F)
(7:59 AM, 77°F)
…
(6:00 AM, 60°F)
(6:01 AM, 61°F)
…
(8:00 AM, 80°F)
(8:01 AM, 81°F)
…
(7:00 AM, 70°F)
(7:01 AM, 71°F)
…
TSRDD
[06:00AM,
07:00AM)
[07:00AM,
8:00AM)
[8:00AM,
∞)
Raw Data
22
Group function
November1, 2016
w A group function groups rows with exactly the same timestamps
time city temperature
1:00 PM New York 70°F
1:00 PM Brussels 60°F
2:00 PM New York 71°F
2:00 PM Brussels 61°F
3:00 PM New York 72°F
3:00 PM Brussels 62°F
4:00 PM New York 73°F
4:00 PM Brussels 63°F
group 1
group 2
group 3
group 4
23
Group function
November1, 2016
w A group function groups rows with nearby timestamps
time city temperature
1:00 PM New York 70°F
1:00 PM Brussels 60°F
2:00 PM New York 71°F
2:00 PM Brussels 61°F
3:00 PM New York 72°F
3:00 PM Brussels 62°F
4:00 PM New York 73°F
4:00 PM Brussels 63°F
group 1
group 2
24
Group in Spark
November1, 2016
w Groups rows with exactly the same timestamps
RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
25
w Data is shuffled and materialized on the workers
Group in Spark
November1, 2016
RDD
groupBy
RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
sortBy
RDD
w Back to Temporal Order
w Temporal order is not preserved
26
Group in TimeSeriesRDD
November1, 2016
w Data is grouped per partition locally as streams
TimeSeriesRDD
2:00PM
1:00PM
3:00PM
4:00PM
1:00PM
1:00PM
2:00PM
2:00PM
3:00PM
3:00PM
4:00PM
4:00PM
27
• Running time of count after group
• 16 executors (10G memory and 4 cores per executor)
• Data read from HDFS
Benchmark for group + count
November1, 2016
0s 20s 40s 60s 80s 100s
20M
40M
60M
80M
100M TimeseriesRDD
DataFrame
RDD
50 - 100X5 - 10X
28
Temporaljoin
November1, 2016
w A temporal join function is defined by a matching criteria over time
w A typical matching criteria has two parameters
w direction– look-backward or look-forward
w window – how much to look-backward or look-forward
look-backward temporal join
window
29
Temporaljoin
November1, 2016
w A temporal join function is defined by a matching criteria over time
w A typical matching criteria has two parameters
w direction– look-backward or look-forward
w window – how much to look-backward or look-forward
look-backward temporal join
window
30
Temporaljoin
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
time series 1
31
time series 2
Temporaljoin
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w How do we do temporal join in TimeSeriesRDD?
TimeSeriesRDD TimeSeriesRDD
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
32
Temporaljoin in TimeSeriesRDD
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w partition time space into disjoint intervals
TimeSeriesRDD TimeSeriesRDDjoined
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
33
Temporaljoin in TimeSeriesRDD
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Build dependencygraph for the joinedTimeSeriesRDD
TimeSeriesRDD TimeSeriesRDDjoined
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
[1:00AM, 4:00AM)
[4:00AM, 6:00AM)
[1:00AM, 4:00AM)
[4:00AM, 6:00AM)
34
1:00AM1:00AM
Temporaljoin in TimeSeriesRDD
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Join data as streams per partition
1:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM
2:00AM
4:00AM
5:00AM
3:00AM
5:00AM
35
Temporaljoin in TimeSeriesRDD
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Join data as streams
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM 1:00AM1:00AM
2:00AM
36
Temporaljoin in TimeSeriesRDD
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Join data as streams
2:00AM
1:00AM
5:00AM
1:00AM
5:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM
1:00AM
1:00AM
2:00AM
4:00AM
3:00AM
4:00AM
3:00AM
37
Temporaljoin in TimeSeriesRDD
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Join data as streams
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM
1:00AM
1:00AM
2:00AM
4:00AM 3:00AM
5:00AM 5:00AM
38
Benchmark for temporaljoin + count
November1, 2016
0s 10s 20s 30s 40s 50s 60s 70s 80s 90s 100s
20M
40M
60M
80M
100M TimeseriesRDD
DataFrame
RDD
20 - 50X5 - 10X
39
• Running time of count after temporal join
• 16 executors (10G memory and 4 cores per executor)
• Data read from HDFS
Functions over TimeSeriesRDD
November1, 2016
w Grouping functions
w Temporal joins such as look-forward, look-backward etc.
w Summarizers such as average, variance, z-score etc. over grouping functions
40
Open Source
November1, 2016
w True!
w https://github.com/twosigma/flint
41
What’snext?
November1, 2016
w TimeSeriesDataframe / TimeSeriesDataset
w Speed up
w RicherAPIs
w Python bindings
w Additional summarizers
42
Key contributors
November1, 2016
w ChristopherAycock
w Yuri Bogomolov
w Jonathan Coveney
w Li Jin
w David Medina
w Julia Meinwald
w David Palaitis
w Larisa Sawyer
w Leif Walsh
w Wenbo Zhao
43
Flint: Time Series For Spark
November1, 2016
A library to solve for general time series analysis operations at massive scale
Anne Hathaway has nothing to do with Berkshire Hathaway
Check it out in open source, and contribute
https://github.com/twosigma/flint
44
SPARK SUMMIT
EUROPE2016
THANK YOU.
Larisa.Sawyer@twosigma.com

More Related Content

Viewers also liked

Spark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit
 
Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit
 
Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman FarahatSpark Summit
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Spark Summit
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
Spark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey StellaSpark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey StellaSpark Summit
 
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit EU talk by Sebastian Schroeder and Ralf SigmundSpark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit EU talk by Sebastian Schroeder and Ralf SigmundSpark Summit
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit
 

Viewers also liked (20)

Spark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross LawleySpark Summit EU talk by Ross Lawley
Spark Summit EU talk by Ross Lawley
 
Spark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub HavaSpark Summit EU talk by Jakub Hava
Spark Summit EU talk by Jakub Hava
 
Spark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til PifflSpark Summit EU talk by Miha Pelko and Til Piffl
Spark Summit EU talk by Miha Pelko and Til Piffl
 
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim LauSpark Summit EU talk by Kent Buenaventura and Willaim Lau
Spark Summit EU talk by Kent Buenaventura and Willaim Lau
 
Spark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean WamplerSpark Summit EU talk by Dean Wampler
Spark Summit EU talk by Dean Wampler
 
Spark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat PattersonSpark Summit EU talk by Pat Patterson
Spark Summit EU talk by Pat Patterson
 
Spark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital KediaSpark Summit EU talk by Sital Kedia
Spark Summit EU talk by Sital Kedia
 
Spark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van HovellSpark Summit EU talk by Herman van Hovell
Spark Summit EU talk by Herman van Hovell
 
Spark Summit EU talk by John Musser
Spark Summit EU talk by John MusserSpark Summit EU talk by John Musser
Spark Summit EU talk by John Musser
 
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar VeliqiSpark Summit EU talk by Ruben Pulido and Behar Veliqi
Spark Summit EU talk by Ruben Pulido and Behar Veliqi
 
Spark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick PentreathSpark Summit EU talk by Nick Pentreath
Spark Summit EU talk by Nick Pentreath
 
Spark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim HunterSpark Summit EU talk by Tim Hunter
Spark Summit EU talk by Tim Hunter
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
26 Trillion App Recomendations using 100 Lines of Spark Code - Ayman Farahat
 
Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0Simplifying Big Data Applications with Apache Spark 2.0
Simplifying Big Data Applications with Apache Spark 2.0
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Spark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey StellaSpark Summit EU talk by Casey Stella
Spark Summit EU talk by Casey Stella
 
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit EU talk by Sebastian Schroeder and Ralf SigmundSpark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
 
Spark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas GeerdinkSpark Summit EU talk by Bas Geerdink
Spark Summit EU talk by Bas Geerdink
 
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSparking up Data Engineering: Spark Summit East talk by Rohan Sharma
Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma
 

More from Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

More from Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Recently uploaded

Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data VisualizationKianJazayeri1
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksdeepakthakur548787
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelBoston Institute of Analytics
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxSimranPal17
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfblazblazml
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxHaritikaChhatwal1
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaManalVerma4
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBoston Institute of Analytics
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024Susanna-Assunta Sansone
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...Jack Cole
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectBoston Institute of Analytics
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Seán Kennedy
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Boston Institute of Analytics
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxHimangsuNath
 

Recently uploaded (20)

Principles and Practices of Data Visualization
Principles and Practices of Data VisualizationPrinciples and Practices of Data Visualization
Principles and Practices of Data Visualization
 
Insurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis ProjectInsurance Churn Prediction Data Analysis Project
Insurance Churn Prediction Data Analysis Project
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
Digital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing worksDigital Marketing Plan, how digital marketing works
Digital Marketing Plan, how digital marketing works
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis modelDecoding Movie Sentiments: Analyzing Reviews with Data Analysis model
Decoding Movie Sentiments: Analyzing Reviews with Data Analysis model
 
What To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptxWhat To Do For World Nature Conservation Day by Slidesgo.pptx
What To Do For World Nature Conservation Day by Slidesgo.pptx
 
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdfEnglish-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
English-8-Q4-W3-Synthesizing-Essential-Information-From-Various-Sources-1.pdf
 
SMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptxSMOTE and K-Fold Cross Validation-Presentation.pptx
SMOTE and K-Fold Cross Validation-Presentation.pptx
 
IBEF report on the Insurance market in India
IBEF report on the Insurance market in IndiaIBEF report on the Insurance market in India
IBEF report on the Insurance market in India
 
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis ProjectBank Loan Approval Analysis: A Comprehensive Data Analysis Project
Bank Loan Approval Analysis: A Comprehensive Data Analysis Project
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
FAIR, FAIRsharing, FAIR Cookbook and ELIXIR - Sansone SA - Boston 2024
 
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
why-transparency-and-traceability-are-essential-for-sustainable-supply-chains...
 
Decoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis ProjectDecoding Patterns: Customer Churn Prediction Data Analysis Project
Decoding Patterns: Customer Churn Prediction Data Analysis Project
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...Student profile product demonstration on grades, ability, well-being and mind...
Student profile product demonstration on grades, ability, well-being and mind...
 
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
Data Analysis Project Presentation: Unveiling Your Ideal Customer, Bank Custo...
 
Networking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptxNetworking Case Study prepared by teacher.pptx
Networking Case Study prepared by teacher.pptx
 

Spark Summit EU talk by Larisa Sawyer

  • 1. SPARK SUMMIT EUROPE2016 Distributed Time Series Analysis Framework For Spark Larisa Sawyer Two Sigma
  • 3. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 1/3/1950 1/3/1953 1/3/1956 1/3/1959 1/3/1962 1/3/1965 1/3/1968 1/3/1971 1/3/1974 1/3/1977 1/3/1980 1/3/1983 1/3/1986 1/3/1989 1/3/1992 1/3/1995 1/3/1998 1/3/2001 1/3/2004 1/3/2007 1/3/2010 1/3/2013 1/3/2016 S&P 500 Time seriesexamples November1, 2016 w Stock market prices w Temperatures w Height w … 18°C 20°C 22°C 24°C 26°C 28°C 30°C 32°C 34°C New York Brussels 100cm 110cm 120cm 130cm 140cm 150cm 160cm 170cm 180cm 5 6 7 8 9 10 11 12 13 14 15 Age (years) Avg US female 100cm 110cm 120cm 130cm 140cm 150cm 160cm 170cm 180cm 5 6 7 8 9 10 11 12 13 14 15 Age (years) Avg US female Larisa 3
  • 4. What do we do with time series data? November1, 2016 w Forecast future values given past observations $8.90 $8.95 $8.90 $9.06 $9.10 10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10 corn price ? ? ? 4
  • 6. Multivariate time series November1, 2016 w We can forecast better by joining multiple time series w Our framework enables fast distributed temporal join of large scale unalignedtime series w Temporal join is a fundamental operation for time series analysis $8.90 $8.95 $8.90 $9.06 $9.10 10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10 corn price 75°F 72°F 71°F 72°F 68°F 67°F 65°F temperature 6
  • 7. Multivariate time series November1, 2016 w We can forecast better by joining multiple time series w Our framework enables fast distributed temporal join of large scale unalignedtime series w Temporal join is a fundamental operation for time series analysis €7.94 €7.98 €7.94 €8.08 €8.12 10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10 corn price 23°C 22°C 21°C 22°C 20°C 19°C 18°C temperature 7
  • 8. What is a left join? November1, 2016 time series 1 time series 2 8
  • 9. What is temporal join? November1, 2016 w A particular joinfunction defined by a matching criteriaover time w Examples of criteria w look-backward w look-forward time series 1 time series 2 look-forward time series 1 time series 2 look-backward observation 9
  • 10. Temporaljoin with look-backwardcriteria November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AM 10
  • 11. ImportantLegal Information November1, 2016 11 The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes an offer to sell or the solicitation of any offer to buy any security or other interest. We consider this information to be confidential and not for redistribution or dissemination.
  • 12. Temporaljoin with look-backwardcriteria November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AM time tweets BRK.A 08:00 AM 10:00 AM 12:00 PM 12
  • 13. Temporaljoin with look-backwardcriteria November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AM time tweets BRK.A 08:00 AM 10:00 AM 12:00 PM 13
  • 14. Temporaljoin with look-backwardcriteria November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AM time tweets BRK.A 08:00 AM 10:00 AM 12:00 PM 14
  • 15. Temporaljoin with look-backwardcriteria November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AM time tweets BRK.A 08:00 AM 10:00 AM 12:00 PM 15
  • 16. Temporaljoins in practice November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AM 16
  • 17. Time Series Scale November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AMWe need fast and scalable distributed temporal join 17
  • 18. Existing solutions November1, 2016 w Existing packages don’t support temporal join or can’t handle large time series w Pandas / R / Matlab w Limited to single machine w Spark w Does scale, but all data is unordered w spark-ts w Expects univariate time series to fit on single machine w Splits by col w Supports only snapshot data 18
  • 19. Flint: A new time series library for Spark November1, 2016 w Goal w Provide a collection of functions to manipulate and analyze time series at scale w Group, temporal join, summarize, aggregate … w How w Build a time series aware data structure w TimeSeriesRDD extends RDD w Optimize using temporal locality w Reduce shuffling w Reduce memory pressure by streaming 19
  • 20. What is a TimeSeriesRDD? November1, 2016 w TimeSeriesRDD vs RDD w Associate time range on each partition w Track partition time-ranges w Preserve temporal order 20
  • 21. RDD November1, 2016 time temperature 6:00AM 60°F 6:01AM 61°F … … 7:00AM 70°F 7:01AM 71°F … … 8:00AM 80°F 8:01AM 81°F … … RDD (6:00 AM, 60°F) (6:01 AM, 61°F) … (7:00 AM, 70°F) (7:01 AM, 71°F) … (8:00 AM, 80°F) (8:01 AM, 81°F) … (6:58 AM, 64°F) (6:59 AM, 65°F) … (7:34 AM, 74°F) (7:35 AM, 74°F) … (7:58 AM, 76°F) (7:59 AM, 77°F) … Raw Data 21
  • 22. TimeSeriesRDD November1, 2016 time temperature 6:00AM 60°F 6:01AM 61°F … … 7:00AM 70°F 7:01AM 71°F … … 8:00AM 80°F 8:01AM 81°F … … RDD (6:00 AM, 60°F) (6:01 AM, 61°F) … (7:00 AM, 70°F) (7:01 AM, 71°F) … (8:00 AM, 80°F) (8:01 AM, 81°F) … (6:58 AM, 64°F) (6:59 AM, 65°F) … (7:34 AM, 74°F) (7:35 AM, 74°F) … (7:58 AM, 76°F) (7:59 AM, 77°F) … (6:00 AM, 60°F) (6:01 AM, 61°F) … (8:00 AM, 80°F) (8:01 AM, 81°F) … (7:00 AM, 70°F) (7:01 AM, 71°F) … TSRDD [06:00AM, 07:00AM) [07:00AM, 8:00AM) [8:00AM, ∞) Raw Data 22
  • 23. Group function November1, 2016 w A group function groups rows with exactly the same timestamps time city temperature 1:00 PM New York 70°F 1:00 PM Brussels 60°F 2:00 PM New York 71°F 2:00 PM Brussels 61°F 3:00 PM New York 72°F 3:00 PM Brussels 62°F 4:00 PM New York 73°F 4:00 PM Brussels 63°F group 1 group 2 group 3 group 4 23
  • 24. Group function November1, 2016 w A group function groups rows with nearby timestamps time city temperature 1:00 PM New York 70°F 1:00 PM Brussels 60°F 2:00 PM New York 71°F 2:00 PM Brussels 61°F 3:00 PM New York 72°F 3:00 PM Brussels 62°F 4:00 PM New York 73°F 4:00 PM Brussels 63°F group 1 group 2 24
  • 25. Group in Spark November1, 2016 w Groups rows with exactly the same timestamps RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM 25
  • 26. w Data is shuffled and materialized on the workers Group in Spark November1, 2016 RDD groupBy RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM sortBy RDD w Back to Temporal Order w Temporal order is not preserved 26
  • 27. Group in TimeSeriesRDD November1, 2016 w Data is grouped per partition locally as streams TimeSeriesRDD 2:00PM 1:00PM 3:00PM 4:00PM 1:00PM 1:00PM 2:00PM 2:00PM 3:00PM 3:00PM 4:00PM 4:00PM 27
  • 28. • Running time of count after group • 16 executors (10G memory and 4 cores per executor) • Data read from HDFS Benchmark for group + count November1, 2016 0s 20s 40s 60s 80s 100s 20M 40M 60M 80M 100M TimeseriesRDD DataFrame RDD 50 - 100X5 - 10X 28
  • 29. Temporaljoin November1, 2016 w A temporal join function is defined by a matching criteria over time w A typical matching criteria has two parameters w direction– look-backward or look-forward w window – how much to look-backward or look-forward look-backward temporal join window 29
  • 30. Temporaljoin November1, 2016 w A temporal join function is defined by a matching criteria over time w A typical matching criteria has two parameters w direction– look-backward or look-forward w window – how much to look-backward or look-forward look-backward temporal join window 30
  • 31. Temporaljoin November1, 2016 w Temporal join with criteria look-back and window of 1 hour 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM time series 1 31 time series 2
  • 32. Temporaljoin November1, 2016 w Temporal join with criteria look-back and window of 1 hour w How do we do temporal join in TimeSeriesRDD? TimeSeriesRDD TimeSeriesRDD 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM 32
  • 33. Temporaljoin in TimeSeriesRDD November1, 2016 w Temporal join with criteria look-back and window of 1 hour w partition time space into disjoint intervals TimeSeriesRDD TimeSeriesRDDjoined 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM 33
  • 34. Temporaljoin in TimeSeriesRDD November1, 2016 w Temporal join with criteria look-back and window of 1 hour w Build dependencygraph for the joinedTimeSeriesRDD TimeSeriesRDD TimeSeriesRDDjoined 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM [1:00AM, 4:00AM) [4:00AM, 6:00AM) [1:00AM, 4:00AM) [4:00AM, 6:00AM) 34
  • 35. 1:00AM1:00AM Temporaljoin in TimeSeriesRDD November1, 2016 w Temporal join with criteria look-back and window of 1 hour w Join data as streams per partition 1:00AM TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 2:00AM 4:00AM 5:00AM 3:00AM 5:00AM 35
  • 36. Temporaljoin in TimeSeriesRDD November1, 2016 w Temporal join with criteria look-back and window of 1 hour w Join data as streams 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 1:00AM1:00AM 2:00AM 36
  • 37. Temporaljoin in TimeSeriesRDD November1, 2016 w Temporal join with criteria look-back and window of 1 hour w Join data as streams 2:00AM 1:00AM 5:00AM 1:00AM 5:00AM TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 1:00AM 1:00AM 2:00AM 4:00AM 3:00AM 4:00AM 3:00AM 37
  • 38. Temporaljoin in TimeSeriesRDD November1, 2016 w Temporal join with criteria look-back and window of 1 hour w Join data as streams 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 1:00AM 1:00AM 2:00AM 4:00AM 3:00AM 5:00AM 5:00AM 38
  • 39. Benchmark for temporaljoin + count November1, 2016 0s 10s 20s 30s 40s 50s 60s 70s 80s 90s 100s 20M 40M 60M 80M 100M TimeseriesRDD DataFrame RDD 20 - 50X5 - 10X 39 • Running time of count after temporal join • 16 executors (10G memory and 4 cores per executor) • Data read from HDFS
  • 40. Functions over TimeSeriesRDD November1, 2016 w Grouping functions w Temporal joins such as look-forward, look-backward etc. w Summarizers such as average, variance, z-score etc. over grouping functions 40
  • 41. Open Source November1, 2016 w True! w https://github.com/twosigma/flint 41
  • 42. What’snext? November1, 2016 w TimeSeriesDataframe / TimeSeriesDataset w Speed up w RicherAPIs w Python bindings w Additional summarizers 42
  • 43. Key contributors November1, 2016 w ChristopherAycock w Yuri Bogomolov w Jonathan Coveney w Li Jin w David Medina w Julia Meinwald w David Palaitis w Larisa Sawyer w Leif Walsh w Wenbo Zhao 43
  • 44. Flint: Time Series For Spark November1, 2016 A library to solve for general time series analysis operations at massive scale Anne Hathaway has nothing to do with Berkshire Hathaway Check it out in open source, and contribute https://github.com/twosigma/flint 44