4. What do we do with time series data?
November1, 2016
w Forecast future values given past observations
$8.90
$8.95
$8.90
$9.06
$9.10
10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10
corn price
?
?
?
4
6. Multivariate time series
November1, 2016
w We can forecast better by joining multiple time series
w Our framework enables fast distributed temporal join of large scale unalignedtime series
w Temporal join is a fundamental operation for time series analysis
$8.90
$8.95
$8.90
$9.06
$9.10
10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10
corn price
75°F
72°F
71°F
72°F
68°F
67°F
65°F
temperature
6
7. Multivariate time series
November1, 2016
w We can forecast better by joining multiple time series
w Our framework enables fast distributed temporal join of large scale unalignedtime series
w Temporal join is a fundamental operation for time series analysis
€7.94
€7.98
€7.94
€8.08
€8.12
10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10
corn price
23°C
22°C
21°C
22°C
20°C
19°C
18°C
temperature
7
8. What is a left join?
November1, 2016
time series 1 time series 2
8
9. What is temporal join?
November1, 2016
w A particular joinfunction defined by a matching criteriaover time
w Examples of criteria
w look-backward
w look-forward
time series 1 time series 2
look-forward
time series 1 time series 2
look-backward
observation
9
11. ImportantLegal Information
November1, 2016 11
The information presented here is offered for informational purposes only and should
not be used for any other purpose (including, without limitation, the making of
investment decisions). Examples provided herein are for illustrative purposes only and
are not necessarily based on actual data. Nothing herein constitutes an offer to sell or the
solicitation of any offer to buy any security or other interest. We consider this
information to be confidential and not for redistribution or dissemination.
17. Time Series Scale
November1, 2016
time tweets
08:00 AM
10:00 AM
12:00 PM
time BRK.A
08:00AM
11:00AMWe need fast and scalable
distributed temporal join
17
18. Existing solutions
November1, 2016
w Existing packages don’t support temporal join or can’t handle large time series
w Pandas / R / Matlab
w Limited to single machine
w Spark
w Does scale, but all data is unordered
w spark-ts
w Expects univariate time series to fit on single machine
w Splits by col
w Supports only snapshot data
18
19. Flint: A new time series library for Spark
November1, 2016
w Goal
w Provide a collection of functions to manipulate and analyze time series at scale
w Group, temporal join, summarize, aggregate …
w How
w Build a time series aware data structure
w TimeSeriesRDD extends RDD
w Optimize using temporal locality
w Reduce shuffling
w Reduce memory pressure by streaming
19
20. What is a TimeSeriesRDD?
November1, 2016
w TimeSeriesRDD vs RDD
w Associate time range on each partition
w Track partition time-ranges
w Preserve temporal order
20
23. Group function
November1, 2016
w A group function groups rows with exactly the same timestamps
time city temperature
1:00 PM New York 70°F
1:00 PM Brussels 60°F
2:00 PM New York 71°F
2:00 PM Brussels 61°F
3:00 PM New York 72°F
3:00 PM Brussels 62°F
4:00 PM New York 73°F
4:00 PM Brussels 63°F
group 1
group 2
group 3
group 4
23
24. Group function
November1, 2016
w A group function groups rows with nearby timestamps
time city temperature
1:00 PM New York 70°F
1:00 PM Brussels 60°F
2:00 PM New York 71°F
2:00 PM Brussels 61°F
3:00 PM New York 72°F
3:00 PM Brussels 62°F
4:00 PM New York 73°F
4:00 PM Brussels 63°F
group 1
group 2
24
25. Group in Spark
November1, 2016
w Groups rows with exactly the same timestamps
RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
25
26. w Data is shuffled and materialized on the workers
Group in Spark
November1, 2016
RDD
groupBy
RDD
1:00PM
2:00PM
2:00PM
1:00PM
3:00PM
3:00PM
4:00PM
4:00PM
sortBy
RDD
w Back to Temporal Order
w Temporal order is not preserved
26
27. Group in TimeSeriesRDD
November1, 2016
w Data is grouped per partition locally as streams
TimeSeriesRDD
2:00PM
1:00PM
3:00PM
4:00PM
1:00PM
1:00PM
2:00PM
2:00PM
3:00PM
3:00PM
4:00PM
4:00PM
27
28. • Running time of count after group
• 16 executors (10G memory and 4 cores per executor)
• Data read from HDFS
Benchmark for group + count
November1, 2016
0s 20s 40s 60s 80s 100s
20M
40M
60M
80M
100M TimeseriesRDD
DataFrame
RDD
50 - 100X5 - 10X
28
29. Temporaljoin
November1, 2016
w A temporal join function is defined by a matching criteria over time
w A typical matching criteria has two parameters
w direction– look-backward or look-forward
w window – how much to look-backward or look-forward
look-backward temporal join
window
29
30. Temporaljoin
November1, 2016
w A temporal join function is defined by a matching criteria over time
w A typical matching criteria has two parameters
w direction– look-backward or look-forward
w window – how much to look-backward or look-forward
look-backward temporal join
window
30
31. Temporaljoin
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
time series 1
31
time series 2
32. Temporaljoin
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w How do we do temporal join in TimeSeriesRDD?
TimeSeriesRDD TimeSeriesRDD
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
32
33. Temporaljoin in TimeSeriesRDD
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w partition time space into disjoint intervals
TimeSeriesRDD TimeSeriesRDDjoined
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
33
34. Temporaljoin in TimeSeriesRDD
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Build dependencygraph for the joinedTimeSeriesRDD
TimeSeriesRDD TimeSeriesRDDjoined
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
[1:00AM, 4:00AM)
[4:00AM, 6:00AM)
[1:00AM, 4:00AM)
[4:00AM, 6:00AM)
34
35. 1:00AM1:00AM
Temporaljoin in TimeSeriesRDD
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Join data as streams per partition
1:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM
2:00AM
4:00AM
5:00AM
3:00AM
5:00AM
35
36. Temporaljoin in TimeSeriesRDD
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Join data as streams
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM 1:00AM1:00AM
2:00AM
36
37. Temporaljoin in TimeSeriesRDD
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Join data as streams
2:00AM
1:00AM
5:00AM
1:00AM
5:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM
1:00AM
1:00AM
2:00AM
4:00AM
3:00AM
4:00AM
3:00AM
37
38. Temporaljoin in TimeSeriesRDD
November1, 2016
w Temporal join with criteria look-back and window of 1 hour
w Join data as streams
2:00AM
1:00AM
4:00AM
5:00AM
1:00AM
3:00AM
5:00AM
TimeSeriesRDD TimeSeriesRDDjoined
1:00AM
1:00AM
1:00AM
2:00AM
4:00AM 3:00AM
5:00AM 5:00AM
38
39. Benchmark for temporaljoin + count
November1, 2016
0s 10s 20s 30s 40s 50s 60s 70s 80s 90s 100s
20M
40M
60M
80M
100M TimeseriesRDD
DataFrame
RDD
20 - 50X5 - 10X
39
• Running time of count after temporal join
• 16 executors (10G memory and 4 cores per executor)
• Data read from HDFS
40. Functions over TimeSeriesRDD
November1, 2016
w Grouping functions
w Temporal joins such as look-forward, look-backward etc.
w Summarizers such as average, variance, z-score etc. over grouping functions
40
43. Key contributors
November1, 2016
w ChristopherAycock
w Yuri Bogomolov
w Jonathan Coveney
w Li Jin
w David Medina
w Julia Meinwald
w David Palaitis
w Larisa Sawyer
w Leif Walsh
w Wenbo Zhao
43
44. Flint: Time Series For Spark
November1, 2016
A library to solve for general time series analysis operations at massive scale
Anne Hathaway has nothing to do with Berkshire Hathaway
Check it out in open source, and contribute
https://github.com/twosigma/flint
44