Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Spark Summit EU talk by Larisa Sawyer

1,018 views

Published on

OrderedRDD: A Distributed Time Series Analysis Framework for Spark

Published in: Data & Analytics
  • Be the first to comment

Spark Summit EU talk by Larisa Sawyer

  1. 1. SPARK SUMMIT EUROPE2016 Distributed Time Series Analysis Framework For Spark Larisa Sawyer Two Sigma
  2. 2. Larisa Sawyer November1, 2016 2
  3. 3. $0.0 $500.0 $1,000.0 $1,500.0 $2,000.0 $2,500.0 1/3/1950 1/3/1953 1/3/1956 1/3/1959 1/3/1962 1/3/1965 1/3/1968 1/3/1971 1/3/1974 1/3/1977 1/3/1980 1/3/1983 1/3/1986 1/3/1989 1/3/1992 1/3/1995 1/3/1998 1/3/2001 1/3/2004 1/3/2007 1/3/2010 1/3/2013 1/3/2016 S&P 500 Time seriesexamples November1, 2016 w Stock market prices w Temperatures w Height w … 18°C 20°C 22°C 24°C 26°C 28°C 30°C 32°C 34°C New York Brussels 100cm 110cm 120cm 130cm 140cm 150cm 160cm 170cm 180cm 5 6 7 8 9 10 11 12 13 14 15 Age (years) Avg US female 100cm 110cm 120cm 130cm 140cm 150cm 160cm 170cm 180cm 5 6 7 8 9 10 11 12 13 14 15 Age (years) Avg US female Larisa 3
  4. 4. What do we do with time series data? November1, 2016 w Forecast future values given past observations $8.90 $8.95 $8.90 $9.06 $9.10 10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10 corn price ? ? ? 4
  5. 5. November1, 2016 Univariate time series 5
  6. 6. Multivariate time series November1, 2016 w We can forecast better by joining multiple time series w Our framework enables fast distributed temporal join of large scale unalignedtime series w Temporal join is a fundamental operation for time series analysis $8.90 $8.95 $8.90 $9.06 $9.10 10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10 corn price 75°F 72°F 71°F 72°F 68°F 67°F 65°F temperature 6
  7. 7. Multivariate time series November1, 2016 w We can forecast better by joining multiple time series w Our framework enables fast distributed temporal join of large scale unalignedtime series w Temporal join is a fundamental operation for time series analysis €7.94 €7.98 €7.94 €8.08 €8.12 10/1 10/2 10/3 10/4 10/5 10/6 10/7 10/8 10/9 10/10 corn price 23°C 22°C 21°C 22°C 20°C 19°C 18°C temperature 7
  8. 8. What is a left join? November1, 2016 time series 1 time series 2 8
  9. 9. What is temporal join? November1, 2016 w A particular joinfunction defined by a matching criteriaover time w Examples of criteria w look-backward w look-forward time series 1 time series 2 look-forward time series 1 time series 2 look-backward observation 9
  10. 10. Temporaljoin with look-backwardcriteria November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AM 10
  11. 11. ImportantLegal Information November1, 2016 11 The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes an offer to sell or the solicitation of any offer to buy any security or other interest. We consider this information to be confidential and not for redistribution or dissemination.
  12. 12. Temporaljoin with look-backwardcriteria November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AM time tweets BRK.A 08:00 AM 10:00 AM 12:00 PM 12
  13. 13. Temporaljoin with look-backwardcriteria November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AM time tweets BRK.A 08:00 AM 10:00 AM 12:00 PM 13
  14. 14. Temporaljoin with look-backwardcriteria November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AM time tweets BRK.A 08:00 AM 10:00 AM 12:00 PM 14
  15. 15. Temporaljoin with look-backwardcriteria November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AM time tweets BRK.A 08:00 AM 10:00 AM 12:00 PM 15
  16. 16. Temporaljoins in practice November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AM 16
  17. 17. Time Series Scale November1, 2016 time tweets 08:00 AM 10:00 AM 12:00 PM time BRK.A 08:00AM 11:00AMWe need fast and scalable distributed temporal join 17
  18. 18. Existing solutions November1, 2016 w Existing packages don’t support temporal join or can’t handle large time series w Pandas / R / Matlab w Limited to single machine w Spark w Does scale, but all data is unordered w spark-ts w Expects univariate time series to fit on single machine w Splits by col w Supports only snapshot data 18
  19. 19. Flint: A new time series library for Spark November1, 2016 w Goal w Provide a collection of functions to manipulate and analyze time series at scale w Group, temporal join, summarize, aggregate … w How w Build a time series aware data structure w TimeSeriesRDD extends RDD w Optimize using temporal locality w Reduce shuffling w Reduce memory pressure by streaming 19
  20. 20. What is a TimeSeriesRDD? November1, 2016 w TimeSeriesRDD vs RDD w Associate time range on each partition w Track partition time-ranges w Preserve temporal order 20
  21. 21. RDD November1, 2016 time temperature 6:00AM 60°F 6:01AM 61°F … … 7:00AM 70°F 7:01AM 71°F … … 8:00AM 80°F 8:01AM 81°F … … RDD (6:00 AM, 60°F) (6:01 AM, 61°F) … (7:00 AM, 70°F) (7:01 AM, 71°F) … (8:00 AM, 80°F) (8:01 AM, 81°F) … (6:58 AM, 64°F) (6:59 AM, 65°F) … (7:34 AM, 74°F) (7:35 AM, 74°F) … (7:58 AM, 76°F) (7:59 AM, 77°F) … Raw Data 21
  22. 22. TimeSeriesRDD November1, 2016 time temperature 6:00AM 60°F 6:01AM 61°F … … 7:00AM 70°F 7:01AM 71°F … … 8:00AM 80°F 8:01AM 81°F … … RDD (6:00 AM, 60°F) (6:01 AM, 61°F) … (7:00 AM, 70°F) (7:01 AM, 71°F) … (8:00 AM, 80°F) (8:01 AM, 81°F) … (6:58 AM, 64°F) (6:59 AM, 65°F) … (7:34 AM, 74°F) (7:35 AM, 74°F) … (7:58 AM, 76°F) (7:59 AM, 77°F) … (6:00 AM, 60°F) (6:01 AM, 61°F) … (8:00 AM, 80°F) (8:01 AM, 81°F) … (7:00 AM, 70°F) (7:01 AM, 71°F) … TSRDD [06:00AM, 07:00AM) [07:00AM, 8:00AM) [8:00AM, ∞) Raw Data 22
  23. 23. Group function November1, 2016 w A group function groups rows with exactly the same timestamps time city temperature 1:00 PM New York 70°F 1:00 PM Brussels 60°F 2:00 PM New York 71°F 2:00 PM Brussels 61°F 3:00 PM New York 72°F 3:00 PM Brussels 62°F 4:00 PM New York 73°F 4:00 PM Brussels 63°F group 1 group 2 group 3 group 4 23
  24. 24. Group function November1, 2016 w A group function groups rows with nearby timestamps time city temperature 1:00 PM New York 70°F 1:00 PM Brussels 60°F 2:00 PM New York 71°F 2:00 PM Brussels 61°F 3:00 PM New York 72°F 3:00 PM Brussels 62°F 4:00 PM New York 73°F 4:00 PM Brussels 63°F group 1 group 2 24
  25. 25. Group in Spark November1, 2016 w Groups rows with exactly the same timestamps RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM 25
  26. 26. w Data is shuffled and materialized on the workers Group in Spark November1, 2016 RDD groupBy RDD 1:00PM 2:00PM 2:00PM 1:00PM 3:00PM 3:00PM 4:00PM 4:00PM sortBy RDD w Back to Temporal Order w Temporal order is not preserved 26
  27. 27. Group in TimeSeriesRDD November1, 2016 w Data is grouped per partition locally as streams TimeSeriesRDD 2:00PM 1:00PM 3:00PM 4:00PM 1:00PM 1:00PM 2:00PM 2:00PM 3:00PM 3:00PM 4:00PM 4:00PM 27
  28. 28. • Running time of count after group • 16 executors (10G memory and 4 cores per executor) • Data read from HDFS Benchmark for group + count November1, 2016 0s 20s 40s 60s 80s 100s 20M 40M 60M 80M 100M TimeseriesRDD DataFrame RDD 50 - 100X5 - 10X 28
  29. 29. Temporaljoin November1, 2016 w A temporal join function is defined by a matching criteria over time w A typical matching criteria has two parameters w direction– look-backward or look-forward w window – how much to look-backward or look-forward look-backward temporal join window 29
  30. 30. Temporaljoin November1, 2016 w A temporal join function is defined by a matching criteria over time w A typical matching criteria has two parameters w direction– look-backward or look-forward w window – how much to look-backward or look-forward look-backward temporal join window 30
  31. 31. Temporaljoin November1, 2016 w Temporal join with criteria look-back and window of 1 hour 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM time series 1 31 time series 2
  32. 32. Temporaljoin November1, 2016 w Temporal join with criteria look-back and window of 1 hour w How do we do temporal join in TimeSeriesRDD? TimeSeriesRDD TimeSeriesRDD 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM 32
  33. 33. Temporaljoin in TimeSeriesRDD November1, 2016 w Temporal join with criteria look-back and window of 1 hour w partition time space into disjoint intervals TimeSeriesRDD TimeSeriesRDDjoined 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM 33
  34. 34. Temporaljoin in TimeSeriesRDD November1, 2016 w Temporal join with criteria look-back and window of 1 hour w Build dependencygraph for the joinedTimeSeriesRDD TimeSeriesRDD TimeSeriesRDDjoined 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM [1:00AM, 4:00AM) [4:00AM, 6:00AM) [1:00AM, 4:00AM) [4:00AM, 6:00AM) 34
  35. 35. 1:00AM1:00AM Temporaljoin in TimeSeriesRDD November1, 2016 w Temporal join with criteria look-back and window of 1 hour w Join data as streams per partition 1:00AM TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 2:00AM 4:00AM 5:00AM 3:00AM 5:00AM 35
  36. 36. Temporaljoin in TimeSeriesRDD November1, 2016 w Temporal join with criteria look-back and window of 1 hour w Join data as streams 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 1:00AM1:00AM 2:00AM 36
  37. 37. Temporaljoin in TimeSeriesRDD November1, 2016 w Temporal join with criteria look-back and window of 1 hour w Join data as streams 2:00AM 1:00AM 5:00AM 1:00AM 5:00AM TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 1:00AM 1:00AM 2:00AM 4:00AM 3:00AM 4:00AM 3:00AM 37
  38. 38. Temporaljoin in TimeSeriesRDD November1, 2016 w Temporal join with criteria look-back and window of 1 hour w Join data as streams 2:00AM 1:00AM 4:00AM 5:00AM 1:00AM 3:00AM 5:00AM TimeSeriesRDD TimeSeriesRDDjoined 1:00AM 1:00AM 1:00AM 2:00AM 4:00AM 3:00AM 5:00AM 5:00AM 38
  39. 39. Benchmark for temporaljoin + count November1, 2016 0s 10s 20s 30s 40s 50s 60s 70s 80s 90s 100s 20M 40M 60M 80M 100M TimeseriesRDD DataFrame RDD 20 - 50X5 - 10X 39 • Running time of count after temporal join • 16 executors (10G memory and 4 cores per executor) • Data read from HDFS
  40. 40. Functions over TimeSeriesRDD November1, 2016 w Grouping functions w Temporal joins such as look-forward, look-backward etc. w Summarizers such as average, variance, z-score etc. over grouping functions 40
  41. 41. Open Source November1, 2016 w True! w https://github.com/twosigma/flint 41
  42. 42. What’snext? November1, 2016 w TimeSeriesDataframe / TimeSeriesDataset w Speed up w RicherAPIs w Python bindings w Additional summarizers 42
  43. 43. Key contributors November1, 2016 w ChristopherAycock w Yuri Bogomolov w Jonathan Coveney w Li Jin w David Medina w Julia Meinwald w David Palaitis w Larisa Sawyer w Leif Walsh w Wenbo Zhao 43
  44. 44. Flint: Time Series For Spark November1, 2016 A library to solve for general time series analysis operations at massive scale Anne Hathaway has nothing to do with Berkshire Hathaway Check it out in open source, and contribute https://github.com/twosigma/flint 44
  45. 45. SPARK SUMMIT EUROPE2016 THANK YOU. Larisa.Sawyer@twosigma.com

×