Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ron Crocker - Evaluating Streaming Framework Performance for a Large-Scale Aggregation Pipeline

446 views

Published on

http://flink-forward.org/kb_sessions/evaluating-streaming-framework-performance-for-a-large-scale-aggregation-pipeline/

In this talk I present the results of a set of experiments comparing the performance of several implementations of aggregating time-series data. There are 3 implementations: a baseline implementation not using any streaming frameworks, an implementation using Apache Flink, and an implementation using Apache Spark Streaming*. These implementations all ran against the same Kafka cluster using the same data stream, with the goal to understand the limitations of the different implementations. The limitations were measured at 3 input data rates: 100%, 6000%, and breaking-point load.

Published in: Data & Analytics
  • Login to see the comments

Ron Crocker - Evaluating Streaming Framework Performance for a Large-Scale Aggregation Pipeline

  1. 1. Confidential ©2008-15 New Relic, Inc. All rights reserved.   EVALUATING STREAMING FRAMEWORK PERFORMANCE FOR A LARGE-SCALE AGGREGATION PIPELINE RON CROCKER (rcrocker@newrelic.com) PRINCIPAL ENGINEER & ARCHITECT INGEST PIPELINE 1
  2. 2. Confidential ©2008-15 New Relic, Inc. All rights reserved.  
  3. 3. Confidential ©2008-15 New Relic, Inc. All rights reserved.   4 ▪
  4. 4. EVERY MINUTE requestsaccepts
 over 16M stores
 over analytic
 events2M aggregates
 over 800M metrics 3Bqueries
 over data
 points ▪
  5. 5.
  6. 6. different
 services contains
 over 200 maintained/
 built by 25+ engineering
 teams ▪ 2.5more
 than SSD
 storage PETABYTES
  7. 7. Thanks for the pic! https://www.flickr.com/photos/stephenyeargin/7466608166 �����
  8. 8. Confidential ©2008-15 New Relic, Inc. All rights reserved.   9
  9. 9. Confidential ©2008-15 New Relic, Inc. All rights reserved.   ▪ Double-click to edit▪
  10. 10. Confidential ©2008-15 New Relic, Inc. All rights reserved.   Goals for evaluating streaming systems • Understand performance characteristics • Understand operations characteristics 11
  11. 11. Confidential ©2008-15 New Relic, Inc. All rights reserved.   How New Relic works… … the cartoon version 12
  12. 12. Confidential ©2008-15 New Relic, Inc. All rights reserved.   13 A1 An instance of your application running on a host A2 Another instance of your application running on another host An More instances of your application running on more hosts …
  13. 13. Confidential ©2008-15 New Relic, Inc. All rights reserved.   14 A1 A2 An ▪ ▪ ▪ New Relic Agent reports data to New Relic …
  14. 14. Confidential ©2008-15 New Relic, Inc. All rights reserved.   15 A1 A2 An ▪ ▪ ▪ … ▪ Agent Token (≈ account ID, agent ID) ▪ Duration (time-period covered) ▪ Timeslices: Each timeslice contains ▪ Metric name ▪ Metric stats ▪ Count, total time, exclusive time, min, max, sum of squares HTTP post to <something>.newrelic.com
  15. 15. Confidential ©2008-15 New Relic, Inc. All rights reserved.   16 ▪ ▪ ▪ ▪ ▪ ▪
  16. 16. Confidential ©2008-15 New Relic, Inc. All rights reserved.   Timeslice Resolver Minute Aggregator Minute Writer Hour Aggregator Hour Writer aggregated_minute_timeslices_data resolved_timeslice_data aggregated_hourly_timeslices_data raw_timeslice_data HTTP termination Other Consumers 17 A1 A1 ▪ Agent Token (≈ account ID, agent ID) ▪ Duration (time-period covered) ▪ Timeslices: Each timeslice contains ▪ Metric name ▪ Metric stats ▪ Count, total time, exclusive time, min, max, sum of squares ▪ Account ID ▪ Agent ID ▪ Start time ▪ Duration (time-period covered) ▪ Timeslices: Each timeslice contains ▪ Metric name ▪ Metric stats ▪ Count, total time, exclusive time, min, max, sum of squares ▪ Account ID ▪ Agent ID ▪ Application Agent IDs ▪ Start time ▪ Duration (time-period covered) ▪ Timeslices: Each timeslice contains ▪ Metric ID ▪ Metric stats ▪ Count, total time, exclusive time, min, max, sum of squares ▪ Account ID ▪ Agent ID ▪ Timeslices: Each timeslice contains ▪ Metric ID ▪ Start time ▪ Duration (time-period covered) ▪ Metric stats ▪ Count, total time, exclusive time, min, max, sum of squares
  17. 17. Confidential ©2008-15 New Relic, Inc. All rights reserved.   The Experiment 18
  18. 18. Confidential ©2008-15 New Relic, Inc. All rights reserved.   19 Timeslice Resolver Minute Aggregator Minute Writer Hour Aggregator Hour Writer aggregated_minute_timeslices_data resolved_timeslice_data aggregated_hourly_timeslices_data raw_timeslice_data HTTP termination Other Consumers
  19. 19. Why Minute Aggregator? ▪No external dependencies ▪ Performance comparisons solely focused on processing ▪ Repeatable ▪ We can compare across technologies without needing to normalize ▪ Important to our business ▪ ProvIDes aggregation across instances of your application ▪ We could have benchmarked something else, like Yahoo benchmark or word count, but would it have mattered? 20
  20. 20. Confidential ©2008-15 New Relic, Inc. All rights reserved.   21 Timeslice Resolver Minute Aggregator Minute Writer Hour Aggregator Hour Writer aggregated_minute_timeslices_data resolved_timeslice_data aggregated_hourly_timeslices_data raw_timeslice_data HTTP termination Other Consumers
  21. 21. What about Hour Aggregator? ▪ Similar to Minute Aggregator ▪ No external dependencies, Repeatable, Important to the business ▪ Needs to run for several hours to understand performance ▪ … and I'm that patient ▪ Extra credit: Integrate into stream implementations 22
  22. 22. Confidential ©2008-15 New Relic, Inc. All rights reserved.   Goals for evaluating streaming systems • Understand performance characteristics • Performance at different arrival rates: • 100% • 6000% • To infinity and beyond • Understand operations characteristics • No explicit goal 23
  23. 23. Evaluation Framework 24 Datacenter Staging Kafka AWS Experiment Kafka VPC Baseline Flink Spark Load driver
  24. 24. AWS Configurations ▪ Kafka + ZK ▪ 3 i2.8xlarge hosts ▪ Baseline ▪ 3 m4.4xlarge hosts ▪ Flink ▪ 4 m4.4xlarge hosts ▪ Spark ▪ EMR - 1 master , 3 workers, all m4.4xlarge 25 AWS Configuration i2.8xlarge m4.4xlarge Cores 32 16 RAM 244GB 64GB Network Bandwidth 10Gbps 2Gbps
  25. 25. Experimental Kafka system ▪ Kafka 0.8.2.2 ▪ NR fork, includes back ports of some 0.9 features ▪ # partitions: 16 ▪ It's possible that this is too few partitions for the Baseline system 26
  26. 26. Load Driver ▪ Generates simple synthetic load based on real traffic ▪ Real traffic = output of Timeslice Resolver ▪ Load generated based on repeated messages ▪ Synthesizing interesting load is challenging: ▪ Un-bundle timeslices ▪ Generate re-bundled with new IDs - Agent, Account and/or Metrics ▪ Repeat as necessary to get to load point 27
  27. 27. Kafka Baseline system - Our incumbent Minute Aggregator 28 ▪ ▪ Consume Aggregate
 Agent Aggregate
 Applications ▪ Consume Aggregate
 Agent Aggregate
 Applications ▪ Produce Construct Minute Bundles ▪Kafka ▪
  28. 28. Confidential ©2008-15 New Relic, Inc. All rights reserved.   29 ▪
  29. 29. Confidential ©2008-15 New Relic, Inc. All rights reserved.   30
  30. 30. Distributions are not friendly… 31 Average # timeslices: 279 Geometric Mean # timeslices: 64 Median # timeslices: 44 Long tail… ▪
  31. 31. Flink configuration Job Manager Task Manager (16 slots) Task Manager (16 slots) Task Manager (16 slots)
  32. 32. Confidential ©2008-15 New Relic, Inc. All rights reserved.   33
  33. 33. Confidential ©2008-15 New Relic, Inc. All rights reserved.   34
  34. 34. AWS EMR Spark configuration Master Slave Slave Slave
  35. 35. Confidential ©2008-15 New Relic, Inc. All rights reserved.   36
  36. 36. But the Spark Streaming solution generates WRONG results ▪ … because there is no Event Time windowing ▪ … leading to me abandoning Spark Streaming 37
  37. 37. Confidential ©2008-15 New Relic, Inc. All rights reserved.   Results 38
  38. 38. Results 39 Technology 100 % 500 % 4000 % 6000 % more… Baseline Flink Spark X Flat throughput with Kafka lag X Flat throughput 
 without Kafka lag X Wrong answers…
  39. 39. Opportunities to improve the experiment ▪ MORE BANDWIDTH ▪ I don't know the limit of the Flink implementation ▪ Key space domain expansion [All] ▪ Scaling in rate domain only, with the same set of keys ▪ This is too easy on the key-based systems [Flink, Spark] ▪ This may be hard on the baseline system as well ▪ Inclusion of Database sinks [Flink, Spark] ▪ Kafka sinks are still needed for downstream functions 40
  40. 40. Confidential ©2008-15 New Relic, Inc. All rights reserved.   Thank you 41
  41. 41. Confidential ©2008-15 New Relic, Inc. All rights reserved.   Extra credit 42
  42. 42. Confidential ©2008-15 New Relic, Inc. All rights reserved.   43 Timeslice Resolver Minute Aggregator Minute Writer Hour Aggregator Hour Writer aggregated_minute_timeslices_data resolved_timeslice_data aggregated_hourly_timeslices_data raw_timeslice_data HTTP termination Other Consumers ▪
  43. 43. Confidential ©2008-15 New Relic, Inc. All rights reserved.   44 ▪
  44. 44. Confidential ©2008-15 New Relic, Inc. All rights reserved.   45
  45. 45. Confidential ©2008-15 New Relic, Inc. All rights reserved.   Thank you 46

×