Ron Crocker - Evaluating Streaming Framework Performance for a Large-Scale Aggregation Pipeline

Confidential ©2008-15 New Relic, Inc. All rights reserved.
EVALUATING STREAMING FRAMEWORK
PERFORMANCE FOR A LARGE-SCALE
AGGREGATION PIPELINE
RON CROCKER
(rcrocker@newrelic.com)
PRINCIPAL ENGINEER & ARCHITECT
INGEST PIPELINE
1

4
▪

EVERY MINUTE
requestsaccepts 
over 16M
stores 
over
analytic 
events2M
aggregates 
over 800M metrics
3Bqueries 
over
data 
points
▪

different 
services
contains 
over 200
maintained/ 
built by 25+ engineering 
teams
▪
2.5more 
than
SSD 
storage
PETABYTES

Thanks for the pic! https://www.flickr.com/photos/stephenyeargin/7466608166
��

9

▪ Double-click to edit▪

Goals for evaluating streaming systems
• Understand performance characteristics
• Understand operations characteristics
11

How New Relic works…
… the cartoon version
12

13
A1 An instance of your application running on a host
A2 Another instance of your application running on another host
An More instances of your application running on more hosts
…

14
A1
A2
An
▪
▪
▪
New Relic Agent reports data to New Relic
…

15
A1
A2
An
▪
▪
▪
…
▪ Agent Token (≈ account ID, agent ID)

▪ Duration (time-period covered)

▪ Timeslices: Each timeslice contains

▪ Metric name

▪ Metric stats

▪ Count, total time, exclusive time, min,
max, sum of squares
HTTP post to <something>.newrelic.com

16
▪
▪
▪
▪ ▪
▪

Timeslice
Resolver
Minute
Aggregator
Minute
Writer
Hour
Aggregator
Hour
Writer
aggregated_minute_timeslices_data
resolved_timeslice_data
aggregated_hourly_timeslices_data
raw_timeslice_data
HTTP
termination
Other
Consumers
17
A1
A1
▪ Agent Token (≈ account ID, agent ID)



▪ Metric name

▪ Metric stats

max, sum of squares
▪ Account ID
▪ Agent ID
▪ Start time


▪ Metric name

▪ Metric stats

max, sum of squares
▪ Account ID

▪ Agent ID

▪ Application Agent IDs
▪ Start time



▪ Metric ID
▪ Metric stats

max, sum of squares
▪ Account ID

▪ Agent ID

▪ Metric ID

▪ Start time
▪ Metric stats

▪ Count, total time, exclusive time,
min, max, sum of squares

The Experiment
18

19
Timeslice
Resolver
Minute
Aggregator
Minute
Writer
Hour
Aggregator
Hour
Writer
raw_timeslice_data
HTTP
termination
Other
Consumers

Why Minute Aggregator?
▪No external dependencies
▪ Performance comparisons solely focused on processing

▪ Repeatable
▪ We can compare across technologies without needing to normalize

▪ Important to our business
▪ ProvIDes aggregation across instances of your application

▪ We could have benchmarked something else, like Yahoo benchmark or
word count, but would it have mattered?
20

21
Timeslice
Resolver
Minute
Aggregator
Minute
Writer
Hour
Aggregator
Hour
Writer
raw_timeslice_data
HTTP
termination
Other
Consumers

What about Hour Aggregator?
▪ Similar to Minute Aggregator
▪ No external dependencies, Repeatable, Important to the business
▪ Needs to run for several hours to understand
performance
▪ … and I'm that patient

▪ Extra credit: Integrate into stream implementations
22

Goals for evaluating streaming systems
• Understand performance characteristics
• Performance at different arrival rates:

• 100%

• 6000%

• To infinity and beyond
• Understand operations characteristics
• No explicit goal
23

Evaluation Framework
24
Datacenter
Staging
Kafka
AWS
Experiment
Kafka
VPC
Baseline
Flink
Spark
Load
driver

AWS Configurations
▪ Kafka + ZK
▪ 3 i2.8xlarge hosts

▪ Baseline
▪ 3 m4.4xlarge hosts

▪ Flink
▪ 4 m4.4xlarge hosts

▪ Spark
▪ EMR - 1 master , 3 workers, all m4.4xlarge
25
AWS
Configuration
i2.8xlarge m4.4xlarge
Cores 32 16
RAM 244GB 64GB
Network
Bandwidth
10Gbps 2Gbps

Experimental Kafka system
▪ Kafka 0.8.2.2
▪ NR fork, includes back ports of some 0.9 features

▪ # partitions: 16
▪ It's possible that this is too few partitions for the Baseline system
26

Load Driver
▪ Generates simple synthetic load based on real traffic
▪ Real traffic = output of Timeslice Resolver
▪ Load generated based on repeated messages
▪ Synthesizing interesting load is challenging:
▪ Un-bundle timeslices

▪ Generate re-bundled with new IDs - Agent, Account and/or Metrics

▪ Repeat as necessary to get to load point
27

Kafka
Baseline system - Our incumbent Minute Aggregator
28
▪
▪ Consume
Aggregate 
Agent
Aggregate 
Applications
▪ Consume
Aggregate 
Agent
Aggregate 
Applications
▪
Produce
Construct
Minute
Bundles
▪Kafka ▪

29
▪

30

Distributions are not friendly…
31
Average # timeslices: 279
Geometric Mean # timeslices: 64
Median # timeslices: 44
Long tail…
▪

Flink configuration
Job Manager
Task Manager
(16 slots)
Task Manager
(16 slots)
Task Manager
(16 slots)

33

34

AWS EMR
Spark configuration
Master
Slave Slave Slave

36

But the Spark Streaming solution generates WRONG results
▪ … because there is no Event Time windowing
▪ … leading to me abandoning Spark Streaming
37

Results
38

Results
39
Technology 100 % 500 % 4000 % 6000 % more…
Baseline
Flink
Spark
X Flat throughput with Kafka lag
X Flat throughput  
without Kafka lag
X Wrong answers…

Opportunities to improve the experiment
▪ MORE BANDWIDTH
▪ I don't know the limit of the Flink implementation

▪ Key space domain expansion [All]
▪ Scaling in rate domain only, with the same set of keys

▪ This is too easy on the key-based systems [Flink, Spark]
▪ This may be hard on the baseline system as well
▪ Inclusion of Database sinks [Flink, Spark]
▪ Kafka sinks are still needed for downstream functions
40

Thank you
41

Extra credit
42

43
Timeslice
Resolver
Minute
Aggregator
Minute
Writer
Hour
Aggregator
Hour
Writer
raw_timeslice_data
HTTP
termination
Other
Consumers
▪

44
▪

45

Thank you
46

Ron Crocker - Evaluating Streaming Framework Performance for a Large-Scale Aggregation Pipeline

Ron Crocker - Evaluating Streaming Framework Performance for a Large-Scale Aggregation Pipeline

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

More from Flink Forward

More from Flink Forward (20)

Recently uploaded

Recently uploaded (20)

Ron Crocker - Evaluating Streaming Framework Performance for a Large-Scale Aggregation Pipeline