Performance Management in ‘Big Data’ Applications

Performance Management in
‘Big Data’ Applications
It’s still about the Application

Michael Kopp, Technology Strategist Edward Capriolo
michael.kopp@compuware.com edward@m6d.com
@mikopp @edwardcapriolo
blog.dynatrace.com m6d.com/blog

BigData High Volume/Low Latency DBs

Web Java

Key Challenges Key Benefits
1) Even Distribution 1) Fast Read/Write
2) Correct Schema and Access patterns 2) Horizontal Scalability
3) Understanding Application Impact 3) Redundancy and High Availability

3

BigData Large Parallel Batch Processing

Hive high-level
map/reduce JOB
query JOB
1
2
1
batch 3
2
trigger 4
3
.
.
.
Key Challenges Hive Server 754
Key Benefits
1) Optimal Distribution 1) Massive Horizontal Batch Job
2) Unwieldy Configuration 2) Split big Problems into smaller ones
3) Can easily waste your resources

Impressions look like…

6

Typical MapReduce Job at m6d

8

Hadoop at m6d

• Critical piece of infrastructure
• Long Term Data Storage
– Raw logs
– Aggregations
– Reports
– Generated data (feed back loops)
• Numerous ETL (Extract Transform Load)
• Scheduled and adhoc processes
• Used directly by Tech-Team, Ad Ops, Data Science

9

Hadoop at m6d

• Two deployments 'production' and 'research'
– ~ 500 TB - 40+ Nodes
– ~ 350 TB – 20+ Nodes
• Thousands of jobs
– <5 minute jobs and 12 hour Job Flows
– Mostly Hive Jobs
– Some custom code and streaming jobs

Hadoop Design Tenants

• Linear scalability by adding more hardware
• HDFS Distributed file system
– User space file system
– Blocks are replicated across nodes
– Limited semantics
• MapReduce
– Paradigm which models using map/reduce
– Data Locality
– Split Job into Tasks by Data
– Retry in failure

Schema Design Challenges

• Partition data for good distribution
– By time interval (optionally a second level)
• Partition pruning with WHERE
– Clustering (aka bucketing)
• Optimized sampling and joins
– Columnar
• Column oriented
• Raw Data Growth
• Data features change (more distinct X)

12

Key Performance Challenges

• Intermediate I/O
– Compression codec
– Block size
– Split-table formats
• Contentions between jobs
• Data and Map/Reduce Distribution
• Data Skew
• Non Uniform Computation (long running tasks)
• ‘Cost' of new feature – is this justified?
• Tuning variables (spills, buffers, Etc, etc)

13

How to handle Performance Issues?

• Profile the Job / Query?
– Who should do this?
(DBA, Dev, Ops, DevOps , NoOps, Big Data Guru)
– How should we do this?
• Look at job run times day over day?
• Look at code and micro-benchmark?
• Collect Job Counters?

• Upgrade often for latest performance features?
• Investigate/purchase newer better hardware
– More cores? RAM? 10G Ethernet? SSD
Test Data is not like
• Read blogs? Real Data

14

But how to optimize
the job itself?

15

Understanding Map/Reduce Performance

Attention Data
Maximum
Parallelism
Volume!

Actual Mapping
Also your own
Millions of
Parallelism
Code
Executions!!!

Attention
Potential
Choke Point!

Maximum
Reduce
Parallelism

Actual Reduce Also your own
Parallelism
16
Code

Understanding Map/Reduce Performance

Map/Reduce behind the scenes
Serialize
De-Serialize
and Serialize
again

Potentionally
Inefficient

Too Many Files,
Same Key
spread all over

De-Serialize Expensive
and Serialize Synchronous
again Combine

19

Map/Reduce Combine and Spill Performance

1) Pre Combine in Mapping Step
2) Avoid many intermediate files and combines

20

Map/Reduce “Map” Performance

Avoid Brute Force
Then on Big Hotspots
FocusOptimize Hadoop
Save a lot of Hardware

21

Map/Reduce to the Max!

• Ensure Data Locality
• Optimize Map/Reduce Hotspots
• Reduce Intermediate Data and “Overhead”
• Ensure optimal Data and Compute Distribution
• Tune Hadoop Environment

22

Cassandra and
Application
Performance
23

A High Level look at RTB

1. Browsers visit Publishers and create impressions.
2. Publishers sell impressions via Exchanges.
3. Exchanges serve as auction houses for the impressions
4. On behalf of the marketer, m6d bids the impressions via
the auction house. If m6d wins, we display our ad to the
browser.
24

Cassandra at m6d for Real Time Bidding

• RTB limited data is provided from exchange
• System to store information on users
– Frequency Capping
– Visit History
– Segments (product service affinity)
• Low latency Requirements
– Less then 100ms
– Requires fast read/write on discrete data

25

Key Cassandra Design Tennents

• Swap/paging not possible
• Mostly schema-less
• Writes do not read
– Read/Write is an anti-pattern
• Optimize around put and get
– Not for scan and query
• De-Normalize data
– Attempt to get all data in single read*

Cassandra Design Challenges

• De-normailize
– Store data to optimize reads
– Composite (multi-column) keys
• Multi-column family and Multi-tenant scenarios
• Compress settings
– Disk and cache savings
– CPU and JVM costs
• Data/Compaction settings
– Size tiered vs LevelDB
• Caching, Memtable and other tuning

28

How to handle performance issues?

• Monitor standard vitals (cpu,disk) ?
• Read blogs and documentation?
• Use Cassandra JMX to track req/sec
• Use Cassandra JMX to track size of Column Families, rows and
columns
• Upgrade often to get latest performance enhancements? *

What about the Application?

29

NoSQL APM is not so different after all…

Web Java Database

Key APM Problems Identified
1) Response Time Contribution
2) data access patterns
3) transaction to query
relationship (transaction flow)

31

Response Time Contribution

Contribution to
Business Transaction Connection Pool

Access Pattern

32

Statement Analysis

Executions per
Average and Total
Contribution to
Transactions and
Business Transaction
Execution Time
Total

33

Where, Why, How and which Transaction…

Which Business
Transaction

Which Web Service

Where and why in my
Transaction
Single Statement
Performance
34

How does this apply to NoSQL Databases?

Web Java

Key APM Problems Identified
1) Response Time Contribution
1) Data Access Distribution
2) data access patterns
2) End-to-End Monitoring
3) transaction to query
3) Storage (I/O, GC) Bottlenecks
relationship (transaction flow)
4) Consistency Level

35

Real End-to-End Application Performance

Our Application
Third Party
External

End User

Services

End User Response Time
Contribution

37

Understanding Cassandra’s Contribution

Which statements did the Transaction Execute?
Which node where they executed against?
Contribution of each many calls?
Too Statment Data Access patterns
Which Consistency Level was used?

38

Understand Response Time Contribution

5 Calls 4 Calls
~50-80 ms ~15 ms Contribution
Contribution?

Access and Data Distribution

39

Why and how was a statement executed?

45ms latency? 60ms waiting on
the server?

40

Any Hotspots on the Cassandra Nodes?

Much more load on Node3?
Which Transactions are
responsible

41

Specific Cassandra Health Metrics

42

General Health of Cassandra

Memory Issues?

Too many requests?

Too much GC Suspensions?

43

Extend Performance Focus on Application

Web Java

A Fast Database doesn’t make
a fast Application
45

Intelligent MapReduce APM

data/task node
Hive high-level
map/reduce JOB data/task node
query JOB
1
2
batch 1
3
master node
2
trigger 4
3
.
.
Hive Server .
754
data/task node

Simple Optimizations with big impact

Big Data is about solving
Application Problems
APM is about Application
Performance and Efficiency

47

THANK YOU
Michael Kopp, Technology Strategist Edward Capriolo
michael.kopp@compuware.com edward@m6d.com
@mikopp @edwardcapriolo
48
blog.dynatrace.com m6d.com/blog

Performance Management in ‘Big Data’ Applications

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Performance Management in ‘Big Data’ Applications

Similar to Performance Management in ‘Big Data’ Applications (20)

More from Michael Kopp

More from Michael Kopp (6)

Recently uploaded

Recently uploaded (20)

Performance Management in ‘Big Data’ Applications

Editor's Notes