Do applications using NoSQL still require performance management? Is it always the best option to throw more hardware at a MapReduce job? In both cases, performance management is still about the application, but "Big Data" technologies have added a new wrinkle.
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Performance Management in ‘Big Data’ Applications
1. Performance Management in
‘Big Data’ Applications
It’s still about the Application
Michael Kopp, Technology Strategist Edward Capriolo
michael.kopp@compuware.com edward@m6d.com
@mikopp @edwardcapriolo
blog.dynatrace.com m6d.com/blog
2. BigData High Volume/Low Latency DBs
Web Java
Key Challenges Key Benefits
1) Even Distribution 1) Fast Read/Write
2) Correct Schema and Access patterns 2) Horizontal Scalability
3) Understanding Application Impact 3) Redundancy and High Availability
3
3. BigData Large Parallel Batch Processing
Hive high-level
map/reduce JOB
query JOB
1
2
1
batch 3
2
trigger 4
3
.
.
.
Key Challenges Hive Server 754
Key Benefits
1) Optimal Distribution 1) Massive Horizontal Batch Job
2) Unwieldy Configuration 2) Split big Problems into smaller ones
3) Can easily waste your resources
8. Hadoop at m6d
• Critical piece of infrastructure
• Long Term Data Storage
– Raw logs
– Aggregations
– Reports
– Generated data (feed back loops)
• Numerous ETL (Extract Transform Load)
• Scheduled and adhoc processes
• Used directly by Tech-Team, Ad Ops, Data Science
9
9. Hadoop at m6d
• Two deployments 'production' and 'research'
– ~ 500 TB - 40+ Nodes
– ~ 350 TB – 20+ Nodes
• Thousands of jobs
– <5 minute jobs and 12 hour Job Flows
– Mostly Hive Jobs
– Some custom code and streaming jobs
10. Hadoop Design Tenants
• Linear scalability by adding more hardware
• HDFS Distributed file system
– User space file system
– Blocks are replicated across nodes
– Limited semantics
• MapReduce
– Paradigm which models using map/reduce
– Data Locality
– Split Job into Tasks by Data
– Retry in failure
11. Schema Design Challenges
• Partition data for good distribution
– By time interval (optionally a second level)
• Partition pruning with WHERE
– Clustering (aka bucketing)
• Optimized sampling and joins
– Columnar
• Column oriented
• Raw Data Growth
• Data features change (more distinct X)
12
12. Key Performance Challenges
• Intermediate I/O
– Compression codec
– Block size
– Split-table formats
• Contentions between jobs
• Data and Map/Reduce Distribution
• Data Skew
• Non Uniform Computation (long running tasks)
• ‘Cost' of new feature – is this justified?
• Tuning variables (spills, buffers, Etc, etc)
13
13. How to handle Performance Issues?
• Profile the Job / Query?
– Who should do this?
(DBA, Dev, Ops, DevOps , NoOps, Big Data Guru)
– How should we do this?
• Look at job run times day over day?
• Look at code and micro-benchmark?
• Collect Job Counters?
• Upgrade often for latest performance features?
• Investigate/purchase newer better hardware
– More cores? RAM? 10G Ethernet? SSD
Test Data is not like
• Read blogs? Real Data
14
15. Understanding Map/Reduce Performance
Attention Data
Maximum
Parallelism
Volume!
Actual Mapping
Also your own
Millions of
Parallelism
Code
Executions!!!
Attention
Potential
Choke Point!
Maximum
Reduce
Parallelism
Actual Reduce Also your own
Parallelism
16
Code
18. Map/Reduce behind the scenes
Serialize
De-Serialize
and Serialize
again
Potentionally
Inefficient
Too Many Files,
Same Key
spread all over
De-Serialize Expensive
and Serialize Synchronous
again Combine
19
19. Map/Reduce Combine and Spill Performance
1) Pre Combine in Mapping Step
2) Avoid many intermediate files and combines
20
20. Map/Reduce “Map” Performance
Avoid Brute Force
Then on Big Hotspots
FocusOptimize Hadoop
Save a lot of Hardware
21
21. Map/Reduce to the Max!
• Ensure Data Locality
• Optimize Map/Reduce Hotspots
• Reduce Intermediate Data and “Overhead”
• Ensure optimal Data and Compute Distribution
• Tune Hadoop Environment
22
23. A High Level look at RTB
1. Browsers visit Publishers and create impressions.
2. Publishers sell impressions via Exchanges.
3. Exchanges serve as auction houses for the impressions
4. On behalf of the marketer, m6d bids the impressions via
the auction house. If m6d wins, we display our ad to the
browser.
24
24. Cassandra at m6d for Real Time Bidding
• RTB limited data is provided from exchange
• System to store information on users
– Frequency Capping
– Visit History
– Segments (product service affinity)
• Low latency Requirements
– Less then 100ms
– Requires fast read/write on discrete data
25
26. Key Cassandra Design Tennents
• Swap/paging not possible
• Mostly schema-less
• Writes do not read
– Read/Write is an anti-pattern
• Optimize around put and get
– Not for scan and query
• De-Normalize data
– Attempt to get all data in single read*
27. Cassandra Design Challenges
• De-normailize
– Store data to optimize reads
– Composite (multi-column) keys
• Multi-column family and Multi-tenant scenarios
• Compress settings
– Disk and cache savings
– CPU and JVM costs
• Data/Compaction settings
– Size tiered vs LevelDB
• Caching, Memtable and other tuning
28
28. How to handle performance issues?
• Monitor standard vitals (cpu,disk) ?
• Read blogs and documentation?
• Use Cassandra JMX to track req/sec
• Use Cassandra JMX to track size of Column Families, rows and
columns
• Upgrade often to get latest performance enhancements? *
What about the Application?
29
30. NoSQL APM is not so different after all…
Web Java Database
Key APM Problems Identified
1) Response Time Contribution
2) data access patterns
3) transaction to query
relationship (transaction flow)
31
32. Statement Analysis
Executions per
Average and Total
Contribution to
Transactions and
Business Transaction
Execution Time
Total
33
33. Where, Why, How and which Transaction…
Which Business
Transaction
Which Web Service
Where and why in my
Transaction
Single Statement
Performance
34
34. How does this apply to NoSQL Databases?
Web Java
Key APM Problems Identified
1) Response Time Contribution
1) Data Access Distribution
2) data access patterns
2) End-to-End Monitoring
3) transaction to query
3) Storage (I/O, GC) Bottlenecks
relationship (transaction flow)
4) Consistency Level
35
35. Real End-to-End Application Performance
Our Application
Third Party
External
End User
Services
End User Response Time
Contribution
37
36. Understanding Cassandra’s Contribution
Which statements did the Transaction Execute?
Which node where they executed against?
Contribution of each many calls?
Too Statment Data Access patterns
Which Consistency Level was used?
38
37. Understand Response Time Contribution
5 Calls 4 Calls
~50-80 ms ~15 ms Contribution
Contribution?
Access and Data Distribution
39
38. Why and how was a statement executed?
45ms latency? 60ms waiting on
the server?
40
39. Any Hotspots on the Cassandra Nodes?
Much more load on Node3?
Which Transactions are
responsible
41
45. Big Data is about solving
Application Problems
APM is about Application
Performance and Efficiency
47
46. THANK YOU
Michael Kopp, Technology Strategist Edward Capriolo
michael.kopp@compuware.com edward@m6d.com
@mikopp @edwardcapriolo
48
blog.dynatrace.com m6d.com/blog
Editor's Notes
Map/Reduce Problem PatternsUneven DistributionOptimal splittingOptimizing Design Choke Point (between map and reduce)Complex Jobs aka. Hive QueriesData Locality (a customer of ours has the problem that while the job itself is distributed the data comes from only 3 HBase/Data Nodes)Too Many HBase CallsWasteful Job Code aka. Add more hardware instead of fixing hotspotsPremarture flushing (see http://blog.dynatrace.com/2012/01/25/about-the-performance-of-map-reduce-jobs/) Cassandra/NoSQLThe theme here will be that from and App point of view the problem patterns haven’t really changes, but we actually have additional onesToo many callsToo much data readNon optimal data accessData driven Locking issuesSlow QueriesUneven DistributionUsing wrong Consistency LevelSlower NodesI/O IssuesGC?
Purepath is the only solution spanning client and server (or edge and cloud)Keynote has no Real User MonitoringAppDynamics?New Relic
Done by MikeExplain NoSQL on a high levelExplain Key Benefits and Key Challenges
Done by MikeExplain MapReduce on a high levelExplain Key Benefits and Key Challenges
Ed:Describe how hadoop works on a high levelDescribe a M6DUseCase as an example?Typical Performance Issues and why it is hard (different jobs different options)point towards developer and hive query, complicated, but most potentialMike:We are now starting doing things a little differentWhen you look at the typical Map/Reduce flow you’ll see the major parts, now we can monitor these areas for each jobTherefore we can decide on a job per job basis if we have one of the typical hadoop problems, or if it is worth our while to optimize things at the core of it, at the code level, and here we get pretty decent hotspots. After all when cutting down mapcode from 60 to 20% it helps a lot, after that it might be good enough or now that we spend most of our time in the framework it is time to look at hadoop itself againThe message however is, it’s the same as APM always has been, first identify on a Job basis if and what the problem is and then go for it, don’t just tune away, you’ll need an expert like Ed to go anywhere then.
Ed:Describe how hadoop works on a high levelDescribe a M6DUseCase as an example?Typical Performance Issues and why it is hard (different jobs different options)point towards developer and hive query, complicated, but most potentialMike:We are now starting doing things a little differentWhen you look at the typical Map/Reduce flow you’ll see the major parts, now we can monitor these areas for each jobTherefore we can decide on a job per job basis if we have one of the typical hadoop problems, or if it is worth our while to optimize things at the core of it, at the code level, and here we get pretty decent hotspots. After all when cutting down mapcode from 60 to 20% it helps a lot, after that it might be good enough or now that we spend most of our time in the framework it is time to look at hadoop itself againThe message however is, it’s the same as APM always has been, first identify on a Job basis if and what the problem is and then go for it, don’t just tune away, you’ll need an expert like Ed to go anywhere then.
Ed:Describe how hadoop works on a high levelDescribe a M6DUseCase as an example?Typical Performance Issues and why it is hard (different jobs different options)point towards developer and hive query, complicated, but most potentialMike:We are now starting doing things a little differentWhen you look at the typical Map/Reduce flow you’ll see the major parts, now we can monitor these areas for each jobTherefore we can decide on a job per job basis if we have one of the typical hadoop problems, or if it is worth our while to optimize things at the core of it, at the code level, and here we get pretty decent hotspots. After all when cutting down mapcode from 60 to 20% it helps a lot, after that it might be good enough or now that we spend most of our time in the framework it is time to look at hadoop itself againThe message however is, it’s the same as APM always has been, first identify on a Job basis if and what the problem is and then go for it, don’t just tune away, you’ll need an expert like Ed to go anywhere then.
Done by MikeExplain MapReduce on a high levelExplain Key Benefits and Key Challenges
Ed:Describe how hadoop works on a high levelDescribe a M6DUseCase as an example?Typical Performance Issues and why it is hard (different jobs different options)point towards developer and hive query, complicated, but most potentialMike:We are now starting doing things a little differentWhen you look at the typical Map/Reduce flow you’ll see the major parts, now we can monitor these areas for each jobTherefore we can decide on a job per job basis if we have one of the typical hadoop problems, or if it is worth our while to optimize things at the core of it, at the code level, and here we get pretty decent hotspots. After all when cutting down mapcode from 60 to 20% it helps a lot, after that it might be good enough or now that we spend most of our time in the framework it is time to look at hadoop itself againThe message however is, it’s the same as APM always has been, first identify on a Job basis if and what the problem is and then go for it, don’t just tune away, you’ll need an expert like Ed to go anywhere then.