Top Java Performance Problems and Metrics To Check in Your Pipeline

And other Tips & Tricks to make you a “Performance Expert”
More @ http://blog.dynatrace.com – Tools @ http://bit.ly/dtpersonal
Andreas Grabner - @grabnerandi
Deep Dive Into Top
Performance Mistakes

Why
Performance?
Confidential, Dynatrace, LLC

700 deployments / YEAR
10 + deployments / DAY
50 – 60 deployments / DAY
Every 11.6 SECONDS

Not only fast delivered but also delivering fast!
-1000ms +2%
Response Time Conversions
-1000ms +10%
+100ms -1%

#1: Which Geo has which
“User Experience”?
#2: Who are
these users?

Daily Deployments + Mkt Push
Increase # of unhappy users!
Drop in Conversion Rate
Overall increase of Users!

Satisfied Users Click more Content

Tolerating Users click less content

Frustrated Users mainly click on Support

Update of Dependency Injection Library
impacts Memory & CPU

App with Regular
Load supported by
10 Containers
Twice the Load but 48
(=4.8x!) Containers!
App doesn’t scale!!
Does it really scale?

How to
analyze perf?
Confidential, Dynatrace, LLC

Time: Wall Clock, CPU, I/O, Wait/Sync, Susp, Page Load
Throughput: # of Requests per Timeinterval
Resources: CPU Cycles, Memory, I/O, Log Messages, ...
Pools and Queues: Sizes, Utilization, Acquisition Time,
# Publishers vs # Subscribers, Process Time
Interactions: # SQLs, # Messages, # Services, # Images, # CSS
Errors: Exceptions, HTTPs, TCP Packet Loss

https://dynatrace.github.io/ufo/
“In Your Face” Data!

Where do your
Stories come
from?

Share Your PurePath -
http://bit.ly/sharepurepath

3rd parties
Akamai
Cloudfront
Synthetic
Apache
IIS
Node.js
nginx
Java
.NET
PHP
IBM
WMQ
ESBs
MongoDB
Hbase
Cassandra
CICs
IMS
ORACLE
MSSQL
MySQL
DB2
Mobile
Collector
Plugins
Dynatrace Server
Hosts
Session Storage
Splunk
Elasticsearch
Solr
Rich Client
Web Interface
Web

Dev/Arch
Method Level Hotspots
+ Exceptions, Logs, Memory
Allocation, Threads, Actual Code ...

Export & Share
Share Your PurePath -
http://bit.ly/sharepurepath

Frontend Performance
We are getting FATer!

Mobile landing page of Super Bowl ad
434 Resources in total on that page:
230 JPEGs, 75 PNGs, 50 GIFs, …
Total size of ~
20MB

Fifa.com during Worldcup
Source: http://apmblog.compuware.com/2014/05/21/is-the-fifa-world-cup-website-ready-for-the-tournament/

8MB of background image for STPCon (Word Press)

Availability dropped to 0%
Availability And Response Time

Tip for handling Spike Load: GO LEAN!!
1h before
SuperBowl KickOff
1h after
Game ended

Make F12 or Browser Agent your friend!

Key Metrics
# of Resources
Size of Resources
Total Size of Content
HTTP 3xx, 4xx, 5xx
# of Domains

Backend Performance
The Usual Suspects

• Symptoms
• HTML takes between 60 and 120s to render
• High GC Time
• Developer Assumptions
• Bad GC Tuning
• Probably bad Database Performance as rendering was simple
• Result: 2 Years of Finger pointing between Dev and DBA
Project: Online Room Reservation System

Developers built own monitoring
void roomreservationReport(int officeId)
{
long startTime = System.currentTimeMillis();
Object data = loadDataForOffice(officeId);
long dataLoadTime = System.currentTimeMillis() - startTime;
generateReport(data, officeId);
}
Result:
Avg. Data Load Time: 45s!
DB Tool says:
Avg. SQL Query: <1ms!

#1: Loading too much data
24889! Calls to the Database API!
High Memory Usage results in GC
resulting to high GC to keep all
data in Memory

#2: On individual connections 12444!
individual
connections
Classical N+1
Query Problem
Individual SQL
really <1ms

#3: Putting all data in temp Hashtable
Lots of time spent
in Hashtable.get
Called from their
Entity Objects

• … you know what code is doing you inherited!!
• … you are not making mistakes like this 
• Explore the Right Tools
• Built-In Database Analysis Tools
• “Logging” options of Frameworks such as Hibernate, …
• JMX, Perf Counters, … of your Application Servers
• Performance Tracing Tools: Dynatrace, Ruxit, NewRelic,
AppDynamics, Your Profiler of Choice …
Lessons Learned – Don’t Assume …

Key Metrics
# of SQL Calls
# of same SQL Execs (1+N)
# of Connections
Rows/Data Transferred

Logging
WE CAN LOG THIS!!
Or we just throw a lot of
Exceptions 
LOG

Log Hotspots in Frameworks!
callAppenders clear CPU and I/O Hotspot
Excessive logging through Spring Framework

Debug Log and outdated log4j library
#1: Top Problem: log4j.callAppenders
-> 71% Sync Time
#2: Most of logging done from
fillDetail method
#3: Doing “DEBUG” log
output: Is this necessary?

Overhead caused by Exceptions
fillInStackTrace is Top 2 in CPU Hotspots
All these Exceptions that never show up in
a log file are consuming all CPU

Too Many Exceptions vs Log Messages
2-5 Log Messages per 5 Min
Looking at the important
(SEVERE, FATAL, …) log messages
written
Up to 20000 Custom Exceptions
That’s about 4000x the number
of Exceptions per Log Message

Key Metrics
# of Log Entries
Size of Logs per Use Case

Pools & Queues
Proper Sizing!!

Wrong Pool Sizes Configured
Do we have enough DB
CONNECTIONS per pool?

Threading Issues (Analysis) Tip: I like the Thread Column as it tells me
where we spawn off async threads and
where the “main threads” might be waiting

Sync / Wait
1.63s in Object.wait
Means that this thread is put to hold
Waiting on the next
Connection to become
available!

Key Metrics
Pool and Queue Sizes
Time in Sync & Wait

(Micro)Services
Architectural Mistakes with
„Migrating“ to (Micro)Services

Example #2: Online Sports Club Search Service
2015201420xx
Response Time
2016+
1) Started as a
small project
2) Slowly growing
user base
3) Expanding to
new markets –
1st performance
degradation!
4) Adding more markets
– performance becomes
a business impact Users
4) Potentially start
loosing users

Early 2015: Monolithic App
Can‘t scale vertically endlessly!
2.68s Load Time
94.09% CPU
Bound

Proposal: Service approach!
Front End
to Cloud
Scale Backend
in Containers!

7:00 a.m.
Low Load and Service running
on minimum redundancy
12:00 p.m.
Scaled up service during peak load
with failover of problematic node
7:00 p.m.
Scaled down again to lower load
and move to different geo location
Testing the Backend Service alone scales well …

26.7s Load Time
5kB Payload
33! Service Calls
99kB - 3kB for each call!
171!Total SQL Count
Architecture Violation
Direct access to DB from frontend service
Single search query end-to-end

The fixed end-to-end use case
“Re-architect” vs. “Migrate” to Service-Orientation
2.5s (vs 26.7)
5kB Payload
1! (vs 33!) Service Call
5kB (vs 99) Payload!
3!(vs 177) Total
SQL Count

You measure it! from Dev (to) Ops

Build 17 testNewsAlert OK
testSearch OK
Build # Use Case Stat # API Calls # SQL Payload CPU
1 5 2kb 70ms
1 3 5kb 120ms
Use Case Tests and Monitors Service & App Metrics
testSearch OK
testSearch OK
1 4 1kb 60ms
34 171 104kb 550ms
Ops
#ServInst Usage RT
1 0.5% 7.2s
1 63% 5.2s
1 4 1kb 60ms
2 3 10kb 150ms
1 0.6% 4.2s
5 75% 2.5s
Build 35 testNewsAlert -
testSearch OK
- - - -
2 3 10kb 150ms
- - -
8 80% 2.0s
Metrics from and for Dev(to)Ops
Re-architecture into „Services“ + Performance Fixes
Scenario: Monolithic App with 2 Key Features

Key Metrics
# of Service Calls
Payload of Service Calls
# of Involved Threads
1+N Service Call Pattern!

Tips & Tricks
And more Metrics of course 

Tip: Layer Breakdown over Time
With increasing load: Which LAYER
doesn’t SCALE?

Tip: Exceptions and Log Messages
How are # of EXCEPTIONS
evolving over time?
How many SEVERE LOG
messages to we write in
relation to Exceptions?

Tip: Failed Transactions
Are more TRANSACTIONS
FAILING (HTTP 5xx, 4xx, …)
under heavier load?

Tip: Database Activity
Do we see increased in AVG #
of SQL Executions over Time?
Do TOTAL # of SQL Executions
increase with load? Shouldn’t
it flatten due to CACHES?

Tip: Database History Dashboard
How many SQL Statements are
PREPARED?
What’s the overall Execution
Time of different SQL Types
(SELECT, INSERT, DELETE, …)

For more Key Metrics
http://blog.dynatrace.com
http://blog.ruxit.com

Questions and/or Demo
Slides: slideshare.net/grabnerandi
Get Tools: bit.ly/dtpersonal
YouTube Tutorials: bit.ly/dttutorials
Contact Me: agrabner@dynatrace.com
Follow Me: @grabnerandi
Read More: blog.dynatrace.com

Andreas Grabner
Dynatrace Developer Advocate
@grabnerandi
http://blog.dynatrace.com

Top Java Performance Problems and Metrics To Check in Your Pipeline

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Top Java Performance Problems and Metrics To Check in Your Pipeline

Similar to Top Java Performance Problems and Metrics To Check in Your Pipeline (20)

More from Andreas Grabner

More from Andreas Grabner (14)

Recently uploaded

Recently uploaded (20)

Top Java Performance Problems and Metrics To Check in Your Pipeline

Editor's Notes