Why is Performance Important? What are the most common reasons applications dont scale and perform well. Which technical metrics to look at. How to check it automated in the pipeline
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Top Java Performance Problems and Metrics To Check in Your Pipeline
1. And other Tips & Tricks to make you a “Performance Expert”
More @ http://blog.dynatrace.com – Tools @ http://bit.ly/dtpersonal
Andreas Grabner - @grabnerandi
Deep Dive Into Top
Performance Mistakes
39. • Symptoms
• HTML takes between 60 and 120s to render
• High GC Time
• Developer Assumptions
• Bad GC Tuning
• Probably bad Database Performance as rendering was simple
• Result: 2 Years of Finger pointing between Dev and DBA
Project: Online Room Reservation System
40. Developers built own monitoring
void roomreservationReport(int officeId)
{
long startTime = System.currentTimeMillis();
Object data = loadDataForOffice(officeId);
long dataLoadTime = System.currentTimeMillis() - startTime;
generateReport(data, officeId);
}
Result:
Avg. Data Load Time: 45s!
DB Tool says:
Avg. SQL Query: <1ms!
41. #1: Loading too much data
24889! Calls to the Database API!
High Memory Usage results in GC
resulting to high GC to keep all
data in Memory
42. #2: On individual connections 12444!
individual
connections
Classical N+1
Query Problem
Individual SQL
really <1ms
43. #3: Putting all data in temp Hashtable
Lots of time spent
in Hashtable.get
Called from their
Entity Objects
44. • … you know what code is doing you inherited!!
• … you are not making mistakes like this
• Explore the Right Tools
• Built-In Database Analysis Tools
• “Logging” options of Frameworks such as Hibernate, …
• JMX, Perf Counters, … of your Application Servers
• Performance Tracing Tools: Dynatrace, Ruxit, NewRelic,
AppDynamics, Your Profiler of Choice …
Lessons Learned – Don’t Assume …
45. Key Metrics
# of SQL Calls
# of same SQL Execs (1+N)
# of Connections
Rows/Data Transferred
47. Log Hotspots in Frameworks!
callAppenders clear CPU and I/O Hotspot
Excessive logging through Spring Framework
48. Debug Log and outdated log4j library
#1: Top Problem: log4j.callAppenders
-> 71% Sync Time
#2: Most of logging done from
fillDetail method
#3: Doing “DEBUG” log
output: Is this necessary?
49. Overhead caused by Exceptions
fillInStackTrace is Top 2 in CPU Hotspots
All these Exceptions that never show up in
a log file are consuming all CPU
50. Too Many Exceptions vs Log Messages
2-5 Log Messages per 5 Min
Looking at the important
(SEVERE, FATAL, …) log messages
written
Up to 20000 Custom Exceptions
That’s about 4000x the number
of Exceptions per Log Message
55. Threading Issues (Analysis) Tip: I like the Thread Column as it tells me
where we spawn off async threads and
where the “main threads” might be waiting
56. Sync / Wait
1.63s in Object.wait
Means that this thread is put to hold
Waiting on the next
Connection to become
available!
59. Example #2: Online Sports Club Search Service
2015201420xx
Response Time
2016+
1) Started as a
small project
2) Slowly growing
user base
3) Expanding to
new markets –
1st performance
degradation!
4) Adding more markets
– performance becomes
a business impact Users
4) Potentially start
loosing users
60. Early 2015: Monolithic App
Can‘t scale vertically endlessly!
2.68s Load Time
94.09% CPU
Bound
62. 7:00 a.m.
Low Load and Service running
on minimum redundancy
12:00 p.m.
Scaled up service during peak load
with failover of problematic node
7:00 p.m.
Scaled down again to lower load
and move to different geo location
Testing the Backend Service alone scales well …
66. 26.7s Load Time
5kB Payload
33! Service Calls
99kB - 3kB for each call!
171!Total SQL Count
Architecture Violation
Direct access to DB from frontend service
Single search query end-to-end
67. The fixed end-to-end use case
“Re-architect” vs. “Migrate” to Service-Orientation
2.5s (vs 26.7)
5kB Payload
1! (vs 33!) Service Call
5kB (vs 99) Payload!
3!(vs 177) Total
SQL Count
76. Tip: Database Activity
Do we see increased in AVG #
of SQL Executions over Time?
Do TOTAL # of SQL Executions
increase with load? Shouldn’t
it flatten due to CACHES?
77. Tip: Database History Dashboard
How many SQL Statements are
PREPARED?
What’s the overall Execution
Time of different SQL Types
(SELECT, INSERT, DELETE, …)
78. For more Key Metrics
http://blog.dynatrace.com
http://blog.ruxit.com
More detailed stories can also be found on our blog: http://blog.dynatrace.com
All examples have been found using Dynatrace Free Trial – http://bit.ly/dtpersonal
Several companies changed their way they develop and deploy software over the years. Here are some examples (numbers from 2011 – 2014)
Cars: from 2 deployments to 700
Flicks: 10+ per Day
Etsy: lets every new employee on their first day of employment make a code change and push it through the pipeline in production: THAT’S the right approach towards required culture change
Amazon: every 11.6s
Remember: these are very small changes – which is also a key goal of continuous delivery. The smaller the change the easier it is to deploy, the less risk it has, the easier it is to test and the easier is it to take it out in case it has a problem.
But it is not only about delivering features faster – it is also about delivering fast features!
These stats come from here: http://nft.atcyber.com/infographics/infographic-the-importance-of-web-performance-20140913
Monitor your end users after you deployed something
Monitoring user experience and impact on conversion rate
Understand user behavior depending on who they are and what they are doing.
Screenshot from https://github.com/Dynatrace/Dynatrace-UEM-PureLytics-Heatmap
Does the behavior change if they have a less optimal user experience?
Screenshot from https://github.com/Dynatrace/Dynatrace-UEM-PureLytics-Heatmap
Seems like users that have a frustrating experience are more likely to click on Support
Screenshot from https://github.com/Dynatrace/Dynatrace-UEM-PureLytics-Heatmap
Even if the deployment seemed good because all features work and response time is the same as before. If your resource consumption goes up like this the deployment is NOT GOOD. As you are now paying a lot of money for that extra compute power
We look at metrics – lots of them
Yes – I am working for a tool vendor – BUT – you can try this with most of the tools in the APM, Tracing, Diagnostics space out there.
Your chance to leave now in case you think this session is about optimzing your java code by 0.01ms
Its about looking at common performance metrics and trying to figure out why your application really doesnt scale or perform
Because – thanks to my really awesome job – and thanks to dynatrace – I am allowed to travel the world and meet a lot of people that deal with real problems
People send me data and I analyze it for them
Quick overview of how APM tools such as Dynatrace work!
This is the data we collect!
And this is how easy it is to share data with me
This is a sample of what I send people back -> thats the input to many stories I have to tell
Based on my experience
80% of the problems are only caused by 20% problem patterns. And focusing on 20% of potential problems that take away 80% of the pain is a very good starting point
Most of the problems can easily be identified by just looking at the right metrics. Most performance problems can also be found by looking at metrics while your app is not even under load -> a simple click through / functional / unit or integration test will do
We will start at the frontend but spend most time on the backend. Its important though to look at both sides
Lets start with the Frontend for all Web Developers
My all time favorite is the mobile landing page for a softdrink company during SuperBowl 2014 – 400+ individual images of selfie uploads aligned in a 20x20 grid. Pushed to my iPhone 4 in very high resolution causing 20MB data download and my phone to shrink each picture to be shown in a 20x20 grid on my small display
Another common problem are individual very large images – or in this case a very large favicon which should normally only be a couple of bytes
Or people forgetting to shrink their high resolultion images before putting it on public websites
Synthetic Availability Monitoring -> Clearly something went wrong
If you have a peak period coming up – consider switching to an optimized landing page for that period – just as GoDaddy did during the SuperBowl.
In case you didnt know – Hit F12 in your browser and you get all these metrics. Even better – you can automate that while running your browser driven tests
Done with the Frontend
Lets look at the backend
Now to the backend
This story is from Joe – a DB guy from a very large telco arguing with his developers over performance problems of an online room reservation system which has evolved from a small project implemented by an intern to an application that is now used in their entire organization
Devs buillt custom monitoring to proof their point! Contradicting what Joe‘s DB Tools had to say
Reading this Transaction Flow showed what the real problem was: Loading Too Much Data from the Database causing High Memory Usage and therefore high CPU to cleanup the garbage
Every SQL was executed on its on Connection
The intern back then implemented its own OR Mapper by loading the full database content into a HashTable using individual queries
Thanks toi Splunk, Elastic Search and others we are able to analyze every log message we put out – but – does this really make sense?
When logging becomes your performance issue -> misconfiguration of frameworks lead to CPU and I/O issues -> be aware of that!
Wrong Log level and outdated log libraries can lead to serious performance impacts
Thanks toi Splunk, Elastic Search and others we are able to analyze every log message we put out – but – does this really make sense?
Everybody seems to migrate to MicroServices -> but be aware of the common mistakes
They had a monolithic app that couldnt scale endlessly. Their popularity caused them to think about re-architecture and allowing developers to make faster changes to their code. The were moving towards a Service Approach
Separating frontend logic from backend (search service). The idea was to also host these services potentially in the public cloud (frontend) and in a dynamic virtual enviornment (backend) to be able to scale better globally
The Backend Search Service Team did a lot of testing on their backend services. Scaling up and down on demand. All looked pretty good! They gave it a Thumbs Up!
On Go Live Date with the new architecture everything looked good at 7AM where not many folks were yet online!
By noon – when the real traffic started to come in the picture was completely different. User Experience across the globe was bad. Response Time jumped from 2.5 to 25s and bounce rate trippled from 20% to 60%
The backend service itself was well tested. The problem was that they never looked at what happens under load „end-to-end“. Turned out that the frontend had direct access to the database to execute the initial query when somebody executed a search. The returned list of search result IDs was then iterated over in a loop. For every element a „Micro“ Service call was made to the backend which resulted in 33! Service Invokations for this particular use case where the search result returned 33 items. Lots of wasted traffic and resources as these Key Architectural Metrics show us
They fixed the problem by understanding the end-to-end use cases and then defined backend service APIs that provided the data they really needed by the frontend. This reduced roundtrips, elimiated the architectural regression and improved performance and scalability
Lessons Learned!
If we monitor these key metrics in dev and in ops we can make much better decisions on which builds to deploy
We immediately detect bad changes and fix them. We will stop builds from making it into Production in case these metrics tell us that something is wrong.
We can also take features out that nobody uses if we have usage insights for our services. Like in this case we monitor % of Visitors using a certain feature. If a feature is never used – even when we spent time to improve performance – it is about time to take this feature out. This removes code that nobody needs and therefore reduces technical debt: less code to maintain – less tests to maintain – less bugs in the system!
I love looking at Layers / APIs / Services -> if you have the chance to run a load test with slightly increasing load just monitor which of your APIs/Services/Methods behaviors „out of the norm“ -> thats your breaking point
I always look at Exceptions vs Log Messages. Especially with frameworks such as Hibernate/Spring you can end up with a lot of „internal exceptions“ that impact performance but there is no „visible“ entry in any log file. Thats why I chart them and assume they correlate. If not – you know that something is wrong
Same is true for Failed Requests vs. Load -> at which point does your app break and return HTTP 4xx, 5xx?
Looking at Avg number of SQL Queries -> Do we have a data driven problem?
Looking at Total # of SQL -> should show a flatten curve as we assume we can cache some of the data
Are we preparing SQLs – how many INSERTS, UPDATES, DELETES -> do we have certain periods during the day when heavy REPORTS or clean up jobs run?