During our two weeks engagement with Pivotal, we started with Discovery and Framing phase, in this phase we
1. Created Story Backlogs prioritized by the goals & activities (that is shown on top right corner)
2. Performed Architecture Retrospective and assessed the application architecture holistically
3. Performed Fishbone Analysis to identify failure modes between microservices, queues and failure points for different applications. (that is shown on bottom right corner)
4. Performed Platform Health Check Activities which includes Reviewing of the current Rabbit MQ implementation of application and best practices, Reviewing of the Autoscaling Policies and Reviewing of Environment specific differences
5. Performed Application Source Code Review and Evaluated the upgrade impact of SpringBoot 2.0 and Java 11 and Dependencies
6. Defined Performance Plan and Isolation Segment Setup
Inconsistence on response time for the apps
Determine Root cause of performance drift
Understand different CPU utilization across PCF instances
7. Talked about how we can leverge Pivotal Platform Metrics Dashboard for application monitoring
During our two weeks engagement with Pivotal, we started with Discovery and Framing phase, in this phase we
1. Created Story Backlogs prioritized by the goals & activities (that is shown on top right corner)
2. Performed Architecture Retrospective and assessed the application architecture holistically
3. Performed Fishbone Analysis to identify failure modes between microservices, queues and failure points for different applications. (that is shown on bottom right corner)
4. Performed Platform Health Check Activities which includes Reviewing of the current Rabbit MQ implementation of application and best practices, Reviewing of the Autoscaling Policies and Reviewing of Environment specific differences
5. Performed Application Source Code Review and Evaluated the upgrade impact of SpringBoot 2.0 and Java 11 and Dependencies
6. Defined Performance Plan and Isolation Segment Setup
Inconsistence on response time for the apps
Determine Root cause of performance drift
Understand different CPU utilization across PCF instances
7. Talked about how we can leverge Pivotal Platform Metrics Dashboard for application monitoring
A 360 degree health assessment of our application reveals many interesting observations about our applications and platform, identify several key risks and received applicable recommendations from Pivotal Solution Architects. Instead of going over each of the Application Health Check dimension, I will try to focus on few dimensions in the interest of the time.
First I would like to talk about Failure Mode Analysis dimension.
In this dimension, we performed a failure mode testing where we were able to reproduce the issue we were facing in production due to thread pool exhaustion caused by resource contention leading to HIGH CPU.
The risk identified for this dimension is that we would have to perform chaos testing with high load to reach break point, app cannot tolerate loss of RMQ and run in degrade mode for extended period of time. Latency autoscaling in PCF does not work for this application.
The recommendation from Pivotal was to
● Tune size of the ForkJoin Threadpool = 10
● Forkjoin Queue max depth = 10
● HTTP Threadpool size = 100
● Configure autoscaling to be CPU based [80, 160] with min, max instances set to [1,10]
● CallerRunsPolicy for ForkJoin pool
Next dimension, I would like to talk about Technical Debt & Code Hygiene dimension.
In this dimension, we discovered that application has a bloated classpath. Microservices are as big ~ 250MB and embed 5 app servers (netty, jetty, jersey, spark server & tomcat)
The risk was that if time and resources are not spent in reducing the number and scope of dependencies then the apps will take longer to start and eventually auto-scaling will not work. Need to speed up inner loop of development.
The recommendation was to
● Eliminate Shared Service Library
● Eliminate and prune external dependencies
● Migrate to Spring Boot 2.x and Java 8u11
● Apps should be run and profiled in local sandbox with all service dependencies
● Set a threshold on size of app jars in CI pipeline to stop third party library proliferation
Monitoring and Metrics
Establishing desired service behavior, measuring how the service is actually behaving, and correcting discrepancies.
Assessment of Monitoring & Metrics reveals that we were using too many tools to monitor our applications and it was causing confusion in identifying the root cause of the issue. Recommendations was to reduce the number of monitoring tools and use PCF metrics along with Dynatrace or similar Application Performance Monitoring tool for root cause analysis.
Failure Mode Analysis
Understand the impact of failure of critical external dependencies on the core service. Play out scenarios where there is partial or complete loss of business functionality and plan for appropriate countermeasures.
Assessment of Failure Mode Analysis reveals that we should perform chaos testing with high load to identify the failure impact such as application can’t tolerate loss of Rabbit MQ ….
Technical Debt
Dependency Management and Library updates within the project. Is there a substantial bloat of libraries and third party dependencies in the project? Where is the technical debt accumulated in the components?
Assessment of Technical Debt reveals that application has been bloated due to inclusion of the various dependencies jars which are not being lev…
Emergency Response
Are run books in place to capture the right set of logs when a failure occurs? Does the development team follow a prescribed set of steps to triage and debug a problem in production? Are circuit breakers and other fallbacks in place to revert to a degraded functionality during failure?
Assessment of Technical Debt reveals that we would need to have an automated when of collecting the CPU thread and heat dumps when CPU is experiencing high utilization.
Performance Optimization
Are the applications starting slowly? Applications not meeting their expected SLAs. Analysis of performance issues ranging from high memory allocation to increased latency and high CPU. Performance test plan evaluation.
Assessment of Performance Optimization reveals that application was CPU constrained due to unmanaged threads. Proper Optimization of the size of the thread pools are necessary to drive the performance along with use of correct garbage collector. Local profiling of the application is very important to understand the thread utilization which can be achieved by using Visual VM and Jmeter tools.
Next dimension, I would like to focus is about Architecture dimension.
This dimension reveals that our Microservices are at the right level of granularity; however there is tight coupling and unnecessary big data dependencies present in code.
The risk that is identified that there is considerable sharing of service libraries between microservices leading to tight coupling. Shared service library is a monolith that is dragged into each service. Big data dependencies are leading to monoliths. Standalone mode execution jar should run locally, on Spark and Cloud Foundry
The recommendation was to
● Eliminate the core service library sharing between microservices
● Decouple model execution in app from Hadoop and Spark to decompose dark mode functionality
● Reduce exceptions and errors at startup. Reduce startup < 30s
● Use BOSH DNS to remove SCS overhead
Architecture
Is the architecture tightly coupled? Are Microservices too fine grained? Is the architecture adding technical debt? Is the architecture tending in the right direction? Can it be extended easily?
Assessment of Architecture, uncovers our implementation of Micro Services were at the right level, however there was tight coupling between them due to sharing of the core framework library which made our application big due to incorporation of unnecessary dependency components ….
Change Management
How does feature development work? What changes need to be made to the architecture and code for sustainability and evolution along the right dimensions? Top 3 things to bring the code and design in alignment with design principles
Assessment of Change Management,
Platform as a Product
The platform’s capabilities change in response to the needs of its users. It is treated as a product that is inclusive of not only Pivotal Platform but all the services and integrations that make it a viable environment for applications to run.
Assessment of Platform as Product, reveals that
Balanced Team
The platform team consists of a product manager and at least two platform engineers with a combination of infrastructure and software engineering skills. Does the team has all the tools and workstation infrastructure it needs for performing at a high velocity?
Process and Path to Production
Developers are able to take full advantage of the platform via modern and optimized tools and processes. Does Devops and CI/CD follow the right set of processes? How is code promoted across environments?
We made a great progress in achieving the objectives that we set at the beginning of two weeks engagements. I would like highlights some of the achievements that we accomplished from 360 Degree Health Assessment:
(2) Our team has started doing local profiling of the application from startup, CPU and latency perspective before deploying to cloud for performance testing using Visual VM and Jmeter tool
(3 and 4) We resolved the performance mystery from the Production outage by implementing manage thread strategy and right sizing our Threadpool settings
(5) Got consistent result of our performance testing by running in isolation segment setup
(6) Demonstrated > app can scale under sustained load keeping response times under SLO
(7) Reduced the overall application size and improved the startup time by 50% by reducing the classpath bloating, removing unnecessary exceptions and errors, pruning pom.xml.
Understanding what users want of your service helps to inform SLIs
Be careful not to select too many so as not to be able to focus on what users really care about
The cost of increasing reliability is two-fold:
Cost of extra hardware, software, licenses (for redundancy)
Opportunity cost of not working on new features