360-Degree Health Assessment of Microservices on Pivotal Platform

Travelers 360-Degree Health
Assessment of Microservices on
the Pivotal Platform
October 7-10, 2019
Austin Convention Center
1

Unless otherwise indicated, these slides are © 2013-2019 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ 2
Introduction
Nehal Gandhi
Lead Architect
Travelers Personal Insurance
Rohit Kelapure
Principal Solutions Architect
Pivotal App Modernization

Unless otherwise indicated, these slides are © 2013-2019 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Agenda
Overview of Travelers
A 360 Degree Assessment Of
Microservices & Platform Health
Assessment Dimensions
Application Health Check
Results
SRE Practices
3

© 2019 The Travelers Indemnity Company. All rights reserved.
Travelers Insurance Overview
4
• The largest writer of commercial property insurance in the United States.
• The largest writer of workers compensation insurance in the United States.
• Top three writer of personal insurance through independent agents in the United
States, including number one in homeowners.
• Total assets of approximately $105 billion, shareholder equity of approximately $23
billion and a total revenue of $30 billion.
• A component of the Dow Jones Industrial average.
• More than 30,000 employees.
• Representatives in every U.S. state, Canada, the Republic of Ireland, the United
Kingdom and at Lloyds.

Key Objectives and approach
5
Architecture Assessment
– Assess alignment of the architecture with the Pivotal Platform Cloud – Native model
– Assure the application architecture and roadmap is sustainable
– Consider the validity of the application findings and recommendations
 Production Stability and Business Capability Delivery
– Gain confidence that the application will support the upcoming spike and cadence of feature
releases
– Prioritize key steps which could improve the stability, performance, and scalability of the
applications
• MaaS (Model As A Service)
• Understand Current State
• Define Objective and Key Results
• Perform Architecture
Retrospective
• Assess Microservice Design and
Pivotal Platform alignment
• Identify Risks and Mitigation
Strategies
• Review Source Code
• Assess Architecture Health
• Align Engineering Metrics with
SLO
• Review Path to Production
• 2 Weeks Onsite with
Travelers Teams
• Rohit Kelapure
• Will Voorhees
• Oleg Iavorskyi
• PI Solution Architecture
• Application Team Leads
• Cloud Engineering
Objectives Applications in Scope
Approach
Duration Pivotal Consultants Travelers Team

PEOPLE
How is your Development
Team doing? Do they have
the right mix of skills? Are
they following the right
methodologies? What are
their process pains? Is the
velocity trending correctly?
APPLICATIONS
Are your apps Service Level
Objectives met in
Production? What are the
top performance
constraints? What are the
high priority implementation
concerns? Is the
architecture the right fit?
PLATFORM
Are the app teams able to
leverage the capabilities of
the platform as advised?
What are the pain points
with platform services? Is
the Platform misconfigured ?
A 360 Degree Assessment Of Your Applications

Balanced Team
The platform team consists of a product manager and at least two
platform engineers with a combination of infrastructure and software
engineering skills. Does the team has all the tools and workstation
infrastructure it needs for performing at a high velocity?
Architecture
Is the architecture tightly coupled? Are Microservices too fine grained?
Is the architecture adding technical debt? Is the architecture tending in
the right direction? Can it be extended easily?
Assessment Dimensions
Platform as a Product
The platform’s capabilities change in response to the needs of its users.
It is treated as a product that is inclusive of not only Pivotal Platform but
all the services and integrations that make it a viable environment for
applications to run.
Process and Path to Production
Developers are able to take full advantage of the platform via modern
and optimized tools and processes. Does Devops and CI/CD follow the
right set of processes? How is code promoted across environments?
Change Management
How does feature development work? What changes need to be made
to the architecture and code for sustainability and evolution along the
right dimensions? Top 3 things to bring the code and design in
alignment with design principles.
Monitoring and Metrics
Establishing desired service behavior, measuring how the service is
actually behaving, and correcting discrepancies. Examples: response
latency, error or unanswered query rate, peak utilization of resources.
Failure Mode Analysis
Understand the impact of failure of critical external dependencies on the
core service. Play out scenarios where there is partial or complete loss
of business functionality and plan for appropriate countermeasures.
Technical Debt
Dependency Management and Library updates within the project. Is
there a substantial bloat of libraries and third party dependencies in the
project? Where is the technical debt accumulated in the components?
Emergency Response
Are run books in place to capture the right set of logs when a failure
occurs? Does the development team follow a prescribed set of steps to
triage and debug a problem in production? Are circuit breakers and
other fallbacks in place to revert to a degraded functionality during
failure?
Performance Optimization
Are the applications starting slowly? Applications not meeting their
expected SLAs. Analysis of performance issues ranging from high
memory allocation to increased latency and high CPU. Performance test
plan evaluation.

Typical Outcomes Health Check
● Define Objectives and Key Results
● Assess Microservice design and
Pivotal Platform Alignment
● Identify Architectural Risks and
Concerns
● Assess Architecture Health and
Engineering Practice Health
● Align Engineering Metrics with
Service Level Indicators and
Objectives
● Review Path To Production & Identify
Gaps
● Develop Future Application
Architecture Roadmap
● Remediate Application Code and
Dependencies, Library anti-patterns
● Performance Baseline, Execution
Plan, Tooling & Profiling
● Assist in setting up a performance
environment and load harness with
tests
● Opportunistic enhancements like
JVM Level tuning
● Day 2 app support best practices
recommendations
● Chaos Testing
● SRE Practices

Discovery and Framing
10

Discovery and Framing
11

Application Health Check Results (Part 1)
12

Application Health Check Results (Part 2)
13

Achievements
14
1. Comprehensive health check across 10 dimensions of an application in 2 weeks
2. Enabled the team to perform performance profiling of code for startup, CPU and latency
3. Resolved Performance Mystery from the November outage (High CPU utilization driving auto scaling).
4. Tuned threadpool settings, Garbage Collector for optimal throughput
5. Repeatable Saturation and endurance testing of application
6. Demonstrated > app can scale under sustained load keeping response times under SLO
7. Reduced classpath bloat and improved startup times with spring boot startup bench
8. Architecture/Code review resulted in significant findings that will setup the team for success
9. Cross collaboration across ICs and Dev teams to create ICU and dedicated RabbitMQ plans
10. Pairing on removing exceptions and errors reduced startup time
11. Outlined critical must fix items from platform and app perspective for application
12. Mob programming on performance analysis experiments led to a shared and deep understanding of the multi-threaded code and
performance profiling patterns
13. Increased understanding of platform capabilities including Pivotal Platform Metrics and autoscaling policies
14. Pruned pom.xml to reduce classpath bloat
15. Analyzed Spring Cloud Stream upgrade and implementation for Cloudstream POC

Providing a Level of Service to Users
● What are SLIs and SLOs?
● Service Level Indicator (SLI) is a metric
○ Represents a specific user workflow
○ Common examples: request latency,
error rate, throughput
● Service Level Objective (SLO) is a
threshold
○ Threshold below which user needs
are not being met for that workflow
■ This leads to unhappy users
○ For each SLI, we might establish a
SLO

Platform Metrics Health & KPIs User Metrics
Product
Inception Feature Design Release
Production
Usage
Release Viability
Design
Measure
Previous releases define a baseline
with which to establish appropriate
success/failure thresholds
Design feature
success
indicators
Initial KPIs
based on
risks
Explore user
interactions
• Combine Health and User metrics to define success criteria for a release, and use that success criteria to
determine if the release is moving your product in the desired direction.
• Some metrics can be measured instantaneously at release time, while others may need a longer period of active
use to validate user behavior has been effectively changed by a release.
• When you deploy a release onto the platform, you get Application events like starting, stopping, and crashing,
Container resource utilization like CPU and memory, Application health monitoring based on a listening port or an
HTTP response code

Application Metrics Dashboard

Balancing Reliability and Features
SLI ≤ SLOz Release New Features
Focus on Reliability
Yes
No

Site Reliability Engineering - Why are we doing this?
● Reliability is a product
feature
● Users expect product to
have features and they
expect those features to
work (nearly) all the time
● Inherent tension between
having a reliable system
and creating & deploying
new features.
● New features = less
reliability (Google says
70% of their outages
caused by deploying
changes)
● Users continually expect
new features (and security
patches)

Controlled Experiments – Chaos / Load
A scientific procedure to test a hypothesis
1. Understand steady-state behavior first. (Look at four golden
signals!)
2. Form a hypothesis about what happens to the steady-state
behavior when exposed to an event. Pro-tip: identify a control
group as well!
3. Introduce Chaos/Load in the experimental group
4. Test the hypothesis: compare the control group to the
experimental group

Simulating Failure - Use Caution
Minimize Blast Radius: Try not to blow up everything
● Avoid undue harm
● Time-box your activities
● Scope to skirt total IaaS domain failure
Embrace Risk:
● Explore component interactions
● Increase complexity as maturity deepens
● Maximize your potential
https://cloud.rohitkelapure.com/2019/05/load-testing-tools.html
https://cloud.rohitkelapure.com/2019/07/tools-to-create-chaos.html

360-Degree Health Assessment of Microservices on Pivotal Platform

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 360-Degree Health Assessment of Microservices on Pivotal Platform

Similar to 360-Degree Health Assessment of Microservices on Pivotal Platform (20)

More from VMware Tanzu

More from VMware Tanzu (20)

Recently uploaded

Recently uploaded (20)

360-Degree Health Assessment of Microservices on Pivotal Platform

Editor's Notes