SpringOne Platform 2019
Session Title: Platform Health Assessment at Department of Homeland Security Citizenship and Immigration Services
Speakers: Chris Saunders, Platform Architect Manager, Pivotal and Kelly Walsh, Engagement Director, Pivotal and Paul Beccio, Developer, DHS USCIS
Youtube: https://youtu.be/LZsqqSH9VbI
8. Strong Platform enables Mission Outcomes
Reduced Processing Time
Increased Integrity and Safeguarding
Operational Efficiencies
Team Responsiveness
9. State of the Platform in Fall 2018
Matrixed Ops Team with Shared Responsibilities
Firefighting!
Behind on Platform Upgrades
Lack of Platform Monitoring
Platform not a Product
18. Health Markers Self-Assessment, 12/13/18
Platform Journey
Platform as a Product: 2.1
Balanced Team: 2.4
Path to Production: 3.1
Priority 1 Monitoring and Metrics: 2.3
Capacity Planning: 1.3
Change Management: 2.0
Priority 2 Emergency Response: 2.0
Self Service: 3.0
Performance Optimization: 1.7
Priority 3 Business Continuity: 1.3
Ingrain Scale
1 2
Start
3 4 5
19. Priority 1: Monitoring
and Metrics
Establishing desired service behavior,
measuring how the service is actually
behaving, and correcting
discrepancies. Examples: response
latency, error or unanswered query
rate, peak utilization of resources
Needs work:
● Desired service behavior not tied to
specific application needs
● Process for monitoring is
unclear/confusing
● We have a sense there are more
tools we could be using
Working well:
● Pcf 2.2 upgrade getting us closer
● We have had experiences with useful
alerts (Redis)
● New Relic Integration with Slack is
promising
On a 1-5 scale, we give ourselves a:
2.3
Pivotal’s recommendations:
Things we can work on together at no cost:
● Set up a time for Pivotal to host a “Healthwatch 101” class; get to know functionality for
metrics and event alerts
● Start a platform team book club to read Google’s SRE book so everyone is on the same
page and learning together
● Identify most meaningful metrics and event alerts and start monitoring!
20. Priority 2: Emergency
Response
Noticing and responding effectively to
service failures in order to preserve
the service's conformance to SLA.
Examples: on-call rotations, prober,
dip detection,
primary/secondary/escalation,
playbooks, wheel of misfortune, prod
VPN rooms
Needs work:
● No engineer on call for ops
● Unclear who on the team should
respond to emergencies
● No DR/COOP plan
Working well:
● The team has internal and Pivotal
resources they can reach out to if
something goes wrong
● Pivotal SLAs have been helpful
● Wiki docs + an enthusiastic team!
On a 1-5 scale, we give ourselves a:
2.0
Pivotal’s recommendations:
Things we can work on together at no cost:
● Pivotal can provide example playbooks and SLAs from other clients who have successfully
tackled Emergency Response
● Establish priority (T1, T2, etc) among applications in the DID app portfolio
● Create a first-draft Emergency Response document (the Google SRE book is a helpful
guide)
● Establish regular cadence with developer teams to anticipate ER needs
21. Priority 3: Business
Continuity
Treat the platform like critical
infrastructure with published
RPO/RTOs that satisfy business
requirements. Proving viability of
disaster recovery plans by restoring
platforms to a known good state
through the use of automation.
Needs work:
● Right now, we have to rebuild
manually in order to recover
● We probably take it for granted that
PCF is ever reliable, so there is a lot
we haven’t covered on our end
Working well:
● Concourse has been great
● PCF automates a lot of the work
required to rebuild
On a 1-5 scale, we give ourselves a:
1.3
Pivotal’s recommendations:
Things we can work on together at no cost:
● Interview application dev teams to understand DR needs
● Consider setting up an active-active architecture
● Validate backups, consider exercising DR with sandbox environments
22. 2nd Health Markers Self-Assessment, 10/2/19
Platform Journey
Priority 1 Platform as a Product: 2.1
Balanced Team: 2.4
Priority 2 Path to Production: 3.1
Monitoring and Metrics: 2.3
Capacity Planning: 1.3
Platform Upgrade Engine: 2.0
Priority 3 Emergency Response: 2.0
Self Service: 3.0
Performance Optimization: 1.7
Business Continuity: 1.3
Ingrain Scale
1 2
Start
3 4 5
12/13/18
10/02/19
23. The Alliance - Pivotal Platform Value Metrics
Speed
Daily
releases to
production
Deployment timelines
for new features <10
minutes
Developer self
services (Redis,
Push)
Stability Scalability
300+
containers across
environments
Business critical
systems running
within PCF (Global)
< 2 mins to scale
applications
Security
0hours spent
planning security
and patching
44CVEs resolved
YTD with no
downtime
100%
workloads created
from security
approved
buildpacks
200+ VM’s
replaced and
hardened with
latest security
patches through
automated pipelines
Savings
30+ year old
MainFrame application
decommissioned
MainFrame computers
turned off saving
$10,000,000/year
0Minutes
downtime during
patching & routine
maintenance
3 foundations
100%of
platform life-cycle
driven by
automated
pipelines and
Infrastructure as
Code.(Q4)