Platform Health Assessment at Department of Homeland Security Citizenship and Immigration Services

Platform Health Assessment at Department of
Homeland Security Citizenship and Immigration
Services
October 7–10, 2019
Austin Convention Center

Unless otherwise indicated, these slides are © 2013-2019 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/
Hello!
Matt Dosberg,
Chief, DID(it)
Loves mission,
progressive tech,
and working with
talented team
Hands full with two
small kids
Paul Beccio,
SRE USCIS
Chris Saunders,
Platform Architect
Manager For
Federal
Successful
Customers
Matter
Kelly Walsh,
Engagement
Director for New
England Region

USCIS Mission

U.S. Citizenship & Imigration Services
U.S. Citizenship and Immigration
Services administers the nation’s
lawful immigration system,
safeguarding its integrity and promise
by efficiently and fairly adjudicating
requests for immigration benefits
while protecting Americans, securing
the homeland, and honoring our
values 4

Fiscal Year 2018 Snapshot
8.7 Million Receipts
5
757,000 Naturalizations
1.1 Millions Lawful Permanent Residents
2.1 Million Employment Authorization Cards
37 Million New Hires Verified

Asylum Program
Every year people come to the United
States seeking protection because they
have suffered persecution or fear that
they will suffer persecution due to:
- Race
- Religion
- Nationality
- Membership in a particular social group
- Political opinion
6
Types of Asylum
Affirmative
Defensive
Credible Fear
Reasonable Fear

Digital Innovation & Development - DID(it)
Provides cutting edge Application
Development and associated IT
DevSecOps services through the
application of best of breed Agile and
Lean Startup best practices to support
web and mobile-based application
development of existing and new
systems based upon a mix of legacy
and emerging technologies 7

Strong Platform enables Mission Outcomes
Reduced Processing Time
Increased Integrity and Safeguarding
Operational Efficiencies
Team Responsiveness

State of the Platform in Fall 2018
Matrixed Ops Team with Shared Responsibilities
Firefighting!
Behind on Platform Upgrades
Lack of Platform Monitoring
Platform not a Product

Journey Health Markers

Customer Journeys
Start Ingrain
Launch Platform
Capability
Extend the Platform
as a Product
Stay on Track with
Continuous
Improvement
Platform
Application
NEW
EXISTING
Replatform and
Modernize
Construct and Begin
Enterprise AppTX
Plan
Execute on AppTX
Plan
Scale
Tackle a Meaningful
Problem with Custom
Software
Enable Practice
Leaders
Foster New Culture
and Continue
Learning
Cost Savings
Faster Releases
Reduce Risk
Stable Software
Get to Scale

Customer Journeys
Start Ingrain
Launch Platform
Capability
Extend the Platform
as a Product
Stay on Track with
Continuous
Improvement
Platform
Applications
NEW
EXISTING
Replatform and
Modernize
Construct and Begin
Enterprise AppTX
Plan
Execute on AppTX
Plan
Scale
Tackle a Meaningful
Problem with Custom
Software
Enable Practice
Leaders
Foster New Culture
and Continue
Learning
Cost Savings
Faster Releases
Reduce Risk
Stable Software
Get to Scale
Journey Health Markers are how we find out where a client
is on the journey (and focus on what they need to do next)

Platform Maturity Matrix Dimensions
Balanced Team
Business Continuity
Platform as a Product
Path to Production
Performance OptimizationMonitoring and Metrics
Capacity Planning
Platform Update Engine
Emergency Response
Self-Service

1 – Chaotic: No service level indicators or objectives (SLI/SLO) defined. No
automated monitoring. Users report issues, and application / platform teams
are not aware of the issues until they are reported.
Monitoring and
Metrics
3 – Managed: Monitoring provides visibility and appropriate alerts are sent
when defined thresholds are met.
4 – Measured: Monitoring and alerting strategies are adjusted as a response to
violations of SLOs. They incorporate and adapt to customer feedback on a
regular basis.
5 – Continuous Improvement: The team iterate on new monitoring graphs,
and continually and proactively tweak the alerting strategy to align with SLOs,
minimize false alerts.
2 – Defined: Clearly defined ownership of monitoring. Some SLI/SLO definition
but monitoring solutions don't provide clear optics to all the right things. Basic
platform metrics are being sent to something like ELK, Splunk or Prometheus.
Establishing desired service behavior,
measuring how the service is actually
behaving, and correcting
discrepancies. Examples: response
latency, error or unanswered query
rate, peak utilization of resources
©Copyright2018–19PivotalSoftware,Inc.AllRightsReserved

FacilitateIntroduce Assess Recommend

FacilitateIntroduce Assess Recommend
Clients

CIS Health Markers

Health Markers Self-Assessment, 12/13/18
Platform Journey
Platform as a Product: 2.1
Balanced Team: 2.4
Path to Production: 3.1
Priority 1 Monitoring and Metrics: 2.3
Capacity Planning: 1.3
Change Management: 2.0
Priority 2 Emergency Response: 2.0
Self Service: 3.0
Performance Optimization: 1.7
Priority 3 Business Continuity: 1.3
Ingrain Scale
1 2
Start
3 4 5

Priority 1: Monitoring
and Metrics
Establishing desired service behavior,
measuring how the service is actually
behaving, and correcting
discrepancies. Examples: response
latency, error or unanswered query
rate, peak utilization of resources
Needs work:
● Desired service behavior not tied to
specific application needs
● Process for monitoring is
unclear/confusing
● We have a sense there are more
tools we could be using
Working well:
● Pcf 2.2 upgrade getting us closer
● We have had experiences with useful
alerts (Redis)
● New Relic Integration with Slack is
promising
On a 1-5 scale, we give ourselves a:
2.3
Pivotal’s recommendations:
Things we can work on together at no cost:
● Set up a time for Pivotal to host a “Healthwatch 101” class; get to know functionality for
metrics and event alerts
● Start a platform team book club to read Google’s SRE book so everyone is on the same
page and learning together
● Identify most meaningful metrics and event alerts and start monitoring!

Priority 2: Emergency
Response
Noticing and responding effectively to
service failures in order to preserve
the service's conformance to SLA.
Examples: on-call rotations, prober,
dip detection,
primary/secondary/escalation,
playbooks, wheel of misfortune, prod
VPN rooms
Needs work:
● No engineer on call for ops
● Unclear who on the team should
respond to emergencies
● No DR/COOP plan
Working well:
● The team has internal and Pivotal
resources they can reach out to if
something goes wrong
● Pivotal SLAs have been helpful
● Wiki docs + an enthusiastic team!
2.0
● Pivotal can provide example playbooks and SLAs from other clients who have successfully
tackled Emergency Response
● Establish priority (T1, T2, etc) among applications in the DID app portfolio
● Create a first-draft Emergency Response document (the Google SRE book is a helpful
guide)
● Establish regular cadence with developer teams to anticipate ER needs

Priority 3: Business
Continuity
Treat the platform like critical
infrastructure with published
RPO/RTOs that satisfy business
requirements. Proving viability of
disaster recovery plans by restoring
platforms to a known good state
through the use of automation.
Needs work:
● Right now, we have to rebuild
manually in order to recover
● We probably take it for granted that
PCF is ever reliable, so there is a lot
we haven’t covered on our end
Working well:
● Concourse has been great
● PCF automates a lot of the work
required to rebuild
1.3
● Interview application dev teams to understand DR needs
● Consider setting up an active-active architecture
● Validate backups, consider exercising DR with sandbox environments

2nd Health Markers Self-Assessment, 10/2/19
Platform Journey
Priority 1 Platform as a Product: 2.1
Balanced Team: 2.4
Priority 2 Path to Production: 3.1
Monitoring and Metrics: 2.3
Capacity Planning: 1.3
Platform Upgrade Engine: 2.0
Priority 3 Emergency Response: 2.0
Self Service: 3.0
Performance Optimization: 1.7
Business Continuity: 1.3
Ingrain Scale
1 2
Start
3 4 5
12/13/18
10/02/19

The Alliance - Pivotal Platform Value Metrics
Speed
Daily
releases to
production
Deployment timelines
for new features <10
minutes
Developer self
services (Redis,
Push)
Stability Scalability
300+
containers across
environments
Business critical
systems running
within PCF (Global)
< 2 mins to scale
applications
Security
0hours spent
planning security
and patching
44CVEs resolved
YTD with no
downtime
100%
workloads created
from security
approved
buildpacks
200+ VM’s
replaced and
hardened with
latest security
patches through
automated pipelines
Savings
30+ year old
MainFrame application
decommissioned
MainFrame computers
turned off saving
$10,000,000/year
0Minutes
downtime during
patching & routine
maintenance
3 foundations
100%of
platform life-cycle
driven by
automated
pipelines and
Infrastructure as
Code.(Q4)

We’re Hiring!
Product Managers
Product Designers
Software Engineers
25
Come help solve complex problems
using progressive tech alongside a very
talented team that supports USCIS
RAIO mission to provide immigration
and humanitarian services for people
who are fleeing oppression,
persecution or torture and facing
urgent humanitarian situations
Want to learn more? Come up and say hello.
Or reach out to Matthew.W.Dosberg@uscis.dhs.gov

Thank you!

Platform Health Assessment at Department of Homeland Security Citizenship and Immigration Services

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Platform Health Assessment at Department of Homeland Security Citizenship and Immigration Services

Similar to Platform Health Assessment at Department of Homeland Security Citizenship and Immigration Services (20)

More from VMware Tanzu

More from VMware Tanzu (20)

Recently uploaded

Recently uploaded (20)

Platform Health Assessment at Department of Homeland Security Citizenship and Immigration Services