20. You don’t have to throw everything away
to start over again.
21. Legacy system
• 4-6 weeks for new
infrastructure
• 4+* hour maintenance
windows
• Sticky sessions
• Not mobile friendly
App 2.0: under the hood
App 2.0
• 30 minutes for a new
virtual data center
• Zero downtime deploys
• Stateless servers
• Responsive
• Throttled roll-out
22. CMS staff, OIG report on HealthCare.gov
App 2.0 went into production 30 days early so
we could see it in the wild during special
enrollment. It was very successful.”
“
23. Immutable infrastructure
Bake binaries + code into image
Auto-scaling
Rollbacks
Generalized infrastructure
Reuse code across apps / microservices
CoreVPC: PaaS for government
30. Brendan Neutra, Scaling HealthCare.gov to a Billion Users
7,754 transactions per second was attained for
one hour with 128 millisecond response times
and zero errors.”
“
Introduction
Rohan, CEO of Nava
Thank you for having me here...
Today I’ll be sharing with you some of Nava’s experiences with rebuilding HealthCare.gov, and what happens after a crisis
This is us. Well, some of us.
Nava is a public benefit corporation of 20 people headquartered in Washington DC, with offices in San Francisco and New York
Our mission is to “Radically improve how our government serves people”
Navigating a crisis
Nava was born out of the effort to move beyond a HealthCare.gov that was merely stabilized; our goal was to build our way out of trouble, towards a service that was robust, flexible, and a model for user-centered design
Crisis can often be pivotal moments, and navigating those periods is not easy. There are no easy answers, no quick fixes.
With the intense, acute focus on the crisis, larger questions can fall away.
What happens after
And so what I want to talk about what happens after the crisis
After the bright lights have dimmed a bit and the adrenaline is gone
How do we build a better future? How do we start looking further out?
How do we build lasting change? What kind of strategies and tactics get us there?
At Nava, we spend a lot of time on this question
And we’ll talk about some of our experiences and strategies we’ve used, that you may find helpful
What’s more important is not the strategies and tactics I talk about today, but that you ask these sorts of questions about the organizations and systems that you’re working on
And I want to acknowledge that this question is incredibly broad, even within our world of government digital services: the answer spans policy, procurement practices, culture, product practices, politics, and more
Where we are now
Healthcare.gov is on solid ground
Hard work across Centers for Medicare and Medicaid Services, etc.
Call center representatives
In the process, Nava has begun to help other agencies
Similar stories everywhere
At SSA...At Census...federal level...state level...
We’ve begun to see a pattern in the complexity after a crisis – folks feel caught, trapped…
I’m here to say that in that complexity, there is a huge amount of hope for creating firm, future-facing foundations
First things first – where do you start?
Folks that we talk to who are responsible for large, complex, government programs feel caught:
Caught between a rock and commercial off the shelf software
"we tried custom – it didn't work, it's complex, it's costly, it's insecure, we got locked into one vendor"
"we are trying COTS, we can't get the product to change for what we need it to do"
Going incremental with legacy systems? Takes forever, costs a lot
Start over? Low success rate, throw away lots of investments that are working
And what Nava has found is that
We can modernize complex architecture and systems without throwing away all your existing investments
These struggles are shared:
Policy people working on legislation for decades watch the implementation fall flat on its face without any way to steer
Folks on the IT side within gov struggling with requirements…
Auditing legacy systems only to find a fractal-like complexity
CIOs guide modern RFPs but contracting officers are unable to distinguish true skill from skillful RFP responses…
Meanwhile, the systems causing issues become more and more urgent
The feeling of having no good options
The only way to start over is to throw everything away or do something so marginalized it has effectively zero impact
Or just not knowing where the inefficiencies in the entire lifecycle are, only knowing costs in terms of production (lines of code or etc)
A playbook for building the future
I’m going to share our strategies and some tactical examples at HealthCare.gov about how to move from crisis management to firm foundations
At a high level, it’s not rocket science
The first thing we do, is close the distance
Government work is made more complicated by the complexity of the organization: multiple contractors
Combat that by getting really close in your collaboration
Product, development, design, requirements: tight loops of agile, research, prototyping
What you’re searching for is to adapt quickly, as a substitute for knowing the truth upfront.
Agile isn’t just a label, it’s a way of responding to changing requirements
“For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.” Richard Feynman (on the Challenger explosion)
Cut through to the core of user needs, requirements
[sha’s note from rohan speaking: closing the distance is a way of managing complexity, closing the distance across multiple dimensions is very important]
What you’re searching for is to adapt quickly, as a substitute for knowing the truth upfront.
Agile isn’t just a label, it’s a way of responding to changing requirements
“For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.” Richard Feynman (on the Challenger explosion)
Cut through to the core of user needs, requirements
In the last two years, Nava has worked together with CMS to redesign the health insurance application from top to bottom – from the infrastructure all the way up to the user experience
Working directly with requirements people let us work find simple paths through the legislation
76 screens → 16 screens
Call center complaints down
Fully responsive design
Big deal nowadays - 20% of traffic on mobile
And a big portion of the population doesn’t have access to a computer but they still have a smartphone for their work
Conversion rate 65% → 95%
We lose a 1/7 of the number of people we used to lose, 7x improvement in terms of people that decide to pull out their hair
Time to completion 21 min → 9 min
Faster than Geico
Build proof of concepts to answer doubts
Give people something to react to
Working software >> excel or word
Something is wrong when:
“We can start once we document the requirements”
“We’ll have that in 6 months”
“That will take 10,000 additional hours”
This one is close to my heart
And where complexity becomes opportunity in a concrete sense
Find the threads to pull on that let you move through the problem one meaningful step at a time
Business process seam, technical seam, use case seam
Soft launching
[not sure if this quote is good] http://www.shirky.com/weblog/2013/11/healthcare-gov-and-the-gulf-between-planning-and-reality/ “It is hard for policy people to imagine that Healthcare.gov could have had a phased rollout, even while it is having one.” - Clay Shirky
Find the seam
Create account page -> SLS -> account flows
You can use what you have, you don’t have to throw everything away to start over
Soft launch
http://www.shirky.com/weblog/2013/11/healthcare-gov-and-the-gulf-between-planning-and-reality/ “It is hard for policy people to imagine that Healthcare.gov could have had a phased rollout, even while it is having one.” - Clay Shirky (not sure if this quote is good :P)
Find the seam
Create account page -> SLS -> account flows
App 2.0
65% of the most common case of users
Slow rollout: 0%, 1%, 10%, 50%, 100%
A/B testing
Strangler pattern
SLS
Hot swap at the interface
Data migration
65% of consumers: found the seam
Throttled the roll out so that we could gradually go from 0%, to 1%, to 10%, etc to 100% eventually, with minimal risk
Found the other seam at the DevOps layer, built our own PaaS
Made sure that our work here was simultaneously government compliant, and fully modern
Reliable, scalable, secure, flexible
Ultimately, these investment result in UX improvements, even if infrastructure doesn’t seem to directly affect users
[Find the investments that start paying off immediately]
Set records
SLS supporting 20 million accounts, great
Set the right metrics,
Planned downtime is still downtime
How far can we push this?
Thinking proactively to see how systems will break
Billion user load test
Drop-in replacement
1/100th the cost, 1000x performance
Backend system, not sexy, critical to user experience
How far can we push this?
7754 tps was attained for 1 hour with the 90th percentile at 128ms and zero errors.
7754 tps was attained for 1 hour with the 90th percentile at 128ms and zero errors.
Build ambition
Going up the hierarchy of needs (Program Integrity to Futureproofing to Prototyping)
Magic: emergent effects
Build new foundations
this is a dramatic time in IT modernization efforts
this is a moment for gov to lead in terms of scalability, security, privacy, etc
Promote good behavior don’t police bad behavior
Celebrate successes, learn from failures