Example Scenario - Actor
"Ron Lew, a site reliability engineer, gets paged during a
deployment of a web application. On investigating the logs, he
saw a mini outage during the deployment when certain users to
the website saw an error message. This was the 3rd time in a
week that the monitoring service paged him during deployment.
He is anxious about the application's health and notices many
errors in the logs, but these errors did not stop the application
from getting built and deployed. There was an unusual number
of errors in the logs and there was no pattern to them. His
immediate concern was to stop these outages. He suspects that
the internal errors are causing outages during deployment."
Example Scenario 2- Task flow
"A massive earthquake strikes the city of San Francisco. The bay
area headquarters located in the city is badly damaged, and there
is a power outage. The secondary systems have not come on, and
there is chaos all over. The incident managers on-site in San
Francisco cannot access any of the systems because of damage to
the networking infrastructure. The Incident Managers in Austin,
TX, get notified of the disruption when multiple systems go down.
In Austin, the incident managers send a notification to the mobile
devices of all employees in San Francisco to confirm their safety.
The system does not log any responses. It has been 24 hours, and
still no answers from San Francisco personnel. The ops team in
Austin waits for emergency responders and are glued to the TV."
● Enable mission critical systems, processes and services to
failover to a redundant system in case of disruption or a
● Enable incident management by providing an interface to
coordinate and manage an incident
● Provide accurate and timely information to the employees
● Determine the whereabouts of missing employees and
● Coordinate with local emergency response organizations
So What? Many flavors of a problem
● Customer data loss
● Critical System Down
● Earthquake in San Francisco
● We need to integrate this new product with ours. They wrote
their code in Sumerian
● Loss of power and network
● Too many users!
● How do I find all the comms where I was tagged
● “How do we automate this thing?"
● “How can we make this process faster and more efficient?”
● "How the hell do we increase confidence in the data we are
● "How do we make our site faster and more performant?"
● "How do we help the on-call engineer better diagnose
● "How do we make this self-serve?"