Presentation at the International Industry-Academia Workshop on Cloud Reliability and Resilience. 7-8 November 2016, Berlin, Germany.
Organized by EIT Digital and Huawei GRC, Germany.
Failures happen. Building resilient cloud infrastructure requires an end-to-end automated approach to failure remediation. This approach must go beyond the current DevOps model of monitoring the system and getting engineers alerted when a failure condition occurs.
Recently, event driven automation and workflows re-emerged as a way to automate troubleshooting, remediation, and a variety of Day-2 operations. Facebook famously uses FBAR to "save 16,000 engineer-hours, a day, in ops". Similar approaches had been reported by other hyper-scale cloud providers. Open-source auto-remediation platforms like StackStorm are replacing legacy Runbook automation products, and have been successfully used to automate applications, networks, security, and cloud infrastructure.
In this presentation we give a brief history of workflow automation, overview the common architecture ingredients of a typical event driven automation framework, compare and contrast alternative approaches to day-2 automation, and, most importantly, share real-world use cases and examples of applying event driven automation in operations.