1. Cloud Resilience
Fault Injection for Increased Resilience
Jorge Cardoso
(jorge.cardoso@huawei.com)
Huawei European Research Center
Riesstraße 25, 80992 München
The Butterfly Effect Project
OpenStack Munich - Cloud Resilience &
Experiences with OpenStack
Wednesday, April 13, 2016
6:30 PM
4. 3
FAILURES ARE INEVITABLE!
THE BEST WE CAN DO IS BE
PREPARED FOR THEM AND LEARN
FROM THEM
TEST, REPAIR, LEARN & PREDICT !
5. 4
Unplanned downtime
is caused by*
software bugs … 27%
hardware … 23%
human error … 18%
network failures … 17%
natural disasters … 8%
*Marcus, E., and Stern, H. Blueprints for High Availability: Designing Resilient Distributed Systems. John Wiley & Sons, Inc., 2003.
6. 5
Google's 2007 found annualized failure
rates (AFRs) for drives
1 year old 1.7%
3 year old >8.6%
Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proceedings of
the 5th USENIX conference on File and Storage Technologies (FAST '07). USENIX Association, Berkeley, CA, USA, 2-2.
7. 6
One reason [Netflix]: It’s the lack of control over the underlying
hardware, the inability to configure it to try to ensure 100%
uptime
Why does using a cloud infrastructure requires
advanced approaches for resiliency?
9. 8
Chaos Monkey
Randomly terminates instances in a cluster
Chaos Gorilla
Simulate an Availability Zone becoming unavailable
Chaos Kong
Simulate an entire region outages
Latency Monkey
Introduce latency to network packets to simulate
degradation of the EC2 network
Janitor Monkey
Clean up unused resources
Security Monkey
Analyze and notify
on security profile changes
Netflix: Chaos Monkey
AWS recently recommended firms using
its infrastructure test their resilience by
using Chaos Monkey to induce failures
10. 9
Netflix: Chaos Monkey
Fewer alerts
for ops team
Amazon EC2 and Amazon RDS Service
Disruption in the US East Region
April 29, 2011
September 20th, 2015
Amazon’s DynamoDB service experienced an availability issue in their US-EAST-1
Transfer traffic
to east region
11. 10
A program designed to increase resilience by purposely injecting
major failures
Discover flaws and subtle dependencies
Amazon AWS: GameDay
“That seems totally bizarre on the face of it, but as you dig down, you end up finding
some dependency no one knew about previously […] We’ve had situations where we
brought down a network in, say, São Paulo, only to find that in doing so we broke our
links in Mexico.”
12. 11
Google DIRT (Disaster Recovery Test)
Annual disaster recovery & testing exercise
8 years since inception
Multi-day exercise triggering (controlled) failures in systems and process
Premise
30-day incapacitation of headquarters following a disaster
Other offices and facilities may be affected
When
“Big disaster”: Annually for 3-5 days
Continuous testing: Year-round
Who
100s of engineers (Site Reliability, Network, Hardware, Software, Security, Facilities)
Business units (Human Resources, Finance, Safety, Crisis response etc.)
Google: DiRT
http://flowcon.org/dl/flowcon-sanfran-2014/slides/KripaKrishnan_LearningContinuouslyFromFailures.pdf
13. 12
Goal
-- Butterfly Effect System --
Enables to Automatically Test and Repair OpenStack and Cloud
Applications
CLOUD APPLICATION
HUAWEI FusionSphere
The system works by intentionally injecting different failures, test the ability to
survive them, and learn how to predict and repair failures preemptively
Failure
Repair
Test
14. 13
Use Case: OpenStack Resiliency
Kill cinder database
(Simulate update failure)
Introduce delay in messages
(Full-scale traffic shows where
the real bottlenecks are)
Operation Error
OPENSTACK_KEYSTONE_URL = "http://%s:5000/v2.0" % OPENSTACK_HOST
Operation Error
/etc/nova/nova.conf
Delete: auth_strategy=keystone
Remove driver to HD
Remove access to NFS
(Simulate hardware failure)
Best way to avoid failure: Fail constantly
The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)
15. 14
Use Case 1: Increasing Reliability
Public Cloud
Damage
Pattern
Butterfly Effect
Fix configurations
Fix bugs
Replace hardware
Upgrade memory
Fault Type
16. 15
Use Case 2: Run Book Automation (RBA)
Public Cloud Incident Management
Is this really
an incident?
Major Incident
Procedure
Butterfly Effect
Fault Type
Damage
Pattern
Recovery
Script
17. 16
MONITORING
Nagios Zabbix Cacti
StackTach Synaps Monasca
CONFIGURATION AUTOMATION
Ansible CFEngine Chef
Puppet Salt Heat
FAULT-INJECTION ENGINES
DestroyStack FSaaS
ChaosMonkey AnarchyApe
FAULT LIBRARIES AND PLANS
pyCallGraph Intellect
RunDeck Nose
DATA VISUALIZATION
Kibana Graylog2
Grafana
DAMAGE DETECTION
Tempest
Nose
DATA STORAGE
ElasticSearch OpenTSDB Neo4J
Graphite Cassandra Redis
DATA AGGREGATION
Logstash Collectd Flume
Fluentd Heka Ceilometer
MANUAL REPAIR
Bash Python
Chef Puppet
AUTOMATED REPAIR
jCOLIBRI myCBR Puppet
Rundeck (R)?ex Chef
DATA PROCESSING
Hadoop Pig
Hive Spark Storm
OPERATIONS ANALYTICS
Statsd R Panda
Weka Machine Leaning
ALERTING
Errbit Honeybadger Nagios
Zabbix OpenPager Riemann
DATA SOURCE
Log files Collectd Plg FlumeNG
OpenStack Tbls Zabbix Agt Nagios Plg
DATA TRANSPORT
rsyslog ZeroMQ
Components of a Solution
CONFIGURATION AUTOMATION
Ansible CFEngine Chef
Puppet Salt Heat
1
2 3
4
7
5
6
Design &
Deploy
Test
Infrastructure
Monitoring
Facilities
Design & Execute
Fault-Injection Plan
Identify Damages
Predict
Future Errors
Automatic
Repair
Repair & Learn
19. 18
Design & Deploy Test Environment
Customizable, automated OpenStack deployment
FusionServer RH2288 + VirtualBox + Vagrant + RDO
Deploy Test Environment
2 hours to deploy OpenStack infrastructure with 32 VMs
20. 19
Faults to Inject
Disk temporarily unavailable
unmount a disk
wait for replicas to regenerate
remount the disk with the data intact
wait for replicas to regenerate the extra replicas from handoff nodes
should get removed
Disk replacement
unmount a disk
wait for replicas regenerate
delete the disk and remount it
wait for replicas to regenerate
Extra replicas from handoff nodes should get removed
Expected failure
damage three disks at the same time
more if the replica count is higher
check that the replicas didn’t regenerate even after some time period
fail if the replicas regenerated
this tests if the tests themselves are correct
VM failures
send VM creation request
find compute node where request was scheduled
damage to the compute server
check if the VM creation was re-scheduled to another node
3
Inject Faults
21. 20
Damage Detection
The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)
Network tests
• create keypairs
• create security
groups
• create networks
Compute tests
• create a keypair
• create a security
group
• boot a instance
Swift tests
• create a volume
• get the volume
• delete the volume
Identity tests
…
Cinder tests
…
Glance tests
…
echo "$ tempest init cloud-01"
echo "$ cp tempest/etc/tempest.conf cloud-01/etc/"
echo "$ cd cloud-01"
echo "Next is the full test suite:"
echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.(api|scenario))'"
echo "Next ist the minimum basic test:"
echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.scenario.test_minimum_basic)'"
23. 22
Monasca
Overview: Uses the Keystone OpenStack Identity Service for authentication,
authorization and multi-tenancy. Monasca integrates with several other
OpenStack services such as Heat for auto-scaling and Ceilometer for
monitoring OpenStack resources.
Apache Kafka: A high-throughput distributed messaging system. Kafka is a
central component in Monasca and provides the infranstructure for all internal
communications between components.
Apache Storm: A free and open source distributed realtime computation
system. Apache Storm is used in the Monasca Threshold Engine.
InfluxDB: An open-source distributed time series database with no external
dependencies. InfluxDB is one of the supported databases for storing metrics
and alarm history.
MySQL: MySQL is one of the supported databases for the Monasca Config
Database.
Grafana: An open source, feature rich metrics dashboard and graph editor.
Support for Monasca as a data source in Grafana has been added.
Anomaly Detection: Engine implements real-time streaming anomaly detection.
Two algorithms: Numenta Platform for Intelligent Computing (NuPIC) and
Kolmogorov-Smirnov (K-S) Two Sample Test. Uses Stacktach for realtime
streaming.
Performance: 3 HP Proliant SL390s G7 servers + InfluxDB cluster = 25K-30K
metrics/sec; monasca-api > 150K metrics/sec for a 3 node cluster with a load
balancing; for more performance use HP Vertica database.
See https://www.openstack.org/assets/presentation-media/Monasca-Deep-Dive-
Paris-Summit.pdf
Grafana (compute_instance_create_time)
Anomaly Detection (cpu.user_perc)
25. 24
Join the Cause!
Internship positions for MSc students
Fault injection, fault models, fault libraries, fault plans,
brake and rebuild systems all day long, …
OpenStack Engineers positions
Rapid prototyping of cool ideas: propose it today,
code it, and show it running in 3 months…
Innovative PoCs
Solving difficult challenges of real problems using
quick and dirty prototyping