Lecture given at the Technical University of Munich, 12 December 2016, on Cloud Operations and Analytics: Improving Distributed Systems Reliability using Fault Injection.
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Cloud Operations and Analytics: Improving Distributed Systems Reliability using Fault Injection
1. Cloud Operations and Analytics
Improving Distributed Systems Reliability
Using Fault Injection
December, 12, 2016
Technical University of Munich
www.tum.de
Dr. Jorge Cardoso (jorge.cardoso@huawei.com)
Chief Architect for Cloud Operations and Analytics
IT R&D Division
2. 1
About Me
Jorge Cardoso
http://jorge-cardoso.github.io/
Interests
Cloud Computing
Service Science and Internet of Services
Business Process Management
Semantic Web
Positions in Industry
Prof. Jorge Cardoso obtained his PhD degree in Computer Science from the University of Georgia (US) in 2002. He is Chief Architect for
Cloud Operations and Analytics at Huawei GRC in Munich, Germany, and Professor at the University of Coimbra, Portugal.
He frequently publishes papers in first-tier conferences such as ICWS, CAISE, and ESWC, and first-tier journals such as IEEE TSC and
Journal of Web Semantics. He has published several books on distributed systems, process management systems, and service systems.
Short Bio
4. 3
From Virtualization to Clouds
Cloud Computing Deployment Stages of Enterprises
• Computing virtualization
• Storage virtualization
• Network and security
virtualization
• Automatic management
• Elastic resource scheduling
• HA based on large clusters
• Consolidation of multiple DCs
• Multi-level backup and DR
• Software-defined networking
(SDN)
• Unified management
• Optimal resource allocation
• Flexible service migration
Private Public
Hybrid
Cloud
Private Cloud
Virtualization
Data Center
Consolidation
Hybrid Cloud
Focus on resources
Gradually focus
on business Focus on global
business
Flexible and
service-driven
5. 4
Server virtualization is the partitioning of a physical server into smaller
virtual servers to maximize resources. The resources of the server are
hidden from users. Software is used to divide the physical server into
multiple virtual environments.
Communications of the ACM, vol 17, no 7, 1974, pp.412-421
Virtualization
X86
Windows
XP
X86
Windows
2003
X86
Suse
X86
Red Hat
12% Hardware
Utilization
15% Hardware
Utilization
18% Hardware
Utilization
10% Hardware
Utilization
App App App App App App App App
X86 Multi-Core, Multi Processor
X86
Windows
XP
X86
Windows
2003
X86
Suse
X86
Red Hat
App App App App App App App App
70% Hardware Utilization
7. 6
Azure, Amazon, Google,
Oracle, OpenStack,
SoftLayer, etc.
Transforms datacenters into
pools of resources
Provides a management
layer for controlling,
automating, and efficiently
allocating resources
Adopts a selfservice mode
Enables developers to build
cloud-aware applications via
standard APIs
Cloud Operating Systems
8. 7
Started by Rackspace and NASA (2010)
Driven by the emergence of virtualization
Rackspace wanted to rewrite its cloud servers offering
NASA had published code for Nova, a Python-based
cloud computing controller
OpenStack History
Series Status Initial Release Date EOL Date
Queens Future TBD TBD
Pike Future TBD TBD
Ocata Under Development
2017-02-22
(planned)
TBD
Newton
Current stable release,
security-supported
2016-10-06 TBD
Mitaka Security-supported 2016-04-07 2017-04-10
Liberty Security-supported 2015-10-15 2016-11-17
Kilo EOL 2015-04-30 2016-05-02
Juno EOL 2014-10-16 2015-12-07
Icehouse EOL 2014-04-17 2015-07-02
Havana EOL 2013-10-17 2014-09-30
Grizzly EOL 2013-04-04 2014-03-29
Folsom EOL 2012-09-27 2013-11-19
Essex EOL 2012-04-05 2013-05-06
Diablo EOL 2011-09-22 2013-05-06
Cactus Deprecated 2011-04-15
Bexar Deprecated 2011-02-03
Austin Deprecated 2010-10-21
https://www.nextplatform.com/2016/11/03/building-stack-openstack/
9. 8
OpenStack Community
1,500+ active participants!
17 countries represented at Design Summit!
60,000+ downloads!
Worldwide network of user groups (North
America, South America, Europe, Asia and
Africa)
11. 10OpenStack User Survey: A snapshot of OpenStack users’ attitudes and deployments. April 2016. (https://www.openstack.org/assets/survey/April-2016-User-Survey-Report.pdf). Fig. 4.6, Pag. 31.
16. 15
One reason [Netflix]: It’s the lack of control over the underlying
hardware, the inability to configure it to ensure 100% uptime
Why does using a cloud infrastructure requires
advanced approaches for resiliency?
17. 16
Unplanned downtime
is caused by*
software bugs … 27%
hardware … 23%
human error … 18%
network failures … 17%
natural disasters … 8%
* Marcus, E., and Stern, H. Blueprints for High Availability: Designing Resilient Distributed Systems. John Wiley & Sons, Inc., 2003.
Google's 2007 found
annualized failure rates (AFRs) for drives
1 year old 1.7%
3 year old >8.6%
Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proc.
of the 5th USENIX conference on File and Storage Technologies (FAST '07). USENIX Association, Berkeley, CA, USA, 2-2.
18. 17
A program designed to increase resilience by purposely injecting
major failures
Discover flaws and subtle dependencies
Amazon AWS: GameDay
“That seems totally bizarre on the face of it, but as you dig down, you end up finding
some dependency no one knew about previously […] We’ve had situations where we
brought down a network in, say, São Paulo, only to find that in doing so we broke our
links in Mexico.”
19. 18
Google DIRT (Disaster Recovery Test)
Annual disaster recovery & testing exercise
8 years since inception
Multi-day exercise triggering (controlled) failures in systems and process
Premise
30-day incapacitation of headquarters following a disaster
Other offices and facilities may be affected
When
“Big disaster”: Annually for 3-5 days
Continuous testing: Year-round
Who
100s of engineers (Site Reliability, Network, Hardware, Software, Security, Facilities)
Business units (Human Resources, Finance, Safety, Crisis response etc.)
Google: DiRT
Source http://flowcon.org/dl/flowcon-sanfran-2014/slides/KripaKrishnan_LearningContinuouslyFromFailures.pdf
20. 19
Netflix: Chaos Monkey
Fewer alerts
for ops team
Amazon EC2 and Amazon RDS Service
Disruption in the US East Region
April 29, 2011
September 20th, 2015
Amazon’s DynamoDB service experienced an availability issue in their US-EAST-1
Transfer traffic
to east region
21. 20
Dependability. Concepts, techniques and tools developed
over the past four decades and include the attributes:
Availability. readiness for correct service.
Reliability. Continuity of correct service.
Safety. absence of catastrophic consequences on the
User(s) and the environment.
Integrity. absence of improper system alterations.
Maintainability. ability to undergo modifications and
repairs.
Means to attain dependability
Fault prevention means to prevent the occurrence or
introduction of faults.
Fault tolerance means to avoid service failures in the
presence of faults [Voas98].
Fault removal means to reduce the number and
severity of faults.
Fault forecasting means to estimate the present number,
the future incidence, and the likely consequences of
faults.
Reliability
A. Avizienis, JC Laprie, B. Randell, and C. Landwehr. 2004. Basic Concepts and Taxonomy of
Dependable and Secure Computing. IEEE Trans. Dependable Secur. Comput. 1, 1, 11-33
J. Voas, G. McGraw. “Software Fault Injection: Inoculating programs against errors”. Edit. Wiley. USA, 1998.
Dependability
22. 21
Fault. adjudged or hypothesized cause of an
error
Error. discrepancy between a computed,
observed, or measured value or condition and
a true, specified, or theoretically correct value
or condition. Error is a consequence of a fault
Failure. deviation of the delivered service from
fulfilling the system function
Threats
Marcello Cinque, Domenico Cotroneo, Antonio Pecchia
Event Logs for the Analysis of Software Failures: A Rule-Based Approach, #6, vol.39, pp: 806-821
E
Ft
Fl
EFt Fl
Fault Error Failure
24. 23
Fault Injection
FI on Simulated
models
VHDL Simulation
models
Other languages
FI on prototypes
Hardware
Injection
HWIFI
External
HWIFI at pin level
Electromagnetic
Perturbations
Internal
Heavy ion
radiations
Laser Radiation
Scan Chain
Software
Injection SWIFI
(1)
Time
Static
Dynamic
Level
High Level
Machine
Language
Injection Objectives
• Prediction
• Elimination
Fault Injection Techniques
Software-implemented
fault injection (SWIFI)
Fault injection techniques introduce faults to
perturb the normal flow of a program to
extend test coverage or stress test the
system.
Inject a fault into a
software system at run
time.
Experiments can be run in near real-time
No model development needed
Can be expanded for new classes of
faults.
Limited set of injection instants.
Cannot inject faults into locations that
are inaccessible to software.
Require modification of the source code
to support the fault injection.
25. 24
Huawei: Butterfly Effect
-- Butterfly Effect System --
Enables to Automatically Test and Repair OpenStack and Cloud
Applications
CLOUD APPLICATION
HUAWEI FusionSphere
The system works by intentionally injecting different failures, test the ability to
survive them, and learn how to predict and repair failures preemptively
Failure
Repair
Test
In chaos theory, the butterfly effect is the sensitive dependence on initial conditions in which a small change in one state of a deterministic nonlinear
system can result in large differences in a later state. [Wikipedia]
26. 25
The Strategy
VM failures
send VM creation request
find compute node where request was scheduled
damage to the compute server
check if the VM creation was re-scheduled to another node
Disk temporarily unavailable
unmount a disk
wait for replicas to regenerate
remount the disk with the data intact
wait for replicas to regenerate the extra replicas from handoff nodes
should get removed
Disk replacement
unmount a disk
wait for replicas regenerate
delete the disk and remount it
wait for replicas to regenerate
Extra replicas from handoff nodes should get removed
Replication
damage three disks at the same time
more if the replica count is higher
check that the replicas didn’t regenerate even after some time period
fail if the replicas regenerated
this tests if the tests themselves are correct
1
2
3
4
1
2
3
4
27. 26
Approach
Fully automated and customizable
Simple using ssh and bash scripting
FusionServer RH2288
Deploy and Destroy: 2 hours to deploy OpenStack infrastructure with 32 VMs…32
seconds to destroy…
Vagrant. Provides easy to configure, reproducible, and portable environments for
OpenStack
Interfaces to VirtualBox, VMware, AWS, an other providers
VirtualBox. Free open-source hypervisor for x86 computers from Oracle
Management of virtual machines
RDO. Freely-available
distribution of OpenStack
from Red Hat
OpenStack Mitaka
Test Environment
Huawei RH2288 + Fedora
Vagrant
Virtualbox
VM VM VM VM VM VM VM
28. 27
Service to Destroy
Database
Message Queue
Authentication
Hypervisor
Hard drive
The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)
Nodes, services,
processes, network,
hypervisor, storage, etc.
Nova-Compute
29. 28
1. Request provisioning UI/CLI
2. Validate Auth data
3. Send API request to NOVA API
4. Validate API token
5. Process API request
6. Publish Provisioning Request
7. Schedule Provisioning
8. Start VM provisioning
9. Configure Network
10. Request Volume
11. Request VM image from Glance
12. Get image URL from Glance
13. Direct Image File Copy
14. Start VM rendering via Hypervison
Scenario Driven
http://www.slideshare.net/mirantis/openstack-architecture-43160012
http://docs.openstack.org/developer/tempest/field_guide/scenario.html
Create Server
• Create server
Inject Faults
Scenario FaultsProcess
flavor create
flavor delete
flavor list
host list
hypervisor list
hypervisor show
image add project
image create
image delete
image list
image show
ip fixed add
…
openstack server create --flavor
m1.medium --image "fedora-23" --
key-name ayoung-pubkey --
security-group default --nic net-
id=63258623-1fd5-497c-b62d-
e0651e03bdca windows_dev
31. 30
Faults to Inject
Bit-flips - CPU registers/memory
Memory errors - mem corruptions/leaks, lack of memory
Disk faults - read/write errors, lack disk space
Network faults - packet loss, network congestion, etc.
Terminate instance
Introduce delays in message delivery
Corrupt data in DB
Services, processes, and application crash
Reboot node
Configuration error
32. 31
Detect Failures
The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)
Network tests
• create keypairs
• create security
groups
• create networks
Compute tests
• create a keypair
• create a security
group
• boot a instance
Swift tests
• create a volume
• get the volume
• delete the volume
Identity tests
…
Cinder tests
…
Glance tests
…
echo "$ tempest init cloud-01"
echo "$ cp tempest/etc/tempest.conf cloud-01/etc/"
echo "$ cd cloud-01"
echo "Next is the full test suite:"
echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.(api|scenario))'"
echo "Next ist the minimum basic test:"
echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.scenario.test_minimum_basic)'"
33. 32
Detect Failures
Tempest 0
1400 test/45min-2h
Tempest 1
100%,100
40%,40
Tempest 2 Tempest 3
Overlapping tests Mutually exclusive tests
5%, Log2 40
Branch and bound
4%, Log2 20
Side Effects. Integration tests often have side effects and require specific setups. Thus, they often cannot be used
in production systems. For example, running integration tests which delete all the virtual machines running in a
production platforms cannot be run in production.
Reuse. Integration tests are composed on many types of tests (e.g., unit testing, API testing, integration testing,
scenario testing, and positive and negative test). Reusing code tests for damage detection is useful but the selection
can be difficult.
Filtering. Most of the tests are not relevant for damage detection on production systems. While damage detection
looks for components, services, and processes which are no longer working properly, tests determine if commits to
code generate errors. When software code is tested, many functional test are irrelevant to use in production.
Specificity. New code for damage detection always needs to be developed since testing does not typically looks
for problems that can happen when a system is in a particular operational state.
Limitations of Integration Tests
36. Dmitri Zimine (Brocade) giving his
speech on workflows for auto-
remediation (credits to Johannes
Weingart).
Sebastian Kirsch (Google), co-
author of the bestselling book Site
Reliability Engineering from Google,
and the workshop organizer Jorge
Cardoso (Huawei).
The International Industry-Academia Workshop on Cloud Reliability and Resilience was held
in Berlin on 7-8 November 2016. The workshop gathered close to 50 participants from
industry (Intel, Red Hat, CISCO, SAP, Google, LinkedIn, Microsoft, Mirantis, Brocade, T-
Systems, SysEleven, Deutsche Telekom, Flexiant, Hastexo) and academia (TU Wien, TU
Berlin, U Oxford, ETH, U Stuttgart, TU Chemnitz, U Potsdam, TU Darmstadt, U Lisbon, U
Coimbra).
International Industry-Academia
Workshop on
Cloud Reliability and Resilience
Berlin on 7-8 November 2016
37. Current Team: Cloud Operations and Analytics
Objective
Planet-scale distributed systems = automation
Highly complex systems = AI and machine learning
Skills and knowledge
OpenStack Software Development
Machine Learning and Real-time Analysis
Reliability for Cloud Native Applications
Large-scale distributed systems
Working Student
Distributed Execution Graphs (DEG) for OpenStack.
Master Students
Efficient Diagnosis in Cloud Platforms.
DEG-driven Fault Injection for Cloud platforms.
PhD Students
Risk-aware Cloud Recovery using Machine Learning
(automation + AI).
Internship for PhD student
Next generation of DEG-driven systems beyond
Google’s Dapper and Twitter’s Zipkin.
Working & MSc students
Fault injection, fault models,
fault libraries, fault plans,
brake and rebuild systems all
day long, …
PhD Students
Rapid prototyping of cool
ideas: propose it today, code
it, and show it running in 3
months…
Postdocs
Solving difficult challenges of
real problems using quick and
dirty prototyping
Open Positions
39. 38
The complexity and dynamicity of large-scale cloud platforms requires automated solutions
to reduce the risks of eventual failures.
Fault injection mechanisms enable to determine (and repair) the types of failures that
platforms cannot tolerate under controlled environments rather than taking a passive
approach waiting that Murphy’s law comes into play on a Sunday at 2am when engineers are
off duty.
Pioneers, such as Amazon, Google, and Netflix, have already developed fault injection
mechanisms and have also changed their mindset with respect to the importance of the
resiliency of cloud platforms.
As an innovation topic, we take one step further towards fault-tolerant platforms by
exploring, not only fault injection, but also the automated recovery of platforms.
Executive Summary