Cloud Operations and Analytics: Improving Distributed Systems Reliability using Fault Injection

Cloud Operations and Analytics
Improving Distributed Systems Reliability
Using Fault Injection
December, 12, 2016
Technical University of Munich
www.tum.de
Dr. Jorge Cardoso (jorge.cardoso@huawei.com)
Chief Architect for Cloud Operations and Analytics
IT R&D Division

1
About Me
Jorge Cardoso
http://jorge-cardoso.github.io/
Interests
Cloud Computing
Service Science and Internet of Services
Business Process Management
Semantic Web
Positions in Industry
Prof. Jorge Cardoso obtained his PhD degree in Computer Science from the University of Georgia (US) in 2002. He is Chief Architect for
Cloud Operations and Analytics at Huawei GRC in Munich, Germany, and Professor at the University of Coimbra, Portugal.
He frequently publishes papers in first-tier conferences such as ICWS, CAISE, and ESWC, and first-tier journals such as IEEE TSC and
Journal of Web Semantics. He has published several books on distributed systems, process management systems, and service systems.
Short Bio

2
Contents
Fault Injection Techniques4
Cloud Reliability3
Cloud Operating Systems2
Cloud Computing1
Butterfly Effect Project5

3
From Virtualization to Clouds
Cloud Computing Deployment Stages of Enterprises
• Computing virtualization
• Storage virtualization
• Network and security
virtualization
• Automatic management
• Elastic resource scheduling
• HA based on large clusters
• Consolidation of multiple DCs
• Multi-level backup and DR
• Software-defined networking
(SDN)
• Unified management
• Optimal resource allocation
• Flexible service migration
Private Public
Hybrid
Cloud
Private Cloud
Virtualization
Data Center
Consolidation
Hybrid Cloud
Focus on resources
Gradually focus
on business Focus on global
business
Flexible and
service-driven

4
Server virtualization is the partitioning of a physical server into smaller
virtual servers to maximize resources. The resources of the server are
hidden from users. Software is used to divide the physical server into
multiple virtual environments.
Communications of the ACM, vol 17, no 7, 1974, pp.412-421
Virtualization
X86
Windows
XP
X86
Windows
2003
X86
Suse
X86
Red Hat
12% Hardware
Utilization
15% Hardware
Utilization
18% Hardware
Utilization
10% Hardware
Utilization
App App App App App App App App
X86 Multi-Core, Multi Processor
X86
Windows
XP
X86
Windows
2003
X86
Suse
X86
Red Hat
App App App App App App App App
70% Hardware Utilization

5
Contents
Cloud Reliability3
Cloud Computing1

6
 Azure, Amazon, Google,
Oracle, OpenStack,
SoftLayer, etc.
 Transforms datacenters into
pools of resources
 Provides a management
layer for controlling,
automating, and efficiently
allocating resources
 Adopts a selfservice mode
 Enables developers to build
cloud-aware applications via
standard APIs
Cloud Operating Systems

7
 Started by Rackspace and NASA (2010)
 Driven by the emergence of virtualization
 Rackspace wanted to rewrite its cloud servers offering
 NASA had published code for Nova, a Python-based
cloud computing controller
OpenStack History
Series Status Initial Release Date EOL Date
Queens Future TBD TBD
Pike Future TBD TBD
Ocata Under Development
2017-02-22
(planned)
TBD
Newton
Current stable release,
security-supported
2016-10-06 TBD
Mitaka Security-supported 2016-04-07 2017-04-10
Liberty Security-supported 2015-10-15 2016-11-17
Kilo EOL 2015-04-30 2016-05-02
Juno EOL 2014-10-16 2015-12-07
Icehouse EOL 2014-04-17 2015-07-02
Havana EOL 2013-10-17 2014-09-30
Grizzly EOL 2013-04-04 2014-03-29
Folsom EOL 2012-09-27 2013-11-19
Essex EOL 2012-04-05 2013-05-06
Diablo EOL 2011-09-22 2013-05-06
Cactus Deprecated 2011-04-15
Bexar Deprecated 2011-02-03
Austin Deprecated 2010-10-21
https://www.nextplatform.com/2016/11/03/building-stack-openstack/

8
OpenStack Community
 1,500+ active participants!
 17 countries represented at Design Summit!
 60,000+ downloads!
 Worldwide network of user groups (North
America, South America, Europe, Asia and
Africa)

9
OpenStack Architecture
https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/paged/architecture-guide/chapter-1-components

10OpenStack User Survey: A snapshot of OpenStack users’ attitudes and deployments. April 2016. (https://www.openstack.org/assets/survey/April-2016-User-Survey-Report.pdf). Fig. 4.6, Pag. 31.

11
Compute Architecture
https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/paged/architecture-guide/chapter-1-components

12
Adopters
Apr 6, 2016
http://cloud.telekom.de/Deutsche-Cloud‎

13
$ sudo yum install -y centos-release-openstack-newton
$ sudo yum update -y
$ sudo yum install -y openstack-packstack
$ packstack --allinone
Deploying OpenStack
https://www.rdoproject.org/install/quickstart/

14
Contents
Cloud Reliability3
Cloud Computing1

15
One reason [Netflix]: It’s the lack of control over the underlying
hardware, the inability to configure it to ensure 100% uptime
Why does using a cloud infrastructure requires
advanced approaches for resiliency?

16
Unplanned downtime
is caused by*
software bugs … 27%
hardware … 23%
human error … 18%
network failures … 17%
natural disasters … 8%
* Marcus, E., and Stern, H. Blueprints for High Availability: Designing Resilient Distributed Systems. John Wiley & Sons, Inc., 2003.
Google's 2007 found
annualized failure rates (AFRs) for drives
1 year old 1.7%
3 year old >8.6%
Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proc.
of the 5th USENIX conference on File and Storage Technologies (FAST '07). USENIX Association, Berkeley, CA, USA, 2-2.

17
A program designed to increase resilience by purposely injecting
major failures
Discover flaws and subtle dependencies
Amazon AWS: GameDay
“That seems totally bizarre on the face of it, but as you dig down, you end up finding
some dependency no one knew about previously […] We’ve had situations where we
brought down a network in, say, São Paulo, only to find that in doing so we broke our
links in Mexico.”

18
 Google DIRT (Disaster Recovery Test)
 Annual disaster recovery & testing exercise
 8 years since inception
 Multi-day exercise triggering (controlled) failures in systems and process
 Premise
 30-day incapacitation of headquarters following a disaster
 Other offices and facilities may be affected
 When
 “Big disaster”: Annually for 3-5 days
 Continuous testing: Year-round
 Who
 100s of engineers (Site Reliability, Network, Hardware, Software, Security, Facilities)
 Business units (Human Resources, Finance, Safety, Crisis response etc.)
Google: DiRT
Source http://flowcon.org/dl/flowcon-sanfran-2014/slides/KripaKrishnan_LearningContinuouslyFromFailures.pdf

19
Netflix: Chaos Monkey
Fewer alerts
for ops team
Amazon EC2 and Amazon RDS Service
Disruption in the US East Region
April 29, 2011
September 20th, 2015
Amazon’s DynamoDB service experienced an availability issue in their US-EAST-1
Transfer traffic
to east region

20
 Dependability. Concepts, techniques and tools developed
over the past four decades and include the attributes:
 Availability. readiness for correct service.
 Reliability. Continuity of correct service.
 Safety. absence of catastrophic consequences on the
User(s) and the environment.
 Integrity. absence of improper system alterations.
 Maintainability. ability to undergo modifications and
repairs.
 Means to attain dependability
 Fault prevention means to prevent the occurrence or
introduction of faults.
 Fault tolerance means to avoid service failures in the
presence of faults [Voas98].
 Fault removal means to reduce the number and
severity of faults.
 Fault forecasting means to estimate the present number,
the future incidence, and the likely consequences of
faults.
Reliability
A. Avizienis, JC Laprie, B. Randell, and C. Landwehr. 2004. Basic Concepts and Taxonomy of
Dependable and Secure Computing. IEEE Trans. Dependable Secur. Comput. 1, 1, 11-33
J. Voas, G. McGraw. “Software Fault Injection: Inoculating programs against errors”. Edit. Wiley. USA, 1998.
Dependability

21
 Fault. adjudged or hypothesized cause of an
error
 Error. discrepancy between a computed,
observed, or measured value or condition and
a true, specified, or theoretically correct value
or condition. Error is a consequence of a fault
 Failure. deviation of the delivered service from
fulfilling the system function
Threats
Marcello Cinque, Domenico Cotroneo, Antonio Pecchia
Event Logs for the Analysis of Software Failures: A Rule-Based Approach, #6, vol.39, pp: 806-821
E
Ft
Fl
EFt Fl
Fault Error Failure

22
Contents
Cloud Reliability3
Cloud Computing1

23
Fault Injection
FI on Simulated
models
VHDL Simulation
models
Other languages
FI on prototypes
Hardware
Injection
HWIFI
External
HWIFI at pin level
Electromagnetic
Perturbations
Internal
Heavy ion
radiations
Laser Radiation
Scan Chain
Software
Injection SWIFI
(1)
Time
Static
Dynamic
Level
High Level
Machine
Language
Injection Objectives
• Prediction
• Elimination
Fault Injection Techniques
Software-implemented
fault injection (SWIFI)
Fault injection techniques introduce faults to
perturb the normal flow of a program to
extend test coverage or stress test the
system.
Inject a fault into a
software system at run
time.
 Experiments can be run in near real-time
 No model development needed
 Can be expanded for new classes of
faults.
 Limited set of injection instants.
 Cannot inject faults into locations that
are inaccessible to software.
 Require modification of the source code
to support the fault injection.

24
Huawei: Butterfly Effect
-- Butterfly Effect System --
Enables to Automatically Test and Repair OpenStack and Cloud
Applications
CLOUD APPLICATION
HUAWEI FusionSphere
The system works by intentionally injecting different failures, test the ability to
survive them, and learn how to predict and repair failures preemptively
Failure
Repair
Test
In chaos theory, the butterfly effect is the sensitive dependence on initial conditions in which a small change in one state of a deterministic nonlinear
system can result in large differences in a later state. [Wikipedia]

25
The Strategy
 VM failures
 send VM creation request
 find compute node where request was scheduled
 damage to the compute server
 check if the VM creation was re-scheduled to another node
 Disk temporarily unavailable
 unmount a disk
 wait for replicas to regenerate
 remount the disk with the data intact
 wait for replicas to regenerate the extra replicas from handoff nodes
should get removed
 Disk replacement
 unmount a disk
 wait for replicas regenerate
 delete the disk and remount it
 wait for replicas to regenerate
 Extra replicas from handoff nodes should get removed
 Replication
 damage three disks at the same time
 more if the replica count is higher
 check that the replicas didn’t regenerate even after some time period
 fail if the replicas regenerated
 this tests if the tests themselves are correct
1
2
3
4
1
2
3
4

26
 Approach
 Fully automated and customizable
 Simple using ssh and bash scripting
 FusionServer RH2288
 Deploy and Destroy: 2 hours to deploy OpenStack infrastructure with 32 VMs…32
seconds to destroy…
 Vagrant. Provides easy to configure, reproducible, and portable environments for
OpenStack
 Interfaces to VirtualBox, VMware, AWS, an other providers
 VirtualBox. Free open-source hypervisor for x86 computers from Oracle
 Management of virtual machines
 RDO. Freely-available
distribution of OpenStack
from Red Hat
 OpenStack Mitaka
Test Environment
Huawei RH2288 + Fedora
Vagrant
Virtualbox
VM VM VM VM VM VM VM

27
Service to Destroy
Database
Message Queue
Authentication
Hypervisor
Hard drive
The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)
Nodes, services,
processes, network,
hypervisor, storage, etc.
Nova-Compute

28
1. Request provisioning UI/CLI
2. Validate Auth data
3. Send API request to NOVA API
4. Validate API token
5. Process API request
6. Publish Provisioning Request
7. Schedule Provisioning
8. Start VM provisioning
9. Configure Network
10. Request Volume
11. Request VM image from Glance
12. Get image URL from Glance
13. Direct Image File Copy
14. Start VM rendering via Hypervison
Scenario Driven
http://www.slideshare.net/mirantis/openstack-architecture-43160012
http://docs.openstack.org/developer/tempest/field_guide/scenario.html
Create Server
• Create server
Inject Faults
Scenario FaultsProcess
flavor create
flavor delete
flavor list
host list
hypervisor list
hypervisor show
image add project
image create
image delete
image list
image show
ip fixed add
…
openstack server create --flavor
m1.medium --image "fedora-23" --
key-name ayoung-pubkey --
security-group default --nic net-
id=63258623-1fd5-497c-b62d-
e0651e03bdca windows_dev

29
Localized Injection
 State based
 Time based
 State
 Time

30
Faults to Inject
 Bit-flips - CPU registers/memory
 Memory errors - mem corruptions/leaks, lack of memory
 Disk faults - read/write errors, lack disk space
 Network faults - packet loss, network congestion, etc.
 Terminate instance
 Introduce delays in message delivery
 Corrupt data in DB
 Services, processes, and application crash
 Reboot node
 Configuration error

31
Detect Failures
The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)
Network tests
• create keypairs
• create security
groups
• create networks
Compute tests
• create a keypair
• create a security
group
• boot a instance
Swift tests
• create a volume
• get the volume
• delete the volume
Identity tests
…
Cinder tests
…
Glance tests
…
echo "$ tempest init cloud-01"
echo "$ cp tempest/etc/tempest.conf cloud-01/etc/"
echo "$ cd cloud-01"
echo "Next is the full test suite:"
echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.(api|scenario))'"
echo "Next ist the minimum basic test:"
echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.scenario.test_minimum_basic)'"

32
Detect Failures
Tempest 0
1400 test/45min-2h
Tempest 1
100%,100
40%,40
Tempest 2 Tempest 3
Overlapping tests Mutually exclusive tests
5%, Log2 40
Branch and bound
4%, Log2 20
 Side Effects. Integration tests often have side effects and require specific setups. Thus, they often cannot be used
in production systems. For example, running integration tests which delete all the virtual machines running in a
production platforms cannot be run in production.
 Reuse. Integration tests are composed on many types of tests (e.g., unit testing, API testing, integration testing,
scenario testing, and positive and negative test). Reusing code tests for damage detection is useful but the selection
can be difficult.
 Filtering. Most of the tests are not relevant for damage detection on production systems. While damage detection
looks for components, services, and processes which are no longer working properly, tests determine if commits to
code generate errors. When software code is tested, many functional test are irrelevant to use in production.
 Specificity. New code for damage detection always needs to be developed since testing does not typically looks
for problems that can happen when a system is in a particular operational state.
Limitations of Integration Tests

33
Butterfly Effect: Example of Fault Injection

34
Butterfly Effect: Example of Fault Injection

Dmitri Zimine (Brocade) giving his
speech on workflows for auto-
remediation (credits to Johannes
Weingart).
Sebastian Kirsch (Google), co-
author of the bestselling book Site
Reliability Engineering from Google,
and the workshop organizer Jorge
Cardoso (Huawei).
The International Industry-Academia Workshop on Cloud Reliability and Resilience was held
in Berlin on 7-8 November 2016. The workshop gathered close to 50 participants from
industry (Intel, Red Hat, CISCO, SAP, Google, LinkedIn, Microsoft, Mirantis, Brocade, T-
Systems, SysEleven, Deutsche Telekom, Flexiant, Hastexo) and academia (TU Wien, TU
Berlin, U Oxford, ETH, U Stuttgart, TU Chemnitz, U Potsdam, TU Darmstadt, U Lisbon, U
Coimbra).
International Industry-Academia
Workshop on
Cloud Reliability and Resilience
Berlin on 7-8 November 2016

Current Team: Cloud Operations and Analytics
 Objective
 Planet-scale distributed systems = automation
 Highly complex systems = AI and machine learning
 Skills and knowledge
 OpenStack Software Development
 Machine Learning and Real-time Analysis
 Reliability for Cloud Native Applications
 Large-scale distributed systems
 Working Student
 Distributed Execution Graphs (DEG) for OpenStack.
 Master Students
 Efficient Diagnosis in Cloud Platforms.
 DEG-driven Fault Injection for Cloud platforms.
 PhD Students
 Risk-aware Cloud Recovery using Machine Learning
(automation + AI).
 Internship for PhD student
 Next generation of DEG-driven systems beyond
Google’s Dapper and Twitter’s Zipkin.
 Working & MSc students
 Fault injection, fault models,
fault libraries, fault plans,
brake and rebuild systems all
day long, …
 PhD Students
 Rapid prototyping of cool
ideas: propose it today, code
it, and show it running in 3
months…
 Postdocs
 Solving difficult challenges of
real problems using quick and
dirty prototyping
Open Positions

Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved.
The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive
statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time
without notice.
HUAWEI ENTERPRISE ICT SOLUTIONS A BETTER WAY

38
 The complexity and dynamicity of large-scale cloud platforms requires automated solutions
to reduce the risks of eventual failures.
 Fault injection mechanisms enable to determine (and repair) the types of failures that
platforms cannot tolerate under controlled environments rather than taking a passive
approach waiting that Murphy’s law comes into play on a Sunday at 2am when engineers are
off duty.
 Pioneers, such as Amazon, Google, and Netflix, have already developed fault injection
mechanisms and have also changed their mindset with respect to the importance of the
resiliency of cloud platforms.
 As an innovation topic, we take one step further towards fault-tolerant platforms by
exploring, not only fault injection, but also the automated recovery of platforms.
Executive Summary

39
 FIAT: Fault Injection Based Automated Testing Environment, Carnegie Mellon University.
 EFI, PROFI: Processor Fault Injector, Dortmund University.
 FERRARI: Fault and ERRor Automatic Real-time Injector, Texas University.
 SFI, DOCTOR: intergrateD sOftware implemented fault injeCTiOn enviRonment, Michigan University.
 FINE: Fault Injection and moNitoring Environment, Universidad de Illinois University.
 FTAPE: Fault Tolerance and Performance Evaluator, Illinois University.
 XCEPTION: Coimbra University.
 MAFALDA, MAFALDA-RT: Microkernel Assessment by Fault injection AnaLysis and Design Aid, LAAS-
CNRS en Toulouse
 BALLISTA: Carnegie Mellon University.
SW Fault Injection Tools

Cloud Operations and Analytics: Improving Distributed Systems Reliability using Fault Injection

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Cloud Operations and Analytics: Improving Distributed Systems Reliability using Fault Injection

Similar to Cloud Operations and Analytics: Improving Distributed Systems Reliability using Fault Injection (20)

More from Jorge Cardoso

More from Jorge Cardoso (20)

Recently uploaded

Recently uploaded (20)

Cloud Operations and Analytics: Improving Distributed Systems Reliability using Fault Injection