SlideShare a Scribd company logo
1 of 40
Download to read offline
Cloud Operations and Analytics
Improving Distributed Systems Reliability
Using Fault Injection
December, 12, 2016
Technical University of Munich
www.tum.de
Dr. Jorge Cardoso (jorge.cardoso@huawei.com)
Chief Architect for Cloud Operations and Analytics
IT R&D Division
1
About Me
Jorge Cardoso
http://jorge-cardoso.github.io/
Interests
Cloud Computing
Service Science and Internet of Services
Business Process Management
Semantic Web
Positions in Industry
Prof. Jorge Cardoso obtained his PhD degree in Computer Science from the University of Georgia (US) in 2002. He is Chief Architect for
Cloud Operations and Analytics at Huawei GRC in Munich, Germany, and Professor at the University of Coimbra, Portugal.
He frequently publishes papers in first-tier conferences such as ICWS, CAISE, and ESWC, and first-tier journals such as IEEE TSC and
Journal of Web Semantics. He has published several books on distributed systems, process management systems, and service systems.
Short Bio
2
Contents
Fault Injection Techniques4
Cloud Reliability3
Cloud Operating Systems2
Cloud Computing1
Butterfly Effect Project5
3
From Virtualization to Clouds
Cloud Computing Deployment Stages of Enterprises
• Computing virtualization
• Storage virtualization
• Network and security
virtualization
• Automatic management
• Elastic resource scheduling
• HA based on large clusters
• Consolidation of multiple DCs
• Multi-level backup and DR
• Software-defined networking
(SDN)
• Unified management
• Optimal resource allocation
• Flexible service migration
Private Public
Hybrid
Cloud
Private Cloud
Virtualization
Data Center
Consolidation
Hybrid Cloud
Focus on resources
Gradually focus
on business Focus on global
business
Flexible and
service-driven
4
Server virtualization is the partitioning of a physical server into smaller
virtual servers to maximize resources. The resources of the server are
hidden from users. Software is used to divide the physical server into
multiple virtual environments.
Communications of the ACM, vol 17, no 7, 1974, pp.412-421
Virtualization
X86
Windows
XP
X86
Windows
2003
X86
Suse
X86
Red Hat
12% Hardware
Utilization
15% Hardware
Utilization
18% Hardware
Utilization
10% Hardware
Utilization
App App App App App App App App
X86 Multi-Core, Multi Processor
X86
Windows
XP
X86
Windows
2003
X86
Suse
X86
Red Hat
App App App App App App App App
70% Hardware Utilization
5
Contents
Fault Injection Techniques4
Cloud Reliability3
Cloud Operating Systems2
Cloud Computing1
Butterfly Effect Project5
6
 Azure, Amazon, Google,
Oracle, OpenStack,
SoftLayer, etc.
 Transforms datacenters into
pools of resources
 Provides a management
layer for controlling,
automating, and efficiently
allocating resources
 Adopts a selfservice mode
 Enables developers to build
cloud-aware applications via
standard APIs
Cloud Operating Systems
7
 Started by Rackspace and NASA (2010)
 Driven by the emergence of virtualization
 Rackspace wanted to rewrite its cloud servers offering
 NASA had published code for Nova, a Python-based
cloud computing controller
OpenStack History
Series Status Initial Release Date EOL Date
Queens Future TBD TBD
Pike Future TBD TBD
Ocata Under Development
2017-02-22
(planned)
TBD
Newton
Current stable release,
security-supported
2016-10-06 TBD
Mitaka Security-supported 2016-04-07 2017-04-10
Liberty Security-supported 2015-10-15 2016-11-17
Kilo EOL 2015-04-30 2016-05-02
Juno EOL 2014-10-16 2015-12-07
Icehouse EOL 2014-04-17 2015-07-02
Havana EOL 2013-10-17 2014-09-30
Grizzly EOL 2013-04-04 2014-03-29
Folsom EOL 2012-09-27 2013-11-19
Essex EOL 2012-04-05 2013-05-06
Diablo EOL 2011-09-22 2013-05-06
Cactus Deprecated 2011-04-15
Bexar Deprecated 2011-02-03
Austin Deprecated 2010-10-21
https://www.nextplatform.com/2016/11/03/building-stack-openstack/
8
OpenStack Community
 1,500+ active participants!
 17 countries represented at Design Summit!
 60,000+ downloads!
 Worldwide network of user groups (North
America, South America, Europe, Asia and
Africa)
9
OpenStack Architecture
https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/paged/architecture-guide/chapter-1-components
10OpenStack User Survey: A snapshot of OpenStack users’ attitudes and deployments. April 2016. (https://www.openstack.org/assets/survey/April-2016-User-Survey-Report.pdf). Fig. 4.6, Pag. 31.
11
Compute Architecture
https://access.redhat.com/documentation/en/red-hat-openstack-platform/8/paged/architecture-guide/chapter-1-components
12
Adopters
Apr 6, 2016
http://cloud.telekom.de/Deutsche-Cloud‎
13
$ sudo yum install -y centos-release-openstack-newton
$ sudo yum update -y
$ sudo yum install -y openstack-packstack
$ packstack --allinone
Deploying OpenStack
https://www.rdoproject.org/install/quickstart/
14
Contents
Fault Injection Techniques4
Cloud Reliability3
Cloud Operating Systems2
Cloud Computing1
Butterfly Effect Project5
15
One reason [Netflix]: It’s the lack of control over the underlying
hardware, the inability to configure it to ensure 100% uptime
Why does using a cloud infrastructure requires
advanced approaches for resiliency?
16
Unplanned downtime
is caused by*
software bugs … 27%
hardware … 23%
human error … 18%
network failures … 17%
natural disasters … 8%
* Marcus, E., and Stern, H. Blueprints for High Availability: Designing Resilient Distributed Systems. John Wiley & Sons, Inc., 2003.
Google's 2007 found
annualized failure rates (AFRs) for drives
1 year old 1.7%
3 year old >8.6%
Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proc.
of the 5th USENIX conference on File and Storage Technologies (FAST '07). USENIX Association, Berkeley, CA, USA, 2-2.
17
A program designed to increase resilience by purposely injecting
major failures
Discover flaws and subtle dependencies
Amazon AWS: GameDay
“That seems totally bizarre on the face of it, but as you dig down, you end up finding
some dependency no one knew about previously […] We’ve had situations where we
brought down a network in, say, São Paulo, only to find that in doing so we broke our
links in Mexico.”
18
 Google DIRT (Disaster Recovery Test)
 Annual disaster recovery & testing exercise
 8 years since inception
 Multi-day exercise triggering (controlled) failures in systems and process
 Premise
 30-day incapacitation of headquarters following a disaster
 Other offices and facilities may be affected
 When
 “Big disaster”: Annually for 3-5 days
 Continuous testing: Year-round
 Who
 100s of engineers (Site Reliability, Network, Hardware, Software, Security, Facilities)
 Business units (Human Resources, Finance, Safety, Crisis response etc.)
Google: DiRT
Source http://flowcon.org/dl/flowcon-sanfran-2014/slides/KripaKrishnan_LearningContinuouslyFromFailures.pdf
19
Netflix: Chaos Monkey
Fewer alerts
for ops team
Amazon EC2 and Amazon RDS Service
Disruption in the US East Region
April 29, 2011
September 20th, 2015
Amazon’s DynamoDB service experienced an availability issue in their US-EAST-1
Transfer traffic
to east region
20
 Dependability. Concepts, techniques and tools developed
over the past four decades and include the attributes:
 Availability. readiness for correct service.
 Reliability. Continuity of correct service.
 Safety. absence of catastrophic consequences on the
User(s) and the environment.
 Integrity. absence of improper system alterations.
 Maintainability. ability to undergo modifications and
repairs.
 Means to attain dependability
 Fault prevention means to prevent the occurrence or
introduction of faults.
 Fault tolerance means to avoid service failures in the
presence of faults [Voas98].
 Fault removal means to reduce the number and
severity of faults.
 Fault forecasting means to estimate the present number,
the future incidence, and the likely consequences of
faults.
Reliability
A. Avizienis, JC Laprie, B. Randell, and C. Landwehr. 2004. Basic Concepts and Taxonomy of
Dependable and Secure Computing. IEEE Trans. Dependable Secur. Comput. 1, 1, 11-33
J. Voas, G. McGraw. “Software Fault Injection: Inoculating programs against errors”. Edit. Wiley. USA, 1998.
Dependability
21
 Fault. adjudged or hypothesized cause of an
error
 Error. discrepancy between a computed,
observed, or measured value or condition and
a true, specified, or theoretically correct value
or condition. Error is a consequence of a fault
 Failure. deviation of the delivered service from
fulfilling the system function
Threats
Marcello Cinque, Domenico Cotroneo, Antonio Pecchia
Event Logs for the Analysis of Software Failures: A Rule-Based Approach, #6, vol.39, pp: 806-821
E
Ft
Fl
EFt Fl
Fault Error Failure
22
Contents
Fault Injection Techniques4
Cloud Reliability3
Cloud Operating Systems2
Cloud Computing1
Butterfly Effect Project5
23
Fault Injection
FI on Simulated
models
VHDL Simulation
models
Other languages
FI on prototypes
Hardware
Injection
HWIFI
External
HWIFI at pin level
Electromagnetic
Perturbations
Internal
Heavy ion
radiations
Laser Radiation
Scan Chain
Software
Injection SWIFI
(1)
Time
Static
Dynamic
Level
High Level
Machine
Language
Injection Objectives
• Prediction
• Elimination
Fault Injection Techniques
Software-implemented
fault injection (SWIFI)
Fault injection techniques introduce faults to
perturb the normal flow of a program to
extend test coverage or stress test the
system.
Inject a fault into a
software system at run
time.
 Experiments can be run in near real-time
 No model development needed
 Can be expanded for new classes of
faults.
 Limited set of injection instants.
 Cannot inject faults into locations that
are inaccessible to software.
 Require modification of the source code
to support the fault injection.
24
Huawei: Butterfly Effect
-- Butterfly Effect System --
Enables to Automatically Test and Repair OpenStack and Cloud
Applications
CLOUD APPLICATION
HUAWEI FusionSphere
The system works by intentionally injecting different failures, test the ability to
survive them, and learn how to predict and repair failures preemptively
Failure
Repair
Test
In chaos theory, the butterfly effect is the sensitive dependence on initial conditions in which a small change in one state of a deterministic nonlinear
system can result in large differences in a later state. [Wikipedia]
25
The Strategy
 VM failures
 send VM creation request
 find compute node where request was scheduled
 damage to the compute server
 check if the VM creation was re-scheduled to another node
 Disk temporarily unavailable
 unmount a disk
 wait for replicas to regenerate
 remount the disk with the data intact
 wait for replicas to regenerate the extra replicas from handoff nodes
should get removed
 Disk replacement
 unmount a disk
 wait for replicas regenerate
 delete the disk and remount it
 wait for replicas to regenerate
 Extra replicas from handoff nodes should get removed
 Replication
 damage three disks at the same time
 more if the replica count is higher
 check that the replicas didn’t regenerate even after some time period
 fail if the replicas regenerated
 this tests if the tests themselves are correct
1
2
3
4
1
2
3
4
26
 Approach
 Fully automated and customizable
 Simple using ssh and bash scripting
 FusionServer RH2288
 Deploy and Destroy: 2 hours to deploy OpenStack infrastructure with 32 VMs…32
seconds to destroy…
 Vagrant. Provides easy to configure, reproducible, and portable environments for
OpenStack
 Interfaces to VirtualBox, VMware, AWS, an other providers
 VirtualBox. Free open-source hypervisor for x86 computers from Oracle
 Management of virtual machines
 RDO. Freely-available
distribution of OpenStack
from Red Hat
 OpenStack Mitaka
Test Environment
Huawei RH2288 + Fedora
Vagrant
Virtualbox
VM VM VM VM VM VM VM
27
Service to Destroy
Database
Message Queue
Authentication
Hypervisor
Hard drive
The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)
Nodes, services,
processes, network,
hypervisor, storage, etc.
Nova-Compute
28
1. Request provisioning UI/CLI
2. Validate Auth data
3. Send API request to NOVA API
4. Validate API token
5. Process API request
6. Publish Provisioning Request
7. Schedule Provisioning
8. Start VM provisioning
9. Configure Network
10. Request Volume
11. Request VM image from Glance
12. Get image URL from Glance
13. Direct Image File Copy
14. Start VM rendering via Hypervison
Scenario Driven
http://www.slideshare.net/mirantis/openstack-architecture-43160012
http://docs.openstack.org/developer/tempest/field_guide/scenario.html
Create Server
• Create server
Inject Faults
Scenario FaultsProcess
flavor create
flavor delete
flavor list
host list
hypervisor list
hypervisor show
image add project
image create
image delete
image list
image show
ip fixed add
…
openstack server create --flavor
m1.medium --image "fedora-23" --
key-name ayoung-pubkey --
security-group default --nic net-
id=63258623-1fd5-497c-b62d-
e0651e03bdca windows_dev
29
Localized Injection
 State based
 Time based
 State
 Time
30
Faults to Inject
 Bit-flips - CPU registers/memory
 Memory errors - mem corruptions/leaks, lack of memory
 Disk faults - read/write errors, lack disk space
 Network faults - packet loss, network congestion, etc.
 Terminate instance
 Introduce delays in message delivery
 Corrupt data in DB
 Services, processes, and application crash
 Reboot node
 Configuration error
31
Detect Failures
The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces)
Network tests
• create keypairs
• create security
groups
• create networks
Compute tests
• create a keypair
• create a security
group
• boot a instance
Swift tests
• create a volume
• get the volume
• delete the volume
Identity tests
…
Cinder tests
…
Glance tests
…
echo "$ tempest init cloud-01"
echo "$ cp tempest/etc/tempest.conf cloud-01/etc/"
echo "$ cd cloud-01"
echo "Next is the full test suite:"
echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.(api|scenario))'"
echo "Next ist the minimum basic test:"
echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.scenario.test_minimum_basic)'"
32
Detect Failures
Tempest 0
1400 test/45min-2h
Tempest 1
100%,100
40%,40
Tempest 2 Tempest 3
Overlapping tests Mutually exclusive tests
5%, Log2 40
Branch and bound
4%, Log2 20
 Side Effects. Integration tests often have side effects and require specific setups. Thus, they often cannot be used
in production systems. For example, running integration tests which delete all the virtual machines running in a
production platforms cannot be run in production.
 Reuse. Integration tests are composed on many types of tests (e.g., unit testing, API testing, integration testing,
scenario testing, and positive and negative test). Reusing code tests for damage detection is useful but the selection
can be difficult.
 Filtering. Most of the tests are not relevant for damage detection on production systems. While damage detection
looks for components, services, and processes which are no longer working properly, tests determine if commits to
code generate errors. When software code is tested, many functional test are irrelevant to use in production.
 Specificity. New code for damage detection always needs to be developed since testing does not typically looks
for problems that can happen when a system is in a particular operational state.
Limitations of Integration Tests
33
Butterfly Effect: Example of Fault Injection
34
Butterfly Effect: Example of Fault Injection
Dmitri Zimine (Brocade) giving his
speech on workflows for auto-
remediation (credits to Johannes
Weingart).
Sebastian Kirsch (Google), co-
author of the bestselling book Site
Reliability Engineering from Google,
and the workshop organizer Jorge
Cardoso (Huawei).
The International Industry-Academia Workshop on Cloud Reliability and Resilience was held
in Berlin on 7-8 November 2016. The workshop gathered close to 50 participants from
industry (Intel, Red Hat, CISCO, SAP, Google, LinkedIn, Microsoft, Mirantis, Brocade, T-
Systems, SysEleven, Deutsche Telekom, Flexiant, Hastexo) and academia (TU Wien, TU
Berlin, U Oxford, ETH, U Stuttgart, TU Chemnitz, U Potsdam, TU Darmstadt, U Lisbon, U
Coimbra).
International Industry-Academia
Workshop on
Cloud Reliability and Resilience
Berlin on 7-8 November 2016
Current Team: Cloud Operations and Analytics
 Objective
 Planet-scale distributed systems = automation
 Highly complex systems = AI and machine learning
 Skills and knowledge
 OpenStack Software Development
 Machine Learning and Real-time Analysis
 Reliability for Cloud Native Applications
 Large-scale distributed systems
 Working Student
 Distributed Execution Graphs (DEG) for OpenStack.
 Master Students
 Efficient Diagnosis in Cloud Platforms.
 DEG-driven Fault Injection for Cloud platforms.
 PhD Students
 Risk-aware Cloud Recovery using Machine Learning
(automation + AI).
 Internship for PhD student
 Next generation of DEG-driven systems beyond
Google’s Dapper and Twitter’s Zipkin.
 Working & MSc students
 Fault injection, fault models,
fault libraries, fault plans,
brake and rebuild systems all
day long, …
 PhD Students
 Rapid prototyping of cool
ideas: propose it today, code
it, and show it running in 3
months…
 Postdocs
 Solving difficult challenges of
real problems using quick and
dirty prototyping
Open Positions
Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved.
The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product
portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive
statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time
without notice.
HUAWEI ENTERPRISE ICT SOLUTIONS A BETTER WAY
38
 The complexity and dynamicity of large-scale cloud platforms requires automated solutions
to reduce the risks of eventual failures.
 Fault injection mechanisms enable to determine (and repair) the types of failures that
platforms cannot tolerate under controlled environments rather than taking a passive
approach waiting that Murphy’s law comes into play on a Sunday at 2am when engineers are
off duty.
 Pioneers, such as Amazon, Google, and Netflix, have already developed fault injection
mechanisms and have also changed their mindset with respect to the importance of the
resiliency of cloud platforms.
 As an innovation topic, we take one step further towards fault-tolerant platforms by
exploring, not only fault injection, but also the automated recovery of platforms.
Executive Summary
39
 FIAT: Fault Injection Based Automated Testing Environment, Carnegie Mellon University.
 EFI, PROFI: Processor Fault Injector, Dortmund University.
 FERRARI: Fault and ERRor Automatic Real-time Injector, Texas University.
 SFI, DOCTOR: intergrateD sOftware implemented fault injeCTiOn enviRonment, Michigan University.
 FINE: Fault Injection and moNitoring Environment, Universidad de Illinois University.
 FTAPE: Fault Tolerance and Performance Evaluator, Illinois University.
 XCEPTION: Coimbra University.
 MAFALDA, MAFALDA-RT: Microkernel Assessment by Fault injection AnaLysis and Design Aid, LAAS-
CNRS en Toulouse
 BALLISTA: Carnegie Mellon University.
SW Fault Injection Tools

More Related Content

What's hot

Introduction to red team operations
Introduction to red team operationsIntroduction to red team operations
Introduction to red team operationsSunny Neo
 
資訊安全入門
資訊安全入門資訊安全入門
資訊安全入門Tyler Chen
 
CentOS Linux Server Hardening
CentOS Linux Server HardeningCentOS Linux Server Hardening
CentOS Linux Server HardeningMyOwn Telco
 
Understanding Application Threat Modelling & Architecture
 Understanding Application Threat Modelling & Architecture Understanding Application Threat Modelling & Architecture
Understanding Application Threat Modelling & ArchitecturePriyanka Aash
 
Configuring Data Sources in AlienVault
Configuring Data Sources in AlienVaultConfiguring Data Sources in AlienVault
Configuring Data Sources in AlienVaultAlienVault
 
DevOps You Build It, You Own It!
DevOpsYou Build It, You Own It!DevOpsYou Build It, You Own It!
DevOps You Build It, You Own It!Amazon Web Services
 
AI & ML in Cyber Security - Why Algorithms are Dangerous
AI & ML in Cyber Security - Why Algorithms are DangerousAI & ML in Cyber Security - Why Algorithms are Dangerous
AI & ML in Cyber Security - Why Algorithms are DangerousRaffael Marty
 
Introduction To Vulnerability Assessment & Penetration Testing
Introduction To Vulnerability Assessment & Penetration TestingIntroduction To Vulnerability Assessment & Penetration Testing
Introduction To Vulnerability Assessment & Penetration TestingRaghav Bisht
 
Pentesting react native application for fun and profit - Abdullah
Pentesting react native application for fun and profit - AbdullahPentesting react native application for fun and profit - Abdullah
Pentesting react native application for fun and profit - Abdullahidsecconf
 
멀티 클라우드 시대의 정보보호 관리체계
멀티 클라우드 시대의 정보보호 관리체계멀티 클라우드 시대의 정보보호 관리체계
멀티 클라우드 시대의 정보보호 관리체계Logpresso
 
Integrating Security into DevOps
Integrating Security into DevOpsIntegrating Security into DevOps
Integrating Security into DevOpsCloudPassage
 
Sql injection bypassing hand book blackrose
Sql injection bypassing hand book blackroseSql injection bypassing hand book blackrose
Sql injection bypassing hand book blackroseNoaman Aziz
 
Encoded Attacks And Countermeasures
Encoded Attacks And CountermeasuresEncoded Attacks And Countermeasures
Encoded Attacks And CountermeasuresMarco Morana
 
Static Analysis Security Testing for Dummies... and You
Static Analysis Security Testing for Dummies... and YouStatic Analysis Security Testing for Dummies... and You
Static Analysis Security Testing for Dummies... and YouKevin Fealey
 
Fantastic Red Team Attacks and How to Find Them
Fantastic Red Team Attacks and How to Find ThemFantastic Red Team Attacks and How to Find Them
Fantastic Red Team Attacks and How to Find ThemRoss Wolf
 
Scanning web vulnerabilities
Scanning web vulnerabilitiesScanning web vulnerabilities
Scanning web vulnerabilitiesMohit Dholakiya
 
How to implement DevSecOps on AWS for startups
How to implement DevSecOps on AWS for startupsHow to implement DevSecOps on AWS for startups
How to implement DevSecOps on AWS for startupsAleksandr Maklakov
 
Infra as Code with Packer, Ansible and Terraform
Infra as Code with Packer, Ansible and TerraformInfra as Code with Packer, Ansible and Terraform
Infra as Code with Packer, Ansible and TerraformInho Kang
 

What's hot (20)

Introduction to red team operations
Introduction to red team operationsIntroduction to red team operations
Introduction to red team operations
 
資訊安全入門
資訊安全入門資訊安全入門
資訊安全入門
 
CentOS Linux Server Hardening
CentOS Linux Server HardeningCentOS Linux Server Hardening
CentOS Linux Server Hardening
 
Understanding Application Threat Modelling & Architecture
 Understanding Application Threat Modelling & Architecture Understanding Application Threat Modelling & Architecture
Understanding Application Threat Modelling & Architecture
 
Configuring Data Sources in AlienVault
Configuring Data Sources in AlienVaultConfiguring Data Sources in AlienVault
Configuring Data Sources in AlienVault
 
Secure Code Review 101
Secure Code Review 101Secure Code Review 101
Secure Code Review 101
 
DevOps You Build It, You Own It!
DevOpsYou Build It, You Own It!DevOpsYou Build It, You Own It!
DevOps You Build It, You Own It!
 
AI & ML in Cyber Security - Why Algorithms are Dangerous
AI & ML in Cyber Security - Why Algorithms are DangerousAI & ML in Cyber Security - Why Algorithms are Dangerous
AI & ML in Cyber Security - Why Algorithms are Dangerous
 
Introduction To Vulnerability Assessment & Penetration Testing
Introduction To Vulnerability Assessment & Penetration TestingIntroduction To Vulnerability Assessment & Penetration Testing
Introduction To Vulnerability Assessment & Penetration Testing
 
Pentesting react native application for fun and profit - Abdullah
Pentesting react native application for fun and profit - AbdullahPentesting react native application for fun and profit - Abdullah
Pentesting react native application for fun and profit - Abdullah
 
멀티 클라우드 시대의 정보보호 관리체계
멀티 클라우드 시대의 정보보호 관리체계멀티 클라우드 시대의 정보보호 관리체계
멀티 클라우드 시대의 정보보호 관리체계
 
Integrating Security into DevOps
Integrating Security into DevOpsIntegrating Security into DevOps
Integrating Security into DevOps
 
Sql injection bypassing hand book blackrose
Sql injection bypassing hand book blackroseSql injection bypassing hand book blackrose
Sql injection bypassing hand book blackrose
 
Encoded Attacks And Countermeasures
Encoded Attacks And CountermeasuresEncoded Attacks And Countermeasures
Encoded Attacks And Countermeasures
 
Automation CICD
Automation CICDAutomation CICD
Automation CICD
 
Static Analysis Security Testing for Dummies... and You
Static Analysis Security Testing for Dummies... and YouStatic Analysis Security Testing for Dummies... and You
Static Analysis Security Testing for Dummies... and You
 
Fantastic Red Team Attacks and How to Find Them
Fantastic Red Team Attacks and How to Find ThemFantastic Red Team Attacks and How to Find Them
Fantastic Red Team Attacks and How to Find Them
 
Scanning web vulnerabilities
Scanning web vulnerabilitiesScanning web vulnerabilities
Scanning web vulnerabilities
 
How to implement DevSecOps on AWS for startups
How to implement DevSecOps on AWS for startupsHow to implement DevSecOps on AWS for startups
How to implement DevSecOps on AWS for startups
 
Infra as Code with Packer, Ansible and Terraform
Infra as Code with Packer, Ansible and TerraformInfra as Code with Packer, Ansible and Terraform
Infra as Code with Packer, Ansible and Terraform
 

Viewers also liked

Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC Amazon Web Services
 
È l'ora del Cloud Managed IT
È l'ora del Cloud Managed ITÈ l'ora del Cloud Managed IT
È l'ora del Cloud Managed ITMatteo Masi
 
Azure Services Platform Oc Event Ned
Azure Services Platform Oc Event NedAzure Services Platform Oc Event Ned
Azure Services Platform Oc Event NedWes Yanaga
 
Gov 2.0: Scaling, Automation, & Management in the Cloud
Gov 2.0: Scaling, Automation, & Management in the CloudGov 2.0: Scaling, Automation, & Management in the Cloud
Gov 2.0: Scaling, Automation, & Management in the CloudJesse Robbins
 
Oracle Management Cloud
Oracle Management CloudOracle Management Cloud
Oracle Management CloudFabio Batista
 
Smau Bologna 2015 - Microsoft - Azure
Smau Bologna 2015 - Microsoft - AzureSmau Bologna 2015 - Microsoft - Azure
Smau Bologna 2015 - Microsoft - AzureSMAU
 
Meraki cloud managed products
Meraki cloud managed productsMeraki cloud managed products
Meraki cloud managed productsAtanas Gergiminov
 
Simplify IT Operations by Unifying Element Management with Vistara
Simplify IT Operations by Unifying Element Management with VistaraSimplify IT Operations by Unifying Element Management with Vistara
Simplify IT Operations by Unifying Element Management with VistaraVistara
 
The Microsoft Cloud - Azure | Office 365 | Intune
The Microsoft Cloud - Azure | Office 365 | IntuneThe Microsoft Cloud - Azure | Office 365 | Intune
The Microsoft Cloud - Azure | Office 365 | IntuneRola Ezzeddine
 
Meraki Company And Product Overview
Meraki Company And Product OverviewMeraki Company And Product Overview
Meraki Company And Product Overviewxanstevenson
 
The Power and Promise of SaaS: CA Cloud Service Management Case Study
The Power and Promise of SaaS: CA Cloud Service Management Case StudyThe Power and Promise of SaaS: CA Cloud Service Management Case Study
The Power and Promise of SaaS: CA Cloud Service Management Case StudyCA Technologies
 
Microsoft Operations Management Suite
Microsoft Operations Management Suite Microsoft Operations Management Suite
Microsoft Operations Management Suite Engin Özkurt
 
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...dev2ops
 
Meraki Cloud Networking Workshop
Meraki Cloud Networking WorkshopMeraki Cloud Networking Workshop
Meraki Cloud Networking WorkshopCisco Canada
 
Microsoft Azure Explained - Hitesh D Kesharia
Microsoft Azure Explained - Hitesh D KeshariaMicrosoft Azure Explained - Hitesh D Kesharia
Microsoft Azure Explained - Hitesh D KeshariaHARMAN Services
 
Delivering operations management success at Morningstar (a case study)
Delivering operations management success at Morningstar (a case study)Delivering operations management success at Morningstar (a case study)
Delivering operations management success at Morningstar (a case study)BMC Software
 
Microsoft Azure And The Competitive Cloud Industry - Collab365
Microsoft Azure And The Competitive Cloud Industry - Collab365Microsoft Azure And The Competitive Cloud Industry - Collab365
Microsoft Azure And The Competitive Cloud Industry - Collab365Richard Harbridge
 
Service-now.com SaaS vs. ASP vs. traditional software
Service-now.com   SaaS vs. ASP vs. traditional softwareService-now.com   SaaS vs. ASP vs. traditional software
Service-now.com SaaS vs. ASP vs. traditional softwareRhett Glauser
 
Lattice Energy LLC - Adequate reasonably priced dispatchable power generation...
Lattice Energy LLC - Adequate reasonably priced dispatchable power generation...Lattice Energy LLC - Adequate reasonably priced dispatchable power generation...
Lattice Energy LLC - Adequate reasonably priced dispatchable power generation...Lewis Larsen
 

Viewers also liked (20)

Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
Building Fault Tolerant Applications in the cloud - AWS Summit 2012 - NYC
 
È l'ora del Cloud Managed IT
È l'ora del Cloud Managed ITÈ l'ora del Cloud Managed IT
È l'ora del Cloud Managed IT
 
Azure Services Platform Oc Event Ned
Azure Services Platform Oc Event NedAzure Services Platform Oc Event Ned
Azure Services Platform Oc Event Ned
 
Gov 2.0: Scaling, Automation, & Management in the Cloud
Gov 2.0: Scaling, Automation, & Management in the CloudGov 2.0: Scaling, Automation, & Management in the Cloud
Gov 2.0: Scaling, Automation, & Management in the Cloud
 
Oracle Management Cloud
Oracle Management CloudOracle Management Cloud
Oracle Management Cloud
 
Smau Bologna 2015 - Microsoft - Azure
Smau Bologna 2015 - Microsoft - AzureSmau Bologna 2015 - Microsoft - Azure
Smau Bologna 2015 - Microsoft - Azure
 
Meraki cloud managed products
Meraki cloud managed productsMeraki cloud managed products
Meraki cloud managed products
 
Simplify IT Operations by Unifying Element Management with Vistara
Simplify IT Operations by Unifying Element Management with VistaraSimplify IT Operations by Unifying Element Management with Vistara
Simplify IT Operations by Unifying Element Management with Vistara
 
The Microsoft Cloud - Azure | Office 365 | Intune
The Microsoft Cloud - Azure | Office 365 | IntuneThe Microsoft Cloud - Azure | Office 365 | Intune
The Microsoft Cloud - Azure | Office 365 | Intune
 
Meraki Company And Product Overview
Meraki Company And Product OverviewMeraki Company And Product Overview
Meraki Company And Product Overview
 
The Power and Promise of SaaS: CA Cloud Service Management Case Study
The Power and Promise of SaaS: CA Cloud Service Management Case StudyThe Power and Promise of SaaS: CA Cloud Service Management Case Study
The Power and Promise of SaaS: CA Cloud Service Management Case Study
 
Microsoft Operations Management Suite
Microsoft Operations Management Suite Microsoft Operations Management Suite
Microsoft Operations Management Suite
 
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...
 
Meraki Cloud Networking Workshop
Meraki Cloud Networking WorkshopMeraki Cloud Networking Workshop
Meraki Cloud Networking Workshop
 
Microsoft Azure Explained - Hitesh D Kesharia
Microsoft Azure Explained - Hitesh D KeshariaMicrosoft Azure Explained - Hitesh D Kesharia
Microsoft Azure Explained - Hitesh D Kesharia
 
Delivering operations management success at Morningstar (a case study)
Delivering operations management success at Morningstar (a case study)Delivering operations management success at Morningstar (a case study)
Delivering operations management success at Morningstar (a case study)
 
Microsoft Azure And The Competitive Cloud Industry - Collab365
Microsoft Azure And The Competitive Cloud Industry - Collab365Microsoft Azure And The Competitive Cloud Industry - Collab365
Microsoft Azure And The Competitive Cloud Industry - Collab365
 
Service-now.com SaaS vs. ASP vs. traditional software
Service-now.com   SaaS vs. ASP vs. traditional softwareService-now.com   SaaS vs. ASP vs. traditional software
Service-now.com SaaS vs. ASP vs. traditional software
 
Lattice Energy LLC - Adequate reasonably priced dispatchable power generation...
Lattice Energy LLC - Adequate reasonably priced dispatchable power generation...Lattice Energy LLC - Adequate reasonably priced dispatchable power generation...
Lattice Energy LLC - Adequate reasonably priced dispatchable power generation...
 
Meraki Overview
Meraki OverviewMeraki Overview
Meraki Overview
 

Similar to Cloud Operations and Analytics: Improving Distributed Systems Reliability using Fault Injection

DOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresDOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresJorge Cardoso
 
Cloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injectionCloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injectionJorge Cardoso
 
Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...
Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...
Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...University of Antwerp
 
Cloud Resilience with Open Stack
Cloud Resilience with Open StackCloud Resilience with Open Stack
Cloud Resilience with Open StackJorge Cardoso
 
Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016Jorge Cardoso
 
Security TechTalk | AWS Public Sector Summit 2016
Security TechTalk | AWS Public Sector Summit 2016Security TechTalk | AWS Public Sector Summit 2016
Security TechTalk | AWS Public Sector Summit 2016Amazon Web Services
 
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...Mark Underwood
 
Gervais Peter Resume Oct :2015
Gervais Peter Resume Oct :2015Gervais Peter Resume Oct :2015
Gervais Peter Resume Oct :2015Peter Gervais
 
An Overview Of The Singularity Project
An  Overview Of The  Singularity  ProjectAn  Overview Of The  Singularity  Project
An Overview Of The Singularity Projectalanocu
 
Embracing Failure - AzureDay Rome
Embracing Failure - AzureDay RomeEmbracing Failure - AzureDay Rome
Embracing Failure - AzureDay RomeAlberto Acerbis
 
IRJET- Analysis of Forensics Tools in Cloud Environment
IRJET-  	  Analysis of Forensics Tools in Cloud EnvironmentIRJET-  	  Analysis of Forensics Tools in Cloud Environment
IRJET- Analysis of Forensics Tools in Cloud EnvironmentIRJET Journal
 
IRJET- Cross Platform Penetration Testing Suite
IRJET-  	  Cross Platform Penetration Testing SuiteIRJET-  	  Cross Platform Penetration Testing Suite
IRJET- Cross Platform Penetration Testing SuiteIRJET Journal
 
WIRELESS COMPUTING AND IT ECOSYSTEMS
WIRELESS COMPUTING AND IT ECOSYSTEMSWIRELESS COMPUTING AND IT ECOSYSTEMS
WIRELESS COMPUTING AND IT ECOSYSTEMScscpconf
 
reliability based design optimization for cloud migration
reliability based design optimization for cloud migrationreliability based design optimization for cloud migration
reliability based design optimization for cloud migrationNishmitha B
 
Seeing O S Processes To Improve Dependability And Safety
Seeing  O S  Processes To  Improve  Dependability And  SafetySeeing  O S  Processes To  Improve  Dependability And  Safety
Seeing O S Processes To Improve Dependability And Safetyalanocu
 
Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018Dinis Cruz
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018Christophe Rochefolle
 
Wicsa2011 cloud tutorial
Wicsa2011 cloud tutorialWicsa2011 cloud tutorial
Wicsa2011 cloud tutorialAnna Liu
 
Security that Scales with Cloud Native Development
Security that Scales with Cloud Native DevelopmentSecurity that Scales with Cloud Native Development
Security that Scales with Cloud Native DevelopmentPanoptica
 
INLINE_PATCH_PROXY_FOR_XEN_HYPERVISOR
INLINE_PATCH_PROXY_FOR_XEN_HYPERVISORINLINE_PATCH_PROXY_FOR_XEN_HYPERVISOR
INLINE_PATCH_PROXY_FOR_XEN_HYPERVISORNeha Rana
 

Similar to Cloud Operations and Analytics: Improving Distributed Systems Reliability using Fault Injection (20)

DOST 2016 Cloud Without Failures
DOST 2016 Cloud Without FailuresDOST 2016 Cloud Without Failures
DOST 2016 Cloud Without Failures
 
Cloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injectionCloud Reliability: Decreasing outage frequency using fault injection
Cloud Reliability: Decreasing outage frequency using fault injection
 
Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...
Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...
Finding Bugs, Fixing Bugs, Preventing Bugs — Exploiting Automated Tests to In...
 
Cloud Resilience with Open Stack
Cloud Resilience with Open StackCloud Resilience with Open Stack
Cloud Resilience with Open Stack
 
Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016Recapitulation Workshop Cloud Reliability Resilience 2016
Recapitulation Workshop Cloud Reliability Resilience 2016
 
Security TechTalk | AWS Public Sector Summit 2016
Security TechTalk | AWS Public Sector Summit 2016Security TechTalk | AWS Public Sector Summit 2016
Security TechTalk | AWS Public Sector Summit 2016
 
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
The Quality “Logs”-Jam: Why Alerting for Cybersecurity is Awash with False Po...
 
Gervais Peter Resume Oct :2015
Gervais Peter Resume Oct :2015Gervais Peter Resume Oct :2015
Gervais Peter Resume Oct :2015
 
An Overview Of The Singularity Project
An  Overview Of The  Singularity  ProjectAn  Overview Of The  Singularity  Project
An Overview Of The Singularity Project
 
Embracing Failure - AzureDay Rome
Embracing Failure - AzureDay RomeEmbracing Failure - AzureDay Rome
Embracing Failure - AzureDay Rome
 
IRJET- Analysis of Forensics Tools in Cloud Environment
IRJET-  	  Analysis of Forensics Tools in Cloud EnvironmentIRJET-  	  Analysis of Forensics Tools in Cloud Environment
IRJET- Analysis of Forensics Tools in Cloud Environment
 
IRJET- Cross Platform Penetration Testing Suite
IRJET-  	  Cross Platform Penetration Testing SuiteIRJET-  	  Cross Platform Penetration Testing Suite
IRJET- Cross Platform Penetration Testing Suite
 
WIRELESS COMPUTING AND IT ECOSYSTEMS
WIRELESS COMPUTING AND IT ECOSYSTEMSWIRELESS COMPUTING AND IT ECOSYSTEMS
WIRELESS COMPUTING AND IT ECOSYSTEMS
 
reliability based design optimization for cloud migration
reliability based design optimization for cloud migrationreliability based design optimization for cloud migration
reliability based design optimization for cloud migration
 
Seeing O S Processes To Improve Dependability And Safety
Seeing  O S  Processes To  Improve  Dependability And  SafetySeeing  O S  Processes To  Improve  Dependability And  Safety
Seeing O S Processes To Improve Dependability And Safety
 
Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018Using security to drive chaos engineering - April 2018
Using security to drive chaos engineering - April 2018
 
From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018From Duke of DevOps to Queen of Chaos - Api days 2018
From Duke of DevOps to Queen of Chaos - Api days 2018
 
Wicsa2011 cloud tutorial
Wicsa2011 cloud tutorialWicsa2011 cloud tutorial
Wicsa2011 cloud tutorial
 
Security that Scales with Cloud Native Development
Security that Scales with Cloud Native DevelopmentSecurity that Scales with Cloud Native Development
Security that Scales with Cloud Native Development
 
INLINE_PATCH_PROXY_FOR_XEN_HYPERVISOR
INLINE_PATCH_PROXY_FOR_XEN_HYPERVISORINLINE_PATCH_PROXY_FOR_XEN_HYPERVISOR
INLINE_PATCH_PROXY_FOR_XEN_HYPERVISOR
 

More from Jorge Cardoso

On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...Jorge Cardoso
 
Distributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using MLDistributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using MLJorge Cardoso
 
AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
AIOps: Anomalous Span Detection in Distributed Traces Using Deep LearningAIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
AIOps: Anomalous Span Detection in Distributed Traces Using Deep LearningJorge Cardoso
 
AIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed TracesAIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed TracesJorge Cardoso
 
Mastering AIOps with Deep Learning
Mastering AIOps with Deep LearningMastering AIOps with Deep Learning
Mastering AIOps with Deep LearningJorge Cardoso
 
Evolution and Overview of Linked USDL
Evolution and Overview of Linked USDLEvolution and Overview of Linked USDL
Evolution and Overview of Linked USDLJorge Cardoso
 
Ten years of service research from a computer science perspective
Ten years of service research from a computer science perspectiveTen years of service research from a computer science perspective
Ten years of service research from a computer science perspectiveJorge Cardoso
 
Cloud Computing Automation: Integrating USDL and TOSCA
 Cloud Computing Automation: Integrating USDL and TOSCA Cloud Computing Automation: Integrating USDL and TOSCA
Cloud Computing Automation: Integrating USDL and TOSCAJorge Cardoso
 
Open Service Network Analysis
Open Service Network AnalysisOpen Service Network Analysis
Open Service Network AnalysisJorge Cardoso
 
Open Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and AnalysisOpen Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and AnalysisJorge Cardoso
 
Modeling Service Relationships for Service Networks
Modeling Service Relationships for Service NetworksModeling Service Relationships for Service Networks
Modeling Service Relationships for Service NetworksJorge Cardoso
 
Challenges for Open Semantic Service Networks : models, theory, applications
Challenges for Open Semantic Service Networks: models, theory, applications Challenges for Open Semantic Service Networks: models, theory, applications
Challenges for Open Semantic Service Networks : models, theory, applications Jorge Cardoso
 
Description and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCADescription and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCAJorge Cardoso
 
Open Semantic Service Networks
Open Semantic Service NetworksOpen Semantic Service Networks
Open Semantic Service NetworksJorge Cardoso
 
Dynamic Open Semantic Service Networks
Dynamic Open Semantic Service NetworksDynamic Open Semantic Service Networks
Dynamic Open Semantic Service NetworksJorge Cardoso
 
Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013Jorge Cardoso
 
IEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-servicesIEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-servicesJorge Cardoso
 
Community based harversting for USDL
Community based harversting for USDLCommunity based harversting for USDL
Community based harversting for USDLJorge Cardoso
 

More from Jorge Cardoso (20)

On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...
 
Distributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using MLDistributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using ML
 
AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
AIOps: Anomalous Span Detection in Distributed Traces Using Deep LearningAIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
AIOps: Anomalous Span Detection in Distributed Traces Using Deep Learning
 
AIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed TracesAIOps: Anomalies Detection of Distributed Traces
AIOps: Anomalies Detection of Distributed Traces
 
Mastering AIOps with Deep Learning
Mastering AIOps with Deep LearningMastering AIOps with Deep Learning
Mastering AIOps with Deep Learning
 
Shape the Cloud
Shape the CloudShape the Cloud
Shape the Cloud
 
Evolution and Overview of Linked USDL
Evolution and Overview of Linked USDLEvolution and Overview of Linked USDL
Evolution and Overview of Linked USDL
 
Ten years of service research from a computer science perspective
Ten years of service research from a computer science perspectiveTen years of service research from a computer science perspective
Ten years of service research from a computer science perspective
 
Cloud Computing Automation: Integrating USDL and TOSCA
 Cloud Computing Automation: Integrating USDL and TOSCA Cloud Computing Automation: Integrating USDL and TOSCA
Cloud Computing Automation: Integrating USDL and TOSCA
 
Open Service Network Analysis
Open Service Network AnalysisOpen Service Network Analysis
Open Service Network Analysis
 
Open Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and AnalysisOpen Semantic Service Networks: Modeling and Analysis
Open Semantic Service Networks: Modeling and Analysis
 
Modeling Service Relationships for Service Networks
Modeling Service Relationships for Service NetworksModeling Service Relationships for Service Networks
Modeling Service Relationships for Service Networks
 
Linked USDL
Linked USDLLinked USDL
Linked USDL
 
Challenges for Open Semantic Service Networks : models, theory, applications
Challenges for Open Semantic Service Networks: models, theory, applications Challenges for Open Semantic Service Networks: models, theory, applications
Challenges for Open Semantic Service Networks : models, theory, applications
 
Description and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCADescription and portability of cloud services with USDL and TOSCA
Description and portability of cloud services with USDL and TOSCA
 
Open Semantic Service Networks
Open Semantic Service NetworksOpen Semantic Service Networks
Open Semantic Service Networks
 
Dynamic Open Semantic Service Networks
Dynamic Open Semantic Service NetworksDynamic Open Semantic Service Networks
Dynamic Open Semantic Service Networks
 
Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013Genssiz Projects: Year 2012 2013
Genssiz Projects: Year 2012 2013
 
IEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-servicesIEEE SE2012 Internet-based self-services
IEEE SE2012 Internet-based self-services
 
Community based harversting for USDL
Community based harversting for USDLCommunity based harversting for USDL
Community based harversting for USDL
 

Recently uploaded

Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfkalichargn70th171
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 

Recently uploaded (20)

Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdfExploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
Exploring Selenium_Appium Frameworks for Seamless Integration with HeadSpin.pdf
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 

Cloud Operations and Analytics: Improving Distributed Systems Reliability using Fault Injection

  • 1. Cloud Operations and Analytics Improving Distributed Systems Reliability Using Fault Injection December, 12, 2016 Technical University of Munich www.tum.de Dr. Jorge Cardoso (jorge.cardoso@huawei.com) Chief Architect for Cloud Operations and Analytics IT R&D Division
  • 2. 1 About Me Jorge Cardoso http://jorge-cardoso.github.io/ Interests Cloud Computing Service Science and Internet of Services Business Process Management Semantic Web Positions in Industry Prof. Jorge Cardoso obtained his PhD degree in Computer Science from the University of Georgia (US) in 2002. He is Chief Architect for Cloud Operations and Analytics at Huawei GRC in Munich, Germany, and Professor at the University of Coimbra, Portugal. He frequently publishes papers in first-tier conferences such as ICWS, CAISE, and ESWC, and first-tier journals such as IEEE TSC and Journal of Web Semantics. He has published several books on distributed systems, process management systems, and service systems. Short Bio
  • 3. 2 Contents Fault Injection Techniques4 Cloud Reliability3 Cloud Operating Systems2 Cloud Computing1 Butterfly Effect Project5
  • 4. 3 From Virtualization to Clouds Cloud Computing Deployment Stages of Enterprises • Computing virtualization • Storage virtualization • Network and security virtualization • Automatic management • Elastic resource scheduling • HA based on large clusters • Consolidation of multiple DCs • Multi-level backup and DR • Software-defined networking (SDN) • Unified management • Optimal resource allocation • Flexible service migration Private Public Hybrid Cloud Private Cloud Virtualization Data Center Consolidation Hybrid Cloud Focus on resources Gradually focus on business Focus on global business Flexible and service-driven
  • 5. 4 Server virtualization is the partitioning of a physical server into smaller virtual servers to maximize resources. The resources of the server are hidden from users. Software is used to divide the physical server into multiple virtual environments. Communications of the ACM, vol 17, no 7, 1974, pp.412-421 Virtualization X86 Windows XP X86 Windows 2003 X86 Suse X86 Red Hat 12% Hardware Utilization 15% Hardware Utilization 18% Hardware Utilization 10% Hardware Utilization App App App App App App App App X86 Multi-Core, Multi Processor X86 Windows XP X86 Windows 2003 X86 Suse X86 Red Hat App App App App App App App App 70% Hardware Utilization
  • 6. 5 Contents Fault Injection Techniques4 Cloud Reliability3 Cloud Operating Systems2 Cloud Computing1 Butterfly Effect Project5
  • 7. 6  Azure, Amazon, Google, Oracle, OpenStack, SoftLayer, etc.  Transforms datacenters into pools of resources  Provides a management layer for controlling, automating, and efficiently allocating resources  Adopts a selfservice mode  Enables developers to build cloud-aware applications via standard APIs Cloud Operating Systems
  • 8. 7  Started by Rackspace and NASA (2010)  Driven by the emergence of virtualization  Rackspace wanted to rewrite its cloud servers offering  NASA had published code for Nova, a Python-based cloud computing controller OpenStack History Series Status Initial Release Date EOL Date Queens Future TBD TBD Pike Future TBD TBD Ocata Under Development 2017-02-22 (planned) TBD Newton Current stable release, security-supported 2016-10-06 TBD Mitaka Security-supported 2016-04-07 2017-04-10 Liberty Security-supported 2015-10-15 2016-11-17 Kilo EOL 2015-04-30 2016-05-02 Juno EOL 2014-10-16 2015-12-07 Icehouse EOL 2014-04-17 2015-07-02 Havana EOL 2013-10-17 2014-09-30 Grizzly EOL 2013-04-04 2014-03-29 Folsom EOL 2012-09-27 2013-11-19 Essex EOL 2012-04-05 2013-05-06 Diablo EOL 2011-09-22 2013-05-06 Cactus Deprecated 2011-04-15 Bexar Deprecated 2011-02-03 Austin Deprecated 2010-10-21 https://www.nextplatform.com/2016/11/03/building-stack-openstack/
  • 9. 8 OpenStack Community  1,500+ active participants!  17 countries represented at Design Summit!  60,000+ downloads!  Worldwide network of user groups (North America, South America, Europe, Asia and Africa)
  • 11. 10OpenStack User Survey: A snapshot of OpenStack users’ attitudes and deployments. April 2016. (https://www.openstack.org/assets/survey/April-2016-User-Survey-Report.pdf). Fig. 4.6, Pag. 31.
  • 14. 13 $ sudo yum install -y centos-release-openstack-newton $ sudo yum update -y $ sudo yum install -y openstack-packstack $ packstack --allinone Deploying OpenStack https://www.rdoproject.org/install/quickstart/
  • 15. 14 Contents Fault Injection Techniques4 Cloud Reliability3 Cloud Operating Systems2 Cloud Computing1 Butterfly Effect Project5
  • 16. 15 One reason [Netflix]: It’s the lack of control over the underlying hardware, the inability to configure it to ensure 100% uptime Why does using a cloud infrastructure requires advanced approaches for resiliency?
  • 17. 16 Unplanned downtime is caused by* software bugs … 27% hardware … 23% human error … 18% network failures … 17% natural disasters … 8% * Marcus, E., and Stern, H. Blueprints for High Availability: Designing Resilient Distributed Systems. John Wiley & Sons, Inc., 2003. Google's 2007 found annualized failure rates (AFRs) for drives 1 year old 1.7% 3 year old >8.6% Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz André Barroso. 2007. Failure trends in a large disk drive population. In Proc. of the 5th USENIX conference on File and Storage Technologies (FAST '07). USENIX Association, Berkeley, CA, USA, 2-2.
  • 18. 17 A program designed to increase resilience by purposely injecting major failures Discover flaws and subtle dependencies Amazon AWS: GameDay “That seems totally bizarre on the face of it, but as you dig down, you end up finding some dependency no one knew about previously […] We’ve had situations where we brought down a network in, say, São Paulo, only to find that in doing so we broke our links in Mexico.”
  • 19. 18  Google DIRT (Disaster Recovery Test)  Annual disaster recovery & testing exercise  8 years since inception  Multi-day exercise triggering (controlled) failures in systems and process  Premise  30-day incapacitation of headquarters following a disaster  Other offices and facilities may be affected  When  “Big disaster”: Annually for 3-5 days  Continuous testing: Year-round  Who  100s of engineers (Site Reliability, Network, Hardware, Software, Security, Facilities)  Business units (Human Resources, Finance, Safety, Crisis response etc.) Google: DiRT Source http://flowcon.org/dl/flowcon-sanfran-2014/slides/KripaKrishnan_LearningContinuouslyFromFailures.pdf
  • 20. 19 Netflix: Chaos Monkey Fewer alerts for ops team Amazon EC2 and Amazon RDS Service Disruption in the US East Region April 29, 2011 September 20th, 2015 Amazon’s DynamoDB service experienced an availability issue in their US-EAST-1 Transfer traffic to east region
  • 21. 20  Dependability. Concepts, techniques and tools developed over the past four decades and include the attributes:  Availability. readiness for correct service.  Reliability. Continuity of correct service.  Safety. absence of catastrophic consequences on the User(s) and the environment.  Integrity. absence of improper system alterations.  Maintainability. ability to undergo modifications and repairs.  Means to attain dependability  Fault prevention means to prevent the occurrence or introduction of faults.  Fault tolerance means to avoid service failures in the presence of faults [Voas98].  Fault removal means to reduce the number and severity of faults.  Fault forecasting means to estimate the present number, the future incidence, and the likely consequences of faults. Reliability A. Avizienis, JC Laprie, B. Randell, and C. Landwehr. 2004. Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Trans. Dependable Secur. Comput. 1, 1, 11-33 J. Voas, G. McGraw. “Software Fault Injection: Inoculating programs against errors”. Edit. Wiley. USA, 1998. Dependability
  • 22. 21  Fault. adjudged or hypothesized cause of an error  Error. discrepancy between a computed, observed, or measured value or condition and a true, specified, or theoretically correct value or condition. Error is a consequence of a fault  Failure. deviation of the delivered service from fulfilling the system function Threats Marcello Cinque, Domenico Cotroneo, Antonio Pecchia Event Logs for the Analysis of Software Failures: A Rule-Based Approach, #6, vol.39, pp: 806-821 E Ft Fl EFt Fl Fault Error Failure
  • 23. 22 Contents Fault Injection Techniques4 Cloud Reliability3 Cloud Operating Systems2 Cloud Computing1 Butterfly Effect Project5
  • 24. 23 Fault Injection FI on Simulated models VHDL Simulation models Other languages FI on prototypes Hardware Injection HWIFI External HWIFI at pin level Electromagnetic Perturbations Internal Heavy ion radiations Laser Radiation Scan Chain Software Injection SWIFI (1) Time Static Dynamic Level High Level Machine Language Injection Objectives • Prediction • Elimination Fault Injection Techniques Software-implemented fault injection (SWIFI) Fault injection techniques introduce faults to perturb the normal flow of a program to extend test coverage or stress test the system. Inject a fault into a software system at run time.  Experiments can be run in near real-time  No model development needed  Can be expanded for new classes of faults.  Limited set of injection instants.  Cannot inject faults into locations that are inaccessible to software.  Require modification of the source code to support the fault injection.
  • 25. 24 Huawei: Butterfly Effect -- Butterfly Effect System -- Enables to Automatically Test and Repair OpenStack and Cloud Applications CLOUD APPLICATION HUAWEI FusionSphere The system works by intentionally injecting different failures, test the ability to survive them, and learn how to predict and repair failures preemptively Failure Repair Test In chaos theory, the butterfly effect is the sensitive dependence on initial conditions in which a small change in one state of a deterministic nonlinear system can result in large differences in a later state. [Wikipedia]
  • 26. 25 The Strategy  VM failures  send VM creation request  find compute node where request was scheduled  damage to the compute server  check if the VM creation was re-scheduled to another node  Disk temporarily unavailable  unmount a disk  wait for replicas to regenerate  remount the disk with the data intact  wait for replicas to regenerate the extra replicas from handoff nodes should get removed  Disk replacement  unmount a disk  wait for replicas regenerate  delete the disk and remount it  wait for replicas to regenerate  Extra replicas from handoff nodes should get removed  Replication  damage three disks at the same time  more if the replica count is higher  check that the replicas didn’t regenerate even after some time period  fail if the replicas regenerated  this tests if the tests themselves are correct 1 2 3 4 1 2 3 4
  • 27. 26  Approach  Fully automated and customizable  Simple using ssh and bash scripting  FusionServer RH2288  Deploy and Destroy: 2 hours to deploy OpenStack infrastructure with 32 VMs…32 seconds to destroy…  Vagrant. Provides easy to configure, reproducible, and portable environments for OpenStack  Interfaces to VirtualBox, VMware, AWS, an other providers  VirtualBox. Free open-source hypervisor for x86 computers from Oracle  Management of virtual machines  RDO. Freely-available distribution of OpenStack from Red Hat  OpenStack Mitaka Test Environment Huawei RH2288 + Fedora Vagrant Virtualbox VM VM VM VM VM VM VM
  • 28. 27 Service to Destroy Database Message Queue Authentication Hypervisor Hard drive The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces) Nodes, services, processes, network, hypervisor, storage, etc. Nova-Compute
  • 29. 28 1. Request provisioning UI/CLI 2. Validate Auth data 3. Send API request to NOVA API 4. Validate API token 5. Process API request 6. Publish Provisioning Request 7. Schedule Provisioning 8. Start VM provisioning 9. Configure Network 10. Request Volume 11. Request VM image from Glance 12. Get image URL from Glance 13. Direct Image File Copy 14. Start VM rendering via Hypervison Scenario Driven http://www.slideshare.net/mirantis/openstack-architecture-43160012 http://docs.openstack.org/developer/tempest/field_guide/scenario.html Create Server • Create server Inject Faults Scenario FaultsProcess flavor create flavor delete flavor list host list hypervisor list hypervisor show image add project image create image delete image list image show ip fixed add … openstack server create --flavor m1.medium --image "fedora-23" -- key-name ayoung-pubkey -- security-group default --nic net- id=63258623-1fd5-497c-b62d- e0651e03bdca windows_dev
  • 30. 29 Localized Injection  State based  Time based  State  Time
  • 31. 30 Faults to Inject  Bit-flips - CPU registers/memory  Memory errors - mem corruptions/leaks, lack of memory  Disk faults - read/write errors, lack disk space  Network faults - packet loss, network congestion, etc.  Terminate instance  Introduce delays in message delivery  Corrupt data in DB  Services, processes, and application crash  Reboot node  Configuration error
  • 32. 31 Detect Failures The main testing framework of OpenStack is called Tempest, an opensource project with more than 2000 tests: only black-box testing (test only access the public interfaces) Network tests • create keypairs • create security groups • create networks Compute tests • create a keypair • create a security group • boot a instance Swift tests • create a volume • get the volume • delete the volume Identity tests … Cinder tests … Glance tests … echo "$ tempest init cloud-01" echo "$ cp tempest/etc/tempest.conf cloud-01/etc/" echo "$ cd cloud-01" echo "Next is the full test suite:" echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.(api|scenario))'" echo "Next ist the minimum basic test:" echo "$ ostestr -c 3 --regex '(?!.*[.*bslowb.*])(^tempest.scenario.test_minimum_basic)'"
  • 33. 32 Detect Failures Tempest 0 1400 test/45min-2h Tempest 1 100%,100 40%,40 Tempest 2 Tempest 3 Overlapping tests Mutually exclusive tests 5%, Log2 40 Branch and bound 4%, Log2 20  Side Effects. Integration tests often have side effects and require specific setups. Thus, they often cannot be used in production systems. For example, running integration tests which delete all the virtual machines running in a production platforms cannot be run in production.  Reuse. Integration tests are composed on many types of tests (e.g., unit testing, API testing, integration testing, scenario testing, and positive and negative test). Reusing code tests for damage detection is useful but the selection can be difficult.  Filtering. Most of the tests are not relevant for damage detection on production systems. While damage detection looks for components, services, and processes which are no longer working properly, tests determine if commits to code generate errors. When software code is tested, many functional test are irrelevant to use in production.  Specificity. New code for damage detection always needs to be developed since testing does not typically looks for problems that can happen when a system is in a particular operational state. Limitations of Integration Tests
  • 34. 33 Butterfly Effect: Example of Fault Injection
  • 35. 34 Butterfly Effect: Example of Fault Injection
  • 36. Dmitri Zimine (Brocade) giving his speech on workflows for auto- remediation (credits to Johannes Weingart). Sebastian Kirsch (Google), co- author of the bestselling book Site Reliability Engineering from Google, and the workshop organizer Jorge Cardoso (Huawei). The International Industry-Academia Workshop on Cloud Reliability and Resilience was held in Berlin on 7-8 November 2016. The workshop gathered close to 50 participants from industry (Intel, Red Hat, CISCO, SAP, Google, LinkedIn, Microsoft, Mirantis, Brocade, T- Systems, SysEleven, Deutsche Telekom, Flexiant, Hastexo) and academia (TU Wien, TU Berlin, U Oxford, ETH, U Stuttgart, TU Chemnitz, U Potsdam, TU Darmstadt, U Lisbon, U Coimbra). International Industry-Academia Workshop on Cloud Reliability and Resilience Berlin on 7-8 November 2016
  • 37. Current Team: Cloud Operations and Analytics  Objective  Planet-scale distributed systems = automation  Highly complex systems = AI and machine learning  Skills and knowledge  OpenStack Software Development  Machine Learning and Real-time Analysis  Reliability for Cloud Native Applications  Large-scale distributed systems  Working Student  Distributed Execution Graphs (DEG) for OpenStack.  Master Students  Efficient Diagnosis in Cloud Platforms.  DEG-driven Fault Injection for Cloud platforms.  PhD Students  Risk-aware Cloud Recovery using Machine Learning (automation + AI).  Internship for PhD student  Next generation of DEG-driven systems beyond Google’s Dapper and Twitter’s Zipkin.  Working & MSc students  Fault injection, fault models, fault libraries, fault plans, brake and rebuild systems all day long, …  PhD Students  Rapid prototyping of cool ideas: propose it today, code it, and show it running in 3 months…  Postdocs  Solving difficult challenges of real problems using quick and dirty prototyping Open Positions
  • 38. Copyright©2015 Huawei Technologies Co., Ltd. All Rights Reserved. The information in this document may contain predictive statements including, without limitation, statements regarding the future financial and operating results, future product portfolio, new technology, etc. There are a number of factors that could cause actual results and developments to differ materially from those expressed or implied in the predictive statements. Therefore, such information is provided for reference purpose only and constitutes neither an offer nor an acceptance. Huawei may change the information at any time without notice. HUAWEI ENTERPRISE ICT SOLUTIONS A BETTER WAY
  • 39. 38  The complexity and dynamicity of large-scale cloud platforms requires automated solutions to reduce the risks of eventual failures.  Fault injection mechanisms enable to determine (and repair) the types of failures that platforms cannot tolerate under controlled environments rather than taking a passive approach waiting that Murphy’s law comes into play on a Sunday at 2am when engineers are off duty.  Pioneers, such as Amazon, Google, and Netflix, have already developed fault injection mechanisms and have also changed their mindset with respect to the importance of the resiliency of cloud platforms.  As an innovation topic, we take one step further towards fault-tolerant platforms by exploring, not only fault injection, but also the automated recovery of platforms. Executive Summary
  • 40. 39  FIAT: Fault Injection Based Automated Testing Environment, Carnegie Mellon University.  EFI, PROFI: Processor Fault Injector, Dortmund University.  FERRARI: Fault and ERRor Automatic Real-time Injector, Texas University.  SFI, DOCTOR: intergrateD sOftware implemented fault injeCTiOn enviRonment, Michigan University.  FINE: Fault Injection and moNitoring Environment, Universidad de Illinois University.  FTAPE: Fault Tolerance and Performance Evaluator, Illinois University.  XCEPTION: Coimbra University.  MAFALDA, MAFALDA-RT: Microkernel Assessment by Fault injection AnaLysis and Design Aid, LAAS- CNRS en Toulouse  BALLISTA: Carnegie Mellon University. SW Fault Injection Tools