SlideShare a Scribd company logo
1 of 77
Download to read offline
Dependable Systems
!
Summary
Dr. Peter Trƶger

Lena Herscheid
Dependable Systems Course PT 2014
Dependability
ā€¢ Umbrella term for operational requirements on a system

ā€¢ IFIP WG 10.4: "[..] the trustworthiness of a computing system which allows reliance
to be justiļ¬ably placed on the service it delivers [..]"

ā€¢ IEC IEV: "dependability (is) the collective term used to describe the availability
performance and its inļ¬‚uencing factorsĀ : reliability performance, maintainability
performance and maintenance support performance"

ā€¢ Laprie: ā€ž Trustworthiness of a computer system such that ā€Ø
reliance can be placed on the service it delivers to the user ā€œ

ā€¢ Adds a third dimension to system quality

ā€¢ General question: How to deal with unexpected events ?

ā€¢ In German: ,VerlƤsslichkeitā€˜ vs. ,ZuverlƤssigkeitā€˜ā€Ø
2
Dependable Systems Course PT 2014
Course Topics
ā€¢ Deļ¬nitions and metrics

ā€¢ Fault, error, failure, fault models, fault tolerance, fault forecasting, ...

ā€¢ Reliability, availability, maintainability, error mitigation, damage conļ¬nement, ...

ā€¢ Reliability engineering, reliability testing, MTTF, MTBF, failure rate, ...

ā€¢ Fault tolerance patterns

ā€¢ N-out-of-M, voting, heartbeat, ...

ā€¢ Analytical evaluation

ā€¢ Reliability block diagrams, fault tree analysis, ...

ā€¢ Petri nets, Markov chains

ā€¢ Root cause analysis, risk analysis and assessment
3
Dependable Systems Course PT 2014
Course Topics
ā€¢ Hardware-Implemented Dependability

ā€¢ Failure rates, testing, hardware redundancy approaches 

ā€¢ Software-Implemented Dependability

ā€¢ Fault-tolerant programming: simplex, recovery blocks, checkpointing, roll-back, ...

ā€¢ Testing, software metrics

ā€¢ Fault-tolerant distributed systems: Transactions, consensus, clocks, ...

ā€¢ Advanced approaches

ā€¢ Autonomic computing, rejuvenation, online failure prediction, performability

ā€¢ Case studies from practice
4
Dependable Systems Course PT 2014
Dependability Tree (Laprie)
5
Dependable Systems Course PT 2014
Dependability Stakeholders
ā€¢ System - Entity with function, behavior, and structure

ā€¢ A number of components or subsystems, which interact under the control of a
design [Robinson]

ā€¢ Service - System behavior abstraction, as perceived by the user

ā€¢ User - Human or physical system that interacts with the systems service

ā€¢ Speciļ¬cation - Deļ¬nition of expected service and delivery conditions 

ā€¢ On diļ¬€erent levels, can lead to speciļ¬cation fault

ā€¢ Reliance demands assessment of non-functional dependability attributes
ā€¢ Provide ability for trustworthy service delivery by dependability means
ā€¢ Undesired (maybe expected) circumstances form dependability threats
6
Dependable Systems Course PT 2014
2 - Dependability Threats
ā€¢ System failure - ,Ausfallā€˜

ā€¢ Event that occurs when the service no longer complies with the speciļ¬cation /
deviates from the correct service.

ā€¢ System error - ,Fehler(zustand)ā€˜

ā€¢ Part of system state that can lead to subsequent failure

ā€¢ Some sources deļ¬ne errors as active faults - not in this course ... 

ā€¢ System fault - ,Fehler(ursache)ā€˜

ā€¢ Adjudged or hypothesized cause of an error

ā€¢ Failure occurs when error state alters the provided service

ā€¢ Systems are build from connected components, which are again systems

ā€¢ Fault is the consequence of a failure of some other system to deliver its service
7
Fault
Error
Failure
Dependable Systems Course PT 2014
Chain of Dependability Threats (Avizienis)
8
Fault
Error
Failure
Activation
Propagation
Causation
Fault
Dependable Systems Course PT 2014
Faults
ā€¢ High diversity in possible sources and types

ā€¢ Fault nature

ā€¢ Accidental faults (,Zufallsfehlerā€˜) vs. intentional faults (,Absichtsfehlerā€˜)

ā€¢ Intentional faults are created deliberately, presumably malevolently

ā€¢ Fault origin viewpoints (not exclusive)

ā€¢ Phenomenological causes: Physical / natural faults vs. human-made faults

ā€¢ System boundaries: Internal faults (part of system state that produces an error)
vs. external faults (interference with the environment)

ā€¢ Phase of creation: Design faults vs. operational faults

ā€¢ Temporal persistence

ā€¢ Permanent faults vs. temporary faults
9
Dependable Systems Course PT 2014
System-Level Fault Model
ā€¢ Fault model idea originates from hardware

ā€¢ How many faults of diļ¬€erent classes can occur ? ā€Ø
What do I tolerate ?

ā€¢ Timing of faults: Fault delay, repeat time, recovery time, ...

ā€¢ Also mappable to software or even complete systems

ā€¢ Activities as black box, only look on input and output messages

ā€¢ Link faults are mapped to the ā€Ø
participating components

ā€¢ Every participating component ā€Ø
would need a fault model - ā€Ø
pick the most urgent ones
10
Dependable Systems Course PT 2014
Error Propagation
11
(C) Avizienis
Dependable Systems Course PT 2014
Error Message Occurrence (Hansen & Siewiorek)
ā€¢ Same fault can lead to diļ¬€erent (detected or undetected) errors

ā€¢ Errors become detected by error detection mechanism

ā€¢ Some undetected errors are detected by several detectors

ā€¢ Some detectors report several undetected errors as one

ā€¢ Some undetected errors are never uncovered 

ā€¢ Detected errors might not be logged, if the system stops too fast
12
Dependable Systems Course PT 2014
Failures
ā€¢ Non-compliance with the speciļ¬cation - arbitrary failure (ā€˜willkĆ¼rlicher Ausfallā€˜)
ā€¢ System failures can be further categorized in failure modes
ā€¢ Fail-silent / crash failure mode - incorrect results are not delivered

ā€¢ Fail-stop mode - constant value is delivered

ā€¢ Failure mode domain - what is inļ¬‚uenced

ā€¢ Service result - value failures
ā€¢ Service timeliness - timing failures

ā€¢ Service availability - stopping failures
ā€¢ User perception in the mode - consistent / inconsistent for all users

ā€¢ Failure mode consequences for ranking the identiļ¬ed issues
13
Dependable Systems Course PT 2014
Failure Severity (,Schweregrad des Ausfallsā€˜)
ā€¢ Denotes consequences of failure
ā€¢ Benign failures (,unkritische AusfƤlleā€˜)

ā€¢ Failure costs and operational beneļ¬ts are similar

ā€¢ Sometimes also umbrella term for failures only detected by inspection

ā€¢ A system with only such failures is fail-safe

ā€¢ Catastrophic failures (,kritische AusfƤlleā€˜)

ā€¢ Costs of failure consequences are much larger than service beneļ¬t

ā€¢ Signiļ¬cant / serious failures - Intermediate steps expressing reduced service

ā€¢ Grading of failure consequences on overall system depends on application

ā€¢ Flying airplane - Catastrophic stopping failure, Train - Benign stopping failure

ā€¢ Criticality - Highest severity of possible failure modes in the system
14
Dependable Systems Course PT 2014
Example: DO-178B Standard
15
Dependable Systems Course PT 2014
3 - Means for Dependability
ā€¢ Fault prevention - Prevent fault occurrence or introduction

ā€¢ Fault tolerance - Provide service matching the speciļ¬cation under faults

ā€¢ Fault removal - How to reduce the presence of faults

ā€¢ Fault forecasting- Estimate the present number, future incidence, and the
consequences of faults

ā€¢ Combined utilization
16
Dependable Systems Course PT 2014
Fault Tolerance
ā€¢ Fault tolerance is the ability of a system to operate correctly in presence of faults. 

or

ā€¢ A system S is called k-fault-tolerant with respect to a set of ā€Ø
algorithms {A1, A2, ... , Ap} and a set of faults {F1, F2, ... , Fp} ā€Ø
if for every k-fault F in S, Ai is executable by a subsystem of system S with k faults.
(Hayes, 9/76)

or

ā€¢ Fault tolerance is the use of redundancy (time or space) to achieve the desired level
of system dependability - costs !

ā€¢ Accepts that an implemented system will not be fault-free

ā€¢ Implements automatic recovery from errors

ā€¢ Is a recursive concept (voter replication, self-checking checkers, stable memory)
17
Error Processing
Dependable Systems Course PT 2014
Phases of Fault Tolerance (Hanmer)
18
Latent
Fault
Error
Normal
OperationFault
Activation
Error Recovery
Error Mitigation
Error
Detection
Fault
Treatment
Dependable Systems Course PT 2014
Fault Tolerance - Error Processing Through Recovery
ā€¢ Forward error recovery

ā€¢ Error is masked to reach again a consistent state (fault compensation)

ā€¢ Corrective actions need detailled knowledge (damage assessment)

ā€¢ New state is typically computed in another way

ā€¢ Examples with compensation: Error correcting codes, non-journaling ļ¬le
system check, advanced exception handlers, voters

ā€¢ Backward error recovery

ā€¢ Roll back to previous consistent state (recovery point / checkpoint) 

ā€¢ Very suitable for transient faults

ā€¢ Computation can be re-done with same components (retry), ā€Ø
with alternate components (reconļ¬gure), or can be ignored (skip frame)
19
Dependable Systems Course PT 2014
ā€žFail-Fastā€œ
ā€¢ A common concept from system engineering, company management, ...

ā€¢ ā€žReport failure and stop immediately without further actionā€œ
ā€¢ Discussed by Jim Gray in 1985 as part of his famous articleā€Ø
ā€žWhy do computers stop and what can be done about it ?ā€œ

ā€¢ Useful when beneļ¬t from recovery is not good enough for its costs, ā€Ø
or if error propagation is highly probable

ā€¢ Single units of a redundant set

ā€¢ Deeply interwired IT system components

ā€¢ Components under heavy request load

ā€¢ And ... crappy start-up companies
20
Dependable Systems Course PT 2014
4 - Fault Tolerance Patterns
ā€¢ Architectural patterns
ā€¢ Considerations that cut across all parts of the system

ā€¢ Need to be applied in early design phase

ā€¢ Detection patterns
ā€¢ Detect the presence of root faults, error states, and failures

ā€¢ Errors vs. failures -> a-priori knowledge vs. comparison of redundant elements

ā€¢ Error Recovery Patterns
ā€¢ Methods to continue execution in a new error-free state

ā€¢ Undoing the error eļ¬€ects + creating the new state
21
Dependable Systems Course PT 2014
4 - Fault Tolerance Patterns
ā€¢ Error Mitigation Patterns
ā€¢ Do not change application or system state, ā€Ø
but mask the error and compensate for the eļ¬€ects

ā€¢ Typical strategies for timing or performance faults

ā€¢ Fault Treatment Patterns
ā€¢ Prevent the error from reoccurring by repairing the fault

ā€¢ System veriļ¬cation

ā€¢ Diagnosis of fault location and nature

ā€¢ Correction of the system and / or the procedures
22
Dependable Systems Course PT 2014
5 - Attributes of Dependability
ā€¢ Non-functional attributes such as reliability and maintainability

ā€¢ Complementary nature of viewpoints in the area of dependability

ā€¢ In comparison to functional properties

ā€¢ ... hard to deļ¬ne

ā€¢ ... hard to abstract

ā€¢ ... ,Divide and conquerā€˜ does not work as good

ā€¢ ... diļ¬ƒcult interrelationships

ā€¢ ... often probabilistic dependencies
23
Dependable Systems Course PT 2014
In Detail
ā€¢ Reliability - Function R(t) 

ā€¢ Probability that a system is functioning properly and constantly over time period t

ā€¢ Assumes that system was fully operational at t=0

ā€¢ Denotes failure-free interval of operation
ā€¢ Availability - Statement if a system is operational at a point in time / fraction of time

ā€¢ Describe system behavior in presence of error treatment mechanisms 

ā€¢ Instantaneous availability (at t) - Probability that a system is performing
correctly at time t; equal to reliability for non-repairable systems: Ai(t) = R(t)

ā€¢ Steady-state availability - Probability that a system will be operational at any
random point of time, expressed as the fraction of time a system is operational
during its expected lifetime: As = Uptime / Lifetime
24
Dependable Systems Course PT 2014
Why Exponential ?
ā€¢ Distribution function that models the memoryless property of the Poisson process
ā€¢ P(T > t + s|T > t) = P(T > s), e.g. PFailure(5 years|T > 2 years) = PFailure(3 years)

ā€¢ Failure is not the result of wear-out

ā€¢ Models ,intrinsic failureā€˜ behavior, assumed for the majority of hardware life time

ā€¢ Weibull distribution as alternative, can also model tear-in and wear-out

ā€¢ Some natural phenomena have constant failure rate (e.g. cosmic ray particles)
25
Dependable Systems Course PT 2014
Variable Failure Rate in Real World
26
Burn in Use Wear out Integration
& Test
Use Obsolete
Hardware Software
ā€¢ Failure rate is treated as constant parameter of the exponential distribution

ā€¢ (maybe invalid) simpliļ¬cation, mostly combined solution in practice:

ā€¢ Exponential distribution when failure rate is constant

ā€¢ Weibull distribution when failure rate is monotonic decreasing / increasing
ā€¢ Mean time to failure (MTTF) - ā€Ø
Average time it takes to failā€Ø
-> average uptime

ā€¢ Mean time to recover / repair (MTTR) -
Average time it takes to recover

ā€¢ Mean time between failures (MTBF) -
Average time between two successive failures

ā€¢ Availability = Uptime / Lifetimeā€Ø
= MTTF / MTBF
Dependable Systems Course PT 2014
Steady-State Availability
27
up down up
MTBF
down up
up
C1
up
C3
up
C2
MTTF
Dependable Systems Course PT 2014
Steady-State Availability
28
Availability Downtime per year Downtime per week
90.0 % (1 nine) 36.5 days 16.8 hours
99.0 % (2 nines) 3.65 days 1.68 hours
99.9 % (3 nines) 8.76 hours 10.1 min
99.99 % (4 nines) 52.6 min 1.01 min
99.999 % (5 nines) 5.26 min 6.05 s
99.9999 % (6 nines) 31.5 s 0.605 s
99.99999 % (7 nines) 0.3 s 6 ms
A = Uptime
Uptime+Downtime = MT T F
MT T F +MT T R
Dependable Systems Course PT 2014
MTTR << MTTF [Fox]
ā€¢ Armando Fox on ,Recovery-Oriented Computingā€˜

ā€¢ A = MTTF / (MTTF + MTTR)

ā€¢ 10x decrease of MTTR as good as 10x increase of MTTF ?

ā€¢ MTTFā€˜s are not claimable, but MTTR claims are veriļ¬able

ā€¢ Proving MTTF numbers demands system-years of observation and experience

ā€¢ Lowering MTTR directly improves user experience of one speciļ¬c outage, ā€Ø
since MTTF is typically longer than one user session

ā€¢ HCI factor of failed system

ā€¢ Miller, 1968: >1sec ā€œsluggishā€, >10sec ā€œdistractedā€ (user moves away)

ā€¢ 2001 Web user study: ~5sec ā€žacceptableā€, ~10sec ā€žexcessively slowā€œ
29
Dependable Systems Course PT 2014
6 - Dependability Modeling
ā€¢ Default approach: Utilize a formalism to model system dependability

ā€¢ Quantify the availability of components, calculate system availability based on this
data and a set of assumptions (the availability model)

ā€¢ Most models expose the same expressiveness

ā€¢ Each formalism allows to focus on certain aspects

ā€¢ Structure-based models: Reliability block diagram, fault tree

ā€¢ State-based models: Markov chain, petri net

ā€¢ System understanding evolved from hardware to software to IT infrastructures

ā€¢ Example: Organization management inļ¬‚uence on business service reliability

ā€¢ Information Technology Infrastructure Library (ITIL)

ā€¢ CoBiT(Control Objectives for Information and related Technology)
30
Dependable Systems Course PT 2014
Dependability Modeling
ā€¢ The Failure Space-Success Space concept

ā€¢ Often easier to agree on what constitutes a system failure

ā€¢ Success tends to be associated with system eļ¬ƒciency, which makes it harder to
formulate events in the model (ā€žThe car drives fast.ā€œ, ā€žThe car stops driving.ā€œ)

ā€¢ In practice, there are more ways to success than to failure
31
Fault Tree Handbook with Aerospace Applications Version 1.1
2. System Logical Modeling Approaches
2.1 Success vs. Failure Approaches
The operation of a system can be considered from two standpoints: the various ways for system
success can be enumerated or the various ways for system failure can be enumerated. Such an
enumeration would include completely successful system operation and total system failure, as
well as intermediate conditions such as minimum acceptable success. Figure 2-1 depicts the
Failure/Success space concept.
Figure 2-1. The Failure Space-Success Space Concept
Dependable Systems Course PT 2014
Dependability Modeling
ā€¢ System analysis approaches

ā€¢ Inductive methods - Reasoning from speciļ¬c cases to a general conclusion

ā€¢ Postulate a particular fault or initiating event, ļ¬nd out system eļ¬€ect

ā€¢ Determine what system (failure) states are possible

ā€¢ Trivial approach: ā€žparts countā€œ method

ā€¢ Examples: Failure Mode and Eļ¬€ect Analysis (FMEA), Preliminary Hazards
Analysis (PHA), Event Tree Analysis, Reliability Block Diagrams (RBD), ...

ā€¢ Deductive methods - Postulate a system failure, ļ¬nd out what system modes or
component behaviors contribute to this failure

ā€¢ Determine how a particular system state can occur

ā€¢ Examples: Fault Tree Analysis (FTA)
32
S = cLB (cW S1 ā‡„ cW S2) (cDB1 ā‡„ cDB2)
Asite = aLB ā‡„ AW Sset ā‡„ ADBset
= aLB ā‡„ [1 (1 aW S)nW S
] ā‡„ [1 (1 aDB)nDB
]
Dependable Systems Course PT 2014
Examples
33
aWS
aWS
aDB
aDB
aLB
Dependable Systems Course PT 2014
Reliability Block Diagrams (RBD)
ā€¢ Model logical interaction for success-oriented analysis of system reliability

ā€¢ Building blocks: series structure, parallel structure, k-out-of-n structure

ā€¢ System is available only if there is a path between s and t
ā€¢ Granularity based on data and lowest actionable item concept

ā€¢ Structure formula can be obtained from RBD by identifying the ā€Ø
subset of nodes that disconnects s from t if removed
34
s t
A1
A2
A3
A4
B
Spare
C
D
E
2/3
C
C
D
D
E
E
Dependable Systems Course PT 2014
Deductive Analysis - Fault Trees
ā€¢ Structure analysis eļ¬€ort grows exponentially with the number of components

ā€¢ Fault Trees

ā€¢ Invented 1961 by H. Watson (Bell Telephone Laboratories)

ā€¢ Facilitate analysis of the launch control system of the intercontinental
Minuteman missile 

ā€¢ Used by Boeing since 1966, meanwhile adopted by diļ¬€erent industries

ā€¢ Root cause analysis, risk assessment, safety assessment

ā€¢ Basic idea 

ā€¢ Technique for describing the possible ways in which an ā€Ø
undesired system state can occur

ā€¢ Complex system failures are broken down into basic events
35
Example: 2-of-3 System
State-Based Modeling | Dependable Systems 2014 24
The complexity of the petri net does not
depend on the number of components!
Dependable Systems Course PT 2014
7 - State-Based Modeling
ā€¢ Structural vs. state-based modeling

ā€¢ State Transition Diagrams

ā€¢ Markov Chains

ā€¢ Petri Nets
36
37
State-Based Modeling | Dependable Systems 2014 34
Dependable Systems Course
Real World
Systems
Model
Solution
Technique
Evaluations
Input
Parameters
Modeling Errors
ā€¢ Structural Errors: Initial state,
missing or extra states / transitions
ā€¢ Error Propagation Model
ā€¢ Parametric Models: Failure and repair rates,
coverage parameters
ā€¢ Errors due to non-independence
Solution Errors
ā€¢ Approximation Errors: System partition, state aggregation
ā€¢ Numerical Errors: Truncation, Round-off
ā€¢ Programming Errors
ā€¢ Estimation Errors: Stochastic estimators, not enough data
Parametric Errors
ā€¢ Different component parameter sources
ā€¢ Projected stress factors assume unrealistic
operational conditions
Dependable Systems Course PT 2014
8 - Qualitative Dependability Investigation
ā€¢ Diļ¬€erent approaches that focus on structural (qualitative) system evaluation

ā€¢ Root cause analysis

ā€¢ Broad research / industrial topic targeting error diagnosis

ā€¢ Specialized topic in quality methodologies

ā€¢ Development process investigation

ā€¢ Procedures for ensuring industry quality in production

ā€¢ Software development process

ā€¢ Organizational investigation

ā€¢ Non-technical inļ¬‚uence factors on system reliability
38
Dependable Systems Course PT 2014
FMEA
ā€¢ Failure Mode and Eļ¬€ects Analysis
ā€¢ Engineering quality method for early concept phase - identify and tackle weak points

ā€¢ Introduced in the late 1940s for military usage (MIL-P-1629)

ā€¢ Later also used for aerospace program, automotive industry, ā€Ø
semiconductor processing, software development, healthcare, ...

ā€¢ Main goal is to identify and prevent critical failures

ā€¢ Performed by cross-functional team of subject matter experts
ā€¢ Most important task in many reliability programs

ā€¢ Six Sigma certiļ¬cation, reliability-centered maintenance (RCM) approach

ā€¢ Automotive industry (ISO16949, SAE J1739)

ā€¢ Medical devices
39
Dependable Systems Course PT 2014
Hazard & Operability Studies (HAZOPS)
ā€¢ Process for identiļ¬cation of potential hazard & operability problems,ā€Ø
caused by deviations from the design intend

ā€¢ Diļ¬€erence between deviation (failure) and its cause (fault)

ā€¢ Conduct intended functionality in the safest and most eļ¬€ective manner

ā€¢ Initially developed to investigate chemical production processes, ā€Ø
meanwhile for petroleum, food, and water industries

ā€¢ Extended for complex (software) systems

ā€¢ Qualitative technique

ā€¢ Take full description of process and systematically question every part of it

ā€¢ Assess possible deviations and their consequences

ā€¢ Based on guide-words and multi-disciplinary meetings
40
Dependable Systems Course PT 2014
Root Cause Analysis
ā€¢ What caused the fault ? - Starting point of dependability chain

ā€¢ Peeling back the layers

ā€¢ Must be performed systematically as an investigation

ā€¢ Establish sequence of events / timeline
41
(C) Avizienis
Dependable Systems Course PT 2014
Reliability Models for IT Infrastructures
ā€¢ System reliability in a commercial environment is determined by many factors:

ā€¢ Software and hardware reliability

ā€¢ Training of maintenance personnel

ā€¢ ā€žBusiness processesā€œ how maintenance is handled

ā€¢ The way the IT department is organized

ā€¢ ...

ā€¢ Impact of management organization on reliability is an emerging research ļ¬eld

ā€¢ Standards for IT organization, based on best practices

ā€¢ Describe which processes have to be established in an IT department

ā€¢ Provide reference models for organization of IT department
42
Dependable Systems Course PT 2014
CMMI Maturity Levels
43
(C)Wikipedia
Dependable Systems Course PT 2014
ITIL
ā€¢ Information Technology Infrastructure Library (ITIL), latest version v3

ā€¢ Started as set of recommendations by the UK Government

ā€¢ Concepts and guides for IT service management

ā€¢ Supports to deliver business-oriented ā€Ø
quality IT services

ā€¢ Core publications: Service Strategy, ā€Ø
Service Design, Service Transition, ā€Ø
Service Operation, ā€Ø
Continual Service Improvement

ā€¢ Broad tool support from vendors

ā€¢ High costs for certiļ¬cation and training

ā€¢ Methodology sometimes over-respected at ā€Ø
the expense of pragmatism
44
Dependable Systems Course PT 2014
9 - Predicting System Reliability
ā€¢ Feasibility evaluation for given model, identiļ¬cation of potential problem sources

ā€¢ Comparison of competing designs

ā€¢ Identiļ¬cation of potential reliability problems - low quality, over-stressed parts

ā€¢ Input to other reliability-related tasks

ā€¢ Maintainability analysis, testability evaluation

ā€¢ Failure modes and eļ¬€ects analysis (FMEA)

ā€¢ Ultimate goal is the prediction of system failures by a reliability methodology

ā€¢ Approach depends on nature of component (electrical, electronic, mechanical)
45
Dependable Systems Course PT 2014
Reliability Data
ā€¢ Field (operational) failure data
ā€¢ Meaningful source of information, experience from real world operation

ā€¢ Operational and environmental conditions may be not fully known

ā€¢ Service life data with / without failure
ā€¢ Helpful in assessing time characteristics of reliability issues

ā€¢ Data from engineering tests
ā€¢ Example: Accelerated life tests

ā€¢ Results from controlled environment

ā€¢ Trustworthy for analysis purposes

ā€¢ Lack of failure information is the ,greatest deļ¬ciencyā€˜ in reliability research [Misra]
46
Dependable Systems Course PT 2014
Reliability Prediction Models
ā€¢ Economic requirements aļ¬€ect ļ¬eld-data availability

ā€¢ Customers do not need generic ļ¬eld data, so collection is extra eļ¬€ort

ā€¢ Numerical reliability prediction models available for speciļ¬c areas

ā€¢ Hardware-oriented reliability modeling
ā€¢ MIL-HDBK-217, Bellcore Reliability Prediction Program, ...

ā€¢ Software-oriented reliability modeling
ā€¢ Jelinski-Moranda model, Basic Execution model, software metrics, ...

ā€¢ Prediction models look for corrective factors, so they are always focused

ā€¢ Demands conļ¬dence in understanding of environmental conditions

ā€¢ Reconciliation is often needed between the diļ¬€erent values of failure data from
various sources
47
Dependable Systems Course PT 2014
Software Reliability Assessment
ā€¢ Software bugs are permanent faults, but behave as transient faults

ā€¢ Activation depends on software usage pattern and input values

ā€¢ Timing eļ¬€ects with parallel or distributed execution

ā€¢ Software aging eļ¬€ects

ā€¢ Fraction of system failures reasoned by software increased in the last decades

ā€¢ Possibilities for software reliability assessment

ā€¢ Black box models - No details known, observations from testing or operation

ā€¢ Reliability growth models

ā€¢ Software metric models - Derive knowledge from static code properties
48
Dependable Systems Course PT 2014
Halstead Metric
ā€¢ Statistical approach - Complexity is related to number of operators and operands

ā€¢ Only deļ¬ned on method level, OO code analysis must consider additional structure

ā€¢ Program length N = Number of operators NOP + total number of operands NOD
ā€¢ Vocabulary size n = Number of unique operators nOP + number of unique operands nOD
ā€¢ Program volume V = N * log2(n)

ā€¢ Describes size of an implementation by vocabulary and (mainly) by program length

ā€¢ In-place changes are ok ...

ā€¢ Diļ¬ƒculty level D = (nOP / 2) * (NOD / nOD), program level L = 1 / D
ā€¢ Error proneness is proportional to number of unique operators ā€Ø
-> since most languages only oļ¬€er a few, this should be small 

ā€¢ Also proportional to the level of operand reuse through the code
49
x = x + 1;
NOP=nOP=3
NOD=3
nOD=2
Dependable Systems Course PT 2014
10 - Distributed Systems Theory
ā€¢ Fallacies of Distributed Computing

ā€¢ The network is reliable, Latency is zero, Bandwidth is inļ¬nite, The network is
secure, Topology doesnā€™t change, There is one administrator, Transport cost is
zero, The network is homogeneous

ā€¢ Timing Models

ā€¢ Logical time, Lamport clocks

ā€¢ Fault Models

ā€¢ Consensus Problems

ā€¢ 2PC

ā€¢ Paxos
50
Asynchronous*
Par-ally*synchronous*
Synchronous*
Dependable Systems Course PT 2014
11 - Fault-Tolerant Distributed Systems
ā€¢ Consistency Models

ā€¢ Strict consistency, sequential consistency, eventual consistency

ā€¢ Trade-Oļ¬€s

ā€¢ Brewer, CAP Theorem

ā€¢ Replication

ā€¢ State Machine Replication, ā€¦
51
P1# W(x)#!#a#
P2# R(x)#!#a#
P1# W(x)#!#a#
P2# R(x)#!#NIL# R(x)#!#a#
Dependable Systems Course PT 2014
12 - Dependable Distributed Applications
ā€¢ Frameworks - Erlang / FT-CORBA

ā€¢ Coordination Services

ā€¢ Chubby, Zookeeper

ā€¢ Distributed Storage

ā€¢ Cassandra, RIAK, HDFS ā€¦
52
Dependable Systems Course PT 2014
13 - Hardware Diagnosis
ā€¢ Contains of fault detection (what ?) and fault location (where ?)

ā€¢ Fault detection

ā€¢ Replication checks - Test operation / execution against alternate implementation

ā€¢ Timing checks - Test operation / execution against timing constraints

ā€¢ Reversal checks - Make use of operation reversibility

ā€¢ Coding checks - Utilize redundant (but diļ¬€erent) representation of data

ā€¢ Reasonableness checks - Check against known system / data properties

ā€¢ Structural checks - Ensure consistent structure of data, diagnostic tests, ...

ā€¢ Diagnostic checks - Use set of inputs for which the outputs are known

ā€¢ Algorithmic checks - Check invariants of an algorithm
53
Dependable Systems Course PT 2014
14 - Hardware Redundancy
ā€¢ Redundancy for error detection and forward error recovery

ā€¢ Redundancy types: spatial, temporal, informational (presentation, version)

ā€¢ Redundant not mean identical functionality, just perform the same work

ā€¢ Static redundancy implements error mitigation

ā€¢ Fault does not show up, since it is transparently removed

ā€¢ Examples: Voting, error-correcting codes, N-modular redundancy 

ā€¢ Dynamic redundancy implements error processing
ā€¢ After fault detection, the system is reconļ¬gured to avoid a failure

ā€¢ Examples: Back-up sparing, duplex and share, pair and spare

ā€¢ Hybrid approaches
54
Dependable Systems Course PT 2014
Sphere of Replication
ā€¢ Components outside the
sphere must be protected by
other means

ā€¢ Level of output comparison
decides upon fault coverage

ā€¢ Larger sphere tends to
decrease the required
bandwidth on input and output

ā€¢ More state changing
happens just inside the
sphere

ā€¢ Vendor might be restricted on
choice of sphere size
55
HDD HDD
Input Replication
Output
Comparison
I/O Bus
HDD HDD
Input Replication
Output
Comparison
HDD HDD
Controll
er
Controll
er
RAID1
RAID10
Dependable Systems Course PT 2014
Masking / Static Redundancy: Voting
ā€¢ Exact voting: Only one correct result possible 

ā€¢ Majority vote for uneven module numbers

ā€¢ Generalized median voting - Select result that is the median, ā€Ø
by iteratively removing extremes

ā€¢ Formalized plurality voting - Divide results in partitions, ā€Ø
choose random member from the largest partition

ā€¢ Inexact voting: Comparison at high level might lead to multiple correct results

ā€¢ Non-adaptive voting - Use allowable result discrepancy, put boundary on
discrepancy minimum or maximum (e.g. 1,4 = 1,3)

ā€¢ Adaptive voting - Rank results based on past experience with module results

ā€¢ Compute the correct value based on ā€žtrustā€œ in modules from experience

ā€¢ Example: Weighted sum R=W1*R1 + W2*R2 + W3*R3 with W1+W2+W3=1
56
Voter
A B C
Dependable Systems Course PT 2014
N-Modular Redundancy (with perfect voter)
57
Module 1
Module 2
Module 3 Voter
...
Module N
Input Output
RNMR =
mX
i=0
āœ“
N
i
ā—†
(1 R)i
RN i
āœ“
n
k
ā—†
=
n!
k!(n k)!
R2 of 3 =
āœ“
3
0
ā—†
(1 R)0
R3
+
āœ“
3
1
ā—†
(1 R)R2
R2 of 3 = R3
+ 3(1 R)R2
R3 of 5 = ...
ā€¢ Special cases for combination of duplex (with comparator) and sparing (with switch)

ā€¢ Pair and spare - Multiple duplex pairs, connected as standby sparing setup 

ā€¢ Two replicated modules operate as duplex pair (lockstep execution), ā€Ø
connected by comparator as voting circuit

ā€¢ Same setting again as spare unit, ā€Ø
spare units connected by switch

ā€¢ On module output mismatch, comparators ā€Ø
signal switch to perform failover

ā€¢ Commercially used, e.g. Stratus XA/R Series 300
Dependable Systems Course PT 2014
Pair and Spare
58
MODULES
1
2
3
OUTPUTINPUT
COMPARATOR
4 COMPARATOR
SWITCH/COMPARATOR
Dependable Systems Course PT 2014
Hybrid Approaches
ā€¢ N-modular redundancy ā€Ø
with spares
ā€¢ Also called hybrid redundancy

ā€¢ System has basic NMR ā€Ø
conļ¬guration

ā€¢ Disagreement detector replacesā€Ø
modules with spares if their ā€Ø
output is not matchingā€Ø
the voting result

ā€¢ Reliability as long as the spare pool is not exhausted

ā€¢ Improves fault masking capability of NMR

ā€¢ Can tolerate two faults with one spare, while classic NMR would need 5
modules with majority voting to tolerate two faults
59
Adds fault
localization
Dependable Systems Course PT 2014
Coding Checks in Memory Hardware
ā€¢ Primary memory
ā€¢ Parity code

ā€¢ m-out-of-n resp. m-of-n resp. m/n code

ā€¢ Checksumming

ā€¢ Berger Code

ā€¢ Hamming code

ā€¢ Secondary storage
ā€¢ RAID codes

ā€¢ Reed-Solomon code
60
Dependable Systems Course PT 2014
Memory Redundancy
ā€¢ Standard technology in DRAMs

ā€¢ Bit-per-byte parity, check on read access

ā€¢ Implemented by additional parity memory chip

ā€¢ ECC with Hamming codes - 7 check bits for 32 bit data words, 8 bit for 64 bit 

ā€¢ Leads to 72 bit data bus between DIMM and chipset

ā€¢ Computed by memory controller on write, checked on read

ā€¢ Study by IBM: ECC memory achieves R=0.91 over three years

ā€¢ Can correct single bit errors and detect double bit errors

ā€¢ Hewlett Packard Advanced ECC (1996)

ā€¢ Can detect and correct single bit and double bit errors
61
Dependable Systems Course PT 2014
RAID
ā€¢ Redundant Array of Independent Disks (RAID) [Patterson et al. 1988]

ā€¢ Improve I/O performance and / or reliability by building raid groups

ā€¢ Replication for information reconstruction on disk failure (degrading)

ā€¢ Requires computational eļ¬€ort (dedicated controller vs. software)

ā€¢ Assumes failure independence
62
Dependable Systems Course PT 2014
Parity With XOR
ā€¢ Self-inverse operation

ā€¢ 101 XOR 011 = 110, 110 XOR 011 = 101
63
Disk Byte
1 1 1 0 0 1 0 0 1
2 0 1 1 0 1 1 1 0
3 0 0 0 1 0 0 1 1
4 1 1 1 0 1 0 1 1
Parity 0 1 0 1 1 1 1 1
Disk Byte
1 1 1 0 0 1 0 0 1
Parity 0 1 0 1 1 1 1 1
3 0 0 0 1 0 0 1 1
4 1 1 1 0 1 0 1 1
Hot Spare 0 1 1 0 1 1 1 0
Dependable Systems Course PT 2014
15 - Software Dependability
ā€¢ Fault elimination

ā€¢ Reduce number of dormant faults at development time

ā€¢ Fault-tolerant software

ā€¢ Techniques to achieve fault tolerance for software faults

ā€¢ Application of redundancy idea to software modules

ā€¢ Software fault tolerance

ā€¢ Techniques to achieve fault tolerance by software mechanisms

ā€¢ Typically for hardware failures on lower levels in the system stack

ā€¢ Redundancy managed by operating system, cluster framework, application code
64
Dependable Systems Course PT 2014
Fault Elimination through Software Testing
65
(C) Ian Sommerville
Responsibility of
the component
developer, based
on experience
Responsibility of an
independent testing
team, based on
speciļ¬cation
Black-Box
Tests
Access to source
code, tested while
components are
integrated
Steadily
increasing
load
Excess
maximum
design load
Dependable Systems Course PT 2014
Testing Through Software Fault Injection
ā€¢ Intentionally trigger erroneous behavior of the execution environment

ā€¢ Typical approach for communicating software entities

ā€¢ Orientation towards implementation details - program state, functional behavior

ā€¢ Several non-intrusive implementation techniques, such as AOP

ā€¢ Injection can be done as part of execution under real conditions

ā€¢ Huge variety of fault classes, including SWIFI fault classes

ā€¢ New approaches support remote fault injection (e.g. fuzzing)

ā€¢ Compile-time injection: Leads to erroneous image being executed

ā€¢ Run-time injection: Demands some altering of application state during runtime

ā€¢ Typical triggers: Time-out, exception, debugging trap, code insertion
66
Dependable Systems Course PT 2014
Software Fault Model [Goloubeva]
ā€¢ Example: Control loop writing computation result to a variable

ā€¢ Temporary fault: Write NULL to result variable after end of usage

ā€¢ Permanent fault: Compute and store incorrect output from input data

ā€¢ Single vs. multiple faults depends on granularity level of investigation

ā€¢ On source code level

ā€¢ Program: Structured collection of features with syntax and semantic

ā€¢ Syntax allows to describe fault model

ā€¢ Negation of semantic properties deļ¬nes a fault model

ā€¢ Examples: Calling non-existent functions, Functions not returning a value, values
out of their type range, missing input parameters 

ā€¢ On executable level, e.g. stack overļ¬‚ow due to recursion
67
Dependable Systems Course PT 2014
Fault-Tolerant Software -
Another categorization [Lyu 95]
ā€¢ Single version techniques
ā€¢ Add mechanisms for detection, containment, and handling of errors to the
software component itself

ā€¢ Examples: Software structure and actions approaches, error detection, ā€Ø
exception handling, checkpointing and restart, process pairs, data diversity

ā€¢ Multi version techniques
ā€¢ Rely on structured utilization of variants of the same software

ā€¢ Examples: Recovery blocks, N-version programming

ā€¢ Principles can be applied to any software layer

ā€¢ Identify source of most design faults

ā€¢ Typically no problem with parallel application, beside cost factor
68
Dependable Systems Course PT 2014
Single-Version Approaches -
Wrapper
ā€¢ Piece of software that encapsulates a given program when it is being executed

ā€¢ Typical approach for operating systems and middleware stacks

ā€¢ Structure: Wrapper software and wrapped entity

ā€¢ Inputs and outputs are checked by the wrapper

ā€¢ Examples:

ā€¢ Dealing with buļ¬€er overļ¬‚ow, checking scheduler correctness (e.g., EDF),
bypassing known bugs, checking output correctness

ā€¢ When pre- or postconditions are violated, usually an exception is being raised

ā€¢ Wrapper forms an acceptance test in the dependability rings

ā€¢ Good approach for ļ¬xing issues in the operational phase of software
69
ā€¢ Save application state data at recovery points

ā€¢ Can be reloaded on crash or any other kind ā€Ø
of data loss

ā€¢ Possible on diļ¬€erent levels: local per process, ā€Ø
partial, complete, distributed

ā€¢ Optimum checkpointing interval

ā€¢ Checkpointing too frequent: Majority of time spent for data saving

ā€¢ Checkpointing too rare: May take long time to recover

ā€¢ Several specialized solutions for C / C++ language, easier with reļ¬‚ection support

ā€¢ Popular approach in clusters / high-performance computing

ā€¢ Latest trend: In-memory checkpointing
Dependable Systems Course PT 2014
Single Version Approaches -
Checkpointing
70
taken from ā€Ø
Software Fault Tolerance: A Tutorial
Dependable Systems Course PT 2014
Single Version Approaches -
Data Diversity [Ammann 88]
71
takenfromā€Ø
SoftwareFaultTolerance:ATutorial
Dependable Systems Course PT 2014
Single Version Approaches -
High-Level Instruction Duplication [Goloubeva]
ā€¢ Introduce data and code redundancy through
high-level transformation

ā€¢ Duplicate every variable

ā€¢ Perform every write operation on both copies
of the variable

ā€¢ After each read operation, the copies must be
checked for consistency

ā€¢ Should be close to read operation, ā€Ø
in order to avoid error propagation

ā€¢ Includes also expression evaluation

ā€¢ Procedure parameters treated as variables

ā€¢ Independent from underlying hardware, ā€Ø
targets cache / main memory faults ā€Ø
72
a=b;
.. becomes ...
a0=b0;
a1=b1;
if (b0 != b1)
error();
...
a=b+c;
... becomes ...
a0=b0+c0;
a1=b1+c1;
if ((b0!=b1)||(c0!=c1))
error();
Dependable Systems Course PT 2014
Control Flow Error
ā€¢ Control ļ¬‚ow error (CFE) leads to unexpected instruction
execution

ā€¢ Terminology: 

ā€¢ Basic block (BB) - branch-free serial code fragment

ā€¢ Branch / jump instruction is modeled as last instruction
of a basic block

ā€¢ Control ļ¬‚ow graph (CFG) - one node per basic block

ā€¢ Illegal branch - Node transition is not part of the CFG

ā€¢ Wrong branch - Node transition is already part of the CFG

ā€¢ Inter-block error - Erroneous branch to diļ¬€erent block

ā€¢ Intra-block errors - Erroneous branch inside a block
73
i=0;
while(i<n) {
------------------
if (a[i] < b[i])
------------------
x[i]=a[i];
------------------
else
x[i]=b[i];
------------------
i++;
}
------------------
BB 0
BB 1
BB 2 BB 3
B 4
BB 5
Dependable Systems Course PT 2014
Multi-Version Approaches
Recovery Blocks
ā€¢ Redundant system
implementations are
typically used
simultaneously, ā€Ø
best answer is picked i.e.
by voting

ā€¢ Alternative way:
Sequential execution of
recovery blocks
ā€¢ Introduced in 1974 by
Horning et. al.

ā€¢ Dynamic fault tolerance
approach, related to
stand-by sparing in
hardware
74
establish Checkpoint
Primary Module
Acceptance Test, else
load Checkpoint
Alternative Module 1
Acceptance Test, else
load Checkpoint
Alternative Module 2
...
else Failure Exception
Dependable Systems Course PT 2014
Multi-Version Approaches -
N-Version Programming
ā€¢ Common mode errors are only catchable by design diversity
ā€¢ Design diversity is a complex issue

ā€¢ Design philosophies, software tools, programming languages, test philosophies

ā€¢ Typical approaches try to utilize randomness - separate teams on diļ¬€erent locations

ā€¢ N-Version Programming
ā€¢ Suggested by Elmendorf in 1972, developed by Avizienis & Chen in 1977

ā€¢ Static approach, combination of decision mechanism and forward recovery

ā€¢ At least two independently designed and functionally equivalent variants

ā€¢ Variants are executed in parallel, decision mechanism selects the ā€žbestā€œ result

ā€¢ Can support reliability, but also system security against malicious logic
75
Dependable Systems Course PT 2014
Simplex
ā€¢ Idea: Using simplicity to control complexity

ā€¢ Forward recovery approach, based on feedback loop

ā€¢ HAC subsystem - Simple construction, formal methods, reliable hardware

ā€¢ HPC subsystem - Complex technology, advanced features

ā€¢ HPC can use HAC output, but not vice versa

ā€¢ Decision logic based on control loop output

ā€¢ Typically performance degredation with HAC

ā€¢ Example: Boeing 777 primary and secondary ā€Ø
ļ¬‚ight controller
76
77

More Related Content

What's hot

Dependable Systems - Hardware Dependability with Redundancy (14/16)
Dependable Systems - Hardware Dependability with Redundancy (14/16)Dependable Systems - Hardware Dependability with Redundancy (14/16)
Dependable Systems - Hardware Dependability with Redundancy (14/16)Peter Trƶger
Ā 
Dependable Systems - Hardware Dependability with Diagnosis (13/16)
Dependable Systems - Hardware Dependability with Diagnosis (13/16)Dependable Systems - Hardware Dependability with Diagnosis (13/16)
Dependable Systems - Hardware Dependability with Diagnosis (13/16)Peter Trƶger
Ā 
What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.Peter Trƶger
Ā 
OpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissionsOpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissionsPeter Trƶger
Ā 
Fault Tolerance System
Fault Tolerance SystemFault Tolerance System
Fault Tolerance Systemprakashjjaya
Ā 
Fault tolearant system
Fault tolearant systemFault tolearant system
Fault tolearant systemarvinthsaran
Ā 
Fault tolerance and computing
Fault tolerance  and computingFault tolerance  and computing
Fault tolerance and computingPalani murugan
Ā 
Real Time Systems
Real Time SystemsReal Time Systems
Real Time Systemsleo3004
Ā 
Fault tolerance techniques tsp
Fault tolerance techniques tspFault tolerance techniques tsp
Fault tolerance techniques tspPradeep Kumar TS
Ā 
Software Fault Tolerance
Software Fault ToleranceSoftware Fault Tolerance
Software Fault ToleranceAnkit Singh
Ā 
Fault tolerance
Fault toleranceFault tolerance
Fault toleranceGaurav Rawat
Ā 
Fault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating systemFault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating systemanujos25
Ā 
Fault tolerant presentation
Fault tolerant presentationFault tolerant presentation
Fault tolerant presentationskadyan1
Ā 
Fault Tolerance System
Fault Tolerance SystemFault Tolerance System
Fault Tolerance SystemEhsan Ilahi
Ā 
Fault tolerance techniques
Fault tolerance techniquesFault tolerance techniques
Fault tolerance techniquesECEDepartmentJSREC
Ā 
Availability tactics
Availability tacticsAvailability tactics
Availability tacticsahsan riaz
Ā 

What's hot (20)

Dependable Systems - Hardware Dependability with Redundancy (14/16)
Dependable Systems - Hardware Dependability with Redundancy (14/16)Dependable Systems - Hardware Dependability with Redundancy (14/16)
Dependable Systems - Hardware Dependability with Redundancy (14/16)
Ā 
Dependable Systems - Hardware Dependability with Diagnosis (13/16)
Dependable Systems - Hardware Dependability with Diagnosis (13/16)Dependable Systems - Hardware Dependability with Diagnosis (13/16)
Dependable Systems - Hardware Dependability with Diagnosis (13/16)
Ā 
What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.What activates a bug? A refinement of the Laprie terminology model.
What activates a bug? A refinement of the Laprie terminology model.
Ā 
OpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissionsOpenSubmit - How to grade 1200 code submissions
OpenSubmit - How to grade 1200 code submissions
Ā 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
Ā 
Fault Tolerance System
Fault Tolerance SystemFault Tolerance System
Fault Tolerance System
Ā 
Fault tolearant system
Fault tolearant systemFault tolearant system
Fault tolearant system
Ā 
Fault tolerance and computing
Fault tolerance  and computingFault tolerance  and computing
Fault tolerance and computing
Ā 
Real Time Systems
Real Time SystemsReal Time Systems
Real Time Systems
Ā 
Fault tolerance techniques tsp
Fault tolerance techniques tspFault tolerance techniques tsp
Fault tolerance techniques tsp
Ā 
Software Fault Tolerance
Software Fault ToleranceSoftware Fault Tolerance
Software Fault Tolerance
Ā 
Fault tolerance
Fault toleranceFault tolerance
Fault tolerance
Ā 
Fault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating systemFault tolerance techniques for real time operating system
Fault tolerance techniques for real time operating system
Ā 
Fault tolerant presentation
Fault tolerant presentationFault tolerant presentation
Fault tolerant presentation
Ā 
SRE Tools
SRE ToolsSRE Tools
SRE Tools
Ā 
Fault Tolerance System
Fault Tolerance SystemFault Tolerance System
Fault Tolerance System
Ā 
Fault tolerance techniques
Fault tolerance techniquesFault tolerance techniques
Fault tolerance techniques
Ā 
Voting protocol
Voting protocolVoting protocol
Voting protocol
Ā 
Availability tactics
Availability tacticsAvailability tactics
Availability tactics
Ā 
Sw testing
Sw testingSw testing
Sw testing
Ā 

Similar to Dependable Systems - Summary (16/16)

Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)Peter Trƶger
Ā 
Chromatography Data System: Getting It ā€œRight First Timeā€ Seminar Series ā€“ Pa...
Chromatography Data System: Getting It ā€œRight First Timeā€ Seminar Series ā€“ Pa...Chromatography Data System: Getting It ā€œRight First Timeā€ Seminar Series ā€“ Pa...
Chromatography Data System: Getting It ā€œRight First Timeā€ Seminar Series ā€“ Pa...Chromatography & Mass Spectrometry Solutions
Ā 
HA & DR System Design - Concepts and Solution
HA & DR System Design - Concepts and SolutionHA & DR System Design - Concepts and Solution
HA & DR System Design - Concepts and SolutionContinuity and Resilience
Ā 
Software Reliability_CS-3059_VISHAL_PADME.pptx
Software Reliability_CS-3059_VISHAL_PADME.pptxSoftware Reliability_CS-3059_VISHAL_PADME.pptx
Software Reliability_CS-3059_VISHAL_PADME.pptxVishalPadme2
Ā 
Software archiecture lecture05
Software archiecture   lecture05Software archiecture   lecture05
Software archiecture lecture05Luktalja
Ā 
Introduction to performance testing
Introduction to performance testingIntroduction to performance testing
Introduction to performance testingRichard Bishop
Ā 
IT6701 Information Management - Unit II
IT6701 Information Management - Unit II   IT6701 Information Management - Unit II
IT6701 Information Management - Unit II pkaviya
Ā 
What is onTune for management
What is onTune for managementWhat is onTune for management
What is onTune for managementTeemStone Pty Ltd
Ā 
basic concepts of reliability
basic concepts of reliabilitybasic concepts of reliability
basic concepts of reliabilitydennis gookyi
Ā 
Unit 2-software development process notes
Unit 2-software development process notes Unit 2-software development process notes
Unit 2-software development process notes arvind pandey
Ā 
Comp8 unit9a lecture_slides
Comp8 unit9a lecture_slidesComp8 unit9a lecture_slides
Comp8 unit9a lecture_slidesCMDLMS
Ā 
Reliability Levels of Subsea Production Systems During Operations
Reliability Levels of Subsea Production Systems During OperationsReliability Levels of Subsea Production Systems During Operations
Reliability Levels of Subsea Production Systems During OperationsLloyd's Register Energy
Ā 
Lec # 1 chapter 2
Lec # 1 chapter 2Lec # 1 chapter 2
Lec # 1 chapter 2rereelshahed
Ā 
Non Functional Requirement.
Non Functional Requirement.Non Functional Requirement.
Non Functional Requirement.Khushboo Shaukat
Ā 
Software Reliability
Software ReliabilitySoftware Reliability
Software ReliabilityGurkamal Rakhra
Ā 
testing strategies and tactics
 testing strategies and tactics testing strategies and tactics
testing strategies and tacticsPreeti Mishra
Ā 
22-REQUIREMENT.ppt
22-REQUIREMENT.ppt22-REQUIREMENT.ppt
22-REQUIREMENT.pptssuser5e271f1
Ā 
Critical Systems
Critical SystemsCritical Systems
Critical SystemsUsman Bin Saad
Ā 
Testing software
Testing softwareTesting software
Testing softwareBlueTree5
Ā 

Similar to Dependable Systems - Summary (16/16) (20)

Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)Dependable Systems - System Dependability Evaluation (8/16)
Dependable Systems - System Dependability Evaluation (8/16)
Ā 
Chromatography Data System: Getting It ā€œRight First Timeā€ Seminar Series ā€“ Pa...
Chromatography Data System: Getting It ā€œRight First Timeā€ Seminar Series ā€“ Pa...Chromatography Data System: Getting It ā€œRight First Timeā€ Seminar Series ā€“ Pa...
Chromatography Data System: Getting It ā€œRight First Timeā€ Seminar Series ā€“ Pa...
Ā 
HA & DR System Design - Concepts and Solution
HA & DR System Design - Concepts and SolutionHA & DR System Design - Concepts and Solution
HA & DR System Design - Concepts and Solution
Ā 
metrics.ppt
metrics.pptmetrics.ppt
metrics.ppt
Ā 
Software Reliability_CS-3059_VISHAL_PADME.pptx
Software Reliability_CS-3059_VISHAL_PADME.pptxSoftware Reliability_CS-3059_VISHAL_PADME.pptx
Software Reliability_CS-3059_VISHAL_PADME.pptx
Ā 
Software archiecture lecture05
Software archiecture   lecture05Software archiecture   lecture05
Software archiecture lecture05
Ā 
Introduction to performance testing
Introduction to performance testingIntroduction to performance testing
Introduction to performance testing
Ā 
IT6701 Information Management - Unit II
IT6701 Information Management - Unit II   IT6701 Information Management - Unit II
IT6701 Information Management - Unit II
Ā 
What is onTune for management
What is onTune for managementWhat is onTune for management
What is onTune for management
Ā 
basic concepts of reliability
basic concepts of reliabilitybasic concepts of reliability
basic concepts of reliability
Ā 
Unit 2-software development process notes
Unit 2-software development process notes Unit 2-software development process notes
Unit 2-software development process notes
Ā 
Comp8 unit9a lecture_slides
Comp8 unit9a lecture_slidesComp8 unit9a lecture_slides
Comp8 unit9a lecture_slides
Ā 
Reliability Levels of Subsea Production Systems During Operations
Reliability Levels of Subsea Production Systems During OperationsReliability Levels of Subsea Production Systems During Operations
Reliability Levels of Subsea Production Systems During Operations
Ā 
Lec # 1 chapter 2
Lec # 1 chapter 2Lec # 1 chapter 2
Lec # 1 chapter 2
Ā 
Non Functional Requirement.
Non Functional Requirement.Non Functional Requirement.
Non Functional Requirement.
Ā 
Software Reliability
Software ReliabilitySoftware Reliability
Software Reliability
Ā 
testing strategies and tactics
 testing strategies and tactics testing strategies and tactics
testing strategies and tactics
Ā 
22-REQUIREMENT.ppt
22-REQUIREMENT.ppt22-REQUIREMENT.ppt
22-REQUIREMENT.ppt
Ā 
Critical Systems
Critical SystemsCritical Systems
Critical Systems
Ā 
Testing software
Testing softwareTesting software
Testing software
Ā 

More from Peter Trƶger

WannaCry - An OS course perspective
WannaCry - An OS course perspectiveWannaCry - An OS course perspective
WannaCry - An OS course perspectivePeter Trƶger
Ā 
Cloud Standards and Virtualization
Cloud Standards and VirtualizationCloud Standards and Virtualization
Cloud Standards and VirtualizationPeter Trƶger
Ā 
Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2Peter Trƶger
Ā 
Design of Software for Embedded Systems
Design of Software for Embedded SystemsDesign of Software for Embedded Systems
Design of Software for Embedded SystemsPeter Trƶger
Ā 
Humans should not write XML.
Humans should not write XML.Humans should not write XML.
Humans should not write XML.Peter Trƶger
Ā 
Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0Peter Trƶger
Ā 
Operating Systems 1 (12/12) - Summary
Operating Systems 1 (12/12) - SummaryOperating Systems 1 (12/12) - Summary
Operating Systems 1 (12/12) - SummaryPeter Trƶger
Ā 
Operating Systems 1 (8/12) - Concurrency
Operating Systems 1 (8/12) - ConcurrencyOperating Systems 1 (8/12) - Concurrency
Operating Systems 1 (8/12) - ConcurrencyPeter Trƶger
Ā 
Operating Systems 1 (9/12) - Memory Management Concepts
Operating Systems 1 (9/12) - Memory Management ConceptsOperating Systems 1 (9/12) - Memory Management Concepts
Operating Systems 1 (9/12) - Memory Management ConceptsPeter Trƶger
Ā 
Operating Systems 1 (7/12) - Threads
Operating Systems 1 (7/12) - ThreadsOperating Systems 1 (7/12) - Threads
Operating Systems 1 (7/12) - ThreadsPeter Trƶger
Ā 
Operating Systems 1 (1/12) - History
Operating Systems 1 (1/12) - HistoryOperating Systems 1 (1/12) - History
Operating Systems 1 (1/12) - HistoryPeter Trƶger
Ā 
Operating Systems 1 (2/12) - Hardware Basics
Operating Systems 1 (2/12) - Hardware BasicsOperating Systems 1 (2/12) - Hardware Basics
Operating Systems 1 (2/12) - Hardware BasicsPeter Trƶger
Ā 
Operating Systems 1 (11/12) - Input / Output
Operating Systems 1 (11/12) - Input / OutputOperating Systems 1 (11/12) - Input / Output
Operating Systems 1 (11/12) - Input / OutputPeter Trƶger
Ā 
Operating Systems 1 (3/12) - Architectures
Operating Systems 1 (3/12) - ArchitecturesOperating Systems 1 (3/12) - Architectures
Operating Systems 1 (3/12) - ArchitecturesPeter Trƶger
Ā 
Operating Systems 1 (6/12) - Processes
Operating Systems 1 (6/12) - ProcessesOperating Systems 1 (6/12) - Processes
Operating Systems 1 (6/12) - ProcessesPeter Trƶger
Ā 
Operating Systems 1 (5/12) - Architectures (Unix)
Operating Systems 1 (5/12) - Architectures (Unix)Operating Systems 1 (5/12) - Architectures (Unix)
Operating Systems 1 (5/12) - Architectures (Unix)Peter Trƶger
Ā 
Operating Systems 1 (4/12) - Architectures (Windows)
Operating Systems 1 (4/12) - Architectures (Windows)Operating Systems 1 (4/12) - Architectures (Windows)
Operating Systems 1 (4/12) - Architectures (Windows)Peter Trƶger
Ā 

More from Peter Trƶger (17)

WannaCry - An OS course perspective
WannaCry - An OS course perspectiveWannaCry - An OS course perspective
WannaCry - An OS course perspective
Ā 
Cloud Standards and Virtualization
Cloud Standards and VirtualizationCloud Standards and Virtualization
Cloud Standards and Virtualization
Ā 
Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2Distributed Resource Management Application API (DRMAA) Version 2
Distributed Resource Management Application API (DRMAA) Version 2
Ā 
Design of Software for Embedded Systems
Design of Software for Embedded SystemsDesign of Software for Embedded Systems
Design of Software for Embedded Systems
Ā 
Humans should not write XML.
Humans should not write XML.Humans should not write XML.
Humans should not write XML.
Ā 
Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0Verteilte Software-Systeme im Kontext von Industrie 4.0
Verteilte Software-Systeme im Kontext von Industrie 4.0
Ā 
Operating Systems 1 (12/12) - Summary
Operating Systems 1 (12/12) - SummaryOperating Systems 1 (12/12) - Summary
Operating Systems 1 (12/12) - Summary
Ā 
Operating Systems 1 (8/12) - Concurrency
Operating Systems 1 (8/12) - ConcurrencyOperating Systems 1 (8/12) - Concurrency
Operating Systems 1 (8/12) - Concurrency
Ā 
Operating Systems 1 (9/12) - Memory Management Concepts
Operating Systems 1 (9/12) - Memory Management ConceptsOperating Systems 1 (9/12) - Memory Management Concepts
Operating Systems 1 (9/12) - Memory Management Concepts
Ā 
Operating Systems 1 (7/12) - Threads
Operating Systems 1 (7/12) - ThreadsOperating Systems 1 (7/12) - Threads
Operating Systems 1 (7/12) - Threads
Ā 
Operating Systems 1 (1/12) - History
Operating Systems 1 (1/12) - HistoryOperating Systems 1 (1/12) - History
Operating Systems 1 (1/12) - History
Ā 
Operating Systems 1 (2/12) - Hardware Basics
Operating Systems 1 (2/12) - Hardware BasicsOperating Systems 1 (2/12) - Hardware Basics
Operating Systems 1 (2/12) - Hardware Basics
Ā 
Operating Systems 1 (11/12) - Input / Output
Operating Systems 1 (11/12) - Input / OutputOperating Systems 1 (11/12) - Input / Output
Operating Systems 1 (11/12) - Input / Output
Ā 
Operating Systems 1 (3/12) - Architectures
Operating Systems 1 (3/12) - ArchitecturesOperating Systems 1 (3/12) - Architectures
Operating Systems 1 (3/12) - Architectures
Ā 
Operating Systems 1 (6/12) - Processes
Operating Systems 1 (6/12) - ProcessesOperating Systems 1 (6/12) - Processes
Operating Systems 1 (6/12) - Processes
Ā 
Operating Systems 1 (5/12) - Architectures (Unix)
Operating Systems 1 (5/12) - Architectures (Unix)Operating Systems 1 (5/12) - Architectures (Unix)
Operating Systems 1 (5/12) - Architectures (Unix)
Ā 
Operating Systems 1 (4/12) - Architectures (Windows)
Operating Systems 1 (4/12) - Architectures (Windows)Operating Systems 1 (4/12) - Architectures (Windows)
Operating Systems 1 (4/12) - Architectures (Windows)
Ā 

Recently uploaded

Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)cama23
Ā 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
Ā 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
Ā 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
Ā 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
Ā 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
Ā 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
Ā 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
Ā 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
Ā 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
Ā 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
Ā 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
Ā 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
Ā 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
Ā 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A BeƱa
Ā 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
Ā 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
Ā 

Recently uploaded (20)

Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)Global Lehigh Strategic Initiatives (without descriptions)
Global Lehigh Strategic Initiatives (without descriptions)
Ā 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
Ā 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Ā 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
Ā 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
Ā 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
Ā 
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptxYOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
YOUVE GOT EMAIL_FINALS_EL_DORADO_2024.pptx
Ā 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
Ā 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
Ā 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
Ā 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
Ā 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Ā 
Model Call Girl in Tilak Nagar Delhi reach out to us at šŸ”9953056974šŸ”
Model Call Girl in Tilak Nagar Delhi reach out to us at šŸ”9953056974šŸ”Model Call Girl in Tilak Nagar Delhi reach out to us at šŸ”9953056974šŸ”
Model Call Girl in Tilak Nagar Delhi reach out to us at šŸ”9953056974šŸ”
Ā 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Ā 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
Ā 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
Ā 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
Ā 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
Ā 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
Ā 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
Ā 

Dependable Systems - Summary (16/16)

  • 1. Dependable Systems ! Summary Dr. Peter Trƶger Lena Herscheid
  • 2. Dependable Systems Course PT 2014 Dependability ā€¢ Umbrella term for operational requirements on a system ā€¢ IFIP WG 10.4: "[..] the trustworthiness of a computing system which allows reliance to be justiļ¬ably placed on the service it delivers [..]" ā€¢ IEC IEV: "dependability (is) the collective term used to describe the availability performance and its inļ¬‚uencing factorsĀ : reliability performance, maintainability performance and maintenance support performance" ā€¢ Laprie: ā€ž Trustworthiness of a computer system such that ā€Ø reliance can be placed on the service it delivers to the user ā€œ ā€¢ Adds a third dimension to system quality ā€¢ General question: How to deal with unexpected events ? ā€¢ In German: ,VerlƤsslichkeitā€˜ vs. ,ZuverlƤssigkeitā€˜ā€Ø 2
  • 3. Dependable Systems Course PT 2014 Course Topics ā€¢ Deļ¬nitions and metrics ā€¢ Fault, error, failure, fault models, fault tolerance, fault forecasting, ... ā€¢ Reliability, availability, maintainability, error mitigation, damage conļ¬nement, ... ā€¢ Reliability engineering, reliability testing, MTTF, MTBF, failure rate, ... ā€¢ Fault tolerance patterns ā€¢ N-out-of-M, voting, heartbeat, ... ā€¢ Analytical evaluation ā€¢ Reliability block diagrams, fault tree analysis, ... ā€¢ Petri nets, Markov chains ā€¢ Root cause analysis, risk analysis and assessment 3
  • 4. Dependable Systems Course PT 2014 Course Topics ā€¢ Hardware-Implemented Dependability ā€¢ Failure rates, testing, hardware redundancy approaches ā€¢ Software-Implemented Dependability ā€¢ Fault-tolerant programming: simplex, recovery blocks, checkpointing, roll-back, ... ā€¢ Testing, software metrics ā€¢ Fault-tolerant distributed systems: Transactions, consensus, clocks, ... ā€¢ Advanced approaches ā€¢ Autonomic computing, rejuvenation, online failure prediction, performability ā€¢ Case studies from practice 4
  • 5. Dependable Systems Course PT 2014 Dependability Tree (Laprie) 5
  • 6. Dependable Systems Course PT 2014 Dependability Stakeholders ā€¢ System - Entity with function, behavior, and structure ā€¢ A number of components or subsystems, which interact under the control of a design [Robinson] ā€¢ Service - System behavior abstraction, as perceived by the user ā€¢ User - Human or physical system that interacts with the systems service ā€¢ Speciļ¬cation - Deļ¬nition of expected service and delivery conditions ā€¢ On diļ¬€erent levels, can lead to speciļ¬cation fault ā€¢ Reliance demands assessment of non-functional dependability attributes ā€¢ Provide ability for trustworthy service delivery by dependability means ā€¢ Undesired (maybe expected) circumstances form dependability threats 6
  • 7. Dependable Systems Course PT 2014 2 - Dependability Threats ā€¢ System failure - ,Ausfallā€˜ ā€¢ Event that occurs when the service no longer complies with the speciļ¬cation / deviates from the correct service. ā€¢ System error - ,Fehler(zustand)ā€˜ ā€¢ Part of system state that can lead to subsequent failure ā€¢ Some sources deļ¬ne errors as active faults - not in this course ... ā€¢ System fault - ,Fehler(ursache)ā€˜ ā€¢ Adjudged or hypothesized cause of an error ā€¢ Failure occurs when error state alters the provided service ā€¢ Systems are build from connected components, which are again systems ā€¢ Fault is the consequence of a failure of some other system to deliver its service 7 Fault Error Failure
  • 8. Dependable Systems Course PT 2014 Chain of Dependability Threats (Avizienis) 8 Fault Error Failure Activation Propagation Causation Fault
  • 9. Dependable Systems Course PT 2014 Faults ā€¢ High diversity in possible sources and types ā€¢ Fault nature ā€¢ Accidental faults (,Zufallsfehlerā€˜) vs. intentional faults (,Absichtsfehlerā€˜) ā€¢ Intentional faults are created deliberately, presumably malevolently ā€¢ Fault origin viewpoints (not exclusive) ā€¢ Phenomenological causes: Physical / natural faults vs. human-made faults ā€¢ System boundaries: Internal faults (part of system state that produces an error) vs. external faults (interference with the environment) ā€¢ Phase of creation: Design faults vs. operational faults ā€¢ Temporal persistence ā€¢ Permanent faults vs. temporary faults 9
  • 10. Dependable Systems Course PT 2014 System-Level Fault Model ā€¢ Fault model idea originates from hardware ā€¢ How many faults of diļ¬€erent classes can occur ? ā€Ø What do I tolerate ? ā€¢ Timing of faults: Fault delay, repeat time, recovery time, ... ā€¢ Also mappable to software or even complete systems ā€¢ Activities as black box, only look on input and output messages ā€¢ Link faults are mapped to the ā€Ø participating components ā€¢ Every participating component ā€Ø would need a fault model - ā€Ø pick the most urgent ones 10
  • 11. Dependable Systems Course PT 2014 Error Propagation 11 (C) Avizienis
  • 12. Dependable Systems Course PT 2014 Error Message Occurrence (Hansen & Siewiorek) ā€¢ Same fault can lead to diļ¬€erent (detected or undetected) errors ā€¢ Errors become detected by error detection mechanism ā€¢ Some undetected errors are detected by several detectors ā€¢ Some detectors report several undetected errors as one ā€¢ Some undetected errors are never uncovered ā€¢ Detected errors might not be logged, if the system stops too fast 12
  • 13. Dependable Systems Course PT 2014 Failures ā€¢ Non-compliance with the speciļ¬cation - arbitrary failure (ā€˜willkĆ¼rlicher Ausfallā€˜) ā€¢ System failures can be further categorized in failure modes ā€¢ Fail-silent / crash failure mode - incorrect results are not delivered ā€¢ Fail-stop mode - constant value is delivered ā€¢ Failure mode domain - what is inļ¬‚uenced ā€¢ Service result - value failures ā€¢ Service timeliness - timing failures ā€¢ Service availability - stopping failures ā€¢ User perception in the mode - consistent / inconsistent for all users ā€¢ Failure mode consequences for ranking the identiļ¬ed issues 13
  • 14. Dependable Systems Course PT 2014 Failure Severity (,Schweregrad des Ausfallsā€˜) ā€¢ Denotes consequences of failure ā€¢ Benign failures (,unkritische AusfƤlleā€˜) ā€¢ Failure costs and operational beneļ¬ts are similar ā€¢ Sometimes also umbrella term for failures only detected by inspection ā€¢ A system with only such failures is fail-safe ā€¢ Catastrophic failures (,kritische AusfƤlleā€˜) ā€¢ Costs of failure consequences are much larger than service beneļ¬t ā€¢ Signiļ¬cant / serious failures - Intermediate steps expressing reduced service ā€¢ Grading of failure consequences on overall system depends on application ā€¢ Flying airplane - Catastrophic stopping failure, Train - Benign stopping failure ā€¢ Criticality - Highest severity of possible failure modes in the system 14
  • 15. Dependable Systems Course PT 2014 Example: DO-178B Standard 15
  • 16. Dependable Systems Course PT 2014 3 - Means for Dependability ā€¢ Fault prevention - Prevent fault occurrence or introduction ā€¢ Fault tolerance - Provide service matching the speciļ¬cation under faults ā€¢ Fault removal - How to reduce the presence of faults ā€¢ Fault forecasting- Estimate the present number, future incidence, and the consequences of faults ā€¢ Combined utilization 16
  • 17. Dependable Systems Course PT 2014 Fault Tolerance ā€¢ Fault tolerance is the ability of a system to operate correctly in presence of faults. or ā€¢ A system S is called k-fault-tolerant with respect to a set of ā€Ø algorithms {A1, A2, ... , Ap} and a set of faults {F1, F2, ... , Fp} ā€Ø if for every k-fault F in S, Ai is executable by a subsystem of system S with k faults. (Hayes, 9/76) or ā€¢ Fault tolerance is the use of redundancy (time or space) to achieve the desired level of system dependability - costs ! ā€¢ Accepts that an implemented system will not be fault-free ā€¢ Implements automatic recovery from errors ā€¢ Is a recursive concept (voter replication, self-checking checkers, stable memory) 17
  • 18. Error Processing Dependable Systems Course PT 2014 Phases of Fault Tolerance (Hanmer) 18 Latent Fault Error Normal OperationFault Activation Error Recovery Error Mitigation Error Detection Fault Treatment
  • 19. Dependable Systems Course PT 2014 Fault Tolerance - Error Processing Through Recovery ā€¢ Forward error recovery ā€¢ Error is masked to reach again a consistent state (fault compensation) ā€¢ Corrective actions need detailled knowledge (damage assessment) ā€¢ New state is typically computed in another way ā€¢ Examples with compensation: Error correcting codes, non-journaling ļ¬le system check, advanced exception handlers, voters ā€¢ Backward error recovery ā€¢ Roll back to previous consistent state (recovery point / checkpoint) ā€¢ Very suitable for transient faults ā€¢ Computation can be re-done with same components (retry), ā€Ø with alternate components (reconļ¬gure), or can be ignored (skip frame) 19
  • 20. Dependable Systems Course PT 2014 ā€žFail-Fastā€œ ā€¢ A common concept from system engineering, company management, ... ā€¢ ā€žReport failure and stop immediately without further actionā€œ ā€¢ Discussed by Jim Gray in 1985 as part of his famous articleā€Ø ā€žWhy do computers stop and what can be done about it ?ā€œ ā€¢ Useful when beneļ¬t from recovery is not good enough for its costs, ā€Ø or if error propagation is highly probable ā€¢ Single units of a redundant set ā€¢ Deeply interwired IT system components ā€¢ Components under heavy request load ā€¢ And ... crappy start-up companies 20
  • 21. Dependable Systems Course PT 2014 4 - Fault Tolerance Patterns ā€¢ Architectural patterns ā€¢ Considerations that cut across all parts of the system ā€¢ Need to be applied in early design phase ā€¢ Detection patterns ā€¢ Detect the presence of root faults, error states, and failures ā€¢ Errors vs. failures -> a-priori knowledge vs. comparison of redundant elements ā€¢ Error Recovery Patterns ā€¢ Methods to continue execution in a new error-free state ā€¢ Undoing the error eļ¬€ects + creating the new state 21
  • 22. Dependable Systems Course PT 2014 4 - Fault Tolerance Patterns ā€¢ Error Mitigation Patterns ā€¢ Do not change application or system state, ā€Ø but mask the error and compensate for the eļ¬€ects ā€¢ Typical strategies for timing or performance faults ā€¢ Fault Treatment Patterns ā€¢ Prevent the error from reoccurring by repairing the fault ā€¢ System veriļ¬cation ā€¢ Diagnosis of fault location and nature ā€¢ Correction of the system and / or the procedures 22
  • 23. Dependable Systems Course PT 2014 5 - Attributes of Dependability ā€¢ Non-functional attributes such as reliability and maintainability ā€¢ Complementary nature of viewpoints in the area of dependability ā€¢ In comparison to functional properties ā€¢ ... hard to deļ¬ne ā€¢ ... hard to abstract ā€¢ ... ,Divide and conquerā€˜ does not work as good ā€¢ ... diļ¬ƒcult interrelationships ā€¢ ... often probabilistic dependencies 23
  • 24. Dependable Systems Course PT 2014 In Detail ā€¢ Reliability - Function R(t) ā€¢ Probability that a system is functioning properly and constantly over time period t ā€¢ Assumes that system was fully operational at t=0 ā€¢ Denotes failure-free interval of operation ā€¢ Availability - Statement if a system is operational at a point in time / fraction of time ā€¢ Describe system behavior in presence of error treatment mechanisms ā€¢ Instantaneous availability (at t) - Probability that a system is performing correctly at time t; equal to reliability for non-repairable systems: Ai(t) = R(t) ā€¢ Steady-state availability - Probability that a system will be operational at any random point of time, expressed as the fraction of time a system is operational during its expected lifetime: As = Uptime / Lifetime 24
  • 25. Dependable Systems Course PT 2014 Why Exponential ? ā€¢ Distribution function that models the memoryless property of the Poisson process ā€¢ P(T > t + s|T > t) = P(T > s), e.g. PFailure(5 years|T > 2 years) = PFailure(3 years) ā€¢ Failure is not the result of wear-out ā€¢ Models ,intrinsic failureā€˜ behavior, assumed for the majority of hardware life time ā€¢ Weibull distribution as alternative, can also model tear-in and wear-out ā€¢ Some natural phenomena have constant failure rate (e.g. cosmic ray particles) 25
  • 26. Dependable Systems Course PT 2014 Variable Failure Rate in Real World 26 Burn in Use Wear out Integration & Test Use Obsolete Hardware Software ā€¢ Failure rate is treated as constant parameter of the exponential distribution ā€¢ (maybe invalid) simpliļ¬cation, mostly combined solution in practice: ā€¢ Exponential distribution when failure rate is constant ā€¢ Weibull distribution when failure rate is monotonic decreasing / increasing
  • 27. ā€¢ Mean time to failure (MTTF) - ā€Ø Average time it takes to failā€Ø -> average uptime ā€¢ Mean time to recover / repair (MTTR) - Average time it takes to recover ā€¢ Mean time between failures (MTBF) - Average time between two successive failures ā€¢ Availability = Uptime / Lifetimeā€Ø = MTTF / MTBF Dependable Systems Course PT 2014 Steady-State Availability 27 up down up MTBF down up up C1 up C3 up C2 MTTF
  • 28. Dependable Systems Course PT 2014 Steady-State Availability 28 Availability Downtime per year Downtime per week 90.0 % (1 nine) 36.5 days 16.8 hours 99.0 % (2 nines) 3.65 days 1.68 hours 99.9 % (3 nines) 8.76 hours 10.1 min 99.99 % (4 nines) 52.6 min 1.01 min 99.999 % (5 nines) 5.26 min 6.05 s 99.9999 % (6 nines) 31.5 s 0.605 s 99.99999 % (7 nines) 0.3 s 6 ms A = Uptime Uptime+Downtime = MT T F MT T F +MT T R
  • 29. Dependable Systems Course PT 2014 MTTR << MTTF [Fox] ā€¢ Armando Fox on ,Recovery-Oriented Computingā€˜ ā€¢ A = MTTF / (MTTF + MTTR) ā€¢ 10x decrease of MTTR as good as 10x increase of MTTF ? ā€¢ MTTFā€˜s are not claimable, but MTTR claims are veriļ¬able ā€¢ Proving MTTF numbers demands system-years of observation and experience ā€¢ Lowering MTTR directly improves user experience of one speciļ¬c outage, ā€Ø since MTTF is typically longer than one user session ā€¢ HCI factor of failed system ā€¢ Miller, 1968: >1sec ā€œsluggishā€, >10sec ā€œdistractedā€ (user moves away) ā€¢ 2001 Web user study: ~5sec ā€žacceptableā€, ~10sec ā€žexcessively slowā€œ 29
  • 30. Dependable Systems Course PT 2014 6 - Dependability Modeling ā€¢ Default approach: Utilize a formalism to model system dependability ā€¢ Quantify the availability of components, calculate system availability based on this data and a set of assumptions (the availability model) ā€¢ Most models expose the same expressiveness ā€¢ Each formalism allows to focus on certain aspects ā€¢ Structure-based models: Reliability block diagram, fault tree ā€¢ State-based models: Markov chain, petri net ā€¢ System understanding evolved from hardware to software to IT infrastructures ā€¢ Example: Organization management inļ¬‚uence on business service reliability ā€¢ Information Technology Infrastructure Library (ITIL) ā€¢ CoBiT(Control Objectives for Information and related Technology) 30
  • 31. Dependable Systems Course PT 2014 Dependability Modeling ā€¢ The Failure Space-Success Space concept ā€¢ Often easier to agree on what constitutes a system failure ā€¢ Success tends to be associated with system eļ¬ƒciency, which makes it harder to formulate events in the model (ā€žThe car drives fast.ā€œ, ā€žThe car stops driving.ā€œ) ā€¢ In practice, there are more ways to success than to failure 31 Fault Tree Handbook with Aerospace Applications Version 1.1 2. System Logical Modeling Approaches 2.1 Success vs. Failure Approaches The operation of a system can be considered from two standpoints: the various ways for system success can be enumerated or the various ways for system failure can be enumerated. Such an enumeration would include completely successful system operation and total system failure, as well as intermediate conditions such as minimum acceptable success. Figure 2-1 depicts the Failure/Success space concept. Figure 2-1. The Failure Space-Success Space Concept
  • 32. Dependable Systems Course PT 2014 Dependability Modeling ā€¢ System analysis approaches ā€¢ Inductive methods - Reasoning from speciļ¬c cases to a general conclusion ā€¢ Postulate a particular fault or initiating event, ļ¬nd out system eļ¬€ect ā€¢ Determine what system (failure) states are possible ā€¢ Trivial approach: ā€žparts countā€œ method ā€¢ Examples: Failure Mode and Eļ¬€ect Analysis (FMEA), Preliminary Hazards Analysis (PHA), Event Tree Analysis, Reliability Block Diagrams (RBD), ... ā€¢ Deductive methods - Postulate a system failure, ļ¬nd out what system modes or component behaviors contribute to this failure ā€¢ Determine how a particular system state can occur ā€¢ Examples: Fault Tree Analysis (FTA) 32
  • 33. S = cLB (cW S1 ā‡„ cW S2) (cDB1 ā‡„ cDB2) Asite = aLB ā‡„ AW Sset ā‡„ ADBset = aLB ā‡„ [1 (1 aW S)nW S ] ā‡„ [1 (1 aDB)nDB ] Dependable Systems Course PT 2014 Examples 33 aWS aWS aDB aDB aLB
  • 34. Dependable Systems Course PT 2014 Reliability Block Diagrams (RBD) ā€¢ Model logical interaction for success-oriented analysis of system reliability ā€¢ Building blocks: series structure, parallel structure, k-out-of-n structure ā€¢ System is available only if there is a path between s and t ā€¢ Granularity based on data and lowest actionable item concept ā€¢ Structure formula can be obtained from RBD by identifying the ā€Ø subset of nodes that disconnects s from t if removed 34 s t A1 A2 A3 A4 B Spare C D E 2/3 C C D D E E
  • 35. Dependable Systems Course PT 2014 Deductive Analysis - Fault Trees ā€¢ Structure analysis eļ¬€ort grows exponentially with the number of components ā€¢ Fault Trees ā€¢ Invented 1961 by H. Watson (Bell Telephone Laboratories) ā€¢ Facilitate analysis of the launch control system of the intercontinental Minuteman missile ā€¢ Used by Boeing since 1966, meanwhile adopted by diļ¬€erent industries ā€¢ Root cause analysis, risk assessment, safety assessment ā€¢ Basic idea ā€¢ Technique for describing the possible ways in which an ā€Ø undesired system state can occur ā€¢ Complex system failures are broken down into basic events 35
  • 36. Example: 2-of-3 System State-Based Modeling | Dependable Systems 2014 24 The complexity of the petri net does not depend on the number of components! Dependable Systems Course PT 2014 7 - State-Based Modeling ā€¢ Structural vs. state-based modeling ā€¢ State Transition Diagrams ā€¢ Markov Chains ā€¢ Petri Nets 36
  • 37. 37 State-Based Modeling | Dependable Systems 2014 34 Dependable Systems Course Real World Systems Model Solution Technique Evaluations Input Parameters Modeling Errors ā€¢ Structural Errors: Initial state, missing or extra states / transitions ā€¢ Error Propagation Model ā€¢ Parametric Models: Failure and repair rates, coverage parameters ā€¢ Errors due to non-independence Solution Errors ā€¢ Approximation Errors: System partition, state aggregation ā€¢ Numerical Errors: Truncation, Round-off ā€¢ Programming Errors ā€¢ Estimation Errors: Stochastic estimators, not enough data Parametric Errors ā€¢ Different component parameter sources ā€¢ Projected stress factors assume unrealistic operational conditions
  • 38. Dependable Systems Course PT 2014 8 - Qualitative Dependability Investigation ā€¢ Diļ¬€erent approaches that focus on structural (qualitative) system evaluation ā€¢ Root cause analysis ā€¢ Broad research / industrial topic targeting error diagnosis ā€¢ Specialized topic in quality methodologies ā€¢ Development process investigation ā€¢ Procedures for ensuring industry quality in production ā€¢ Software development process ā€¢ Organizational investigation ā€¢ Non-technical inļ¬‚uence factors on system reliability 38
  • 39. Dependable Systems Course PT 2014 FMEA ā€¢ Failure Mode and Eļ¬€ects Analysis ā€¢ Engineering quality method for early concept phase - identify and tackle weak points ā€¢ Introduced in the late 1940s for military usage (MIL-P-1629) ā€¢ Later also used for aerospace program, automotive industry, ā€Ø semiconductor processing, software development, healthcare, ... ā€¢ Main goal is to identify and prevent critical failures ā€¢ Performed by cross-functional team of subject matter experts ā€¢ Most important task in many reliability programs ā€¢ Six Sigma certiļ¬cation, reliability-centered maintenance (RCM) approach ā€¢ Automotive industry (ISO16949, SAE J1739) ā€¢ Medical devices 39
  • 40. Dependable Systems Course PT 2014 Hazard & Operability Studies (HAZOPS) ā€¢ Process for identiļ¬cation of potential hazard & operability problems,ā€Ø caused by deviations from the design intend ā€¢ Diļ¬€erence between deviation (failure) and its cause (fault) ā€¢ Conduct intended functionality in the safest and most eļ¬€ective manner ā€¢ Initially developed to investigate chemical production processes, ā€Ø meanwhile for petroleum, food, and water industries ā€¢ Extended for complex (software) systems ā€¢ Qualitative technique ā€¢ Take full description of process and systematically question every part of it ā€¢ Assess possible deviations and their consequences ā€¢ Based on guide-words and multi-disciplinary meetings 40
  • 41. Dependable Systems Course PT 2014 Root Cause Analysis ā€¢ What caused the fault ? - Starting point of dependability chain ā€¢ Peeling back the layers ā€¢ Must be performed systematically as an investigation ā€¢ Establish sequence of events / timeline 41 (C) Avizienis
  • 42. Dependable Systems Course PT 2014 Reliability Models for IT Infrastructures ā€¢ System reliability in a commercial environment is determined by many factors: ā€¢ Software and hardware reliability ā€¢ Training of maintenance personnel ā€¢ ā€žBusiness processesā€œ how maintenance is handled ā€¢ The way the IT department is organized ā€¢ ... ā€¢ Impact of management organization on reliability is an emerging research ļ¬eld ā€¢ Standards for IT organization, based on best practices ā€¢ Describe which processes have to be established in an IT department ā€¢ Provide reference models for organization of IT department 42
  • 43. Dependable Systems Course PT 2014 CMMI Maturity Levels 43 (C)Wikipedia
  • 44. Dependable Systems Course PT 2014 ITIL ā€¢ Information Technology Infrastructure Library (ITIL), latest version v3 ā€¢ Started as set of recommendations by the UK Government ā€¢ Concepts and guides for IT service management ā€¢ Supports to deliver business-oriented ā€Ø quality IT services ā€¢ Core publications: Service Strategy, ā€Ø Service Design, Service Transition, ā€Ø Service Operation, ā€Ø Continual Service Improvement ā€¢ Broad tool support from vendors ā€¢ High costs for certiļ¬cation and training ā€¢ Methodology sometimes over-respected at ā€Ø the expense of pragmatism 44
  • 45. Dependable Systems Course PT 2014 9 - Predicting System Reliability ā€¢ Feasibility evaluation for given model, identiļ¬cation of potential problem sources ā€¢ Comparison of competing designs ā€¢ Identiļ¬cation of potential reliability problems - low quality, over-stressed parts ā€¢ Input to other reliability-related tasks ā€¢ Maintainability analysis, testability evaluation ā€¢ Failure modes and eļ¬€ects analysis (FMEA) ā€¢ Ultimate goal is the prediction of system failures by a reliability methodology ā€¢ Approach depends on nature of component (electrical, electronic, mechanical) 45
  • 46. Dependable Systems Course PT 2014 Reliability Data ā€¢ Field (operational) failure data ā€¢ Meaningful source of information, experience from real world operation ā€¢ Operational and environmental conditions may be not fully known ā€¢ Service life data with / without failure ā€¢ Helpful in assessing time characteristics of reliability issues ā€¢ Data from engineering tests ā€¢ Example: Accelerated life tests ā€¢ Results from controlled environment ā€¢ Trustworthy for analysis purposes ā€¢ Lack of failure information is the ,greatest deļ¬ciencyā€˜ in reliability research [Misra] 46
  • 47. Dependable Systems Course PT 2014 Reliability Prediction Models ā€¢ Economic requirements aļ¬€ect ļ¬eld-data availability ā€¢ Customers do not need generic ļ¬eld data, so collection is extra eļ¬€ort ā€¢ Numerical reliability prediction models available for speciļ¬c areas ā€¢ Hardware-oriented reliability modeling ā€¢ MIL-HDBK-217, Bellcore Reliability Prediction Program, ... ā€¢ Software-oriented reliability modeling ā€¢ Jelinski-Moranda model, Basic Execution model, software metrics, ... ā€¢ Prediction models look for corrective factors, so they are always focused ā€¢ Demands conļ¬dence in understanding of environmental conditions ā€¢ Reconciliation is often needed between the diļ¬€erent values of failure data from various sources 47
  • 48. Dependable Systems Course PT 2014 Software Reliability Assessment ā€¢ Software bugs are permanent faults, but behave as transient faults ā€¢ Activation depends on software usage pattern and input values ā€¢ Timing eļ¬€ects with parallel or distributed execution ā€¢ Software aging eļ¬€ects ā€¢ Fraction of system failures reasoned by software increased in the last decades ā€¢ Possibilities for software reliability assessment ā€¢ Black box models - No details known, observations from testing or operation ā€¢ Reliability growth models ā€¢ Software metric models - Derive knowledge from static code properties 48
  • 49. Dependable Systems Course PT 2014 Halstead Metric ā€¢ Statistical approach - Complexity is related to number of operators and operands ā€¢ Only deļ¬ned on method level, OO code analysis must consider additional structure ā€¢ Program length N = Number of operators NOP + total number of operands NOD ā€¢ Vocabulary size n = Number of unique operators nOP + number of unique operands nOD ā€¢ Program volume V = N * log2(n) ā€¢ Describes size of an implementation by vocabulary and (mainly) by program length ā€¢ In-place changes are ok ... ā€¢ Diļ¬ƒculty level D = (nOP / 2) * (NOD / nOD), program level L = 1 / D ā€¢ Error proneness is proportional to number of unique operators ā€Ø -> since most languages only oļ¬€er a few, this should be small ā€¢ Also proportional to the level of operand reuse through the code 49 x = x + 1; NOP=nOP=3 NOD=3 nOD=2
  • 50. Dependable Systems Course PT 2014 10 - Distributed Systems Theory ā€¢ Fallacies of Distributed Computing ā€¢ The network is reliable, Latency is zero, Bandwidth is inļ¬nite, The network is secure, Topology doesnā€™t change, There is one administrator, Transport cost is zero, The network is homogeneous ā€¢ Timing Models ā€¢ Logical time, Lamport clocks ā€¢ Fault Models ā€¢ Consensus Problems ā€¢ 2PC ā€¢ Paxos 50 Asynchronous* Par-ally*synchronous* Synchronous*
  • 51. Dependable Systems Course PT 2014 11 - Fault-Tolerant Distributed Systems ā€¢ Consistency Models ā€¢ Strict consistency, sequential consistency, eventual consistency ā€¢ Trade-Oļ¬€s ā€¢ Brewer, CAP Theorem ā€¢ Replication ā€¢ State Machine Replication, ā€¦ 51 P1# W(x)#!#a# P2# R(x)#!#a# P1# W(x)#!#a# P2# R(x)#!#NIL# R(x)#!#a#
  • 52. Dependable Systems Course PT 2014 12 - Dependable Distributed Applications ā€¢ Frameworks - Erlang / FT-CORBA ā€¢ Coordination Services ā€¢ Chubby, Zookeeper ā€¢ Distributed Storage ā€¢ Cassandra, RIAK, HDFS ā€¦ 52
  • 53. Dependable Systems Course PT 2014 13 - Hardware Diagnosis ā€¢ Contains of fault detection (what ?) and fault location (where ?) ā€¢ Fault detection ā€¢ Replication checks - Test operation / execution against alternate implementation ā€¢ Timing checks - Test operation / execution against timing constraints ā€¢ Reversal checks - Make use of operation reversibility ā€¢ Coding checks - Utilize redundant (but diļ¬€erent) representation of data ā€¢ Reasonableness checks - Check against known system / data properties ā€¢ Structural checks - Ensure consistent structure of data, diagnostic tests, ... ā€¢ Diagnostic checks - Use set of inputs for which the outputs are known ā€¢ Algorithmic checks - Check invariants of an algorithm 53
  • 54. Dependable Systems Course PT 2014 14 - Hardware Redundancy ā€¢ Redundancy for error detection and forward error recovery ā€¢ Redundancy types: spatial, temporal, informational (presentation, version) ā€¢ Redundant not mean identical functionality, just perform the same work ā€¢ Static redundancy implements error mitigation ā€¢ Fault does not show up, since it is transparently removed ā€¢ Examples: Voting, error-correcting codes, N-modular redundancy ā€¢ Dynamic redundancy implements error processing ā€¢ After fault detection, the system is reconļ¬gured to avoid a failure ā€¢ Examples: Back-up sparing, duplex and share, pair and spare ā€¢ Hybrid approaches 54
  • 55. Dependable Systems Course PT 2014 Sphere of Replication ā€¢ Components outside the sphere must be protected by other means ā€¢ Level of output comparison decides upon fault coverage ā€¢ Larger sphere tends to decrease the required bandwidth on input and output ā€¢ More state changing happens just inside the sphere ā€¢ Vendor might be restricted on choice of sphere size 55 HDD HDD Input Replication Output Comparison I/O Bus HDD HDD Input Replication Output Comparison HDD HDD Controll er Controll er RAID1 RAID10
  • 56. Dependable Systems Course PT 2014 Masking / Static Redundancy: Voting ā€¢ Exact voting: Only one correct result possible ā€¢ Majority vote for uneven module numbers ā€¢ Generalized median voting - Select result that is the median, ā€Ø by iteratively removing extremes ā€¢ Formalized plurality voting - Divide results in partitions, ā€Ø choose random member from the largest partition ā€¢ Inexact voting: Comparison at high level might lead to multiple correct results ā€¢ Non-adaptive voting - Use allowable result discrepancy, put boundary on discrepancy minimum or maximum (e.g. 1,4 = 1,3) ā€¢ Adaptive voting - Rank results based on past experience with module results ā€¢ Compute the correct value based on ā€žtrustā€œ in modules from experience ā€¢ Example: Weighted sum R=W1*R1 + W2*R2 + W3*R3 with W1+W2+W3=1 56 Voter A B C
  • 57. Dependable Systems Course PT 2014 N-Modular Redundancy (with perfect voter) 57 Module 1 Module 2 Module 3 Voter ... Module N Input Output RNMR = mX i=0 āœ“ N i ā—† (1 R)i RN i āœ“ n k ā—† = n! k!(n k)! R2 of 3 = āœ“ 3 0 ā—† (1 R)0 R3 + āœ“ 3 1 ā—† (1 R)R2 R2 of 3 = R3 + 3(1 R)R2 R3 of 5 = ...
  • 58. ā€¢ Special cases for combination of duplex (with comparator) and sparing (with switch) ā€¢ Pair and spare - Multiple duplex pairs, connected as standby sparing setup ā€¢ Two replicated modules operate as duplex pair (lockstep execution), ā€Ø connected by comparator as voting circuit ā€¢ Same setting again as spare unit, ā€Ø spare units connected by switch ā€¢ On module output mismatch, comparators ā€Ø signal switch to perform failover ā€¢ Commercially used, e.g. Stratus XA/R Series 300 Dependable Systems Course PT 2014 Pair and Spare 58 MODULES 1 2 3 OUTPUTINPUT COMPARATOR 4 COMPARATOR SWITCH/COMPARATOR
  • 59. Dependable Systems Course PT 2014 Hybrid Approaches ā€¢ N-modular redundancy ā€Ø with spares ā€¢ Also called hybrid redundancy ā€¢ System has basic NMR ā€Ø conļ¬guration ā€¢ Disagreement detector replacesā€Ø modules with spares if their ā€Ø output is not matchingā€Ø the voting result ā€¢ Reliability as long as the spare pool is not exhausted ā€¢ Improves fault masking capability of NMR ā€¢ Can tolerate two faults with one spare, while classic NMR would need 5 modules with majority voting to tolerate two faults 59 Adds fault localization
  • 60. Dependable Systems Course PT 2014 Coding Checks in Memory Hardware ā€¢ Primary memory ā€¢ Parity code ā€¢ m-out-of-n resp. m-of-n resp. m/n code ā€¢ Checksumming ā€¢ Berger Code ā€¢ Hamming code ā€¢ Secondary storage ā€¢ RAID codes ā€¢ Reed-Solomon code 60
  • 61. Dependable Systems Course PT 2014 Memory Redundancy ā€¢ Standard technology in DRAMs ā€¢ Bit-per-byte parity, check on read access ā€¢ Implemented by additional parity memory chip ā€¢ ECC with Hamming codes - 7 check bits for 32 bit data words, 8 bit for 64 bit ā€¢ Leads to 72 bit data bus between DIMM and chipset ā€¢ Computed by memory controller on write, checked on read ā€¢ Study by IBM: ECC memory achieves R=0.91 over three years ā€¢ Can correct single bit errors and detect double bit errors ā€¢ Hewlett Packard Advanced ECC (1996) ā€¢ Can detect and correct single bit and double bit errors 61
  • 62. Dependable Systems Course PT 2014 RAID ā€¢ Redundant Array of Independent Disks (RAID) [Patterson et al. 1988] ā€¢ Improve I/O performance and / or reliability by building raid groups ā€¢ Replication for information reconstruction on disk failure (degrading) ā€¢ Requires computational eļ¬€ort (dedicated controller vs. software) ā€¢ Assumes failure independence 62
  • 63. Dependable Systems Course PT 2014 Parity With XOR ā€¢ Self-inverse operation ā€¢ 101 XOR 011 = 110, 110 XOR 011 = 101 63 Disk Byte 1 1 1 0 0 1 0 0 1 2 0 1 1 0 1 1 1 0 3 0 0 0 1 0 0 1 1 4 1 1 1 0 1 0 1 1 Parity 0 1 0 1 1 1 1 1 Disk Byte 1 1 1 0 0 1 0 0 1 Parity 0 1 0 1 1 1 1 1 3 0 0 0 1 0 0 1 1 4 1 1 1 0 1 0 1 1 Hot Spare 0 1 1 0 1 1 1 0
  • 64. Dependable Systems Course PT 2014 15 - Software Dependability ā€¢ Fault elimination ā€¢ Reduce number of dormant faults at development time ā€¢ Fault-tolerant software ā€¢ Techniques to achieve fault tolerance for software faults ā€¢ Application of redundancy idea to software modules ā€¢ Software fault tolerance ā€¢ Techniques to achieve fault tolerance by software mechanisms ā€¢ Typically for hardware failures on lower levels in the system stack ā€¢ Redundancy managed by operating system, cluster framework, application code 64
  • 65. Dependable Systems Course PT 2014 Fault Elimination through Software Testing 65 (C) Ian Sommerville Responsibility of the component developer, based on experience Responsibility of an independent testing team, based on speciļ¬cation Black-Box Tests Access to source code, tested while components are integrated Steadily increasing load Excess maximum design load
  • 66. Dependable Systems Course PT 2014 Testing Through Software Fault Injection ā€¢ Intentionally trigger erroneous behavior of the execution environment ā€¢ Typical approach for communicating software entities ā€¢ Orientation towards implementation details - program state, functional behavior ā€¢ Several non-intrusive implementation techniques, such as AOP ā€¢ Injection can be done as part of execution under real conditions ā€¢ Huge variety of fault classes, including SWIFI fault classes ā€¢ New approaches support remote fault injection (e.g. fuzzing) ā€¢ Compile-time injection: Leads to erroneous image being executed ā€¢ Run-time injection: Demands some altering of application state during runtime ā€¢ Typical triggers: Time-out, exception, debugging trap, code insertion 66
  • 67. Dependable Systems Course PT 2014 Software Fault Model [Goloubeva] ā€¢ Example: Control loop writing computation result to a variable ā€¢ Temporary fault: Write NULL to result variable after end of usage ā€¢ Permanent fault: Compute and store incorrect output from input data ā€¢ Single vs. multiple faults depends on granularity level of investigation ā€¢ On source code level ā€¢ Program: Structured collection of features with syntax and semantic ā€¢ Syntax allows to describe fault model ā€¢ Negation of semantic properties deļ¬nes a fault model ā€¢ Examples: Calling non-existent functions, Functions not returning a value, values out of their type range, missing input parameters ā€¢ On executable level, e.g. stack overļ¬‚ow due to recursion 67
  • 68. Dependable Systems Course PT 2014 Fault-Tolerant Software - Another categorization [Lyu 95] ā€¢ Single version techniques ā€¢ Add mechanisms for detection, containment, and handling of errors to the software component itself ā€¢ Examples: Software structure and actions approaches, error detection, ā€Ø exception handling, checkpointing and restart, process pairs, data diversity ā€¢ Multi version techniques ā€¢ Rely on structured utilization of variants of the same software ā€¢ Examples: Recovery blocks, N-version programming ā€¢ Principles can be applied to any software layer ā€¢ Identify source of most design faults ā€¢ Typically no problem with parallel application, beside cost factor 68
  • 69. Dependable Systems Course PT 2014 Single-Version Approaches - Wrapper ā€¢ Piece of software that encapsulates a given program when it is being executed ā€¢ Typical approach for operating systems and middleware stacks ā€¢ Structure: Wrapper software and wrapped entity ā€¢ Inputs and outputs are checked by the wrapper ā€¢ Examples: ā€¢ Dealing with buļ¬€er overļ¬‚ow, checking scheduler correctness (e.g., EDF), bypassing known bugs, checking output correctness ā€¢ When pre- or postconditions are violated, usually an exception is being raised ā€¢ Wrapper forms an acceptance test in the dependability rings ā€¢ Good approach for ļ¬xing issues in the operational phase of software 69
  • 70. ā€¢ Save application state data at recovery points ā€¢ Can be reloaded on crash or any other kind ā€Ø of data loss ā€¢ Possible on diļ¬€erent levels: local per process, ā€Ø partial, complete, distributed ā€¢ Optimum checkpointing interval ā€¢ Checkpointing too frequent: Majority of time spent for data saving ā€¢ Checkpointing too rare: May take long time to recover ā€¢ Several specialized solutions for C / C++ language, easier with reļ¬‚ection support ā€¢ Popular approach in clusters / high-performance computing ā€¢ Latest trend: In-memory checkpointing Dependable Systems Course PT 2014 Single Version Approaches - Checkpointing 70 taken from ā€Ø Software Fault Tolerance: A Tutorial
  • 71. Dependable Systems Course PT 2014 Single Version Approaches - Data Diversity [Ammann 88] 71 takenfromā€Ø SoftwareFaultTolerance:ATutorial
  • 72. Dependable Systems Course PT 2014 Single Version Approaches - High-Level Instruction Duplication [Goloubeva] ā€¢ Introduce data and code redundancy through high-level transformation ā€¢ Duplicate every variable ā€¢ Perform every write operation on both copies of the variable ā€¢ After each read operation, the copies must be checked for consistency ā€¢ Should be close to read operation, ā€Ø in order to avoid error propagation ā€¢ Includes also expression evaluation ā€¢ Procedure parameters treated as variables ā€¢ Independent from underlying hardware, ā€Ø targets cache / main memory faults ā€Ø 72 a=b; .. becomes ... a0=b0; a1=b1; if (b0 != b1) error(); ... a=b+c; ... becomes ... a0=b0+c0; a1=b1+c1; if ((b0!=b1)||(c0!=c1)) error();
  • 73. Dependable Systems Course PT 2014 Control Flow Error ā€¢ Control ļ¬‚ow error (CFE) leads to unexpected instruction execution ā€¢ Terminology: ā€¢ Basic block (BB) - branch-free serial code fragment ā€¢ Branch / jump instruction is modeled as last instruction of a basic block ā€¢ Control ļ¬‚ow graph (CFG) - one node per basic block ā€¢ Illegal branch - Node transition is not part of the CFG ā€¢ Wrong branch - Node transition is already part of the CFG ā€¢ Inter-block error - Erroneous branch to diļ¬€erent block ā€¢ Intra-block errors - Erroneous branch inside a block 73 i=0; while(i<n) { ------------------ if (a[i] < b[i]) ------------------ x[i]=a[i]; ------------------ else x[i]=b[i]; ------------------ i++; } ------------------ BB 0 BB 1 BB 2 BB 3 B 4 BB 5
  • 74. Dependable Systems Course PT 2014 Multi-Version Approaches Recovery Blocks ā€¢ Redundant system implementations are typically used simultaneously, ā€Ø best answer is picked i.e. by voting ā€¢ Alternative way: Sequential execution of recovery blocks ā€¢ Introduced in 1974 by Horning et. al. ā€¢ Dynamic fault tolerance approach, related to stand-by sparing in hardware 74 establish Checkpoint Primary Module Acceptance Test, else load Checkpoint Alternative Module 1 Acceptance Test, else load Checkpoint Alternative Module 2 ... else Failure Exception
  • 75. Dependable Systems Course PT 2014 Multi-Version Approaches - N-Version Programming ā€¢ Common mode errors are only catchable by design diversity ā€¢ Design diversity is a complex issue ā€¢ Design philosophies, software tools, programming languages, test philosophies ā€¢ Typical approaches try to utilize randomness - separate teams on diļ¬€erent locations ā€¢ N-Version Programming ā€¢ Suggested by Elmendorf in 1972, developed by Avizienis & Chen in 1977 ā€¢ Static approach, combination of decision mechanism and forward recovery ā€¢ At least two independently designed and functionally equivalent variants ā€¢ Variants are executed in parallel, decision mechanism selects the ā€žbestā€œ result ā€¢ Can support reliability, but also system security against malicious logic 75
  • 76. Dependable Systems Course PT 2014 Simplex ā€¢ Idea: Using simplicity to control complexity ā€¢ Forward recovery approach, based on feedback loop ā€¢ HAC subsystem - Simple construction, formal methods, reliable hardware ā€¢ HPC subsystem - Complex technology, advanced features ā€¢ HPC can use HAC output, but not vice versa ā€¢ Decision logic based on control loop output ā€¢ Typically performance degredation with HAC ā€¢ Example: Boeing 777 primary and secondary ā€Ø ļ¬‚ight controller 76
  • 77. 77