2. Topics Covered in this Presentation
What software reliability engineering is and why it is
needed.
Defining software reliability targets.
Operational profiles.
Reliability risk management.
Code inspection.
Software testing.
Reliable system design.
Reliability modeling.
Reliability demonstration.
4. Different Views of Reliability
Product development teams
View reliability at the sub-domain level, addressing mechanical,
electronic and software issues.
Customers
View reliability at the system level, with minimal consideration placed on
sub-domain distinction.
The primary measure of reliability is defined by the customer.
To develop a reliable product engineering teams must consider both
views (system and sub-domain).
System
Mechanical
Reliability
+
Electronic
Reliability
+
Software
Reliability
Although this presentation focuses on software reliability engineering, it should be viewed as a
component part of an overall Design for Reliability process, not as a disparate activity as hardwaresoftware interactions may be missed.
This presentation does not make any distinction between software and firmware, but the same
techniques apply equally to both.
5. System-Level Reliability Modeling (1 of 2)
A system is made up of components/sub-systems; each has its own inherent
reliability.
Software
R=0.99
Computer Server
R=0.9665
A “traditional” reliability program may include modeling, evaluation and testing to prove that
the hardware meets the reliability target, but software should not be forgotten as it is a
system component.
Individually the hardware and software may meet the reliability target…but they also have to
when they are combined.
System probability of failure = H/W Failure Probability x S/W Failure Probability
I.e., H/W = 0.9665, S/W = 0.99, System = 0.9665 x 0.99 = 0.9568
System Reliability = 95.68%
6. System-Level Reliability Modeling
(2 of 2)
Therefore the software
reliability should also be
accounted for in the
system-level reliability
model.
Software may consist of both the operating system (OS) and configurable (turnkey)
software. It may not be possible to influence the OS design, but turnkey software can be
focused on.
This may consist of re-used software such as library functions and newly developed
software.
If the reliability of the library functions is already understood then library function re-use
simplifies the software reliability engineering process.
7. What is Software Reliability Engineering
(SRE)?
The quantitative study of the
operational behavior of software-based
systems with respect to user
requirements concerning reliability.
SRE has been adopted either as standard or as best practice by more than
50 organizations in their software projects including AT&T, Lucent, IBM,
NASA and Microsoft, plus many others worldwide.
This presentation will provide an introduction to software
reliability engineering…..
8. Why is SRE Important?
There are several key reasons a reliability engineering program should be
implemented:
So that it can be determined how satisfactorily products are functioning.
Avoid over-designing – products could cost more than necessary and lower profit.
If more features are added to meet customer demand then reliability should be monitored
to ensure that defects are not designed in, which could impact reliability.
If a customer’s product is not designed well, with reliability and quality in mind, then they
may well turn to a COMPETITOR!
Having a software reliability engineering process can make
organizations more competitive as customers will always
expect reliable software that is better and cheaper.
9. Why is SRE Beneficial?
For Engineers:
Managing customer demands:
Enables software to be produced that is more reliable; built faster and
cheaper.
Makes engineers more successful in meeting customer demands.
In turn this avoids conflicts – risk, pressure, schedule, functionality, cost etc.
For the organization:
Improves competitiveness.
Reduces development costs.
Provides customers with quantitative reliability metrics.
Places less emphasis on tools and a greater emphasis on
“designing in reliability.”
Products can be developed that are delivered to the customer at the right time, at an
acceptable cost, and with satisfactory reliability.
10. Common SRE Challenges
Data is collected during test phases, so if problems are discovered it is too late
for fundamental design changes to be made.
Failure data collected during in-house testing may be limited,
and may not represent failures that would be uncovered in the
product’s actual operational environment.
Reliability metrics obtained from restricted testing data may
result in reliability metrics being inaccurate.
There are many possible models that can be used to predict the
reliability of the software, which can be very confusing.
Even if the correct model is selected there may be no way of validating it due to
having insufficient field data.
11. Fault Lifecycle Techniques
Prevent faults from being inserted.
Avoids faults being designed into the software when it is being
constructed.
Remove faults that have been inserted.
Detect and eliminate faults that have been inserted through inspection and test.
Design the software so that it is fault tolerant.
Provide redundant services so that the software continues to work even though faults have
occurred or are occurring.
Forecast faults and/or failures.
Evaluate the code and estimate how many faults are present and the occurrences and
consequences of software failures.
12. Preventing Faults From Being Inserted
Initial approach for reliable software
A fault that is never created does not cost anything to fix. This should be the ultimate
objective of software engineering.
This requires:
A formal requirement specification always being available that has been thoroughly
reviewed and agreed to.
Formal inspection and test methods being implemented and used.
Early interaction with end-users (field trials) and requirement refinement if necessary.
The correct analysis tools and disciplined tool use.
Formal programming principles and environments that are enforced.
Systematic techniques for software reuse.
Formal software engineering processes and tools, if applied successfully, can
be very effective in preventing faults (but is no guarantee!) However, software
reuse without proper verification can result in disappointment.
13. Removing Faults
When faults are injected into the software, the next
method that can be used is fault removal.
Approaches:
Software inspection.
Software testing.
Both have become standard industry practices. This presentation will focus closely
on these.
14. Fault Tolerance
This is a survival attribute – the software has to
continue to work even though a failure has
occurred.
Fault tolerance techniques enables a system to:
Prevent dormant software faults from becoming active (i.e.,
defensive programming to check for input and output
conditions and forbid illegal operations).
Contain software errors within a confined boundary to prevent them from propagating
further (i.e., exception handling routines to treat unsuccessful operations).
Recover software operations from erroneous conditions by using techniques such as
check pointing and rollback.
15. Fault/Failure Forecasting
If software failures are likely to occur it is critical to estimate the number of
failures and predict when each is likely to occur.
This will help concentrate on failures that have the greatest probability of occurring, provide
reliability improvement opportunities and improve customer satisfaction.
Fault/failure forecasting requires:
Defining a fault/failure relationship – why the failure occurs and its effect.
Establishing a software reliability model.
Developing procedures for measuring software reliability.
Analyzing and evaluating the measurement results.
Measuring software reliability provides:
Useful metrics that can be used to plan further testing and
debug efforts, to calculate warranty costs and plan further
software releases.
Determines when testing can be terminated.
16. SRE Process Overview
This slide shows a general SRE
process flow that has six major
components:
Determine Reliability
Determine Reliability
Objective
Objective
Define Operational
Define Operational
Profile
Profile
Perform Code Inspection
Perform Code Inspection
Determine the reliability Target.
Define a software operational Profile.
Perform Software Testing
Continue
Testing
Select Appropriate Software
Model
Conduct code inspection.
Perform software testing.
Conduct reliability modelling to measure
the software reliability – continuously
improve the software reliability until the
target is reached.
Collect Failure Data
Reliability
Objectives
met?
Use software Reliability Model(s)
Use software Reliability Model(s)
to Calculate Current Reliability
to Calculate Current Reliability
Software Release Acceptable
from Reliability Perspective
Field reliability validation.
Validate Field Reliability
Validate Field Reliability
17. SRE Terms
Reliability objective: The product’s reliability goal from the customer’s viewpoint.
Operational profile: A set of system operational scenarios with their associated
probability of occurrence.
This encourages testers to select test cases according to the system’s likely operational
usage.
Reliability modeling: This is an essential element of SRE that determines whether the
product meets its reliability objective.
One or more models can be used to calculate, from failure data collected during system
testing, various estimates of a product’s reliability as a function of test time. It can also provide
the following information:
Product reliability at the end of various test phases.
Amount of additional test time required to reach the product’s reliability objective.
The reliability growth that is still required (ratio of initial to target reliability).
Prediction of field reliability.
Field Reliability Validation: Determination of whether the actual field reliability meets
the customer’s target.
19. Software Reliability Objectives
Reliability target(s) should be defined and used to:
Manage customer expectations.
Determine how reliability growth can and will be tracked throughout
the program.
Determine availability targets. Software reliability is commonly
expressed as an availability metric though rather than as a
probabilistic reliability metric. This is defined as:
Availability =
Software uptime
Software uptime + downtime
A data collection and analysis methodology also has to be defined:
How inspections will be conducted.
How failure data will be collected.
How the data will be analyzed, i.e., what model will be used?
This helps project managers track metrics and plan resource.
20. Managing the Software Reliability Objective
Defects are often inserted from the beginning of project.
This is usually related to the intensity of the effort, i.e.
the number of engineers working on the program, the
project schedule and the various design decisions that
are made etc.
Defects are most often detected and
addressed at a later date than the original
design effort.
Test efforts are relied on to discover most defects, this lag can have a negative impact on
the program.
This can be mitigated against by using code inspection, but some testing will still be
necessary. Code inspections should be conducted to IEEE 1028.
There is still a lag though between defect insertion and correction, which can have a
negative impact on the program.
The eventual defect rate represents the reliability target, and as defects are
discovered and addressed the software reliability is increased, or grown –
this is termed ‘Reliability Growth Management”.
21. Initial Reliability Growth Model - The
Rayleigh Curve (1 of 3)
The eventual goal should be to forecast the discovery rate of defects as a
function of time throughout the software development program.
This cannot be achieved until data from prior similar projects becomes available. This may
take time but the effort provides value as it enables accurate forecasts to be achieved from
the beginning of the project.
Industry data is also available.
This helps to manage customer
expectations as it demonstrates a strategy
for improving software reliability.
To produce this curve, reliability data from
prior software developments has to be
available. Therefore this is a goal, it’s not
a technique that can be used immediately.
To get to this stage metrics need to be
collected by using the methods discussed
in this presentation.
1 2 − 1 2 t 2
2 Peak
f (t ) = K
te
Peak
22. The Rayleigh Curve (2 of 3)
The model's cumulative distribution function (CDF) describes the totalto-date effort expended or defects found at each interval – returns the
software reliability at various points in time.
1 2
−
t
2 Peak2
F (t ) = K 1 − e
23. The Rayleigh Curve (3 of 3)
Example: A software project has a 12 month delivery
Prior data is available to generate a reliability forecast.
The customer wants to know what the effect is of pulling
the delivery in to 9 months. What is the answer?
It reduces the total containment effectiveness
(TCE), otherwise expressed as reliability, from
89.6% to 61%.
Tradeoff:
This allows expectations
to be managed by
explaining that to achieve
early delivery their will be
a tradeoff in the reliability,
which may require a later
release. This type of
management helps to
avoid possible customer
dissatisfaction.
24. Further Information
Software reliability growth using the Rayleigh Curve is discussed in greater
depth in Appendix A of How Reliable Is Your Product?: 50 Ways to Improve
Product Reliability, by Mike Silverman. The text of Appendix A was provided
by the author of this presentation.
This book is highly recommended for anybody that is interested in improving
product reliability, available from Amazon or directly from Ops A La Carte.
25. Software Availability and Failure
Intensity (1 of 2)
As mentioned earlier, instead of a reliability metric being provided,
customers may ask for a certain ‘availability’.
This is the average (over time) probability that a system or a capability of
a system is currently functional in a specified environment.
It depends on:
The probability of software failure
Length of downtime when failure occurs.
It essentially describes the expected fraction of the operating time during
which a software component or system is functioning acceptably.
If the software is not being modified (if further development or further
releases are not planned) then the failure rate will be constant and therefore
the availability will be constant.
26. Software Availability and Failure
Intensity (2 of 2) Software uptime
From earlier, availability is defined as: Availability =
Software uptime + downtime
Downtime can be expressed as:
Downtime = t m λ
Where: tm=downtime per failure
λ=failure intensity
For software , the downtime per failure is the time to recover from the
failure, not the time required to find and remove the fault.
1
∴ Availabilt y =
1 + tm λ
If an availability specification for the software is specified, then the
downtime per failure will determine a failure intensity objective:
1 − Availabili ty
λ=
Availabili ty × t m
Either an availability or a failure intensity objective have to be defined.
27. Example
A product must be available 99% of time.
Required Downtime = 6 minutes (0.1hr)
The downtime per failure can be used to determine the failure intensity objective.
1− A
λ=
Atm
1 − 0.99
∴λ =
0.99 × 0.1
∴ λ = 0.1 failure / hr
or 100 failures/kHrs
28. Availability, Failure Intensity,
Reliability and MTBF
This presentation will discuss reliability in terms of availability, probability
and MTBF. These are the relationships between each of these three metrics.
A customer specifies an availability target of 0.99999 and a maximum software downtime
of 5 minutes, or 0.083 hours. The failure intensity is determined from:
λ=
Downtime
0.00083
=
= 0.083 failures / Hr
Availability 0.0099999
What is the reliability probability for a period of 2 years?
R (t ) = e
−
λT
1×10
9
=e
−
0.99999×17520
= 0.99998
1×109
What is the Mean Time Between Failures (MTBF)?
MTBF =
1 × 109
λ
1 × 109
=
= 1.2 × 1010
0.083
Hours
30. Defining an Operational Profile
An operational profile is a quantitative characterization of how a system will
be used in the field by customers.
Why is it useful?
It provides information on how users will employ the product.
It enables the most critical operations to be focused on during testing.
This allows the efficiency of the reliability test effort to be improved.
It allows more realistic test cases to be designed.
To do this the individual software operations have to be identified, which are:
Major system logical tasks that returns control to the system when complete.
Major = a task that is related to a functional requirement or feature rather than a subtask.
The operation can be initiated by a user, another part of the system, or by the systems
own controller.
For more information on operational profiles refer to Software Reliability
Engineering: More Reliable Software Faster and Cheaper – John D. Musa
31. Developing an Operational Profile (1 of 5)
Five steps are needed to develop an operational profile:
1. Identify operation initiators (users, other sub-systems, external systems, product’s own
controller etc.
2. Create an operations list – this is a list of operations that each initiator can execute. If all
initiators can execute every operation then the initiators can be omitted, and instead just
focus on producing a thorough operations list.
32. Developing an Operational Profile (2 of 5)
A good way to generate an operations list for a menu-driven product is to
produce a ‘walk tree” rather than use an initiators list. An example of a menu
driven system is provided below.
This is based on a medical enteral pump, used for feeding patients.
33. Developing an Operational Profile (3 of 5)
Step 3. Once the operational profile is complete it should be reviewed to
ensure:
All operations are of short duration in execution time (seconds at most).
Each operation must have substantially different processing from the others.
All operations must be well-formed, i.e., sending messages and displaying data are parts
of the operation and not operations in themselves.
The final list is complete with high probability.
The total number of operations is reasonable, taking the test budget into account. This is
because each operation will be focused on individually using a test case, so if the list is
too long it may result in the project test phase being very lengthy.
34. Developing an Operational Profile (4 of 5)
Step 4. Determine occurrence rates for each operation – this may need to be
estimated to begin with, but can be revised later.
Occurrence Rate =
Number of operation occurrences
Time the total set of operations is running
35. Developing an Operational Profile (5 of 5)
Step 5. Determine the occurrence probabilities.
Occurrence Probability =
Occurrence rate of each operation
Total operation occurrence rate
This table has been rearranged by sorting the operations in order of descending
probabilities. This presents the operational profile in a form that is more convenient to use.
36. Establish Failure Definitions
What is critical to the customer? How does the customer define a failure?
A failure is any departure of system behavior in execution from the user needs.
A Fault is a defect that causes the failure (i.e., missing code).
A fault may not result in failure…but a failure can only occur if a fault exists.
Faults have to be detect – how can this be done?
Answer – by developing an operational profile. This enables resource to be focused on
addressing issues in operations that have the highest probability of failure. Results in
failures having a low failure intensity.
Failure modes should be defined early in the project – this provides a specification for
what the system should NOT be doing!
Failure severity classes can be defined as shown below. The failures that have the highest
severity should be focused on first.
38. Software FMEA and Risk Analysis
A software Failure Mode and Effects Analysis (SFMEA) is a systematic method
that:
Recognizes, evaluates, and prioritizes potential failures and their effects.
Identifies and prioritizes actions that could eliminate or reduce the likelihood of potential
failures occurring.
Failure Mode
(Defect)
Cause
Material or
process input
Process Step
Effect
Software Failure
An FMEA aids in anticipating failure modes in order to determine and assess the risk to
the customer or product.
then
Risks have to be reduced to acceptable levels.
39. Software FMEA and Risk Analysis (1 of 2)
Fault trees provide a graphical and logical framework system failure modes
to be analyzed. These can then be used to assess the overall impact of
software failures on a system, or to prove that certain failure modes cannot
occur.
Here is a simple example of how to use a fault tree to perform a Software
FMEA. It is far better to begin an FMEA using a fault tree. Filling in a
spreadsheet immediately can easily result in confusion and is rarely
successful!!
SYSTEM BLOCK DIAGRAM
Sensor
Controller
Actuator
Potential failure mode - unintended system function.
Results in undesirable system behavior - could include
potential controller or sensor failures.
The first step is to produce a fault tree
42. Why Inspect Code?
Formal inspections should be carried out on the:
Requirements.
Design.
Code.
Approximately 18 man hours
plus rework are required per
300-400 lines of code.
Test plans.
“…formal design and code inspections rank as the most effective
methods of defect removal yet discovered…(defect removal) can top
85%, about twice those of any form of testing.”
-Capers Jones
Applied Software Measurement, 3rd Ed.
McGraw Hill 2008
Case study performed by the Data Analysis Center for Software (DACS):
85% Defect Containment: cost = $1,000,000, Duration = 12 months
95% Defect Containment: cost = $750,000, Duration = 10.8 months
43. Formal “Fagan Style” Inspections
This is a defined process that is quantitatively
managed.
The objective is to do the thing right. There is no discussion of
options, it is either right or wrong, or it requires investigation.
Ideally 4 inspectors participate (it can be 3-5, but not less than
3). Participants have roles – Leader, Reader, Author and Tester.
The review rate target is 150-200 lines of code per hour. What is found depends on
how closely the inspectors look at the code.
This is a 6 step process that is defined in IEEE 1028.
Data is stored in a repository for future reference.
The outcome should be that defects are found and fixed, and that data is collected and
analyzed.
44. Relationship Between Inspection and
Reliability (1 of 2)
For a four-phase test process the reliability is likely to vary between 74% and
92% (based on industry data).
Note that not all fixes address problems completely. Some fixes may not be totally effective,
while others may also introduce further problems. This is where inspection can be of value.
Adapted from a similar approach in : Capers Jones
Applied Software Measurement, 3rd Ed.
McGraw Hill 2008
45. Relationship Between Inspection and
Reliability (2 of 2)
Introducing inspection can increase the reliability to 93 – 99%(based on
industry data).
Inspection alone can enable the software to surpass the reliability that is obtained from a testonly process!
This also increases the scope for reducing the emphasis on testing.
Adapted from: Capers Jones
Applied Software Measurement, 3rd Ed.
McGraw Hill 2008
47. Static Analysis (1 of 2)
This should be performed after the code is developed.
It is pattern based – it scans the code to check for patterns that are known to
cause defects.
This type of analysis uses coding standard rules and enforces internal coding guidelines.
This is a simple task, easily automated, that reduces future debugging effort.
It is data flow based, in that it statically simulates execution paths, so is able
to automatically detect potential runtime errors such as:
Resource leaks.
NullpointerExceptions.
SQL injections.
Security vulnerabilities.
The benefits of static analysis are:
It can examine more execution paths than conventional testing.
It can be applied early in the software design, providing significant time and cost savings.
48. Static Analysis (2 of 2)
Examples of warning classes that can be obtained from static analysis are:
Buffer overrun
Buffer underrun
Cast alters value
Ignored return value
Division by zero
Missing return statement
Null pointer dereference
Redundant condition
Shift amount exceeds bit width
Type overrun
Type underrun
Uninitialized variable
Unreachable code
Unused value
Useless assignment
49. Buffer Overflow Example
Consider the code segment below:
char arr[32];
For (int i = 0; i < 64; i++)
{
arr[i] = (char)i;
}
Here, memory that is beyond the range of the stack-based variable “arr” is being explicitly
addressed. This results in memory being overwritten, which could include the stack frame
information that is required for the function to successfully return to its caller, etc.
This coding pattern is typical of security vulnerabilities that exist in software. The specifics
of the vulnerability may change from one instance to another, but the underlying problem
remains the same, performing array copy operations that are incorrectly or insufficiently
guarded against exploit.
Static analysis can assist in detecting such coding patterns
50. Types of Tests
Functional tests
This is single execution of operations with interactions between the various operations
minimized. The focus is on whether the operation executes correctly.
Load tests
These tests attempt to represent field use and the environment as accurately as possible,
with operations executing simultaneously and interacting. Interactions can occur directly,
through the data, or as a result of resource conflicts. This testing should use the operational
profile.
Regression tests
Functional tests that can be conducted after every build involving significant change. The
focus during these tests is to reveal faults that may have been created during the change
process.
Endurance tests
Ad-hoc testing is similar to load testing in that it should represent the field use and environment
as accurately as possible. This will focus on how the product is to be used…and may be
misused.
52. Reliable System Design (1 of 7)
To achieve reliable system design software should be designed such that it is
fault tolerant.
Typical responses to system or software faults during operation includes a
sequence of stages:
Fault confinement,
Fault detection,
Diagnosis,
Reconfiguration,
Recovery,
Restart,
Repair,
Reintegration.
53. Reliable System Design (2 of 7)
Fault Confinement.
Limits the spread of fault effects to one area of the system – prevents
contamination of other areas.
Achieved through use of:
- self-checking acceptance tests,
- exception handling routines,
- consistency checking mechanisms,
- multiple requests/confirmations.
Erroneous system behaviors due to software faults are typically undetectable.
Reduction of dependencies can help.
54. Reliable System Design (3 of 7)
Fault Detection.
This stage recognizes that something unexpected has occurred in the system.
Fault latency – period of time between fault occurrence and detection.
The shorter the fault latency is, the better the system can recover. Two technique
classes are off-line and on-line fault diagnosis:
- Off-line techniques are diagnostic programs.
System cannot perform useful work under test.
- On-line techniques provide real-time detection capability.
System can still perform useful work.
Watchdog monitors and redundancy schemes.
55. Reliable System Design (4 of 7)
Diagnosis.
This is necessary if the fault detection technique does not provide information about the
failure location and/or properties.
This is often an off-line technique that may require a system reset.
On-line techniques can also be used i.e., when a diagnosis indicates unhealthy system
conditions (such as low available resources), low-priority resources can be released
automatically in order to achieve in-time transient failure prevention.
Reconfiguration.
This occurs when a fault is detected and a permanent failure is located.
The system may reconfigure its components either to replace the failed component or to
isolate it from the rest of the system (i.e., redundant memory, error checking of memory in
case of partial corruption etc).
Successful reconfiguration requires robust and flexible software architecture and
reconfiguration schemes.
56. Reliable System Design (5 of 7)
Recovery.
Uses techniques to eliminate the effects of faults.
There are two approaches:
- fault masking,
- retry and rollback.
Fault masking hides effects of failures by allowing redundant, correct information to
outweigh the incorrect information.
Retry makes a second try at an operation as many faults are transient in nature.
Rollback makes use of backed up (checkpointed) operations at some point in its
processing prior to fault detection, and operation recommences from this point.
Fault latency is very important because the rollback must go back far enough to
avoid the effects of undetected errors that occurred before the detected error.
57. Reliable System Design (6 of 7)
Restart.
This occurs after the recovery of undamaged information.
There are three approaches:
- hot restart,
- warm restart;
- cold restart.
Hot restart – resumption of all operations from the point of fault detection (this is only
possible if no damage has occurred).
Warm restart – only some of the processes can be resumed without loss.
Cold restart – complete reload of the system is performed with no processes surviving.
58. Reliable System Design (7 of 7)
Repair.
Replacement of failed component – on or off-line.
Off-line – system brought down to perform repair. System availability depends on how fast
a fault can be located and removed.
On-line – Component replaced immediately with a back up spare (similar to
reconfiguration), or perhaps operation can continue without using the faulty component
(i.e., masking redundancy or graceful degradation).
On-line repair prevents system operation interruption.
Reintegration.
Repaired module must be reintegrated into the system.
For on-line repair, reintegration must be performed without interrupting system operation.
Non-redundant systems are fault intolerant and, to achieve reliability, fault avoidance is
often the best approach. Redundant systems should use fault detection, masking
redundancy (i.e., disabling 1 out of N units), and dynamic redundancy (i.e., temporarily
disabling certain operations ) to automate one or more stages of fault handling.
60. Reliability Modeling (1 of 4)
This is used to calculate what the current reliability is, and if the reliability
target is not yet being achieved, determine how much testing and debug
needs to be completed in order to achieve the reliability target.
The questions that reliability modeling aims to answer are:
How many failures are we likely to experience during a fixed time period?
What is the probability of experiencing a failure in the next time period?
What is the availability of the software system?
Is the system ready for release (from a reliability perspective)?
Software
Failures
t2
t1
T=0
T1
T2
t3
t4
T3 T4 T5
Ti is the Cumulative Time To Failure
ti is the inter-arrival time = Ti – Ti-1
t6
t5
T6
t7
T7 T8 TE
61. Reliability Modeling (2 of 4)
In reliability engineering it is usual to identify a failure distribution,
especially when modeling non-repairable products*. This approach can be
used because it is assumed that hardware faults are statistically
independent and identically distributed.
Where software is concerned, events (failures) are not necessarily
independent due to interactions with other system elements, so in most
cases failures are not identically distributed.
When a failure occurs in a software system the next failure may depend on
the current operational time of the unit, and therefore each failure event in
the system may be DEPENDENT.
*
Although it can be argued that a software system can be repaired by fixing the fault, in
reliability terms it is still a non-repairable product because it is not wearing out. For instance,
a car is a repairable device as parts can be changed when they wear out, but this does not
necessarily make it as good as new. If a software fault is repaired it is actually as good as
new again, and in fact the improvement may make it better than new.
62. Reliability Modeling (3 of 4)
Therefore what is needed is to model the Rate of Occurrence of Failures
and the Number of Failures within a given time.
As an example, with reference to the figure below, a model is needed that
will report the fact that 8 failures are expected by timeTE and that the Rate
of Occurrence of Failures is Increasing with Time.
Software
Failures
t2
t1
T=0
T1
T2
t3
t4
T3 T4 T5
t6
t5
T6
t7
T7 T8 TE
63. Reliability Modeling (4 of 4)
If a Distribution Analysis is performed on the Time-Between-Failures, then
this is equivalent to saying that there are 9 different systems, where
System 1 failed after t1 hours of operation, System 2 after t2,…, etc.
T=0
System 1
System 2
System 3
.
.
.
System 9
t1
t2
t3
T9 (suspension*)
This is the same as assuming that the system is failure free if the fault is
addressed, which may not necessarily be true as further failures may
occur.
Example: Changing the break pads on a car. This does not mean that the
car is now failure free!
*
A unit that continues to work at the end of the analysis period or is removed from a test in working condition.
I.e., it may fail at some point in the future.
64. An Example of an Incorrect Approach (1 of 4)
This example has been included because it is a common approach to hardware
reliability modeling but it CANNOT be used for modeling software reliability.
This method is normally used to model a non-repairable hardware product.
Unfortunately when used in analyzing software reliability it returns incorrect
results…but it is an easy trap for a reliability engineer to fall into!!!
Both firmware and hardware failure data is collected from three systems:
A total of 6 different firmware and 4 different hardware failure modes are
identified
65. An Example of an Incorrect Approach (2 of 4)
The conventional reliability engineering approach is to take the TimeBetween-Failures for each system and then fit a distribution.
319-152
Notice that hardware failures have been removed.
The time between the last failure and the current age is a Suspension.
66. An Example of an Incorrect Approach (3 of 4)
A Weibull (life data) Analysis is conducted, but with software this is not
appropriate!
This analysis assumes a
sample of 20 systems, and
one system failed after
152hrs, the other after
319hrs, etc.
67. An Example of an Incorrect Approach (4 of 4)
This system will be used for a total of 250 hours. What will the software
reliability be?
Distribution analysis is okay for
non-repairable products
containing only hardware, but
not for anything containing
software (or for repairable
hardware only products).
In products that contain software,
events are dependent, and therefore
alternative analysis methods should be
used.
However, it is correct to fit a
distribution on the First-Time-to-Failure
of each system.
97.63% GREAT RESULT
BUT
COMPLETELY
WRONG!!!
68. An Example of a Correct Approach
This is the
probability that the
unit will NOT fail in
the first 250 hours.
Reliability=68.36%
Notice that the
confidence interval
is very wide.
69. Three Possible SRE approaches…
Are multiple
systems being
tested?
No
Can testing be
stopped after each
phase to fix failure
modes?
No Use 3-Parameter
Crow-Extended
Model
Yes
Use NHPP model
(This is the best
option)
This is the current state of the art in software
reliability modeling, and is suitable for most
projects. However, this approach is not suitable
for testing a single unit (i.e., a large expensive
system), or where not all faults are going to be
fixed in between compiles. A better model is
needed for this type of application.
Yes
Use Crow-Extended
Model
It is hypothesized by the author that these models may be
more suitable for developments where the NHPP model
cannot be well applied. This essentially represents a future
state of software reliability testing. However, before being
readily accepted they should be validated, i.e., by
comparing their predicted reliability with actual field data.
Use of these models has been included in this presentation
for completeness and possible future application.
70. A Better SRE Analysis Approach (1 of 4)
A model is needed that will take into
account the fact that when a failure occurs
the system has a “Current Age,” or in
other words there is a further failure that
is likely to occur.
For example, in System 1, the system has
an age of 152 hours after the first firmware
failure mode has been detected.
In other words, all other operations that can
result in a failure also have an age of 152 hours
and the next failure event is based on this fact.
71. A Better SRE Analysis Approach (2 of 4)
The NHPP (Non-Homogenous Poisson Process) with a Power Law failure
Intensity is such a model:
Where:
Pr[N(T)=n] is the probability that n failures will be observed by time, T.
Λ(T) is the Failure Intensity Function (Rate of Occurrence of Failures).
Just because a model is used for hardware does not mean that it cannot be suitable for software
as well, as models simply describe times-to-failure. Therefore a hardware model can also be
used for software, providing that it is a dependent model (failures are dependent on the
operational time, rather than being independent).
72. A Better SRE Analysis Approach (3 of 4)
NHPP model parameters:
Here the failure events of System 1 are analyzed between the period of 0 and 1380 hours.
This folio also contains the failure events for Systems 2 and 3 (not shown).
Of interest is the fact that
Beta >1, which indicates
that the inter-arrival times
between unique failures
are decreasing, so there
may be little opportunity
for reliability improvement.
73. A Better SRE Analysis Approach (4 of 4)
NHPP model results:
Plot shows the cumulative number of failure vs. time, from which conclusions and further
predictions can be obtained. The Weibull plot intersects the X-axis, so out-of-box failures
should not be present. If it had intersected the Y-axis then this would indicate potential for
out-of-box failures.
The cumulative number of failures is
0.1352, or 13.52 failures per 25000
operational hours.
74. An Example Using the NHPP Model (1 of 8)
Software is under development – the reliability
requirement is to have no more than 1 fault in
every 8 hours of software operation.
Three Test Engineers provide a total of 24 hours
of testing each day.
One new compile is available for testing each week,
when fixes are implemented.
The failure rate goal is:
8
FR =
= 0.125
24
Failures per hour
In a testing day, the failure intensity goal 3 faults/day.
FRI = 0.125 × 24 = 3 Faults per day
75. An Example Using the NHPP Model (2 of 8)
Failure data is obtained:
The data is grouped by the number of days until a new compile is available,
i.e., the first 45 failures are contained in one group and are fixed in compile
#1.
NHPP model
parameters
76. An Example Using the NHPP Model (3 of 8)
The instantaneous failure intensity after 28
days of testing is 4.4947 faults/day.
If testing is continued with
the same growth rate,
when will the goal of no
more than 3 faults/day be
achieved?
The answer is after
an additional 14928=121 days of
testing and
development
(test-analyze-fix)
77. An Example Using the NHPP Model (4 of 8)
An extra 121 days is longer than anticipated. Let’s take a closer look by
generating a Failure Intensity vs. Time plot…
Each of these lines indicates the
failure intensity over a given
interval (which in this case is 5
days).
It can be seen that there was a
jump in the failure intensity between
20 and 23 days.
This is why it is estimated that more
development time is required.
The next step is to analyze the data set for the period up to 20 days of testing,
before the failure intensity increased…
78. An Example Using the NHPP Model (5 of 8)
The NHPP model data is limited to the first 20 days of testing and
another Failure Intensity vs. Time plot is generated, but this time for the
first 20 days:
This plot shows the decrease in the
failure intensity rate over the first 20
days of testing.
This confirms that the failure intensity
continuously reduced during the first
20 days.
79. An Example Using the NHPP Model (6 of 8)
Based on the first 20 days of data the
additional test and development duration can
be recalculated, which results in there being
an additional 55-28=27 days to achieve the
goal of having no more than 3 faults/day,
rather than 121!
This generates questions:
Why is there such a big difference in the test duration
still required?
What happened when the failure intensity jumped on
the 23rd day of testing and development?
Answer – New functionality was added. The jump in required test time is
typical when new features are introduced, and applies to software and
hardware alike.
Because new functionality has been added it would be wise to reset the clock and track the
reliability growth from the 20th day forward…
80. An Example Using the NHPP Model (7 of 8)
Now the NHPP model parameters need to be obtained and plotted for the
last 8 days of testing (8 days is an arbitrary number; enough data needs to
be available to have confidence in any conclusions that are drawn).
This provides better resolution. By taking a
“macro’ view it can be seen that the failure
intensity is starting to increase, so the minimum
failure intensity point has been determined. For
improved accuracy calculations should be based
on this.
81. An Example Using the NHPP Model (8 of 8)
Based on this data set 51-8=43 more days
of developmental testing are required.
It may be too early to make any
predictions based on only 8 days of
testing, but the result can be used to
obtain a general idea of the remaining
development time required and produce
a test plan.
To pull in the schedule 3 more Test
Engineers could be added and the code
recompiled every 2 days, which will
complete the project within 1 month.
There are also situations where some issues are fixed immediately, others are addressed
later and more minor issues may not be addressed at all. In this type of situation the
Crow Extended Model can be useful…
82. Crow-Extended Model Introduction (1 of 2)
This is not a common SRE model but does have the benefit of supporting
decision making by providing metrics such as
Failure intensity vs. time.
Demonstrated Mean Time Between Failures (MTBF*).
MTBF growth that can be achieved through implementation of corrective actions.
Maximum potential MTBF that can be achieved through implementation of
corrective actions.
Maximum potential MTBF that can likely be achieved for the software and estimates
regarding latent failure modes that have not yet been uncovered through testing.
This model utilizes A, BC and BD failure mode classifications to analyze
growth data.
A = Failure mode that will not be fixed.
BC = A Failure mode that will be fixed while the test is in progress.
BD = A Failure mode that will be corrected at the end of the test.
* This model uses MTBF rather than failure intensity or reliability metrics. A conversion between these various
metrics is provided in slide 28.
83. Crow-Extended Model Introduction (2 of 2)
There is no reliability growth for A modes.
The effectiveness of the corrective actions for BC modes is assumed to be
demonstrated during the test.
BD modes require a factor to be assigned that estimates the effectiveness of
the correction that will be implemented after the test.
Analysis using the Crow Extended model allows different management
strategies to be considered by reviewing whether the reliability goal will be
achieved.
There is one constraint to this approach – the testing must be stopped at the
end of the test phase and all BD modes must be fixed. The Crow Extended
model will return misleading conclusions if it is used across multiple test
phases. For those situations use the 3-Parameter Crow-Extended model
(discussed next).
DO NOT APPLY THIS MODEL TO A MULTIPLE SYSTEM TEST, USE THE NHPP
MODEL INSTEAD.
84. Crow-Extended Model Example (1 of 8)
A product underwent development testing, during which failure modes were
observed. Some modes were corrected during the test (BC modes), some
modes were corrected after the end of the test (delayed fixes, BD modes) and
some modes were left in the system (A modes). The test was terminated after
400 hours; the times-to-failure are provided below:
85. Crow-Extended Model Example (2 of 8)
An effectiveness factor has been assigned for each BD failure mode
(delayed fixes). The effectiveness factor is based on engineering
assessment and represents the fractional decrease in failure intensity of a
failure mode after the implementation of a corrective action.
The effectiveness factors for the BD failure modes are provided below:
This is a metric that enables an assessment to be made of whether or not the corrective actions
have been effective, and if they have, how effective they were. This is often a subjective judgment.
86. Crow-Extended Model Example (3 of 8)
The times-to-failure data and effectiveness factors are entered:
Note that this data sheet only
displays 29 rows of data, but all
data is entered even though it
has not been shown.
Effectiveness factor is expressed as 0-1 (0-100% of the failure intensity being removed by the
corrective action).
87. Crow-Extended Model Example (4 of 8)
Model parameter calculation:
Here the failure events are analyzed between the period of 0 and 400 hours.
88. Crow-Extended Model Example (5 of 8)
Growth potential MTBF plot:
Growth potential MTBF
(Maximum achievable MTBF based
on current strategy)
Projected MTBF
(Estimated MTBF after delayed corrective
actions have been implemented)
Demonstrated MTBF
(MTBF at end of test without corrective
actions)
Instantaneous MTBF
(Demonstrated MTBF with time)
The demonstrated MTBF, (the result of fixing BC modes during the test) is about 7.76 hours.
The projected MTBF (the result of fixing BD mode after the test) is about 11.13 hours.
The growth potential MTBF (if testing continues with the current strategy, i.e. modes corrected
vs. modes not corrected and with the current effectiveness of each corrective action) is
estimated to be about 14.7 hours. This is the maximum attainable MTBF.
89. Crow-Extended Model Example (6 of 8)
An Average Failure Mode Strategy plot is a pie chart that breaks down the
average failure intensity of the software into the following categories:
A modes – 9.546%.
BC modes addressed – 14.211%.
BC modes still undetected – 30.655%.
BD modes removed – 8.846%.
BD modes to be removed – 3.355
(because corrective actions were
<100% effective).
BD modes still undetected – 33.386%
90. Crow-Extended Model Example (7 of 8)
Individual Mode MTBF plot, which shows the MTBF of each individual failure
mode. This enables the failure modes with the lowest MTBF to be identified.
These are the failure modes
that cause the majority of
software failures, and should
be addressed as the highest
priority when reliability
improvement activities are to
be implemented.
Blue = Failure mode MTBF
before corrective action.
Green = Failure mode MTBF
after corrective action.
91. Crow-Extended Model Example (8 of 8)
Failure Intensity vs. Time plot:
This can be analyzed in
exactly the same way as in
the NHPP example.
92. 3-Parameter Crow-Extended Model
Introduction (1 of 2)
This is not a common SRE model either, but has the same benefits as the
single parameter Crow-Extended model plus multiple test phases can also be
taken into account.
This model is ideal in situations where software is to be tested over multiple phases
but where all bug fixes cannot be introduced as faults are discovered, i.e., all bugs
will be addressed on an ad-hoc basis over an extended time period.
The model provides the flexibility of not having to specify when the test will end, so
it can be continuously updated with new test data. Therefore this model is optimized
for continuous evaluation rather than fixed test periods.
It can only be applied to an individual system, so it lends itself ideally to situations
where an individual complex system is being tested. DO NOT APPLY ANY CROW
MODEL TO A MULTIPLE SYSTEM TEST, USE THE NHPP MODEL INSTEAD.
93. 3-Parameter Crow-Extended Model
Introduction (2 of 2)
This model uses several event codes:
F
Failure time.
I
Time at which a certain BD failure mode has been corrected. BD modes that have
not received a corrective action by time T will not have an associated I event in the
Q
data set.
A failure that was due to a quality issue, such as a build problem rather than a
design problem. The reliability engineer can decide whether or not to include
quality issues in the analysis.
P
A failure that was due to a performance issue, such as an incorrect component
being installed in a device where the embedded code is being tested. The reliability
engineer can decide whether or not to include performance issues in the analysis.
AP
This is an analysis point, used to track overall project progress, which can be
compared to planned growth phases.
PH
The end of a test phase. Test phases can be used to track overall project progress,
which can be compared to planned growth phases.
X
A data point that is to be excluded from the analysis.
94. 3-Parameter Crow-Extended Model
Example (1 of 11)
Software is under development. Testing is to be conducted in 3 phases.
Phase 1 – 6 weeks of manual testing that is run 45 hours per week, total 270 hours.
Phase 2 – 4 weeks of automated testing that is run 24/7, total 672 hours.
Phase 3 – 8 weeks of field manual testing that is run 40 hours per week, total 320 hours.
One hour of continuous testing equates to 7 hours of customer usage, so
the testing includes a usage acceleration factor of 7.
The average fix delay for the three phases are 90 hours, 90 hours and 180
hours respectively (fix time = time delay between discovering a failure mode
to the time the corrective action is incorporated into the design).
Taking usage acceleration into account, cumulative test times for the three
phases is 1890 hours, 6594 hours and 8834 hours respectively.
95. 3-Parameter Crow-Extended Model
Example (2 of 11)
Customer reliability target = 2 failures per year.
Usage duty cycle = 0.1428
Therefore for continuous usage, reliability target is: = 2 failures every
1251 hrs.
Failure intensity: =
2
= 0.0016
1251
Equivalent test time = 8834 hrs.
Required MTBF: =
8834
= 625hrs
0.0016 × 8834
96. 3-Parameter Crow-Extended Model
Example (3 of 11)
The growth potential = 1.3
This is the amount by which the MTBF target should exceed the requirement for margin.
The higher the GP, the lower the risk is to the program. This is an initial estimate based on
prior experience, and the higher the GP margin, the less risk is present in the program.
Average effectiveness factor = 0.5 (1.0 = a perfect fix, 0 = inadequate fix)
This is also an initial estimate based on prior experience.
Management strategy – address at least 90% of all unique failure modes
prior to formal release.
Beta parameter = 0.7
This is the rate for discovering new, distinct failure modes found during testing.
Again, this is an estimate based on prior experience.
A discovery beta of less than 1 indicates that the inter-arrival times between unique B
modes are increasing. This is desirable, as it is assumed that most failures will be
identified early on, and their inter-arrival times will become larger as the test progresses.
This is an initial estimate; the actual discovery beta can be obtained from the final results,
allowing this parameter estimation to be refined with testing experience.
97. 3-Parameter Crow-Extended Model
Example (4 of 11)
Based on the assumptions on the previous slide, an overall growth planning
model can be created that shows the nominal and actual idealized growth
curve and the planned growth MTBF for each phase.
A growth planning folio is created and 1890, 6594 and 8834 are entered for
the Cumulative Phase Times and 630, 630 and 1260 for the Average phase
delays.
Note that inter-phase average fix delays have
been multiplied by 7 to take the usage
acceleration factor into account.
98. 3-Parameter Crow-Extended Model
Example (5 of 11)
The project parameters are input into the Planning Calculations window .
Given the MTBF target and design
margin that has been specified,
along with other required inputs to
describe the planned reliability
growth management strategy, the
final MTBF that can be achieved is
calculated, along with other useful
results. Here it is verified that 625
hours is achievable (if it was not
achievable a figure of less than 625
hours would be calculated).
99. 3-Parameter Crow-Extended Model
Example (6 of 11)
Effectiveness Factors for all BD modes are specified, together with when
they are to be implemented.
A growth planning plot can then be obtained:
MTBF at end of phase 3
MTBF at end of phase 1
MTBF at end of phase 2
This plot displays the MTBF
vs. Time values for the three
phases that have been
planned for the test.
100. 3-Parameter Crow-Extended Model
Example (7 of 11)
Test failure data is collected during the three phases:
Actual discovery
beta
(original
estimate was
0.7)
101. 3-Parameter Crow-Extended Model
Example (8 of 11)
Growth potential MTBF plot can now been obtained:
Growth potential
MTBF
(Maximum achievable
MTBF based on current
strategy)
Demonstrated
MTBF
(MTBF at end of test
without corrective
actions)
Projected MTBF
(Estimated MTBF after
delayed corrective
actions have been
implemented)
Instantaneous
MTBF
If the MTBF goal is higher than the Growth Potential line then the current design
cannot achieve the desired goal and a redesign or change of goals may be
required. For this example, the goal MTBF of 650 hours is well within the growth
potential and is expected to be achieved after the implementation of the delayed
BD fixes.
102. 3-Parameter Crow-Extended Model
Example (9 of 11)
Average Failure Mode Strategy plot, breaking down the average failure
intensity of the software into categories:
A modes – 13.432%.
BC modes addressed – 19.281%.
BC modes still undetected – 13.76%.
BD modes removed – 25.893%.
BD modes remain (because corrective
actions were <100% effective) –
5.813%.
BD modes still undetected – 21.882%
103. 3-Parameter Crow-Extended Model
Example (10 of 11)
Individual Mode MTBF plot showing the MTBF of each individual failure mode,
thus enabling the failure modes with the lowest MTBF to be identified.
Blue = Failure mode MTBF before
corrective action.
Green = Failure mode MTBF after
corrective action.
104. 3-Parameter Crow-Extended Model
Example (11 of 11)
The RGA Quick calculation pad indicates
that the discovery rate of new unseen BD
modes at 630 hours is 0.0006 per hour.
The Beta bounds are less than 1, indicating
that there is still growth in the system
(think of this as the leading edge slope of
the bathtub curve; when beta=1 there is no
more growth potential)
106. Reliability Demonstration Testing (1 of 2)
There can be occasions when the actual software reliability may have to be
measured through practical demonstration rather than testing. However, this
is more applicable where all known faults have been removed and the
software is considered to be stable.
If the reliability has already been discovered by conducting a reliability growth program
then there may be little value in conducting this test. This is actually more suitable for
situations where a reliability growth program has not been conducted.
This can be achieved through
sequential sampling theory.
107. Reliability Demonstration Testing (2 of 2)
A project specific chart depends on:
Discrimination Ratio – this is an error in the failure intensity estimation that is considered to
be acceptable.
Consumer Risk Level – this is the probability of falsely claiming the failure intensity has
been met when it has not.
Supplier Risk Level – this is the probability of falsely claiming the failure intensity objective
has not been met when it has.
Common values are:
Discrimination Ratio 2%
Consumer Risk Level 0.1 (10%).
Supplier Risk Level 0.1 (10%).
108. Example
Requirement: 4 Failures/million operations
1
2
3
0.4
0.625
1.2
1.6
2.5
4.8
Multiply by
requirement target
Software can be
accepted after
failure 3 with 90%
confidence that it is
within the reliability
target and a 10%
risk that it is not.
The boundary has
to be crossed
though.
109. Reliability Demonstration Test Chart
Design (1 of 2)
What if the software is still in the Continue region at the end of the test?
Assume that the end of test is reached
just after failure 2.
Option 1 – Calculate the Failure Intensity
Objective:
Factor =
FCURRENT 3.6
=
= 1.44
F PREVIOUS 2.5
∴ FIO = 1.44 × 4 = 5.76
Option 2 - Extend the test time by
≥factor.
Grouped data CANNOT be used, it has to be obtained from individual units.
110. Reliability Demonstration Test Chart
Design (2 of 2)
The following formulae is used to design RDT charts:
TN
( A − n )(ln γ )
=
1−γ
TN
and
Accept-Continue Boundary
(B − n )(ln γ )
=
1−γ
Reject-Continue Boundary
Where:
TN:
Normalized measure of when failure occur (horizontal coordinate).
n:
Failure number.
γ:
Discrimination ratio (ratio of max acceptable failure intensity to the failure objective).
A and B are defined from:
A = ln
β
1−α
B = ln
1− β
α
Where:
α: Supplier risk (probability of falsely claiming objective is not met when it is).
β: Consumer risk (probability of falsely claiming objective is met when it is not).
111. Reliability Demonstration Test Chart
Design Example
The boundary intersections with the x and y axis's can be calculated using the
following formulae:
B − n(ln γ )
A − n(ln γ )
,n
,n
1− γ
1− γ
B
0,
ln γ
A
,0
1− γ
In this example
n=16.
112. SRE Review
Enables defect discovery rates to be
forecast and monitored – helps all staff –
enables customer expectations to be
managed.
Enables reliability targets to be established
and monitored.
Software FMEA enables failure modes and risks to be identified.
Establishes formal and thorough test and analysis methodologies.
Provides a method for modeling and demonstrating software reliability.
Defines code inspection processes.
Guarantees customer satisfaction!
113. References
Adamantios Mettas, “Repairable Systems: Data Analysis and Modeling” Applied
Reliability Symposium 2008.
Michael R. Lyu, “Software Reliability Engineering: A Roadmap”.
Dr Larry Crow, “An Extended Reliability Growth Model For Managing and
Accessing Corrective Actions” Reliability And Maintainability Symposium 2004.
John D. Musa, “Software Reliability Engineering: More Reliable Software Faster
and Cheaper” Authorhouse 2004.
Reliasoft RGA 7 Training Guide.
Capers Jones, “Applied Software Measurement, 3rd Edition, McGraw Hill 2008.