SlideShare a Scribd company logo
1 of 113
Download to read offline
Software Reliability Engineering
Mark Turner
Topics Covered in this Presentation
What software reliability engineering is and why it is
needed.
Defining software reliability targets.
Operational profiles.
Reliability risk management.
Code inspection.
Software testing.
Reliable system design.
Reliability modeling.
Reliability demonstration.
INTRODUCTION
What Software Reliability Engineering is
and why it is needed
Different Views of Reliability
Product development teams
View reliability at the sub-domain level, addressing mechanical,
electronic and software issues.

Customers
View reliability at the system level, with minimal consideration placed on
sub-domain distinction.

The primary measure of reliability is defined by the customer.
To develop a reliable product engineering teams must consider both
views (system and sub-domain).

System
Mechanical
Reliability
+
Electronic
Reliability
+
Software
Reliability

Although this presentation focuses on software reliability engineering, it should be viewed as a
component part of an overall Design for Reliability process, not as a disparate activity as hardwaresoftware interactions may be missed.
This presentation does not make any distinction between software and firmware, but the same
techniques apply equally to both.
System-Level Reliability Modeling (1 of 2)
A system is made up of components/sub-systems; each has its own inherent
reliability.

Software
R=0.99

Computer Server
R=0.9665

A “traditional” reliability program may include modeling, evaluation and testing to prove that
the hardware meets the reliability target, but software should not be forgotten as it is a
system component.
Individually the hardware and software may meet the reliability target…but they also have to
when they are combined.

System probability of failure = H/W Failure Probability x S/W Failure Probability
I.e., H/W = 0.9665, S/W = 0.99, System = 0.9665 x 0.99 = 0.9568
System Reliability = 95.68%
System-Level Reliability Modeling

(2 of 2)

Therefore the software
reliability should also be
accounted for in the
system-level reliability
model.

Software may consist of both the operating system (OS) and configurable (turnkey)
software. It may not be possible to influence the OS design, but turnkey software can be
focused on.
This may consist of re-used software such as library functions and newly developed
software.
If the reliability of the library functions is already understood then library function re-use
simplifies the software reliability engineering process.
What is Software Reliability Engineering
(SRE)?
The quantitative study of the
operational behavior of software-based
systems with respect to user
requirements concerning reliability.
SRE has been adopted either as standard or as best practice by more than
50 organizations in their software projects including AT&T, Lucent, IBM,
NASA and Microsoft, plus many others worldwide.

This presentation will provide an introduction to software
reliability engineering…..
Why is SRE Important?
There are several key reasons a reliability engineering program should be
implemented:
So that it can be determined how satisfactorily products are functioning.
Avoid over-designing – products could cost more than necessary and lower profit.
If more features are added to meet customer demand then reliability should be monitored
to ensure that defects are not designed in, which could impact reliability.
If a customer’s product is not designed well, with reliability and quality in mind, then they
may well turn to a COMPETITOR!

Having a software reliability engineering process can make
organizations more competitive as customers will always
expect reliable software that is better and cheaper.
Why is SRE Beneficial?
For Engineers:
Managing customer demands:
Enables software to be produced that is more reliable; built faster and
cheaper.
Makes engineers more successful in meeting customer demands.
In turn this avoids conflicts – risk, pressure, schedule, functionality, cost etc.

For the organization:
Improves competitiveness.
Reduces development costs.
Provides customers with quantitative reliability metrics.
Places less emphasis on tools and a greater emphasis on
“designing in reliability.”
Products can be developed that are delivered to the customer at the right time, at an
acceptable cost, and with satisfactory reliability.
Common SRE Challenges
Data is collected during test phases, so if problems are discovered it is too late
for fundamental design changes to be made.
Failure data collected during in-house testing may be limited,
and may not represent failures that would be uncovered in the
product’s actual operational environment.
Reliability metrics obtained from restricted testing data may
result in reliability metrics being inaccurate.
There are many possible models that can be used to predict the
reliability of the software, which can be very confusing.
Even if the correct model is selected there may be no way of validating it due to
having insufficient field data.
Fault Lifecycle Techniques
Prevent faults from being inserted.
Avoids faults being designed into the software when it is being
constructed.

Remove faults that have been inserted.
Detect and eliminate faults that have been inserted through inspection and test.

Design the software so that it is fault tolerant.
Provide redundant services so that the software continues to work even though faults have
occurred or are occurring.

Forecast faults and/or failures.
Evaluate the code and estimate how many faults are present and the occurrences and
consequences of software failures.
Preventing Faults From Being Inserted
Initial approach for reliable software
A fault that is never created does not cost anything to fix. This should be the ultimate
objective of software engineering.

This requires:
A formal requirement specification always being available that has been thoroughly
reviewed and agreed to.
Formal inspection and test methods being implemented and used.
Early interaction with end-users (field trials) and requirement refinement if necessary.
The correct analysis tools and disciplined tool use.
Formal programming principles and environments that are enforced.
Systematic techniques for software reuse.

Formal software engineering processes and tools, if applied successfully, can
be very effective in preventing faults (but is no guarantee!) However, software
reuse without proper verification can result in disappointment.
Removing Faults
When faults are injected into the software, the next
method that can be used is fault removal.
Approaches:
Software inspection.
Software testing.
Both have become standard industry practices. This presentation will focus closely
on these.
Fault Tolerance
This is a survival attribute – the software has to
continue to work even though a failure has
occurred.
Fault tolerance techniques enables a system to:
Prevent dormant software faults from becoming active (i.e.,
defensive programming to check for input and output
conditions and forbid illegal operations).
Contain software errors within a confined boundary to prevent them from propagating
further (i.e., exception handling routines to treat unsuccessful operations).
Recover software operations from erroneous conditions by using techniques such as
check pointing and rollback.
Fault/Failure Forecasting
If software failures are likely to occur it is critical to estimate the number of
failures and predict when each is likely to occur.
This will help concentrate on failures that have the greatest probability of occurring, provide
reliability improvement opportunities and improve customer satisfaction.

Fault/failure forecasting requires:
Defining a fault/failure relationship – why the failure occurs and its effect.
Establishing a software reliability model.
Developing procedures for measuring software reliability.
Analyzing and evaluating the measurement results.

Measuring software reliability provides:
Useful metrics that can be used to plan further testing and
debug efforts, to calculate warranty costs and plan further
software releases.
Determines when testing can be terminated.
SRE Process Overview
This slide shows a general SRE
process flow that has six major
components:

Determine Reliability
Determine Reliability
Objective
Objective

Define Operational
Define Operational
Profile
Profile

Perform Code Inspection
Perform Code Inspection

Determine the reliability Target.
Define a software operational Profile.

Perform Software Testing
Continue
Testing

Select Appropriate Software
Model

Conduct code inspection.
Perform software testing.
Conduct reliability modelling to measure
the software reliability – continuously
improve the software reliability until the
target is reached.

Collect Failure Data

Reliability
Objectives
met?

Use software Reliability Model(s)
Use software Reliability Model(s)
to Calculate Current Reliability
to Calculate Current Reliability

Software Release Acceptable
from Reliability Perspective

Field reliability validation.
Validate Field Reliability
Validate Field Reliability
SRE Terms
Reliability objective: The product’s reliability goal from the customer’s viewpoint.
Operational profile: A set of system operational scenarios with their associated
probability of occurrence.
This encourages testers to select test cases according to the system’s likely operational
usage.

Reliability modeling: This is an essential element of SRE that determines whether the
product meets its reliability objective.
One or more models can be used to calculate, from failure data collected during system
testing, various estimates of a product’s reliability as a function of test time. It can also provide
the following information:
Product reliability at the end of various test phases.
Amount of additional test time required to reach the product’s reliability objective.
The reliability growth that is still required (ratio of initial to target reliability).
Prediction of field reliability.

Field Reliability Validation: Determination of whether the actual field reliability meets
the customer’s target.
OBJECTIVES
Defining software reliability targets
Software Reliability Objectives
Reliability target(s) should be defined and used to:
Manage customer expectations.
Determine how reliability growth can and will be tracked throughout
the program.
Determine availability targets. Software reliability is commonly
expressed as an availability metric though rather than as a
probabilistic reliability metric. This is defined as:

Availability =

Software uptime
Software uptime + downtime

A data collection and analysis methodology also has to be defined:
How inspections will be conducted.
How failure data will be collected.
How the data will be analyzed, i.e., what model will be used?
This helps project managers track metrics and plan resource.
Managing the Software Reliability Objective
Defects are often inserted from the beginning of project.
This is usually related to the intensity of the effort, i.e.
the number of engineers working on the program, the
project schedule and the various design decisions that
are made etc.

Defects are most often detected and
addressed at a later date than the original
design effort.
Test efforts are relied on to discover most defects, this lag can have a negative impact on
the program.
This can be mitigated against by using code inspection, but some testing will still be
necessary. Code inspections should be conducted to IEEE 1028.
There is still a lag though between defect insertion and correction, which can have a
negative impact on the program.

The eventual defect rate represents the reliability target, and as defects are
discovered and addressed the software reliability is increased, or grown –
this is termed ‘Reliability Growth Management”.
Initial Reliability Growth Model - The
Rayleigh Curve (1 of 3)
The eventual goal should be to forecast the discovery rate of defects as a
function of time throughout the software development program.
This cannot be achieved until data from prior similar projects becomes available. This may
take time but the effort provides value as it enables accurate forecasts to be achieved from
the beginning of the project.
Industry data is also available.
This helps to manage customer
expectations as it demonstrates a strategy
for improving software reliability.
To produce this curve, reliability data from
prior software developments has to be
available. Therefore this is a goal, it’s not
a technique that can be used immediately.
To get to this stage metrics need to be
collected by using the methods discussed
in this presentation.

 1  2 − 1 2 t 2 


 2 Peak 
f (t ) = K 

 te
 Peak 



The Rayleigh Curve (2 of 3)
The model's cumulative distribution function (CDF) describes the totalto-date effort expended or defects found at each interval – returns the
software reliability at various points in time.

 1 2

−
t 
 2 Peak2 
F (t ) = K 1 − e





The Rayleigh Curve (3 of 3)
Example: A software project has a 12 month delivery
Prior data is available to generate a reliability forecast.
The customer wants to know what the effect is of pulling
the delivery in to 9 months. What is the answer?

It reduces the total containment effectiveness
(TCE), otherwise expressed as reliability, from
89.6% to 61%.

Tradeoff:
This allows expectations
to be managed by
explaining that to achieve
early delivery their will be
a tradeoff in the reliability,
which may require a later
release. This type of
management helps to
avoid possible customer
dissatisfaction.
Further Information
Software reliability growth using the Rayleigh Curve is discussed in greater
depth in Appendix A of How Reliable Is Your Product?: 50 Ways to Improve
Product Reliability, by Mike Silverman. The text of Appendix A was provided
by the author of this presentation.

This book is highly recommended for anybody that is interested in improving
product reliability, available from Amazon or directly from Ops A La Carte.
Software Availability and Failure
Intensity (1 of 2)
As mentioned earlier, instead of a reliability metric being provided,
customers may ask for a certain ‘availability’.
This is the average (over time) probability that a system or a capability of
a system is currently functional in a specified environment.
It depends on:
The probability of software failure
Length of downtime when failure occurs.

It essentially describes the expected fraction of the operating time during
which a software component or system is functioning acceptably.
If the software is not being modified (if further development or further
releases are not planned) then the failure rate will be constant and therefore
the availability will be constant.
Software Availability and Failure
Intensity (2 of 2) Software uptime

From earlier, availability is defined as: Availability =

Software uptime + downtime

Downtime can be expressed as:

Downtime = t m λ

Where: tm=downtime per failure
λ=failure intensity

For software , the downtime per failure is the time to recover from the
failure, not the time required to find and remove the fault.

1
∴ Availabilt y =
1 + tm λ
If an availability specification for the software is specified, then the
downtime per failure will determine a failure intensity objective:

1 − Availabili ty
λ=
Availabili ty × t m
Either an availability or a failure intensity objective have to be defined.
Example
A product must be available 99% of time.
Required Downtime = 6 minutes (0.1hr)
The downtime per failure can be used to determine the failure intensity objective.

1− A
λ=
Atm
1 − 0.99
∴λ =
0.99 × 0.1
∴ λ = 0.1 failure / hr

or 100 failures/kHrs
Availability, Failure Intensity,
Reliability and MTBF
This presentation will discuss reliability in terms of availability, probability
and MTBF. These are the relationships between each of these three metrics.
A customer specifies an availability target of 0.99999 and a maximum software downtime
of 5 minutes, or 0.083 hours. The failure intensity is determined from:

λ=

Downtime
0.00083
=
= 0.083 failures / Hr
Availability 0.0099999

What is the reliability probability for a period of 2 years?

R (t ) = e

−

λT
1×10

9

=e

−

0.99999×17520
= 0.99998
1×109

What is the Mean Time Between Failures (MTBF)?
MTBF =

1 × 109

λ

1 × 109
=
= 1.2 × 1010
0.083

Hours
THE OPERATIONAL PROFILE
Defining a structured approach to
inspection and test
Defining an Operational Profile
An operational profile is a quantitative characterization of how a system will
be used in the field by customers.
Why is it useful?
It provides information on how users will employ the product.
It enables the most critical operations to be focused on during testing.
This allows the efficiency of the reliability test effort to be improved.
It allows more realistic test cases to be designed.

To do this the individual software operations have to be identified, which are:
Major system logical tasks that returns control to the system when complete.
Major = a task that is related to a functional requirement or feature rather than a subtask.
The operation can be initiated by a user, another part of the system, or by the systems
own controller.

For more information on operational profiles refer to Software Reliability
Engineering: More Reliable Software Faster and Cheaper – John D. Musa
Developing an Operational Profile (1 of 5)
Five steps are needed to develop an operational profile:
1. Identify operation initiators (users, other sub-systems, external systems, product’s own
controller etc.
2. Create an operations list – this is a list of operations that each initiator can execute. If all
initiators can execute every operation then the initiators can be omitted, and instead just
focus on producing a thorough operations list.
Developing an Operational Profile (2 of 5)
A good way to generate an operations list for a menu-driven product is to
produce a ‘walk tree” rather than use an initiators list. An example of a menu
driven system is provided below.
This is based on a medical enteral pump, used for feeding patients.
Developing an Operational Profile (3 of 5)
Step 3. Once the operational profile is complete it should be reviewed to
ensure:
All operations are of short duration in execution time (seconds at most).
Each operation must have substantially different processing from the others.
All operations must be well-formed, i.e., sending messages and displaying data are parts
of the operation and not operations in themselves.
The final list is complete with high probability.
The total number of operations is reasonable, taking the test budget into account. This is
because each operation will be focused on individually using a test case, so if the list is
too long it may result in the project test phase being very lengthy.
Developing an Operational Profile (4 of 5)
Step 4. Determine occurrence rates for each operation – this may need to be
estimated to begin with, but can be revised later.
Occurrence Rate =

Number of operation occurrences
Time the total set of operations is running
Developing an Operational Profile (5 of 5)
Step 5. Determine the occurrence probabilities.
Occurrence Probability =

Occurrence rate of each operation
Total operation occurrence rate

This table has been rearranged by sorting the operations in order of descending
probabilities. This presents the operational profile in a form that is more convenient to use.
Establish Failure Definitions
What is critical to the customer? How does the customer define a failure?
A failure is any departure of system behavior in execution from the user needs.
A Fault is a defect that causes the failure (i.e., missing code).
A fault may not result in failure…but a failure can only occur if a fault exists.
Faults have to be detect – how can this be done?
Answer – by developing an operational profile. This enables resource to be focused on
addressing issues in operations that have the highest probability of failure. Results in
failures having a low failure intensity.
Failure modes should be defined early in the project – this provides a specification for
what the system should NOT be doing!
Failure severity classes can be defined as shown below. The failures that have the highest
severity should be focused on first.
SOFTWARE FMEA
Software reliability risk management
Software FMEA and Risk Analysis
A software Failure Mode and Effects Analysis (SFMEA) is a systematic method
that:
Recognizes, evaluates, and prioritizes potential failures and their effects.
Identifies and prioritizes actions that could eliminate or reduce the likelihood of potential
failures occurring.
Failure Mode
(Defect)

Cause

Material or
process input

Process Step

Effect

Software Failure

An FMEA aids in anticipating failure modes in order to determine and assess the risk to
the customer or product.

then
Risks have to be reduced to acceptable levels.
Software FMEA and Risk Analysis (1 of 2)
Fault trees provide a graphical and logical framework system failure modes
to be analyzed. These can then be used to assess the overall impact of
software failures on a system, or to prove that certain failure modes cannot
occur.
Here is a simple example of how to use a fault tree to perform a Software
FMEA. It is far better to begin an FMEA using a fault tree. Filling in a
spreadsheet immediately can easily result in confusion and is rarely
successful!!
SYSTEM BLOCK DIAGRAM
Sensor

Controller

Actuator

Potential failure mode - unintended system function.
Results in undesirable system behavior - could include
potential controller or sensor failures.
The first step is to produce a fault tree
Software FMEA and Risk Analysis (2 of 2)
1

2
CODE INSPECTION
A reliability improvement and risk
management technique
Why Inspect Code?
Formal inspections should be carried out on the:
Requirements.
Design.
Code.

Approximately 18 man hours
plus rework are required per
300-400 lines of code.

Test plans.
“…formal design and code inspections rank as the most effective
methods of defect removal yet discovered…(defect removal) can top
85%, about twice those of any form of testing.”
-Capers Jones
Applied Software Measurement, 3rd Ed.
McGraw Hill 2008

Case study performed by the Data Analysis Center for Software (DACS):
85% Defect Containment: cost = $1,000,000, Duration = 12 months
95% Defect Containment: cost = $750,000, Duration = 10.8 months
Formal “Fagan Style” Inspections
This is a defined process that is quantitatively
managed.
The objective is to do the thing right. There is no discussion of
options, it is either right or wrong, or it requires investigation.
Ideally 4 inspectors participate (it can be 3-5, but not less than
3). Participants have roles – Leader, Reader, Author and Tester.
The review rate target is 150-200 lines of code per hour. What is found depends on
how closely the inspectors look at the code.
This is a 6 step process that is defined in IEEE 1028.
Data is stored in a repository for future reference.
The outcome should be that defects are found and fixed, and that data is collected and
analyzed.
Relationship Between Inspection and
Reliability (1 of 2)
For a four-phase test process the reliability is likely to vary between 74% and
92% (based on industry data).
Note that not all fixes address problems completely. Some fixes may not be totally effective,
while others may also introduce further problems. This is where inspection can be of value.

Adapted from a similar approach in : Capers Jones
Applied Software Measurement, 3rd Ed.
McGraw Hill 2008
Relationship Between Inspection and
Reliability (2 of 2)
Introducing inspection can increase the reliability to 93 – 99%(based on
industry data).
Inspection alone can enable the software to surpass the reliability that is obtained from a testonly process!
This also increases the scope for reducing the emphasis on testing.

Adapted from: Capers Jones
Applied Software Measurement, 3rd Ed.
McGraw Hill 2008
SOFTWARE TESTING
Further defect detection and elimination
Static Analysis (1 of 2)
This should be performed after the code is developed.
It is pattern based – it scans the code to check for patterns that are known to
cause defects.
This type of analysis uses coding standard rules and enforces internal coding guidelines.
This is a simple task, easily automated, that reduces future debugging effort.

It is data flow based, in that it statically simulates execution paths, so is able
to automatically detect potential runtime errors such as:
Resource leaks.
NullpointerExceptions.
SQL injections.
Security vulnerabilities.

The benefits of static analysis are:
It can examine more execution paths than conventional testing.
It can be applied early in the software design, providing significant time and cost savings.
Static Analysis (2 of 2)
Examples of warning classes that can be obtained from static analysis are:
Buffer overrun
Buffer underrun
Cast alters value
Ignored return value
Division by zero
Missing return statement
Null pointer dereference
Redundant condition
Shift amount exceeds bit width
Type overrun
Type underrun
Uninitialized variable
Unreachable code
Unused value
Useless assignment
Buffer Overflow Example
Consider the code segment below:
char arr[32];
For (int i = 0; i < 64; i++)
{
arr[i] = (char)i;
}
Here, memory that is beyond the range of the stack-based variable “arr” is being explicitly
addressed. This results in memory being overwritten, which could include the stack frame
information that is required for the function to successfully return to its caller, etc.
This coding pattern is typical of security vulnerabilities that exist in software. The specifics
of the vulnerability may change from one instance to another, but the underlying problem
remains the same, performing array copy operations that are incorrectly or insufficiently
guarded against exploit.
Static analysis can assist in detecting such coding patterns
Types of Tests
Functional tests
This is single execution of operations with interactions between the various operations
minimized. The focus is on whether the operation executes correctly.

Load tests
These tests attempt to represent field use and the environment as accurately as possible,
with operations executing simultaneously and interacting. Interactions can occur directly,
through the data, or as a result of resource conflicts. This testing should use the operational
profile.

Regression tests
Functional tests that can be conducted after every build involving significant change. The
focus during these tests is to reveal faults that may have been created during the change
process.

Endurance tests
Ad-hoc testing is similar to load testing in that it should represent the field use and environment
as accurately as possible. This will focus on how the product is to be used…and may be
misused.
RELIABLE SYSTEM DESIGN
A look at fault tolerance, an essential
aspect of system design
Reliable System Design (1 of 7)
To achieve reliable system design software should be designed such that it is
fault tolerant.
Typical responses to system or software faults during operation includes a
sequence of stages:
Fault confinement,
Fault detection,
Diagnosis,
Reconfiguration,
Recovery,
Restart,
Repair,
Reintegration.
Reliable System Design (2 of 7)
Fault Confinement.
Limits the spread of fault effects to one area of the system – prevents
contamination of other areas.
Achieved through use of:
- self-checking acceptance tests,
- exception handling routines,
- consistency checking mechanisms,
- multiple requests/confirmations.
Erroneous system behaviors due to software faults are typically undetectable.
Reduction of dependencies can help.
Reliable System Design (3 of 7)
Fault Detection.
This stage recognizes that something unexpected has occurred in the system.
Fault latency – period of time between fault occurrence and detection.
The shorter the fault latency is, the better the system can recover. Two technique
classes are off-line and on-line fault diagnosis:
- Off-line techniques are diagnostic programs.
System cannot perform useful work under test.
- On-line techniques provide real-time detection capability.
System can still perform useful work.
Watchdog monitors and redundancy schemes.
Reliable System Design (4 of 7)
Diagnosis.
This is necessary if the fault detection technique does not provide information about the
failure location and/or properties.
This is often an off-line technique that may require a system reset.
On-line techniques can also be used i.e., when a diagnosis indicates unhealthy system
conditions (such as low available resources), low-priority resources can be released
automatically in order to achieve in-time transient failure prevention.

Reconfiguration.
This occurs when a fault is detected and a permanent failure is located.
The system may reconfigure its components either to replace the failed component or to
isolate it from the rest of the system (i.e., redundant memory, error checking of memory in
case of partial corruption etc).
Successful reconfiguration requires robust and flexible software architecture and
reconfiguration schemes.
Reliable System Design (5 of 7)
Recovery.
Uses techniques to eliminate the effects of faults.
There are two approaches:
- fault masking,
- retry and rollback.
Fault masking hides effects of failures by allowing redundant, correct information to
outweigh the incorrect information.
Retry makes a second try at an operation as many faults are transient in nature.
Rollback makes use of backed up (checkpointed) operations at some point in its
processing prior to fault detection, and operation recommences from this point.
Fault latency is very important because the rollback must go back far enough to
avoid the effects of undetected errors that occurred before the detected error.
Reliable System Design (6 of 7)
Restart.
This occurs after the recovery of undamaged information.
There are three approaches:
- hot restart,
- warm restart;
- cold restart.
Hot restart – resumption of all operations from the point of fault detection (this is only
possible if no damage has occurred).
Warm restart – only some of the processes can be resumed without loss.
Cold restart – complete reload of the system is performed with no processes surviving.
Reliable System Design (7 of 7)
Repair.
Replacement of failed component – on or off-line.
Off-line – system brought down to perform repair. System availability depends on how fast
a fault can be located and removed.
On-line – Component replaced immediately with a back up spare (similar to
reconfiguration), or perhaps operation can continue without using the faulty component
(i.e., masking redundancy or graceful degradation).
On-line repair prevents system operation interruption.

Reintegration.
Repaired module must be reintegrated into the system.
For on-line repair, reintegration must be performed without interrupting system operation.
Non-redundant systems are fault intolerant and, to achieve reliability, fault avoidance is
often the best approach. Redundant systems should use fault detection, masking
redundancy (i.e., disabling 1 out of N units), and dynamic redundancy (i.e., temporarily
disabling certain operations ) to automate one or more stages of fault handling.
RELIABILITY MODELING
Determining what reliability has actually
been achieved
Reliability Modeling (1 of 4)
This is used to calculate what the current reliability is, and if the reliability
target is not yet being achieved, determine how much testing and debug
needs to be completed in order to achieve the reliability target.
The questions that reliability modeling aims to answer are:
How many failures are we likely to experience during a fixed time period?
What is the probability of experiencing a failure in the next time period?
What is the availability of the software system?
Is the system ready for release (from a reliability perspective)?

Software
Failures

t2

t1
T=0

T1

T2

t3

t4

T3 T4 T5

Ti is the Cumulative Time To Failure
ti is the inter-arrival time = Ti – Ti-1

t6

t5
T6

t7
T7 T8 TE
Reliability Modeling (2 of 4)
In reliability engineering it is usual to identify a failure distribution,
especially when modeling non-repairable products*. This approach can be
used because it is assumed that hardware faults are statistically
independent and identically distributed.
Where software is concerned, events (failures) are not necessarily
independent due to interactions with other system elements, so in most
cases failures are not identically distributed.
When a failure occurs in a software system the next failure may depend on
the current operational time of the unit, and therefore each failure event in
the system may be DEPENDENT.
*

Although it can be argued that a software system can be repaired by fixing the fault, in
reliability terms it is still a non-repairable product because it is not wearing out. For instance,
a car is a repairable device as parts can be changed when they wear out, but this does not
necessarily make it as good as new. If a software fault is repaired it is actually as good as
new again, and in fact the improvement may make it better than new.
Reliability Modeling (3 of 4)
Therefore what is needed is to model the Rate of Occurrence of Failures
and the Number of Failures within a given time.
As an example, with reference to the figure below, a model is needed that
will report the fact that 8 failures are expected by timeTE and that the Rate
of Occurrence of Failures is Increasing with Time.

Software
Failures

t2

t1
T=0

T1

T2

t3

t4

T3 T4 T5

t6

t5
T6

t7
T7 T8 TE
Reliability Modeling (4 of 4)
If a Distribution Analysis is performed on the Time-Between-Failures, then
this is equivalent to saying that there are 9 different systems, where
System 1 failed after t1 hours of operation, System 2 after t2,…, etc.
T=0
System 1
System 2
System 3
.
.
.
System 9

t1
t2
t3

T9 (suspension*)

This is the same as assuming that the system is failure free if the fault is
addressed, which may not necessarily be true as further failures may
occur.
Example: Changing the break pads on a car. This does not mean that the
car is now failure free!
*

A unit that continues to work at the end of the analysis period or is removed from a test in working condition.
I.e., it may fail at some point in the future.
An Example of an Incorrect Approach (1 of 4)
This example has been included because it is a common approach to hardware
reliability modeling but it CANNOT be used for modeling software reliability.
This method is normally used to model a non-repairable hardware product.
Unfortunately when used in analyzing software reliability it returns incorrect
results…but it is an easy trap for a reliability engineer to fall into!!!
Both firmware and hardware failure data is collected from three systems:

A total of 6 different firmware and 4 different hardware failure modes are
identified
An Example of an Incorrect Approach (2 of 4)
The conventional reliability engineering approach is to take the TimeBetween-Failures for each system and then fit a distribution.

319-152

Notice that hardware failures have been removed.
The time between the last failure and the current age is a Suspension.
An Example of an Incorrect Approach (3 of 4)
A Weibull (life data) Analysis is conducted, but with software this is not
appropriate!

This analysis assumes a
sample of 20 systems, and
one system failed after
152hrs, the other after
319hrs, etc.
An Example of an Incorrect Approach (4 of 4)
This system will be used for a total of 250 hours. What will the software
reliability be?
Distribution analysis is okay for
non-repairable products
containing only hardware, but
not for anything containing
software (or for repairable
hardware only products).
In products that contain software,
events are dependent, and therefore
alternative analysis methods should be
used.
However, it is correct to fit a
distribution on the First-Time-to-Failure
of each system.

97.63% GREAT RESULT
BUT
COMPLETELY
WRONG!!!
An Example of a Correct Approach

This is the
probability that the
unit will NOT fail in
the first 250 hours.
Reliability=68.36%
Notice that the
confidence interval
is very wide.
Three Possible SRE approaches…
Are multiple
systems being
tested?

No

Can testing be
stopped after each
phase to fix failure
modes?

No Use 3-Parameter
Crow-Extended
Model

Yes
Use NHPP model
(This is the best
option)
This is the current state of the art in software
reliability modeling, and is suitable for most
projects. However, this approach is not suitable
for testing a single unit (i.e., a large expensive
system), or where not all faults are going to be
fixed in between compiles. A better model is
needed for this type of application.

Yes
Use Crow-Extended
Model

It is hypothesized by the author that these models may be
more suitable for developments where the NHPP model
cannot be well applied. This essentially represents a future
state of software reliability testing. However, before being
readily accepted they should be validated, i.e., by
comparing their predicted reliability with actual field data.
Use of these models has been included in this presentation
for completeness and possible future application.
A Better SRE Analysis Approach (1 of 4)
A model is needed that will take into
account the fact that when a failure occurs
the system has a “Current Age,” or in
other words there is a further failure that
is likely to occur.
For example, in System 1, the system has
an age of 152 hours after the first firmware
failure mode has been detected.
In other words, all other operations that can
result in a failure also have an age of 152 hours
and the next failure event is based on this fact.
A Better SRE Analysis Approach (2 of 4)
The NHPP (Non-Homogenous Poisson Process) with a Power Law failure
Intensity is such a model:

Where:
Pr[N(T)=n] is the probability that n failures will be observed by time, T.
Λ(T) is the Failure Intensity Function (Rate of Occurrence of Failures).
Just because a model is used for hardware does not mean that it cannot be suitable for software
as well, as models simply describe times-to-failure. Therefore a hardware model can also be
used for software, providing that it is a dependent model (failures are dependent on the
operational time, rather than being independent).
A Better SRE Analysis Approach (3 of 4)
NHPP model parameters:
Here the failure events of System 1 are analyzed between the period of 0 and 1380 hours.
This folio also contains the failure events for Systems 2 and 3 (not shown).

Of interest is the fact that
Beta >1, which indicates
that the inter-arrival times
between unique failures
are decreasing, so there
may be little opportunity
for reliability improvement.
A Better SRE Analysis Approach (4 of 4)
NHPP model results:
Plot shows the cumulative number of failure vs. time, from which conclusions and further
predictions can be obtained. The Weibull plot intersects the X-axis, so out-of-box failures
should not be present. If it had intersected the Y-axis then this would indicate potential for
out-of-box failures.

The cumulative number of failures is
0.1352, or 13.52 failures per 25000
operational hours.
An Example Using the NHPP Model (1 of 8)
Software is under development – the reliability
requirement is to have no more than 1 fault in
every 8 hours of software operation.
Three Test Engineers provide a total of 24 hours
of testing each day.
One new compile is available for testing each week,
when fixes are implemented.
The failure rate goal is:

8
FR =
= 0.125
24

Failures per hour

In a testing day, the failure intensity goal 3 faults/day.

FRI = 0.125 × 24 = 3 Faults per day
An Example Using the NHPP Model (2 of 8)
Failure data is obtained:

The data is grouped by the number of days until a new compile is available,
i.e., the first 45 failures are contained in one group and are fixed in compile
#1.

NHPP model
parameters
An Example Using the NHPP Model (3 of 8)
The instantaneous failure intensity after 28
days of testing is 4.4947 faults/day.

If testing is continued with
the same growth rate,
when will the goal of no
more than 3 faults/day be
achieved?

The answer is after
an additional 14928=121 days of
testing and
development
(test-analyze-fix)
An Example Using the NHPP Model (4 of 8)
An extra 121 days is longer than anticipated. Let’s take a closer look by
generating a Failure Intensity vs. Time plot…

Each of these lines indicates the
failure intensity over a given
interval (which in this case is 5
days).
It can be seen that there was a
jump in the failure intensity between
20 and 23 days.
This is why it is estimated that more
development time is required.

The next step is to analyze the data set for the period up to 20 days of testing,
before the failure intensity increased…
An Example Using the NHPP Model (5 of 8)
The NHPP model data is limited to the first 20 days of testing and
another Failure Intensity vs. Time plot is generated, but this time for the
first 20 days:

This plot shows the decrease in the
failure intensity rate over the first 20
days of testing.
This confirms that the failure intensity
continuously reduced during the first
20 days.
An Example Using the NHPP Model (6 of 8)
Based on the first 20 days of data the
additional test and development duration can
be recalculated, which results in there being
an additional 55-28=27 days to achieve the
goal of having no more than 3 faults/day,
rather than 121!
This generates questions:
Why is there such a big difference in the test duration
still required?
What happened when the failure intensity jumped on
the 23rd day of testing and development?

Answer – New functionality was added. The jump in required test time is
typical when new features are introduced, and applies to software and
hardware alike.
Because new functionality has been added it would be wise to reset the clock and track the
reliability growth from the 20th day forward…
An Example Using the NHPP Model (7 of 8)
Now the NHPP model parameters need to be obtained and plotted for the
last 8 days of testing (8 days is an arbitrary number; enough data needs to
be available to have confidence in any conclusions that are drawn).

This provides better resolution. By taking a
“macro’ view it can be seen that the failure
intensity is starting to increase, so the minimum
failure intensity point has been determined. For
improved accuracy calculations should be based
on this.
An Example Using the NHPP Model (8 of 8)
Based on this data set 51-8=43 more days
of developmental testing are required.
It may be too early to make any
predictions based on only 8 days of
testing, but the result can be used to
obtain a general idea of the remaining
development time required and produce
a test plan.
To pull in the schedule 3 more Test
Engineers could be added and the code
recompiled every 2 days, which will
complete the project within 1 month.
There are also situations where some issues are fixed immediately, others are addressed
later and more minor issues may not be addressed at all. In this type of situation the
Crow Extended Model can be useful…
Crow-Extended Model Introduction (1 of 2)
This is not a common SRE model but does have the benefit of supporting
decision making by providing metrics such as
Failure intensity vs. time.
Demonstrated Mean Time Between Failures (MTBF*).
MTBF growth that can be achieved through implementation of corrective actions.
Maximum potential MTBF that can be achieved through implementation of
corrective actions.
Maximum potential MTBF that can likely be achieved for the software and estimates
regarding latent failure modes that have not yet been uncovered through testing.

This model utilizes A, BC and BD failure mode classifications to analyze
growth data.
A = Failure mode that will not be fixed.
BC = A Failure mode that will be fixed while the test is in progress.
BD = A Failure mode that will be corrected at the end of the test.
* This model uses MTBF rather than failure intensity or reliability metrics. A conversion between these various
metrics is provided in slide 28.
Crow-Extended Model Introduction (2 of 2)
There is no reliability growth for A modes.
The effectiveness of the corrective actions for BC modes is assumed to be
demonstrated during the test.
BD modes require a factor to be assigned that estimates the effectiveness of
the correction that will be implemented after the test.
Analysis using the Crow Extended model allows different management
strategies to be considered by reviewing whether the reliability goal will be
achieved.
There is one constraint to this approach – the testing must be stopped at the
end of the test phase and all BD modes must be fixed. The Crow Extended
model will return misleading conclusions if it is used across multiple test
phases. For those situations use the 3-Parameter Crow-Extended model
(discussed next).
DO NOT APPLY THIS MODEL TO A MULTIPLE SYSTEM TEST, USE THE NHPP
MODEL INSTEAD.
Crow-Extended Model Example (1 of 8)
A product underwent development testing, during which failure modes were
observed. Some modes were corrected during the test (BC modes), some
modes were corrected after the end of the test (delayed fixes, BD modes) and
some modes were left in the system (A modes). The test was terminated after
400 hours; the times-to-failure are provided below:
Crow-Extended Model Example (2 of 8)
An effectiveness factor has been assigned for each BD failure mode
(delayed fixes). The effectiveness factor is based on engineering
assessment and represents the fractional decrease in failure intensity of a
failure mode after the implementation of a corrective action.
The effectiveness factors for the BD failure modes are provided below:

This is a metric that enables an assessment to be made of whether or not the corrective actions
have been effective, and if they have, how effective they were. This is often a subjective judgment.
Crow-Extended Model Example (3 of 8)
The times-to-failure data and effectiveness factors are entered:
Note that this data sheet only
displays 29 rows of data, but all
data is entered even though it
has not been shown.

Effectiveness factor is expressed as 0-1 (0-100% of the failure intensity being removed by the
corrective action).
Crow-Extended Model Example (4 of 8)
Model parameter calculation:
Here the failure events are analyzed between the period of 0 and 400 hours.
Crow-Extended Model Example (5 of 8)
Growth potential MTBF plot:
Growth potential MTBF
(Maximum achievable MTBF based
on current strategy)

Projected MTBF
(Estimated MTBF after delayed corrective
actions have been implemented)

Demonstrated MTBF
(MTBF at end of test without corrective
actions)

Instantaneous MTBF
(Demonstrated MTBF with time)
The demonstrated MTBF, (the result of fixing BC modes during the test) is about 7.76 hours.
The projected MTBF (the result of fixing BD mode after the test) is about 11.13 hours.
The growth potential MTBF (if testing continues with the current strategy, i.e. modes corrected
vs. modes not corrected and with the current effectiveness of each corrective action) is
estimated to be about 14.7 hours. This is the maximum attainable MTBF.
Crow-Extended Model Example (6 of 8)
An Average Failure Mode Strategy plot is a pie chart that breaks down the
average failure intensity of the software into the following categories:

A modes – 9.546%.
BC modes addressed – 14.211%.
BC modes still undetected – 30.655%.
BD modes removed – 8.846%.
BD modes to be removed – 3.355
(because corrective actions were
<100% effective).
BD modes still undetected – 33.386%
Crow-Extended Model Example (7 of 8)
Individual Mode MTBF plot, which shows the MTBF of each individual failure
mode. This enables the failure modes with the lowest MTBF to be identified.
These are the failure modes
that cause the majority of
software failures, and should
be addressed as the highest
priority when reliability
improvement activities are to
be implemented.

Blue = Failure mode MTBF
before corrective action.
Green = Failure mode MTBF
after corrective action.
Crow-Extended Model Example (8 of 8)
Failure Intensity vs. Time plot:

This can be analyzed in
exactly the same way as in
the NHPP example.
3-Parameter Crow-Extended Model
Introduction (1 of 2)
This is not a common SRE model either, but has the same benefits as the
single parameter Crow-Extended model plus multiple test phases can also be
taken into account.
This model is ideal in situations where software is to be tested over multiple phases
but where all bug fixes cannot be introduced as faults are discovered, i.e., all bugs
will be addressed on an ad-hoc basis over an extended time period.
The model provides the flexibility of not having to specify when the test will end, so
it can be continuously updated with new test data. Therefore this model is optimized
for continuous evaluation rather than fixed test periods.
It can only be applied to an individual system, so it lends itself ideally to situations
where an individual complex system is being tested. DO NOT APPLY ANY CROW
MODEL TO A MULTIPLE SYSTEM TEST, USE THE NHPP MODEL INSTEAD.
3-Parameter Crow-Extended Model
Introduction (2 of 2)
This model uses several event codes:
F

Failure time.

I

Time at which a certain BD failure mode has been corrected. BD modes that have
not received a corrective action by time T will not have an associated I event in the

Q

data set.
A failure that was due to a quality issue, such as a build problem rather than a
design problem. The reliability engineer can decide whether or not to include
quality issues in the analysis.

P

A failure that was due to a performance issue, such as an incorrect component
being installed in a device where the embedded code is being tested. The reliability
engineer can decide whether or not to include performance issues in the analysis.

AP

This is an analysis point, used to track overall project progress, which can be
compared to planned growth phases.

PH

The end of a test phase. Test phases can be used to track overall project progress,
which can be compared to planned growth phases.

X

A data point that is to be excluded from the analysis.
3-Parameter Crow-Extended Model
Example (1 of 11)
Software is under development. Testing is to be conducted in 3 phases.
Phase 1 – 6 weeks of manual testing that is run 45 hours per week, total 270 hours.
Phase 2 – 4 weeks of automated testing that is run 24/7, total 672 hours.
Phase 3 – 8 weeks of field manual testing that is run 40 hours per week, total 320 hours.

One hour of continuous testing equates to 7 hours of customer usage, so
the testing includes a usage acceleration factor of 7.
The average fix delay for the three phases are 90 hours, 90 hours and 180
hours respectively (fix time = time delay between discovering a failure mode
to the time the corrective action is incorporated into the design).
Taking usage acceleration into account, cumulative test times for the three
phases is 1890 hours, 6594 hours and 8834 hours respectively.
3-Parameter Crow-Extended Model
Example (2 of 11)
Customer reliability target = 2 failures per year.
Usage duty cycle = 0.1428
Therefore for continuous usage, reliability target is: = 2 failures every
1251 hrs.
Failure intensity: =

2
= 0.0016
1251

Equivalent test time = 8834 hrs.
Required MTBF: =

8834
= 625hrs
0.0016 × 8834
3-Parameter Crow-Extended Model
Example (3 of 11)
The growth potential = 1.3
This is the amount by which the MTBF target should exceed the requirement for margin.
The higher the GP, the lower the risk is to the program. This is an initial estimate based on
prior experience, and the higher the GP margin, the less risk is present in the program.

Average effectiveness factor = 0.5 (1.0 = a perfect fix, 0 = inadequate fix)
This is also an initial estimate based on prior experience.

Management strategy – address at least 90% of all unique failure modes
prior to formal release.
Beta parameter = 0.7
This is the rate for discovering new, distinct failure modes found during testing.
Again, this is an estimate based on prior experience.
A discovery beta of less than 1 indicates that the inter-arrival times between unique B
modes are increasing. This is desirable, as it is assumed that most failures will be
identified early on, and their inter-arrival times will become larger as the test progresses.
This is an initial estimate; the actual discovery beta can be obtained from the final results,
allowing this parameter estimation to be refined with testing experience.
3-Parameter Crow-Extended Model
Example (4 of 11)
Based on the assumptions on the previous slide, an overall growth planning
model can be created that shows the nominal and actual idealized growth
curve and the planned growth MTBF for each phase.
A growth planning folio is created and 1890, 6594 and 8834 are entered for
the Cumulative Phase Times and 630, 630 and 1260 for the Average phase
delays.

Note that inter-phase average fix delays have
been multiplied by 7 to take the usage
acceleration factor into account.
3-Parameter Crow-Extended Model
Example (5 of 11)
The project parameters are input into the Planning Calculations window .
Given the MTBF target and design
margin that has been specified,
along with other required inputs to
describe the planned reliability
growth management strategy, the
final MTBF that can be achieved is
calculated, along with other useful
results. Here it is verified that 625
hours is achievable (if it was not
achievable a figure of less than 625
hours would be calculated).
3-Parameter Crow-Extended Model
Example (6 of 11)
Effectiveness Factors for all BD modes are specified, together with when
they are to be implemented.
A growth planning plot can then be obtained:

MTBF at end of phase 3

MTBF at end of phase 1

MTBF at end of phase 2

This plot displays the MTBF
vs. Time values for the three
phases that have been
planned for the test.
3-Parameter Crow-Extended Model
Example (7 of 11)
Test failure data is collected during the three phases:

Actual discovery
beta
(original
estimate was
0.7)
3-Parameter Crow-Extended Model
Example (8 of 11)
Growth potential MTBF plot can now been obtained:
Growth potential
MTBF
(Maximum achievable
MTBF based on current
strategy)

Demonstrated
MTBF
(MTBF at end of test
without corrective
actions)

Projected MTBF
(Estimated MTBF after
delayed corrective
actions have been
implemented)

Instantaneous
MTBF
If the MTBF goal is higher than the Growth Potential line then the current design
cannot achieve the desired goal and a redesign or change of goals may be
required. For this example, the goal MTBF of 650 hours is well within the growth
potential and is expected to be achieved after the implementation of the delayed
BD fixes.
3-Parameter Crow-Extended Model
Example (9 of 11)
Average Failure Mode Strategy plot, breaking down the average failure
intensity of the software into categories:

A modes – 13.432%.
BC modes addressed – 19.281%.
BC modes still undetected – 13.76%.
BD modes removed – 25.893%.
BD modes remain (because corrective
actions were <100% effective) –
5.813%.
BD modes still undetected – 21.882%
3-Parameter Crow-Extended Model
Example (10 of 11)
Individual Mode MTBF plot showing the MTBF of each individual failure mode,
thus enabling the failure modes with the lowest MTBF to be identified.

Blue = Failure mode MTBF before
corrective action.
Green = Failure mode MTBF after
corrective action.
3-Parameter Crow-Extended Model
Example (11 of 11)
The RGA Quick calculation pad indicates
that the discovery rate of new unseen BD
modes at 630 hours is 0.0006 per hour.
The Beta bounds are less than 1, indicating
that there is still growth in the system
(think of this as the leading edge slope of
the bathtub curve; when beta=1 there is no
more growth potential)
RELIABILITY DEMONSTRATION
Demonstration that a minimum software
reliability has been achieved
Reliability Demonstration Testing (1 of 2)
There can be occasions when the actual software reliability may have to be
measured through practical demonstration rather than testing. However, this
is more applicable where all known faults have been removed and the
software is considered to be stable.
If the reliability has already been discovered by conducting a reliability growth program
then there may be little value in conducting this test. This is actually more suitable for
situations where a reliability growth program has not been conducted.
This can be achieved through
sequential sampling theory.
Reliability Demonstration Testing (2 of 2)
A project specific chart depends on:
Discrimination Ratio – this is an error in the failure intensity estimation that is considered to
be acceptable.
Consumer Risk Level – this is the probability of falsely claiming the failure intensity has
been met when it has not.
Supplier Risk Level – this is the probability of falsely claiming the failure intensity objective
has not been met when it has.

Common values are:
Discrimination Ratio 2%
Consumer Risk Level 0.1 (10%).
Supplier Risk Level 0.1 (10%).
Example
Requirement: 4 Failures/million operations
1
2
3

0.4
0.625
1.2

1.6
2.5
4.8

Multiply by
requirement target

Software can be
accepted after
failure 3 with 90%
confidence that it is
within the reliability
target and a 10%
risk that it is not.
The boundary has
to be crossed
though.
Reliability Demonstration Test Chart
Design (1 of 2)
What if the software is still in the Continue region at the end of the test?
Assume that the end of test is reached
just after failure 2.
Option 1 – Calculate the Failure Intensity
Objective:

Factor =

FCURRENT 3.6
=
= 1.44
F PREVIOUS 2.5

∴ FIO = 1.44 × 4 = 5.76
Option 2 - Extend the test time by
≥factor.

Grouped data CANNOT be used, it has to be obtained from individual units.
Reliability Demonstration Test Chart
Design (2 of 2)
The following formulae is used to design RDT charts:

TN

( A − n )(ln γ )
=
1−γ

TN

and

Accept-Continue Boundary

(B − n )(ln γ )
=
1−γ

Reject-Continue Boundary

Where:
TN:

Normalized measure of when failure occur (horizontal coordinate).

n:

Failure number.

γ:

Discrimination ratio (ratio of max acceptable failure intensity to the failure objective).

A and B are defined from:

A = ln

β

1−α

B = ln

1− β

α

Where:
α: Supplier risk (probability of falsely claiming objective is not met when it is).
β: Consumer risk (probability of falsely claiming objective is met when it is not).
Reliability Demonstration Test Chart
Design Example
The boundary intersections with the x and y axis's can be calculated using the
following formulae:

B − n(ln γ )
A − n(ln γ )
,n
,n
1− γ
1− γ

B
0,
ln γ

A
,0
1− γ

In this example
n=16.
SRE Review
Enables defect discovery rates to be
forecast and monitored – helps all staff –
enables customer expectations to be
managed.
Enables reliability targets to be established
and monitored.
Software FMEA enables failure modes and risks to be identified.
Establishes formal and thorough test and analysis methodologies.
Provides a method for modeling and demonstrating software reliability.
Defines code inspection processes.
Guarantees customer satisfaction!
References
Adamantios Mettas, “Repairable Systems: Data Analysis and Modeling” Applied
Reliability Symposium 2008.
Michael R. Lyu, “Software Reliability Engineering: A Roadmap”.
Dr Larry Crow, “An Extended Reliability Growth Model For Managing and
Accessing Corrective Actions” Reliability And Maintainability Symposium 2004.
John D. Musa, “Software Reliability Engineering: More Reliable Software Faster
and Cheaper” Authorhouse 2004.
Reliasoft RGA 7 Training Guide.
Capers Jones, “Applied Software Measurement, 3rd Edition, McGraw Hill 2008.

More Related Content

What's hot

Unit iii(part c - user interface design)
Unit   iii(part c - user interface design)Unit   iii(part c - user interface design)
Unit iii(part c - user interface design)BALAJI A
 
Software engineering note
Software engineering noteSoftware engineering note
Software engineering noteNeelamani Samal
 
X-Zone - Garantia da Qualidade de Software
X-Zone - Garantia da Qualidade de SoftwareX-Zone - Garantia da Qualidade de Software
X-Zone - Garantia da Qualidade de SoftwareAlexandreBartie
 
Software Quality Attributes
Software Quality AttributesSoftware Quality Attributes
Software Quality AttributesHayim Makabee
 
Software Testing Process, Testing Automation and Software Testing Trends
Software Testing Process, Testing Automation and Software Testing TrendsSoftware Testing Process, Testing Automation and Software Testing Trends
Software Testing Process, Testing Automation and Software Testing TrendsKMS Technology
 
Security metrics
Security metrics Security metrics
Security metrics PRAYAGRAJ11
 
Risk management(software engineering)
Risk management(software engineering)Risk management(software engineering)
Risk management(software engineering)Priya Tomar
 
Basics of Software Testing
Basics of Software TestingBasics of Software Testing
Basics of Software TestingShakal Shukla
 
Engineering Software Products: 1. software products
Engineering Software Products: 1. software productsEngineering Software Products: 1. software products
Engineering Software Products: 1. software productssoftware-engineering-book
 
Engineering Software Products: 2. agile software engineering
Engineering Software Products: 2. agile software engineeringEngineering Software Products: 2. agile software engineering
Engineering Software Products: 2. agile software engineeringsoftware-engineering-book
 
Risk-based Testing
Risk-based TestingRisk-based Testing
Risk-based TestingJohan Hoberg
 

What's hot (20)

Unit iii(part c - user interface design)
Unit   iii(part c - user interface design)Unit   iii(part c - user interface design)
Unit iii(part c - user interface design)
 
Software engineering note
Software engineering noteSoftware engineering note
Software engineering note
 
X-Zone - Garantia da Qualidade de Software
X-Zone - Garantia da Qualidade de SoftwareX-Zone - Garantia da Qualidade de Software
X-Zone - Garantia da Qualidade de Software
 
Risk Management by Roger S. Pressman
Risk Management by Roger S. PressmanRisk Management by Roger S. Pressman
Risk Management by Roger S. Pressman
 
Ch1 introduction
Ch1 introductionCh1 introduction
Ch1 introduction
 
Software Quality Attributes
Software Quality AttributesSoftware Quality Attributes
Software Quality Attributes
 
Slides chapter 2
Slides chapter 2Slides chapter 2
Slides chapter 2
 
Software Testing Process, Testing Automation and Software Testing Trends
Software Testing Process, Testing Automation and Software Testing TrendsSoftware Testing Process, Testing Automation and Software Testing Trends
Software Testing Process, Testing Automation and Software Testing Trends
 
Risk Management
Risk ManagementRisk Management
Risk Management
 
Security metrics
Security metrics Security metrics
Security metrics
 
CMMI
CMMICMMI
CMMI
 
Risk management(software engineering)
Risk management(software engineering)Risk management(software engineering)
Risk management(software engineering)
 
Basics of Software Testing
Basics of Software TestingBasics of Software Testing
Basics of Software Testing
 
Engineering Software Products: 1. software products
Engineering Software Products: 1. software productsEngineering Software Products: 1. software products
Engineering Software Products: 1. software products
 
Engineering Software Products: 2. agile software engineering
Engineering Software Products: 2. agile software engineeringEngineering Software Products: 2. agile software engineering
Engineering Software Products: 2. agile software engineering
 
Vulnerability and Patch Management
Vulnerability and Patch ManagementVulnerability and Patch Management
Vulnerability and Patch Management
 
Risk-based Testing
Risk-based TestingRisk-based Testing
Risk-based Testing
 
Checkpoints of the Process
Checkpoints of the ProcessCheckpoints of the Process
Checkpoints of the Process
 
Unit1
Unit1Unit1
Unit1
 
Software quality management standards
Software quality management standardsSoftware quality management standards
Software quality management standards
 

Similar to Software reliability engineering

A Survey of Software Reliability factor
A Survey of Software Reliability factorA Survey of Software Reliability factor
A Survey of Software Reliability factorIOSR Journals
 
Top 7 reasons why software testing is crucial in SDLC
Top 7 reasons why software testing is crucial in SDLCTop 7 reasons why software testing is crucial in SDLC
Top 7 reasons why software testing is crucial in SDLCSLAJobs Chennai
 
A Combined Approach of Software Metrics and Software Fault Analysis to Estima...
A Combined Approach of Software Metrics and Software Fault Analysis to Estima...A Combined Approach of Software Metrics and Software Fault Analysis to Estima...
A Combined Approach of Software Metrics and Software Fault Analysis to Estima...IOSR Journals
 
Software Testing Interview Questions For Experienced
Software Testing Interview Questions For ExperiencedSoftware Testing Interview Questions For Experienced
Software Testing Interview Questions For Experiencedzynofustechnology
 
Why Software Testing is Crucial in Software Development_.pdf
Why Software Testing is Crucial in Software Development_.pdfWhy Software Testing is Crucial in Software Development_.pdf
Why Software Testing is Crucial in Software Development_.pdfXDuce Corporation
 
Software Engineering
Software EngineeringSoftware Engineering
Software EngineeringMohamed Essam
 
16103271 software-testing-ppt
16103271 software-testing-ppt16103271 software-testing-ppt
16103271 software-testing-pptatish90
 
CHAPTER 15Security Quality Assurance TestingIn this chapter yo
CHAPTER 15Security Quality Assurance TestingIn this chapter yoCHAPTER 15Security Quality Assurance TestingIn this chapter yo
CHAPTER 15Security Quality Assurance TestingIn this chapter yoJinElias52
 
Importance of software quality metrics
Importance of software quality metricsImportance of software quality metrics
Importance of software quality metricsPiyush Sohaney
 
Welingkar_final project_ppt_IMPORTANCE & NEED FOR TESTING
Welingkar_final project_ppt_IMPORTANCE & NEED FOR TESTINGWelingkar_final project_ppt_IMPORTANCE & NEED FOR TESTING
Welingkar_final project_ppt_IMPORTANCE & NEED FOR TESTINGSachin Pathania
 
Elementary Probability theory Chapter 2.pptx
Elementary Probability theory Chapter 2.pptxElementary Probability theory Chapter 2.pptx
Elementary Probability theory Chapter 2.pptxethiouniverse
 

Similar to Software reliability engineering (20)

A Survey of Software Reliability factor
A Survey of Software Reliability factorA Survey of Software Reliability factor
A Survey of Software Reliability factor
 
Top 7 reasons why software testing is crucial in SDLC
Top 7 reasons why software testing is crucial in SDLCTop 7 reasons why software testing is crucial in SDLC
Top 7 reasons why software testing is crucial in SDLC
 
Qa analyst training
Qa analyst training Qa analyst training
Qa analyst training
 
Introduction to SDET
Introduction to SDETIntroduction to SDET
Introduction to SDET
 
A Combined Approach of Software Metrics and Software Fault Analysis to Estima...
A Combined Approach of Software Metrics and Software Fault Analysis to Estima...A Combined Approach of Software Metrics and Software Fault Analysis to Estima...
A Combined Approach of Software Metrics and Software Fault Analysis to Estima...
 
Softwaretesting
SoftwaretestingSoftwaretesting
Softwaretesting
 
Software Testing Interview Questions For Experienced
Software Testing Interview Questions For ExperiencedSoftware Testing Interview Questions For Experienced
Software Testing Interview Questions For Experienced
 
Software Engineering by Pankaj Jalote
Software Engineering by Pankaj JaloteSoftware Engineering by Pankaj Jalote
Software Engineering by Pankaj Jalote
 
Why Software Testing is Crucial in Software Development_.pdf
Why Software Testing is Crucial in Software Development_.pdfWhy Software Testing is Crucial in Software Development_.pdf
Why Software Testing is Crucial in Software Development_.pdf
 
Software Engineering
Software EngineeringSoftware Engineering
Software Engineering
 
Software quality assurance
Software quality assuranceSoftware quality assurance
Software quality assurance
 
16103271 software-testing-ppt
16103271 software-testing-ppt16103271 software-testing-ppt
16103271 software-testing-ppt
 
Quality Assurance and Testing services
Quality Assurance and Testing servicesQuality Assurance and Testing services
Quality Assurance and Testing services
 
CHAPTER 15Security Quality Assurance TestingIn this chapter yo
CHAPTER 15Security Quality Assurance TestingIn this chapter yoCHAPTER 15Security Quality Assurance TestingIn this chapter yo
CHAPTER 15Security Quality Assurance TestingIn this chapter yo
 
Importance of software quality metrics
Importance of software quality metricsImportance of software quality metrics
Importance of software quality metrics
 
M017548895
M017548895M017548895
M017548895
 
Software testing ppt
Software testing pptSoftware testing ppt
Software testing ppt
 
Slides chapters 26-27
Slides chapters 26-27Slides chapters 26-27
Slides chapters 26-27
 
Welingkar_final project_ppt_IMPORTANCE & NEED FOR TESTING
Welingkar_final project_ppt_IMPORTANCE & NEED FOR TESTINGWelingkar_final project_ppt_IMPORTANCE & NEED FOR TESTING
Welingkar_final project_ppt_IMPORTANCE & NEED FOR TESTING
 
Elementary Probability theory Chapter 2.pptx
Elementary Probability theory Chapter 2.pptxElementary Probability theory Chapter 2.pptx
Elementary Probability theory Chapter 2.pptx
 

Recently uploaded

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 

Recently uploaded (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 

Software reliability engineering

  • 2. Topics Covered in this Presentation What software reliability engineering is and why it is needed. Defining software reliability targets. Operational profiles. Reliability risk management. Code inspection. Software testing. Reliable system design. Reliability modeling. Reliability demonstration.
  • 3. INTRODUCTION What Software Reliability Engineering is and why it is needed
  • 4. Different Views of Reliability Product development teams View reliability at the sub-domain level, addressing mechanical, electronic and software issues. Customers View reliability at the system level, with minimal consideration placed on sub-domain distinction. The primary measure of reliability is defined by the customer. To develop a reliable product engineering teams must consider both views (system and sub-domain). System Mechanical Reliability + Electronic Reliability + Software Reliability Although this presentation focuses on software reliability engineering, it should be viewed as a component part of an overall Design for Reliability process, not as a disparate activity as hardwaresoftware interactions may be missed. This presentation does not make any distinction between software and firmware, but the same techniques apply equally to both.
  • 5. System-Level Reliability Modeling (1 of 2) A system is made up of components/sub-systems; each has its own inherent reliability. Software R=0.99 Computer Server R=0.9665 A “traditional” reliability program may include modeling, evaluation and testing to prove that the hardware meets the reliability target, but software should not be forgotten as it is a system component. Individually the hardware and software may meet the reliability target…but they also have to when they are combined. System probability of failure = H/W Failure Probability x S/W Failure Probability I.e., H/W = 0.9665, S/W = 0.99, System = 0.9665 x 0.99 = 0.9568 System Reliability = 95.68%
  • 6. System-Level Reliability Modeling (2 of 2) Therefore the software reliability should also be accounted for in the system-level reliability model. Software may consist of both the operating system (OS) and configurable (turnkey) software. It may not be possible to influence the OS design, but turnkey software can be focused on. This may consist of re-used software such as library functions and newly developed software. If the reliability of the library functions is already understood then library function re-use simplifies the software reliability engineering process.
  • 7. What is Software Reliability Engineering (SRE)? The quantitative study of the operational behavior of software-based systems with respect to user requirements concerning reliability. SRE has been adopted either as standard or as best practice by more than 50 organizations in their software projects including AT&T, Lucent, IBM, NASA and Microsoft, plus many others worldwide. This presentation will provide an introduction to software reliability engineering…..
  • 8. Why is SRE Important? There are several key reasons a reliability engineering program should be implemented: So that it can be determined how satisfactorily products are functioning. Avoid over-designing – products could cost more than necessary and lower profit. If more features are added to meet customer demand then reliability should be monitored to ensure that defects are not designed in, which could impact reliability. If a customer’s product is not designed well, with reliability and quality in mind, then they may well turn to a COMPETITOR! Having a software reliability engineering process can make organizations more competitive as customers will always expect reliable software that is better and cheaper.
  • 9. Why is SRE Beneficial? For Engineers: Managing customer demands: Enables software to be produced that is more reliable; built faster and cheaper. Makes engineers more successful in meeting customer demands. In turn this avoids conflicts – risk, pressure, schedule, functionality, cost etc. For the organization: Improves competitiveness. Reduces development costs. Provides customers with quantitative reliability metrics. Places less emphasis on tools and a greater emphasis on “designing in reliability.” Products can be developed that are delivered to the customer at the right time, at an acceptable cost, and with satisfactory reliability.
  • 10. Common SRE Challenges Data is collected during test phases, so if problems are discovered it is too late for fundamental design changes to be made. Failure data collected during in-house testing may be limited, and may not represent failures that would be uncovered in the product’s actual operational environment. Reliability metrics obtained from restricted testing data may result in reliability metrics being inaccurate. There are many possible models that can be used to predict the reliability of the software, which can be very confusing. Even if the correct model is selected there may be no way of validating it due to having insufficient field data.
  • 11. Fault Lifecycle Techniques Prevent faults from being inserted. Avoids faults being designed into the software when it is being constructed. Remove faults that have been inserted. Detect and eliminate faults that have been inserted through inspection and test. Design the software so that it is fault tolerant. Provide redundant services so that the software continues to work even though faults have occurred or are occurring. Forecast faults and/or failures. Evaluate the code and estimate how many faults are present and the occurrences and consequences of software failures.
  • 12. Preventing Faults From Being Inserted Initial approach for reliable software A fault that is never created does not cost anything to fix. This should be the ultimate objective of software engineering. This requires: A formal requirement specification always being available that has been thoroughly reviewed and agreed to. Formal inspection and test methods being implemented and used. Early interaction with end-users (field trials) and requirement refinement if necessary. The correct analysis tools and disciplined tool use. Formal programming principles and environments that are enforced. Systematic techniques for software reuse. Formal software engineering processes and tools, if applied successfully, can be very effective in preventing faults (but is no guarantee!) However, software reuse without proper verification can result in disappointment.
  • 13. Removing Faults When faults are injected into the software, the next method that can be used is fault removal. Approaches: Software inspection. Software testing. Both have become standard industry practices. This presentation will focus closely on these.
  • 14. Fault Tolerance This is a survival attribute – the software has to continue to work even though a failure has occurred. Fault tolerance techniques enables a system to: Prevent dormant software faults from becoming active (i.e., defensive programming to check for input and output conditions and forbid illegal operations). Contain software errors within a confined boundary to prevent them from propagating further (i.e., exception handling routines to treat unsuccessful operations). Recover software operations from erroneous conditions by using techniques such as check pointing and rollback.
  • 15. Fault/Failure Forecasting If software failures are likely to occur it is critical to estimate the number of failures and predict when each is likely to occur. This will help concentrate on failures that have the greatest probability of occurring, provide reliability improvement opportunities and improve customer satisfaction. Fault/failure forecasting requires: Defining a fault/failure relationship – why the failure occurs and its effect. Establishing a software reliability model. Developing procedures for measuring software reliability. Analyzing and evaluating the measurement results. Measuring software reliability provides: Useful metrics that can be used to plan further testing and debug efforts, to calculate warranty costs and plan further software releases. Determines when testing can be terminated.
  • 16. SRE Process Overview This slide shows a general SRE process flow that has six major components: Determine Reliability Determine Reliability Objective Objective Define Operational Define Operational Profile Profile Perform Code Inspection Perform Code Inspection Determine the reliability Target. Define a software operational Profile. Perform Software Testing Continue Testing Select Appropriate Software Model Conduct code inspection. Perform software testing. Conduct reliability modelling to measure the software reliability – continuously improve the software reliability until the target is reached. Collect Failure Data Reliability Objectives met? Use software Reliability Model(s) Use software Reliability Model(s) to Calculate Current Reliability to Calculate Current Reliability Software Release Acceptable from Reliability Perspective Field reliability validation. Validate Field Reliability Validate Field Reliability
  • 17. SRE Terms Reliability objective: The product’s reliability goal from the customer’s viewpoint. Operational profile: A set of system operational scenarios with their associated probability of occurrence. This encourages testers to select test cases according to the system’s likely operational usage. Reliability modeling: This is an essential element of SRE that determines whether the product meets its reliability objective. One or more models can be used to calculate, from failure data collected during system testing, various estimates of a product’s reliability as a function of test time. It can also provide the following information: Product reliability at the end of various test phases. Amount of additional test time required to reach the product’s reliability objective. The reliability growth that is still required (ratio of initial to target reliability). Prediction of field reliability. Field Reliability Validation: Determination of whether the actual field reliability meets the customer’s target.
  • 19. Software Reliability Objectives Reliability target(s) should be defined and used to: Manage customer expectations. Determine how reliability growth can and will be tracked throughout the program. Determine availability targets. Software reliability is commonly expressed as an availability metric though rather than as a probabilistic reliability metric. This is defined as: Availability = Software uptime Software uptime + downtime A data collection and analysis methodology also has to be defined: How inspections will be conducted. How failure data will be collected. How the data will be analyzed, i.e., what model will be used? This helps project managers track metrics and plan resource.
  • 20. Managing the Software Reliability Objective Defects are often inserted from the beginning of project. This is usually related to the intensity of the effort, i.e. the number of engineers working on the program, the project schedule and the various design decisions that are made etc. Defects are most often detected and addressed at a later date than the original design effort. Test efforts are relied on to discover most defects, this lag can have a negative impact on the program. This can be mitigated against by using code inspection, but some testing will still be necessary. Code inspections should be conducted to IEEE 1028. There is still a lag though between defect insertion and correction, which can have a negative impact on the program. The eventual defect rate represents the reliability target, and as defects are discovered and addressed the software reliability is increased, or grown – this is termed ‘Reliability Growth Management”.
  • 21. Initial Reliability Growth Model - The Rayleigh Curve (1 of 3) The eventual goal should be to forecast the discovery rate of defects as a function of time throughout the software development program. This cannot be achieved until data from prior similar projects becomes available. This may take time but the effort provides value as it enables accurate forecasts to be achieved from the beginning of the project. Industry data is also available. This helps to manage customer expectations as it demonstrates a strategy for improving software reliability. To produce this curve, reliability data from prior software developments has to be available. Therefore this is a goal, it’s not a technique that can be used immediately. To get to this stage metrics need to be collected by using the methods discussed in this presentation.  1  2 − 1 2 t 2     2 Peak  f (t ) = K    te  Peak    
  • 22. The Rayleigh Curve (2 of 3) The model's cumulative distribution function (CDF) describes the totalto-date effort expended or defects found at each interval – returns the software reliability at various points in time.  1 2  − t   2 Peak2  F (t ) = K 1 − e     
  • 23. The Rayleigh Curve (3 of 3) Example: A software project has a 12 month delivery Prior data is available to generate a reliability forecast. The customer wants to know what the effect is of pulling the delivery in to 9 months. What is the answer? It reduces the total containment effectiveness (TCE), otherwise expressed as reliability, from 89.6% to 61%. Tradeoff: This allows expectations to be managed by explaining that to achieve early delivery their will be a tradeoff in the reliability, which may require a later release. This type of management helps to avoid possible customer dissatisfaction.
  • 24. Further Information Software reliability growth using the Rayleigh Curve is discussed in greater depth in Appendix A of How Reliable Is Your Product?: 50 Ways to Improve Product Reliability, by Mike Silverman. The text of Appendix A was provided by the author of this presentation. This book is highly recommended for anybody that is interested in improving product reliability, available from Amazon or directly from Ops A La Carte.
  • 25. Software Availability and Failure Intensity (1 of 2) As mentioned earlier, instead of a reliability metric being provided, customers may ask for a certain ‘availability’. This is the average (over time) probability that a system or a capability of a system is currently functional in a specified environment. It depends on: The probability of software failure Length of downtime when failure occurs. It essentially describes the expected fraction of the operating time during which a software component or system is functioning acceptably. If the software is not being modified (if further development or further releases are not planned) then the failure rate will be constant and therefore the availability will be constant.
  • 26. Software Availability and Failure Intensity (2 of 2) Software uptime From earlier, availability is defined as: Availability = Software uptime + downtime Downtime can be expressed as: Downtime = t m λ Where: tm=downtime per failure λ=failure intensity For software , the downtime per failure is the time to recover from the failure, not the time required to find and remove the fault. 1 ∴ Availabilt y = 1 + tm λ If an availability specification for the software is specified, then the downtime per failure will determine a failure intensity objective: 1 − Availabili ty λ= Availabili ty × t m Either an availability or a failure intensity objective have to be defined.
  • 27. Example A product must be available 99% of time. Required Downtime = 6 minutes (0.1hr) The downtime per failure can be used to determine the failure intensity objective. 1− A λ= Atm 1 − 0.99 ∴λ = 0.99 × 0.1 ∴ λ = 0.1 failure / hr or 100 failures/kHrs
  • 28. Availability, Failure Intensity, Reliability and MTBF This presentation will discuss reliability in terms of availability, probability and MTBF. These are the relationships between each of these three metrics. A customer specifies an availability target of 0.99999 and a maximum software downtime of 5 minutes, or 0.083 hours. The failure intensity is determined from: λ= Downtime 0.00083 = = 0.083 failures / Hr Availability 0.0099999 What is the reliability probability for a period of 2 years? R (t ) = e − λT 1×10 9 =e − 0.99999×17520 = 0.99998 1×109 What is the Mean Time Between Failures (MTBF)? MTBF = 1 × 109 λ 1 × 109 = = 1.2 × 1010 0.083 Hours
  • 29. THE OPERATIONAL PROFILE Defining a structured approach to inspection and test
  • 30. Defining an Operational Profile An operational profile is a quantitative characterization of how a system will be used in the field by customers. Why is it useful? It provides information on how users will employ the product. It enables the most critical operations to be focused on during testing. This allows the efficiency of the reliability test effort to be improved. It allows more realistic test cases to be designed. To do this the individual software operations have to be identified, which are: Major system logical tasks that returns control to the system when complete. Major = a task that is related to a functional requirement or feature rather than a subtask. The operation can be initiated by a user, another part of the system, or by the systems own controller. For more information on operational profiles refer to Software Reliability Engineering: More Reliable Software Faster and Cheaper – John D. Musa
  • 31. Developing an Operational Profile (1 of 5) Five steps are needed to develop an operational profile: 1. Identify operation initiators (users, other sub-systems, external systems, product’s own controller etc. 2. Create an operations list – this is a list of operations that each initiator can execute. If all initiators can execute every operation then the initiators can be omitted, and instead just focus on producing a thorough operations list.
  • 32. Developing an Operational Profile (2 of 5) A good way to generate an operations list for a menu-driven product is to produce a ‘walk tree” rather than use an initiators list. An example of a menu driven system is provided below. This is based on a medical enteral pump, used for feeding patients.
  • 33. Developing an Operational Profile (3 of 5) Step 3. Once the operational profile is complete it should be reviewed to ensure: All operations are of short duration in execution time (seconds at most). Each operation must have substantially different processing from the others. All operations must be well-formed, i.e., sending messages and displaying data are parts of the operation and not operations in themselves. The final list is complete with high probability. The total number of operations is reasonable, taking the test budget into account. This is because each operation will be focused on individually using a test case, so if the list is too long it may result in the project test phase being very lengthy.
  • 34. Developing an Operational Profile (4 of 5) Step 4. Determine occurrence rates for each operation – this may need to be estimated to begin with, but can be revised later. Occurrence Rate = Number of operation occurrences Time the total set of operations is running
  • 35. Developing an Operational Profile (5 of 5) Step 5. Determine the occurrence probabilities. Occurrence Probability = Occurrence rate of each operation Total operation occurrence rate This table has been rearranged by sorting the operations in order of descending probabilities. This presents the operational profile in a form that is more convenient to use.
  • 36. Establish Failure Definitions What is critical to the customer? How does the customer define a failure? A failure is any departure of system behavior in execution from the user needs. A Fault is a defect that causes the failure (i.e., missing code). A fault may not result in failure…but a failure can only occur if a fault exists. Faults have to be detect – how can this be done? Answer – by developing an operational profile. This enables resource to be focused on addressing issues in operations that have the highest probability of failure. Results in failures having a low failure intensity. Failure modes should be defined early in the project – this provides a specification for what the system should NOT be doing! Failure severity classes can be defined as shown below. The failures that have the highest severity should be focused on first.
  • 38. Software FMEA and Risk Analysis A software Failure Mode and Effects Analysis (SFMEA) is a systematic method that: Recognizes, evaluates, and prioritizes potential failures and their effects. Identifies and prioritizes actions that could eliminate or reduce the likelihood of potential failures occurring. Failure Mode (Defect) Cause Material or process input Process Step Effect Software Failure An FMEA aids in anticipating failure modes in order to determine and assess the risk to the customer or product. then Risks have to be reduced to acceptable levels.
  • 39. Software FMEA and Risk Analysis (1 of 2) Fault trees provide a graphical and logical framework system failure modes to be analyzed. These can then be used to assess the overall impact of software failures on a system, or to prove that certain failure modes cannot occur. Here is a simple example of how to use a fault tree to perform a Software FMEA. It is far better to begin an FMEA using a fault tree. Filling in a spreadsheet immediately can easily result in confusion and is rarely successful!! SYSTEM BLOCK DIAGRAM Sensor Controller Actuator Potential failure mode - unintended system function. Results in undesirable system behavior - could include potential controller or sensor failures. The first step is to produce a fault tree
  • 40. Software FMEA and Risk Analysis (2 of 2) 1 2
  • 41. CODE INSPECTION A reliability improvement and risk management technique
  • 42. Why Inspect Code? Formal inspections should be carried out on the: Requirements. Design. Code. Approximately 18 man hours plus rework are required per 300-400 lines of code. Test plans. “…formal design and code inspections rank as the most effective methods of defect removal yet discovered…(defect removal) can top 85%, about twice those of any form of testing.” -Capers Jones Applied Software Measurement, 3rd Ed. McGraw Hill 2008 Case study performed by the Data Analysis Center for Software (DACS): 85% Defect Containment: cost = $1,000,000, Duration = 12 months 95% Defect Containment: cost = $750,000, Duration = 10.8 months
  • 43. Formal “Fagan Style” Inspections This is a defined process that is quantitatively managed. The objective is to do the thing right. There is no discussion of options, it is either right or wrong, or it requires investigation. Ideally 4 inspectors participate (it can be 3-5, but not less than 3). Participants have roles – Leader, Reader, Author and Tester. The review rate target is 150-200 lines of code per hour. What is found depends on how closely the inspectors look at the code. This is a 6 step process that is defined in IEEE 1028. Data is stored in a repository for future reference. The outcome should be that defects are found and fixed, and that data is collected and analyzed.
  • 44. Relationship Between Inspection and Reliability (1 of 2) For a four-phase test process the reliability is likely to vary between 74% and 92% (based on industry data). Note that not all fixes address problems completely. Some fixes may not be totally effective, while others may also introduce further problems. This is where inspection can be of value. Adapted from a similar approach in : Capers Jones Applied Software Measurement, 3rd Ed. McGraw Hill 2008
  • 45. Relationship Between Inspection and Reliability (2 of 2) Introducing inspection can increase the reliability to 93 – 99%(based on industry data). Inspection alone can enable the software to surpass the reliability that is obtained from a testonly process! This also increases the scope for reducing the emphasis on testing. Adapted from: Capers Jones Applied Software Measurement, 3rd Ed. McGraw Hill 2008
  • 46. SOFTWARE TESTING Further defect detection and elimination
  • 47. Static Analysis (1 of 2) This should be performed after the code is developed. It is pattern based – it scans the code to check for patterns that are known to cause defects. This type of analysis uses coding standard rules and enforces internal coding guidelines. This is a simple task, easily automated, that reduces future debugging effort. It is data flow based, in that it statically simulates execution paths, so is able to automatically detect potential runtime errors such as: Resource leaks. NullpointerExceptions. SQL injections. Security vulnerabilities. The benefits of static analysis are: It can examine more execution paths than conventional testing. It can be applied early in the software design, providing significant time and cost savings.
  • 48. Static Analysis (2 of 2) Examples of warning classes that can be obtained from static analysis are: Buffer overrun Buffer underrun Cast alters value Ignored return value Division by zero Missing return statement Null pointer dereference Redundant condition Shift amount exceeds bit width Type overrun Type underrun Uninitialized variable Unreachable code Unused value Useless assignment
  • 49. Buffer Overflow Example Consider the code segment below: char arr[32]; For (int i = 0; i < 64; i++) { arr[i] = (char)i; } Here, memory that is beyond the range of the stack-based variable “arr” is being explicitly addressed. This results in memory being overwritten, which could include the stack frame information that is required for the function to successfully return to its caller, etc. This coding pattern is typical of security vulnerabilities that exist in software. The specifics of the vulnerability may change from one instance to another, but the underlying problem remains the same, performing array copy operations that are incorrectly or insufficiently guarded against exploit. Static analysis can assist in detecting such coding patterns
  • 50. Types of Tests Functional tests This is single execution of operations with interactions between the various operations minimized. The focus is on whether the operation executes correctly. Load tests These tests attempt to represent field use and the environment as accurately as possible, with operations executing simultaneously and interacting. Interactions can occur directly, through the data, or as a result of resource conflicts. This testing should use the operational profile. Regression tests Functional tests that can be conducted after every build involving significant change. The focus during these tests is to reveal faults that may have been created during the change process. Endurance tests Ad-hoc testing is similar to load testing in that it should represent the field use and environment as accurately as possible. This will focus on how the product is to be used…and may be misused.
  • 51. RELIABLE SYSTEM DESIGN A look at fault tolerance, an essential aspect of system design
  • 52. Reliable System Design (1 of 7) To achieve reliable system design software should be designed such that it is fault tolerant. Typical responses to system or software faults during operation includes a sequence of stages: Fault confinement, Fault detection, Diagnosis, Reconfiguration, Recovery, Restart, Repair, Reintegration.
  • 53. Reliable System Design (2 of 7) Fault Confinement. Limits the spread of fault effects to one area of the system – prevents contamination of other areas. Achieved through use of: - self-checking acceptance tests, - exception handling routines, - consistency checking mechanisms, - multiple requests/confirmations. Erroneous system behaviors due to software faults are typically undetectable. Reduction of dependencies can help.
  • 54. Reliable System Design (3 of 7) Fault Detection. This stage recognizes that something unexpected has occurred in the system. Fault latency – period of time between fault occurrence and detection. The shorter the fault latency is, the better the system can recover. Two technique classes are off-line and on-line fault diagnosis: - Off-line techniques are diagnostic programs. System cannot perform useful work under test. - On-line techniques provide real-time detection capability. System can still perform useful work. Watchdog monitors and redundancy schemes.
  • 55. Reliable System Design (4 of 7) Diagnosis. This is necessary if the fault detection technique does not provide information about the failure location and/or properties. This is often an off-line technique that may require a system reset. On-line techniques can also be used i.e., when a diagnosis indicates unhealthy system conditions (such as low available resources), low-priority resources can be released automatically in order to achieve in-time transient failure prevention. Reconfiguration. This occurs when a fault is detected and a permanent failure is located. The system may reconfigure its components either to replace the failed component or to isolate it from the rest of the system (i.e., redundant memory, error checking of memory in case of partial corruption etc). Successful reconfiguration requires robust and flexible software architecture and reconfiguration schemes.
  • 56. Reliable System Design (5 of 7) Recovery. Uses techniques to eliminate the effects of faults. There are two approaches: - fault masking, - retry and rollback. Fault masking hides effects of failures by allowing redundant, correct information to outweigh the incorrect information. Retry makes a second try at an operation as many faults are transient in nature. Rollback makes use of backed up (checkpointed) operations at some point in its processing prior to fault detection, and operation recommences from this point. Fault latency is very important because the rollback must go back far enough to avoid the effects of undetected errors that occurred before the detected error.
  • 57. Reliable System Design (6 of 7) Restart. This occurs after the recovery of undamaged information. There are three approaches: - hot restart, - warm restart; - cold restart. Hot restart – resumption of all operations from the point of fault detection (this is only possible if no damage has occurred). Warm restart – only some of the processes can be resumed without loss. Cold restart – complete reload of the system is performed with no processes surviving.
  • 58. Reliable System Design (7 of 7) Repair. Replacement of failed component – on or off-line. Off-line – system brought down to perform repair. System availability depends on how fast a fault can be located and removed. On-line – Component replaced immediately with a back up spare (similar to reconfiguration), or perhaps operation can continue without using the faulty component (i.e., masking redundancy or graceful degradation). On-line repair prevents system operation interruption. Reintegration. Repaired module must be reintegrated into the system. For on-line repair, reintegration must be performed without interrupting system operation. Non-redundant systems are fault intolerant and, to achieve reliability, fault avoidance is often the best approach. Redundant systems should use fault detection, masking redundancy (i.e., disabling 1 out of N units), and dynamic redundancy (i.e., temporarily disabling certain operations ) to automate one or more stages of fault handling.
  • 59. RELIABILITY MODELING Determining what reliability has actually been achieved
  • 60. Reliability Modeling (1 of 4) This is used to calculate what the current reliability is, and if the reliability target is not yet being achieved, determine how much testing and debug needs to be completed in order to achieve the reliability target. The questions that reliability modeling aims to answer are: How many failures are we likely to experience during a fixed time period? What is the probability of experiencing a failure in the next time period? What is the availability of the software system? Is the system ready for release (from a reliability perspective)? Software Failures t2 t1 T=0 T1 T2 t3 t4 T3 T4 T5 Ti is the Cumulative Time To Failure ti is the inter-arrival time = Ti – Ti-1 t6 t5 T6 t7 T7 T8 TE
  • 61. Reliability Modeling (2 of 4) In reliability engineering it is usual to identify a failure distribution, especially when modeling non-repairable products*. This approach can be used because it is assumed that hardware faults are statistically independent and identically distributed. Where software is concerned, events (failures) are not necessarily independent due to interactions with other system elements, so in most cases failures are not identically distributed. When a failure occurs in a software system the next failure may depend on the current operational time of the unit, and therefore each failure event in the system may be DEPENDENT. * Although it can be argued that a software system can be repaired by fixing the fault, in reliability terms it is still a non-repairable product because it is not wearing out. For instance, a car is a repairable device as parts can be changed when they wear out, but this does not necessarily make it as good as new. If a software fault is repaired it is actually as good as new again, and in fact the improvement may make it better than new.
  • 62. Reliability Modeling (3 of 4) Therefore what is needed is to model the Rate of Occurrence of Failures and the Number of Failures within a given time. As an example, with reference to the figure below, a model is needed that will report the fact that 8 failures are expected by timeTE and that the Rate of Occurrence of Failures is Increasing with Time. Software Failures t2 t1 T=0 T1 T2 t3 t4 T3 T4 T5 t6 t5 T6 t7 T7 T8 TE
  • 63. Reliability Modeling (4 of 4) If a Distribution Analysis is performed on the Time-Between-Failures, then this is equivalent to saying that there are 9 different systems, where System 1 failed after t1 hours of operation, System 2 after t2,…, etc. T=0 System 1 System 2 System 3 . . . System 9 t1 t2 t3 T9 (suspension*) This is the same as assuming that the system is failure free if the fault is addressed, which may not necessarily be true as further failures may occur. Example: Changing the break pads on a car. This does not mean that the car is now failure free! * A unit that continues to work at the end of the analysis period or is removed from a test in working condition. I.e., it may fail at some point in the future.
  • 64. An Example of an Incorrect Approach (1 of 4) This example has been included because it is a common approach to hardware reliability modeling but it CANNOT be used for modeling software reliability. This method is normally used to model a non-repairable hardware product. Unfortunately when used in analyzing software reliability it returns incorrect results…but it is an easy trap for a reliability engineer to fall into!!! Both firmware and hardware failure data is collected from three systems: A total of 6 different firmware and 4 different hardware failure modes are identified
  • 65. An Example of an Incorrect Approach (2 of 4) The conventional reliability engineering approach is to take the TimeBetween-Failures for each system and then fit a distribution. 319-152 Notice that hardware failures have been removed. The time between the last failure and the current age is a Suspension.
  • 66. An Example of an Incorrect Approach (3 of 4) A Weibull (life data) Analysis is conducted, but with software this is not appropriate! This analysis assumes a sample of 20 systems, and one system failed after 152hrs, the other after 319hrs, etc.
  • 67. An Example of an Incorrect Approach (4 of 4) This system will be used for a total of 250 hours. What will the software reliability be? Distribution analysis is okay for non-repairable products containing only hardware, but not for anything containing software (or for repairable hardware only products). In products that contain software, events are dependent, and therefore alternative analysis methods should be used. However, it is correct to fit a distribution on the First-Time-to-Failure of each system. 97.63% GREAT RESULT BUT COMPLETELY WRONG!!!
  • 68. An Example of a Correct Approach This is the probability that the unit will NOT fail in the first 250 hours. Reliability=68.36% Notice that the confidence interval is very wide.
  • 69. Three Possible SRE approaches… Are multiple systems being tested? No Can testing be stopped after each phase to fix failure modes? No Use 3-Parameter Crow-Extended Model Yes Use NHPP model (This is the best option) This is the current state of the art in software reliability modeling, and is suitable for most projects. However, this approach is not suitable for testing a single unit (i.e., a large expensive system), or where not all faults are going to be fixed in between compiles. A better model is needed for this type of application. Yes Use Crow-Extended Model It is hypothesized by the author that these models may be more suitable for developments where the NHPP model cannot be well applied. This essentially represents a future state of software reliability testing. However, before being readily accepted they should be validated, i.e., by comparing their predicted reliability with actual field data. Use of these models has been included in this presentation for completeness and possible future application.
  • 70. A Better SRE Analysis Approach (1 of 4) A model is needed that will take into account the fact that when a failure occurs the system has a “Current Age,” or in other words there is a further failure that is likely to occur. For example, in System 1, the system has an age of 152 hours after the first firmware failure mode has been detected. In other words, all other operations that can result in a failure also have an age of 152 hours and the next failure event is based on this fact.
  • 71. A Better SRE Analysis Approach (2 of 4) The NHPP (Non-Homogenous Poisson Process) with a Power Law failure Intensity is such a model: Where: Pr[N(T)=n] is the probability that n failures will be observed by time, T. Λ(T) is the Failure Intensity Function (Rate of Occurrence of Failures). Just because a model is used for hardware does not mean that it cannot be suitable for software as well, as models simply describe times-to-failure. Therefore a hardware model can also be used for software, providing that it is a dependent model (failures are dependent on the operational time, rather than being independent).
  • 72. A Better SRE Analysis Approach (3 of 4) NHPP model parameters: Here the failure events of System 1 are analyzed between the period of 0 and 1380 hours. This folio also contains the failure events for Systems 2 and 3 (not shown). Of interest is the fact that Beta >1, which indicates that the inter-arrival times between unique failures are decreasing, so there may be little opportunity for reliability improvement.
  • 73. A Better SRE Analysis Approach (4 of 4) NHPP model results: Plot shows the cumulative number of failure vs. time, from which conclusions and further predictions can be obtained. The Weibull plot intersects the X-axis, so out-of-box failures should not be present. If it had intersected the Y-axis then this would indicate potential for out-of-box failures. The cumulative number of failures is 0.1352, or 13.52 failures per 25000 operational hours.
  • 74. An Example Using the NHPP Model (1 of 8) Software is under development – the reliability requirement is to have no more than 1 fault in every 8 hours of software operation. Three Test Engineers provide a total of 24 hours of testing each day. One new compile is available for testing each week, when fixes are implemented. The failure rate goal is: 8 FR = = 0.125 24 Failures per hour In a testing day, the failure intensity goal 3 faults/day. FRI = 0.125 × 24 = 3 Faults per day
  • 75. An Example Using the NHPP Model (2 of 8) Failure data is obtained: The data is grouped by the number of days until a new compile is available, i.e., the first 45 failures are contained in one group and are fixed in compile #1. NHPP model parameters
  • 76. An Example Using the NHPP Model (3 of 8) The instantaneous failure intensity after 28 days of testing is 4.4947 faults/day. If testing is continued with the same growth rate, when will the goal of no more than 3 faults/day be achieved? The answer is after an additional 14928=121 days of testing and development (test-analyze-fix)
  • 77. An Example Using the NHPP Model (4 of 8) An extra 121 days is longer than anticipated. Let’s take a closer look by generating a Failure Intensity vs. Time plot… Each of these lines indicates the failure intensity over a given interval (which in this case is 5 days). It can be seen that there was a jump in the failure intensity between 20 and 23 days. This is why it is estimated that more development time is required. The next step is to analyze the data set for the period up to 20 days of testing, before the failure intensity increased…
  • 78. An Example Using the NHPP Model (5 of 8) The NHPP model data is limited to the first 20 days of testing and another Failure Intensity vs. Time plot is generated, but this time for the first 20 days: This plot shows the decrease in the failure intensity rate over the first 20 days of testing. This confirms that the failure intensity continuously reduced during the first 20 days.
  • 79. An Example Using the NHPP Model (6 of 8) Based on the first 20 days of data the additional test and development duration can be recalculated, which results in there being an additional 55-28=27 days to achieve the goal of having no more than 3 faults/day, rather than 121! This generates questions: Why is there such a big difference in the test duration still required? What happened when the failure intensity jumped on the 23rd day of testing and development? Answer – New functionality was added. The jump in required test time is typical when new features are introduced, and applies to software and hardware alike. Because new functionality has been added it would be wise to reset the clock and track the reliability growth from the 20th day forward…
  • 80. An Example Using the NHPP Model (7 of 8) Now the NHPP model parameters need to be obtained and plotted for the last 8 days of testing (8 days is an arbitrary number; enough data needs to be available to have confidence in any conclusions that are drawn). This provides better resolution. By taking a “macro’ view it can be seen that the failure intensity is starting to increase, so the minimum failure intensity point has been determined. For improved accuracy calculations should be based on this.
  • 81. An Example Using the NHPP Model (8 of 8) Based on this data set 51-8=43 more days of developmental testing are required. It may be too early to make any predictions based on only 8 days of testing, but the result can be used to obtain a general idea of the remaining development time required and produce a test plan. To pull in the schedule 3 more Test Engineers could be added and the code recompiled every 2 days, which will complete the project within 1 month. There are also situations where some issues are fixed immediately, others are addressed later and more minor issues may not be addressed at all. In this type of situation the Crow Extended Model can be useful…
  • 82. Crow-Extended Model Introduction (1 of 2) This is not a common SRE model but does have the benefit of supporting decision making by providing metrics such as Failure intensity vs. time. Demonstrated Mean Time Between Failures (MTBF*). MTBF growth that can be achieved through implementation of corrective actions. Maximum potential MTBF that can be achieved through implementation of corrective actions. Maximum potential MTBF that can likely be achieved for the software and estimates regarding latent failure modes that have not yet been uncovered through testing. This model utilizes A, BC and BD failure mode classifications to analyze growth data. A = Failure mode that will not be fixed. BC = A Failure mode that will be fixed while the test is in progress. BD = A Failure mode that will be corrected at the end of the test. * This model uses MTBF rather than failure intensity or reliability metrics. A conversion between these various metrics is provided in slide 28.
  • 83. Crow-Extended Model Introduction (2 of 2) There is no reliability growth for A modes. The effectiveness of the corrective actions for BC modes is assumed to be demonstrated during the test. BD modes require a factor to be assigned that estimates the effectiveness of the correction that will be implemented after the test. Analysis using the Crow Extended model allows different management strategies to be considered by reviewing whether the reliability goal will be achieved. There is one constraint to this approach – the testing must be stopped at the end of the test phase and all BD modes must be fixed. The Crow Extended model will return misleading conclusions if it is used across multiple test phases. For those situations use the 3-Parameter Crow-Extended model (discussed next). DO NOT APPLY THIS MODEL TO A MULTIPLE SYSTEM TEST, USE THE NHPP MODEL INSTEAD.
  • 84. Crow-Extended Model Example (1 of 8) A product underwent development testing, during which failure modes were observed. Some modes were corrected during the test (BC modes), some modes were corrected after the end of the test (delayed fixes, BD modes) and some modes were left in the system (A modes). The test was terminated after 400 hours; the times-to-failure are provided below:
  • 85. Crow-Extended Model Example (2 of 8) An effectiveness factor has been assigned for each BD failure mode (delayed fixes). The effectiveness factor is based on engineering assessment and represents the fractional decrease in failure intensity of a failure mode after the implementation of a corrective action. The effectiveness factors for the BD failure modes are provided below: This is a metric that enables an assessment to be made of whether or not the corrective actions have been effective, and if they have, how effective they were. This is often a subjective judgment.
  • 86. Crow-Extended Model Example (3 of 8) The times-to-failure data and effectiveness factors are entered: Note that this data sheet only displays 29 rows of data, but all data is entered even though it has not been shown. Effectiveness factor is expressed as 0-1 (0-100% of the failure intensity being removed by the corrective action).
  • 87. Crow-Extended Model Example (4 of 8) Model parameter calculation: Here the failure events are analyzed between the period of 0 and 400 hours.
  • 88. Crow-Extended Model Example (5 of 8) Growth potential MTBF plot: Growth potential MTBF (Maximum achievable MTBF based on current strategy) Projected MTBF (Estimated MTBF after delayed corrective actions have been implemented) Demonstrated MTBF (MTBF at end of test without corrective actions) Instantaneous MTBF (Demonstrated MTBF with time) The demonstrated MTBF, (the result of fixing BC modes during the test) is about 7.76 hours. The projected MTBF (the result of fixing BD mode after the test) is about 11.13 hours. The growth potential MTBF (if testing continues with the current strategy, i.e. modes corrected vs. modes not corrected and with the current effectiveness of each corrective action) is estimated to be about 14.7 hours. This is the maximum attainable MTBF.
  • 89. Crow-Extended Model Example (6 of 8) An Average Failure Mode Strategy plot is a pie chart that breaks down the average failure intensity of the software into the following categories: A modes – 9.546%. BC modes addressed – 14.211%. BC modes still undetected – 30.655%. BD modes removed – 8.846%. BD modes to be removed – 3.355 (because corrective actions were <100% effective). BD modes still undetected – 33.386%
  • 90. Crow-Extended Model Example (7 of 8) Individual Mode MTBF plot, which shows the MTBF of each individual failure mode. This enables the failure modes with the lowest MTBF to be identified. These are the failure modes that cause the majority of software failures, and should be addressed as the highest priority when reliability improvement activities are to be implemented. Blue = Failure mode MTBF before corrective action. Green = Failure mode MTBF after corrective action.
  • 91. Crow-Extended Model Example (8 of 8) Failure Intensity vs. Time plot: This can be analyzed in exactly the same way as in the NHPP example.
  • 92. 3-Parameter Crow-Extended Model Introduction (1 of 2) This is not a common SRE model either, but has the same benefits as the single parameter Crow-Extended model plus multiple test phases can also be taken into account. This model is ideal in situations where software is to be tested over multiple phases but where all bug fixes cannot be introduced as faults are discovered, i.e., all bugs will be addressed on an ad-hoc basis over an extended time period. The model provides the flexibility of not having to specify when the test will end, so it can be continuously updated with new test data. Therefore this model is optimized for continuous evaluation rather than fixed test periods. It can only be applied to an individual system, so it lends itself ideally to situations where an individual complex system is being tested. DO NOT APPLY ANY CROW MODEL TO A MULTIPLE SYSTEM TEST, USE THE NHPP MODEL INSTEAD.
  • 93. 3-Parameter Crow-Extended Model Introduction (2 of 2) This model uses several event codes: F Failure time. I Time at which a certain BD failure mode has been corrected. BD modes that have not received a corrective action by time T will not have an associated I event in the Q data set. A failure that was due to a quality issue, such as a build problem rather than a design problem. The reliability engineer can decide whether or not to include quality issues in the analysis. P A failure that was due to a performance issue, such as an incorrect component being installed in a device where the embedded code is being tested. The reliability engineer can decide whether or not to include performance issues in the analysis. AP This is an analysis point, used to track overall project progress, which can be compared to planned growth phases. PH The end of a test phase. Test phases can be used to track overall project progress, which can be compared to planned growth phases. X A data point that is to be excluded from the analysis.
  • 94. 3-Parameter Crow-Extended Model Example (1 of 11) Software is under development. Testing is to be conducted in 3 phases. Phase 1 – 6 weeks of manual testing that is run 45 hours per week, total 270 hours. Phase 2 – 4 weeks of automated testing that is run 24/7, total 672 hours. Phase 3 – 8 weeks of field manual testing that is run 40 hours per week, total 320 hours. One hour of continuous testing equates to 7 hours of customer usage, so the testing includes a usage acceleration factor of 7. The average fix delay for the three phases are 90 hours, 90 hours and 180 hours respectively (fix time = time delay between discovering a failure mode to the time the corrective action is incorporated into the design). Taking usage acceleration into account, cumulative test times for the three phases is 1890 hours, 6594 hours and 8834 hours respectively.
  • 95. 3-Parameter Crow-Extended Model Example (2 of 11) Customer reliability target = 2 failures per year. Usage duty cycle = 0.1428 Therefore for continuous usage, reliability target is: = 2 failures every 1251 hrs. Failure intensity: = 2 = 0.0016 1251 Equivalent test time = 8834 hrs. Required MTBF: = 8834 = 625hrs 0.0016 × 8834
  • 96. 3-Parameter Crow-Extended Model Example (3 of 11) The growth potential = 1.3 This is the amount by which the MTBF target should exceed the requirement for margin. The higher the GP, the lower the risk is to the program. This is an initial estimate based on prior experience, and the higher the GP margin, the less risk is present in the program. Average effectiveness factor = 0.5 (1.0 = a perfect fix, 0 = inadequate fix) This is also an initial estimate based on prior experience. Management strategy – address at least 90% of all unique failure modes prior to formal release. Beta parameter = 0.7 This is the rate for discovering new, distinct failure modes found during testing. Again, this is an estimate based on prior experience. A discovery beta of less than 1 indicates that the inter-arrival times between unique B modes are increasing. This is desirable, as it is assumed that most failures will be identified early on, and their inter-arrival times will become larger as the test progresses. This is an initial estimate; the actual discovery beta can be obtained from the final results, allowing this parameter estimation to be refined with testing experience.
  • 97. 3-Parameter Crow-Extended Model Example (4 of 11) Based on the assumptions on the previous slide, an overall growth planning model can be created that shows the nominal and actual idealized growth curve and the planned growth MTBF for each phase. A growth planning folio is created and 1890, 6594 and 8834 are entered for the Cumulative Phase Times and 630, 630 and 1260 for the Average phase delays. Note that inter-phase average fix delays have been multiplied by 7 to take the usage acceleration factor into account.
  • 98. 3-Parameter Crow-Extended Model Example (5 of 11) The project parameters are input into the Planning Calculations window . Given the MTBF target and design margin that has been specified, along with other required inputs to describe the planned reliability growth management strategy, the final MTBF that can be achieved is calculated, along with other useful results. Here it is verified that 625 hours is achievable (if it was not achievable a figure of less than 625 hours would be calculated).
  • 99. 3-Parameter Crow-Extended Model Example (6 of 11) Effectiveness Factors for all BD modes are specified, together with when they are to be implemented. A growth planning plot can then be obtained: MTBF at end of phase 3 MTBF at end of phase 1 MTBF at end of phase 2 This plot displays the MTBF vs. Time values for the three phases that have been planned for the test.
  • 100. 3-Parameter Crow-Extended Model Example (7 of 11) Test failure data is collected during the three phases: Actual discovery beta (original estimate was 0.7)
  • 101. 3-Parameter Crow-Extended Model Example (8 of 11) Growth potential MTBF plot can now been obtained: Growth potential MTBF (Maximum achievable MTBF based on current strategy) Demonstrated MTBF (MTBF at end of test without corrective actions) Projected MTBF (Estimated MTBF after delayed corrective actions have been implemented) Instantaneous MTBF If the MTBF goal is higher than the Growth Potential line then the current design cannot achieve the desired goal and a redesign or change of goals may be required. For this example, the goal MTBF of 650 hours is well within the growth potential and is expected to be achieved after the implementation of the delayed BD fixes.
  • 102. 3-Parameter Crow-Extended Model Example (9 of 11) Average Failure Mode Strategy plot, breaking down the average failure intensity of the software into categories: A modes – 13.432%. BC modes addressed – 19.281%. BC modes still undetected – 13.76%. BD modes removed – 25.893%. BD modes remain (because corrective actions were <100% effective) – 5.813%. BD modes still undetected – 21.882%
  • 103. 3-Parameter Crow-Extended Model Example (10 of 11) Individual Mode MTBF plot showing the MTBF of each individual failure mode, thus enabling the failure modes with the lowest MTBF to be identified. Blue = Failure mode MTBF before corrective action. Green = Failure mode MTBF after corrective action.
  • 104. 3-Parameter Crow-Extended Model Example (11 of 11) The RGA Quick calculation pad indicates that the discovery rate of new unseen BD modes at 630 hours is 0.0006 per hour. The Beta bounds are less than 1, indicating that there is still growth in the system (think of this as the leading edge slope of the bathtub curve; when beta=1 there is no more growth potential)
  • 105. RELIABILITY DEMONSTRATION Demonstration that a minimum software reliability has been achieved
  • 106. Reliability Demonstration Testing (1 of 2) There can be occasions when the actual software reliability may have to be measured through practical demonstration rather than testing. However, this is more applicable where all known faults have been removed and the software is considered to be stable. If the reliability has already been discovered by conducting a reliability growth program then there may be little value in conducting this test. This is actually more suitable for situations where a reliability growth program has not been conducted. This can be achieved through sequential sampling theory.
  • 107. Reliability Demonstration Testing (2 of 2) A project specific chart depends on: Discrimination Ratio – this is an error in the failure intensity estimation that is considered to be acceptable. Consumer Risk Level – this is the probability of falsely claiming the failure intensity has been met when it has not. Supplier Risk Level – this is the probability of falsely claiming the failure intensity objective has not been met when it has. Common values are: Discrimination Ratio 2% Consumer Risk Level 0.1 (10%). Supplier Risk Level 0.1 (10%).
  • 108. Example Requirement: 4 Failures/million operations 1 2 3 0.4 0.625 1.2 1.6 2.5 4.8 Multiply by requirement target Software can be accepted after failure 3 with 90% confidence that it is within the reliability target and a 10% risk that it is not. The boundary has to be crossed though.
  • 109. Reliability Demonstration Test Chart Design (1 of 2) What if the software is still in the Continue region at the end of the test? Assume that the end of test is reached just after failure 2. Option 1 – Calculate the Failure Intensity Objective: Factor = FCURRENT 3.6 = = 1.44 F PREVIOUS 2.5 ∴ FIO = 1.44 × 4 = 5.76 Option 2 - Extend the test time by ≥factor. Grouped data CANNOT be used, it has to be obtained from individual units.
  • 110. Reliability Demonstration Test Chart Design (2 of 2) The following formulae is used to design RDT charts: TN ( A − n )(ln γ ) = 1−γ TN and Accept-Continue Boundary (B − n )(ln γ ) = 1−γ Reject-Continue Boundary Where: TN: Normalized measure of when failure occur (horizontal coordinate). n: Failure number. γ: Discrimination ratio (ratio of max acceptable failure intensity to the failure objective). A and B are defined from: A = ln β 1−α B = ln 1− β α Where: α: Supplier risk (probability of falsely claiming objective is not met when it is). β: Consumer risk (probability of falsely claiming objective is met when it is not).
  • 111. Reliability Demonstration Test Chart Design Example The boundary intersections with the x and y axis's can be calculated using the following formulae: B − n(ln γ ) A − n(ln γ ) ,n ,n 1− γ 1− γ B 0, ln γ A ,0 1− γ In this example n=16.
  • 112. SRE Review Enables defect discovery rates to be forecast and monitored – helps all staff – enables customer expectations to be managed. Enables reliability targets to be established and monitored. Software FMEA enables failure modes and risks to be identified. Establishes formal and thorough test and analysis methodologies. Provides a method for modeling and demonstrating software reliability. Defines code inspection processes. Guarantees customer satisfaction!
  • 113. References Adamantios Mettas, “Repairable Systems: Data Analysis and Modeling” Applied Reliability Symposium 2008. Michael R. Lyu, “Software Reliability Engineering: A Roadmap”. Dr Larry Crow, “An Extended Reliability Growth Model For Managing and Accessing Corrective Actions” Reliability And Maintainability Symposium 2004. John D. Musa, “Software Reliability Engineering: More Reliable Software Faster and Cheaper” Authorhouse 2004. Reliasoft RGA 7 Training Guide. Capers Jones, “Applied Software Measurement, 3rd Edition, McGraw Hill 2008.