Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective Test Generation and Selection

“Testing with Fewer Resources:
Toward Adaptive Approaches for Cost-e
ff
ective
Test Generation and Selection”
June 22-24, 2022 - Córdoba, Spain
Sebastiano Panichella
Zurich University of Applied Sciences
https://spanichella.github.io/
Christian Birchler
https://christianbirchler.github.io/
International Summer School
on Search- and Machine Learning-based
Software Engineering

2
About the Speakers (Christian)

2015 - Physics
2017 - Computer Science
- Software Systems
- Data Science
2021 - Research Assistant
- Testing Self-driving Cars
3
About the Speakers (Christian)

University of Sannio
PhD Student
June 2014
University of Salerno
Master Student
December 2010
University of Zurich
Research Associate
October 2014 - August 2018
Zurich University of
Applied Science
Senior Computer Science Researcher
Since August 2018
4
2010
2014
2018
Today
About the Speakers (Sebastiano)

5
Program Comprehension & Maintenance (PC&M)
Generation of source code documentation
Pro
fi
ling developers
Dependencies analysis in software ecosystems
Mobile computing (MC):
- Machine Learning & Genetic Algorithms
SummarizationTechniques
for Code,
Change,
Testing and User Feedback
PhD thesis:
“Supporting Newcomers in Software Development Projects"
Approved Project
Approved Project
Approved Project
Development & Testing challenges:
- Test case generation and assessment
- User Feedback Analysis
- Continuous Delivery
CD anti-patterns
Branch Coverage Prediction
Documentation defects detection
“Complex Systems”
2010
2014
2018
Today
About the Speakers (Sebastiano)

Outline
• Context & Motivation:
• Cyber-physical Systems
• DevOps and Arti
fi
cial Intelligence
• Search-based Software Testing (SBST) Barriers:
• A success SE story
• Overcoming Barriers of SBST Adoption in Industry
• Cost-effective Testing for Self-Driving Cars (SDCs):
• Regression Testing
• Test Selection & Test Prioritization for SDCs
Test
Case
Selection
Initial
Tests
Search
Test
Execution
Variants
Generation
6

Outline
• DevOps and Arti
fi
cial Intelligence
Test
Case
Selection
Initial
Tests
Search
Test
Execution
Variants
Generation
7

Context
“Our main research goal is to conduct industrial research, involving both industrial and
academic collaborations, to sustain the Internet of Things (IoT) vision of future "smart cities”,
with millions of smart systems connected over the internet, and/or controlled by complex
embedded software implemented for the cloud."
8

Context
9

2) Artificial
Intelligence (AI) 3) DevOps, IoT,
Automated Testing (AT)
1) Cyber-physical Systems
Next
10-15 Years (and beyond)
Context
10

Sebastiano Panichella Sajad Khatiri
Christian Birchler
COSMOS:
DevOps for Complex Cyber-physical Systems
https://www.cosmos-devops.org/ https://twitter.com/COSMOS_DEVOPS

“Emerging Cyber-physical Systems (CPS) will play a crucial role in the quality of
life of European citizens and the future of the European economy”
COSMOS Context
• CPS relevant sectors:
• Healthcare
• Automotive
• Water Monitoring
• Railway
• Manufacturing
• Avionics
• etc.
MEDICAL DELIVERY
FOOD DELIVERY
• Avionics
12

Industrial Use Cases
AVIATION
E-HEALTH
WATER MONITORING
SATELLITES
AUTOMOTIVE
RAILWAYS
DRONES SELF-DRIVING CARS
Reference Use Cases
13
COSMOS Use Cases

Background
First aerodynamic
fl
ight on another planet. Landed with Perseverance rover on 18 February 2021
SPACE EXPLORATION

• -
• Our (Software Engineering) view of DevOps and AI for IoT systems:
• DevOps and Continuous Delivery (CD): Whats is it?
• Present, Challenges, and Opportunities
• Relevant Research Questions
• Arti
fi
cial Intelligence (AI) and Testing Automation:
• User-oriented Testing Automation
“We all recognize the relevance and capacity of contemporary cyber-
physical systems for building the future of our society, but ongoing research
in the
fi
eld is also clearly failing in making the right countermeasures to
avoid that CPS usage a
ff
ects human being safety”. In
“Self-driving Uber kills Arizona
woman in first fatal crash involving
pedestrian”
Problem Statement
“A simple software update was
the direct cause of the fatal
crashes of the Boeing 737”
16

Question:
What are the main Challenges of Testing Cyber-physical Systems?
17
Answers from the Audience (1)

Question:
What are the main Challenges of Testing Cyber-physical Systems?
18
Answers from the Audience (2)

• -
• Our (Software Engineering) view of DevOps and AI for IoT systems:
• DevOps and Continuous Delivery (CD): Whats is it?
• Arti
fi
cial Intelligence (AI) and Testing Automation:
• User-oriented Testing Automation
“Self-driving Uber kills Arizona
woman in first fatal crash involving
pedestrian”
“Swiss Post drone
crashes in Zurich
Challenges
“A simple software update was
the direct cause of the fatal
crashes of the Boeing 737”
Challenge 1: Observability, testability, and predictability of the behavior
of emerging CPS is highly limited and, unfortunately, their usage in the real
world can lead to fatal crashes sometimes tragically involving also humans
19

Research Challenges and Opportunities
As reported by National Academies:
[“A 21st Century Cyber-Physical Systems Education”]
“today's practice of IoT system design and
implementation are often unable to support
the level of ``complexity, scalability, security,
safety, […] required to meet future needs”
20

“The main problem is that contemporary
development methodologies for CPS need to
incorporate core aspects of both systems and
software engineering communities, with the
goal to explicitly embrace and consider the
several direct and indirect physical effects of
software”
[“Complexity challenges in development of cyber-physical systems”]
(Martin Törngren, Ulf Sellgren Pages 478-503)
21
Crash of
Boeing 737

“As identi
fi
ed by agile methodologies, the
development of modern/emerging systems
(e.g., e-health, automotive, satellite, and IoT
manufacturing systems) should evolve with
the systems, ``as development never ends''
software”
22
Crash of
Boeing 737
Tools

These concepts are closely related to DevOps and
Arti
fi
cial Intelligence technologies, and several
researchers and practitioners advocate them as a
promising solutions for the development,
maintenance, testing, and evolution of these
complex systems
software”
“As identi
fi
ed by agile methodologies, the
development of modern/emerging systems
(e.g., e-health, automotive, satellite, and IoT
manufacturing systems) should evolve with
the systems, ``as development never ends''
23
Crash of
Boeing 737
Tools

Challenge 1: Observability, testability, and
predictability of the behavior of emerging
CPS is highly limited and, unfortunately,
their usage in the real world can lead to fatal
crashes sometimes tragically involving also
humans
These concepts are closely related to DevOps and
Arti
fi
cial Intelligence technologies, and several
researchers and practitioners advocate them as a
promising solutions for the development,
maintenance, testing, and evolution of these
complex systems
Challenge 2: Contemporary DevOps and
AI practices and tools are potentially the
right solution to this problem, but they are
not developed to be applied in CPS
domains
24

• DevOps and Arti
fi
cial Intelligence
Outline
Test
Case
Selection
Initial
Tests
Search
Test
Execution
Variants
Generation
25

Traditional DevOps Pipeline
ADSs
• Generate Diversi
fi
ed Test
Inputs (or Scenarios)
• Evaluation based Failures
Detection
V.S.
Manual Automated Testing (SBST)
26

Search-Based Software Testing (SBST)
“The initial population” is a set of randomly generated test cases.
Selection
Crossover
Mutation
End?
YES NO
Initial Population
27
(Fitness Function) We need to select the best “fittest” test case for reproduction
Single-Point Crossover
Mutation: randomly changes some genes (elements within
each chromosome).
Mutation probability: each statement is mutated with
prob=1/n where n=#statements
Mutation
27

Mutation
Initial Population
Selection
Crossover
Mutation
End?
YES NO
class Triangle {
void computeTriangleType() {
if (isTriangle()){
if (side1 == side2) {
if (side2 == side3)
type = "EQUILATERAL";
else
type = "ISOSCELES";
} else {
type = "ISOSCELES";
} else {
if (side2 == side3)
type = “ISOSCELES”;
else
checkRightAngle();
}
}
}// if isTriangle()
}}
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Goal
Class Under Test
Iterate until line 9 is covered
or the search budget (running
time or #iterations) is
consumed
28
28

Mutation
class Triangle {
if (isTriangle()){
if (side2 == side3)
type = "EQUILATERAL";
else
type = "ISOSCELES";
} else {
type = "ISOSCELES";
} else {
if (side2 == side3)
type = “ISOSCELES”;
else
checkRightAngle();
}
}
}// if isTriangle()
}}
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Class Under Test
29
1
2
5
6 7 3
9
8
10
4
Control Flow Graph
Goal: Covering as many
code elements as possible
Branch coverage
Targets = {<1,5>, <1,2>,<5,6>, <5,7>,
<2,3>, <2,4>, <6,10>, <7,8>, <7,9>,
<3,10>, <4,10>, <8,10>, <9,10>}
Statement coverage
Targets = {1, 2, 3,4 ,5 ,6, 7, 8, 9,10}
Path coverage
Targets = {<1,5,6,10>, <1,5,7,8,10>,
<1,5,7,9,10>, <1,2,3,10>, <1,2,4,10>}
29

http://www.evosuite.org
• Command Line
• Eclipse Plugin
• Maven Plugin
• Measure Code Coverage
•…
30
Successful SBST Stories

Generated Tests
Production Code
31
https://github.com/EvoSuite/evosuite

https://arstechnica.com/information-technology/2017/08/facebook-dynamic-analysis-software-sapienz/
“The Sapienz Project at Facebook”
Sapienz in action:
https://youtu.be/j3eV8NiWLg4
32

What are the main Barriers of SBST tools Adoption in Real
Society (e.g., in industrial setting)?
Question
“Manual Testing is still Dominant in Industry…”
33
Answers from the Audience

SBST Barrier to Practical Adoption:
Test Code Comprehension
Are Generated Tests Helpful?
Modeling Readability to Improve Unit Tests
Ermira Daka, José Campos, and
Gordon Fraser
University of Sheffield
Sheffield, UK
Jonathan Dorn and Westley Weimer
University of Virginia
Virginia, USA
ABSTRACT
Writing good unit tests can be tedious and error prone, but even
once they are written, the job is not done: Developers need to reason
about unit tests throughout software development and evolution, in
order to diagnose test failures, maintain the tests, and to understand
code written by other developers. Unreadable tests are more dif-
ficult to maintain and lose some of their value to developers. To
overcome this problem, we propose a domain-specific model of unit
test readability based on human judgements, and use this model to
augment automated unit test generation. The resulting approach can
automatically generate test suites with both high coverage and also
improved readability. In human studies users prefer our improved
tests and are able to answer maintenance questions about them 14%
more quickly at the same level of accuracy.
Categories and Subject Descriptors. D.2.5 [Software Engineer-
ing]: Testing and Debugging – Testing Tools;
Keywords. Readability, unit testing, automated test generation
1. INTRODUCTION
Unit testing is a popular technique in object oriented program-
ming, where efficient automation frameworks such as JUnit allow
unit tests to be defined and executed conveniently. However, pro-
ducing good tests is a tedious and error prone task, and over their
lifetime, these tests often need to be read and understood by different
people. Developers use their own tests to guide their implemen-
tation activities, receive tests from automated unit test generation
tools to improve their test suites, and rely on the tests written by
developers of other code. Any test failures require fixing either the
software or the failing test, and any passing test may be consulted
by developers as documentation and usage example for the code
under test. Test comprehension is a manual activity that requires
one to understand the behavior represented by a test — a task that
may not be easy if the test was written a week ago, difficult if it
was written by a different person, and challenging if the test was
generated automatically.
How difficult it is to understand a unit test depends on many
factors. Unit tests for object-oriented languages typically consist of
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
ElementName elementName0 = new ElementName("", "");
Class<Object> class0 = Object.class;
VirtualHandler virtualHandler0 = new VirtualHandler(
elementName0, (Class) class0);
Object object0 = new Object();
RootHandler rootHandler0 = new RootHandler((ObjectHandler
) virtualHandler0, object0);
ObjectHandlerAdapter objectHandlerAdapter0 = new
ObjectHandlerAdapter((ObjectHandlerInterface)
rootHandler0);
assertEquals("ObjectHandlerAdapter",
objectHandlerAdapter0.getName());
ObjectHandlerAdapter objectHandlerAdapter0 = new
ObjectHandlerAdapter((ObjectHandlerInterface) null);
assertEquals("ObjectHandlerAdapter",
objectHandlerAdapter0.getName());
Figure 1: Two versions of a test that exercise the same functionality
but have a different appearance and readability.
sequences of calls to instantiate various objects, bring them to appro-
priate states, and create interactions between them. The particular
choice of sequence of calls and values can have a large impact on the
resulting test. For example, consider the pair of unit tests shown in
Figure 1. Both tests exercise the same functionality with respect to
the constructor of the class ObjectHandlerAdaptor in the Xi-
neo open source project (which treats null and rootHandler0
arguments identically). Despite this identical coverage of the subject
class in practice, they are quite different in presentation.
In terms of concrete features that may affect comprehension, the
first test is longer, uses more different classes, defines more variables,
has more parentheses, has longer lines. The visual appearance
of code in general is referred to as its readability — if code is
not readable, intuitively it will be more difficult to perform any
tasks that require understanding it. Despite significant interest from
managers and developers [8], a general understanding of software
readability remains elusive. For source code, Buse and Weimer [7]
applied machine learning on a dataset of code snippets with human
annotated ratings of readability, allowing them to predict whether
code snippets are considered readable or not. Although unit tests
are also just code in principle, they use a much more restricted
set of language features; for example, unit tests usually do not
contain conditional or looping statements. Therefore, a general code
readability metric may not be well suited for unit tests.
In this paper, we address this problem by designing a domain-
specific model of readability based on human judgements that ap-
plies to object oriented unit test cases. To support developers in
deriving readable unit tests, we use this model in an automated ap-
proach to improve the readability of unit tests, and integrate this into
an automated unit test generation tool. In detail, the contributions
of this paper are as follows:
• An analysis of the syntactic features of unit tests and their
Does Automated White-Box Test Generation
Really Help Software Testers?
Gordon Fraser1
Matt Staats2
Phil McMinn1
Andrea Arcuri3
Frank Padberg4
1
Department of 2
Division of Web Science 3
Simula Research 4
Karlsruhe Institute of
Computer Science, and Technology, Laboratory, Technology,
University of Sheffield, UK KAIST, South Korea Norway Karlsruhe, Germany
ABSTRACT
Automated test generation techniques can efficiently produce test
data that systematically cover structural aspects of a program. In
the absence of a specification, a common assumption is that these
tests relieve a developer of most of the work, as the act of testing
is reduced to checking the results of the tests. Although this as-
sumption has persisted for decades, there has been no conclusive
evidence to date confirming it. However, the fact that the approach
has only seen a limited uptake in industry suggests the contrary, and
calls into question its practical usefulness. To investigate this issue,
we performed a controlled experiment comparing a total of 49 sub-
jects split between writing tests manually and writing tests with the
aid of an automated unit test generation tool, EVOSUITE. We found
that, on one hand, tool support leads to clear improvements in com-
monly applied quality metrics such as code coverage (up to 300%
increase). However, on the other hand, there was no measurable
improvement in the number of bugs actually found by developers.
Our results not only cast some doubt on how the research commu-
nity evaluates test generation tools, but also point to improvements
and future work necessary before automated test generation tools
will be widely adopted by practitioners.
Categories and Subject Descriptors. D.2.5 [Software Engineer-
ing]: Testing and Debugging – Testing Tools;
General Terms. Algorithms, Experimentation, Reliability, Theory
Keywords. Unit testing, automated test generation, branch cover-
age, empirical software engineering
1. INTRODUCTION
Controlled empirical studies involving human subjects are not
common in software engineering. A recent survey by Sjoberg et
al. [28] showed that out of 5,453 analyzed software engineering
articles, only 1.9% included a controlled study with human sub-
jects. For software testing, several novel techniques and tools have
been developed to automate and solve different kinds of problems
and tasks—however, they have, in general, only been evaluated us-
ing surrogate measures (e.g., code coverage), and not with human
testers—leaving unanswered the more directly relevant question:
Does technique X really help software testers?
This paper addresses this question in the context of automated
white-box test generation, a research area that has received much
attention of late (e.g., [8, 12, 18, 31, 32]). When using white-box
test generation, a developer need not manually write the entire test
suite, and can instead automatically generate a set of test inputs
that systematically exercise a program (for example, by covering
all branches), and only need check that the outputs for the test in-
puts match those expected. Although the benefits for the developer
seem obvious, there is little evidence that it is effective for practical
software development. Manual testing is still dominant in industry,
and research tools are commonly evaluated in terms of code cover-
age achieved and other automatically measurable metrics that can
be applied without the involvement of actual end-users.
In order to determine if automated test generation is really help-
ful for software testing in a scenario without automated oracles, we
performed a controlled experiment involving 49 human subjects.
Subjects were given one of three Java classes containing seeded
faults and were asked to construct a JUnit test suite either manu-
ally, or with the assistance of the automated white-box test genera-
tion tool EVOSUITE [8]. EVOSUITE automatically produces JUnit
test suites that target branch coverage, and these unit tests contain
assertions that reflect the current behaviour of the class [10]. Con-
sequently, if the current behaviour is faulty, the assertions reflecting
the incorrect behaviour must be corrected. The performance of the
subjects was measured in terms of coverage, seeded faults found,
mutation score, and erroneous tests produced.
Our study yields three key results:
1. The experiment results confirm that tools for automated test
generation are effective at what they are designed to do—
producing test suites with high code coverage—when com-
pared with those constructed by humans.
2. The study does not confirm that using automated tools de-
signed for high coverage actually helps in finding faults. In
our experiments, subjects using EVOSUITE found the same
number of faults as manual testers, and during subsequent
mutation analysis, test suites did not always have higher mu-
tation scores.
3. Investigating how test suites evolve over the course of a test-
ing session revealed that there is a need to re-think test gen-
Developers spend up to 50% of their time in
understanding and analyzing the output of
automatic tools.
Fraser et al.
“Professional developers perceive
generated test cases as hard to
understand.”
Dana et al.
34

Why?
Class Name: Option.java
Library: Apache Commons-Cli
Q1: What are the main
differences?
Generated Tests
Q2: Do they cover different
parts of the code?
35

Why?
differences?
Generated Tests
parts of the code?
Candidate
Assertions
Q3: Are these
assertions correct?
Earl T. Barr, et al., “The Oracle Problem in Software Testing: A Survey”.IEEE Transactions on Software Engineering, 2015.
36

Why?
differences?
Generated Tests
parts of the code?
Q3: Are these
assertions correct?
Earl T. Barr, et al., “The Oracle Problem in Software Testing: A Survey”.IEEE Transactions on Software Engineering, 2015.
37

Are Generated Tests Helpful?
G. Fraser et al., Does Automated Unit Test Generation
Really Help Software Testers? A Controlled Empirical
Study, TOSEM 2015. 38
Automatically generated tests do not
improve the ability of developers to detect
faults when compared to manual testing.

39
?
Addressing Test Code Comprehension
Test Case
How to Generate Test Case Summary?
Panichella et al. “The Impact of Test Case Summaries
on Bug Fixing Performance: An Empirical Investigation”.
ICSE 2016

ICSE 2016
40
?
Generated Unit Test
… with Descriptions
40

Addressing Test Code Comprehension ?
ICSE 2016
41
http://textcompactor.com/
35
Intuition
41

Summary Generator
Software Words Usage Model: deriving <actions>, <themes>, and
<secondary arguments> from class, methods, attributes and
variable identifiers
E. Hill et al. Automatically capturing
source code context of NL-queries for
software maintenance and reuse.
ICSE 2009
42

Summary Generator
SWUM in TestDescriber:
Covered Code
43
43

Summary Generator
1) Select the covered statements
Covered Code
44
44

2) Filter out Java keywords, etc.
Summary Generator
Covered Code
45
45

3) Identifier Splitting (Camel case)
Summary Generator
Covered Code
46
46

4) Abbreviation Expansion (using
external vocabularies)
Summary Generator
Covered Code
47
47

4) Abbreviation Expansion (using
external vocabularies)
5) Part-of-Speech tagger
Summary Generator
<actions> = Verbs
<themes> = Nouns/Subjects
<secondary arguments> = Nouns / objectes, adjectives, etc
NOUN NOUN NOUN
ADJ
NOUN
NOUN
VERB
NOUN NOUN NOUN
NOUN
VERB NOUN
NOUN
ADJ
ADJ ADJ ADJ
NOUN
NOUN NOUN
VERB
ADJ
NOUN
CON
NOUN
ADJ
Covered Code
48
48

Summary Generator
NOUN NOUN NOUN
ADJ
NOUN
NOUN
VERB
NOUN NOUN NOUN
NOUN
VERB NOUN
NOUN
ADJ
ADJ ADJ ADJ
NOUN
NOUN NOUN
VERB
ADJ
NOUN
CON
NOUN
The test case instantiates an "Option"
with:
- option equal to “...”
- long option equal to “...”
- it has no argument
- description equal to “…”
An option validator validates it
The test exercises the following
condition:
- "Option" has no argument
NOUN NOUN NOUN
ADJ
NOUN
NOUN
VERB
NOUN NOUN NOUN
NOUN
VERB NOUN
NOUN
ADJ
ADJ ADJ ADJ
NOUN
NOUN NOUN
VERB
ADJ
NOUN
CON
NOUN
ADJ
Natural Language Sentences Parsed Code
49
49

with:
condition:
Natural Language Sentences
Class
Level
Method
Level
Statement
Level
Branch
Level
Summarisation Levels
50
50

with:
condition:
51
51
Summary Generator

Summary Generator
with:
condition:
52
52
Q2: Do Test Summaries Improve Test Readability?
Q1: Do Test Summaries Help Developers find more bugs?

Case Study
Bug Fixing Tasks
Involving 30 Developers
53
53

Subjects: 30 Developers (23 Researchers and 7 Developers)
Context
Object: two Java classes from Apache Commons Primitives and
Math4J that have been used in previous studies on search-based
software testing [by Fraser et al. TOSEM 2015]
ArrayIntList.java
Rational.java

Bug Fixing Tasks
Group 1 Group 2
ArrayIntList.java
Rational.java ArrayIntList.java
Rational.java
55

Bug Fixing Tasks
Group 1 Group 2
ArrayIntList.java
Rational.java
56

Bug Fixing Tasks
Group 1 Group 2
ArrayIntList.java
Rational.java
57

Bug Fixing Tasks
Group 1 Group 2
ArrayIntList.java
Rational.java
Comments Comments
TestDescriber
58

Bug Fixing Tasks
Experiment conducted Offline via a Survey platform
Each participant received the experiment package consisting of:
1. A pretest questionnaire
2. Instructions and materials to perform the experiment
3. A post-test questionnaire
We did not reveal the goal of the study
45 minutes of time for each task
59

Q1: How do test case summaries impact the
number of bugs fixed by developers?
60
Comments

Q1: How do test case summaries impact the
number of bugs fixed by developers?
61
Comments
Summary: Using automatically generated test case summaries
significantly helps developers to identify and fix more bugs.

WITH Summaries:
(i) 46% of participants
consider the test cases as
“easy to understand”.
(ii) Only 18% of participants
considered the test cases
as incomprehensible.
Without
With 4%
6%
14%
33%
14%
6%
32%
9%
36%
45%
Medium High Very High Low Very Low
Perceived test comprehensibility WITH and
WITHOUT TestDescriber summaries
62
Comments

WITHOUT Summaries:
(i) Only 15% of participants
(iii) 40% of participants
WITH Summaries:
Without
With 4%
6%
14%
33%
14%
6%
32%
9%
36%
45%
63
Comments

WITHOUT Summaries:
(i) Only 15% of participants
(iii) 40% of participants
WITH Summaries:
Without
With 4%
6%
14%
33%
14%
6%
32%
9%
36%
45%
64
Comments
Summary: Test summaries statistically improve the
comprehensibility of automatically generated test case
according to human judgments.

1) Using automatically generated test
case summaries significantly helps
developers to identify and fix more bugs.
2) Test summaries statistically improve
the comprehensibility of automatically
generated test case according to
human judgments.
Panichella et al. “The Impact of Test Case
Summaries on Bug Fixing Performance: An
Empirical Investigation”. ICSE 2016
65

66
Other Studies Addressing this Open Problem…
Daka et al. Generating unit tests with descriptive names or:
would you name your children thing1 and thing2? ISSTA 2017
Generating unit tests with descriptive names
Panichella et al. Revisiting Test Smells in Automatically Generated
Tests: Limitations, Pitfalls, and Opportunities. ICSME 2020
Test Smells in Automatically Generated Tests

Cost-effectiveness of Generated Tests
Why?
67
class Triangle {
int a, b, c; //sides
String type = "NOT_TRIANGLE";
Triangle (int a, int b, int c){…}
1. if (a == b) {
2. if (b == c)
3. type = "EQUILATERAL";
else
4. type = "ISOSCELES";
} else {
5. if (a == c) {
} else {
7. if (b == c)
8. type = “ISOSCELES”;
else
9. type = “SCALENE”;
}
}
}
Java Class Under Test (CUT)
@Test
public void test(){
Triangle t = new Triangle (1,2,3);
t.computeTriangleType();
String type = t.getType();
assertTrue(type.equals(“SCALENE”));
}
Test Case

Why?
68
class Triangle {
1. if (a == b) {
2. if (b == c)
else
} else {
5. if (a == c) {
} else {
7. if (b == c)
else
}
}
}
@Test
public void test(){
}
Test Case
Code Coverage:
The main
Quality Assessment
Criteria

Why?
69
class Triangle {
1. if (a == b) {
2. if (b == c)
else
} else {
5. if (a == c) {
} else {
7. if (b == c)
else
}
}
}
Software
Quality
Money
Time
Practical Constraints?

Cost Effectiveness: example
Class A Class B Class C Class D
70

Test Generation Tool 1
TestClass A TestClass B
Class A Class B Class C Class D
TestClass C TestClass D
TestClass A TestClass B TestClass C TestClass D
BUG BUG
71

BUG BUG
Coverage 67%
Coverage 66.5%
We need COST oriented models
V.S.
Manual Automated Testing
72

BUG BUG
Coverage 67%
Coverage 66.5%
We need COST oriented models
+20%
V.S.
Manual Automated Testing
73

74
Cost-effective Generation of Tests
Automatically generating tests with
appropriate performance (CPU, memory, etc.)
when deployed in different environments
Grano et al. “Testing with Fewer Resources: An
Adaptive Approach to Performance-Aware Test
Case Generation”. TSE 2019
Further Performance
Indicators…
+
It uses indicators of
Test Coverage…
?

75
Cached history
information
Kim at al.
ICSE 2007
Change Metrics
Moser at al.
ICSE 2008.
A metrics suite for
object oriented design
Chidamber at al.
TSE 1994
Indicators of Complexity

76
“We needed indicators (or
proxies) of test performances
(CPU, memory, etc.)..”
Cached history
information
Kim at al.
ICSE 2007
Change Metrics
Moser at al.
ICSE 2008.
A metrics suite for
object oriented design
Chidamber at al.
TSE 1994
Indicators of Complexity

pDynaMOSA (Adaptive Performance-Aware DynaMOSA),
pDynaMOSA Pipeline
2)
1)
First Criteria
Second Criteria
Yes
No
Yes
No
Coverage
Performance
Cost-effective Generated of Tests
77

pDynaMOSA Pipeline
2)
1)
First Criteria
Second Criteria
Yes
No
Yes
No
Coverage
Performance
78

pDynaMOSA Pipeline
2)
1)
First Criteria
Second Criteria
Yes
No
Yes
No
Coverage
Performance
79

pDynaMOSA Pipeline
2)
1)
First Criteria
Second Criteria
Yes
No
Yes
No
Coverage
Performance
80

De Oliveira “Perphecy: Performance regression test
selection made simple but effective,” (ICST 2017)
I1. Number of executed loops (branches)
(Higher loop cycle counts in
fl
uence the runtime of the test
case).
2)
1)
First Criteria
Second Criteria
Yes
No
Yes
No
Coverage
Performance
I2. Number of (test and code) method calls
(Fewer method calls result in shorter runtimes and
lower heap memory usage due to potentially fewer object
instantiations.).
I3. Number of object instantiations (not size)
(reducing the number of instantiated objects may lead to
decreased usage of heap memory - e.g., arrays dimension).
I4. Number of executed (test and code) Statements
(Statement execution frequency is a well-known proxy for
runtime).
I5. Test Length (LOC of test case)
(Super set of I2 AND I4)
A set of static (test) and dynamic (prod. code) performance proxies that provide
an approximation of the test execution costs (i.e., runtime and memory usage).
Second Criteria

De Oliveira “Perphecy: Performance regression test
selection made simple but effective,” (ICST 2017)
I1. Number of executed loops (branches)
(Higher loop cycle counts in
fl
uence the runtime of the test
case).
I2. Number of (test and code) method calls
(Fewer method calls result in shorter runtimes and
lower heap memory usage due to potentially fewer object
instantiations.).
I3. Number of object instantiations (not size)
(reducing the number of instantiated objects may lead to
decreased usage of heap memory - e.g., arrays dimension).
I4. Number of executed (test and code) Statements
(Statement execution frequency is a well-known proxy for
runtime).
I5. Test Length (LOC of test case)
(Super set of I2 AND I4)
Dataset from SBST Community
G. Fraser et al. “A large-scale evaluation of automated unit test generation using evosuite,”
ACM Transactions on Software Engineering and Methodology (TOSEM),
Evaluation

Q1. (Effectiveness) What is the target coverage achieved by
pDynaMOSA compared to DynaMOSA?
Small gain in terms of coverage… 83

We may lose in terms of fault detection…
Q2. (Fault Detection) What is the mutation score achieved
by pDynaMOSA compared to DynaMOSA?
Small gain in terms of coverage… 84

Small gain in terms of coverage…
Q3. (Performance) Does the adoption of performance proxies
lead to shorter runtime and lower heap memory consumption?
Huge bene
fi
ts in terms of runtime and
memory consumption…
“Statistically
rigorous java performance
evaluation,”
OOPSLA ’07.
85
We may lose in terms of fault detection…

• DevOps and Arti
fi
cial Intelligence
Outline
Test
Case
Selection
Initial
Tests
Search
Test
Execution
Variants
Generation
86

Tesla Car
Autonomous Driving Systems (ADSs)
Multi-sensing Systems:
• Autonomous systems capture surrounding
environmental data at run-time via
multiple sensors (e.g. camera, radar, lidar)
as inputs
• Processes these data with Deep Neural
Networks (DNNs) and outputs control
decisions (e.g. steering).
• Requires robust testing that
• creates realistic, diverse test cases
87

Traf
fi
c Sign Recognition (TSR)
Pedestrian Protection (PP) Lane Departure Warning (LDW)
Automated Emergency Braking (AEB)
Environmental Data Collection With ADSs Sensors
88

.
.
.
Driving
Actions
Sensors /
Camera
Autonomous
Feature
Actuator
89
Environmental Data Collection With ADSs Sensors
1. Pedestrians
2. Lane Position
4. Other Cars
3. Traf
fi
c Signs
DNNs • steering
• stop
• acceleration/
deceleration
• …

ADSs
90
Traditional DevOps Pipeline ADSs
“Manual Testing is still
Dominant…”

Testing Steps in ADSs
91
Requirements of Testing ADSs
• Generate Diversi
fi
ed Test
Inputs (or Scenarios)
• Evaluation based Failures
Detection
“Manual Testing is still
Dominant…”

92
class Triangle {
1. if (a == b) {
2. if (b == c)
else
} else {
5. if (a == c) {
} else {
7. if (b == c)
else
}
}
}
@Test
public void test(){
}
Test Case
Traditional Development Pipeline:
Coding v.s. Testing

93
class Triangle {
1. if (a == b) {
2. if (b == c)
else
} else {
5. if (a == c) {
} else {
7. if (b == c)
else
}
}
}
@Test
public void test(){
}
Test Case
Code Coverage:
The main
Quality Assessment
Criteria
Coding v.s. Testing

94
class Triangle {
1. if (a == b) {
2. if (b == c)
else
} else {
5. if (a == c) {
} else {
7. if (b == c)
else
}
}
}
@Test
public void test(){
}
Test Case
Coding v.s. Testing
Code Coverage:
Not Suf
fi
cient as
Quality Assessment
Criteria

Challenges of Testing ADSs
95
Challenge 1:
Code coverage
vs.
Scenario Coverage
Challenge 2:
Code coverage
&
CPU & Memory
consumption
Challenge 3:
Unit-Test
v.s.
System-level Testing

96
Stop
Testing Target: Feature Interactions Failures

Testing Autonomous Driving Systems
98
World of Agile, 2018

99

100

101

102
SRF
Swiss Post drone crashed into a lake

103
SRF
NZZ
Swiss Post drone crashed in Zurich
Swiss Post drone crashed into a lake

104

105
npr, January 2022

106
npr, January 2022 Reuters, September 2021

107
The New York Times, April 2021
npr, January 2022 Reuters, September 2021

108

109

110

111

112

113

114

115

116
Real-world testing:
➡Realistic
➡Trustworthy
➡Costly
➡Nondeterministic

Simulation is:
➡Cheaper
➡Faster
➡Less reliable
➡Complex CI/CD
integration
117
Real-world testing:
➡Realistic
➡Trustworthy
➡Costly
➡Nondeterministic

Regression Testing
118
“Regression testing is re-
running functional and non-
functional tests to ensure that
previously developed and tested
software still performs after a
change.”
Anirban Basu 2015

Regression Testing
119
Yoo et al. 2013

Regression Testing
120
Yoo et al. 2013
Selection

Regression Testing
121
Yoo et al. 2013
Selection
Prioritization

Regression Testing
122
Yoo et al. 2013
Minimization
Selection
Prioritization

Regression Testing
123
Minimization
Selection
Prioritization

Test Selection
125
road_points=[(x,y,z),…]

Test Selection for Self-driving Cars
128
How does a test
look like?

129
How does a test
look like?

130
How does a test
look like?

131

132
I can keep the
lane!

133
Uuups… I am not
that good!

134

Testing Costs
135
Simulation Time
# Tests
0% 25% 50% 75% 100%
Passing Tests Failing Tests
200 s 137 s

Test Selection
136
Birchler et al. SANER 2022

Test Selection
137
Birchler et al. SANER 2022

Test Selection
138
https://github.com/
ChristianBirchler/sdc-
scissor
GitHub
coSt-effeCtIve teSt SelectOR
cost-effective test selector
Scissor
SDC-Scissor
➡Free academic license!

Cost-effectiveness
140
Speed up compared to random selection baseline
Dataset 2, Logistic
Dataset 1, Naïve Bayes
Dataset 1, Logistic
0% 42.5% 85% 127.5% 170%
Finding 1:
SDC-Scissor e
ff
ectively speeds
up the testing by 170%.
Finding 2:
Logistic and Naïve Bayes
classi
fi
ers save the most time.

Join the SDC-Scissor Community
141
sdc-scissor.readthedocs.io

Regression Testing
142
Minimization
Selection
Prioritization

Regression Testing
143
Minimization
Selection
Prioritization

Test Prioritization
146
Birchler et al. 2022 ACM TOSEM

Genetic Algorithm for Test Prioritization
147

148

149

Single-objective Test Prioritization
151

Multi-objective Test Prioritization
152

Cost-effectiveness
154
SO-SDC-Prioritizer
MO-SDC-Prioritizer
Greedy

Regression Testing
155
Minimization
Selection
Prioritization

Regression Testing
156
Minimization
Selection
Prioritization

Regression Testing
164
Minimization
Selection
Prioritization

Yoo et al. 2013
Regression Testing
165
Minimization
Selection
Prioritization
Future Work

Regression Testing
166
Yoo et al. 2013
Minimization
Selection
Prioritization

• DevOps and Arti
fi
cial Intelligence
Summary
Test
Case
Selection
Initial
Tests
Search
Test
Execution
Variants
Generation
167

Thanks for the Attention!
• Any Questions?
“Testing with Fewer Resources:
Toward Adaptive Approaches for Cost-
e
ff
ective Test Generation and Selection”
June 22-24, 2022 - Córdoba, Spain
Christian Birchler
Sebastiano Panichella
https://spanichella.github.io/

Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective Test Generation and Selection

Recommended

Recommended

More Related Content

Similar to Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective Test Generation and Selection

Similar to Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective Test Generation and Selection (20)

More from Sebastiano Panichella

More from Sebastiano Panichella (20)

Recently uploaded

Recently uploaded (20)

Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective Test Generation and Selection