Lecture entitled "Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective Test Generation and Selection" at the International Summer School
on Search- and Machine Learning-based Software Engineering
June 22-24, 2022 - Córdoba, Spain
Sebastiano Panichella and Christian Birchler
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective Test Generation and Selection
1. “Testing with Fewer Resources:
Toward Adaptive Approaches for Cost-e
ff
ective
Test Generation and Selection”
June 22-24, 2022 - Córdoba, Spain
Sebastiano Panichella
Zurich University of Applied Sciences
https://spanichella.github.io/
Christian Birchler
Zurich University of Applied Sciences
https://christianbirchler.github.io/
International Summer School
on Search- and Machine Learning-based
Software Engineering
3. 2015 - Physics
2017 - Computer Science
- Software Systems
- Data Science
2021 - Research Assistant
- Testing Self-driving Cars
3
About the Speakers (Christian)
4. University of Sannio
PhD Student
June 2014
University of Salerno
Master Student
December 2010
University of Zurich
Research Associate
October 2014 - August 2018
Zurich University of
Applied Science
Senior Computer Science Researcher
Since August 2018
4
2010
2014
2018
Today
About the Speakers (Sebastiano)
5. 5
Program Comprehension & Maintenance (PC&M)
Generation of source code documentation
Pro
fi
ling developers
Dependencies analysis in software ecosystems
Mobile computing (MC):
- Machine Learning & Genetic Algorithms
SummarizationTechniques
for Code,
Change,
Testing and User Feedback
PhD thesis:
“Supporting Newcomers in Software Development Projects"
Approved Project
Approved Project
Approved Project
Development & Testing challenges:
- Test case generation and assessment
- User Feedback Analysis
- Continuous Delivery
CD anti-patterns
Branch Coverage Prediction
Documentation defects detection
“Complex Systems”
2010
2014
2018
Today
About the Speakers (Sebastiano)
6. Outline
• Context & Motivation:
• Cyber-physical Systems
• DevOps and Arti
fi
cial Intelligence
• Search-based Software Testing (SBST) Barriers:
• A success SE story
• Overcoming Barriers of SBST Adoption in Industry
• Cost-effective Testing for Self-Driving Cars (SDCs):
• Regression Testing
• Test Selection & Test Prioritization for SDCs
Test
Case
Selection
Initial
Tests
Search
Test
Execution
Variants
Generation
6
7. Outline
• Context & Motivation:
• Cyber-physical Systems
• DevOps and Arti
fi
cial Intelligence
• Search-based Software Testing (SBST) Barriers:
• A success SE story
• Overcoming Barriers of SBST Adoption in Industry
• Cost-effective Testing for Self-Driving Cars (SDCs):
• Regression Testing
• Test Selection & Test Prioritization for SDCs
Test
Case
Selection
Initial
Tests
Search
Test
Execution
Variants
Generation
7
8. Context
“Our main research goal is to conduct industrial research, involving both industrial and
academic collaborations, to sustain the Internet of Things (IoT) vision of future "smart cities”,
with millions of smart systems connected over the internet, and/or controlled by complex
embedded software implemented for the cloud."
8
9. Context
“Our main research goal is to conduct industrial research, involving both industrial and
academic collaborations, to sustain the Internet of Things (IoT) vision of future "smart cities”,
with millions of smart systems connected over the internet, and/or controlled by complex
embedded software implemented for the cloud."
9
10. 2) Artificial
Intelligence (AI) 3) DevOps, IoT,
Automated Testing (AT)
1) Cyber-physical Systems
Next
10-15 Years (and beyond)
Context
“Our main research goal is to conduct industrial research, involving both industrial and
academic collaborations, to sustain the Internet of Things (IoT) vision of future "smart cities”,
with millions of smart systems connected over the internet, and/or controlled by complex
embedded software implemented for the cloud."
10
11. Sebastiano Panichella Sajad Khatiri
Christian Birchler
COSMOS:
DevOps for Complex Cyber-physical Systems
https://www.cosmos-devops.org/ https://twitter.com/COSMOS_DEVOPS
12. “Emerging Cyber-physical Systems (CPS) will play a crucial role in the quality of
life of European citizens and the future of the European economy”
COSMOS Context
• CPS relevant sectors:
• Healthcare
• Automotive
• Water Monitoring
• Railway
• Manufacturing
• Avionics
• etc.
MEDICAL DELIVERY
FOOD DELIVERY
• Avionics
12
16. • -
• Our (Software Engineering) view of DevOps and AI for IoT systems:
• DevOps and Continuous Delivery (CD): Whats is it?
• Present, Challenges, and Opportunities
• Relevant Research Questions
• Arti
fi
cial Intelligence (AI) and Testing Automation:
• Present, Challenges, and Opportunities
• User-oriented Testing Automation
• Relevant Research Questions
“We all recognize the relevance and capacity of contemporary cyber-
physical systems for building the future of our society, but ongoing research
in the
fi
eld is also clearly failing in making the right countermeasures to
avoid that CPS usage a
ff
ects human being safety”. In
“Self-driving Uber kills Arizona
woman in first fatal crash involving
pedestrian”
Problem Statement
“A simple software update was
the direct cause of the fatal
crashes of the Boeing 737”
16
17. Question:
What are the main Challenges of Testing Cyber-physical Systems?
17
Answers from the Audience (1)
18. Question:
What are the main Challenges of Testing Cyber-physical Systems?
18
Answers from the Audience (2)
19. • -
• Our (Software Engineering) view of DevOps and AI for IoT systems:
• DevOps and Continuous Delivery (CD): Whats is it?
• Present, Challenges, and Opportunities
• Relevant Research Questions
• Arti
fi
cial Intelligence (AI) and Testing Automation:
• Present, Challenges, and Opportunities
• User-oriented Testing Automation
• Relevant Research Questions
“Self-driving Uber kills Arizona
woman in first fatal crash involving
pedestrian”
“Swiss Post drone
crashes in Zurich
Challenges
“A simple software update was
the direct cause of the fatal
crashes of the Boeing 737”
Challenge 1: Observability, testability, and predictability of the behavior
of emerging CPS is highly limited and, unfortunately, their usage in the real
world can lead to fatal crashes sometimes tragically involving also humans
19
20. Research Challenges and Opportunities
As reported by National Academies:
[“A 21st Century Cyber-Physical Systems Education”]
“today's practice of IoT system design and
implementation are often unable to support
the level of ``complexity, scalability, security,
safety, […] required to meet future needs”
20
21. Research Challenges and Opportunities
As reported by National Academies:
[“A 21st Century Cyber-Physical Systems Education”]
“The main problem is that contemporary
development methodologies for CPS need to
incorporate core aspects of both systems and
software engineering communities, with the
goal to explicitly embrace and consider the
several direct and indirect physical effects of
software”
[“Complexity challenges in development of cyber-physical systems”]
(Martin Törngren, Ulf Sellgren Pages 478-503)
“today's practice of IoT system design and
implementation are often unable to support
the level of ``complexity, scalability, security,
safety, […] required to meet future needs”
21
Crash of
Boeing 737
22. Research Challenges and Opportunities
As reported by National Academies:
[“A 21st Century Cyber-Physical Systems Education”]
[“Complexity challenges in development of cyber-physical systems”]
(Martin Törngren, Ulf Sellgren Pages 478-503)
[“Complexity challenges in development of cyber-physical systems”]
(Martin Törngren, Ulf Sellgren Pages 478-503)
“As identi
fi
ed by agile methodologies, the
development of modern/emerging systems
(e.g., e-health, automotive, satellite, and IoT
manufacturing systems) should evolve with
the systems, ``as development never ends''
“today's practice of IoT system design and
implementation are often unable to support
the level of ``complexity, scalability, security,
safety, […] required to meet future needs”
“The main problem is that contemporary
development methodologies for CPS need to
incorporate core aspects of both systems and
software engineering communities, with the
goal to explicitly embrace and consider the
several direct and indirect physical effects of
software”
22
Crash of
Boeing 737
Tools
23. Research Challenges and Opportunities
As reported by National Academies:
[“A 21st Century Cyber-Physical Systems Education”]
[“Complexity challenges in development of cyber-physical systems”]
(Martin Törngren, Ulf Sellgren Pages 478-503)
[“Complexity challenges in development of cyber-physical systems”]
(Martin Törngren, Ulf Sellgren Pages 478-503)
These concepts are closely related to DevOps and
Arti
fi
cial Intelligence technologies, and several
researchers and practitioners advocate them as a
promising solutions for the development,
maintenance, testing, and evolution of these
complex systems
“today's practice of IoT system design and
implementation are often unable to support
the level of ``complexity, scalability, security,
safety, […] required to meet future needs”
“The main problem is that contemporary
development methodologies for CPS need to
incorporate core aspects of both systems and
software engineering communities, with the
goal to explicitly embrace and consider the
several direct and indirect physical effects of
software”
“As identi
fi
ed by agile methodologies, the
development of modern/emerging systems
(e.g., e-health, automotive, satellite, and IoT
manufacturing systems) should evolve with
the systems, ``as development never ends''
23
Crash of
Boeing 737
Tools
24. Research Challenges and Opportunities
Challenge 1: Observability, testability, and
predictability of the behavior of emerging
CPS is highly limited and, unfortunately,
their usage in the real world can lead to fatal
crashes sometimes tragically involving also
humans
These concepts are closely related to DevOps and
Arti
fi
cial Intelligence technologies, and several
researchers and practitioners advocate them as a
promising solutions for the development,
maintenance, testing, and evolution of these
complex systems
Challenge 2: Contemporary DevOps and
AI practices and tools are potentially the
right solution to this problem, but they are
not developed to be applied in CPS
domains
24
25. • Context & Motivation:
• Cyber-physical Systems
• DevOps and Arti
fi
cial Intelligence
• Search-based Software Testing (SBST) Barriers:
• A success SE story
• Overcoming Barriers of SBST Adoption in Industry
• Cost-effective Testing for Self-Driving Cars (SDCs):
• Regression Testing
• Test Selection & Test Prioritization for SDCs
Outline
Test
Case
Selection
Initial
Tests
Search
Test
Execution
Variants
Generation
25
26. Traditional DevOps Pipeline
ADSs
• Generate Diversi
fi
ed Test
Inputs (or Scenarios)
• Evaluation based Failures
Detection
V.S.
Manual Automated Testing (SBST)
26
27. Search-Based Software Testing (SBST)
“The initial population” is a set of randomly generated test cases.
Selection
Crossover
Mutation
End?
YES NO
Initial Population
27
(Fitness Function) We need to select the best “fittest” test case for reproduction
Single-Point Crossover
Mutation: randomly changes some genes (elements within
each chromosome).
Mutation probability: each statement is mutated with
prob=1/n where n=#statements
Mutation
27
28. Mutation
Initial Population
Selection
Crossover
Mutation
End?
YES NO
class Triangle {
void computeTriangleType() {
if (isTriangle()){
if (side1 == side2) {
if (side2 == side3)
type = "EQUILATERAL";
else
type = "ISOSCELES";
} else {
if (side1 == side3) {
type = "ISOSCELES";
} else {
if (side2 == side3)
type = “ISOSCELES”;
else
checkRightAngle();
}
}
}// if isTriangle()
}}
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Goal
Class Under Test
Iterate until line 9 is covered
or the search budget (running
time or #iterations) is
consumed
28
Search-Based Software Testing (SBST)
28
29. Mutation
class Triangle {
void computeTriangleType() {
if (isTriangle()){
if (side1 == side2) {
if (side2 == side3)
type = "EQUILATERAL";
else
type = "ISOSCELES";
} else {
if (side1 == side3) {
type = "ISOSCELES";
} else {
if (side2 == side3)
type = “ISOSCELES”;
else
checkRightAngle();
}
}
}// if isTriangle()
}}
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Class Under Test
29
1
2
5
6 7 3
9
8
10
4
Control Flow Graph
Goal: Covering as many
code elements as possible
Branch coverage
Targets = {<1,5>, <1,2>,<5,6>, <5,7>,
<2,3>, <2,4>, <6,10>, <7,8>, <7,9>,
<3,10>, <4,10>, <8,10>, <9,10>}
Statement coverage
Targets = {1, 2, 3,4 ,5 ,6, 7, 8, 9,10}
Path coverage
Targets = {<1,5,6,10>, <1,5,7,8,10>,
<1,5,7,9,10>, <1,2,3,10>, <1,2,4,10>}
Search-Based Software Testing (SBST)
29
33. What are the main Barriers of SBST tools Adoption in Real
Society (e.g., in industrial setting)?
Question
“Manual Testing is still Dominant in Industry…”
33
Answers from the Audience
34. SBST Barrier to Practical Adoption:
Test Code Comprehension
Are Generated Tests Helpful?
Modeling Readability to Improve Unit Tests
Ermira Daka, José Campos, and
Gordon Fraser
University of Sheffield
Sheffield, UK
Jonathan Dorn and Westley Weimer
University of Virginia
Virginia, USA
ABSTRACT
Writing good unit tests can be tedious and error prone, but even
once they are written, the job is not done: Developers need to reason
about unit tests throughout software development and evolution, in
order to diagnose test failures, maintain the tests, and to understand
code written by other developers. Unreadable tests are more dif-
ficult to maintain and lose some of their value to developers. To
overcome this problem, we propose a domain-specific model of unit
test readability based on human judgements, and use this model to
augment automated unit test generation. The resulting approach can
automatically generate test suites with both high coverage and also
improved readability. In human studies users prefer our improved
tests and are able to answer maintenance questions about them 14%
more quickly at the same level of accuracy.
Categories and Subject Descriptors. D.2.5 [Software Engineer-
ing]: Testing and Debugging – Testing Tools;
Keywords. Readability, unit testing, automated test generation
1. INTRODUCTION
Unit testing is a popular technique in object oriented program-
ming, where efficient automation frameworks such as JUnit allow
unit tests to be defined and executed conveniently. However, pro-
ducing good tests is a tedious and error prone task, and over their
lifetime, these tests often need to be read and understood by different
people. Developers use their own tests to guide their implemen-
tation activities, receive tests from automated unit test generation
tools to improve their test suites, and rely on the tests written by
developers of other code. Any test failures require fixing either the
software or the failing test, and any passing test may be consulted
by developers as documentation and usage example for the code
under test. Test comprehension is a manual activity that requires
one to understand the behavior represented by a test — a task that
may not be easy if the test was written a week ago, difficult if it
was written by a different person, and challenging if the test was
generated automatically.
How difficult it is to understand a unit test depends on many
factors. Unit tests for object-oriented languages typically consist of
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
ElementName elementName0 = new ElementName("", "");
Class<Object> class0 = Object.class;
VirtualHandler virtualHandler0 = new VirtualHandler(
elementName0, (Class) class0);
Object object0 = new Object();
RootHandler rootHandler0 = new RootHandler((ObjectHandler
) virtualHandler0, object0);
ObjectHandlerAdapter objectHandlerAdapter0 = new
ObjectHandlerAdapter((ObjectHandlerInterface)
rootHandler0);
assertEquals("ObjectHandlerAdapter",
objectHandlerAdapter0.getName());
ObjectHandlerAdapter objectHandlerAdapter0 = new
ObjectHandlerAdapter((ObjectHandlerInterface) null);
assertEquals("ObjectHandlerAdapter",
objectHandlerAdapter0.getName());
Figure 1: Two versions of a test that exercise the same functionality
but have a different appearance and readability.
sequences of calls to instantiate various objects, bring them to appro-
priate states, and create interactions between them. The particular
choice of sequence of calls and values can have a large impact on the
resulting test. For example, consider the pair of unit tests shown in
Figure 1. Both tests exercise the same functionality with respect to
the constructor of the class ObjectHandlerAdaptor in the Xi-
neo open source project (which treats null and rootHandler0
arguments identically). Despite this identical coverage of the subject
class in practice, they are quite different in presentation.
In terms of concrete features that may affect comprehension, the
first test is longer, uses more different classes, defines more variables,
has more parentheses, has longer lines. The visual appearance
of code in general is referred to as its readability — if code is
not readable, intuitively it will be more difficult to perform any
tasks that require understanding it. Despite significant interest from
managers and developers [8], a general understanding of software
readability remains elusive. For source code, Buse and Weimer [7]
applied machine learning on a dataset of code snippets with human
annotated ratings of readability, allowing them to predict whether
code snippets are considered readable or not. Although unit tests
are also just code in principle, they use a much more restricted
set of language features; for example, unit tests usually do not
contain conditional or looping statements. Therefore, a general code
readability metric may not be well suited for unit tests.
In this paper, we address this problem by designing a domain-
specific model of readability based on human judgements that ap-
plies to object oriented unit test cases. To support developers in
deriving readable unit tests, we use this model in an automated ap-
proach to improve the readability of unit tests, and integrate this into
an automated unit test generation tool. In detail, the contributions
of this paper are as follows:
• An analysis of the syntactic features of unit tests and their
Does Automated White-Box Test Generation
Really Help Software Testers?
Gordon Fraser1
Matt Staats2
Phil McMinn1
Andrea Arcuri3
Frank Padberg4
1
Department of 2
Division of Web Science 3
Simula Research 4
Karlsruhe Institute of
Computer Science, and Technology, Laboratory, Technology,
University of Sheffield, UK KAIST, South Korea Norway Karlsruhe, Germany
ABSTRACT
Automated test generation techniques can efficiently produce test
data that systematically cover structural aspects of a program. In
the absence of a specification, a common assumption is that these
tests relieve a developer of most of the work, as the act of testing
is reduced to checking the results of the tests. Although this as-
sumption has persisted for decades, there has been no conclusive
evidence to date confirming it. However, the fact that the approach
has only seen a limited uptake in industry suggests the contrary, and
calls into question its practical usefulness. To investigate this issue,
we performed a controlled experiment comparing a total of 49 sub-
jects split between writing tests manually and writing tests with the
aid of an automated unit test generation tool, EVOSUITE. We found
that, on one hand, tool support leads to clear improvements in com-
monly applied quality metrics such as code coverage (up to 300%
increase). However, on the other hand, there was no measurable
improvement in the number of bugs actually found by developers.
Our results not only cast some doubt on how the research commu-
nity evaluates test generation tools, but also point to improvements
and future work necessary before automated test generation tools
will be widely adopted by practitioners.
Categories and Subject Descriptors. D.2.5 [Software Engineer-
ing]: Testing and Debugging – Testing Tools;
General Terms. Algorithms, Experimentation, Reliability, Theory
Keywords. Unit testing, automated test generation, branch cover-
age, empirical software engineering
1. INTRODUCTION
Controlled empirical studies involving human subjects are not
common in software engineering. A recent survey by Sjoberg et
al. [28] showed that out of 5,453 analyzed software engineering
articles, only 1.9% included a controlled study with human sub-
jects. For software testing, several novel techniques and tools have
been developed to automate and solve different kinds of problems
and tasks—however, they have, in general, only been evaluated us-
ing surrogate measures (e.g., code coverage), and not with human
testers—leaving unanswered the more directly relevant question:
Does technique X really help software testers?
This paper addresses this question in the context of automated
white-box test generation, a research area that has received much
attention of late (e.g., [8, 12, 18, 31, 32]). When using white-box
test generation, a developer need not manually write the entire test
suite, and can instead automatically generate a set of test inputs
that systematically exercise a program (for example, by covering
all branches), and only need check that the outputs for the test in-
puts match those expected. Although the benefits for the developer
seem obvious, there is little evidence that it is effective for practical
software development. Manual testing is still dominant in industry,
and research tools are commonly evaluated in terms of code cover-
age achieved and other automatically measurable metrics that can
be applied without the involvement of actual end-users.
In order to determine if automated test generation is really help-
ful for software testing in a scenario without automated oracles, we
performed a controlled experiment involving 49 human subjects.
Subjects were given one of three Java classes containing seeded
faults and were asked to construct a JUnit test suite either manu-
ally, or with the assistance of the automated white-box test genera-
tion tool EVOSUITE [8]. EVOSUITE automatically produces JUnit
test suites that target branch coverage, and these unit tests contain
assertions that reflect the current behaviour of the class [10]. Con-
sequently, if the current behaviour is faulty, the assertions reflecting
the incorrect behaviour must be corrected. The performance of the
subjects was measured in terms of coverage, seeded faults found,
mutation score, and erroneous tests produced.
Our study yields three key results:
1. The experiment results confirm that tools for automated test
generation are effective at what they are designed to do—
producing test suites with high code coverage—when com-
pared with those constructed by humans.
2. The study does not confirm that using automated tools de-
signed for high coverage actually helps in finding faults. In
our experiments, subjects using EVOSUITE found the same
number of faults as manual testers, and during subsequent
mutation analysis, test suites did not always have higher mu-
tation scores.
3. Investigating how test suites evolve over the course of a test-
ing session revealed that there is a need to re-think test gen-
Developers spend up to 50% of their time in
understanding and analyzing the output of
automatic tools.
Fraser et al.
“Professional developers perceive
generated test cases as hard to
understand.”
Dana et al.
34
35. SBST Barrier to Practical Adoption:
Test Code Comprehension
Why?
Class Name: Option.java
Library: Apache Commons-Cli
Q1: What are the main
differences?
Generated Tests
Q2: Do they cover different
parts of the code?
35
36. SBST Barrier to Practical Adoption:
Test Code Comprehension
Why?
Class Name: Option.java
Library: Apache Commons-Cli
Q1: What are the main
differences?
Generated Tests
Q2: Do they cover different
parts of the code?
Candidate
Assertions
Q3: Are these
assertions correct?
Earl T. Barr, et al., “The Oracle Problem in Software Testing: A Survey”.IEEE Transactions on Software Engineering, 2015.
36
37. SBST Barrier to Practical Adoption:
Test Code Comprehension
Why?
Class Name: Option.java
Library: Apache Commons-Cli
Q1: What are the main
differences?
Generated Tests
Q2: Do they cover different
parts of the code?
Q3: Are these
assertions correct?
Earl T. Barr, et al., “The Oracle Problem in Software Testing: A Survey”.IEEE Transactions on Software Engineering, 2015.
37
38. SBST Barrier to Practical Adoption:
Test Code Comprehension
Are Generated Tests Helpful?
G. Fraser et al., Does Automated Unit Test Generation
Really Help Software Testers? A Controlled Empirical
Study, TOSEM 2015. 38
Automatically generated tests do not
improve the ability of developers to detect
faults when compared to manual testing.
39. 39
?
SBST Barrier to Practical Adoption:
Addressing Test Code Comprehension
Test Case
How to Generate Test Case Summary?
Panichella et al. “The Impact of Test Case Summaries
on Bug Fixing Performance: An Empirical Investigation”.
ICSE 2016
40. How to Generate Test Case Summary?
SBST Barrier to Practical Adoption:
Addressing Test Code Comprehension
Panichella et al. “The Impact of Test Case Summaries
on Bug Fixing Performance: An Empirical Investigation”.
ICSE 2016
40
?
Generated Unit Test
… with Descriptions
40
41. How to Generate Test Case Summary?
SBST Barrier to Practical Adoption:
Addressing Test Code Comprehension ?
Panichella et al. “The Impact of Test Case Summaries
on Bug Fixing Performance: An Empirical Investigation”.
ICSE 2016
41
http://textcompactor.com/
35
Intuition
41
42. Summary Generator
Software Words Usage Model: deriving <actions>, <themes>, and
<secondary arguments> from class, methods, attributes and
variable identifiers
E. Hill et al. Automatically capturing
source code context of NL-queries for
software maintenance and reuse.
ICSE 2009
42
49. Summary Generator
NOUN NOUN NOUN
ADJ
NOUN
NOUN
VERB
NOUN NOUN NOUN
NOUN
VERB NOUN
NOUN
ADJ
ADJ ADJ ADJ
NOUN
NOUN NOUN
VERB
ADJ
NOUN
CON
NOUN
The test case instantiates an "Option"
with:
- option equal to “...”
- long option equal to “...”
- it has no argument
- description equal to “…”
An option validator validates it
The test exercises the following
condition:
- "Option" has no argument
NOUN NOUN NOUN
ADJ
NOUN
NOUN
VERB
NOUN NOUN NOUN
NOUN
VERB NOUN
NOUN
ADJ
ADJ ADJ ADJ
NOUN
NOUN NOUN
VERB
ADJ
NOUN
CON
NOUN
ADJ
Natural Language Sentences Parsed Code
49
49
50. The test case instantiates an "Option"
with:
- option equal to “...”
- long option equal to “...”
- it has no argument
- description equal to “…”
An option validator validates it
The test exercises the following
condition:
- "Option" has no argument
Natural Language Sentences
Class
Level
Method
Level
Statement
Level
Branch
Level
Summarisation Levels
50
50
51. The test case instantiates an "Option"
with:
- option equal to “...”
- long option equal to “...”
- it has no argument
- description equal to “…”
An option validator validates it
The test exercises the following
condition:
- "Option" has no argument
Natural Language Sentences
51
51
Summary Generator
52. Summary Generator
The test case instantiates an "Option"
with:
- option equal to “...”
- long option equal to “...”
- it has no argument
- description equal to “…”
An option validator validates it
The test exercises the following
condition:
- "Option" has no argument
Natural Language Sentences
52
52
Q2: Do Test Summaries Improve Test Readability?
Q1: Do Test Summaries Help Developers find more bugs?
54. Subjects: 30 Developers (23 Researchers and 7 Developers)
Context
Object: two Java classes from Apache Commons Primitives and
Math4J that have been used in previous studies on search-based
software testing [by Fraser et al. TOSEM 2015]
ArrayIntList.java
Rational.java
55. Bug Fixing Tasks
Group 1 Group 2
ArrayIntList.java
Rational.java ArrayIntList.java
Rational.java
55
56. Bug Fixing Tasks
Group 1 Group 2
ArrayIntList.java
Rational.java ArrayIntList.java
Rational.java
56
57. Bug Fixing Tasks
Group 1 Group 2
ArrayIntList.java
Rational.java ArrayIntList.java
Rational.java
57
58. Bug Fixing Tasks
Group 1 Group 2
ArrayIntList.java
Rational.java ArrayIntList.java
Rational.java
Comments Comments
TestDescriber
58
59. Bug Fixing Tasks
Experiment conducted Offline via a Survey platform
Each participant received the experiment package consisting of:
1. A pretest questionnaire
2. Instructions and materials to perform the experiment
3. A post-test questionnaire
We did not reveal the goal of the study
45 minutes of time for each task
59
60. Q1: How do test case summaries impact the
number of bugs fixed by developers?
60
Comments
61. Q1: How do test case summaries impact the
number of bugs fixed by developers?
61
Comments
Summary: Using automatically generated test case summaries
significantly helps developers to identify and fix more bugs.
62. WITH Summaries:
(i) 46% of participants
consider the test cases as
“easy to understand”.
(ii) Only 18% of participants
considered the test cases
as incomprehensible.
Without
With 4%
6%
14%
33%
14%
6%
32%
9%
36%
45%
Medium High Very High Low Very Low
Perceived test comprehensibility WITH and
WITHOUT TestDescriber summaries
62
Q2: Do Test Summaries Improve Test Readability?
Comments
63. WITHOUT Summaries:
(i) Only 15% of participants
consider the test cases as
“easy to understand”.
(iii) 40% of participants
considered the test cases
as incomprehensible.
WITH Summaries:
(i) 46% of participants
consider the test cases as
“easy to understand”.
(ii) Only 18% of participants
considered the test cases
as incomprehensible.
Without
With 4%
6%
14%
33%
14%
6%
32%
9%
36%
45%
Medium High Very High Low Very Low
Perceived test comprehensibility WITH and
WITHOUT TestDescriber summaries
63
Q2: Do Test Summaries Improve Test Readability?
Comments
64. WITHOUT Summaries:
(i) Only 15% of participants
consider the test cases as
“easy to understand”.
(iii) 40% of participants
considered the test cases
as incomprehensible.
WITH Summaries:
(i) 46% of participants
consider the test cases as
“easy to understand”.
(ii) Only 18% of participants
considered the test cases
as incomprehensible.
Without
With 4%
6%
14%
33%
14%
6%
32%
9%
36%
45%
Medium High Very High Low Very Low
Perceived test comprehensibility WITH and
WITHOUT TestDescriber summaries
64
Q2: Do Test Summaries Improve Test Readability?
Comments
Summary: Test summaries statistically improve the
comprehensibility of automatically generated test case
according to human judgments.
65. 1) Using automatically generated test
case summaries significantly helps
developers to identify and fix more bugs.
2) Test summaries statistically improve
the comprehensibility of automatically
generated test case according to
human judgments.
Panichella et al. “The Impact of Test Case
Summaries on Bug Fixing Performance: An
Empirical Investigation”. ICSE 2016
65
SBST Barrier to Practical Adoption:
Addressing Test Code Comprehension
66. 66
SBST Barrier to Practical Adoption:
Addressing Test Code Comprehension
Other Studies Addressing this Open Problem…
Daka et al. Generating unit tests with descriptive names or:
would you name your children thing1 and thing2? ISSTA 2017
Generating unit tests with descriptive names
Panichella et al. Revisiting Test Smells in Automatically Generated
Tests: Limitations, Pitfalls, and Opportunities. ICSME 2020
Test Smells in Automatically Generated Tests
67. SBST Barrier to Practical Adoption:
Cost-effectiveness of Generated Tests
Why?
67
class Triangle {
int a, b, c; //sides
String type = "NOT_TRIANGLE";
Triangle (int a, int b, int c){…}
void computeTriangleType() {
1. if (a == b) {
2. if (b == c)
3. type = "EQUILATERAL";
else
4. type = "ISOSCELES";
} else {
5. if (a == c) {
6. type = "ISOSCELES";
} else {
7. if (b == c)
8. type = “ISOSCELES”;
else
9. type = “SCALENE”;
}
}
}
Java Class Under Test (CUT)
@Test
public void test(){
Triangle t = new Triangle (1,2,3);
t.computeTriangleType();
String type = t.getType();
assertTrue(type.equals(“SCALENE”));
}
Test Case
68. SBST Barrier to Practical Adoption:
Cost-effectiveness of Generated Tests
Why?
68
class Triangle {
int a, b, c; //sides
String type = "NOT_TRIANGLE";
Triangle (int a, int b, int c){…}
void computeTriangleType() {
1. if (a == b) {
2. if (b == c)
3. type = "EQUILATERAL";
else
4. type = "ISOSCELES";
} else {
5. if (a == c) {
6. type = "ISOSCELES";
} else {
7. if (b == c)
8. type = “ISOSCELES”;
else
9. type = “SCALENE”;
}
}
}
Java Class Under Test (CUT)
@Test
public void test(){
Triangle t = new Triangle (1,2,3);
t.computeTriangleType();
String type = t.getType();
assertTrue(type.equals(“SCALENE”));
}
Test Case
Code Coverage:
The main
Quality Assessment
Criteria
69. SBST Barrier to Practical Adoption:
Cost-effectiveness of Generated Tests
Why?
69
class Triangle {
int a, b, c; //sides
String type = "NOT_TRIANGLE";
Triangle (int a, int b, int c){…}
void computeTriangleType() {
1. if (a == b) {
2. if (b == c)
3. type = "EQUILATERAL";
else
4. type = "ISOSCELES";
} else {
5. if (a == c) {
6. type = "ISOSCELES";
} else {
7. if (b == c)
8. type = “ISOSCELES”;
else
9. type = “SCALENE”;
}
}
}
Java Class Under Test (CUT)
Software
Quality
Money
Time
Practical Constraints?
71. Test Generation Tool 1
TestClass A TestClass B
Class A Class B Class C Class D
Cost Effectiveness: example
TestClass C TestClass D
Test Generation Tool 2
TestClass A TestClass B TestClass C TestClass D
BUG BUG
71
72. Test Generation Tool 1
TestClass A TestClass B
Cost Effectiveness: example
TestClass C TestClass D
Test Generation Tool 2
TestClass A TestClass B TestClass C TestClass D
BUG BUG
Coverage 67%
Coverage 66.5%
We need COST oriented models
V.S.
Manual Automated Testing
72
73. Test Generation Tool 1
TestClass A TestClass B
Cost Effectiveness: example
TestClass C TestClass D
Test Generation Tool 2
TestClass A TestClass B TestClass C TestClass D
BUG BUG
Coverage 67%
Coverage 66.5%
We need COST oriented models
+20%
V.S.
Manual Automated Testing
73
74. 74
SBST Barrier to Practical Adoption:
Cost-effective Generation of Tests
Automatically generating tests with
appropriate performance (CPU, memory, etc.)
when deployed in different environments
Grano et al. “Testing with Fewer Resources: An
Adaptive Approach to Performance-Aware Test
Case Generation”. TSE 2019
Further Performance
Indicators…
+
It uses indicators of
Test Coverage…
?
75. 75
Grano et al. “Testing with Fewer Resources: An
Adaptive Approach to Performance-Aware Test
Case Generation”. TSE 2019
Cached history
information
Kim at al.
ICSE 2007
Change Metrics
Moser at al.
ICSE 2008.
A metrics suite for
object oriented design
Chidamber at al.
TSE 1994
Indicators of Complexity
Cost-effective Generation of Tests
76. 76
“We needed indicators (or
proxies) of test performances
(CPU, memory, etc.)..”
Grano et al. “Testing with Fewer Resources: An
Adaptive Approach to Performance-Aware Test
Case Generation”. TSE 2019
Cached history
information
Kim at al.
ICSE 2007
Change Metrics
Moser at al.
ICSE 2008.
A metrics suite for
object oriented design
Chidamber at al.
TSE 1994
Indicators of Complexity
Cost-effective Generation of Tests
77. pDynaMOSA (Adaptive Performance-Aware DynaMOSA),
pDynaMOSA Pipeline
2)
1)
First Criteria
Second Criteria
Yes
No
Yes
No
Coverage
Performance
SBST Barrier to Practical Adoption:
Cost-effective Generated of Tests
77
78. pDynaMOSA (Adaptive Performance-Aware DynaMOSA),
pDynaMOSA Pipeline
2)
1)
First Criteria
Second Criteria
Yes
No
Yes
No
Coverage
Performance
78
SBST Barrier to Practical Adoption:
Cost-effective Generation of Tests
79. pDynaMOSA (Adaptive Performance-Aware DynaMOSA),
pDynaMOSA Pipeline
2)
1)
First Criteria
Second Criteria
Yes
No
Yes
No
Coverage
Performance
79
SBST Barrier to Practical Adoption:
Cost-effective Generation of Tests
80. pDynaMOSA (Adaptive Performance-Aware DynaMOSA),
pDynaMOSA Pipeline
2)
1)
First Criteria
Second Criteria
Yes
No
Yes
No
Coverage
Performance
80
SBST Barrier to Practical Adoption:
Cost-effective Generation of Tests
81. De Oliveira “Perphecy: Performance regression test
selection made simple but effective,” (ICST 2017)
I1. Number of executed loops (branches)
(Higher loop cycle counts in
fl
uence the runtime of the test
case).
2)
1)
First Criteria
Second Criteria
Yes
No
Yes
No
Coverage
Performance
I2. Number of (test and code) method calls
(Fewer method calls result in shorter runtimes and
lower heap memory usage due to potentially fewer object
instantiations.).
I3. Number of object instantiations (not size)
(reducing the number of instantiated objects may lead to
decreased usage of heap memory - e.g., arrays dimension).
I4. Number of executed (test and code) Statements
(Statement execution frequency is a well-known proxy for
runtime).
I5. Test Length (LOC of test case)
(Super set of I2 AND I4)
A set of static (test) and dynamic (prod. code) performance proxies that provide
an approximation of the test execution costs (i.e., runtime and memory usage).
SBST Barrier to Practical Adoption:
Cost-effective Generated of Tests
Second Criteria
82. De Oliveira “Perphecy: Performance regression test
selection made simple but effective,” (ICST 2017)
I1. Number of executed loops (branches)
(Higher loop cycle counts in
fl
uence the runtime of the test
case).
I2. Number of (test and code) method calls
(Fewer method calls result in shorter runtimes and
lower heap memory usage due to potentially fewer object
instantiations.).
I3. Number of object instantiations (not size)
(reducing the number of instantiated objects may lead to
decreased usage of heap memory - e.g., arrays dimension).
I4. Number of executed (test and code) Statements
(Statement execution frequency is a well-known proxy for
runtime).
I5. Test Length (LOC of test case)
(Super set of I2 AND I4)
SBST Barrier to Practical Adoption:
Cost-effective Generated of Tests
Dataset from SBST Community
G. Fraser et al. “A large-scale evaluation of automated unit test generation using evosuite,”
ACM Transactions on Software Engineering and Methodology (TOSEM),
Evaluation
83. Q1. (Effectiveness) What is the target coverage achieved by
pDynaMOSA compared to DynaMOSA?
Small gain in terms of coverage… 83
SBST Barrier to Practical Adoption:
Cost-effective Generated of Tests
84. We may lose in terms of fault detection…
Q2. (Fault Detection) What is the mutation score achieved
by pDynaMOSA compared to DynaMOSA?
Small gain in terms of coverage… 84
SBST Barrier to Practical Adoption:
Cost-effective Generated of Tests
85. Small gain in terms of coverage…
Q3. (Performance) Does the adoption of performance proxies
lead to shorter runtime and lower heap memory consumption?
Huge bene
fi
ts in terms of runtime and
memory consumption…
“Statistically
rigorous java performance
evaluation,”
OOPSLA ’07.
85
SBST Barrier to Practical Adoption:
Cost-effective Generated of Tests
We may lose in terms of fault detection…
86. • Context & Motivation:
• Cyber-physical Systems
• DevOps and Arti
fi
cial Intelligence
• Search-based Software Testing (SBST) Barriers:
• A success SE story
• Overcoming Barriers of SBST Adoption in Industry
• Cost-effective Testing for Self-Driving Cars (SDCs):
• Regression Testing
• Test Selection & Test Prioritization for SDCs
Outline
Test
Case
Selection
Initial
Tests
Search
Test
Execution
Variants
Generation
86
87. Tesla Car
Autonomous Driving Systems (ADSs)
Multi-sensing Systems:
• Autonomous systems capture surrounding
environmental data at run-time via
multiple sensors (e.g. camera, radar, lidar)
as inputs
• Processes these data with Deep Neural
Networks (DNNs) and outputs control
decisions (e.g. steering).
• Requires robust testing that
• creates realistic, diverse test cases
87
88. Traf
fi
c Sign Recognition (TSR)
Pedestrian Protection (PP) Lane Departure Warning (LDW)
Automated Emergency Braking (AEB)
Environmental Data Collection With ADSs Sensors
88
91. Testing Steps in ADSs
91
Requirements of Testing ADSs
• Generate Diversi
fi
ed Test
Inputs (or Scenarios)
• Evaluation based Failures
Detection
“Manual Testing is still
Dominant…”
92. 92
class Triangle {
int a, b, c; //sides
String type = "NOT_TRIANGLE";
Triangle (int a, int b, int c){…}
void computeTriangleType() {
1. if (a == b) {
2. if (b == c)
3. type = "EQUILATERAL";
else
4. type = "ISOSCELES";
} else {
5. if (a == c) {
6. type = "ISOSCELES";
} else {
7. if (b == c)
8. type = “ISOSCELES”;
else
9. type = “SCALENE”;
}
}
}
Java Class Under Test (CUT)
@Test
public void test(){
Triangle t = new Triangle (1,2,3);
t.computeTriangleType();
String type = t.getType();
assertTrue(type.equals(“SCALENE”));
}
Test Case
Traditional Development Pipeline:
Coding v.s. Testing
93. 93
class Triangle {
int a, b, c; //sides
String type = "NOT_TRIANGLE";
Triangle (int a, int b, int c){…}
void computeTriangleType() {
1. if (a == b) {
2. if (b == c)
3. type = "EQUILATERAL";
else
4. type = "ISOSCELES";
} else {
5. if (a == c) {
6. type = "ISOSCELES";
} else {
7. if (b == c)
8. type = “ISOSCELES”;
else
9. type = “SCALENE”;
}
}
}
Java Class Under Test (CUT)
@Test
public void test(){
Triangle t = new Triangle (1,2,3);
t.computeTriangleType();
String type = t.getType();
assertTrue(type.equals(“SCALENE”));
}
Test Case
Code Coverage:
The main
Quality Assessment
Criteria
Traditional Development Pipeline:
Coding v.s. Testing
94. 94
class Triangle {
int a, b, c; //sides
String type = "NOT_TRIANGLE";
Triangle (int a, int b, int c){…}
void computeTriangleType() {
1. if (a == b) {
2. if (b == c)
3. type = "EQUILATERAL";
else
4. type = "ISOSCELES";
} else {
5. if (a == c) {
6. type = "ISOSCELES";
} else {
7. if (b == c)
8. type = “ISOSCELES”;
else
9. type = “SCALENE”;
}
}
}
Java Class Under Test (CUT)
@Test
public void test(){
Triangle t = new Triangle (1,2,3);
t.computeTriangleType();
String type = t.getType();
assertTrue(type.equals(“SCALENE”));
}
Test Case
Traditional Development Pipeline:
Coding v.s. Testing
Code Coverage:
Not Suf
fi
cient as
Quality Assessment
Criteria
95. Challenges of Testing ADSs
95
Challenge 1:
Code coverage
vs.
Scenario Coverage
Challenge 2:
Code coverage
&
CPU & Memory
consumption
Challenge 3:
Unit-Test
v.s.
System-level Testing
118. Regression Testing
118
“Regression testing is re-
running functional and non-
functional tests to ensure that
previously developed and tested
software still performs after a
change.”
Anirban Basu 2015
140. Cost-effectiveness
140
Speed up compared to random selection baseline
Dataset 2, Logistic
Dataset 1, Naïve Bayes
Dataset 1, Logistic
0% 42.5% 85% 127.5% 170%
Finding 1:
SDC-Scissor e
ff
ectively speeds
up the testing by 170%.
Finding 2:
Logistic and Naïve Bayes
classi
fi
ers save the most time.
167. • Context & Motivation:
• Cyber-physical Systems
• DevOps and Arti
fi
cial Intelligence
• Search-based Software Testing (SBST) Barriers:
• A success SE story
• Overcoming Barriers of SBST Adoption in Industry
• Cost-effective Testing for Self-Driving Cars (SDCs):
• Regression Testing
• Test Selection & Test Prioritization for SDCs
Summary
Test
Case
Selection
Initial
Tests
Search
Test
Execution
Variants
Generation
167
168. Thanks for the Attention!
• Any Questions?
“Testing with Fewer Resources:
Toward Adaptive Approaches for Cost-
e
ff
ective Test Generation and Selection”
June 22-24, 2022 - Córdoba, Spain
Christian Birchler
Zurich University of Applied Sciences
https://christianbirchler.github.io/
Sebastiano Panichella
Zurich University of Applied Sciences
https://spanichella.github.io/