Controlled experiments, Hypothesis Testing, Test Selection, Threats to Validity

Controlled Experiments
in Software Engineering
cf. Plefeeger, 1995 https://doi.org/10.1007/BF02249052
cf. S. Easterbrook http://www.cs.toronto.edu/~sme/CSC2130/04-experiments.pdf
Alessio Ferrari, ISTI-CNR, Pisa, Italy

alessio.ferrari@isti.cnr.it

aka Laboratory Experiments
aka Experiment
The ABC of Software Engineering Research 11:11In Vitro Experiment
The GOAL is Precise
Measure
of Behaviour

Typical Examples
• With software subjects: Tool A and B are automatic tools for testing, I
want to compare them (no need to involve people)

• With human subjects: Method M is a manual strategy for finding bugs.
How effective is for experts? How effective is for novices?

• With human and software subjects:
• Tool T is an interactive tool for testing, I want to see if it is more
appropriate for novice or for experts
• Tool A and B are interactive tools for testing, I want to compare them  
(I have to involve people)

• Tool A and B are interactive tools for testing, I want to see if which
one is more appropriate for novices and which one for experts

• Tool A and method M are two approaches for finding bugs, I want to
see which one is better

Controlled Experiments and Theories
Theory
Observation
Induction
Hypothesis
Deduction
Test
Theory
Abduction
Deduction
DEDUCTIVE APPROACH

Controlled Experiments: Process
PREPARATION EXECUTION REPORTING
Theory
Hypothesis and
Variable Definition
Research Design
Research Question
Define Measures for
Variables
Recruit Participants
/ Select Artifacts
Collect Data
Analyse Data
Report Answers
Internal Validity
External Validity
Construct &
Conclusion Validity
Construct
Validity
Discuss
The process normally starts from a Theory
and discusses/modifies it in relation to the results
Typically QUANTITATIVE

Controlled Experiments: Elements
Test
Data from
Experiment
Test
Statistic
p-value
Effect Size
Effect size
computation
Signiﬁcance 𝛼
Hypothesis
Treatments
Analyse DataCollect Data
independent
variables
dependent
variables
controlled variables
Variable Measurements
Data from
Experiment
Design
Hypothesis
✅
❓

Test
Data from
Experiment
Test
Statistic
p-value
Effect Size
Effect size
computation
Signiﬁcance 𝛼
Hypothesis
Treatments
independent
variables
dependent
variables
Data from
Experiment
Design
This part requires your creativity
Hypothesis
✅
❓

Test
Data from
Experiment
Test
Statistic
p-value
Effect Size
Effect size
computation
Signiﬁcance 𝛼
Hypothesis
Treatments
independent
variables
dependent
variables
Data from
Experiment
Design
This part requires your creativity
This part is mostly automated
(but you need to understand it!)
Hypothesis
✅
❓

Controlled Experiment
• “Experimental investigation of a testable hypothesis, in which
conditions are set up to isolate the variables of interest
(independent variables) and test how they aﬀect certain measurable
outcomes (the dependent variables)”
INDEPENDENT
variables
(e.g., testing
tool)
DEPENDENT
variables
(e.g., number of
bugs)
aka FACTORS
Each combination of values of the
independent variables is a TREATMENT
TREATMENTS
Treatment 1
(e.g, testing tool A)
Treatment 2
(e.g., testing tool B)

Controlled Experiment
• “Experimental investigation of a testable hypothesis, in which
conditions are set up to isolate the variables of interest
(independent variables) and test how they aﬀect certain measurable
outcomes (the dependent variables)”
INDEPENDENT
variables
(e.g., testing
tool)
DEPENDENT
variables
(e.g., number of
bugs)
aka FACTORS
Each combination of values of the
independent variables is a TREATMENT
TREATMENTS
Treatment 1
Treatment 2
To ISOLATE the independent variables, the other
variables need to be CONTROLLED
(e.g., variables concerning the code samples on
which the test is performed)

equivalent for each
treatment
homogeneous
general
INDEPENDENT
variables
(e.g., testing
tool)
DEPENDENT
variables
(e.g., number of
bugs)
TREATMENTS
Treatment 1
Treatment 2
CONTROLLED variables
(e.g., sample length,
type of language, complexity)
representative
related to
human subjects
related to objects

equivalent for each
treatment
homogeneous
general
INDEPENDENT
variables
(e.g., testing
tool)
DEPENDENT
variables
(e.g., number of
bugs)
TREATMENTS
Treatment 1
Treatment 2
CONTROLLED variables
(e.g., sample length,
type of language, complexity)
Controlled variables when human subjects are involved
may concern experience of developers, age, etc.
representative
related to
human subjects
related to objects

Definitions
• Hypothesis: the statement I want to test with the experiment

• Derived from a research question (e.g., What is the difference
between A and B in terms of bug detection capability?)

• Include variables that represent constructs of interest (e.g., tools,
methods, actors, number of bugs)

• Concern the measurable impact that a certain variation on some
construct can have on other constructs (e.g., Tool A finds more bugs
than tool B; Tool A finds less or equal bugs than tool B)

• I normally have NULL and Alternative hypothesis; the one I will test is
the NULL hypothesis, but the one I am interested in is the Alternative
one (we’ll see this later)

Deﬁnitions
• Independent Variables (INPUT): operationalisation of
constructs that I want to isolate, and whose values I want to
manipulate (e.g., the tool, the expertise of actors)
• Treatments: combinations of values for the independent
variables (tool A, tool B — 1 variable, two treatments; tool A and
experts, tool A and novices, tool B and experts, tool B and
novices— 2 variable, 4 treatments)
• Dependent Variables (OUTPUT): operationalisation of
constructs that I want to measure based on the manipulation of
the independent variables (e.g., number of bugs)
• Controlled Variables: attributes* of human subjects or objects
that I need to control to mask or prevent their impact on the
dependent variables (e.g., I have to test on some code that is
suﬃciently general, and equivalent for all cases)
* = operationalisation of constructs

Example: Software
• Objective: I want to understand which is a better testing tool among two available choices A and B

• The independent variable is already identified: the tool (one factor)

• Treatments are also straightforward: tool A and B (two treatments)

• I miss the dependent variable: I have to detail what I mean by better. Better in terms of speed? better in
terms of bugs found? Both! Ok, I already have two dependent variables, which I can define as:

• “effectiveness” = number of bugs found/total number of bugs

• “efficiency” = running time/number of bugs found

• Now I have to identify the controlled variables: what can impact on effectiveness and efficiency, besides
the type of tool? The user? Maybe not, if the tool is fully automatic; The language of the code? Well, I want
to focus only on C code; The chosen code? Well yes, but which attributes of the chosen code?

• number of bugs in the code module

• length of the module

• complexity of the module

• domain of the code

• ….

Example: Software











• ….
I have to create a code sample that
has sufﬁcient variations in all of the

Example: Software











• ….
I have to create a code sample that
has sufﬁcient variations in all of the
If I cannot variate a certain variable, I have to ﬁx it (e.g., C code, domain)
and make this choice explicit, as it limits my scope of interest

Example: Software and Humans
• Objective: I want to see if the experience of the user affects the effectiveness of a certain testing tool

• The dependent variable is already identified: the effectiveness (bugs found/total bugs)

• I have to identify the independent variables: they should concern the experience of the user, how can I
measure it? Years of experience in testing? Score from other colleagues? Well, normally it is better to
select one independent variable only, otherwise I need too many treatments and I may not find enough
participants! Ok, but what should I compare? 1, 2, 3, 4, 5 etc. years? It is also a lot of treatments, will I
find enough people? I have to separate years of experience by levels. How do I select the levels? I have
to do some assumptions based on existing literature or I can take some decision that can be defended

• I decide for two levels, and I partition into two treatments (i.e., two homogeneous groups of people)

• from 0 to 1 years: novices

• more than 5 years: experts

• Now I have to identify the controlled variables: what can impact on my outcomes besides the
experience of users? Well, age, gender, all demographic variables…and of course, the code on which the
tool is applied (previous variables)

• I have to make some choice: I should fix a representative code base, use the same for all subjects, make
sure none of them know the code in advance, and control demographic variables
• Therefore, for each treatment, I have a group with a comparable experience (novice OR expert) but
variations in terms of age, gender, and other demographic variables

Controlled Experiments: 🙂 and ☹
• 🙂 Advantages:
• It is SCIENCE, with NUMBERS

• Can be applied to identify cause-effect relationships for specific, well defined,
variables

• ☹ Disadvantages:
• Applicable to well-defined problems in which you can clearly define and isolate
variables

• Hard to apply if you cannot simulate the right conditions in the lab (confounding
variables may be too many to be controlled)

• Reality of SE has several contextual factors that may make the experiment not realistic

• It may be hard and costly to recruit adequate subjects (developers have to develop,
managers need to manage…often, students are used as proxies)

• Design is time consuming and can get very complicated, very easily (which implies that
it is also difficult to analyse the results and have an actual control)

Hypothesis Testing
cf. Sharma, 2015 https://bit.ly/2wTf7VX
I will provide information for you to understand the principles,
but to REALLY understand you need more resources
I will use the word MAGIC when some concepts need to be assumed,
or some measures can be given somehow by common tools


Hypothesis
• A hypothesis is a statistically testable statement derived from a theory (and, in practice,
from a research question)

• A hypothesis is a predictive statement concerning the impact of some independent
variable on some dependent variable

• When we do hypothesis testing, our goal is to refute the negation of the theory
• H0 the NULL hypothesis — The theory does not apply
• Usually expressed as There is no eﬀect […] — changes of the independent variable
do not aﬀect the dependent variable

• It is assumed to be TRUE, unless there is evidence from the data that allows us to
REJECT the NULL hypothesis (for this, you need statistical tests)
• H1 the ALTERNATIVE hypothesis — The theory predicts…
• If H0 is rejected, this is an evidence that H1 can be correct

Example
• H0: The experience of the developer does not affect the
average time to find bugs
• H0: Average-Time-Novices = Average-Time-Experts
• H1: The experience of the developer affects the average
time to find bugs
• H1: Average-Time-Novices ≠ Average-Time-Experts
I imagine to have two groups,
novices and experts
We speak about Two-tailed hypothesis to be tested
(later you will understand why)
I imagine I have a method M or tool T for finding bugs

Example
• H0: The experience of the developer does not affect the
average time to find bugs
• H0: Average-Time-Novices = Average-Time-Experts
• H1: The experience of the developer affects the average
time to find bugs
• H1: Average-Time-Novices ≠ Average-Time-Experts
I imagine to have two groups,
novices and experts
We speak about Two-tailed hypothesis to be tested
(later you will understand why)
What if I want to know WHO is QUICKER? This
formulation does not say anything about that…
I imagine I have a method M or tool T for finding bugs

Example
• But I can find another formulation, with exactly the same
experiment — two groups, novices and experts, and I
measure average time to find bugs

• H0: The average time to find bugs of novices is less than or
equal to the one of experts
• H0: Average-Time-Novices <= Average-Time-Experts
• H1: The average time to find bugs of novices is greater than
the one of experts
• H1: Average-Time-Novices > Average-Time-Experts
We speak about One-tailed hypothesis to be tested

Test Statistic
• Hypothesis tests normally take all my sample data and convert them into
a single value, which is called test statistic
• The test statistic is just a number, but its value can tell me whether the
NULL hypothesis can be REJECTED or not

• Depending on the test that I have to do I will have diﬀerent test statistics
Test
Data from
Experiment
Test
Statistic
time novice 1
time expert 1
time novice 2
time expert 2
e.g, unpaired t-test
-0.38
e.g., t-value
Compare the means
of two independent
samples
cf. https://bit.ly/39LLOU5

Probability Distribution of the Test Statistic
• The assumption is that the NULL hypothesis is TRUE
• Given a population in which the NULL hypothesis is true,  
I imagine to repeat my experiment multiple times and compute the test statistic
• The test statistic will follow a certain distribution — which one? MAGIC, e.g., Student t-
distribution
If H0 is TRUE, most of
the times I repeat the experiment
the test statistic will be around here
Number of samples
with value x
Set of possible values x of the test statistic
If H0 is TRUE, it is
unlikely that
my test statistic
will be here
(or in the left tail)
e.g., a t-value = 0
indicates that my data
conﬁrms H0 precisely
The distribution is centred
on the value that the test
statistic has 
when the data of my
experiment conﬁrm exactly
the NULL hypothesis

Probability Distribution of the Test Statistic
• The assumption is that the NULL hypothesis is TRUE
• Given a population in which the NULL hypothesis is true,  
I imagine to repeat my experiment multiple times and compute the test statistic
• The test statistic will follow a certain distribution — which one? MAGIC, e.g., Student t-
distribution
If H0 is TRUE, most of
the times I repeat the experiment
the test statistic will be around here
Number of samples
with value x
Set of possible values x of the test statistic
If H0 is TRUE, it is
unlikely that
my test statistic
will be here
(or in the left tail)
e.g., a t-value = 0
indicates that my data
conﬁrms H0 precisely
If my test statistic falls
around the tails
I can REJECT H0
…and this is my hope!
The distribution is centred
on the value that the test
statistic has 
when the data of my
experiment conﬁrm exactly
the NULL hypothesis

• Our ﬁnal goal is to evaluate whether our test statistic
value, obtained from our experiment, is so rare that it
justiﬁes rejecting the NULL hypothesis for the entire
population based on our sample data

• How can I do if I do not know the entire distribution of my
test statistic? This can be inferred based on the statistics
of the sampled data and the hypothesis I want to test…

• …in this context we will assume that some MAGIC
occurs and we know the distribution of the test statistic

Critical Regions
test statistic
# of samples
I want the test statistic of
my experiment to fall on
the tails of the distribution
Critical Region = acceptable
values to reject
NULL
values to reject
NULL
The acceptable values
identify a red area in the
distribution
The area is the risk of
rejecting the NULL
when TRUE
Before the experiment,
I set the Critical Regions
(Rejection Regions)

Level of Significance and
Confidence
• Significance level indicates the risk to reject a NULL
hypothesis when it is true; it is denoted by 𝛼

• 0.01, 0.05, 0.1: these are the typical values for 𝛼

• (1 − 𝛼) is the confidence level indicates how confident I
want to be about the result of my test

• 0.99, 0.95, 0.9: typical values for (1 − 𝛼)
Alpha sets the standard for how extreme the data
MUST BE before we can reject the null hypothesis.
The p-value indicates how extreme the data ARE (later).

Significance and Confidence
test statistic
Before any experiment I
set the significance level,
and corresponding
confidence level
values of test statistic to reject
NULL
values of test statistic to reject
NULL
Confidence Level (1-𝛼)
Significance Level 𝛼

Risk of Rejecting the NULL
Hypothesis when TRUE
Risk Level
Signiﬁcance
𝛼
Conﬁdence
Level (1- 𝛼)
Intuitive Meaning
Catastrophic 0.001 0.999
More than 100 million Euros

(Large loss of life, e.g. nuclear
disaster)
Critical 0.01 0.99
Less than 100 million Euros

(A few lives lost, e.g., accident)
Important 0.05 0.95
Less than 100 thousands Euros

(No lives lost, some injuries)
Moderate 0.10 0.90
Less than 500 Euros

(no injuries)

Risk of Rejecting the NULL
Hypothesis when TRUE
Risk Level
Signiﬁcance
𝛼
Conﬁdence
Level (1- 𝛼)
Intuitive Meaning
Catastrophic 0.001 0.999
More than 100 million Euros

(Large loss of life, e.g. nuclear
disaster)
Critical 0.01 0.99
Less than 100 million Euros

(A few lives lost, e.g., accident)
Important 0.05 0.95
Less than 100 thousands Euros

(No lives lost, some injuries)
Moderate 0.10 0.90
Less than 500 Euros

(no injuries)
In software engineering, we normally use these values

Type I and Type II Errors
REAL Population Fail to Reject Reject
NULL is True
No Error
my theory is FALSE

(1 - 𝛼)
Type I Error
(Incorrectly Reject the
NULL hypothesis)

𝛼
NULL is False
Type II Error
(Incorrectly Fail to
Reject the NULL
hypothesis)

β
No Error
my theory is TRUE

(1- β)
Type I 🤥 my (alternative) hypothesis is wrong, but I support it anyway
Type II 🥺 my (alternative) hypothesis is correct, but I rejected it
We normally focus on minimising Type I errors

Two-tailed Test
Average-Time-Novices
= Average-Time-Experts
≠ Average-Time-Experts
≠ Average-Time-Experts
Acceptance region
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙
(1−𝛼) = 0.95
Rejection Region
𝑠𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝑙𝑒𝑣𝑒𝑙
(𝛼/2 = 0.025 𝑜𝑟 2.5%)
Rejection Region
(𝛼/2 = 0.025 𝑜𝑟 2.5%)
the value of 𝛼 = 0.05
is split between the
tails
• H0: The experience of the developer does not aﬀect the average time to ﬁnd bugs
𝛼 is the risk of rejecting NULL when true
the value of 𝛼/2
is this area

One-tailed Test (Left)
>= Average-Time-Experts
< Average-Time-Experts
Acceptance region
(1−𝛼) = 0.95
Rejection Region
(𝛼 = 0.05 𝑜𝑟 5%)
is all in one tail
• H0: The average time to ﬁnd bugs of novices is greater than or equal to the one of experts
the value of 𝛼
is this area

One-tailed Test (Right)
<= Average-Time-Experts
> Average-Time-Experts
Acceptance region
(1−𝛼) = 0.95
Rejection Region
(𝛼 = 0.05 𝑜𝑟 5%)
is all on one tail
• H0: The average time to ﬁnd bugs of novices is less than or equal to the one of experts
the value of 𝛼
is this area

p-value
Test
Data from
Experiment
Test
Statistic
time novice 1
time expert 1
time novice 2
time expert 2
-0.38
e.g., t-value
p-value
Another number produced by the test
LOW values (0.001) are GOOD,
HIGH values (0.3) are BAD

p-value and 𝛼 (one-tailed)
p-value is
this blue area
This point is MY test statistic value,
derived from MY data
𝛼 is the red plus
the blue area
cf. https://statisticsbyjim.com/hypothesis-testing/hypothesis-tests-signiﬁcance-levels-alpha-p-values/

p-value and 𝛼 (two-tailed)
p-value/2 is
this blue area
This point in the x axis is
my test statistic value,
derived from my data
𝛼/2 is the red plus
the blue area
cf. https://statisticsbyjim.com/hypothesis-testing/hypothesis-tests-signiﬁcance-levels-alpha-p-values/
For two-tailed tests, 𝛼 and p are the sum
of the areas in the two tails, both 𝛼 and p are
shared between the tails
cf. https://blog.minitab.com/blog/adventures-in-statistics-2/understanding-hypothesis-tests-signiﬁcance-levels-alpha-and-p-values-in-statistics
𝛼/2 is the red plus
the blue area
p-value/2 is
this blue area

p-value
• 1) p-value indicates the believability of the devil’s advocate case that the NULL
hypothesis is TRUE given the sample data
• 2) p-value is the probability of observing a test statistic that is at least as extreme as
your test statistic, when you assume that the NULL hypothesis is true

• 3) p-value indicates to which extent the result may be due to a random variation
within your data, which make them diﬀerent to the actual population

• If p-value is “very low”, then the NULL hypothesis is REJECTED, in favour of the
alternative hypothesis, otherwise I Fail to REJECT

• The meaning of “Very low” depends on the selected value of signiﬁcance 𝛼

• p-value <= 𝛼: I fall in the REJECTION region, H0 is rejected

• p-value > 𝛼: I fall in the ACCEPTANCE region, I fail to reject H0
Different intuitive way to understand it

Effect Size
Test
Data from
Experiment
Test
Statistic
time novice 1
time expert 1
time novice 2
time expert 2
-0.38
e.g., t-value
p-value
Effect Size
Statistically significant effect does not necessarily mean a big effect
cf. https://en.wikipedia.org/wiki/Effect_size
Effect size measures how big is the effect
Effect size
computation
e.g, Cohen’s d
e.g., d = 2
cf. https://www.simplypsychology.org/effect-size.html

Effect Size
• Effect size is a quantitative measure of the magnitude of the treatment
effect (e.g., HOW MUCH better is my tool?)

• Effect sizes either measure:

• the sizes of associations/relationships between variables  
(HOW MUCH is experience correlated with development speed?)

• the sizes of differences between group means  
(HOW MUCH is the difference between tool A and B?)

• There are different way to measure effect size, the most common are
Cohen’s d (for differences), Pearson r correlation (for associations/
relationships), but it may also depend on the type of data (categorical vs
numeric), and on types of samples (paired vs unpaired)
Check Wikipedia to know the most appropriate for your case:
cf. https://en.wikipedia.org/wiki/Effect_size
cf. Lakens, 2013 https://doi.org/10.3389/fpsyg.2013.00863

Cohen’s d
• Difference between the means divided by the standard
deviation of the population from which the data were sampled —
but how can we know the standard deviation of the population?
The same MAGIC as before

• A d of 1 indicates the two groups differ by 1 standard deviation,
a d of 2 indicates they differ by 2 standard deviations, and so on.
This is how you interpret the values
of d that you obtain
https://en.wikipedia.org/wiki/Effect_size

Pearson’s r
• Indicates the correlation between variables (e.g., number of
bugs vs length of the code)

• Pearson's r can vary in magnitude from −1 to 1:

• −1 perfect negative linear relation,

• 1 perfect positive linear relation

• no linear relation between two variables

• The eﬀect size is low if the value of r varies around 0.1, medium
if r varies around 0.3, and large if r varies more than 0.5

What about Type II Errors?
• In all our evaluations, we assumed that the population
was conﬁrming the NULL hypothesis, but what if we
make a Type II error (we fail to reject the NULL
hypothesis, when the actual population rejects it)?

• Well, in these cases, we should also establish a value,
normally called β, which is the probability of accepting the
NULL hypothesis, although it is FALSE

• If the NULL hypothesis is FALSE, this means that my real
population follows the alternative hypothesis

Type II Errors
Set of possible values x
of my test statistic
Number of samples
with value x (Density)
Distribution if H0
would be true
Distribution if H1
would be true

Type II Errors
Number of samples
Distribution if H0
would be true
Distribution if H1
would be true
𝛼To have smaller 𝛼 I have to push
the bar to the right…

Type II Errors
Number of samples
Distribution if H0
would be true
Distribution if H1
would be true
β 𝛼𝛼 now is really small,
but β gets larger!
β is the probability of accepting the NULL hypothesis when it is FALSE
𝛼 is the probability of rejecting the NULL hypothesis when it is TRUE

The Hard Truth
• Whenever you try to minimise Type I errors, you end up increasing the chance of
Type II errors

• In practice, we mostly look at REJECTING null hypotheses, so we generally focus
on Type I errors, and alpha values

• Why do we look at rejecting the NULL? (intuitive explanation)
• We are using just one sample to reason on an entire population, so
we can REJECT a hypothesis, or FAIL to REJECT, but never accept

• Accepting the alternative hypothesis would imply repeating the
experiments many more times with diﬀerent samples taken from
my actual population and showing that the test statistic follows the
distribution of the alternative hypothesis

• Additional Intuition: it is easier to disprove “all swans are white” (I need to ﬁnd
only one black swan) than to prove it (I need to check all possible swans)

Summary of Concepts
• When you perform an experiment you have to keep in mind the following key
concepts:

• Level of significance 𝛼: tells me how much risk I can take, normally set to
0.05, moderate risk; it is set at the beginning of the experiment

• Test statistic: value depending on the type of test that I make, it serves to
understand how much my sample is rare in a population in which the NULL
hypothesis is TRUE; it is produced based on my experimental data; the
number alone does not say much

• p-value: indicates the probability of rejecting the NULL hypothesis when it is
actually TRUE; it is produced based on my experimental data; it needs to be
compared with 𝛼; if lower than 𝛼, I am happy

• Effect size: indicates how large is the difference between two treatments, or
how much is the correlation between independent and dependent variable;
depends on the chosen test; tables exist to evaluate the effect size

Graphical Summary
Test
Data from
Experiment
Test
Statistic
p-value
Effect Size
Effect size
computation
Signiﬁcance 𝛼
p-value <= 𝛼
Effect Size Table
Small Effect
Large Effect
Reject NULL
Hypothesis

Statistical Tests

cf. Wholin et al. https://doi.org/10.1007/978-3-642-29044-2

Summary from Previous Lecture
Distribution of test
statistic when
samples come
from a population
where NULL is true
NULL Hypothesis
Centred in the value that
test statistic has
when the sample conﬁrms
EXACTLY the NULL
hypothesis
test statistic
# of samples
Every experiment produces a test statistic
(numerical summary of the data)
I imagine to perform a set of experiments with a
population in which NULL is true

𝛼 is this area
statistic when
samples come
from a population
where NULL is true
NULL Hypothesis
test statistic has
EXACTLY the NULL
hypothesis
test statistic
# of samples

𝛼 is this area
statistic when
samples come
from a population
where NULL is true
NULL Hypothesis
This point is my test statistic value,
derived from my data
Statistical Test
p-value is
this blue area
test statistic has
EXACTLY the NULL
hypothesis
test statistic
# of samples

Statistical Tests
• A statistical test is a means to establish a test statistic, i.e., a single value derived from the data of my
experiment

• Several tests exist, and each test is appropriate for a speciﬁc type of experiment

• Two categories of tests exist:

• Parametric Tests: tests that make some assumptions on the population’s distribution, e.g., normality,
or homogeneous variances of the sample

• Nonparametric Tests: tests that do not make assumptions on the population’s distribution. For most
of the parametric tests, a nonparametric alternative exist

• Parametric Tests have more statistical power (a concept that we did not explore); roughly, they are
more likely to lead to the rejection of the NULL hypothesis when FALSE (they lead to lower p-values, when
NULL is false, and hence reduce Type II errors). You cannot use them for nominal or ordinal data.

• Nonparametric Tests are more robust, as they are valid for a larger set of cases, as they do not make
strict assumptions on the data. You can use them for nominal and ordinal data, or when assumptions of
the parametric tests do not hold

• You do not know the population, so, in order to use parametric tests, you ﬁrst have to test how likely is it
that your data follow the assumption of the test that you are going to make; if they do not follow the
assumption, then use a nonparametric alternative (cf. https://help.xlstat.com/s/article/which-statistical-
test-should-you-use?language=en_US)

Normality Test (does not apply
to nominal or ordinal data)
• Many parametric statistical tests assume that your data is normally distributed
(actually, the distribution of the sample mean is normal, so I should consider the
population…in general if you have more than 30 samples you’re safe)

• To ensure that, you need to apply a normality test to your data, for example Shapiro-
Wilk (several others exist)
• The null-hypothesis of this test is H0 = the population is normally distributed.
• Thus, if the p-value is less than the chosen 𝛼 level, then the NULL hypothesis is
rejected and there is evidence that the data tested are NOT normally distributed.
Here you want the p-value to be LARGER than 𝛼,
as your NULL hypothesis is the one that you want to support!
Hence, THE LARGER the p-value, the BETTER!
There are also ways to transform your data if they are not normally distributed,
but be careful, because then the interpretation of the results is not straightforward
(check if non-normality is due to the presence of outliers)
cf. https://bit.ly/2wJAl9l

Parametric and Non-
parametric Tests (Remark)
• Parametric tests are all those test that make some assumptions
on your data (normality, above all). To use a parametric test you
ﬁrst need to check that the assumptions of the parametric test
hold for your data

• Non-parametric tests are alternative tests to use when the
normality test (or any other assumption) fails OR when you are
dealing with categorical or ordinal data

• Sometimes non-parametric tests have assumptions too!
(check carefully which are the assumptions of non-parametric
tests, e.g., cf. https://www.isixsigma.com/tools-templates/
hypothesis-testing/nonparametric-distribution-free-not-
assumption-free/ )

Selecting the right test
HOWTO
• In the following, a diagram will be shown to guide you in the selection of the right
test, assuming that you have only ONE DEPENDENT VARIABLE — as in most of
the experiments with a manageable design in SE

• The selection of the test depends on
• The type of dependent variable (nominal, ordinal, scale/ratio)

• Type of hypothesis (diﬀerence or relationship/association)

• Number of treatments

• Type of design (single group of subjects vs two groups)

• Number of independent variables

HOWTO





You will not memorise the diagram,
but you should know how to follow it

HOWTO





I will not explain how each
test works, you only need to know which one to use

HOWTO





I will not explain how each
test works, you only need to know which one to use
In this lecture a test is a BLACK box that produces
two numbers: test statistic and p-value

Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
Zero (Only the
dependent variable)
One or more
Chi-square
Goodness of fit
Chi-square Test
of Independence
Type of hypothesis
RelationshipDifference
Spearman’s
Rho
Type of
design
Different groups
of subjects
Single group of
subjects
Mann-Whitney
U test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)

Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
Zero (Only the
dependent variable)
One or more
Chi-square
Goodness of ﬁt
Chi-square Test
of Independence
Type of hypothesis
Spearman’s
Rho
Type of
design
Diﬀerent groups
of subjects
Single group of
subjects
Mann-Whitney
U test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
I assume to have One
Dependent Variable

Interval/Ratio (numbers)
Type of hypothesis
Relationship Diﬀerence
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
designSpearman’s
Rho
Pearson’s R
One or More
Zero
Diﬀerent groups
of subjects
(independent measures)
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
cf. https://www.socscistatistics.com

Type of hypothesis
Relationship Diﬀerence
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
designSpearman’s
Rho
Pearson’s R
One or More
Zero
Diﬀerent groups
of subjects
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
The list of tests is
NOT exhaustive
cf. https://www.socscistatistics.com

Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
ZeroOne or More
Chi-square
Goodness of fit
Chi-square Test
of Independence
Type of hypothesis
Relationship
Difference
Spearman’s
Rho
Type of
design
Different groups
of subjects
Single group of
subjects
Mann-Whitney
U test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
e.g., IV: None; DV: type of defect.
to which extent the proportion
of defects of a certain type matches the
expected proportion?
IV = independent variable
DV = dependent variable

Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
ZeroOne or more
Chi-square
Goodness of fit
Chi-square
Test of
Independence
Type of hypothesis
Relationship
Difference
Spearman’s
Rho
Type of
design
Different groups
of subjects
Single group of
subjects
Mann-Whitney
U test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
e.g., IV: code author,  
DV: defect type
Is there a link between defect
type and code authors?

Chi-Square Test of
Independence (Example)
• RQ: Is there a link between defect type and code author?

• H0: There is no relationship between defect type and code author
type of defect
author
“Null pointer” defects
in Homer’s code

Chi-Square Test of

type of defect
author
Chi-square = 56.32, p < 0.00001
H0 is REJECTED
in Homer’s code

Chi-Square Test of

type of defect
author
Chi-square = 56.32, p < 0.00001
H0 is REJECTED
Cramer’s V should
be used to check
Effect Size (check Wikipedia)!
in Homer’s code

Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
ZeroOne or More
Chi-square
Goodness of ﬁt
Chi-square Test
of Independence
Type of hypothesis
Spearman’s
Rho
Type of
design
Diﬀerent groups
of subjects
Single group of
subjects
(repeated measures)
Mann-
Whitney U
test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
e.g., IV: level of experience
(two levels);  
DV: degree of success
Is there a difference between
degree of project success
between novices and
experts?
novices experts
Deg. Proj. Success

Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
ZeroOne or more
Chi-square
Goodness of ﬁt
Chi-square Test
of Independence
Type of hypothesis
Spearman’s
Rho
Type of
design
Diﬀerent groups
of subjects
Single group of
subjects
(repeated measures)
Mann-
Whitney U
test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
e.g., IV: time of the day
(morning, afternoon);  
DV: level of performance
is there a difference between
the performance of the
developers between morning
and afternoon?
Level of Performance
morning afternoon

Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
ZeroOne or More
Chi-square
Goodness of ﬁt
Chi-square Test
of Independence
Type of hypothesis
Spearman’s
Rho
Type of
design
Diﬀerent groups
of subjects
Single group of
subjects
(repeated measures)
Mann-
Whitney U
test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
e.g., IV: motivation; DV: degree of
project success
Is there a relationship
between motivation of a person
and degree of project success?
motivation success

Dependent Variable is Interval/Ratio (numbers)
Type of hypothesis
Relationship
Difference
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
design
Spearman’s
Rho
Pearson’s R
One or MoreZero
Different groups
of subjects
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
e.g.,IV: review duration, DV: number
of defects identified
Is there a relationship
between review duration and
number of defects identified?

Type of hypothesis
Relationship
Diﬀerence
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test (single
sample)
Type of
designSpearman’s
Rho
Pearson’s R
One or More
Zero
Diﬀerent groups
of subjects
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
e.g., IV: None DV: number of
defects per code module,
Is there a difference between the
number of defects identified in the
modules and the mean value
expected?

Type of hypothesis
Relationship
Difference
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
designSpearman’s
Rho
Pearson’s R
One or more
Zero
Different groups
of subjects
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
e.g., IV: tool; DV: speed in finding
bugs
Does the tool improve the users’
speed in finding bugs?
(is there a difference in terms of
speed WITH and WITHOUT the
tool?)

Paired T-test (Example)
• I have a new tool to support bug identification in code review, and I want to
understand whether it is effective or not

• RQ: Does the tool improve the users’ speed of finding bugs?
• Independent Variable: tool (YES/NO) — two treatments (TOOL/NO-TOOL)

• Dependent Variable: speed = number of bugs found/minute

• H0: the speed without the tool is lower or equal to the speed with the tool
• Design: I have 13 users, I have ONE code file to review, and I will let them first do the
bug search WITHOUT the tool (treatment NO-TOOL), and then do the search WITH the
tool (treatment TOOL). Then, I will compare the speed for each used in the two tasks,
to see if they improve.
Does the review training method improve the student’s ability of finding bug
a.k.a. repeated-measures t-test, paired samples t-test,
matched pairs t-test and matched samples t-test




What’s wrong with this design?




What’s wrong with this design?
Learning Bias: if I use the same ﬁle to be reviewed, students will have learned
which are the bugs in the ﬁle and in treatment NO-TOOL will be faster!

Paired T-test (Corrected Example)
• I have a new tool to support bug identification in code review, and I want to understand
whether it is effective or not

• Independent Variable: tool (YES/NO) — two treatments


• Design: I have 13 users, I have ONE code file to review, and I will let
them first do the bug search WITH the tool (treatment TOOL), and
THEN do the search WITHOUT the tool (treatment NO-TOOL). Then, I
will compare the speed for each student in the two tasks.






Now the learning bias would be in favour of NO-TOOL treatment;
if I am able to reject the hypothesis, I can be quite conﬁdent that
the tool increases the speed





Now the learning bias would be in favour of NO-TOOL treatment;
if I am able to reject the hypothesis, I can be quite confident that
the tool increases the speed
Is ONE code file
sufficient?

• Design: I have 13 users, I have TWO equivalent code files
to review (file X and Y), and I will let them first do the bug
search WITH the tool on file X (treatment TOOL), and
THEN do the search WITHOUT the tool on file Y
(treatment NO-TOOL). Then, I will compare the speed for
each student in the two tasks.

• With TWO equivalent code files, I am more confident that
the first treatment does not influence the second
treatment

• Design: I have 13 users, I have TWO equivalent code files
to review (file X and Y), and I will let them first do the bug
search WITH the tool on file X (treatment TOOL), and
THEN do the search WITHOUT the tool on file Y
(treatment NO-TOOL). Then, I will compare the speed for
each student in the two tasks.

• With TWO equivalent code files, I am more confident that
the first treatment does not influence the second
treatment
But what if the task lasts too long, and the students get tired in the second task?
The effect of fatigue needs to be considered, so I need to do the two
treatments in two separate days (or allow sufficient time between tasks)

Paired T-test
• H0: the speed without the tool is lower or equal to the
speed with the tool (one-tailed hypothesis)
USER NO TOOL
u0 3 6
u1 3 6
u2 4 5
u3 3 8
u4 5 3
u5 7 5
u6 2 6
u7 1 5
u8 2 3
u9 8 9
u10 9 11
u11 1 4
u12 7 9
bugs/min by user
u0 with TOOL
t = 3.24
p-value = 0.00354

Paired T-test
• H0: the speed without the tool is lower or equal to the
speed with the tool (one-tailed hypothesis)
USER NO TOOL
u0 3 6
u1 3 6
u2 4 5
u3 3 8
u4 5 3
u5 7 5
u6 2 6
u7 1 5
u8 2 3
u9 8 9
u10 9 11
u11 1 4
u12 7 9
bugs/min by user
u0 with TOOL
t = 3.24
p-value = 0.00354
CURIOSITY:
What calculations are made to
ﬁnd the t-value (the test statistic)?

Computing the t-test statistic (paired case)
NO TOOL Difference
Dev
(Difference - M) Dev2
μ is the expected difference if H0 is true
(hence no difference, μ = 0)
The t-test statistic is based on the difference between the two measures
This is the formula
of the test statistic
for t-test
SS:Mean M:

Type of hypothesis
Relationship
Difference
Number of Ind.
Variables
Standard
Deviation
known unknown
Type of
designSpearman’s
Rho
One or MoreZero
Different groups
of subjects
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
e.g., IV: tools DV: speed in finding
bugs
Which is the difference between
tool A, B, C, D in terms of speed of
bug detection achieved by users?
Tool A Tool B Tool C Tool D

Type of hypothesis
Relationship
Difference
Number of Ind.
Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
designSpearman’s
Rho
Pearson’s R
One or MoreZero
Different groups
of subjects
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
e.g., IV: speed in finding bugs
Does the tool improve the
users’ speed in finding bugs?
(is there a difference in terms of
speed WITH and WITHOUT the
tool?)

Unpaired T-test (Example)

• I want to completely get rid of the learning bias, and of the fatigue effect, and I have a
sufficient number of users (26 instead of 13)

• I change the design by having two groups, randomly allocate subjects and assign each
subject to one of the treatment (TOOL, NO-TOOL)

• I have to assess that there is no difference in the initial competence of the users. To this
end, I can do a pre-test, which can allow me to identify that subjects in the two groups
have the same (average) degree of competence in finding bugs.

• Otherwise, I can provide sound arguments to justify that ALL the subjects have the same
degree of competence (e.g., people are students that come from the same course, and are
all novices…hence my results are valid solely for this category of users)

• Note that the two groups need to be balanced, but you do not need to have the same exact
number of people in the two groups (e.g., if you have 25 people, it can be divided into 13
and 12 subjects)
a.k.a. independent-measures t-test, unpaired samples t-test
The problem is the same as for the paired T-test!

USER NO
u0 3
u1 3
u2 4
u3 3
u4 5
u5 7
u6 2
u7 1
u8 2
u9 8
u10 9
u11 1
u12 7
USER TOOL
u13 6
u14 6
u15 5
u16 8
u17 3
u18 5
u19 6
u20 5
u21 3
u22 9
u23 11
u24 4
u25 9
t-value = -1.89889
p-value = .034833
Note that the t-value is different with respect to
the t-value for the paired case although
the number in the tables are THE SAME (but coming
from different subjects)!

USER NO
u0 3
u1 3
u2 4
u3 3
u4 5
u5 7
u6 2
u7 1
u8 2
u9 8
u10 9
u11 1
u12 7
USER TOOL
u13 6
u14 6
u15 5
u16 8
u17 3
u18 5
u19 6
u20 5
u21 3
u22 9
u23 11
u24 4
u25 9
t-value = -1.89889
p-value = .034833
CURIOSITY:
What calculations are made to
ﬁnd this t-value (the test statistic)?
Note that the t-value is different with respect to
the t-value for the paired case although
the number in the tables are THE SAME (but coming
from different subjects)!

Computing the t-test statistic
(unpaired case)
NO (x)
Difference
(x - M)
Sq. Diff
(x - M)2
TOOL (y) Difference
(y - M)
Sq. Diff
(y - M)2

What about the Effect Size?
• In this case, I have a diﬀerence in my hypothesis,
therefore I will use Cohen’s d
This are the numbers that I need for Cohen’s d
where
This is the formula for Cohen’s d
NO-TOOL
TOOL

What about the Effect Size?
• In this case, I have a diﬀerence in my hypothesis,
therefore I will use Cohen’s d
This are the numbers that I need for Cohen’s d
where
This is the formula for Cohen’s d
d = (6.15 - 4.23) ⁄ 6.701138 = 0.286519
I have a SMALL to MEDIUM effect size
(see table from some slides ago…)
NO-TOOL
TOOL

Type of hypothesis
Relationship
Diﬀerence
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
designSpearman’s
Rho
Pearson’s R
One or More
Zero
Diﬀerent groups
of subjects
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
e.g., Which is the difference
between tool A, B and C in terms of
speed of bug detection achieved by
users? (same as for repeated
measures, but I use a different
design with different people)
Tool A Tool B Tool C

Type of hypothesis
Relationship
Diﬀerence
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
designSpearman’s
Rho
Pearson’s R
One or moreZero
Diﬀerent groups
of subjects
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
e.g., What is the influence of
different tools and experience in
the bug detection speed?
(I consider not only the tool, but
also the experience as independent
variable)

Factorial ANOVA (Example)
• Let’s imagine to have two tools A and B to support bug detection; I want to see which one is better, but I want
also to see whether there is some difference between people with different degree of experience in bug
detection

• RQ: What is the influence of different tools and experience in bug detection speed?

• Here I want to see which of the two factors (users’ experience and type of tool, my independent variables) has
more impact on bug detection speed

• I have three NULL hypothesis this time:

• H0-1: The speed does not depend on the type of adopted tool
• H0-2: The speed does not depend on the level of experience of the user

• H0-3: The speed does not depend on the interaction between type of adopted tool and level of experience
• Design
• User experience has 3 levels: low, medium, high

• Type of tool has 2 levels: tool A, tool B (in principle, I should have also NO tool…)

• Therefore, I have 3 x 2 = 6 possible situations (i.e., people with low experience using tool A, other using tool
B, etc.), and I have to group my subjects in 6 groups

Factorial ANOVA (Example)
User Exp. Tool Speed
1 low A
12
2 low

lo
B 4
3 low A 7
4 low B 3
5 medium A 9
6 medium B 12
7 medium A 16
8 medium
B 23
9 high A 23
10 high B
16
11 high A 14
12 high B 12
… … … …
Data
Mean
Square
F-value p-value
Exp. 2664 147.51 <0.001
Tool 29.4 1.62 0.207
Exp. X
Tool
83.85 4.64 0.014
ANOVA Results
The interaction of the two
factors is significant
(reject H03)
The experience is significant (reject H02)
Tool is not significant
(cannot reject H01)
F-value is the test statistic for ANOVA

How to Select the Right Test
• Follow the diagram

• Use the wizard at https://www.socscistatistics.com/tests/
what_stats_test_wizard.aspx

• Use the Exhaustive Table at https://stats.idre.ucla.edu/other/mult-pkg/
whatstat/ which also contains R code and code for other tools

• To ﬁnd non-parametric alternatives: https://help.xlstat.com/s/article/
which-statistical-test-should-you-use?language=en_US

• Always remember to check that the test assumptions hold
• It takes time to acquire conﬁdence with experiment design, so DO
NOT BE SCARED

How To Select the Right Test
10.3 Hypothesis Testing 137
Table 10.3 Overview of parametric/non-parametric tests for different designs
Design Parametric Non-parametric
One factor, one treatment Chi-2, Binomial test
One factor, two treatments, completely
randomized design
t-test, F-test Mann-Whitney, Chi-2
One factor, two treatments, paired comparison Paired t-test Wilcoxon, Sign test
One factor, more than two treatments ANOVA Kruskal-Wallis, Chi-2
More than one factor ANOVAa
a
This test is not described in this book. Refer instead to, for example, Marascuilo and Serlin [119]
and Montgomery [125]
Input The type of measurements needed to make the test applicable
describes the input to the test. That is, this describes what
requirements there are on the experiment design if the test should
be applicable.
Null hypothesis A formulation of the null-hypothesis is provided.
Calculations It describes what to calculate based on the measured data.
Criterion The criterion for rejecting the null hypothesis. This often involves
Factor = number of independent variables
Treatments = possible values of the independent variables
Fundamental tests

Threats To Validity
for Controlled
Experiments

Threats to Validity for
• Construct Validity: to which extent do the measured variables represent
what I intended to estimate? Did I operationalise my research questions
in the proper manner? Did I use an appropriate design?
• Internal Validity: are there any confounding factors that may have
inﬂuenced the outcome of the experiments? Did I control all the
variables?

• External Validity: for which values of the controlled variables are the
results valid? To which extent the results can be considered general?
• (Statistical) Conclusion Validity: to which extent are my ﬁnding
credible? Have I used the appropriate statistical tests? Did I check the
assumptions? Have I sampled the population in the appropriate way?
Have I used reliable measurement procedures (low measurement errors)?

Internal Validity
• Factors jeopardising internal validity are, e.g.:

• History: did time impact on the treatments? (e.g., I have people
participating at different times of the day, or treatments performed in
different days)

• Maturation: did subjects learn throughout the experiment? did time during
the experiment affect the performance? (e.g., people can get bored or tired)

• Experimental mortality: how many subjects left the experiment and how
did this affect the treatment groups? Are the remaining subjects the most
motivated?
• Researcher bias: in which way could the researcher influence the
outcomes? (e.g., presence of researcher influences the participants)

• Experimental context: to which extent does the experimental context
influence the behaviour of subjects?
cf. https://web.pdx.edu/~stipakb/download/PA555/ResearchDesign.html

External Validity
• Factors jeopardising external validity are, e.g.:

• Selection bias: are the selected subjects really
random, and are they randomly assigned to treatment?

• Representativeness: to which extent the experiment
represents a real context? To which extent was I able to
properly represent all the realistic combinations of the
control variables? To which extent was I able to select
the representative people? To which extent was I able
to select representative situations?

Construct Validity
• Factors jeopardising construct validity are:

• Hypothesis guessing: does knowing the expected result
inﬂuence the behaviour of the participant?

• Bias in experimental design: was my operationalisation and
design correct?

• Subjective measures: to which extent the subjective
measures are reliable?

Conclusion Validity
• Factors jeopardising conclusion validity are:

• Low statistical power: power is the probability of
correctly rejecting the NULL hypothesis when FALSE; I
may fail to reject the NULL hypothesis if I have low
statistical power; low statistical power occurs when I have
few samples, and low eﬀect size.
• Violated assumptions: remember that all tests have
assumptions to check
• Unreliable measures of the variables: large amount of
measurement error

Preparing, Executing and
Reporting Experiments
Theory
Hypothesis and
Variable Deﬁnition
Research Design
Research Question
Deﬁne Measures for
Variables
Recruit Participants
/ Select Artifacts
PREPARATION EXECUTION
Collect Data
Analyse Data
Report Answers
Internal Validity
External Validity
Construct &
Conclusion Validity
Construct
Validity
REPORTING
Discuss

154 11 Presentation and Package
Table 11.1 Proposed reporting structure for experiment reports, by Jedlitschka and Pfahl [86]
Sections/subsections Contents
Title, authorship
Structured abstract Summarizes the paper under headings of background or context,
objectives or aims, method, results, and conclusions
Motivation Sets the scope of the work and encourages readers to read the rest of the
paper
Problem statement Reports what the problem is; where it occurs, and who observes it
Research objectives Defines the experiment using the formalized style used in GQM
Context Reports environmental factors such as settings and locations
Related work How current study relates to other research
Experimental design Describes the outcome of the experimental planning stage
Goals, hypotheses and
variables
Presents the refined research objectives
Design Define the type of experimental design
Subjects Defines the methods used for subject sampling and group allocation
Objects Defines what experimental objects were used
Instrumentation Defines any guidelines and measurement instruments used
Data collection
procedure
Defines the experimental schedule, timing and data collection procedures
Analysis procedure Specifies the mathematical analysis model to be used
Evaluation of validity Describes the validity of materials, procedures to ensure participants
keep to the experimental method, and methods to ensure the
reliability and validity of data collection methods and tools
Execution Describes how the experimental plan was implemented
Sample Description of the sample characteristics
Preparation How the experimental groups were formed and trained
Reporting Experiments (1)

Reporting Experiments (2)
Analysis procedure Specifies the mathematical analysis model to be used
Evaluation of validity Describes the validity of materials, procedures to ensure participants
keep to the experimental method, and methods to ensure the
reliability and validity of data collection methods and tools
Execution Describes how the experimental plan was implemented
Sample Description of the sample characteristics
Preparation How the experimental groups were formed and trained
Data collection
performed
How data collection took place and any deviations from plan
Validity procedure How the validity process was followed and any deviation from plan
Analysis Summarizes the collected data and describes how it was analyzed
Descriptive statistics Presentation of the data using descriptive statistics
Data set reduction Describes any reduction of the data set e.g. removal of outliers
Hypothesis testing Describes how the data was evaluated and how the analysis model was
validated
Interpretation Interprets the findings from the Analysis section
Evaluation of results
and implications
Explains the results
Limitations of study Discusses threats to validity
Inferences How the results generalize given the findings and limitations
Lesson learnt Descriptions of what went well and what did not during the course of
the experiment
Conclusions and
future work
Presents a summary of the study
Relation to existing
evidence
Describes the contribution of the study in the context of earlier
experiments
Impact Identifies the most important findings
Limitations Identifies main limitations of approach i.e. circumstances when the
expected benefits will not be delivered
Future work Suggestions for other experiments to further investigate
Acknowledgements Identifies any contributors who do not fulfill authorship criteria
References Lists all cited literature
Appendices Includes raw data and/or detailed analyses which might help others to

What about Quasi-
Experiments?
• In experiments I randomly assign subjects to treatments;

• In quasi-experiments the assignment is based on some choices of
the designer (e.g., the Factorial ANOVA example, in which I have more
than one level of experience)

• Note that a quasi-experiment does not always allow to convincingly
establish causal relationships (e.g., different degrees of experience
may be related to other factors that may have influenced the outcome)

• When I use a group of students from a certain class for my research, I
am neither performing an experiment nor a quasi-experiment, but a
case study, as I am focusing on a specific environment and I selected
the subjects opportunistically

Summary
• Controlled Experiments in SE are a research strategy mostly oriented to test the
impact of some treatment (method, tool) to a certain dependent variable (e.g.,
speed, bugs, success, happiness)

• They are based on Hypothesis testing, which implies showing that the
experimental data REJECT the NULL hypothesis (i.e., no impact on the dependent
variable)

• Hypothesis testing uses Statistical tests to decide whether the NULL can be
REJECTED

• The selection of the statistical test depends on the Experimental design (look at
https://stats.idre.ucla.edu/other/mult-pkg/whatstat/)

• When I perform a statistical test, I hope to obtain for small p-values, and large
eﬀect size

• Remember to analyse and report Threats to Validity

Controlled experiments, Hypothesis Testing, Test Selection, Threats to Validity

Recommended

Recommended

More Related Content

What's hot

What's hot (16)

Similar to Controlled experiments, Hypothesis Testing, Test Selection, Threats to Validity

Similar to Controlled experiments, Hypothesis Testing, Test Selection, Threats to Validity (20)

More from alessio_ferrari

More from alessio_ferrari (10)

Recently uploaded

Recently uploaded (20)

Controlled experiments, Hypothesis Testing, Test Selection, Threats to Validity