SlideShare a Scribd company logo
1 of 117
Download to read offline
Controlled Experiments
in Software Engineering
cf. Plefeeger, 1995 https://doi.org/10.1007/BF02249052
cf. S. Easterbrook http://www.cs.toronto.edu/~sme/CSC2130/04-experiments.pdf
Alessio Ferrari, ISTI-CNR, Pisa, Italy

alessio.ferrari@isti.cnr.it
Controlled Experiments
aka Laboratory Experiments
aka Experiment
The ABC of Software Engineering Research 11:11In Vitro Experiment
The GOAL is Precise
Measure
of Behaviour
Typical Examples
ā€¢ With software subjects: Tool A and B are automatic tools for testing, I
want to compare them (no need to involve people)

ā€¢ With human subjects: Method M is a manual strategy for ļ¬nding bugs.
How eļ¬€ective is for experts? How eļ¬€ective is for novices? 

ā€¢ With human and software subjects:
ā€¢ Tool T is an interactive tool for testing, I want to see if it is more
appropriate for novice or for experts
ā€¢ Tool A and B are interactive tools for testing, I want to compare them ā€Ø
(I have to involve people) 

ā€¢ Tool A and B are interactive tools for testing, I want to see if which
one is more appropriate for novices and which one for experts

ā€¢ Tool A and method M are two approaches for ļ¬nding bugs, I want to
see which one is better
Controlled Experiments and Theories
Theory
Observation
Induction
Hypothesis
Deduction
Test
Theory
Abduction
Deduction
DEDUCTIVE APPROACH
Controlled Experiments: Process
PREPARATION EXECUTION REPORTING
Theory
Hypothesis and
Variable Deļ¬nition
Research Design
Research Question
Deļ¬ne Measures for
Variables
Recruit Participants
/ Select Artifacts
Collect Data
Analyse Data
Report Answers
Internal Validity
External Validity
Construct &
Conclusion Validity
Construct
Validity
Discuss
The process normally starts from a Theory
and discusses/modiļ¬es it in relation to the results
Typically QUANTITATIVE
Controlled Experiments: Process
PREPARATION EXECUTION REPORTING
Theory
Hypothesis and
Variable Deļ¬nition
Research Design
Research Question
Deļ¬ne Measures for
Variables
Recruit Participants
/ Select Artifacts
Collect Data
Analyse Data
Report Answers
Internal Validity
External Validity
Construct &
Conclusion Validity
Construct
Validity
Discuss
The process normally starts from a Theory
and discusses/modiļ¬es it in relation to the results
Typically QUANTITATIVE
Controlled Experiments: Elements
Test
Data from
Experiment
Test
Statistic
p-value
Effect Size
Effect size
computation
Signiļ¬cance š›¼
Hypothesis
Treatments
Analyse DataCollect Data
independent
variables
dependent
variables
controlled variables
Variable Measurements
Data from
Experiment
Design
Hypothesis
āœ…
ā“
Controlled Experiments: Elements
Test
Data from
Experiment
Test
Statistic
p-value
Effect Size
Effect size
computation
Signiļ¬cance š›¼
Hypothesis
Treatments
Analyse DataCollect Data
independent
variables
dependent
variables
controlled variables
Variable Measurements
Data from
Experiment
Design
This part requires your creativity
Hypothesis
āœ…
ā“
Controlled Experiments: Elements
Test
Data from
Experiment
Test
Statistic
p-value
Effect Size
Effect size
computation
Signiļ¬cance š›¼
Hypothesis
Treatments
Analyse DataCollect Data
independent
variables
dependent
variables
controlled variables
Variable Measurements
Data from
Experiment
Design
This part requires your creativity
This part is mostly automated
(but you need to understand it!)
Hypothesis
āœ…
ā“
Controlled Experiment
ā€¢ ā€œExperimental investigation of a testable hypothesis, in which
conditions are set up to isolate the variables of interest
(independent variables) and test how they aļ¬€ect certain measurable
outcomes (the dependent variables)ā€
INDEPENDENT
variables
(e.g., testing
tool)
DEPENDENT
variables
(e.g., number of
bugs)
aka FACTORS
Each combination of values of the
independent variables is a TREATMENT
TREATMENTS
Treatment 1
(e.g, testing tool A)
Treatment 2
(e.g., testing tool B)
cf. S. Easterbrook http://www.cs.toronto.edu/~sme/CSC2130/04-experiments.pdf
Controlled Experiment
ā€¢ ā€œExperimental investigation of a testable hypothesis, in which
conditions are set up to isolate the variables of interest
(independent variables) and test how they aļ¬€ect certain measurable
outcomes (the dependent variables)ā€
INDEPENDENT
variables
(e.g., testing
tool)
DEPENDENT
variables
(e.g., number of
bugs)
aka FACTORS
Each combination of values of the
independent variables is a TREATMENT
TREATMENTS
Treatment 1
(e.g, testing tool A)
Treatment 2
(e.g., testing tool B)
cf. S. Easterbrook http://www.cs.toronto.edu/~sme/CSC2130/04-experiments.pdf
To ISOLATE the independent variables, the other
variables need to be CONTROLLED
(e.g., variables concerning the code samples on
which the test is performed)
Controlled Experiments
equivalent for each
treatment
homogeneous
general
INDEPENDENT
variables
(e.g., testing
tool)
DEPENDENT
variables
(e.g., number of
bugs)
TREATMENTS
Treatment 1
(e.g, testing tool A)
Treatment 2
(e.g., testing tool B)
CONTROLLED variables
(e.g., sample length,
type of language, complexity)
representative
related to
human subjects
related to objects
Controlled Experiments
equivalent for each
treatment
homogeneous
general
INDEPENDENT
variables
(e.g., testing
tool)
DEPENDENT
variables
(e.g., number of
bugs)
TREATMENTS
Treatment 1
(e.g, testing tool A)
Treatment 2
(e.g., testing tool B)
CONTROLLED variables
(e.g., sample length,
type of language, complexity)
Controlled variables when human subjects are involved
may concern experience of developers, age, etc.
representative
related to
human subjects
related to objects
Deļ¬nitions
ā€¢ Hypothesis: the statement I want to test with the experiment

ā€¢ Derived from a research question (e.g., What is the diļ¬€erence
between A and B in terms of bug detection capability?)

ā€¢ Include variables that represent constructs of interest (e.g., tools,
methods, actors, number of bugs)

ā€¢ Concern the measurable impact that a certain variation on some
construct can have on other constructs (e.g., Tool A ļ¬nds more bugs
than tool B; Tool A ļ¬nds less or equal bugs than tool B)

ā€¢ I normally have NULL and Alternative hypothesis; the one I will test is
the NULL hypothesis, but the one I am interested in is the Alternative
one (weā€™ll see this later)
Deļ¬nitions
ā€¢ Independent Variables (INPUT): operationalisation of
constructs that I want to isolate, and whose values I want to
manipulate (e.g., the tool, the expertise of actors)
ā€¢ Treatments: combinations of values for the independent
variables (tool A, tool B ā€” 1 variable, two treatments; tool A and
experts, tool A and novices, tool B and experts, tool B and
novicesā€” 2 variable, 4 treatments)
ā€¢ Dependent Variables (OUTPUT): operationalisation of
constructs that I want to measure based on the manipulation of
the independent variables (e.g., number of bugs)
ā€¢ Controlled Variables: attributes* of human subjects or objects
that I need to control to mask or prevent their impact on the
dependent variables (e.g., I have to test on some code that is
suļ¬ƒciently general, and equivalent for all cases)
* = operationalisation of constructs
Example: Software
ā€¢ Objective: I want to understand which is a better testing tool among two available choices A and B

ā€¢ The independent variable is already identiļ¬ed: the tool (one factor)

ā€¢ Treatments are also straightforward: tool A and B (two treatments)

ā€¢ I miss the dependent variable: I have to detail what I mean by better. Better in terms of speed? better in
terms of bugs found? Both! Ok, I already have two dependent variables, which I can deļ¬ne as:

ā€¢ ā€œeļ¬€ectivenessā€ = number of bugs found/total number of bugs 

ā€¢ ā€œeļ¬ƒciencyā€ = running time/number of bugs found

ā€¢ Now I have to identify the controlled variables: what can impact on eļ¬€ectiveness and eļ¬ƒciency, besides
the type of tool? The user? Maybe not, if the tool is fully automatic; The language of the code? Well, I want
to focus only on C code; The chosen code? Well yes, but which attributes of the chosen code?

ā€¢ number of bugs in the code module

ā€¢ length of the module

ā€¢ complexity of the module

ā€¢ domain of the code

ā€¢ ā€¦.
Example: Software
ā€¢ Objective: I want to understand which is a better testing tool among two available choices A and B

ā€¢ The independent variable is already identiļ¬ed: the tool (one factor)

ā€¢ Treatments are also straightforward: tool A and B (two treatments)

ā€¢ I miss the dependent variable: I have to detail what I mean by better. Better in terms of speed? better in
terms of bugs found? Both! Ok, I already have two dependent variables, which I can deļ¬ne as:

ā€¢ ā€œeļ¬€ectivenessā€ = number of bugs found/total number of bugs 

ā€¢ ā€œeļ¬ƒciencyā€ = running time/number of bugs found

ā€¢ Now I have to identify the controlled variables: what can impact on eļ¬€ectiveness and eļ¬ƒciency, besides
the type of tool? The user? Maybe not, if the tool is fully automatic; The language of the code? Well, I want
to focus only on C code; The chosen code? Well yes, but which attributes of the chosen code?

ā€¢ number of bugs in the code module

ā€¢ length of the module

ā€¢ complexity of the module

ā€¢ domain of the code

ā€¢ ā€¦.
I have to create a code sample that
has sufļ¬cient variations in all of the
controlled variables
Example: Software
ā€¢ Objective: I want to understand which is a better testing tool among two available choices A and B

ā€¢ The independent variable is already identiļ¬ed: the tool (one factor)

ā€¢ Treatments are also straightforward: tool A and B (two treatments)

ā€¢ I miss the dependent variable: I have to detail what I mean by better. Better in terms of speed? better in
terms of bugs found? Both! Ok, I already have two dependent variables, which I can deļ¬ne as:

ā€¢ ā€œeļ¬€ectivenessā€ = number of bugs found/total number of bugs 

ā€¢ ā€œeļ¬ƒciencyā€ = running time/number of bugs found

ā€¢ Now I have to identify the controlled variables: what can impact on eļ¬€ectiveness and eļ¬ƒciency, besides
the type of tool? The user? Maybe not, if the tool is fully automatic; The language of the code? Well, I want
to focus only on C code; The chosen code? Well yes, but which attributes of the chosen code?

ā€¢ number of bugs in the code module

ā€¢ length of the module

ā€¢ complexity of the module

ā€¢ domain of the code

ā€¢ ā€¦.
I have to create a code sample that
has sufļ¬cient variations in all of the
controlled variables
If I cannot variate a certain variable, I have to ļ¬x it (e.g., C code, domain)
and make this choice explicit, as it limits my scope of interest
Example: Software and Humans
ā€¢ Objective: I want to see if the experience of the user aļ¬€ects the eļ¬€ectiveness of a certain testing tool

ā€¢ The dependent variable is already identiļ¬ed: the eļ¬€ectiveness (bugs found/total bugs)

ā€¢ I have to identify the independent variables: they should concern the experience of the user, how can I
measure it? Years of experience in testing? Score from other colleagues? Well, normally it is better to
select one independent variable only, otherwise I need too many treatments and I may not ļ¬nd enough
participants! Ok, but what should I compare? 1, 2, 3, 4, 5 etc. years? It is also a lot of treatments, will I
ļ¬nd enough people? I have to separate years of experience by levels. How do I select the levels? I have
to do some assumptions based on existing literature or I can take some decision that can be defended

ā€¢ I decide for two levels, and I partition into two treatments (i.e., two homogeneous groups of people)

ā€¢ from 0 to 1 years: novices

ā€¢ more than 5 years: experts

ā€¢ Now I have to identify the controlled variables: what can impact on my outcomes besides the
experience of users? Well, age, gender, all demographic variablesā€¦and of course, the code on which the
tool is applied (previous variables)

ā€¢ I have to make some choice: I should ļ¬x a representative code base, use the same for all subjects, make
sure none of them know the code in advance, and control demographic variables
ā€¢ Therefore, for each treatment, I have a group with a comparable experience (novice OR expert) but
variations in terms of age, gender, and other demographic variables
Controlled Experiments: šŸ™‚ and ā˜¹
ā€¢ šŸ™‚ Advantages:
ā€¢ It is SCIENCE, with NUMBERS

ā€¢ Can be applied to identify cause-eļ¬€ect relationships for speciļ¬c, well deļ¬ned,
variables

ā€¢ ā˜¹ Disadvantages:
ā€¢ Applicable to well-deļ¬ned problems in which you can clearly deļ¬ne and isolate
variables

ā€¢ Hard to apply if you cannot simulate the right conditions in the lab (confounding
variables may be too many to be controlled)

ā€¢ Reality of SE has several contextual factors that may make the experiment not realistic

ā€¢ It may be hard and costly to recruit adequate subjects (developers have to develop,
managers need to manageā€¦often, students are used as proxies)

ā€¢ Design is time consuming and can get very complicated, very easily (which implies that
it is also diļ¬ƒcult to analyse the results and have an actual control)
Hypothesis Testing
cf. Sharma, 2015 https://bit.ly/2wTf7VX
I will provide information for you to understand the principles,
but to REALLY understand you need more resources
I will use the word MAGIC when some concepts need to be assumed,
or some measures can be given somehow by common tools
Alessio Ferrari, ISTI-CNR, Pisa, Italy

alessio.ferrari@isti.cnr.it
Hypothesis
ā€¢ A hypothesis is a statistically testable statement derived from a theory (and, in practice,
from a research question)

ā€¢ A hypothesis is a predictive statement concerning the impact of some independent
variable on some dependent variable

ā€¢ When we do hypothesis testing, our goal is to refute the negation of the theory
ā€¢ H0 the NULL hypothesis ā€” The theory does not apply
ā€¢ Usually expressed as There is no eļ¬€ect [ā€¦] ā€” changes of the independent variable
do not aļ¬€ect the dependent variable

ā€¢ It is assumed to be TRUE, unless there is evidence from the data that allows us to
REJECT the NULL hypothesis (for this, you need statistical tests)
ā€¢ H1 the ALTERNATIVE hypothesis ā€” The theory predictsā€¦
ā€¢ If H0 is rejected, this is an evidence that H1 can be correct
Example
ā€¢ H0: The experience of the developer does not aļ¬€ect the
average time to ļ¬nd bugs
ā€¢ H0: Average-Time-Novices = Average-Time-Experts
ā€¢ H1: The experience of the developer aļ¬€ects the average
time to ļ¬nd bugs
ā€¢ H1: Average-Time-Novices ā‰  Average-Time-Experts
I imagine to have two groups,
novices and experts
We speak about Two-tailed hypothesis to be tested
(later you will understand why)
I imagine I have a method M or tool T for ļ¬nding bugs
Example
ā€¢ H0: The experience of the developer does not aļ¬€ect the
average time to ļ¬nd bugs
ā€¢ H0: Average-Time-Novices = Average-Time-Experts
ā€¢ H1: The experience of the developer aļ¬€ects the average
time to ļ¬nd bugs
ā€¢ H1: Average-Time-Novices ā‰  Average-Time-Experts
I imagine to have two groups,
novices and experts
We speak about Two-tailed hypothesis to be tested
(later you will understand why)
What if I want to know WHO is QUICKER? This
formulation does not say anything about thatā€¦
I imagine I have a method M or tool T for ļ¬nding bugs
Example
ā€¢ But I can ļ¬nd another formulation, with exactly the same
experiment ā€” two groups, novices and experts, and I
measure average time to ļ¬nd bugs

ā€¢ H0: The average time to ļ¬nd bugs of novices is less than or
equal to the one of experts
ā€¢ H0: Average-Time-Novices <= Average-Time-Experts
ā€¢ H1: The average time to ļ¬nd bugs of novices is greater than
the one of experts
ā€¢ H1: Average-Time-Novices > Average-Time-Experts
We speak about One-tailed hypothesis to be tested
Test Statistic
ā€¢ Hypothesis tests normally take all my sample data and convert them into
a single value, which is called test statistic
ā€¢ The test statistic is just a number, but its value can tell me whether the
NULL hypothesis can be REJECTED or not

ā€¢ Depending on the test that I have to do I will have diļ¬€erent test statistics
Test
Data from
Experiment
Test
Statistic
time novice 1
time expert 1
time novice 2
time expert 2
e.g, unpaired t-test
-0.38
e.g., t-value
Compare the means
of two independent
samples
cf. https://bit.ly/39LLOU5
Probability Distribution of the Test Statistic
ā€¢ The assumption is that the NULL hypothesis is TRUE
ā€¢ Given a population in which the NULL hypothesis is true, ā€Ø
I imagine to repeat my experiment multiple times and compute the test statistic
ā€¢ The test statistic will follow a certain distribution ā€” which one? MAGIC, e.g., Student t-
distribution
If H0 is TRUE, most of
the times I repeat the experiment
the test statistic will be around here
Number of samples
with value x
Set of possible values x of the test statistic
If H0 is TRUE, it is
unlikely that
my test statistic
will be here
(or in the left tail)
e.g., a t-value = 0
indicates that my data
conļ¬rms H0 precisely
The distribution is centred
on the value that the test
statistic hasā€Ø
when the data of my
experiment conļ¬rm exactly
the NULL hypothesis
Probability Distribution of the Test Statistic
ā€¢ The assumption is that the NULL hypothesis is TRUE
ā€¢ Given a population in which the NULL hypothesis is true, ā€Ø
I imagine to repeat my experiment multiple times and compute the test statistic
ā€¢ The test statistic will follow a certain distribution ā€” which one? MAGIC, e.g., Student t-
distribution
If H0 is TRUE, most of
the times I repeat the experiment
the test statistic will be around here
Number of samples
with value x
Set of possible values x of the test statistic
If H0 is TRUE, it is
unlikely that
my test statistic
will be here
(or in the left tail)
e.g., a t-value = 0
indicates that my data
conļ¬rms H0 precisely
If my test statistic falls
around the tails
I can REJECT H0
ā€¦and this is my hope!
The distribution is centred
on the value that the test
statistic hasā€Ø
when the data of my
experiment conļ¬rm exactly
the NULL hypothesis
ā€¢ Our ļ¬nal goal is to evaluate whether our test statistic
value, obtained from our experiment, is so rare that it
justiļ¬es rejecting the NULL hypothesis for the entire
population based on our sample data

ā€¢ How can I do if I do not know the entire distribution of my
test statistic? This can be inferred based on the statistics
of the sampled data and the hypothesis I want to testā€¦

ā€¢ ā€¦in this context we will assume that some MAGIC
occurs and we know the distribution of the test statistic
Critical Regions
test statistic
# of samples
I want the test statistic of
my experiment to fall on
the tails of the distribution
Critical Region = acceptable
values to reject
NULL
Critical Region = acceptable
values to reject
NULL
The acceptable values
identify a red area in the
distribution
The area is the risk of
rejecting the NULL
when TRUE
Before the experiment,
I set the Critical Regions
(Rejection Regions)
Level of Signiļ¬cance and
Conļ¬dence
ā€¢ Signiļ¬cance level indicates the risk to reject a NULL
hypothesis when it is true; it is denoted by š›¼ 

ā€¢ 0.01, 0.05, 0.1: these are the typical values for š›¼

ā€¢ (1 āˆ’ š›¼) is the conļ¬dence level indicates how conļ¬dent I
want to be about the result of my test

ā€¢ 0.99, 0.95, 0.9: typical values for (1 āˆ’ š›¼)
Alpha sets the standard for how extreme the data
MUST BE before we can reject the null hypothesis.
The p-value indicates how extreme the data ARE (later).
Signiļ¬cance and Conļ¬dence
test statistic
Before any experiment I
set the signiļ¬cance level,
and corresponding
conļ¬dence level
Critical Region = acceptable
values of test statistic to reject
NULL
Critical Region = acceptable
values of test statistic to reject
NULL
Conļ¬dence Level (1-š›¼)
Signiļ¬cance Level š›¼
Risk of Rejecting the NULL
Hypothesis when TRUE
Risk Level
Signiļ¬cance
š›¼
Conļ¬dence
Level (1- š›¼)
Intuitive Meaning
Catastrophic 0.001 0.999
More than 100 million Euros

(Large loss of life, e.g. nuclear
disaster)
Critical 0.01 0.99
Less than 100 million Euros 

(A few lives lost, e.g., accident)
Important 0.05 0.95
Less than 100 thousands Euros

(No lives lost, some injuries)
Moderate 0.10 0.90
Less than 500 Euros

(no injuries)
Risk of Rejecting the NULL
Hypothesis when TRUE
Risk Level
Signiļ¬cance
š›¼
Conļ¬dence
Level (1- š›¼)
Intuitive Meaning
Catastrophic 0.001 0.999
More than 100 million Euros

(Large loss of life, e.g. nuclear
disaster)
Critical 0.01 0.99
Less than 100 million Euros 

(A few lives lost, e.g., accident)
Important 0.05 0.95
Less than 100 thousands Euros

(No lives lost, some injuries)
Moderate 0.10 0.90
Less than 500 Euros

(no injuries)
Risk of Rejecting the NULL
Hypothesis when TRUE
Risk Level
Signiļ¬cance
š›¼
Conļ¬dence
Level (1- š›¼)
Intuitive Meaning
Catastrophic 0.001 0.999
More than 100 million Euros

(Large loss of life, e.g. nuclear
disaster)
Critical 0.01 0.99
Less than 100 million Euros 

(A few lives lost, e.g., accident)
Important 0.05 0.95
Less than 100 thousands Euros

(No lives lost, some injuries)
Moderate 0.10 0.90
Less than 500 Euros

(no injuries)
In software engineering, we normally use these values
Type I and Type II Errors
REAL Population Fail to Reject Reject
NULL is True
No Error
my theory is FALSE

(1 - š›¼)
Type I Error
(Incorrectly Reject the
NULL hypothesis)

š›¼
NULL is False
Type II Error
(Incorrectly Fail to
Reject the NULL
hypothesis)

Ī²
No Error
my theory is TRUE

(1- Ī²)
Type I šŸ¤„ my (alternative) hypothesis is wrong, but I support it anyway
Type II šŸ„ŗ my (alternative) hypothesis is correct, but I rejected it
We normally focus on minimising Type I errors
Two-tailed Test
Average-Time-Novices
= Average-Time-Experts
Average-Time-Novices
ā‰  Average-Time-Experts
Average-Time-Novices
ā‰  Average-Time-Experts
Acceptance region
š‘š‘œš‘›š‘“š‘–š‘‘š‘’š‘›š‘š‘’ š‘™š‘’š‘£š‘’š‘™
(1āˆ’š›¼) = 0.95
Rejection Region
š‘ š‘–š‘”š‘›š‘–š‘“š‘–š‘š‘Žš‘›š‘š‘’ š‘™š‘’š‘£š‘’š‘™
(š›¼/2 = 0.025 š‘œš‘Ÿ 2.5%)
Rejection Region
š‘ š‘–š‘”š‘›š‘–š‘“š‘–š‘š‘Žš‘›š‘š‘’ š‘™š‘’š‘£š‘’š‘™
(š›¼/2 = 0.025 š‘œš‘Ÿ 2.5%)
the value of š›¼ = 0.05
is split between the
tails
ā€¢ H0: The experience of the developer does not aļ¬€ect the average time to ļ¬nd bugs
š›¼ is the risk of rejecting NULL when true
the value of š›¼/2
is this area
One-tailed Test (Left)
Average-Time-Novices
>= Average-Time-Experts
Average-Time-Novices
< Average-Time-Experts
Acceptance region
š‘š‘œš‘›š‘“š‘–š‘‘š‘’š‘›š‘š‘’ š‘™š‘’š‘£š‘’š‘™
(1āˆ’š›¼) = 0.95
Rejection Region
š‘ š‘–š‘”š‘›š‘–š‘“š‘–š‘š‘Žš‘›š‘š‘’ š‘™š‘’š‘£š‘’š‘™
(š›¼ = 0.05 š‘œš‘Ÿ 5%)
the value of š›¼ = 0.05
is all in one tail
ā€¢ H0: The average time to ļ¬nd bugs of novices is greater than or equal to the one of experts
the value of š›¼
is this area
One-tailed Test (Right)
Average-Time-Novices
<= Average-Time-Experts
Average-Time-Novices
> Average-Time-Experts
Acceptance region
š‘š‘œš‘›š‘“š‘–š‘‘š‘’š‘›š‘š‘’ š‘™š‘’š‘£š‘’š‘™
(1āˆ’š›¼) = 0.95
Rejection Region
š‘ š‘–š‘”š‘›š‘–š‘“š‘–š‘š‘Žš‘›š‘š‘’ š‘™š‘’š‘£š‘’š‘™
(š›¼ = 0.05 š‘œš‘Ÿ 5%)
the value of š›¼ = 0.05
is all on one tail
ā€¢ H0: The average time to ļ¬nd bugs of novices is less than or equal to the one of experts
the value of š›¼
is this area
p-value
Test
Data from
Experiment
Test
Statistic
time novice 1
time expert 1
time novice 2
time expert 2
e.g, unpaired t-test
-0.38
e.g., t-value
p-value
Another number produced by the test
LOW values (0.001) are GOOD,
HIGH values (0.3) are BAD
p-value and š›¼ (one-tailed)
p-value is
this blue area
This point is MY test statistic value,
derived from MY data
š›¼ is the red plus
the blue area
cf. https://statisticsbyjim.com/hypothesis-testing/hypothesis-tests-signiļ¬cance-levels-alpha-p-values/
p-value and š›¼ (two-tailed)
p-value/2 is
this blue area
This point in the x axis is
my test statistic value,
derived from my data
š›¼/2 is the red plus
the blue area
cf. https://statisticsbyjim.com/hypothesis-testing/hypothesis-tests-signiļ¬cance-levels-alpha-p-values/
For two-tailed tests, š›¼ and p are the sum
of the areas in the two tails, both š›¼ and p are
shared between the tails
cf. https://blog.minitab.com/blog/adventures-in-statistics-2/understanding-hypothesis-tests-signiļ¬cance-levels-alpha-and-p-values-in-statistics
š›¼/2 is the red plus
the blue area
p-value/2 is
this blue area
p-value
ā€¢ 1) p-value indicates the believability of the devilā€™s advocate case that the NULL
hypothesis is TRUE given the sample data
ā€¢ 2) p-value is the probability of observing a test statistic that is at least as extreme as
your test statistic, when you assume that the NULL hypothesis is true

ā€¢ 3) p-value indicates to which extent the result may be due to a random variation
within your data, which make them diļ¬€erent to the actual population

ā€¢ If p-value is ā€œvery lowā€, then the NULL hypothesis is REJECTED, in favour of the
alternative hypothesis, otherwise I Fail to REJECT

ā€¢ The meaning of ā€œVery lowā€ depends on the selected value of signiļ¬cance š›¼

ā€¢ p-value <= š›¼: I fall in the REJECTION region, H0 is rejected

ā€¢ p-value > š›¼: I fall in the ACCEPTANCE region, I fail to reject H0
Different intuitive way to understand it
Effect Size
Test
Data from
Experiment
Test
Statistic
time novice 1
time expert 1
time novice 2
time expert 2
e.g, unpaired t-test
-0.38
e.g., t-value
p-value
Effect Size
Statistically signiļ¬cant effect does not necessarily mean a big effect
cf. https://en.wikipedia.org/wiki/Eļ¬€ect_size
Effect size measures how big is the effect
Effect size
computation
e.g, Cohenā€™s d
e.g., d = 2
cf. https://www.simplypsychology.org/eļ¬€ect-size.html
Effect Size
ā€¢ Eļ¬€ect size is a quantitative measure of the magnitude of the treatment
eļ¬€ect (e.g., HOW MUCH better is my tool?)

ā€¢ Eļ¬€ect sizes either measure:

ā€¢ the sizes of associations/relationships between variables ā€Ø
(HOW MUCH is experience correlated with development speed?) 

ā€¢ the sizes of diļ¬€erences between group means ā€Ø
(HOW MUCH is the diļ¬€erence between tool A and B?)

ā€¢ There are diļ¬€erent way to measure eļ¬€ect size, the most common are
Cohenā€™s d (for diļ¬€erences), Pearson r correlation (for associations/
relationships), but it may also depend on the type of data (categorical vs
numeric), and on types of samples (paired vs unpaired)
Check Wikipedia to know the most appropriate for your case:
cf. https://en.wikipedia.org/wiki/Effect_size
cf. Lakens, 2013 https://doi.org/10.3389/fpsyg.2013.00863
Cohenā€™s d
ā€¢ Diļ¬€erence between the means divided by the standard
deviation of the population from which the data were sampled ā€”
but how can we know the standard deviation of the population?
The same MAGIC as before

ā€¢ A d of 1 indicates the two groups diļ¬€er by 1 standard deviation,
a d of 2 indicates they diļ¬€er by 2 standard deviations, and so on.
This is how you interpret the values
of d that you obtain
https://en.wikipedia.org/wiki/Effect_size
Pearsonā€™s r
ā€¢ Indicates the correlation between variables (e.g., number of
bugs vs length of the code)

ā€¢ Pearson's r can vary in magnitude from āˆ’1 to 1:

ā€¢ āˆ’1 perfect negative linear relation, 

ā€¢ 1 perfect positive linear relation

ā€¢ no linear relation between two variables

ā€¢ The eļ¬€ect size is low if the value of r varies around 0.1, medium
if r varies around 0.3, and large if r varies more than 0.5
What about Type II Errors?
ā€¢ In all our evaluations, we assumed that the population
was conļ¬rming the NULL hypothesis, but what if we
make a Type II error (we fail to reject the NULL
hypothesis, when the actual population rejects it)?

ā€¢ Well, in these cases, we should also establish a value,
normally called Ī², which is the probability of accepting the
NULL hypothesis, although it is FALSE

ā€¢ If the NULL hypothesis is FALSE, this means that my real
population follows the alternative hypothesis
Type II Errors
Set of possible values x
of my test statistic
Number of samples
with value x (Density)
Distribution if H0
would be true
Distribution if H1
would be true
Type II Errors
Set of possible values x
of my test statistic
Number of samples
with value x (Density)
Distribution if H0
would be true
Distribution if H1
would be true
š›¼To have smaller š›¼ I have to push
the bar to the rightā€¦
Type II Errors
Set of possible values x
of my test statistic
Number of samples
with value x (Density)
Distribution if H0
would be true
Distribution if H1
would be true
Ī² š›¼š›¼ now is really small,
but Ī² gets larger!
Ī² is the probability of accepting the NULL hypothesis when it is FALSE
š›¼ is the probability of rejecting the NULL hypothesis when it is TRUE
The Hard Truth
ā€¢ Whenever you try to minimise Type I errors, you end up increasing the chance of
Type II errors

ā€¢ In practice, we mostly look at REJECTING null hypotheses, so we generally focus
on Type I errors, and alpha values

ā€¢ Why do we look at rejecting the NULL? (intuitive explanation)
ā€¢ We are using just one sample to reason on an entire population, so
we can REJECT a hypothesis, or FAIL to REJECT, but never accept

ā€¢ Accepting the alternative hypothesis would imply repeating the
experiments many more times with diļ¬€erent samples taken from
my actual population and showing that the test statistic follows the
distribution of the alternative hypothesis 

ā€¢ Additional Intuition: it is easier to disprove ā€œall swans are whiteā€ (I need to ļ¬nd
only one black swan) than to prove it (I need to check all possible swans)
Summary of Concepts
ā€¢ When you perform an experiment you have to keep in mind the following key
concepts:

ā€¢ Level of signiļ¬cance š›¼: tells me how much risk I can take, normally set to
0.05, moderate risk; it is set at the beginning of the experiment

ā€¢ Test statistic: value depending on the type of test that I make, it serves to
understand how much my sample is rare in a population in which the NULL
hypothesis is TRUE; it is produced based on my experimental data; the
number alone does not say much

ā€¢ p-value: indicates the probability of rejecting the NULL hypothesis when it is
actually TRUE; it is produced based on my experimental data; it needs to be
compared with š›¼; if lower than š›¼, I am happy

ā€¢ Eļ¬€ect size: indicates how large is the diļ¬€erence between two treatments, or
how much is the correlation between independent and dependent variable;
depends on the chosen test; tables exist to evaluate the eļ¬€ect size
Graphical Summary
Test
Data from
Experiment
Test
Statistic
p-value
Effect Size
Effect size
computation
Signiļ¬cance š›¼
p-value <= š›¼
Effect Size Table
Small Effect
Large Effect
Reject NULL
Hypothesis
Statistical Tests
Alessio Ferrari, ISTI-CNR, Pisa, Italy

alessio.ferrari@isti.cnr.it
cf. Wholin et al. https://doi.org/10.1007/978-3-642-29044-2
Summary from Previous Lecture
Distribution of test
statistic when
samples come
from a population
where NULL is true
NULL Hypothesis
Centred in the value that
test statistic has
when the sample conļ¬rms
EXACTLY the NULL
hypothesis
test statistic
# of samples
Every experiment produces a test statistic
(numerical summary of the data)
I imagine to perform a set of experiments with a
population in which NULL is true
š›¼ is this area
Summary from Previous Lecture
Distribution of test
statistic when
samples come
from a population
where NULL is true
NULL Hypothesis
Centred in the value that
test statistic has
when the sample conļ¬rms
EXACTLY the NULL
hypothesis
test statistic
# of samples
Every experiment produces a test statistic
(numerical summary of the data)
I imagine to perform a set of experiments with a
population in which NULL is true
š›¼ is this area
Summary from Previous Lecture
Distribution of test
statistic when
samples come
from a population
where NULL is true
NULL Hypothesis
This point is my test statistic value,
derived from my data
Statistical Test
p-value is
this blue area
Centred in the value that
test statistic has
when the sample conļ¬rms
EXACTLY the NULL
hypothesis
test statistic
# of samples
Every experiment produces a test statistic
(numerical summary of the data)
I imagine to perform a set of experiments with a
population in which NULL is true
Statistical Tests
ā€¢ A statistical test is a means to establish a test statistic, i.e., a single value derived from the data of my
experiment

ā€¢ Several tests exist, and each test is appropriate for a speciļ¬c type of experiment

ā€¢ Two categories of tests exist:

ā€¢ Parametric Tests: tests that make some assumptions on the populationā€™s distribution, e.g., normality,
or homogeneous variances of the sample

ā€¢ Nonparametric Tests: tests that do not make assumptions on the populationā€™s distribution. For most
of the parametric tests, a nonparametric alternative exist

ā€¢ Parametric Tests have more statistical power (a concept that we did not explore); roughly, they are
more likely to lead to the rejection of the NULL hypothesis when FALSE (they lead to lower p-values, when
NULL is false, and hence reduce Type II errors). You cannot use them for nominal or ordinal data.

ā€¢ Nonparametric Tests are more robust, as they are valid for a larger set of cases, as they do not make
strict assumptions on the data. You can use them for nominal and ordinal data, or when assumptions of
the parametric tests do not hold

ā€¢ You do not know the population, so, in order to use parametric tests, you ļ¬rst have to test how likely is it
that your data follow the assumption of the test that you are going to make; if they do not follow the
assumption, then use a nonparametric alternative (cf. https://help.xlstat.com/s/article/which-statistical-
test-should-you-use?language=en_US)
Normality Test (does not apply
to nominal or ordinal data)
ā€¢ Many parametric statistical tests assume that your data is normally distributed
(actually, the distribution of the sample mean is normal, so I should consider the
populationā€¦in general if you have more than 30 samples youā€™re safe)

ā€¢ To ensure that, you need to apply a normality test to your data, for example Shapiro-
Wilk (several others exist)
ā€¢ The null-hypothesis of this test is H0 = the population is normally distributed.
ā€¢ Thus, if the p-value is less than the chosen š›¼ level, then the NULL hypothesis is
rejected and there is evidence that the data tested are NOT normally distributed.
Here you want the p-value to be LARGER than š›¼,
as your NULL hypothesis is the one that you want to support!
Hence, THE LARGER the p-value, the BETTER!
There are also ways to transform your data if they are not normally distributed,
but be careful, because then the interpretation of the results is not straightforward
(check if non-normality is due to the presence of outliers)
cf. https://bit.ly/2wJAl9l
Parametric and Non-
parametric Tests (Remark)
ā€¢ Parametric tests are all those test that make some assumptions
on your data (normality, above all). To use a parametric test you
ļ¬rst need to check that the assumptions of the parametric test
hold for your data

ā€¢ Non-parametric tests are alternative tests to use when the
normality test (or any other assumption) fails OR when you are
dealing with categorical or ordinal data

ā€¢ Sometimes non-parametric tests have assumptions too!
(check carefully which are the assumptions of non-parametric
tests, e.g., cf. https://www.isixsigma.com/tools-templates/
hypothesis-testing/nonparametric-distribution-free-not-
assumption-free/ )
Selecting the right test
HOWTO
ā€¢ In the following, a diagram will be shown to guide you in the selection of the right
test, assuming that you have only ONE DEPENDENT VARIABLE ā€” as in most of
the experiments with a manageable design in SE 

ā€¢ The selection of the test depends on
ā€¢ The type of dependent variable (nominal, ordinal, scale/ratio)

ā€¢ Type of hypothesis (diļ¬€erence or relationship/association)

ā€¢ Number of treatments

ā€¢ Type of design (single group of subjects vs two groups)

ā€¢ Number of independent variables
Selecting the right test
HOWTO
ā€¢ In the following, a diagram will be shown to guide you in the selection of the right
test, assuming that you have only ONE DEPENDENT VARIABLE ā€” as in most of
the experiments with a manageable design in SE 

ā€¢ The selection of the test depends on
ā€¢ The type of dependent variable (nominal, ordinal, scale/ratio)

ā€¢ Type of hypothesis (diļ¬€erence or relationship/association)

ā€¢ Number of treatments

ā€¢ Type of design (single group of subjects vs two groups)

ā€¢ Number of independent variables
You will not memorise the diagram,
but you should know how to follow it
Selecting the right test
HOWTO
ā€¢ In the following, a diagram will be shown to guide you in the selection of the right
test, assuming that you have only ONE DEPENDENT VARIABLE ā€” as in most of
the experiments with a manageable design in SE 

ā€¢ The selection of the test depends on
ā€¢ The type of dependent variable (nominal, ordinal, scale/ratio)

ā€¢ Type of hypothesis (diļ¬€erence or relationship/association)

ā€¢ Number of treatments

ā€¢ Type of design (single group of subjects vs two groups)

ā€¢ Number of independent variables
You will not memorise the diagram,
but you should know how to follow it
I will not explain how each
test works, you only need to know which one to use
Selecting the right test
HOWTO
ā€¢ In the following, a diagram will be shown to guide you in the selection of the right
test, assuming that you have only ONE DEPENDENT VARIABLE ā€” as in most of
the experiments with a manageable design in SE 

ā€¢ The selection of the test depends on
ā€¢ The type of dependent variable (nominal, ordinal, scale/ratio)

ā€¢ Type of hypothesis (diļ¬€erence or relationship/association)

ā€¢ Number of treatments

ā€¢ Type of design (single group of subjects vs two groups)

ā€¢ Number of independent variables
You will not memorise the diagram,
but you should know how to follow it
I will not explain how each
test works, you only need to know which one to use
In this lecture a test is a BLACK box that produces
two numbers: test statistic and p-value
Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
Zero (Only the
dependent variable)
One or more
Chi-square
Goodness of ļ¬t
Chi-square Test
of Independence
Type of hypothesis
RelationshipDiļ¬€erence
Spearmanā€™s
Rho
Type of
design
Diļ¬€erent groups
of subjects
Single group of
subjects
Mann-Whitney
U test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
Zero (Only the
dependent variable)
One or more
Chi-square
Goodness of ļ¬t
Chi-square Test
of Independence
Type of hypothesis
RelationshipDiļ¬€erence
Spearmanā€™s
Rho
Type of
design
Diļ¬€erent groups
of subjects
Single group of
subjects
Mann-Whitney
U test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
I assume to have One
Dependent Variable
Interval/Ratio (numbers)
Type of hypothesis
Relationship Diļ¬€erence
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
designSpearmanā€™s
Rho
Pearsonā€™s R
One or More
Zero
Diļ¬€erent groups
of subjects
(independent measures)
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
cf. https://www.socscistatistics.com
Interval/Ratio (numbers)
Type of hypothesis
Relationship Diļ¬€erence
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
designSpearmanā€™s
Rho
Pearsonā€™s R
One or More
Zero
Diļ¬€erent groups
of subjects
(independent measures)
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
The list of tests is
NOT exhaustive
cf. https://www.socscistatistics.com
Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
ZeroOne or More
Chi-square
Goodness of ļ¬t
Chi-square Test
of Independence
Type of hypothesis
Relationship
Diļ¬€erence
Spearmanā€™s
Rho
Type of
design
Diļ¬€erent groups
of subjects
Single group of
subjects
Mann-Whitney
U test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
e.g., IV: None; DV: type of defect.
to which extent the proportion
of defects of a certain type matches the
expected proportion?
IV = independent variable
DV = dependent variable
Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
ZeroOne or more
Chi-square
Goodness of ļ¬t
Chi-square
Test of
Independence
Type of hypothesis
Relationship
Diļ¬€erence
Spearmanā€™s
Rho
Type of
design
Diļ¬€erent groups
of subjects
Single group of
subjects
Mann-Whitney
U test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
e.g., IV: code author, ā€Ø
DV: defect type
Is there a link between defect
type and code authors?
Chi-Square Test of
Independence (Example)
ā€¢ RQ: Is there a link between defect type and code author?

ā€¢ H0: There is no relationship between defect type and code author
type of defect
author
ā€œNull pointerā€ defects
in Homerā€™s code
Chi-Square Test of
Independence (Example)
ā€¢ RQ: Is there a link between defect type and code author?

ā€¢ H0: There is no relationship between defect type and code author
type of defect
author
Chi-square = 56.32, p < 0.00001
H0 is REJECTED
ā€œNull pointerā€ defects
in Homerā€™s code
Chi-Square Test of
Independence (Example)
ā€¢ RQ: Is there a link between defect type and code author?

ā€¢ H0: There is no relationship between defect type and code author
type of defect
author
Chi-square = 56.32, p < 0.00001
H0 is REJECTED
Cramerā€™s V should
be used to check
Effect Size (check Wikipedia)!
ā€œNull pointerā€ defects
in Homerā€™s code
Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
ZeroOne or More
Chi-square
Goodness of ļ¬t
Chi-square Test
of Independence
Type of hypothesis
RelationshipDiļ¬€erence
Spearmanā€™s
Rho
Type of
design
Diļ¬€erent groups
of subjects
(independent measures)
Single group of
subjects
(repeated measures)
Mann-
Whitney U
test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
e.g., IV: level of experience
(two levels); ā€Ø
DV: degree of success
Is there a difference between
degree of project success
between novices and
experts?
novices experts
Deg. Proj. Success
Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
ZeroOne or more
Chi-square
Goodness of ļ¬t
Chi-square Test
of Independence
Type of hypothesis
RelationshipDiļ¬€erence
Spearmanā€™s
Rho
Type of
design
Diļ¬€erent groups
of subjects
(independent measures)
Single group of
subjects
(repeated measures)
Mann-
Whitney U
test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
e.g., IV: time of the day
(morning, afternoon); ā€Ø
DV: level of performance
is there a difference between
the performance of the
developers between morning
and afternoon?
Level of Performance
morning afternoon
Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
ZeroOne or More
Chi-square
Goodness of ļ¬t
Chi-square Test
of Independence
Type of hypothesis
RelationshipDiļ¬€erence
Spearmanā€™s
Rho
Type of
design
Diļ¬€erent groups
of subjects
(independent measures)
Single group of
subjects
(repeated measures)
Mann-
Whitney U
test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
e.g., IV: motivation; DV: degree of
project success
Is there a relationship
between motivation of a person
and degree of project success?
motivation success
Dependent Variable is Interval/Ratio (numbers)
Type of hypothesis
Relationship
Diļ¬€erence
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
design
Spearmanā€™s
Rho
Pearsonā€™s R
One or MoreZero
Diļ¬€erent groups
of subjects
(independent measures)
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
e.g.,IV: review duration, DV: number
of defects identiļ¬ed
Is there a relationship
between review duration and
number of defects identified?
Interval/Ratio (numbers)
Type of hypothesis
Relationship
Diļ¬€erence
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test (single
sample)
Type of
designSpearmanā€™s
Rho
Pearsonā€™s R
One or More
Zero
Diļ¬€erent groups
of subjects
(independent measures)
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
e.g., IV: None DV: number of
defects per code module,
Is there a difference between the
number of defects identified in the
modules and the mean value
expected?
Interval/Ratio (numbers)
Type of hypothesis
Relationship
Diļ¬€erence
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
designSpearmanā€™s
Rho
Pearsonā€™s R
One or more
Zero
Diļ¬€erent groups
of subjects
(independent measures)
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
e.g., IV: tool; DV: speed in ļ¬nding
bugs
Does the tool improve the usersā€™
speed in finding bugs?
(is there a difference in terms of
speed WITH and WITHOUT the
tool?)
Paired T-test (Example)
ā€¢ I have a new tool to support bug identiļ¬cation in code review, and I want to
understand whether it is eļ¬€ective or not

ā€¢ RQ: Does the tool improve the usersā€™ speed of ļ¬nding bugs?
ā€¢ Independent Variable: tool (YES/NO) ā€” two treatments (TOOL/NO-TOOL)

ā€¢ Dependent Variable: speed = number of bugs found/minute

ā€¢ H0: the speed without the tool is lower or equal to the speed with the tool
ā€¢ Design: I have 13 users, I have ONE code ļ¬le to review, and I will let them ļ¬rst do the
bug search WITHOUT the tool (treatment NO-TOOL), and then do the search WITH the
tool (treatment TOOL). Then, I will compare the speed for each used in the two tasks,
to see if they improve.
Does the review training method improve the studentā€™s ability of finding bug
a.k.a. repeated-measures t-test, paired samples t-test,
matched pairs t-test and matched samples t-test
Paired T-test (Example)
ā€¢ I have a new tool to support bug identiļ¬cation in code review, and I want to
understand whether it is eļ¬€ective or not

ā€¢ RQ: Does the tool improve the usersā€™ speed of ļ¬nding bugs?
ā€¢ Independent Variable: tool (YES/NO) ā€” two treatments (TOOL/NO-TOOL)

ā€¢ Dependent Variable: speed = number of bugs found/minute

ā€¢ H0: the speed without the tool is lower or equal to the speed with the tool
ā€¢ Design: I have 13 users, I have ONE code ļ¬le to review, and I will let them ļ¬rst do the
bug search WITHOUT the tool (treatment NO-TOOL), and then do the search WITH the
tool (treatment TOOL). Then, I will compare the speed for each used in the two tasks,
to see if they improve.
Does the review training method improve the studentā€™s ability of finding bug
a.k.a. repeated-measures t-test, paired samples t-test,
matched pairs t-test and matched samples t-test
Whatā€™s wrong with this design?
Paired T-test (Example)
ā€¢ I have a new tool to support bug identiļ¬cation in code review, and I want to
understand whether it is eļ¬€ective or not

ā€¢ RQ: Does the tool improve the usersā€™ speed of ļ¬nding bugs?
ā€¢ Independent Variable: tool (YES/NO) ā€” two treatments (TOOL/NO-TOOL)

ā€¢ Dependent Variable: speed = number of bugs found/minute

ā€¢ H0: the speed without the tool is lower or equal to the speed with the tool
ā€¢ Design: I have 13 users, I have ONE code ļ¬le to review, and I will let them ļ¬rst do the
bug search WITHOUT the tool (treatment NO-TOOL), and then do the search WITH the
tool (treatment TOOL). Then, I will compare the speed for each used in the two tasks,
to see if they improve.
Does the review training method improve the studentā€™s ability of finding bug
a.k.a. repeated-measures t-test, paired samples t-test,
matched pairs t-test and matched samples t-test
Whatā€™s wrong with this design?
Learning Bias: if I use the same ļ¬le to be reviewed, students will have learned
which are the bugs in the ļ¬le and in treatment NO-TOOL will be faster!
Paired T-test (Corrected Example)
ā€¢ I have a new tool to support bug identiļ¬cation in code review, and I want to understand
whether it is eļ¬€ective or not

ā€¢ RQ: Does the tool improve the usersā€™ speed of ļ¬nding bugs?
ā€¢ Independent Variable: tool (YES/NO) ā€” two treatments

ā€¢ Dependent Variable: speed = number of bugs found/minute

ā€¢ H0: the speed without the tool is lower or equal to the speed with the tool
ā€¢ Design: I have 13 users, I have ONE code ļ¬le to review, and I will let
them ļ¬rst do the bug search WITH the tool (treatment TOOL), and
THEN do the search WITHOUT the tool (treatment NO-TOOL). Then, I
will compare the speed for each student in the two tasks. 

Does the review training method improve the studentā€™s ability of finding bug
Paired T-test (Corrected Example)
ā€¢ I have a new tool to support bug identiļ¬cation in code review, and I want to understand
whether it is eļ¬€ective or not

ā€¢ RQ: Does the tool improve the usersā€™ speed of ļ¬nding bugs?
ā€¢ Independent Variable: tool (YES/NO) ā€” two treatments

ā€¢ Dependent Variable: speed = number of bugs found/minute

ā€¢ H0: the speed without the tool is lower or equal to the speed with the tool
ā€¢ Design: I have 13 users, I have ONE code ļ¬le to review, and I will let
them ļ¬rst do the bug search WITH the tool (treatment TOOL), and
THEN do the search WITHOUT the tool (treatment NO-TOOL). Then, I
will compare the speed for each student in the two tasks. 

Does the review training method improve the studentā€™s ability of finding bug
Now the learning bias would be in favour of NO-TOOL treatment;
if I am able to reject the hypothesis, I can be quite conļ¬dent that
the tool increases the speed
Paired T-test (Corrected Example)
ā€¢ I have a new tool to support bug identiļ¬cation in code review, and I want to understand
whether it is eļ¬€ective or not

ā€¢ RQ: Does the tool improve the usersā€™ speed of ļ¬nding bugs?
ā€¢ Independent Variable: tool (YES/NO) ā€” two treatments

ā€¢ Dependent Variable: speed = number of bugs found/minute

ā€¢ H0: the speed without the tool is lower or equal to the speed with the tool
ā€¢ Design: I have 13 users, I have ONE code ļ¬le to review, and I will let
them ļ¬rst do the bug search WITH the tool (treatment TOOL), and
THEN do the search WITHOUT the tool (treatment NO-TOOL). Then, I
will compare the speed for each student in the two tasks. 

Does the review training method improve the studentā€™s ability of finding bug
Now the learning bias would be in favour of NO-TOOL treatment;
if I am able to reject the hypothesis, I can be quite conļ¬dent that
the tool increases the speed
Is ONE code ļ¬le
sufļ¬cient?
Paired T-test (Corrected Example)
ā€¢ Design: I have 13 users, I have TWO equivalent code ļ¬les
to review (ļ¬le X and Y), and I will let them ļ¬rst do the bug
search WITH the tool on ļ¬le X (treatment TOOL), and
THEN do the search WITHOUT the tool on ļ¬le Y
(treatment NO-TOOL). Then, I will compare the speed for
each student in the two tasks. 

ā€¢ With TWO equivalent code ļ¬les, I am more conļ¬dent that
the ļ¬rst treatment does not inļ¬‚uence the second
treatment
Paired T-test (Corrected Example)
ā€¢ Design: I have 13 users, I have TWO equivalent code ļ¬les
to review (ļ¬le X and Y), and I will let them ļ¬rst do the bug
search WITH the tool on ļ¬le X (treatment TOOL), and
THEN do the search WITHOUT the tool on ļ¬le Y
(treatment NO-TOOL). Then, I will compare the speed for
each student in the two tasks. 

ā€¢ With TWO equivalent code ļ¬les, I am more conļ¬dent that
the ļ¬rst treatment does not inļ¬‚uence the second
treatment
But what if the task lasts too long, and the students get tired in the second task?
The effect of fatigue needs to be considered, so I need to do the two
treatments in two separate days (or allow sufļ¬cient time between tasks)
Paired T-test
ā€¢ H0: the speed without the tool is lower or equal to the
speed with the tool (one-tailed hypothesis)
USER NO TOOL
u0 3 6
u1 3 6
u2 4 5
u3 3 8
u4 5 3
u5 7 5
u6 2 6
u7 1 5
u8 2 3
u9 8 9
u10 9 11
u11 1 4
u12 7 9
bugs/min by user
u0 with TOOL
t = 3.24
p-value = 0.00354
Paired T-test
ā€¢ H0: the speed without the tool is lower or equal to the
speed with the tool (one-tailed hypothesis)
USER NO TOOL
u0 3 6
u1 3 6
u2 4 5
u3 3 8
u4 5 3
u5 7 5
u6 2 6
u7 1 5
u8 2 3
u9 8 9
u10 9 11
u11 1 4
u12 7 9
bugs/min by user
u0 with TOOL
t = 3.24
p-value = 0.00354
CURIOSITY:
What calculations are made to
ļ¬nd the t-value (the test statistic)?
Computing the t-test statistic (paired case)
NO TOOL Difference
Dev
(Difference - M) Dev2
Ī¼ is the expected difference if H0 is true
(hence no difference, Ī¼ = 0)
The t-test statistic is based on the difference between the two measures
This is the formula
of the test statistic
for t-test
SS:Mean M:
Interval/Ratio (numbers)
Type of hypothesis
Relationship
Diļ¬€erence
Number of Ind.
Variables
Standard
Deviation
known unknown
Type of
designSpearmanā€™s
Rho
One or MoreZero
Diļ¬€erent groups
of subjects
(independent measures)
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
e.g., IV: tools DV: speed in ļ¬nding
bugs
Which is the difference between
tool A, B, C, D in terms of speed of
bug detection achieved by users?
Tool A Tool B Tool C Tool D
Interval/Ratio (numbers)
Type of hypothesis
Relationship
Diļ¬€erence
Number of Ind.
Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
designSpearmanā€™s
Rho
Pearsonā€™s R
One or MoreZero
Diļ¬€erent groups
of subjects
(independent measures)
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
e.g., IV: speed in ļ¬nding bugs
Does the tool improve the
usersā€™ speed in finding bugs?
(is there a difference in terms of
speed WITH and WITHOUT the
tool?)
Unpaired T-test (Example)
ā€¢ RQ: Does the tool improve the usersā€™ speed of ļ¬nding bugs?

ā€¢ I want to completely get rid of the learning bias, and of the fatigue eļ¬€ect, and I have a
suļ¬ƒcient number of users (26 instead of 13)

ā€¢ I change the design by having two groups, randomly allocate subjects and assign each
subject to one of the treatment (TOOL, NO-TOOL)

ā€¢ I have to assess that there is no diļ¬€erence in the initial competence of the users. To this
end, I can do a pre-test, which can allow me to identify that subjects in the two groups
have the same (average) degree of competence in ļ¬nding bugs.

ā€¢ Otherwise, I can provide sound arguments to justify that ALL the subjects have the same
degree of competence (e.g., people are students that come from the same course, and are
all novicesā€¦hence my results are valid solely for this category of users)

ā€¢ Note that the two groups need to be balanced, but you do not need to have the same exact
number of people in the two groups (e.g., if you have 25 people, it can be divided into 13
and 12 subjects)
a.k.a. independent-measures t-test, unpaired samples t-test
The problem is the same as for the paired T-test!
Unpaired T-test (Example)
USER NO
u0 3
u1 3
u2 4
u3 3
u4 5
u5 7
u6 2
u7 1
u8 2
u9 8
u10 9
u11 1
u12 7
USER TOOL
u13 6
u14 6
u15 5
u16 8
u17 3
u18 5
u19 6
u20 5
u21 3
u22 9
u23 11
u24 4
u25 9
t-value = -1.89889
p-value = .034833
Note that the t-value is different with respect to
the t-value for the paired case although
the number in the tables are THE SAME (but coming
from different subjects)!
Unpaired T-test (Example)
USER NO
u0 3
u1 3
u2 4
u3 3
u4 5
u5 7
u6 2
u7 1
u8 2
u9 8
u10 9
u11 1
u12 7
USER TOOL
u13 6
u14 6
u15 5
u16 8
u17 3
u18 5
u19 6
u20 5
u21 3
u22 9
u23 11
u24 4
u25 9
t-value = -1.89889
p-value = .034833
CURIOSITY:
What calculations are made to
ļ¬nd this t-value (the test statistic)?
Note that the t-value is different with respect to
the t-value for the paired case although
the number in the tables are THE SAME (but coming
from different subjects)!
Computing the t-test statistic
(unpaired case)
NO (x)
Difference
(x - M)
Sq. Diff
(x - M)2
TOOL (y) Difference
(y - M)
Sq. Diff
(y - M)2
What about the Effect Size?
ā€¢ In this case, I have a diļ¬€erence in my hypothesis,
therefore I will use Cohenā€™s d
This are the numbers that I need for Cohenā€™s d
where
This is the formula for Cohenā€™s d
NO-TOOL
TOOL
What about the Effect Size?
ā€¢ In this case, I have a diļ¬€erence in my hypothesis,
therefore I will use Cohenā€™s d
This are the numbers that I need for Cohenā€™s d
where
This is the formula for Cohenā€™s d
d = (6.15 - 4.23) ā„ 6.701138 = 0.286519
I have a SMALL to MEDIUM effect size
(see table from some slides agoā€¦)
NO-TOOL
TOOL
Interval/Ratio (numbers)
Type of hypothesis
Relationship
Diļ¬€erence
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
designSpearmanā€™s
Rho
Pearsonā€™s R
One or More
Zero
Diļ¬€erent groups
of subjects
(independent measures)
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
e.g., Which is the difference
between tool A, B and C in terms of
speed of bug detection achieved by
users? (same as for repeated
measures, but I use a different
design with different people)
Tool A Tool B Tool C
Interval/Ratio (numbers)
Type of hypothesis
Relationship
Diļ¬€erence
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
designSpearmanā€™s
Rho
Pearsonā€™s R
One or moreZero
Diļ¬€erent groups
of subjects
(independent measures)
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
e.g., What is the influence of
different tools and experience in
the bug detection speed?
(I consider not only the tool, but
also the experience as independent
variable)
Factorial ANOVA (Example)
ā€¢ Letā€™s imagine to have two tools A and B to support bug detection; I want to see which one is better, but I want
also to see whether there is some diļ¬€erence between people with diļ¬€erent degree of experience in bug
detection 

ā€¢ RQ: What is the inļ¬‚uence of diļ¬€erent tools and experience in bug detection speed?

ā€¢ Here I want to see which of the two factors (usersā€™ experience and type of tool, my independent variables) has
more impact on bug detection speed

ā€¢ I have three NULL hypothesis this time:

ā€¢ H0-1: The speed does not depend on the type of adopted tool
ā€¢ H0-2: The speed does not depend on the level of experience of the user

ā€¢ H0-3: The speed does not depend on the interaction between type of adopted tool and level of experience
ā€¢ Design
ā€¢ User experience has 3 levels: low, medium, high

ā€¢ Type of tool has 2 levels: tool A, tool B (in principle, I should have also NO toolā€¦)

ā€¢ Therefore, I have 3 x 2 = 6 possible situations (i.e., people with low experience using tool A, other using tool
B, etc.), and I have to group my subjects in 6 groups
Factorial ANOVA (Example)
User Exp. Tool Speed
1 low A
 12
2 low

lo
B 4
3 low A 7
4 low B 3
5 medium A 9
6 medium B 12
7 medium A 16
8 medium
 B 23
9 high A 23
10 high B
 16
11 high A 14
12 high B 12
ā€¦ ā€¦ ā€¦ ā€¦
Data
Mean
Square
F-value p-value
Exp. 2664 147.51 <0.001
Tool 29.4 1.62 0.207
Exp. X
Tool
83.85 4.64 0.014
ANOVA Results
The interaction of the two
factors is signiļ¬cant
(reject H03)
The experience is signiļ¬cant (reject H02)
Tool is not signiļ¬cant
(cannot reject H01)
F-value is the test statistic for ANOVA
How to Select the Right Test
ā€¢ Follow the diagram

ā€¢ Use the wizard at https://www.socscistatistics.com/tests/
what_stats_test_wizard.aspx 

ā€¢ Use the Exhaustive Table at https://stats.idre.ucla.edu/other/mult-pkg/
whatstat/ which also contains R code and code for other tools

ā€¢ To ļ¬nd non-parametric alternatives: https://help.xlstat.com/s/article/
which-statistical-test-should-you-use?language=en_US 

ā€¢ Always remember to check that the test assumptions hold
ā€¢ It takes time to acquire conļ¬dence with experiment design, so DO
NOT BE SCARED
How To Select the Right Test
10.3 Hypothesis Testing 137
Table 10.3 Overview of parametric/non-parametric tests for different designs
Design Parametric Non-parametric
One factor, one treatment Chi-2, Binomial test
One factor, two treatments, completely
randomized design
t-test, F-test Mann-Whitney, Chi-2
One factor, two treatments, paired comparison Paired t-test Wilcoxon, Sign test
One factor, more than two treatments ANOVA Kruskal-Wallis, Chi-2
More than one factor ANOVAa
a
This test is not described in this book. Refer instead to, for example, Marascuilo and Serlin [119]
and Montgomery [125]
Input The type of measurements needed to make the test applicable
describes the input to the test. That is, this describes what
requirements there are on the experiment design if the test should
be applicable.
Null hypothesis A formulation of the null-hypothesis is provided.
Calculations It describes what to calculate based on the measured data.
Criterion The criterion for rejecting the null hypothesis. This often involves
Factor = number of independent variables
Treatments = possible values of the independent variables
cf. Wholin et al. https://doi.org/10.1007/978-3-642-29044-2
Fundamental tests
Threats To Validity
for Controlled
Experiments
Threats to Validity for
Controlled Experiments
ā€¢ Construct Validity: to which extent do the measured variables represent
what I intended to estimate? Did I operationalise my research questions
in the proper manner? Did I use an appropriate design?
ā€¢ Internal Validity: are there any confounding factors that may have
inļ¬‚uenced the outcome of the experiments? Did I control all the
variables? 

ā€¢ External Validity: for which values of the controlled variables are the
results valid? To which extent the results can be considered general?
ā€¢ (Statistical) Conclusion Validity: to which extent are my ļ¬nding
credible? Have I used the appropriate statistical tests? Did I check the
assumptions? Have I sampled the population in the appropriate way?
Have I used reliable measurement procedures (low measurement errors)?
Internal Validity
ā€¢ Factors jeopardising internal validity are, e.g.:

ā€¢ History: did time impact on the treatments? (e.g., I have people
participating at diļ¬€erent times of the day, or treatments performed in
diļ¬€erent days)

ā€¢ Maturation: did subjects learn throughout the experiment? did time during
the experiment aļ¬€ect the performance? (e.g., people can get bored or tired)

ā€¢ Experimental mortality: how many subjects left the experiment and how
did this aļ¬€ect the treatment groups? Are the remaining subjects the most
motivated?
ā€¢ Researcher bias: in which way could the researcher inļ¬‚uence the
outcomes? (e.g., presence of researcher inļ¬‚uences the participants)

ā€¢ Experimental context: to which extent does the experimental context
inļ¬‚uence the behaviour of subjects?
cf. https://web.pdx.edu/~stipakb/download/PA555/ResearchDesign.html
External Validity
ā€¢ Factors jeopardising external validity are, e.g.:

ā€¢ Selection bias: are the selected subjects really
random, and are they randomly assigned to treatment?

ā€¢ Representativeness: to which extent the experiment
represents a real context? To which extent was I able to
properly represent all the realistic combinations of the
control variables? To which extent was I able to select
the representative people? To which extent was I able
to select representative situations?
Construct Validity
ā€¢ Factors jeopardising construct validity are:

ā€¢ Hypothesis guessing: does knowing the expected result
inļ¬‚uence the behaviour of the participant?

ā€¢ Bias in experimental design: was my operationalisation and
design correct?

ā€¢ Subjective measures: to which extent the subjective
measures are reliable?
Conclusion Validity
ā€¢ Factors jeopardising conclusion validity are:

ā€¢ Low statistical power: power is the probability of
correctly rejecting the NULL hypothesis when FALSE; I
may fail to reject the NULL hypothesis if I have low
statistical power; low statistical power occurs when I have
few samples, and low eļ¬€ect size.
ā€¢ Violated assumptions: remember that all tests have
assumptions to check
ā€¢ Unreliable measures of the variables: large amount of
measurement error
Preparing, Executing and
Reporting Experiments
Theory
Hypothesis and
Variable Deļ¬nition
Research Design
Research Question
Deļ¬ne Measures for
Variables
Recruit Participants
/ Select Artifacts
PREPARATION EXECUTION
Collect Data
Analyse Data
Report Answers
Internal Validity
External Validity
Construct &
Conclusion Validity
Construct
Validity
REPORTING
Discuss
154 11 Presentation and Package
Table 11.1 Proposed reporting structure for experiment reports, by Jedlitschka and Pfahl [86]
Sections/subsections Contents
Title, authorship
Structured abstract Summarizes the paper under headings of background or context,
objectives or aims, method, results, and conclusions
Motivation Sets the scope of the work and encourages readers to read the rest of the
paper
Problem statement Reports what the problem is; where it occurs, and who observes it
Research objectives Deļ¬nes the experiment using the formalized style used in GQM
Context Reports environmental factors such as settings and locations
Related work How current study relates to other research
Experimental design Describes the outcome of the experimental planning stage
Goals, hypotheses and
variables
Presents the reļ¬ned research objectives
Design Deļ¬ne the type of experimental design
Subjects Deļ¬nes the methods used for subject sampling and group allocation
Objects Deļ¬nes what experimental objects were used
Instrumentation Deļ¬nes any guidelines and measurement instruments used
Data collection
procedure
Deļ¬nes the experimental schedule, timing and data collection procedures
Analysis procedure Speciļ¬es the mathematical analysis model to be used
Evaluation of validity Describes the validity of materials, procedures to ensure participants
keep to the experimental method, and methods to ensure the
reliability and validity of data collection methods and tools
Execution Describes how the experimental plan was implemented
Sample Description of the sample characteristics
Preparation How the experimental groups were formed and trained
Reporting Experiments (1)
cf. Wholin et al. https://doi.org/10.1007/978-3-642-29044-2
Reporting Experiments (2)
Analysis procedure Speciļ¬es the mathematical analysis model to be used
Evaluation of validity Describes the validity of materials, procedures to ensure participants
keep to the experimental method, and methods to ensure the
reliability and validity of data collection methods and tools
Execution Describes how the experimental plan was implemented
Sample Description of the sample characteristics
Preparation How the experimental groups were formed and trained
Data collection
performed
How data collection took place and any deviations from plan
Validity procedure How the validity process was followed and any deviation from plan
Analysis Summarizes the collected data and describes how it was analyzed
Descriptive statistics Presentation of the data using descriptive statistics
Data set reduction Describes any reduction of the data set e.g. removal of outliers
Hypothesis testing Describes how the data was evaluated and how the analysis model was
validated
Interpretation Interprets the ļ¬ndings from the Analysis section
Evaluation of results
and implications
Explains the results
Limitations of study Discusses threats to validity
Inferences How the results generalize given the ļ¬ndings and limitations
Lesson learnt Descriptions of what went well and what did not during the course of
the experiment
Conclusions and
future work
Presents a summary of the study
Relation to existing
evidence
Describes the contribution of the study in the context of earlier
experiments
Impact Identiļ¬es the most important ļ¬ndings
Limitations Identiļ¬es main limitations of approach i.e. circumstances when the
expected beneļ¬ts will not be delivered
Future work Suggestions for other experiments to further investigate
Acknowledgements Identiļ¬es any contributors who do not fulļ¬ll authorship criteria
References Lists all cited literature
Appendices Includes raw data and/or detailed analyses which might help others to
cf. Wholin et al. https://doi.org/10.1007/978-3-642-29044-2
Quasi-Experiments
What about Quasi-
Experiments?
ā€¢ In experiments I randomly assign subjects to treatments;

ā€¢ In quasi-experiments the assignment is based on some choices of
the designer (e.g., the Factorial ANOVA example, in which I have more
than one level of experience)

ā€¢ Note that a quasi-experiment does not always allow to convincingly
establish causal relationships (e.g., diļ¬€erent degrees of experience
may be related to other factors that may have inļ¬‚uenced the outcome)

ā€¢ When I use a group of students from a certain class for my research, I
am neither performing an experiment nor a quasi-experiment, but a
case study, as I am focusing on a speciļ¬c environment and I selected
the subjects opportunistically
Summary
ā€¢ Controlled Experiments in SE are a research strategy mostly oriented to test the
impact of some treatment (method, tool) to a certain dependent variable (e.g.,
speed, bugs, success, happiness)

ā€¢ They are based on Hypothesis testing, which implies showing that the
experimental data REJECT the NULL hypothesis (i.e., no impact on the dependent
variable)

ā€¢ Hypothesis testing uses Statistical tests to decide whether the NULL can be
REJECTED

ā€¢ The selection of the statistical test depends on the Experimental design (look at
https://stats.idre.ucla.edu/other/mult-pkg/whatstat/)

ā€¢ When I perform a statistical test, I hope to obtain for small p-values, and large
eļ¬€ect size

ā€¢ Remember to analyse and report Threats to Validity

More Related Content

What's hot

Power BI Architecture
Power BI ArchitecturePower BI Architecture
Power BI ArchitectureArthur Graus
Ā 
Calculated Fields in Tableau
Calculated Fields in TableauCalculated Fields in Tableau
Calculated Fields in TableauKanika Nagpal
Ā 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
Ā 
Research Methods: Experimental Design I (Single Factor)
Research Methods: Experimental Design I (Single Factor)Research Methods: Experimental Design I (Single Factor)
Research Methods: Experimental Design I (Single Factor)Brian Piper
Ā 
Tableau slideshare
Tableau slideshareTableau slideshare
Tableau slideshareSakshi Jain
Ā 
SKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSING
SKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSINGSKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSING
SKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSINGSkillwise Group
Ā 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningSpark Summit
Ā 
Data Visualisation & Analytics with Tableau (Beginner) - by Maria Koumandraki
Data Visualisation & Analytics with Tableau (Beginner) - by Maria KoumandrakiData Visualisation & Analytics with Tableau (Beginner) - by Maria Koumandraki
Data Visualisation & Analytics with Tableau (Beginner) - by Maria KoumandrakiOutreach Digital
Ā 
Battle of the Stream Processing Titans ā€“ Flink versus RisingWave
Battle of the Stream Processing Titans ā€“ Flink versus RisingWaveBattle of the Stream Processing Titans ā€“ Flink versus RisingWave
Battle of the Stream Processing Titans ā€“ Flink versus RisingWaveYingjun Wu
Ā 
рŠµŠ»ŃŃ†Ń–Š¹Š½Š° Š°Š»Š³ŠµŠ±Ń€Š° Š»ŠµŠŗцŠøя
рŠµŠ»ŃŃ†Ń–Š¹Š½Š° Š°Š»Š³ŠµŠ±Ń€Š° Š»ŠµŠŗцŠøярŠµŠ»ŃŃ†Ń–Š¹Š½Š° Š°Š»Š³ŠµŠ±Ń€Š° Š»ŠµŠŗцŠøя
рŠµŠ»ŃŃ†Ń–Š¹Š½Š° Š°Š»Š³ŠµŠ±Ń€Š° Š»ŠµŠŗцŠøяvika_kopoty
Ā 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanDatabricks
Ā 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler Thamme Gowda
Ā 
Models for g x e analysis
Models for g x e analysisModels for g x e analysis
Models for g x e analysisICRISAT
Ā 

What's hot (16)

Power BI Architecture
Power BI ArchitecturePower BI Architecture
Power BI Architecture
Ā 
Calculated Fields in Tableau
Calculated Fields in TableauCalculated Fields in Tableau
Calculated Fields in Tableau
Ā 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Ā 
Research Methods: Experimental Design I (Single Factor)
Research Methods: Experimental Design I (Single Factor)Research Methods: Experimental Design I (Single Factor)
Research Methods: Experimental Design I (Single Factor)
Ā 
Tableau slideshare
Tableau slideshareTableau slideshare
Tableau slideshare
Ā 
SKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSING
SKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSINGSKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSING
SKILLWISE-SSIS DESIGN PATTERN FOR DATA WAREHOUSING
Ā 
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic RepartitioningHandling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Ā 
My tableau
My tableauMy tableau
My tableau
Ā 
Data Visualisation & Analytics with Tableau (Beginner) - by Maria Koumandraki
Data Visualisation & Analytics with Tableau (Beginner) - by Maria KoumandrakiData Visualisation & Analytics with Tableau (Beginner) - by Maria Koumandraki
Data Visualisation & Analytics with Tableau (Beginner) - by Maria Koumandraki
Ā 
Battle of the Stream Processing Titans ā€“ Flink versus RisingWave
Battle of the Stream Processing Titans ā€“ Flink versus RisingWaveBattle of the Stream Processing Titans ā€“ Flink versus RisingWave
Battle of the Stream Processing Titans ā€“ Flink versus RisingWave
Ā 
рŠµŠ»ŃŃ†Ń–Š¹Š½Š° Š°Š»Š³ŠµŠ±Ń€Š° Š»ŠµŠŗцŠøя
рŠµŠ»ŃŃ†Ń–Š¹Š½Š° Š°Š»Š³ŠµŠ±Ń€Š° Š»ŠµŠŗцŠøярŠµŠ»ŃŃ†Ń–Š¹Š½Š° Š°Š»Š³ŠµŠ±Ń€Š° Š»ŠµŠŗцŠøя
рŠµŠ»ŃŃ†Ń–Š¹Š½Š° Š°Š»Š³ŠµŠ±Ń€Š° Š»ŠµŠŗцŠøя
Ā 
Normalization
NormalizationNormalization
Normalization
Ā 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Ā 
Sparkler - Spark Crawler
Sparkler - Spark Crawler Sparkler - Spark Crawler
Sparkler - Spark Crawler
Ā 
Avro
AvroAvro
Avro
Ā 
Models for g x e analysis
Models for g x e analysisModels for g x e analysis
Models for g x e analysis
Ā 

Similar to Controlled experiments, Hypothesis Testing, Test Selection, Threats to Validity

Towards Automated A/B Testing
Towards Automated A/B TestingTowards Automated A/B Testing
Towards Automated A/B TestingGiordano Tamburrelli
Ā 
Chapter 1-Object Oriented Software Engineering.pptx
Chapter 1-Object Oriented Software Engineering.pptxChapter 1-Object Oriented Software Engineering.pptx
Chapter 1-Object Oriented Software Engineering.pptxaroraritik30
Ā 
Software testing software engineering.pdf
Software testing software engineering.pdfSoftware testing software engineering.pdf
Software testing software engineering.pdfvaibhavshukla3003
Ā 
Unit testing
Unit testingUnit testing
Unit testingmedsherb
Ā 
Manual Tester Interview Questions(1).pdf
Manual Tester Interview Questions(1).pdfManual Tester Interview Questions(1).pdf
Manual Tester Interview Questions(1).pdfSupriyaDongare
Ā 
Testing, fixing, and proving with contracts
Testing, fixing, and proving with contractsTesting, fixing, and proving with contracts
Testing, fixing, and proving with contractsCarlo A. Furia
Ā 
FutureOfTesting2008
FutureOfTesting2008FutureOfTesting2008
FutureOfTesting2008vipulkocher
Ā 
Chapter 10 Testing and Quality Assurance1Unders.docx
Chapter 10 Testing and Quality Assurance1Unders.docxChapter 10 Testing and Quality Assurance1Unders.docx
Chapter 10 Testing and Quality Assurance1Unders.docxketurahhazelhurst
Ā 
Automation in the Bug Flow - Machine Learning for Triaging and Tracing
Automation in the Bug Flow - Machine Learning for Triaging and TracingAutomation in the Bug Flow - Machine Learning for Triaging and Tracing
Automation in the Bug Flow - Machine Learning for Triaging and TracingMarkus Borg
Ā 
Softwaretestingtoolsanditstaxonomy 131204003332-phpapp01
Softwaretestingtoolsanditstaxonomy 131204003332-phpapp01Softwaretestingtoolsanditstaxonomy 131204003332-phpapp01
Softwaretestingtoolsanditstaxonomy 131204003332-phpapp01Aravindharamanan S
Ā 
Software testing tools and its taxonomy
Software testing tools and its taxonomySoftware testing tools and its taxonomy
Software testing tools and its taxonomyHimanshu
Ā 
SOFTWARE TESTING.pptx
SOFTWARE TESTING.pptxSOFTWARE TESTING.pptx
SOFTWARE TESTING.pptxssrpr
Ā 
Software Development and Quality
Software Development and QualitySoftware Development and Quality
Software Development and QualityHerwig Habenbacher
Ā 
Types of Software Testing
Types of Software TestingTypes of Software Testing
Types of Software TestingNishant Worah
Ā 
Manual Testing Interview Questions & Answers.docx
Manual Testing Interview Questions & Answers.docxManual Testing Interview Questions & Answers.docx
Manual Testing Interview Questions & Answers.docxssuser305f65
Ā 
Static white box testing lecture 12
Static white box testing lecture 12Static white box testing lecture 12
Static white box testing lecture 12Abdul Basit
Ā 
White box testing
White box testingWhite box testing
White box testingAbdul Basit
Ā 
Software testing introduction
Software testing introductionSoftware testing introduction
Software testing introductionSriman Eshwar
Ā 

Similar to Controlled experiments, Hypothesis Testing, Test Selection, Threats to Validity (20)

Towards Automated A/B Testing
Towards Automated A/B TestingTowards Automated A/B Testing
Towards Automated A/B Testing
Ā 
Chapter 1-Object Oriented Software Engineering.pptx
Chapter 1-Object Oriented Software Engineering.pptxChapter 1-Object Oriented Software Engineering.pptx
Chapter 1-Object Oriented Software Engineering.pptx
Ā 
Software testing software engineering.pdf
Software testing software engineering.pdfSoftware testing software engineering.pdf
Software testing software engineering.pdf
Ā 
Unit testing
Unit testingUnit testing
Unit testing
Ā 
L software testing
L   software testingL   software testing
L software testing
Ā 
Manual Tester Interview Questions(1).pdf
Manual Tester Interview Questions(1).pdfManual Tester Interview Questions(1).pdf
Manual Tester Interview Questions(1).pdf
Ā 
Testing, fixing, and proving with contracts
Testing, fixing, and proving with contractsTesting, fixing, and proving with contracts
Testing, fixing, and proving with contracts
Ā 
FutureOfTesting2008
FutureOfTesting2008FutureOfTesting2008
FutureOfTesting2008
Ā 
Chapter 10 Testing and Quality Assurance1Unders.docx
Chapter 10 Testing and Quality Assurance1Unders.docxChapter 10 Testing and Quality Assurance1Unders.docx
Chapter 10 Testing and Quality Assurance1Unders.docx
Ā 
Automation in the Bug Flow - Machine Learning for Triaging and Tracing
Automation in the Bug Flow - Machine Learning for Triaging and TracingAutomation in the Bug Flow - Machine Learning for Triaging and Tracing
Automation in the Bug Flow - Machine Learning for Triaging and Tracing
Ā 
Softwaretestingtoolsanditstaxonomy 131204003332-phpapp01
Softwaretestingtoolsanditstaxonomy 131204003332-phpapp01Softwaretestingtoolsanditstaxonomy 131204003332-phpapp01
Softwaretestingtoolsanditstaxonomy 131204003332-phpapp01
Ā 
Software testing tools and its taxonomy
Software testing tools and its taxonomySoftware testing tools and its taxonomy
Software testing tools and its taxonomy
Ā 
SOFTWARE TESTING.pptx
SOFTWARE TESTING.pptxSOFTWARE TESTING.pptx
SOFTWARE TESTING.pptx
Ā 
Software Development and Quality
Software Development and QualitySoftware Development and Quality
Software Development and Quality
Ā 
Types of Software Testing
Types of Software TestingTypes of Software Testing
Types of Software Testing
Ā 
Manual Testing Interview Questions & Answers.docx
Manual Testing Interview Questions & Answers.docxManual Testing Interview Questions & Answers.docx
Manual Testing Interview Questions & Answers.docx
Ā 
Static white box testing lecture 12
Static white box testing lecture 12Static white box testing lecture 12
Static white box testing lecture 12
Ā 
White box testing
White box testingWhite box testing
White box testing
Ā 
Testing
TestingTesting
Testing
Ā 
Software testing introduction
Software testing introductionSoftware testing introduction
Software testing introduction
Ā 

More from alessio_ferrari

Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
Ā 
Systematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping StudiesSystematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping Studiesalessio_ferrari
Ā 
Case Study Research in Software Engineering
Case Study Research in Software EngineeringCase Study Research in Software Engineering
Case Study Research in Software Engineeringalessio_ferrari
Ā 
Survey Research In Empirical Software Engineering
Survey Research In Empirical Software EngineeringSurvey Research In Empirical Software Engineering
Survey Research In Empirical Software Engineeringalessio_ferrari
Ā 
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...alessio_ferrari
Ā 
Requirements Engineering: focus on Natural Language Processing, Lecture 2
Requirements Engineering: focus on Natural Language Processing, Lecture 2Requirements Engineering: focus on Natural Language Processing, Lecture 2
Requirements Engineering: focus on Natural Language Processing, Lecture 2alessio_ferrari
Ā 
Requirements Engineering: focus on Natural Language Processing, Lecture 1
Requirements Engineering: focus on Natural Language Processing, Lecture 1Requirements Engineering: focus on Natural Language Processing, Lecture 1
Requirements Engineering: focus on Natural Language Processing, Lecture 1alessio_ferrari
Ā 
Ambiguity in Software Engineering
Ambiguity in Software EngineeringAmbiguity in Software Engineering
Ambiguity in Software Engineeringalessio_ferrari
Ā 
Empirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an OverviewEmpirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an Overviewalessio_ferrari
Ā 
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an OverviewNatural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overviewalessio_ferrari
Ā 

More from alessio_ferrari (10)

Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
Ā 
Systematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping StudiesSystematic Literature Reviews and Systematic Mapping Studies
Systematic Literature Reviews and Systematic Mapping Studies
Ā 
Case Study Research in Software Engineering
Case Study Research in Software EngineeringCase Study Research in Software Engineering
Case Study Research in Software Engineering
Ā 
Survey Research In Empirical Software Engineering
Survey Research In Empirical Software EngineeringSurvey Research In Empirical Software Engineering
Survey Research In Empirical Software Engineering
Ā 
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Qualitative Studies in Software Engineering - Interviews, Observation, Ground...
Ā 
Requirements Engineering: focus on Natural Language Processing, Lecture 2
Requirements Engineering: focus on Natural Language Processing, Lecture 2Requirements Engineering: focus on Natural Language Processing, Lecture 2
Requirements Engineering: focus on Natural Language Processing, Lecture 2
Ā 
Requirements Engineering: focus on Natural Language Processing, Lecture 1
Requirements Engineering: focus on Natural Language Processing, Lecture 1Requirements Engineering: focus on Natural Language Processing, Lecture 1
Requirements Engineering: focus on Natural Language Processing, Lecture 1
Ā 
Ambiguity in Software Engineering
Ambiguity in Software EngineeringAmbiguity in Software Engineering
Ambiguity in Software Engineering
Ā 
Empirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an OverviewEmpirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an Overview
Ā 
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an OverviewNatural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Ā 

Recently uploaded

ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
Ā 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
Ā 
Hį»ŒC Tį»T TIįŗ¾NG ANH 11 THEO CHĘÆĘ NG TRƌNH GLOBAL SUCCESS ĐƁP ƁN CHI TIįŗ¾T - Cįŗ¢ NĂ...
Hį»ŒC Tį»T TIįŗ¾NG ANH 11 THEO CHĘÆĘ NG TRƌNH GLOBAL SUCCESS ĐƁP ƁN CHI TIįŗ¾T - Cįŗ¢ NĂ...Hį»ŒC Tį»T TIįŗ¾NG ANH 11 THEO CHĘÆĘ NG TRƌNH GLOBAL SUCCESS ĐƁP ƁN CHI TIįŗ¾T - Cįŗ¢ NĂ...
Hį»ŒC Tį»T TIįŗ¾NG ANH 11 THEO CHĘÆĘ NG TRƌNH GLOBAL SUCCESS ĐƁP ƁN CHI TIįŗ¾T - Cįŗ¢ NĂ...Nguyen Thanh Tu Collection
Ā 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
Ā 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
Ā 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPCeline George
Ā 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptxmary850239
Ā 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxHumphrey A BeƱa
Ā 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...SeƔn Kennedy
Ā 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxMaryGraceBautista27
Ā 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
Ā 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
Ā 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
Ā 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
Ā 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
Ā 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
Ā 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
Ā 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
Ā 

Recently uploaded (20)

ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
Ā 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
Ā 
Hį»ŒC Tį»T TIįŗ¾NG ANH 11 THEO CHĘÆĘ NG TRƌNH GLOBAL SUCCESS ĐƁP ƁN CHI TIįŗ¾T - Cįŗ¢ NĂ...
Hį»ŒC Tį»T TIįŗ¾NG ANH 11 THEO CHĘÆĘ NG TRƌNH GLOBAL SUCCESS ĐƁP ƁN CHI TIįŗ¾T - Cįŗ¢ NĂ...Hį»ŒC Tį»T TIįŗ¾NG ANH 11 THEO CHĘÆĘ NG TRƌNH GLOBAL SUCCESS ĐƁP ƁN CHI TIįŗ¾T - Cįŗ¢ NĂ...
Hį»ŒC Tį»T TIįŗ¾NG ANH 11 THEO CHĘÆĘ NG TRƌNH GLOBAL SUCCESS ĐƁP ƁN CHI TIįŗ¾T - Cįŗ¢ NĂ...
Ā 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
Ā 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
Ā 
What is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERPWhat is Model Inheritance in Odoo 17 ERP
What is Model Inheritance in Odoo 17 ERP
Ā 
4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx4.16.24 21st Century Movements for Black Lives.pptx
4.16.24 21st Century Movements for Black Lives.pptx
Ā 
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptxINTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
INTRODUCTION TO CATHOLIC CHRISTOLOGY.pptx
Ā 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
Ā 
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptxScience 7 Quarter 4 Module 2: Natural Resources.pptx
Science 7 Quarter 4 Module 2: Natural Resources.pptx
Ā 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
Ā 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
Ā 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
Ā 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
Ā 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
Ā 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
Ā 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
Ā 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
Ā 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
Ā 
Model Call Girl in Tilak Nagar Delhi reach out to us at šŸ”9953056974šŸ”
Model Call Girl in Tilak Nagar Delhi reach out to us at šŸ”9953056974šŸ”Model Call Girl in Tilak Nagar Delhi reach out to us at šŸ”9953056974šŸ”
Model Call Girl in Tilak Nagar Delhi reach out to us at šŸ”9953056974šŸ”
Ā 

Controlled experiments, Hypothesis Testing, Test Selection, Threats to Validity

  • 1. Controlled Experiments in Software Engineering cf. Plefeeger, 1995 https://doi.org/10.1007/BF02249052 cf. S. Easterbrook http://www.cs.toronto.edu/~sme/CSC2130/04-experiments.pdf Alessio Ferrari, ISTI-CNR, Pisa, Italy alessio.ferrari@isti.cnr.it
  • 2. Controlled Experiments aka Laboratory Experiments aka Experiment The ABC of Software Engineering Research 11:11In Vitro Experiment The GOAL is Precise Measure of Behaviour
  • 3. Typical Examples ā€¢ With software subjects: Tool A and B are automatic tools for testing, I want to compare them (no need to involve people) ā€¢ With human subjects: Method M is a manual strategy for ļ¬nding bugs. How eļ¬€ective is for experts? How eļ¬€ective is for novices? ā€¢ With human and software subjects: ā€¢ Tool T is an interactive tool for testing, I want to see if it is more appropriate for novice or for experts ā€¢ Tool A and B are interactive tools for testing, I want to compare them ā€Ø (I have to involve people) ā€¢ Tool A and B are interactive tools for testing, I want to see if which one is more appropriate for novices and which one for experts ā€¢ Tool A and method M are two approaches for ļ¬nding bugs, I want to see which one is better
  • 4. Controlled Experiments and Theories Theory Observation Induction Hypothesis Deduction Test Theory Abduction Deduction DEDUCTIVE APPROACH
  • 5. Controlled Experiments: Process PREPARATION EXECUTION REPORTING Theory Hypothesis and Variable Deļ¬nition Research Design Research Question Deļ¬ne Measures for Variables Recruit Participants / Select Artifacts Collect Data Analyse Data Report Answers Internal Validity External Validity Construct & Conclusion Validity Construct Validity Discuss The process normally starts from a Theory and discusses/modiļ¬es it in relation to the results Typically QUANTITATIVE
  • 6. Controlled Experiments: Process PREPARATION EXECUTION REPORTING Theory Hypothesis and Variable Deļ¬nition Research Design Research Question Deļ¬ne Measures for Variables Recruit Participants / Select Artifacts Collect Data Analyse Data Report Answers Internal Validity External Validity Construct & Conclusion Validity Construct Validity Discuss The process normally starts from a Theory and discusses/modiļ¬es it in relation to the results Typically QUANTITATIVE
  • 7. Controlled Experiments: Elements Test Data from Experiment Test Statistic p-value Effect Size Effect size computation Signiļ¬cance š›¼ Hypothesis Treatments Analyse DataCollect Data independent variables dependent variables controlled variables Variable Measurements Data from Experiment Design Hypothesis āœ… ā“
  • 8. Controlled Experiments: Elements Test Data from Experiment Test Statistic p-value Effect Size Effect size computation Signiļ¬cance š›¼ Hypothesis Treatments Analyse DataCollect Data independent variables dependent variables controlled variables Variable Measurements Data from Experiment Design This part requires your creativity Hypothesis āœ… ā“
  • 9. Controlled Experiments: Elements Test Data from Experiment Test Statistic p-value Effect Size Effect size computation Signiļ¬cance š›¼ Hypothesis Treatments Analyse DataCollect Data independent variables dependent variables controlled variables Variable Measurements Data from Experiment Design This part requires your creativity This part is mostly automated (but you need to understand it!) Hypothesis āœ… ā“
  • 10. Controlled Experiment ā€¢ ā€œExperimental investigation of a testable hypothesis, in which conditions are set up to isolate the variables of interest (independent variables) and test how they aļ¬€ect certain measurable outcomes (the dependent variables)ā€ INDEPENDENT variables (e.g., testing tool) DEPENDENT variables (e.g., number of bugs) aka FACTORS Each combination of values of the independent variables is a TREATMENT TREATMENTS Treatment 1 (e.g, testing tool A) Treatment 2 (e.g., testing tool B) cf. S. Easterbrook http://www.cs.toronto.edu/~sme/CSC2130/04-experiments.pdf
  • 11. Controlled Experiment ā€¢ ā€œExperimental investigation of a testable hypothesis, in which conditions are set up to isolate the variables of interest (independent variables) and test how they aļ¬€ect certain measurable outcomes (the dependent variables)ā€ INDEPENDENT variables (e.g., testing tool) DEPENDENT variables (e.g., number of bugs) aka FACTORS Each combination of values of the independent variables is a TREATMENT TREATMENTS Treatment 1 (e.g, testing tool A) Treatment 2 (e.g., testing tool B) cf. S. Easterbrook http://www.cs.toronto.edu/~sme/CSC2130/04-experiments.pdf To ISOLATE the independent variables, the other variables need to be CONTROLLED (e.g., variables concerning the code samples on which the test is performed)
  • 12. Controlled Experiments equivalent for each treatment homogeneous general INDEPENDENT variables (e.g., testing tool) DEPENDENT variables (e.g., number of bugs) TREATMENTS Treatment 1 (e.g, testing tool A) Treatment 2 (e.g., testing tool B) CONTROLLED variables (e.g., sample length, type of language, complexity) representative related to human subjects related to objects
  • 13. Controlled Experiments equivalent for each treatment homogeneous general INDEPENDENT variables (e.g., testing tool) DEPENDENT variables (e.g., number of bugs) TREATMENTS Treatment 1 (e.g, testing tool A) Treatment 2 (e.g., testing tool B) CONTROLLED variables (e.g., sample length, type of language, complexity) Controlled variables when human subjects are involved may concern experience of developers, age, etc. representative related to human subjects related to objects
  • 14. Deļ¬nitions ā€¢ Hypothesis: the statement I want to test with the experiment ā€¢ Derived from a research question (e.g., What is the diļ¬€erence between A and B in terms of bug detection capability?) ā€¢ Include variables that represent constructs of interest (e.g., tools, methods, actors, number of bugs) ā€¢ Concern the measurable impact that a certain variation on some construct can have on other constructs (e.g., Tool A ļ¬nds more bugs than tool B; Tool A ļ¬nds less or equal bugs than tool B) ā€¢ I normally have NULL and Alternative hypothesis; the one I will test is the NULL hypothesis, but the one I am interested in is the Alternative one (weā€™ll see this later)
  • 15. Deļ¬nitions ā€¢ Independent Variables (INPUT): operationalisation of constructs that I want to isolate, and whose values I want to manipulate (e.g., the tool, the expertise of actors) ā€¢ Treatments: combinations of values for the independent variables (tool A, tool B ā€” 1 variable, two treatments; tool A and experts, tool A and novices, tool B and experts, tool B and novicesā€” 2 variable, 4 treatments) ā€¢ Dependent Variables (OUTPUT): operationalisation of constructs that I want to measure based on the manipulation of the independent variables (e.g., number of bugs) ā€¢ Controlled Variables: attributes* of human subjects or objects that I need to control to mask or prevent their impact on the dependent variables (e.g., I have to test on some code that is suļ¬ƒciently general, and equivalent for all cases) * = operationalisation of constructs
  • 16. Example: Software ā€¢ Objective: I want to understand which is a better testing tool among two available choices A and B ā€¢ The independent variable is already identiļ¬ed: the tool (one factor) ā€¢ Treatments are also straightforward: tool A and B (two treatments) ā€¢ I miss the dependent variable: I have to detail what I mean by better. Better in terms of speed? better in terms of bugs found? Both! Ok, I already have two dependent variables, which I can deļ¬ne as: ā€¢ ā€œeļ¬€ectivenessā€ = number of bugs found/total number of bugs ā€¢ ā€œeļ¬ƒciencyā€ = running time/number of bugs found ā€¢ Now I have to identify the controlled variables: what can impact on eļ¬€ectiveness and eļ¬ƒciency, besides the type of tool? The user? Maybe not, if the tool is fully automatic; The language of the code? Well, I want to focus only on C code; The chosen code? Well yes, but which attributes of the chosen code? ā€¢ number of bugs in the code module ā€¢ length of the module ā€¢ complexity of the module ā€¢ domain of the code ā€¢ ā€¦.
  • 17. Example: Software ā€¢ Objective: I want to understand which is a better testing tool among two available choices A and B ā€¢ The independent variable is already identiļ¬ed: the tool (one factor) ā€¢ Treatments are also straightforward: tool A and B (two treatments) ā€¢ I miss the dependent variable: I have to detail what I mean by better. Better in terms of speed? better in terms of bugs found? Both! Ok, I already have two dependent variables, which I can deļ¬ne as: ā€¢ ā€œeļ¬€ectivenessā€ = number of bugs found/total number of bugs ā€¢ ā€œeļ¬ƒciencyā€ = running time/number of bugs found ā€¢ Now I have to identify the controlled variables: what can impact on eļ¬€ectiveness and eļ¬ƒciency, besides the type of tool? The user? Maybe not, if the tool is fully automatic; The language of the code? Well, I want to focus only on C code; The chosen code? Well yes, but which attributes of the chosen code? ā€¢ number of bugs in the code module ā€¢ length of the module ā€¢ complexity of the module ā€¢ domain of the code ā€¢ ā€¦. I have to create a code sample that has sufļ¬cient variations in all of the controlled variables
  • 18. Example: Software ā€¢ Objective: I want to understand which is a better testing tool among two available choices A and B ā€¢ The independent variable is already identiļ¬ed: the tool (one factor) ā€¢ Treatments are also straightforward: tool A and B (two treatments) ā€¢ I miss the dependent variable: I have to detail what I mean by better. Better in terms of speed? better in terms of bugs found? Both! Ok, I already have two dependent variables, which I can deļ¬ne as: ā€¢ ā€œeļ¬€ectivenessā€ = number of bugs found/total number of bugs ā€¢ ā€œeļ¬ƒciencyā€ = running time/number of bugs found ā€¢ Now I have to identify the controlled variables: what can impact on eļ¬€ectiveness and eļ¬ƒciency, besides the type of tool? The user? Maybe not, if the tool is fully automatic; The language of the code? Well, I want to focus only on C code; The chosen code? Well yes, but which attributes of the chosen code? ā€¢ number of bugs in the code module ā€¢ length of the module ā€¢ complexity of the module ā€¢ domain of the code ā€¢ ā€¦. I have to create a code sample that has sufļ¬cient variations in all of the controlled variables If I cannot variate a certain variable, I have to ļ¬x it (e.g., C code, domain) and make this choice explicit, as it limits my scope of interest
  • 19. Example: Software and Humans ā€¢ Objective: I want to see if the experience of the user aļ¬€ects the eļ¬€ectiveness of a certain testing tool ā€¢ The dependent variable is already identiļ¬ed: the eļ¬€ectiveness (bugs found/total bugs) ā€¢ I have to identify the independent variables: they should concern the experience of the user, how can I measure it? Years of experience in testing? Score from other colleagues? Well, normally it is better to select one independent variable only, otherwise I need too many treatments and I may not ļ¬nd enough participants! Ok, but what should I compare? 1, 2, 3, 4, 5 etc. years? It is also a lot of treatments, will I ļ¬nd enough people? I have to separate years of experience by levels. How do I select the levels? I have to do some assumptions based on existing literature or I can take some decision that can be defended ā€¢ I decide for two levels, and I partition into two treatments (i.e., two homogeneous groups of people) ā€¢ from 0 to 1 years: novices ā€¢ more than 5 years: experts ā€¢ Now I have to identify the controlled variables: what can impact on my outcomes besides the experience of users? Well, age, gender, all demographic variablesā€¦and of course, the code on which the tool is applied (previous variables) ā€¢ I have to make some choice: I should ļ¬x a representative code base, use the same for all subjects, make sure none of them know the code in advance, and control demographic variables ā€¢ Therefore, for each treatment, I have a group with a comparable experience (novice OR expert) but variations in terms of age, gender, and other demographic variables
  • 20. Controlled Experiments: šŸ™‚ and ā˜¹ ā€¢ šŸ™‚ Advantages: ā€¢ It is SCIENCE, with NUMBERS ā€¢ Can be applied to identify cause-eļ¬€ect relationships for speciļ¬c, well deļ¬ned, variables ā€¢ ā˜¹ Disadvantages: ā€¢ Applicable to well-deļ¬ned problems in which you can clearly deļ¬ne and isolate variables ā€¢ Hard to apply if you cannot simulate the right conditions in the lab (confounding variables may be too many to be controlled) ā€¢ Reality of SE has several contextual factors that may make the experiment not realistic ā€¢ It may be hard and costly to recruit adequate subjects (developers have to develop, managers need to manageā€¦often, students are used as proxies) ā€¢ Design is time consuming and can get very complicated, very easily (which implies that it is also diļ¬ƒcult to analyse the results and have an actual control)
  • 21. Hypothesis Testing cf. Sharma, 2015 https://bit.ly/2wTf7VX I will provide information for you to understand the principles, but to REALLY understand you need more resources I will use the word MAGIC when some concepts need to be assumed, or some measures can be given somehow by common tools Alessio Ferrari, ISTI-CNR, Pisa, Italy alessio.ferrari@isti.cnr.it
  • 22. Hypothesis ā€¢ A hypothesis is a statistically testable statement derived from a theory (and, in practice, from a research question) ā€¢ A hypothesis is a predictive statement concerning the impact of some independent variable on some dependent variable ā€¢ When we do hypothesis testing, our goal is to refute the negation of the theory ā€¢ H0 the NULL hypothesis ā€” The theory does not apply ā€¢ Usually expressed as There is no eļ¬€ect [ā€¦] ā€” changes of the independent variable do not aļ¬€ect the dependent variable ā€¢ It is assumed to be TRUE, unless there is evidence from the data that allows us to REJECT the NULL hypothesis (for this, you need statistical tests) ā€¢ H1 the ALTERNATIVE hypothesis ā€” The theory predictsā€¦ ā€¢ If H0 is rejected, this is an evidence that H1 can be correct
  • 23. Example ā€¢ H0: The experience of the developer does not aļ¬€ect the average time to ļ¬nd bugs ā€¢ H0: Average-Time-Novices = Average-Time-Experts ā€¢ H1: The experience of the developer aļ¬€ects the average time to ļ¬nd bugs ā€¢ H1: Average-Time-Novices ā‰  Average-Time-Experts I imagine to have two groups, novices and experts We speak about Two-tailed hypothesis to be tested (later you will understand why) I imagine I have a method M or tool T for ļ¬nding bugs
  • 24. Example ā€¢ H0: The experience of the developer does not aļ¬€ect the average time to ļ¬nd bugs ā€¢ H0: Average-Time-Novices = Average-Time-Experts ā€¢ H1: The experience of the developer aļ¬€ects the average time to ļ¬nd bugs ā€¢ H1: Average-Time-Novices ā‰  Average-Time-Experts I imagine to have two groups, novices and experts We speak about Two-tailed hypothesis to be tested (later you will understand why) What if I want to know WHO is QUICKER? This formulation does not say anything about thatā€¦ I imagine I have a method M or tool T for ļ¬nding bugs
  • 25. Example ā€¢ But I can ļ¬nd another formulation, with exactly the same experiment ā€” two groups, novices and experts, and I measure average time to ļ¬nd bugs ā€¢ H0: The average time to ļ¬nd bugs of novices is less than or equal to the one of experts ā€¢ H0: Average-Time-Novices <= Average-Time-Experts ā€¢ H1: The average time to ļ¬nd bugs of novices is greater than the one of experts ā€¢ H1: Average-Time-Novices > Average-Time-Experts We speak about One-tailed hypothesis to be tested
  • 26. Test Statistic ā€¢ Hypothesis tests normally take all my sample data and convert them into a single value, which is called test statistic ā€¢ The test statistic is just a number, but its value can tell me whether the NULL hypothesis can be REJECTED or not ā€¢ Depending on the test that I have to do I will have diļ¬€erent test statistics Test Data from Experiment Test Statistic time novice 1 time expert 1 time novice 2 time expert 2 e.g, unpaired t-test -0.38 e.g., t-value Compare the means of two independent samples cf. https://bit.ly/39LLOU5
  • 27. Probability Distribution of the Test Statistic ā€¢ The assumption is that the NULL hypothesis is TRUE ā€¢ Given a population in which the NULL hypothesis is true, ā€Ø I imagine to repeat my experiment multiple times and compute the test statistic ā€¢ The test statistic will follow a certain distribution ā€” which one? MAGIC, e.g., Student t- distribution If H0 is TRUE, most of the times I repeat the experiment the test statistic will be around here Number of samples with value x Set of possible values x of the test statistic If H0 is TRUE, it is unlikely that my test statistic will be here (or in the left tail) e.g., a t-value = 0 indicates that my data conļ¬rms H0 precisely The distribution is centred on the value that the test statistic hasā€Ø when the data of my experiment conļ¬rm exactly the NULL hypothesis
  • 28. Probability Distribution of the Test Statistic ā€¢ The assumption is that the NULL hypothesis is TRUE ā€¢ Given a population in which the NULL hypothesis is true, ā€Ø I imagine to repeat my experiment multiple times and compute the test statistic ā€¢ The test statistic will follow a certain distribution ā€” which one? MAGIC, e.g., Student t- distribution If H0 is TRUE, most of the times I repeat the experiment the test statistic will be around here Number of samples with value x Set of possible values x of the test statistic If H0 is TRUE, it is unlikely that my test statistic will be here (or in the left tail) e.g., a t-value = 0 indicates that my data conļ¬rms H0 precisely If my test statistic falls around the tails I can REJECT H0 ā€¦and this is my hope! The distribution is centred on the value that the test statistic hasā€Ø when the data of my experiment conļ¬rm exactly the NULL hypothesis
  • 29. ā€¢ Our ļ¬nal goal is to evaluate whether our test statistic value, obtained from our experiment, is so rare that it justiļ¬es rejecting the NULL hypothesis for the entire population based on our sample data ā€¢ How can I do if I do not know the entire distribution of my test statistic? This can be inferred based on the statistics of the sampled data and the hypothesis I want to testā€¦ ā€¢ ā€¦in this context we will assume that some MAGIC occurs and we know the distribution of the test statistic
  • 30. Critical Regions test statistic # of samples I want the test statistic of my experiment to fall on the tails of the distribution Critical Region = acceptable values to reject NULL Critical Region = acceptable values to reject NULL The acceptable values identify a red area in the distribution The area is the risk of rejecting the NULL when TRUE Before the experiment, I set the Critical Regions (Rejection Regions)
  • 31. Level of Signiļ¬cance and Conļ¬dence ā€¢ Signiļ¬cance level indicates the risk to reject a NULL hypothesis when it is true; it is denoted by š›¼ ā€¢ 0.01, 0.05, 0.1: these are the typical values for š›¼ ā€¢ (1 āˆ’ š›¼) is the conļ¬dence level indicates how conļ¬dent I want to be about the result of my test ā€¢ 0.99, 0.95, 0.9: typical values for (1 āˆ’ š›¼) Alpha sets the standard for how extreme the data MUST BE before we can reject the null hypothesis. The p-value indicates how extreme the data ARE (later).
  • 32. Signiļ¬cance and Conļ¬dence test statistic Before any experiment I set the signiļ¬cance level, and corresponding conļ¬dence level Critical Region = acceptable values of test statistic to reject NULL Critical Region = acceptable values of test statistic to reject NULL Conļ¬dence Level (1-š›¼) Signiļ¬cance Level š›¼
  • 33. Risk of Rejecting the NULL Hypothesis when TRUE Risk Level Signiļ¬cance š›¼ Conļ¬dence Level (1- š›¼) Intuitive Meaning Catastrophic 0.001 0.999 More than 100 million Euros (Large loss of life, e.g. nuclear disaster) Critical 0.01 0.99 Less than 100 million Euros (A few lives lost, e.g., accident) Important 0.05 0.95 Less than 100 thousands Euros (No lives lost, some injuries) Moderate 0.10 0.90 Less than 500 Euros (no injuries)
  • 34. Risk of Rejecting the NULL Hypothesis when TRUE Risk Level Signiļ¬cance š›¼ Conļ¬dence Level (1- š›¼) Intuitive Meaning Catastrophic 0.001 0.999 More than 100 million Euros (Large loss of life, e.g. nuclear disaster) Critical 0.01 0.99 Less than 100 million Euros (A few lives lost, e.g., accident) Important 0.05 0.95 Less than 100 thousands Euros (No lives lost, some injuries) Moderate 0.10 0.90 Less than 500 Euros (no injuries)
  • 35. Risk of Rejecting the NULL Hypothesis when TRUE Risk Level Signiļ¬cance š›¼ Conļ¬dence Level (1- š›¼) Intuitive Meaning Catastrophic 0.001 0.999 More than 100 million Euros (Large loss of life, e.g. nuclear disaster) Critical 0.01 0.99 Less than 100 million Euros (A few lives lost, e.g., accident) Important 0.05 0.95 Less than 100 thousands Euros (No lives lost, some injuries) Moderate 0.10 0.90 Less than 500 Euros (no injuries) In software engineering, we normally use these values
  • 36. Type I and Type II Errors REAL Population Fail to Reject Reject NULL is True No Error my theory is FALSE (1 - š›¼) Type I Error (Incorrectly Reject the NULL hypothesis) š›¼ NULL is False Type II Error (Incorrectly Fail to Reject the NULL hypothesis) Ī² No Error my theory is TRUE (1- Ī²) Type I šŸ¤„ my (alternative) hypothesis is wrong, but I support it anyway Type II šŸ„ŗ my (alternative) hypothesis is correct, but I rejected it We normally focus on minimising Type I errors
  • 37. Two-tailed Test Average-Time-Novices = Average-Time-Experts Average-Time-Novices ā‰  Average-Time-Experts Average-Time-Novices ā‰  Average-Time-Experts Acceptance region š‘š‘œš‘›š‘“š‘–š‘‘š‘’š‘›š‘š‘’ š‘™š‘’š‘£š‘’š‘™ (1āˆ’š›¼) = 0.95 Rejection Region š‘ š‘–š‘”š‘›š‘–š‘“š‘–š‘š‘Žš‘›š‘š‘’ š‘™š‘’š‘£š‘’š‘™ (š›¼/2 = 0.025 š‘œš‘Ÿ 2.5%) Rejection Region š‘ š‘–š‘”š‘›š‘–š‘“š‘–š‘š‘Žš‘›š‘š‘’ š‘™š‘’š‘£š‘’š‘™ (š›¼/2 = 0.025 š‘œš‘Ÿ 2.5%) the value of š›¼ = 0.05 is split between the tails ā€¢ H0: The experience of the developer does not aļ¬€ect the average time to ļ¬nd bugs š›¼ is the risk of rejecting NULL when true the value of š›¼/2 is this area
  • 38. One-tailed Test (Left) Average-Time-Novices >= Average-Time-Experts Average-Time-Novices < Average-Time-Experts Acceptance region š‘š‘œš‘›š‘“š‘–š‘‘š‘’š‘›š‘š‘’ š‘™š‘’š‘£š‘’š‘™ (1āˆ’š›¼) = 0.95 Rejection Region š‘ š‘–š‘”š‘›š‘–š‘“š‘–š‘š‘Žš‘›š‘š‘’ š‘™š‘’š‘£š‘’š‘™ (š›¼ = 0.05 š‘œš‘Ÿ 5%) the value of š›¼ = 0.05 is all in one tail ā€¢ H0: The average time to ļ¬nd bugs of novices is greater than or equal to the one of experts the value of š›¼ is this area
  • 39. One-tailed Test (Right) Average-Time-Novices <= Average-Time-Experts Average-Time-Novices > Average-Time-Experts Acceptance region š‘š‘œš‘›š‘“š‘–š‘‘š‘’š‘›š‘š‘’ š‘™š‘’š‘£š‘’š‘™ (1āˆ’š›¼) = 0.95 Rejection Region š‘ š‘–š‘”š‘›š‘–š‘“š‘–š‘š‘Žš‘›š‘š‘’ š‘™š‘’š‘£š‘’š‘™ (š›¼ = 0.05 š‘œš‘Ÿ 5%) the value of š›¼ = 0.05 is all on one tail ā€¢ H0: The average time to ļ¬nd bugs of novices is less than or equal to the one of experts the value of š›¼ is this area
  • 40. p-value Test Data from Experiment Test Statistic time novice 1 time expert 1 time novice 2 time expert 2 e.g, unpaired t-test -0.38 e.g., t-value p-value Another number produced by the test LOW values (0.001) are GOOD, HIGH values (0.3) are BAD
  • 41. p-value and š›¼ (one-tailed) p-value is this blue area This point is MY test statistic value, derived from MY data š›¼ is the red plus the blue area cf. https://statisticsbyjim.com/hypothesis-testing/hypothesis-tests-signiļ¬cance-levels-alpha-p-values/
  • 42. p-value and š›¼ (two-tailed) p-value/2 is this blue area This point in the x axis is my test statistic value, derived from my data š›¼/2 is the red plus the blue area cf. https://statisticsbyjim.com/hypothesis-testing/hypothesis-tests-signiļ¬cance-levels-alpha-p-values/ For two-tailed tests, š›¼ and p are the sum of the areas in the two tails, both š›¼ and p are shared between the tails cf. https://blog.minitab.com/blog/adventures-in-statistics-2/understanding-hypothesis-tests-signiļ¬cance-levels-alpha-and-p-values-in-statistics š›¼/2 is the red plus the blue area p-value/2 is this blue area
  • 43. p-value ā€¢ 1) p-value indicates the believability of the devilā€™s advocate case that the NULL hypothesis is TRUE given the sample data ā€¢ 2) p-value is the probability of observing a test statistic that is at least as extreme as your test statistic, when you assume that the NULL hypothesis is true ā€¢ 3) p-value indicates to which extent the result may be due to a random variation within your data, which make them diļ¬€erent to the actual population ā€¢ If p-value is ā€œvery lowā€, then the NULL hypothesis is REJECTED, in favour of the alternative hypothesis, otherwise I Fail to REJECT ā€¢ The meaning of ā€œVery lowā€ depends on the selected value of signiļ¬cance š›¼ ā€¢ p-value <= š›¼: I fall in the REJECTION region, H0 is rejected ā€¢ p-value > š›¼: I fall in the ACCEPTANCE region, I fail to reject H0 Different intuitive way to understand it
  • 44. Effect Size Test Data from Experiment Test Statistic time novice 1 time expert 1 time novice 2 time expert 2 e.g, unpaired t-test -0.38 e.g., t-value p-value Effect Size Statistically signiļ¬cant effect does not necessarily mean a big effect cf. https://en.wikipedia.org/wiki/Eļ¬€ect_size Effect size measures how big is the effect Effect size computation e.g, Cohenā€™s d e.g., d = 2 cf. https://www.simplypsychology.org/eļ¬€ect-size.html
  • 45. Effect Size ā€¢ Eļ¬€ect size is a quantitative measure of the magnitude of the treatment eļ¬€ect (e.g., HOW MUCH better is my tool?) ā€¢ Eļ¬€ect sizes either measure: ā€¢ the sizes of associations/relationships between variables ā€Ø (HOW MUCH is experience correlated with development speed?) ā€¢ the sizes of diļ¬€erences between group means ā€Ø (HOW MUCH is the diļ¬€erence between tool A and B?) ā€¢ There are diļ¬€erent way to measure eļ¬€ect size, the most common are Cohenā€™s d (for diļ¬€erences), Pearson r correlation (for associations/ relationships), but it may also depend on the type of data (categorical vs numeric), and on types of samples (paired vs unpaired) Check Wikipedia to know the most appropriate for your case: cf. https://en.wikipedia.org/wiki/Effect_size cf. Lakens, 2013 https://doi.org/10.3389/fpsyg.2013.00863
  • 46. Cohenā€™s d ā€¢ Diļ¬€erence between the means divided by the standard deviation of the population from which the data were sampled ā€” but how can we know the standard deviation of the population? The same MAGIC as before ā€¢ A d of 1 indicates the two groups diļ¬€er by 1 standard deviation, a d of 2 indicates they diļ¬€er by 2 standard deviations, and so on. This is how you interpret the values of d that you obtain https://en.wikipedia.org/wiki/Effect_size
  • 47. Pearsonā€™s r ā€¢ Indicates the correlation between variables (e.g., number of bugs vs length of the code) ā€¢ Pearson's r can vary in magnitude from āˆ’1 to 1: ā€¢ āˆ’1 perfect negative linear relation, ā€¢ 1 perfect positive linear relation ā€¢ no linear relation between two variables ā€¢ The eļ¬€ect size is low if the value of r varies around 0.1, medium if r varies around 0.3, and large if r varies more than 0.5
  • 48. What about Type II Errors? ā€¢ In all our evaluations, we assumed that the population was conļ¬rming the NULL hypothesis, but what if we make a Type II error (we fail to reject the NULL hypothesis, when the actual population rejects it)? ā€¢ Well, in these cases, we should also establish a value, normally called Ī², which is the probability of accepting the NULL hypothesis, although it is FALSE ā€¢ If the NULL hypothesis is FALSE, this means that my real population follows the alternative hypothesis
  • 49. Type II Errors Set of possible values x of my test statistic Number of samples with value x (Density) Distribution if H0 would be true Distribution if H1 would be true
  • 50. Type II Errors Set of possible values x of my test statistic Number of samples with value x (Density) Distribution if H0 would be true Distribution if H1 would be true š›¼To have smaller š›¼ I have to push the bar to the rightā€¦
  • 51. Type II Errors Set of possible values x of my test statistic Number of samples with value x (Density) Distribution if H0 would be true Distribution if H1 would be true Ī² š›¼š›¼ now is really small, but Ī² gets larger! Ī² is the probability of accepting the NULL hypothesis when it is FALSE š›¼ is the probability of rejecting the NULL hypothesis when it is TRUE
  • 52. The Hard Truth ā€¢ Whenever you try to minimise Type I errors, you end up increasing the chance of Type II errors ā€¢ In practice, we mostly look at REJECTING null hypotheses, so we generally focus on Type I errors, and alpha values ā€¢ Why do we look at rejecting the NULL? (intuitive explanation) ā€¢ We are using just one sample to reason on an entire population, so we can REJECT a hypothesis, or FAIL to REJECT, but never accept ā€¢ Accepting the alternative hypothesis would imply repeating the experiments many more times with diļ¬€erent samples taken from my actual population and showing that the test statistic follows the distribution of the alternative hypothesis ā€¢ Additional Intuition: it is easier to disprove ā€œall swans are whiteā€ (I need to ļ¬nd only one black swan) than to prove it (I need to check all possible swans)
  • 53. Summary of Concepts ā€¢ When you perform an experiment you have to keep in mind the following key concepts: ā€¢ Level of signiļ¬cance š›¼: tells me how much risk I can take, normally set to 0.05, moderate risk; it is set at the beginning of the experiment ā€¢ Test statistic: value depending on the type of test that I make, it serves to understand how much my sample is rare in a population in which the NULL hypothesis is TRUE; it is produced based on my experimental data; the number alone does not say much ā€¢ p-value: indicates the probability of rejecting the NULL hypothesis when it is actually TRUE; it is produced based on my experimental data; it needs to be compared with š›¼; if lower than š›¼, I am happy ā€¢ Eļ¬€ect size: indicates how large is the diļ¬€erence between two treatments, or how much is the correlation between independent and dependent variable; depends on the chosen test; tables exist to evaluate the eļ¬€ect size
  • 54. Graphical Summary Test Data from Experiment Test Statistic p-value Effect Size Effect size computation Signiļ¬cance š›¼ p-value <= š›¼ Effect Size Table Small Effect Large Effect Reject NULL Hypothesis
  • 55. Statistical Tests Alessio Ferrari, ISTI-CNR, Pisa, Italy alessio.ferrari@isti.cnr.it cf. Wholin et al. https://doi.org/10.1007/978-3-642-29044-2
  • 56. Summary from Previous Lecture Distribution of test statistic when samples come from a population where NULL is true NULL Hypothesis Centred in the value that test statistic has when the sample conļ¬rms EXACTLY the NULL hypothesis test statistic # of samples Every experiment produces a test statistic (numerical summary of the data) I imagine to perform a set of experiments with a population in which NULL is true
  • 57. š›¼ is this area Summary from Previous Lecture Distribution of test statistic when samples come from a population where NULL is true NULL Hypothesis Centred in the value that test statistic has when the sample conļ¬rms EXACTLY the NULL hypothesis test statistic # of samples Every experiment produces a test statistic (numerical summary of the data) I imagine to perform a set of experiments with a population in which NULL is true
  • 58. š›¼ is this area Summary from Previous Lecture Distribution of test statistic when samples come from a population where NULL is true NULL Hypothesis This point is my test statistic value, derived from my data Statistical Test p-value is this blue area Centred in the value that test statistic has when the sample conļ¬rms EXACTLY the NULL hypothesis test statistic # of samples Every experiment produces a test statistic (numerical summary of the data) I imagine to perform a set of experiments with a population in which NULL is true
  • 59. Statistical Tests ā€¢ A statistical test is a means to establish a test statistic, i.e., a single value derived from the data of my experiment ā€¢ Several tests exist, and each test is appropriate for a speciļ¬c type of experiment ā€¢ Two categories of tests exist: ā€¢ Parametric Tests: tests that make some assumptions on the populationā€™s distribution, e.g., normality, or homogeneous variances of the sample ā€¢ Nonparametric Tests: tests that do not make assumptions on the populationā€™s distribution. For most of the parametric tests, a nonparametric alternative exist ā€¢ Parametric Tests have more statistical power (a concept that we did not explore); roughly, they are more likely to lead to the rejection of the NULL hypothesis when FALSE (they lead to lower p-values, when NULL is false, and hence reduce Type II errors). You cannot use them for nominal or ordinal data. ā€¢ Nonparametric Tests are more robust, as they are valid for a larger set of cases, as they do not make strict assumptions on the data. You can use them for nominal and ordinal data, or when assumptions of the parametric tests do not hold ā€¢ You do not know the population, so, in order to use parametric tests, you ļ¬rst have to test how likely is it that your data follow the assumption of the test that you are going to make; if they do not follow the assumption, then use a nonparametric alternative (cf. https://help.xlstat.com/s/article/which-statistical- test-should-you-use?language=en_US)
  • 60. Normality Test (does not apply to nominal or ordinal data) ā€¢ Many parametric statistical tests assume that your data is normally distributed (actually, the distribution of the sample mean is normal, so I should consider the populationā€¦in general if you have more than 30 samples youā€™re safe) ā€¢ To ensure that, you need to apply a normality test to your data, for example Shapiro- Wilk (several others exist) ā€¢ The null-hypothesis of this test is H0 = the population is normally distributed. ā€¢ Thus, if the p-value is less than the chosen š›¼ level, then the NULL hypothesis is rejected and there is evidence that the data tested are NOT normally distributed. Here you want the p-value to be LARGER than š›¼, as your NULL hypothesis is the one that you want to support! Hence, THE LARGER the p-value, the BETTER! There are also ways to transform your data if they are not normally distributed, but be careful, because then the interpretation of the results is not straightforward (check if non-normality is due to the presence of outliers) cf. https://bit.ly/2wJAl9l
  • 61. Parametric and Non- parametric Tests (Remark) ā€¢ Parametric tests are all those test that make some assumptions on your data (normality, above all). To use a parametric test you ļ¬rst need to check that the assumptions of the parametric test hold for your data ā€¢ Non-parametric tests are alternative tests to use when the normality test (or any other assumption) fails OR when you are dealing with categorical or ordinal data ā€¢ Sometimes non-parametric tests have assumptions too! (check carefully which are the assumptions of non-parametric tests, e.g., cf. https://www.isixsigma.com/tools-templates/ hypothesis-testing/nonparametric-distribution-free-not- assumption-free/ )
  • 62. Selecting the right test HOWTO ā€¢ In the following, a diagram will be shown to guide you in the selection of the right test, assuming that you have only ONE DEPENDENT VARIABLE ā€” as in most of the experiments with a manageable design in SE ā€¢ The selection of the test depends on ā€¢ The type of dependent variable (nominal, ordinal, scale/ratio) ā€¢ Type of hypothesis (diļ¬€erence or relationship/association) ā€¢ Number of treatments ā€¢ Type of design (single group of subjects vs two groups) ā€¢ Number of independent variables
  • 63. Selecting the right test HOWTO ā€¢ In the following, a diagram will be shown to guide you in the selection of the right test, assuming that you have only ONE DEPENDENT VARIABLE ā€” as in most of the experiments with a manageable design in SE ā€¢ The selection of the test depends on ā€¢ The type of dependent variable (nominal, ordinal, scale/ratio) ā€¢ Type of hypothesis (diļ¬€erence or relationship/association) ā€¢ Number of treatments ā€¢ Type of design (single group of subjects vs two groups) ā€¢ Number of independent variables You will not memorise the diagram, but you should know how to follow it
  • 64. Selecting the right test HOWTO ā€¢ In the following, a diagram will be shown to guide you in the selection of the right test, assuming that you have only ONE DEPENDENT VARIABLE ā€” as in most of the experiments with a manageable design in SE ā€¢ The selection of the test depends on ā€¢ The type of dependent variable (nominal, ordinal, scale/ratio) ā€¢ Type of hypothesis (diļ¬€erence or relationship/association) ā€¢ Number of treatments ā€¢ Type of design (single group of subjects vs two groups) ā€¢ Number of independent variables You will not memorise the diagram, but you should know how to follow it I will not explain how each test works, you only need to know which one to use
  • 65. Selecting the right test HOWTO ā€¢ In the following, a diagram will be shown to guide you in the selection of the right test, assuming that you have only ONE DEPENDENT VARIABLE ā€” as in most of the experiments with a manageable design in SE ā€¢ The selection of the test depends on ā€¢ The type of dependent variable (nominal, ordinal, scale/ratio) ā€¢ Type of hypothesis (diļ¬€erence or relationship/association) ā€¢ Number of treatments ā€¢ Type of design (single group of subjects vs two groups) ā€¢ Number of independent variables You will not memorise the diagram, but you should know how to follow it I will not explain how each test works, you only need to know which one to use In this lecture a test is a BLACK box that produces two numbers: test statistic and p-value
  • 66. Type of Dependent Variable Nominal (labels) Ordinal (ordered labels) Number of Ind. Variables Zero (Only the dependent variable) One or more Chi-square Goodness of ļ¬t Chi-square Test of Independence Type of hypothesis RelationshipDiļ¬€erence Spearmanā€™s Rho Type of design Diļ¬€erent groups of subjects Single group of subjects Mann-Whitney U test Wilcoxon signed-rank test Interval/Ratio (numbers)
  • 67. Type of Dependent Variable Nominal (labels) Ordinal (ordered labels) Number of Ind. Variables Zero (Only the dependent variable) One or more Chi-square Goodness of ļ¬t Chi-square Test of Independence Type of hypothesis RelationshipDiļ¬€erence Spearmanā€™s Rho Type of design Diļ¬€erent groups of subjects Single group of subjects Mann-Whitney U test Wilcoxon signed-rank test Interval/Ratio (numbers) I assume to have One Dependent Variable
  • 68. Interval/Ratio (numbers) Type of hypothesis Relationship Diļ¬€erence Number of Ind. Variables Standard Deviation known unknown Z-test T-test Type of designSpearmanā€™s Rho Pearsonā€™s R One or More Zero Diļ¬€erent groups of subjects (independent measures) Treatments Two More than two T-test (paired) Wilcoxon signed-rank test One-way ANOVA Treatments Two More than two T-test (unpaired) Mann-Whitney U test Independent Variables One More than One One-way ANOVA Factorial ANOVA Single group of subjects (repeated measures) cf. https://www.socscistatistics.com
  • 69. Interval/Ratio (numbers) Type of hypothesis Relationship Diļ¬€erence Number of Ind. Variables Standard Deviation known unknown Z-test T-test Type of designSpearmanā€™s Rho Pearsonā€™s R One or More Zero Diļ¬€erent groups of subjects (independent measures) Treatments Two More than two T-test (paired) Wilcoxon signed-rank test One-way ANOVA Treatments Two More than two T-test (unpaired) Mann-Whitney U test Independent Variables One More than One One-way ANOVA Factorial ANOVA Single group of subjects (repeated measures) The list of tests is NOT exhaustive cf. https://www.socscistatistics.com
  • 70. Type of Dependent Variable Nominal (labels) Ordinal (ordered labels) Number of Ind. Variables ZeroOne or More Chi-square Goodness of ļ¬t Chi-square Test of Independence Type of hypothesis Relationship Diļ¬€erence Spearmanā€™s Rho Type of design Diļ¬€erent groups of subjects Single group of subjects Mann-Whitney U test Wilcoxon signed-rank test Interval/Ratio (numbers) e.g., IV: None; DV: type of defect. to which extent the proportion of defects of a certain type matches the expected proportion? IV = independent variable DV = dependent variable
  • 71. Type of Dependent Variable Nominal (labels) Ordinal (ordered labels) Number of Ind. Variables ZeroOne or more Chi-square Goodness of ļ¬t Chi-square Test of Independence Type of hypothesis Relationship Diļ¬€erence Spearmanā€™s Rho Type of design Diļ¬€erent groups of subjects Single group of subjects Mann-Whitney U test Wilcoxon signed-rank test Interval/Ratio (numbers) e.g., IV: code author, ā€Ø DV: defect type Is there a link between defect type and code authors?
  • 72. Chi-Square Test of Independence (Example) ā€¢ RQ: Is there a link between defect type and code author? ā€¢ H0: There is no relationship between defect type and code author type of defect author ā€œNull pointerā€ defects in Homerā€™s code
  • 73. Chi-Square Test of Independence (Example) ā€¢ RQ: Is there a link between defect type and code author? ā€¢ H0: There is no relationship between defect type and code author type of defect author Chi-square = 56.32, p < 0.00001 H0 is REJECTED ā€œNull pointerā€ defects in Homerā€™s code
  • 74. Chi-Square Test of Independence (Example) ā€¢ RQ: Is there a link between defect type and code author? ā€¢ H0: There is no relationship between defect type and code author type of defect author Chi-square = 56.32, p < 0.00001 H0 is REJECTED Cramerā€™s V should be used to check Effect Size (check Wikipedia)! ā€œNull pointerā€ defects in Homerā€™s code
  • 75. Type of Dependent Variable Nominal (labels) Ordinal (ordered labels) Number of Ind. Variables ZeroOne or More Chi-square Goodness of ļ¬t Chi-square Test of Independence Type of hypothesis RelationshipDiļ¬€erence Spearmanā€™s Rho Type of design Diļ¬€erent groups of subjects (independent measures) Single group of subjects (repeated measures) Mann- Whitney U test Wilcoxon signed-rank test Interval/Ratio (numbers) e.g., IV: level of experience (two levels); ā€Ø DV: degree of success Is there a difference between degree of project success between novices and experts? novices experts Deg. Proj. Success
  • 76. Type of Dependent Variable Nominal (labels) Ordinal (ordered labels) Number of Ind. Variables ZeroOne or more Chi-square Goodness of ļ¬t Chi-square Test of Independence Type of hypothesis RelationshipDiļ¬€erence Spearmanā€™s Rho Type of design Diļ¬€erent groups of subjects (independent measures) Single group of subjects (repeated measures) Mann- Whitney U test Wilcoxon signed-rank test Interval/Ratio (numbers) e.g., IV: time of the day (morning, afternoon); ā€Ø DV: level of performance is there a difference between the performance of the developers between morning and afternoon? Level of Performance morning afternoon
  • 77. Type of Dependent Variable Nominal (labels) Ordinal (ordered labels) Number of Ind. Variables ZeroOne or More Chi-square Goodness of ļ¬t Chi-square Test of Independence Type of hypothesis RelationshipDiļ¬€erence Spearmanā€™s Rho Type of design Diļ¬€erent groups of subjects (independent measures) Single group of subjects (repeated measures) Mann- Whitney U test Wilcoxon signed-rank test Interval/Ratio (numbers) e.g., IV: motivation; DV: degree of project success Is there a relationship between motivation of a person and degree of project success? motivation success
  • 78. Dependent Variable is Interval/Ratio (numbers) Type of hypothesis Relationship Diļ¬€erence Number of Ind. Variables Standard Deviation known unknown Z-test T-test Type of design Spearmanā€™s Rho Pearsonā€™s R One or MoreZero Diļ¬€erent groups of subjects (independent measures) Treatments Two More than two T-test (paired) Wilcoxon signed-rank test One-way ANOVA Treatments Two More than two T-test (unpaired) Mann-Whitney U test Independent Variables One More than One One-way ANOVA Factorial ANOVA Single group of subjects (repeated measures) e.g.,IV: review duration, DV: number of defects identiļ¬ed Is there a relationship between review duration and number of defects identified?
  • 79. Interval/Ratio (numbers) Type of hypothesis Relationship Diļ¬€erence Number of Ind. Variables Standard Deviation known unknown Z-test T-test (single sample) Type of designSpearmanā€™s Rho Pearsonā€™s R One or More Zero Diļ¬€erent groups of subjects (independent measures) Treatments Two More than two T-test (paired) Wilcoxon signed-rank test One-way ANOVA Treatments Two More than two T-test (unpaired) Mann-Whitney U test Independent Variables One More than One One-way ANOVA Factorial ANOVA Single group of subjects (repeated measures) e.g., IV: None DV: number of defects per code module, Is there a difference between the number of defects identified in the modules and the mean value expected?
  • 80. Interval/Ratio (numbers) Type of hypothesis Relationship Diļ¬€erence Number of Ind. Variables Standard Deviation known unknown Z-test T-test Type of designSpearmanā€™s Rho Pearsonā€™s R One or more Zero Diļ¬€erent groups of subjects (independent measures) Treatments Two More than two T-test (paired) Wilcoxon signed-rank test One-way ANOVA Treatments Two More than two T-test (unpaired) Mann-Whitney U test Independent Variables One More than One One-way ANOVA Factorial ANOVA Single group of subjects (repeated measures) e.g., IV: tool; DV: speed in ļ¬nding bugs Does the tool improve the usersā€™ speed in finding bugs? (is there a difference in terms of speed WITH and WITHOUT the tool?)
  • 81. Paired T-test (Example) ā€¢ I have a new tool to support bug identiļ¬cation in code review, and I want to understand whether it is eļ¬€ective or not ā€¢ RQ: Does the tool improve the usersā€™ speed of ļ¬nding bugs? ā€¢ Independent Variable: tool (YES/NO) ā€” two treatments (TOOL/NO-TOOL) ā€¢ Dependent Variable: speed = number of bugs found/minute ā€¢ H0: the speed without the tool is lower or equal to the speed with the tool ā€¢ Design: I have 13 users, I have ONE code ļ¬le to review, and I will let them ļ¬rst do the bug search WITHOUT the tool (treatment NO-TOOL), and then do the search WITH the tool (treatment TOOL). Then, I will compare the speed for each used in the two tasks, to see if they improve. Does the review training method improve the studentā€™s ability of finding bug a.k.a. repeated-measures t-test, paired samples t-test, matched pairs t-test and matched samples t-test
  • 82. Paired T-test (Example) ā€¢ I have a new tool to support bug identiļ¬cation in code review, and I want to understand whether it is eļ¬€ective or not ā€¢ RQ: Does the tool improve the usersā€™ speed of ļ¬nding bugs? ā€¢ Independent Variable: tool (YES/NO) ā€” two treatments (TOOL/NO-TOOL) ā€¢ Dependent Variable: speed = number of bugs found/minute ā€¢ H0: the speed without the tool is lower or equal to the speed with the tool ā€¢ Design: I have 13 users, I have ONE code ļ¬le to review, and I will let them ļ¬rst do the bug search WITHOUT the tool (treatment NO-TOOL), and then do the search WITH the tool (treatment TOOL). Then, I will compare the speed for each used in the two tasks, to see if they improve. Does the review training method improve the studentā€™s ability of finding bug a.k.a. repeated-measures t-test, paired samples t-test, matched pairs t-test and matched samples t-test Whatā€™s wrong with this design?
  • 83. Paired T-test (Example) ā€¢ I have a new tool to support bug identiļ¬cation in code review, and I want to understand whether it is eļ¬€ective or not ā€¢ RQ: Does the tool improve the usersā€™ speed of ļ¬nding bugs? ā€¢ Independent Variable: tool (YES/NO) ā€” two treatments (TOOL/NO-TOOL) ā€¢ Dependent Variable: speed = number of bugs found/minute ā€¢ H0: the speed without the tool is lower or equal to the speed with the tool ā€¢ Design: I have 13 users, I have ONE code ļ¬le to review, and I will let them ļ¬rst do the bug search WITHOUT the tool (treatment NO-TOOL), and then do the search WITH the tool (treatment TOOL). Then, I will compare the speed for each used in the two tasks, to see if they improve. Does the review training method improve the studentā€™s ability of finding bug a.k.a. repeated-measures t-test, paired samples t-test, matched pairs t-test and matched samples t-test Whatā€™s wrong with this design? Learning Bias: if I use the same ļ¬le to be reviewed, students will have learned which are the bugs in the ļ¬le and in treatment NO-TOOL will be faster!
  • 84. Paired T-test (Corrected Example) ā€¢ I have a new tool to support bug identiļ¬cation in code review, and I want to understand whether it is eļ¬€ective or not ā€¢ RQ: Does the tool improve the usersā€™ speed of ļ¬nding bugs? ā€¢ Independent Variable: tool (YES/NO) ā€” two treatments ā€¢ Dependent Variable: speed = number of bugs found/minute ā€¢ H0: the speed without the tool is lower or equal to the speed with the tool ā€¢ Design: I have 13 users, I have ONE code ļ¬le to review, and I will let them ļ¬rst do the bug search WITH the tool (treatment TOOL), and THEN do the search WITHOUT the tool (treatment NO-TOOL). Then, I will compare the speed for each student in the two tasks. Does the review training method improve the studentā€™s ability of finding bug
  • 85. Paired T-test (Corrected Example) ā€¢ I have a new tool to support bug identiļ¬cation in code review, and I want to understand whether it is eļ¬€ective or not ā€¢ RQ: Does the tool improve the usersā€™ speed of ļ¬nding bugs? ā€¢ Independent Variable: tool (YES/NO) ā€” two treatments ā€¢ Dependent Variable: speed = number of bugs found/minute ā€¢ H0: the speed without the tool is lower or equal to the speed with the tool ā€¢ Design: I have 13 users, I have ONE code ļ¬le to review, and I will let them ļ¬rst do the bug search WITH the tool (treatment TOOL), and THEN do the search WITHOUT the tool (treatment NO-TOOL). Then, I will compare the speed for each student in the two tasks. Does the review training method improve the studentā€™s ability of finding bug Now the learning bias would be in favour of NO-TOOL treatment; if I am able to reject the hypothesis, I can be quite conļ¬dent that the tool increases the speed
  • 86. Paired T-test (Corrected Example) ā€¢ I have a new tool to support bug identiļ¬cation in code review, and I want to understand whether it is eļ¬€ective or not ā€¢ RQ: Does the tool improve the usersā€™ speed of ļ¬nding bugs? ā€¢ Independent Variable: tool (YES/NO) ā€” two treatments ā€¢ Dependent Variable: speed = number of bugs found/minute ā€¢ H0: the speed without the tool is lower or equal to the speed with the tool ā€¢ Design: I have 13 users, I have ONE code ļ¬le to review, and I will let them ļ¬rst do the bug search WITH the tool (treatment TOOL), and THEN do the search WITHOUT the tool (treatment NO-TOOL). Then, I will compare the speed for each student in the two tasks. Does the review training method improve the studentā€™s ability of finding bug Now the learning bias would be in favour of NO-TOOL treatment; if I am able to reject the hypothesis, I can be quite conļ¬dent that the tool increases the speed Is ONE code ļ¬le sufļ¬cient?
  • 87. Paired T-test (Corrected Example) ā€¢ Design: I have 13 users, I have TWO equivalent code ļ¬les to review (ļ¬le X and Y), and I will let them ļ¬rst do the bug search WITH the tool on ļ¬le X (treatment TOOL), and THEN do the search WITHOUT the tool on ļ¬le Y (treatment NO-TOOL). Then, I will compare the speed for each student in the two tasks. ā€¢ With TWO equivalent code ļ¬les, I am more conļ¬dent that the ļ¬rst treatment does not inļ¬‚uence the second treatment
  • 88. Paired T-test (Corrected Example) ā€¢ Design: I have 13 users, I have TWO equivalent code ļ¬les to review (ļ¬le X and Y), and I will let them ļ¬rst do the bug search WITH the tool on ļ¬le X (treatment TOOL), and THEN do the search WITHOUT the tool on ļ¬le Y (treatment NO-TOOL). Then, I will compare the speed for each student in the two tasks. ā€¢ With TWO equivalent code ļ¬les, I am more conļ¬dent that the ļ¬rst treatment does not inļ¬‚uence the second treatment But what if the task lasts too long, and the students get tired in the second task? The effect of fatigue needs to be considered, so I need to do the two treatments in two separate days (or allow sufļ¬cient time between tasks)
  • 89. Paired T-test ā€¢ H0: the speed without the tool is lower or equal to the speed with the tool (one-tailed hypothesis) USER NO TOOL u0 3 6 u1 3 6 u2 4 5 u3 3 8 u4 5 3 u5 7 5 u6 2 6 u7 1 5 u8 2 3 u9 8 9 u10 9 11 u11 1 4 u12 7 9 bugs/min by user u0 with TOOL t = 3.24 p-value = 0.00354
  • 90. Paired T-test ā€¢ H0: the speed without the tool is lower or equal to the speed with the tool (one-tailed hypothesis) USER NO TOOL u0 3 6 u1 3 6 u2 4 5 u3 3 8 u4 5 3 u5 7 5 u6 2 6 u7 1 5 u8 2 3 u9 8 9 u10 9 11 u11 1 4 u12 7 9 bugs/min by user u0 with TOOL t = 3.24 p-value = 0.00354 CURIOSITY: What calculations are made to ļ¬nd the t-value (the test statistic)?
  • 91. Computing the t-test statistic (paired case) NO TOOL Difference Dev (Difference - M) Dev2 Ī¼ is the expected difference if H0 is true (hence no difference, Ī¼ = 0) The t-test statistic is based on the difference between the two measures This is the formula of the test statistic for t-test SS:Mean M:
  • 92. Interval/Ratio (numbers) Type of hypothesis Relationship Diļ¬€erence Number of Ind. Variables Standard Deviation known unknown Type of designSpearmanā€™s Rho One or MoreZero Diļ¬€erent groups of subjects (independent measures) Treatments Two More than two T-test (paired) Wilcoxon signed-rank test One-way ANOVA Treatments Two More than two T-test (unpaired) Mann-Whitney U test Independent Variables One More than One One-way ANOVA Factorial ANOVA Single group of subjects (repeated measures) e.g., IV: tools DV: speed in ļ¬nding bugs Which is the difference between tool A, B, C, D in terms of speed of bug detection achieved by users? Tool A Tool B Tool C Tool D
  • 93. Interval/Ratio (numbers) Type of hypothesis Relationship Diļ¬€erence Number of Ind. Variables Standard Deviation known unknown Z-test T-test Type of designSpearmanā€™s Rho Pearsonā€™s R One or MoreZero Diļ¬€erent groups of subjects (independent measures) Treatments Two More than two T-test (paired) Wilcoxon signed-rank test One-way ANOVA Treatments Two More than two T-test (unpaired) Mann-Whitney U test Independent Variables One More than One One-way ANOVA Factorial ANOVA Single group of subjects (repeated measures) e.g., IV: speed in ļ¬nding bugs Does the tool improve the usersā€™ speed in finding bugs? (is there a difference in terms of speed WITH and WITHOUT the tool?)
  • 94. Unpaired T-test (Example) ā€¢ RQ: Does the tool improve the usersā€™ speed of ļ¬nding bugs? ā€¢ I want to completely get rid of the learning bias, and of the fatigue eļ¬€ect, and I have a suļ¬ƒcient number of users (26 instead of 13) ā€¢ I change the design by having two groups, randomly allocate subjects and assign each subject to one of the treatment (TOOL, NO-TOOL) ā€¢ I have to assess that there is no diļ¬€erence in the initial competence of the users. To this end, I can do a pre-test, which can allow me to identify that subjects in the two groups have the same (average) degree of competence in ļ¬nding bugs. ā€¢ Otherwise, I can provide sound arguments to justify that ALL the subjects have the same degree of competence (e.g., people are students that come from the same course, and are all novicesā€¦hence my results are valid solely for this category of users) ā€¢ Note that the two groups need to be balanced, but you do not need to have the same exact number of people in the two groups (e.g., if you have 25 people, it can be divided into 13 and 12 subjects) a.k.a. independent-measures t-test, unpaired samples t-test The problem is the same as for the paired T-test!
  • 95. Unpaired T-test (Example) USER NO u0 3 u1 3 u2 4 u3 3 u4 5 u5 7 u6 2 u7 1 u8 2 u9 8 u10 9 u11 1 u12 7 USER TOOL u13 6 u14 6 u15 5 u16 8 u17 3 u18 5 u19 6 u20 5 u21 3 u22 9 u23 11 u24 4 u25 9 t-value = -1.89889 p-value = .034833 Note that the t-value is different with respect to the t-value for the paired case although the number in the tables are THE SAME (but coming from different subjects)!
  • 96. Unpaired T-test (Example) USER NO u0 3 u1 3 u2 4 u3 3 u4 5 u5 7 u6 2 u7 1 u8 2 u9 8 u10 9 u11 1 u12 7 USER TOOL u13 6 u14 6 u15 5 u16 8 u17 3 u18 5 u19 6 u20 5 u21 3 u22 9 u23 11 u24 4 u25 9 t-value = -1.89889 p-value = .034833 CURIOSITY: What calculations are made to ļ¬nd this t-value (the test statistic)? Note that the t-value is different with respect to the t-value for the paired case although the number in the tables are THE SAME (but coming from different subjects)!
  • 97. Computing the t-test statistic (unpaired case) NO (x) Difference (x - M) Sq. Diff (x - M)2 TOOL (y) Difference (y - M) Sq. Diff (y - M)2
  • 98. What about the Effect Size? ā€¢ In this case, I have a diļ¬€erence in my hypothesis, therefore I will use Cohenā€™s d This are the numbers that I need for Cohenā€™s d where This is the formula for Cohenā€™s d NO-TOOL TOOL
  • 99. What about the Effect Size? ā€¢ In this case, I have a diļ¬€erence in my hypothesis, therefore I will use Cohenā€™s d This are the numbers that I need for Cohenā€™s d where This is the formula for Cohenā€™s d d = (6.15 - 4.23) ā„ 6.701138 = 0.286519 I have a SMALL to MEDIUM effect size (see table from some slides agoā€¦) NO-TOOL TOOL
  • 100. Interval/Ratio (numbers) Type of hypothesis Relationship Diļ¬€erence Number of Ind. Variables Standard Deviation known unknown Z-test T-test Type of designSpearmanā€™s Rho Pearsonā€™s R One or More Zero Diļ¬€erent groups of subjects (independent measures) Two More than two T-test (paired) Wilcoxon signed-rank test One-way ANOVA Treatments Two More than two T-test (unpaired) Mann-Whitney U test Independent Variables One More than One One-way ANOVA Factorial ANOVA e.g., Which is the difference between tool A, B and C in terms of speed of bug detection achieved by users? (same as for repeated measures, but I use a different design with different people) Tool A Tool B Tool C
  • 101. Interval/Ratio (numbers) Type of hypothesis Relationship Diļ¬€erence Number of Ind. Variables Standard Deviation known unknown Z-test T-test Type of designSpearmanā€™s Rho Pearsonā€™s R One or moreZero Diļ¬€erent groups of subjects (independent measures) Treatments Two More than two T-test (paired) Wilcoxon signed-rank test One-way ANOVA Treatments Two More than two T-test (unpaired) Mann-Whitney U test Independent Variables One More than One One-way ANOVA Factorial ANOVA Single group of subjects (repeated measures) e.g., What is the influence of different tools and experience in the bug detection speed? (I consider not only the tool, but also the experience as independent variable)
  • 102. Factorial ANOVA (Example) ā€¢ Letā€™s imagine to have two tools A and B to support bug detection; I want to see which one is better, but I want also to see whether there is some diļ¬€erence between people with diļ¬€erent degree of experience in bug detection ā€¢ RQ: What is the inļ¬‚uence of diļ¬€erent tools and experience in bug detection speed? ā€¢ Here I want to see which of the two factors (usersā€™ experience and type of tool, my independent variables) has more impact on bug detection speed ā€¢ I have three NULL hypothesis this time: ā€¢ H0-1: The speed does not depend on the type of adopted tool ā€¢ H0-2: The speed does not depend on the level of experience of the user ā€¢ H0-3: The speed does not depend on the interaction between type of adopted tool and level of experience ā€¢ Design ā€¢ User experience has 3 levels: low, medium, high ā€¢ Type of tool has 2 levels: tool A, tool B (in principle, I should have also NO toolā€¦) ā€¢ Therefore, I have 3 x 2 = 6 possible situations (i.e., people with low experience using tool A, other using tool B, etc.), and I have to group my subjects in 6 groups
  • 103. Factorial ANOVA (Example) User Exp. Tool Speed 1 low A 12 2 low lo B 4 3 low A 7 4 low B 3 5 medium A 9 6 medium B 12 7 medium A 16 8 medium B 23 9 high A 23 10 high B 16 11 high A 14 12 high B 12 ā€¦ ā€¦ ā€¦ ā€¦ Data Mean Square F-value p-value Exp. 2664 147.51 <0.001 Tool 29.4 1.62 0.207 Exp. X Tool 83.85 4.64 0.014 ANOVA Results The interaction of the two factors is signiļ¬cant (reject H03) The experience is signiļ¬cant (reject H02) Tool is not signiļ¬cant (cannot reject H01) F-value is the test statistic for ANOVA
  • 104. How to Select the Right Test ā€¢ Follow the diagram ā€¢ Use the wizard at https://www.socscistatistics.com/tests/ what_stats_test_wizard.aspx ā€¢ Use the Exhaustive Table at https://stats.idre.ucla.edu/other/mult-pkg/ whatstat/ which also contains R code and code for other tools ā€¢ To ļ¬nd non-parametric alternatives: https://help.xlstat.com/s/article/ which-statistical-test-should-you-use?language=en_US ā€¢ Always remember to check that the test assumptions hold ā€¢ It takes time to acquire conļ¬dence with experiment design, so DO NOT BE SCARED
  • 105. How To Select the Right Test 10.3 Hypothesis Testing 137 Table 10.3 Overview of parametric/non-parametric tests for different designs Design Parametric Non-parametric One factor, one treatment Chi-2, Binomial test One factor, two treatments, completely randomized design t-test, F-test Mann-Whitney, Chi-2 One factor, two treatments, paired comparison Paired t-test Wilcoxon, Sign test One factor, more than two treatments ANOVA Kruskal-Wallis, Chi-2 More than one factor ANOVAa a This test is not described in this book. Refer instead to, for example, Marascuilo and Serlin [119] and Montgomery [125] Input The type of measurements needed to make the test applicable describes the input to the test. That is, this describes what requirements there are on the experiment design if the test should be applicable. Null hypothesis A formulation of the null-hypothesis is provided. Calculations It describes what to calculate based on the measured data. Criterion The criterion for rejecting the null hypothesis. This often involves Factor = number of independent variables Treatments = possible values of the independent variables cf. Wholin et al. https://doi.org/10.1007/978-3-642-29044-2 Fundamental tests
  • 106. Threats To Validity for Controlled Experiments
  • 107. Threats to Validity for Controlled Experiments ā€¢ Construct Validity: to which extent do the measured variables represent what I intended to estimate? Did I operationalise my research questions in the proper manner? Did I use an appropriate design? ā€¢ Internal Validity: are there any confounding factors that may have inļ¬‚uenced the outcome of the experiments? Did I control all the variables? ā€¢ External Validity: for which values of the controlled variables are the results valid? To which extent the results can be considered general? ā€¢ (Statistical) Conclusion Validity: to which extent are my ļ¬nding credible? Have I used the appropriate statistical tests? Did I check the assumptions? Have I sampled the population in the appropriate way? Have I used reliable measurement procedures (low measurement errors)?
  • 108. Internal Validity ā€¢ Factors jeopardising internal validity are, e.g.: ā€¢ History: did time impact on the treatments? (e.g., I have people participating at diļ¬€erent times of the day, or treatments performed in diļ¬€erent days) ā€¢ Maturation: did subjects learn throughout the experiment? did time during the experiment aļ¬€ect the performance? (e.g., people can get bored or tired) ā€¢ Experimental mortality: how many subjects left the experiment and how did this aļ¬€ect the treatment groups? Are the remaining subjects the most motivated? ā€¢ Researcher bias: in which way could the researcher inļ¬‚uence the outcomes? (e.g., presence of researcher inļ¬‚uences the participants) ā€¢ Experimental context: to which extent does the experimental context inļ¬‚uence the behaviour of subjects? cf. https://web.pdx.edu/~stipakb/download/PA555/ResearchDesign.html
  • 109. External Validity ā€¢ Factors jeopardising external validity are, e.g.: ā€¢ Selection bias: are the selected subjects really random, and are they randomly assigned to treatment? ā€¢ Representativeness: to which extent the experiment represents a real context? To which extent was I able to properly represent all the realistic combinations of the control variables? To which extent was I able to select the representative people? To which extent was I able to select representative situations?
  • 110. Construct Validity ā€¢ Factors jeopardising construct validity are: ā€¢ Hypothesis guessing: does knowing the expected result inļ¬‚uence the behaviour of the participant? ā€¢ Bias in experimental design: was my operationalisation and design correct? ā€¢ Subjective measures: to which extent the subjective measures are reliable?
  • 111. Conclusion Validity ā€¢ Factors jeopardising conclusion validity are: ā€¢ Low statistical power: power is the probability of correctly rejecting the NULL hypothesis when FALSE; I may fail to reject the NULL hypothesis if I have low statistical power; low statistical power occurs when I have few samples, and low eļ¬€ect size. ā€¢ Violated assumptions: remember that all tests have assumptions to check ā€¢ Unreliable measures of the variables: large amount of measurement error
  • 112. Preparing, Executing and Reporting Experiments Theory Hypothesis and Variable Deļ¬nition Research Design Research Question Deļ¬ne Measures for Variables Recruit Participants / Select Artifacts PREPARATION EXECUTION Collect Data Analyse Data Report Answers Internal Validity External Validity Construct & Conclusion Validity Construct Validity REPORTING Discuss
  • 113. 154 11 Presentation and Package Table 11.1 Proposed reporting structure for experiment reports, by Jedlitschka and Pfahl [86] Sections/subsections Contents Title, authorship Structured abstract Summarizes the paper under headings of background or context, objectives or aims, method, results, and conclusions Motivation Sets the scope of the work and encourages readers to read the rest of the paper Problem statement Reports what the problem is; where it occurs, and who observes it Research objectives Deļ¬nes the experiment using the formalized style used in GQM Context Reports environmental factors such as settings and locations Related work How current study relates to other research Experimental design Describes the outcome of the experimental planning stage Goals, hypotheses and variables Presents the reļ¬ned research objectives Design Deļ¬ne the type of experimental design Subjects Deļ¬nes the methods used for subject sampling and group allocation Objects Deļ¬nes what experimental objects were used Instrumentation Deļ¬nes any guidelines and measurement instruments used Data collection procedure Deļ¬nes the experimental schedule, timing and data collection procedures Analysis procedure Speciļ¬es the mathematical analysis model to be used Evaluation of validity Describes the validity of materials, procedures to ensure participants keep to the experimental method, and methods to ensure the reliability and validity of data collection methods and tools Execution Describes how the experimental plan was implemented Sample Description of the sample characteristics Preparation How the experimental groups were formed and trained Reporting Experiments (1) cf. Wholin et al. https://doi.org/10.1007/978-3-642-29044-2
  • 114. Reporting Experiments (2) Analysis procedure Speciļ¬es the mathematical analysis model to be used Evaluation of validity Describes the validity of materials, procedures to ensure participants keep to the experimental method, and methods to ensure the reliability and validity of data collection methods and tools Execution Describes how the experimental plan was implemented Sample Description of the sample characteristics Preparation How the experimental groups were formed and trained Data collection performed How data collection took place and any deviations from plan Validity procedure How the validity process was followed and any deviation from plan Analysis Summarizes the collected data and describes how it was analyzed Descriptive statistics Presentation of the data using descriptive statistics Data set reduction Describes any reduction of the data set e.g. removal of outliers Hypothesis testing Describes how the data was evaluated and how the analysis model was validated Interpretation Interprets the ļ¬ndings from the Analysis section Evaluation of results and implications Explains the results Limitations of study Discusses threats to validity Inferences How the results generalize given the ļ¬ndings and limitations Lesson learnt Descriptions of what went well and what did not during the course of the experiment Conclusions and future work Presents a summary of the study Relation to existing evidence Describes the contribution of the study in the context of earlier experiments Impact Identiļ¬es the most important ļ¬ndings Limitations Identiļ¬es main limitations of approach i.e. circumstances when the expected beneļ¬ts will not be delivered Future work Suggestions for other experiments to further investigate Acknowledgements Identiļ¬es any contributors who do not fulļ¬ll authorship criteria References Lists all cited literature Appendices Includes raw data and/or detailed analyses which might help others to cf. Wholin et al. https://doi.org/10.1007/978-3-642-29044-2
  • 116. What about Quasi- Experiments? ā€¢ In experiments I randomly assign subjects to treatments; ā€¢ In quasi-experiments the assignment is based on some choices of the designer (e.g., the Factorial ANOVA example, in which I have more than one level of experience) ā€¢ Note that a quasi-experiment does not always allow to convincingly establish causal relationships (e.g., diļ¬€erent degrees of experience may be related to other factors that may have inļ¬‚uenced the outcome) ā€¢ When I use a group of students from a certain class for my research, I am neither performing an experiment nor a quasi-experiment, but a case study, as I am focusing on a speciļ¬c environment and I selected the subjects opportunistically
  • 117. Summary ā€¢ Controlled Experiments in SE are a research strategy mostly oriented to test the impact of some treatment (method, tool) to a certain dependent variable (e.g., speed, bugs, success, happiness) ā€¢ They are based on Hypothesis testing, which implies showing that the experimental data REJECT the NULL hypothesis (i.e., no impact on the dependent variable) ā€¢ Hypothesis testing uses Statistical tests to decide whether the NULL can be REJECTED ā€¢ The selection of the statistical test depends on the Experimental design (look at https://stats.idre.ucla.edu/other/mult-pkg/whatstat/) ā€¢ When I perform a statistical test, I hope to obtain for small p-values, and large eļ¬€ect size ā€¢ Remember to analyse and report Threats to Validity