Complete lecture on controlled experiments in software engineering. It explains practical guidelines on conducting controlled experiments and describes the concepts of dependent, independent, and control variables, significance, and p-value. It also explains how to select the appropriate statistic test for a hypothesis, and gives example of data for different typical tests.
Finally, it discusses threats to validity in controlled experiments and gives indications for reporting.
Find the video lectures here: https://www.youtube.com/playlist?list=PLSKM4VZcJjV-P3fFJYMu2OhlTjEr9Bjl0
Model Call Girl in Tilak Nagar Delhi reach out to us at š9953056974š
Ā
Controlled experiments, Hypothesis Testing, Test Selection, Threats to Validity
1. Controlled Experiments
in Software Engineering
cf. Plefeeger, 1995 https://doi.org/10.1007/BF02249052
cf. S. Easterbrook http://www.cs.toronto.edu/~sme/CSC2130/04-experiments.pdf
Alessio Ferrari, ISTI-CNR, Pisa, Italy
alessio.ferrari@isti.cnr.it
2. Controlled Experiments
aka Laboratory Experiments
aka Experiment
The ABC of Software Engineering Research 11:11In Vitro Experiment
The GOAL is Precise
Measure
of Behaviour
3. Typical Examples
ā¢ With software subjects: Tool A and B are automatic tools for testing, I
want to compare them (no need to involve people)
ā¢ With human subjects: Method M is a manual strategy for ļ¬nding bugs.
How eļ¬ective is for experts? How eļ¬ective is for novices?
ā¢ With human and software subjects:
ā¢ Tool T is an interactive tool for testing, I want to see if it is more
appropriate for novice or for experts
ā¢ Tool A and B are interactive tools for testing, I want to compare them āØ
(I have to involve people)
ā¢ Tool A and B are interactive tools for testing, I want to see if which
one is more appropriate for novices and which one for experts
ā¢ Tool A and method M are two approaches for ļ¬nding bugs, I want to
see which one is better
4. Controlled Experiments and Theories
Theory
Observation
Induction
Hypothesis
Deduction
Test
Theory
Abduction
Deduction
DEDUCTIVE APPROACH
5. Controlled Experiments: Process
PREPARATION EXECUTION REPORTING
Theory
Hypothesis and
Variable Deļ¬nition
Research Design
Research Question
Deļ¬ne Measures for
Variables
Recruit Participants
/ Select Artifacts
Collect Data
Analyse Data
Report Answers
Internal Validity
External Validity
Construct &
Conclusion Validity
Construct
Validity
Discuss
The process normally starts from a Theory
and discusses/modiļ¬es it in relation to the results
Typically QUANTITATIVE
6. Controlled Experiments: Process
PREPARATION EXECUTION REPORTING
Theory
Hypothesis and
Variable Deļ¬nition
Research Design
Research Question
Deļ¬ne Measures for
Variables
Recruit Participants
/ Select Artifacts
Collect Data
Analyse Data
Report Answers
Internal Validity
External Validity
Construct &
Conclusion Validity
Construct
Validity
Discuss
The process normally starts from a Theory
and discusses/modiļ¬es it in relation to the results
Typically QUANTITATIVE
7. Controlled Experiments: Elements
Test
Data from
Experiment
Test
Statistic
p-value
Effect Size
Effect size
computation
Signiļ¬cance š¼
Hypothesis
Treatments
Analyse DataCollect Data
independent
variables
dependent
variables
controlled variables
Variable Measurements
Data from
Experiment
Design
Hypothesis
ā
ā
8. Controlled Experiments: Elements
Test
Data from
Experiment
Test
Statistic
p-value
Effect Size
Effect size
computation
Signiļ¬cance š¼
Hypothesis
Treatments
Analyse DataCollect Data
independent
variables
dependent
variables
controlled variables
Variable Measurements
Data from
Experiment
Design
This part requires your creativity
Hypothesis
ā
ā
9. Controlled Experiments: Elements
Test
Data from
Experiment
Test
Statistic
p-value
Effect Size
Effect size
computation
Signiļ¬cance š¼
Hypothesis
Treatments
Analyse DataCollect Data
independent
variables
dependent
variables
controlled variables
Variable Measurements
Data from
Experiment
Design
This part requires your creativity
This part is mostly automated
(but you need to understand it!)
Hypothesis
ā
ā
10. Controlled Experiment
ā¢ āExperimental investigation of a testable hypothesis, in which
conditions are set up to isolate the variables of interest
(independent variables) and test how they aļ¬ect certain measurable
outcomes (the dependent variables)ā
INDEPENDENT
variables
(e.g., testing
tool)
DEPENDENT
variables
(e.g., number of
bugs)
aka FACTORS
Each combination of values of the
independent variables is a TREATMENT
TREATMENTS
Treatment 1
(e.g, testing tool A)
Treatment 2
(e.g., testing tool B)
cf. S. Easterbrook http://www.cs.toronto.edu/~sme/CSC2130/04-experiments.pdf
11. Controlled Experiment
ā¢ āExperimental investigation of a testable hypothesis, in which
conditions are set up to isolate the variables of interest
(independent variables) and test how they aļ¬ect certain measurable
outcomes (the dependent variables)ā
INDEPENDENT
variables
(e.g., testing
tool)
DEPENDENT
variables
(e.g., number of
bugs)
aka FACTORS
Each combination of values of the
independent variables is a TREATMENT
TREATMENTS
Treatment 1
(e.g, testing tool A)
Treatment 2
(e.g., testing tool B)
cf. S. Easterbrook http://www.cs.toronto.edu/~sme/CSC2130/04-experiments.pdf
To ISOLATE the independent variables, the other
variables need to be CONTROLLED
(e.g., variables concerning the code samples on
which the test is performed)
12. Controlled Experiments
equivalent for each
treatment
homogeneous
general
INDEPENDENT
variables
(e.g., testing
tool)
DEPENDENT
variables
(e.g., number of
bugs)
TREATMENTS
Treatment 1
(e.g, testing tool A)
Treatment 2
(e.g., testing tool B)
CONTROLLED variables
(e.g., sample length,
type of language, complexity)
representative
related to
human subjects
related to objects
13. Controlled Experiments
equivalent for each
treatment
homogeneous
general
INDEPENDENT
variables
(e.g., testing
tool)
DEPENDENT
variables
(e.g., number of
bugs)
TREATMENTS
Treatment 1
(e.g, testing tool A)
Treatment 2
(e.g., testing tool B)
CONTROLLED variables
(e.g., sample length,
type of language, complexity)
Controlled variables when human subjects are involved
may concern experience of developers, age, etc.
representative
related to
human subjects
related to objects
14. Deļ¬nitions
ā¢ Hypothesis: the statement I want to test with the experiment
ā¢ Derived from a research question (e.g., What is the diļ¬erence
between A and B in terms of bug detection capability?)
ā¢ Include variables that represent constructs of interest (e.g., tools,
methods, actors, number of bugs)
ā¢ Concern the measurable impact that a certain variation on some
construct can have on other constructs (e.g., Tool A ļ¬nds more bugs
than tool B; Tool A ļ¬nds less or equal bugs than tool B)
ā¢ I normally have NULL and Alternative hypothesis; the one I will test is
the NULL hypothesis, but the one I am interested in is the Alternative
one (weāll see this later)
15. Deļ¬nitions
ā¢ Independent Variables (INPUT): operationalisation of
constructs that I want to isolate, and whose values I want to
manipulate (e.g., the tool, the expertise of actors)
ā¢ Treatments: combinations of values for the independent
variables (tool A, tool B ā 1 variable, two treatments; tool A and
experts, tool A and novices, tool B and experts, tool B and
novicesā 2 variable, 4 treatments)
ā¢ Dependent Variables (OUTPUT): operationalisation of
constructs that I want to measure based on the manipulation of
the independent variables (e.g., number of bugs)
ā¢ Controlled Variables: attributes* of human subjects or objects
that I need to control to mask or prevent their impact on the
dependent variables (e.g., I have to test on some code that is
suļ¬ciently general, and equivalent for all cases)
* = operationalisation of constructs
16. Example: Software
ā¢ Objective: I want to understand which is a better testing tool among two available choices A and B
ā¢ The independent variable is already identiļ¬ed: the tool (one factor)
ā¢ Treatments are also straightforward: tool A and B (two treatments)
ā¢ I miss the dependent variable: I have to detail what I mean by better. Better in terms of speed? better in
terms of bugs found? Both! Ok, I already have two dependent variables, which I can deļ¬ne as:
ā¢ āeļ¬ectivenessā = number of bugs found/total number of bugs
ā¢ āeļ¬ciencyā = running time/number of bugs found
ā¢ Now I have to identify the controlled variables: what can impact on eļ¬ectiveness and eļ¬ciency, besides
the type of tool? The user? Maybe not, if the tool is fully automatic; The language of the code? Well, I want
to focus only on C code; The chosen code? Well yes, but which attributes of the chosen code?
ā¢ number of bugs in the code module
ā¢ length of the module
ā¢ complexity of the module
ā¢ domain of the code
ā¢ ā¦.
17. Example: Software
ā¢ Objective: I want to understand which is a better testing tool among two available choices A and B
ā¢ The independent variable is already identiļ¬ed: the tool (one factor)
ā¢ Treatments are also straightforward: tool A and B (two treatments)
ā¢ I miss the dependent variable: I have to detail what I mean by better. Better in terms of speed? better in
terms of bugs found? Both! Ok, I already have two dependent variables, which I can deļ¬ne as:
ā¢ āeļ¬ectivenessā = number of bugs found/total number of bugs
ā¢ āeļ¬ciencyā = running time/number of bugs found
ā¢ Now I have to identify the controlled variables: what can impact on eļ¬ectiveness and eļ¬ciency, besides
the type of tool? The user? Maybe not, if the tool is fully automatic; The language of the code? Well, I want
to focus only on C code; The chosen code? Well yes, but which attributes of the chosen code?
ā¢ number of bugs in the code module
ā¢ length of the module
ā¢ complexity of the module
ā¢ domain of the code
ā¢ ā¦.
I have to create a code sample that
has sufļ¬cient variations in all of the
controlled variables
18. Example: Software
ā¢ Objective: I want to understand which is a better testing tool among two available choices A and B
ā¢ The independent variable is already identiļ¬ed: the tool (one factor)
ā¢ Treatments are also straightforward: tool A and B (two treatments)
ā¢ I miss the dependent variable: I have to detail what I mean by better. Better in terms of speed? better in
terms of bugs found? Both! Ok, I already have two dependent variables, which I can deļ¬ne as:
ā¢ āeļ¬ectivenessā = number of bugs found/total number of bugs
ā¢ āeļ¬ciencyā = running time/number of bugs found
ā¢ Now I have to identify the controlled variables: what can impact on eļ¬ectiveness and eļ¬ciency, besides
the type of tool? The user? Maybe not, if the tool is fully automatic; The language of the code? Well, I want
to focus only on C code; The chosen code? Well yes, but which attributes of the chosen code?
ā¢ number of bugs in the code module
ā¢ length of the module
ā¢ complexity of the module
ā¢ domain of the code
ā¢ ā¦.
I have to create a code sample that
has sufļ¬cient variations in all of the
controlled variables
If I cannot variate a certain variable, I have to ļ¬x it (e.g., C code, domain)
and make this choice explicit, as it limits my scope of interest
19. Example: Software and Humans
ā¢ Objective: I want to see if the experience of the user aļ¬ects the eļ¬ectiveness of a certain testing tool
ā¢ The dependent variable is already identiļ¬ed: the eļ¬ectiveness (bugs found/total bugs)
ā¢ I have to identify the independent variables: they should concern the experience of the user, how can I
measure it? Years of experience in testing? Score from other colleagues? Well, normally it is better to
select one independent variable only, otherwise I need too many treatments and I may not ļ¬nd enough
participants! Ok, but what should I compare? 1, 2, 3, 4, 5 etc. years? It is also a lot of treatments, will I
ļ¬nd enough people? I have to separate years of experience by levels. How do I select the levels? I have
to do some assumptions based on existing literature or I can take some decision that can be defended
ā¢ I decide for two levels, and I partition into two treatments (i.e., two homogeneous groups of people)
ā¢ from 0 to 1 years: novices
ā¢ more than 5 years: experts
ā¢ Now I have to identify the controlled variables: what can impact on my outcomes besides the
experience of users? Well, age, gender, all demographic variablesā¦and of course, the code on which the
tool is applied (previous variables)
ā¢ I have to make some choice: I should ļ¬x a representative code base, use the same for all subjects, make
sure none of them know the code in advance, and control demographic variables
ā¢ Therefore, for each treatment, I have a group with a comparable experience (novice OR expert) but
variations in terms of age, gender, and other demographic variables
20. Controlled Experiments: š and ā¹
ā¢ š Advantages:
ā¢ It is SCIENCE, with NUMBERS
ā¢ Can be applied to identify cause-eļ¬ect relationships for speciļ¬c, well deļ¬ned,
variables
ā¢ ā¹ Disadvantages:
ā¢ Applicable to well-deļ¬ned problems in which you can clearly deļ¬ne and isolate
variables
ā¢ Hard to apply if you cannot simulate the right conditions in the lab (confounding
variables may be too many to be controlled)
ā¢ Reality of SE has several contextual factors that may make the experiment not realistic
ā¢ It may be hard and costly to recruit adequate subjects (developers have to develop,
managers need to manageā¦often, students are used as proxies)
ā¢ Design is time consuming and can get very complicated, very easily (which implies that
it is also diļ¬cult to analyse the results and have an actual control)
21. Hypothesis Testing
cf. Sharma, 2015 https://bit.ly/2wTf7VX
I will provide information for you to understand the principles,
but to REALLY understand you need more resources
I will use the word MAGIC when some concepts need to be assumed,
or some measures can be given somehow by common tools
Alessio Ferrari, ISTI-CNR, Pisa, Italy
alessio.ferrari@isti.cnr.it
22. Hypothesis
ā¢ A hypothesis is a statistically testable statement derived from a theory (and, in practice,
from a research question)
ā¢ A hypothesis is a predictive statement concerning the impact of some independent
variable on some dependent variable
ā¢ When we do hypothesis testing, our goal is to refute the negation of the theory
ā¢ H0 the NULL hypothesis ā The theory does not apply
ā¢ Usually expressed as There is no eļ¬ect [ā¦] ā changes of the independent variable
do not aļ¬ect the dependent variable
ā¢ It is assumed to be TRUE, unless there is evidence from the data that allows us to
REJECT the NULL hypothesis (for this, you need statistical tests)
ā¢ H1 the ALTERNATIVE hypothesis ā The theory predictsā¦
ā¢ If H0 is rejected, this is an evidence that H1 can be correct
23. Example
ā¢ H0: The experience of the developer does not aļ¬ect the
average time to ļ¬nd bugs
ā¢ H0: Average-Time-Novices = Average-Time-Experts
ā¢ H1: The experience of the developer aļ¬ects the average
time to ļ¬nd bugs
ā¢ H1: Average-Time-Novices ā Average-Time-Experts
I imagine to have two groups,
novices and experts
We speak about Two-tailed hypothesis to be tested
(later you will understand why)
I imagine I have a method M or tool T for ļ¬nding bugs
24. Example
ā¢ H0: The experience of the developer does not aļ¬ect the
average time to ļ¬nd bugs
ā¢ H0: Average-Time-Novices = Average-Time-Experts
ā¢ H1: The experience of the developer aļ¬ects the average
time to ļ¬nd bugs
ā¢ H1: Average-Time-Novices ā Average-Time-Experts
I imagine to have two groups,
novices and experts
We speak about Two-tailed hypothesis to be tested
(later you will understand why)
What if I want to know WHO is QUICKER? This
formulation does not say anything about thatā¦
I imagine I have a method M or tool T for ļ¬nding bugs
25. Example
ā¢ But I can ļ¬nd another formulation, with exactly the same
experiment ā two groups, novices and experts, and I
measure average time to ļ¬nd bugs
ā¢ H0: The average time to ļ¬nd bugs of novices is less than or
equal to the one of experts
ā¢ H0: Average-Time-Novices <= Average-Time-Experts
ā¢ H1: The average time to ļ¬nd bugs of novices is greater than
the one of experts
ā¢ H1: Average-Time-Novices > Average-Time-Experts
We speak about One-tailed hypothesis to be tested
26. Test Statistic
ā¢ Hypothesis tests normally take all my sample data and convert them into
a single value, which is called test statistic
ā¢ The test statistic is just a number, but its value can tell me whether the
NULL hypothesis can be REJECTED or not
ā¢ Depending on the test that I have to do I will have diļ¬erent test statistics
Test
Data from
Experiment
Test
Statistic
time novice 1
time expert 1
time novice 2
time expert 2
e.g, unpaired t-test
-0.38
e.g., t-value
Compare the means
of two independent
samples
cf. https://bit.ly/39LLOU5
27. Probability Distribution of the Test Statistic
ā¢ The assumption is that the NULL hypothesis is TRUE
ā¢ Given a population in which the NULL hypothesis is true, āØ
I imagine to repeat my experiment multiple times and compute the test statistic
ā¢ The test statistic will follow a certain distribution ā which one? MAGIC, e.g., Student t-
distribution
If H0 is TRUE, most of
the times I repeat the experiment
the test statistic will be around here
Number of samples
with value x
Set of possible values x of the test statistic
If H0 is TRUE, it is
unlikely that
my test statistic
will be here
(or in the left tail)
e.g., a t-value = 0
indicates that my data
conļ¬rms H0 precisely
The distribution is centred
on the value that the test
statistic hasāØ
when the data of my
experiment conļ¬rm exactly
the NULL hypothesis
28. Probability Distribution of the Test Statistic
ā¢ The assumption is that the NULL hypothesis is TRUE
ā¢ Given a population in which the NULL hypothesis is true, āØ
I imagine to repeat my experiment multiple times and compute the test statistic
ā¢ The test statistic will follow a certain distribution ā which one? MAGIC, e.g., Student t-
distribution
If H0 is TRUE, most of
the times I repeat the experiment
the test statistic will be around here
Number of samples
with value x
Set of possible values x of the test statistic
If H0 is TRUE, it is
unlikely that
my test statistic
will be here
(or in the left tail)
e.g., a t-value = 0
indicates that my data
conļ¬rms H0 precisely
If my test statistic falls
around the tails
I can REJECT H0
ā¦and this is my hope!
The distribution is centred
on the value that the test
statistic hasāØ
when the data of my
experiment conļ¬rm exactly
the NULL hypothesis
29. ā¢ Our ļ¬nal goal is to evaluate whether our test statistic
value, obtained from our experiment, is so rare that it
justiļ¬es rejecting the NULL hypothesis for the entire
population based on our sample data
ā¢ How can I do if I do not know the entire distribution of my
test statistic? This can be inferred based on the statistics
of the sampled data and the hypothesis I want to testā¦
ā¢ ā¦in this context we will assume that some MAGIC
occurs and we know the distribution of the test statistic
30. Critical Regions
test statistic
# of samples
I want the test statistic of
my experiment to fall on
the tails of the distribution
Critical Region = acceptable
values to reject
NULL
Critical Region = acceptable
values to reject
NULL
The acceptable values
identify a red area in the
distribution
The area is the risk of
rejecting the NULL
when TRUE
Before the experiment,
I set the Critical Regions
(Rejection Regions)
31. Level of Signiļ¬cance and
Conļ¬dence
ā¢ Signiļ¬cance level indicates the risk to reject a NULL
hypothesis when it is true; it is denoted by š¼
ā¢ 0.01, 0.05, 0.1: these are the typical values for š¼
ā¢ (1 ā š¼) is the conļ¬dence level indicates how conļ¬dent I
want to be about the result of my test
ā¢ 0.99, 0.95, 0.9: typical values for (1 ā š¼)
Alpha sets the standard for how extreme the data
MUST BE before we can reject the null hypothesis.
The p-value indicates how extreme the data ARE (later).
32. Signiļ¬cance and Conļ¬dence
test statistic
Before any experiment I
set the signiļ¬cance level,
and corresponding
conļ¬dence level
Critical Region = acceptable
values of test statistic to reject
NULL
Critical Region = acceptable
values of test statistic to reject
NULL
Conļ¬dence Level (1-š¼)
Signiļ¬cance Level š¼
33. Risk of Rejecting the NULL
Hypothesis when TRUE
Risk Level
Signiļ¬cance
š¼
Conļ¬dence
Level (1- š¼)
Intuitive Meaning
Catastrophic 0.001 0.999
More than 100 million Euros
(Large loss of life, e.g. nuclear
disaster)
Critical 0.01 0.99
Less than 100 million Euros
(A few lives lost, e.g., accident)
Important 0.05 0.95
Less than 100 thousands Euros
(No lives lost, some injuries)
Moderate 0.10 0.90
Less than 500 Euros
(no injuries)
34. Risk of Rejecting the NULL
Hypothesis when TRUE
Risk Level
Signiļ¬cance
š¼
Conļ¬dence
Level (1- š¼)
Intuitive Meaning
Catastrophic 0.001 0.999
More than 100 million Euros
(Large loss of life, e.g. nuclear
disaster)
Critical 0.01 0.99
Less than 100 million Euros
(A few lives lost, e.g., accident)
Important 0.05 0.95
Less than 100 thousands Euros
(No lives lost, some injuries)
Moderate 0.10 0.90
Less than 500 Euros
(no injuries)
35. Risk of Rejecting the NULL
Hypothesis when TRUE
Risk Level
Signiļ¬cance
š¼
Conļ¬dence
Level (1- š¼)
Intuitive Meaning
Catastrophic 0.001 0.999
More than 100 million Euros
(Large loss of life, e.g. nuclear
disaster)
Critical 0.01 0.99
Less than 100 million Euros
(A few lives lost, e.g., accident)
Important 0.05 0.95
Less than 100 thousands Euros
(No lives lost, some injuries)
Moderate 0.10 0.90
Less than 500 Euros
(no injuries)
In software engineering, we normally use these values
36. Type I and Type II Errors
REAL Population Fail to Reject Reject
NULL is True
No Error
my theory is FALSE
(1 - š¼)
Type I Error
(Incorrectly Reject the
NULL hypothesis)
š¼
NULL is False
Type II Error
(Incorrectly Fail to
Reject the NULL
hypothesis)
Ī²
No Error
my theory is TRUE
(1- Ī²)
Type I š¤„ my (alternative) hypothesis is wrong, but I support it anyway
Type II š„ŗ my (alternative) hypothesis is correct, but I rejected it
We normally focus on minimising Type I errors
37. Two-tailed Test
Average-Time-Novices
= Average-Time-Experts
Average-Time-Novices
ā Average-Time-Experts
Average-Time-Novices
ā Average-Time-Experts
Acceptance region
šššššššššš ššš£šš
(1āš¼) = 0.95
Rejection Region
š ššššššššššš ššš£šš
(š¼/2 = 0.025 šš 2.5%)
Rejection Region
š ššššššššššš ššš£šš
(š¼/2 = 0.025 šš 2.5%)
the value of š¼ = 0.05
is split between the
tails
ā¢ H0: The experience of the developer does not aļ¬ect the average time to ļ¬nd bugs
š¼ is the risk of rejecting NULL when true
the value of š¼/2
is this area
38. One-tailed Test (Left)
Average-Time-Novices
>= Average-Time-Experts
Average-Time-Novices
< Average-Time-Experts
Acceptance region
šššššššššš ššš£šš
(1āš¼) = 0.95
Rejection Region
š ššššššššššš ššš£šš
(š¼ = 0.05 šš 5%)
the value of š¼ = 0.05
is all in one tail
ā¢ H0: The average time to ļ¬nd bugs of novices is greater than or equal to the one of experts
the value of š¼
is this area
39. One-tailed Test (Right)
Average-Time-Novices
<= Average-Time-Experts
Average-Time-Novices
> Average-Time-Experts
Acceptance region
šššššššššš ššš£šš
(1āš¼) = 0.95
Rejection Region
š ššššššššššš ššš£šš
(š¼ = 0.05 šš 5%)
the value of š¼ = 0.05
is all on one tail
ā¢ H0: The average time to ļ¬nd bugs of novices is less than or equal to the one of experts
the value of š¼
is this area
40. p-value
Test
Data from
Experiment
Test
Statistic
time novice 1
time expert 1
time novice 2
time expert 2
e.g, unpaired t-test
-0.38
e.g., t-value
p-value
Another number produced by the test
LOW values (0.001) are GOOD,
HIGH values (0.3) are BAD
41. p-value and š¼ (one-tailed)
p-value is
this blue area
This point is MY test statistic value,
derived from MY data
š¼ is the red plus
the blue area
cf. https://statisticsbyjim.com/hypothesis-testing/hypothesis-tests-signiļ¬cance-levels-alpha-p-values/
42. p-value and š¼ (two-tailed)
p-value/2 is
this blue area
This point in the x axis is
my test statistic value,
derived from my data
š¼/2 is the red plus
the blue area
cf. https://statisticsbyjim.com/hypothesis-testing/hypothesis-tests-signiļ¬cance-levels-alpha-p-values/
For two-tailed tests, š¼ and p are the sum
of the areas in the two tails, both š¼ and p are
shared between the tails
cf. https://blog.minitab.com/blog/adventures-in-statistics-2/understanding-hypothesis-tests-signiļ¬cance-levels-alpha-and-p-values-in-statistics
š¼/2 is the red plus
the blue area
p-value/2 is
this blue area
43. p-value
ā¢ 1) p-value indicates the believability of the devilās advocate case that the NULL
hypothesis is TRUE given the sample data
ā¢ 2) p-value is the probability of observing a test statistic that is at least as extreme as
your test statistic, when you assume that the NULL hypothesis is true
ā¢ 3) p-value indicates to which extent the result may be due to a random variation
within your data, which make them diļ¬erent to the actual population
ā¢ If p-value is āvery lowā, then the NULL hypothesis is REJECTED, in favour of the
alternative hypothesis, otherwise I Fail to REJECT
ā¢ The meaning of āVery lowā depends on the selected value of signiļ¬cance š¼
ā¢ p-value <= š¼: I fall in the REJECTION region, H0 is rejected
ā¢ p-value > š¼: I fall in the ACCEPTANCE region, I fail to reject H0
Different intuitive way to understand it
44. Effect Size
Test
Data from
Experiment
Test
Statistic
time novice 1
time expert 1
time novice 2
time expert 2
e.g, unpaired t-test
-0.38
e.g., t-value
p-value
Effect Size
Statistically signiļ¬cant effect does not necessarily mean a big effect
cf. https://en.wikipedia.org/wiki/Eļ¬ect_size
Effect size measures how big is the effect
Effect size
computation
e.g, Cohenās d
e.g., d = 2
cf. https://www.simplypsychology.org/eļ¬ect-size.html
45. Effect Size
ā¢ Eļ¬ect size is a quantitative measure of the magnitude of the treatment
eļ¬ect (e.g., HOW MUCH better is my tool?)
ā¢ Eļ¬ect sizes either measure:
ā¢ the sizes of associations/relationships between variables āØ
(HOW MUCH is experience correlated with development speed?)
ā¢ the sizes of diļ¬erences between group means āØ
(HOW MUCH is the diļ¬erence between tool A and B?)
ā¢ There are diļ¬erent way to measure eļ¬ect size, the most common are
Cohenās d (for diļ¬erences), Pearson r correlation (for associations/
relationships), but it may also depend on the type of data (categorical vs
numeric), and on types of samples (paired vs unpaired)
Check Wikipedia to know the most appropriate for your case:
cf. https://en.wikipedia.org/wiki/Effect_size
cf. Lakens, 2013 https://doi.org/10.3389/fpsyg.2013.00863
46. Cohenās d
ā¢ Diļ¬erence between the means divided by the standard
deviation of the population from which the data were sampled ā
but how can we know the standard deviation of the population?
The same MAGIC as before
ā¢ A d of 1 indicates the two groups diļ¬er by 1 standard deviation,
a d of 2 indicates they diļ¬er by 2 standard deviations, and so on.
This is how you interpret the values
of d that you obtain
https://en.wikipedia.org/wiki/Effect_size
47. Pearsonās r
ā¢ Indicates the correlation between variables (e.g., number of
bugs vs length of the code)
ā¢ Pearson's r can vary in magnitude from ā1 to 1:
ā¢ ā1 perfect negative linear relation,
ā¢ 1 perfect positive linear relation
ā¢ no linear relation between two variables
ā¢ The eļ¬ect size is low if the value of r varies around 0.1, medium
if r varies around 0.3, and large if r varies more than 0.5
48. What about Type II Errors?
ā¢ In all our evaluations, we assumed that the population
was conļ¬rming the NULL hypothesis, but what if we
make a Type II error (we fail to reject the NULL
hypothesis, when the actual population rejects it)?
ā¢ Well, in these cases, we should also establish a value,
normally called Ī², which is the probability of accepting the
NULL hypothesis, although it is FALSE
ā¢ If the NULL hypothesis is FALSE, this means that my real
population follows the alternative hypothesis
49. Type II Errors
Set of possible values x
of my test statistic
Number of samples
with value x (Density)
Distribution if H0
would be true
Distribution if H1
would be true
50. Type II Errors
Set of possible values x
of my test statistic
Number of samples
with value x (Density)
Distribution if H0
would be true
Distribution if H1
would be true
š¼To have smaller š¼ I have to push
the bar to the rightā¦
51. Type II Errors
Set of possible values x
of my test statistic
Number of samples
with value x (Density)
Distribution if H0
would be true
Distribution if H1
would be true
Ī² š¼š¼ now is really small,
but Ī² gets larger!
Ī² is the probability of accepting the NULL hypothesis when it is FALSE
š¼ is the probability of rejecting the NULL hypothesis when it is TRUE
52. The Hard Truth
ā¢ Whenever you try to minimise Type I errors, you end up increasing the chance of
Type II errors
ā¢ In practice, we mostly look at REJECTING null hypotheses, so we generally focus
on Type I errors, and alpha values
ā¢ Why do we look at rejecting the NULL? (intuitive explanation)
ā¢ We are using just one sample to reason on an entire population, so
we can REJECT a hypothesis, or FAIL to REJECT, but never accept
ā¢ Accepting the alternative hypothesis would imply repeating the
experiments many more times with diļ¬erent samples taken from
my actual population and showing that the test statistic follows the
distribution of the alternative hypothesis
ā¢ Additional Intuition: it is easier to disprove āall swans are whiteā (I need to ļ¬nd
only one black swan) than to prove it (I need to check all possible swans)
53. Summary of Concepts
ā¢ When you perform an experiment you have to keep in mind the following key
concepts:
ā¢ Level of signiļ¬cance š¼: tells me how much risk I can take, normally set to
0.05, moderate risk; it is set at the beginning of the experiment
ā¢ Test statistic: value depending on the type of test that I make, it serves to
understand how much my sample is rare in a population in which the NULL
hypothesis is TRUE; it is produced based on my experimental data; the
number alone does not say much
ā¢ p-value: indicates the probability of rejecting the NULL hypothesis when it is
actually TRUE; it is produced based on my experimental data; it needs to be
compared with š¼; if lower than š¼, I am happy
ā¢ Eļ¬ect size: indicates how large is the diļ¬erence between two treatments, or
how much is the correlation between independent and dependent variable;
depends on the chosen test; tables exist to evaluate the eļ¬ect size
56. Summary from Previous Lecture
Distribution of test
statistic when
samples come
from a population
where NULL is true
NULL Hypothesis
Centred in the value that
test statistic has
when the sample conļ¬rms
EXACTLY the NULL
hypothesis
test statistic
# of samples
Every experiment produces a test statistic
(numerical summary of the data)
I imagine to perform a set of experiments with a
population in which NULL is true
57. š¼ is this area
Summary from Previous Lecture
Distribution of test
statistic when
samples come
from a population
where NULL is true
NULL Hypothesis
Centred in the value that
test statistic has
when the sample conļ¬rms
EXACTLY the NULL
hypothesis
test statistic
# of samples
Every experiment produces a test statistic
(numerical summary of the data)
I imagine to perform a set of experiments with a
population in which NULL is true
58. š¼ is this area
Summary from Previous Lecture
Distribution of test
statistic when
samples come
from a population
where NULL is true
NULL Hypothesis
This point is my test statistic value,
derived from my data
Statistical Test
p-value is
this blue area
Centred in the value that
test statistic has
when the sample conļ¬rms
EXACTLY the NULL
hypothesis
test statistic
# of samples
Every experiment produces a test statistic
(numerical summary of the data)
I imagine to perform a set of experiments with a
population in which NULL is true
59. Statistical Tests
ā¢ A statistical test is a means to establish a test statistic, i.e., a single value derived from the data of my
experiment
ā¢ Several tests exist, and each test is appropriate for a speciļ¬c type of experiment
ā¢ Two categories of tests exist:
ā¢ Parametric Tests: tests that make some assumptions on the populationās distribution, e.g., normality,
or homogeneous variances of the sample
ā¢ Nonparametric Tests: tests that do not make assumptions on the populationās distribution. For most
of the parametric tests, a nonparametric alternative exist
ā¢ Parametric Tests have more statistical power (a concept that we did not explore); roughly, they are
more likely to lead to the rejection of the NULL hypothesis when FALSE (they lead to lower p-values, when
NULL is false, and hence reduce Type II errors). You cannot use them for nominal or ordinal data.
ā¢ Nonparametric Tests are more robust, as they are valid for a larger set of cases, as they do not make
strict assumptions on the data. You can use them for nominal and ordinal data, or when assumptions of
the parametric tests do not hold
ā¢ You do not know the population, so, in order to use parametric tests, you ļ¬rst have to test how likely is it
that your data follow the assumption of the test that you are going to make; if they do not follow the
assumption, then use a nonparametric alternative (cf. https://help.xlstat.com/s/article/which-statistical-
test-should-you-use?language=en_US)
60. Normality Test (does not apply
to nominal or ordinal data)
ā¢ Many parametric statistical tests assume that your data is normally distributed
(actually, the distribution of the sample mean is normal, so I should consider the
populationā¦in general if you have more than 30 samples youāre safe)
ā¢ To ensure that, you need to apply a normality test to your data, for example Shapiro-
Wilk (several others exist)
ā¢ The null-hypothesis of this test is H0 = the population is normally distributed.
ā¢ Thus, if the p-value is less than the chosen š¼ level, then the NULL hypothesis is
rejected and there is evidence that the data tested are NOT normally distributed.
Here you want the p-value to be LARGER than š¼,
as your NULL hypothesis is the one that you want to support!
Hence, THE LARGER the p-value, the BETTER!
There are also ways to transform your data if they are not normally distributed,
but be careful, because then the interpretation of the results is not straightforward
(check if non-normality is due to the presence of outliers)
cf. https://bit.ly/2wJAl9l
61. Parametric and Non-
parametric Tests (Remark)
ā¢ Parametric tests are all those test that make some assumptions
on your data (normality, above all). To use a parametric test you
ļ¬rst need to check that the assumptions of the parametric test
hold for your data
ā¢ Non-parametric tests are alternative tests to use when the
normality test (or any other assumption) fails OR when you are
dealing with categorical or ordinal data
ā¢ Sometimes non-parametric tests have assumptions too!
(check carefully which are the assumptions of non-parametric
tests, e.g., cf. https://www.isixsigma.com/tools-templates/
hypothesis-testing/nonparametric-distribution-free-not-
assumption-free/ )
62. Selecting the right test
HOWTO
ā¢ In the following, a diagram will be shown to guide you in the selection of the right
test, assuming that you have only ONE DEPENDENT VARIABLE ā as in most of
the experiments with a manageable design in SE
ā¢ The selection of the test depends on
ā¢ The type of dependent variable (nominal, ordinal, scale/ratio)
ā¢ Type of hypothesis (diļ¬erence or relationship/association)
ā¢ Number of treatments
ā¢ Type of design (single group of subjects vs two groups)
ā¢ Number of independent variables
63. Selecting the right test
HOWTO
ā¢ In the following, a diagram will be shown to guide you in the selection of the right
test, assuming that you have only ONE DEPENDENT VARIABLE ā as in most of
the experiments with a manageable design in SE
ā¢ The selection of the test depends on
ā¢ The type of dependent variable (nominal, ordinal, scale/ratio)
ā¢ Type of hypothesis (diļ¬erence or relationship/association)
ā¢ Number of treatments
ā¢ Type of design (single group of subjects vs two groups)
ā¢ Number of independent variables
You will not memorise the diagram,
but you should know how to follow it
64. Selecting the right test
HOWTO
ā¢ In the following, a diagram will be shown to guide you in the selection of the right
test, assuming that you have only ONE DEPENDENT VARIABLE ā as in most of
the experiments with a manageable design in SE
ā¢ The selection of the test depends on
ā¢ The type of dependent variable (nominal, ordinal, scale/ratio)
ā¢ Type of hypothesis (diļ¬erence or relationship/association)
ā¢ Number of treatments
ā¢ Type of design (single group of subjects vs two groups)
ā¢ Number of independent variables
You will not memorise the diagram,
but you should know how to follow it
I will not explain how each
test works, you only need to know which one to use
65. Selecting the right test
HOWTO
ā¢ In the following, a diagram will be shown to guide you in the selection of the right
test, assuming that you have only ONE DEPENDENT VARIABLE ā as in most of
the experiments with a manageable design in SE
ā¢ The selection of the test depends on
ā¢ The type of dependent variable (nominal, ordinal, scale/ratio)
ā¢ Type of hypothesis (diļ¬erence or relationship/association)
ā¢ Number of treatments
ā¢ Type of design (single group of subjects vs two groups)
ā¢ Number of independent variables
You will not memorise the diagram,
but you should know how to follow it
I will not explain how each
test works, you only need to know which one to use
In this lecture a test is a BLACK box that produces
two numbers: test statistic and p-value
66. Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
Zero (Only the
dependent variable)
One or more
Chi-square
Goodness of ļ¬t
Chi-square Test
of Independence
Type of hypothesis
RelationshipDiļ¬erence
Spearmanās
Rho
Type of
design
Diļ¬erent groups
of subjects
Single group of
subjects
Mann-Whitney
U test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
67. Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
Zero (Only the
dependent variable)
One or more
Chi-square
Goodness of ļ¬t
Chi-square Test
of Independence
Type of hypothesis
RelationshipDiļ¬erence
Spearmanās
Rho
Type of
design
Diļ¬erent groups
of subjects
Single group of
subjects
Mann-Whitney
U test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
I assume to have One
Dependent Variable
68. Interval/Ratio (numbers)
Type of hypothesis
Relationship Diļ¬erence
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
designSpearmanās
Rho
Pearsonās R
One or More
Zero
Diļ¬erent groups
of subjects
(independent measures)
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
cf. https://www.socscistatistics.com
69. Interval/Ratio (numbers)
Type of hypothesis
Relationship Diļ¬erence
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
designSpearmanās
Rho
Pearsonās R
One or More
Zero
Diļ¬erent groups
of subjects
(independent measures)
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
The list of tests is
NOT exhaustive
cf. https://www.socscistatistics.com
70. Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
ZeroOne or More
Chi-square
Goodness of ļ¬t
Chi-square Test
of Independence
Type of hypothesis
Relationship
Diļ¬erence
Spearmanās
Rho
Type of
design
Diļ¬erent groups
of subjects
Single group of
subjects
Mann-Whitney
U test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
e.g., IV: None; DV: type of defect.
to which extent the proportion
of defects of a certain type matches the
expected proportion?
IV = independent variable
DV = dependent variable
71. Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
ZeroOne or more
Chi-square
Goodness of ļ¬t
Chi-square
Test of
Independence
Type of hypothesis
Relationship
Diļ¬erence
Spearmanās
Rho
Type of
design
Diļ¬erent groups
of subjects
Single group of
subjects
Mann-Whitney
U test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
e.g., IV: code author, āØ
DV: defect type
Is there a link between defect
type and code authors?
72. Chi-Square Test of
Independence (Example)
ā¢ RQ: Is there a link between defect type and code author?
ā¢ H0: There is no relationship between defect type and code author
type of defect
author
āNull pointerā defects
in Homerās code
73. Chi-Square Test of
Independence (Example)
ā¢ RQ: Is there a link between defect type and code author?
ā¢ H0: There is no relationship between defect type and code author
type of defect
author
Chi-square = 56.32, p < 0.00001
H0 is REJECTED
āNull pointerā defects
in Homerās code
74. Chi-Square Test of
Independence (Example)
ā¢ RQ: Is there a link between defect type and code author?
ā¢ H0: There is no relationship between defect type and code author
type of defect
author
Chi-square = 56.32, p < 0.00001
H0 is REJECTED
Cramerās V should
be used to check
Effect Size (check Wikipedia)!
āNull pointerā defects
in Homerās code
75. Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
ZeroOne or More
Chi-square
Goodness of ļ¬t
Chi-square Test
of Independence
Type of hypothesis
RelationshipDiļ¬erence
Spearmanās
Rho
Type of
design
Diļ¬erent groups
of subjects
(independent measures)
Single group of
subjects
(repeated measures)
Mann-
Whitney U
test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
e.g., IV: level of experience
(two levels); āØ
DV: degree of success
Is there a difference between
degree of project success
between novices and
experts?
novices experts
Deg. Proj. Success
76. Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
ZeroOne or more
Chi-square
Goodness of ļ¬t
Chi-square Test
of Independence
Type of hypothesis
RelationshipDiļ¬erence
Spearmanās
Rho
Type of
design
Diļ¬erent groups
of subjects
(independent measures)
Single group of
subjects
(repeated measures)
Mann-
Whitney U
test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
e.g., IV: time of the day
(morning, afternoon); āØ
DV: level of performance
is there a difference between
the performance of the
developers between morning
and afternoon?
Level of Performance
morning afternoon
77. Type of Dependent
Variable
Nominal
(labels)
Ordinal
(ordered
labels)
Number of Ind.
Variables
ZeroOne or More
Chi-square
Goodness of ļ¬t
Chi-square Test
of Independence
Type of hypothesis
RelationshipDiļ¬erence
Spearmanās
Rho
Type of
design
Diļ¬erent groups
of subjects
(independent measures)
Single group of
subjects
(repeated measures)
Mann-
Whitney U
test
Wilcoxon
signed-rank
test
Interval/Ratio
(numbers)
e.g., IV: motivation; DV: degree of
project success
Is there a relationship
between motivation of a person
and degree of project success?
motivation success
78. Dependent Variable is Interval/Ratio (numbers)
Type of hypothesis
Relationship
Diļ¬erence
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
design
Spearmanās
Rho
Pearsonās R
One or MoreZero
Diļ¬erent groups
of subjects
(independent measures)
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
e.g.,IV: review duration, DV: number
of defects identiļ¬ed
Is there a relationship
between review duration and
number of defects identified?
79. Interval/Ratio (numbers)
Type of hypothesis
Relationship
Diļ¬erence
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test (single
sample)
Type of
designSpearmanās
Rho
Pearsonās R
One or More
Zero
Diļ¬erent groups
of subjects
(independent measures)
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
e.g., IV: None DV: number of
defects per code module,
Is there a difference between the
number of defects identified in the
modules and the mean value
expected?
80. Interval/Ratio (numbers)
Type of hypothesis
Relationship
Diļ¬erence
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
designSpearmanās
Rho
Pearsonās R
One or more
Zero
Diļ¬erent groups
of subjects
(independent measures)
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
e.g., IV: tool; DV: speed in ļ¬nding
bugs
Does the tool improve the usersā
speed in finding bugs?
(is there a difference in terms of
speed WITH and WITHOUT the
tool?)
81. Paired T-test (Example)
ā¢ I have a new tool to support bug identiļ¬cation in code review, and I want to
understand whether it is eļ¬ective or not
ā¢ RQ: Does the tool improve the usersā speed of ļ¬nding bugs?
ā¢ Independent Variable: tool (YES/NO) ā two treatments (TOOL/NO-TOOL)
ā¢ Dependent Variable: speed = number of bugs found/minute
ā¢ H0: the speed without the tool is lower or equal to the speed with the tool
ā¢ Design: I have 13 users, I have ONE code ļ¬le to review, and I will let them ļ¬rst do the
bug search WITHOUT the tool (treatment NO-TOOL), and then do the search WITH the
tool (treatment TOOL). Then, I will compare the speed for each used in the two tasks,
to see if they improve.
Does the review training method improve the studentās ability of finding bug
a.k.a. repeated-measures t-test, paired samples t-test,
matched pairs t-test and matched samples t-test
82. Paired T-test (Example)
ā¢ I have a new tool to support bug identiļ¬cation in code review, and I want to
understand whether it is eļ¬ective or not
ā¢ RQ: Does the tool improve the usersā speed of ļ¬nding bugs?
ā¢ Independent Variable: tool (YES/NO) ā two treatments (TOOL/NO-TOOL)
ā¢ Dependent Variable: speed = number of bugs found/minute
ā¢ H0: the speed without the tool is lower or equal to the speed with the tool
ā¢ Design: I have 13 users, I have ONE code ļ¬le to review, and I will let them ļ¬rst do the
bug search WITHOUT the tool (treatment NO-TOOL), and then do the search WITH the
tool (treatment TOOL). Then, I will compare the speed for each used in the two tasks,
to see if they improve.
Does the review training method improve the studentās ability of finding bug
a.k.a. repeated-measures t-test, paired samples t-test,
matched pairs t-test and matched samples t-test
Whatās wrong with this design?
83. Paired T-test (Example)
ā¢ I have a new tool to support bug identiļ¬cation in code review, and I want to
understand whether it is eļ¬ective or not
ā¢ RQ: Does the tool improve the usersā speed of ļ¬nding bugs?
ā¢ Independent Variable: tool (YES/NO) ā two treatments (TOOL/NO-TOOL)
ā¢ Dependent Variable: speed = number of bugs found/minute
ā¢ H0: the speed without the tool is lower or equal to the speed with the tool
ā¢ Design: I have 13 users, I have ONE code ļ¬le to review, and I will let them ļ¬rst do the
bug search WITHOUT the tool (treatment NO-TOOL), and then do the search WITH the
tool (treatment TOOL). Then, I will compare the speed for each used in the two tasks,
to see if they improve.
Does the review training method improve the studentās ability of finding bug
a.k.a. repeated-measures t-test, paired samples t-test,
matched pairs t-test and matched samples t-test
Whatās wrong with this design?
Learning Bias: if I use the same ļ¬le to be reviewed, students will have learned
which are the bugs in the ļ¬le and in treatment NO-TOOL will be faster!
84. Paired T-test (Corrected Example)
ā¢ I have a new tool to support bug identiļ¬cation in code review, and I want to understand
whether it is eļ¬ective or not
ā¢ RQ: Does the tool improve the usersā speed of ļ¬nding bugs?
ā¢ Independent Variable: tool (YES/NO) ā two treatments
ā¢ Dependent Variable: speed = number of bugs found/minute
ā¢ H0: the speed without the tool is lower or equal to the speed with the tool
ā¢ Design: I have 13 users, I have ONE code ļ¬le to review, and I will let
them ļ¬rst do the bug search WITH the tool (treatment TOOL), and
THEN do the search WITHOUT the tool (treatment NO-TOOL). Then, I
will compare the speed for each student in the two tasks.
Does the review training method improve the studentās ability of finding bug
85. Paired T-test (Corrected Example)
ā¢ I have a new tool to support bug identiļ¬cation in code review, and I want to understand
whether it is eļ¬ective or not
ā¢ RQ: Does the tool improve the usersā speed of ļ¬nding bugs?
ā¢ Independent Variable: tool (YES/NO) ā two treatments
ā¢ Dependent Variable: speed = number of bugs found/minute
ā¢ H0: the speed without the tool is lower or equal to the speed with the tool
ā¢ Design: I have 13 users, I have ONE code ļ¬le to review, and I will let
them ļ¬rst do the bug search WITH the tool (treatment TOOL), and
THEN do the search WITHOUT the tool (treatment NO-TOOL). Then, I
will compare the speed for each student in the two tasks.
Does the review training method improve the studentās ability of finding bug
Now the learning bias would be in favour of NO-TOOL treatment;
if I am able to reject the hypothesis, I can be quite conļ¬dent that
the tool increases the speed
86. Paired T-test (Corrected Example)
ā¢ I have a new tool to support bug identiļ¬cation in code review, and I want to understand
whether it is eļ¬ective or not
ā¢ RQ: Does the tool improve the usersā speed of ļ¬nding bugs?
ā¢ Independent Variable: tool (YES/NO) ā two treatments
ā¢ Dependent Variable: speed = number of bugs found/minute
ā¢ H0: the speed without the tool is lower or equal to the speed with the tool
ā¢ Design: I have 13 users, I have ONE code ļ¬le to review, and I will let
them ļ¬rst do the bug search WITH the tool (treatment TOOL), and
THEN do the search WITHOUT the tool (treatment NO-TOOL). Then, I
will compare the speed for each student in the two tasks.
Does the review training method improve the studentās ability of finding bug
Now the learning bias would be in favour of NO-TOOL treatment;
if I am able to reject the hypothesis, I can be quite conļ¬dent that
the tool increases the speed
Is ONE code ļ¬le
sufļ¬cient?
87. Paired T-test (Corrected Example)
ā¢ Design: I have 13 users, I have TWO equivalent code ļ¬les
to review (ļ¬le X and Y), and I will let them ļ¬rst do the bug
search WITH the tool on ļ¬le X (treatment TOOL), and
THEN do the search WITHOUT the tool on ļ¬le Y
(treatment NO-TOOL). Then, I will compare the speed for
each student in the two tasks.
ā¢ With TWO equivalent code ļ¬les, I am more conļ¬dent that
the ļ¬rst treatment does not inļ¬uence the second
treatment
88. Paired T-test (Corrected Example)
ā¢ Design: I have 13 users, I have TWO equivalent code ļ¬les
to review (ļ¬le X and Y), and I will let them ļ¬rst do the bug
search WITH the tool on ļ¬le X (treatment TOOL), and
THEN do the search WITHOUT the tool on ļ¬le Y
(treatment NO-TOOL). Then, I will compare the speed for
each student in the two tasks.
ā¢ With TWO equivalent code ļ¬les, I am more conļ¬dent that
the ļ¬rst treatment does not inļ¬uence the second
treatment
But what if the task lasts too long, and the students get tired in the second task?
The effect of fatigue needs to be considered, so I need to do the two
treatments in two separate days (or allow sufļ¬cient time between tasks)
89. Paired T-test
ā¢ H0: the speed without the tool is lower or equal to the
speed with the tool (one-tailed hypothesis)
USER NO TOOL
u0 3 6
u1 3 6
u2 4 5
u3 3 8
u4 5 3
u5 7 5
u6 2 6
u7 1 5
u8 2 3
u9 8 9
u10 9 11
u11 1 4
u12 7 9
bugs/min by user
u0 with TOOL
t = 3.24
p-value = 0.00354
90. Paired T-test
ā¢ H0: the speed without the tool is lower or equal to the
speed with the tool (one-tailed hypothesis)
USER NO TOOL
u0 3 6
u1 3 6
u2 4 5
u3 3 8
u4 5 3
u5 7 5
u6 2 6
u7 1 5
u8 2 3
u9 8 9
u10 9 11
u11 1 4
u12 7 9
bugs/min by user
u0 with TOOL
t = 3.24
p-value = 0.00354
CURIOSITY:
What calculations are made to
ļ¬nd the t-value (the test statistic)?
91. Computing the t-test statistic (paired case)
NO TOOL Difference
Dev
(Difference - M) Dev2
Ī¼ is the expected difference if H0 is true
(hence no difference, Ī¼ = 0)
The t-test statistic is based on the difference between the two measures
This is the formula
of the test statistic
for t-test
SS:Mean M:
92. Interval/Ratio (numbers)
Type of hypothesis
Relationship
Diļ¬erence
Number of Ind.
Variables
Standard
Deviation
known unknown
Type of
designSpearmanās
Rho
One or MoreZero
Diļ¬erent groups
of subjects
(independent measures)
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
e.g., IV: tools DV: speed in ļ¬nding
bugs
Which is the difference between
tool A, B, C, D in terms of speed of
bug detection achieved by users?
Tool A Tool B Tool C Tool D
93. Interval/Ratio (numbers)
Type of hypothesis
Relationship
Diļ¬erence
Number of Ind.
Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
designSpearmanās
Rho
Pearsonās R
One or MoreZero
Diļ¬erent groups
of subjects
(independent measures)
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
e.g., IV: speed in ļ¬nding bugs
Does the tool improve the
usersā speed in finding bugs?
(is there a difference in terms of
speed WITH and WITHOUT the
tool?)
94. Unpaired T-test (Example)
ā¢ RQ: Does the tool improve the usersā speed of ļ¬nding bugs?
ā¢ I want to completely get rid of the learning bias, and of the fatigue eļ¬ect, and I have a
suļ¬cient number of users (26 instead of 13)
ā¢ I change the design by having two groups, randomly allocate subjects and assign each
subject to one of the treatment (TOOL, NO-TOOL)
ā¢ I have to assess that there is no diļ¬erence in the initial competence of the users. To this
end, I can do a pre-test, which can allow me to identify that subjects in the two groups
have the same (average) degree of competence in ļ¬nding bugs.
ā¢ Otherwise, I can provide sound arguments to justify that ALL the subjects have the same
degree of competence (e.g., people are students that come from the same course, and are
all novicesā¦hence my results are valid solely for this category of users)
ā¢ Note that the two groups need to be balanced, but you do not need to have the same exact
number of people in the two groups (e.g., if you have 25 people, it can be divided into 13
and 12 subjects)
a.k.a. independent-measures t-test, unpaired samples t-test
The problem is the same as for the paired T-test!
95. Unpaired T-test (Example)
USER NO
u0 3
u1 3
u2 4
u3 3
u4 5
u5 7
u6 2
u7 1
u8 2
u9 8
u10 9
u11 1
u12 7
USER TOOL
u13 6
u14 6
u15 5
u16 8
u17 3
u18 5
u19 6
u20 5
u21 3
u22 9
u23 11
u24 4
u25 9
t-value = -1.89889
p-value = .034833
Note that the t-value is different with respect to
the t-value for the paired case although
the number in the tables are THE SAME (but coming
from different subjects)!
96. Unpaired T-test (Example)
USER NO
u0 3
u1 3
u2 4
u3 3
u4 5
u5 7
u6 2
u7 1
u8 2
u9 8
u10 9
u11 1
u12 7
USER TOOL
u13 6
u14 6
u15 5
u16 8
u17 3
u18 5
u19 6
u20 5
u21 3
u22 9
u23 11
u24 4
u25 9
t-value = -1.89889
p-value = .034833
CURIOSITY:
What calculations are made to
ļ¬nd this t-value (the test statistic)?
Note that the t-value is different with respect to
the t-value for the paired case although
the number in the tables are THE SAME (but coming
from different subjects)!
97. Computing the t-test statistic
(unpaired case)
NO (x)
Difference
(x - M)
Sq. Diff
(x - M)2
TOOL (y) Difference
(y - M)
Sq. Diff
(y - M)2
98. What about the Effect Size?
ā¢ In this case, I have a diļ¬erence in my hypothesis,
therefore I will use Cohenās d
This are the numbers that I need for Cohenās d
where
This is the formula for Cohenās d
NO-TOOL
TOOL
99. What about the Effect Size?
ā¢ In this case, I have a diļ¬erence in my hypothesis,
therefore I will use Cohenās d
This are the numbers that I need for Cohenās d
where
This is the formula for Cohenās d
d = (6.15 - 4.23) ā 6.701138 = 0.286519
I have a SMALL to MEDIUM effect size
(see table from some slides agoā¦)
NO-TOOL
TOOL
100. Interval/Ratio (numbers)
Type of hypothesis
Relationship
Diļ¬erence
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
designSpearmanās
Rho
Pearsonās R
One or More
Zero
Diļ¬erent groups
of subjects
(independent measures)
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
e.g., Which is the difference
between tool A, B and C in terms of
speed of bug detection achieved by
users? (same as for repeated
measures, but I use a different
design with different people)
Tool A Tool B Tool C
101. Interval/Ratio (numbers)
Type of hypothesis
Relationship
Diļ¬erence
Number of
Ind. Variables
Standard
Deviation
known unknown
Z-test T-test
Type of
designSpearmanās
Rho
Pearsonās R
One or moreZero
Diļ¬erent groups
of subjects
(independent measures)
Treatments
Two More than two
T-test
(paired)
Wilcoxon
signed-rank test
One-way
ANOVA
Treatments
Two
More than two
T-test
(unpaired)
Mann-Whitney
U test
Independent
Variables
One More than One
One-way
ANOVA
Factorial
ANOVA
Single group of
subjects
(repeated measures)
e.g., What is the influence of
different tools and experience in
the bug detection speed?
(I consider not only the tool, but
also the experience as independent
variable)
102. Factorial ANOVA (Example)
ā¢ Letās imagine to have two tools A and B to support bug detection; I want to see which one is better, but I want
also to see whether there is some diļ¬erence between people with diļ¬erent degree of experience in bug
detection
ā¢ RQ: What is the inļ¬uence of diļ¬erent tools and experience in bug detection speed?
ā¢ Here I want to see which of the two factors (usersā experience and type of tool, my independent variables) has
more impact on bug detection speed
ā¢ I have three NULL hypothesis this time:
ā¢ H0-1: The speed does not depend on the type of adopted tool
ā¢ H0-2: The speed does not depend on the level of experience of the user
ā¢ H0-3: The speed does not depend on the interaction between type of adopted tool and level of experience
ā¢ Design
ā¢ User experience has 3 levels: low, medium, high
ā¢ Type of tool has 2 levels: tool A, tool B (in principle, I should have also NO toolā¦)
ā¢ Therefore, I have 3 x 2 = 6 possible situations (i.e., people with low experience using tool A, other using tool
B, etc.), and I have to group my subjects in 6 groups
103. Factorial ANOVA (Example)
User Exp. Tool Speed
1 low A
12
2 low
lo
B 4
3 low A 7
4 low B 3
5 medium A 9
6 medium B 12
7 medium A 16
8 medium
B 23
9 high A 23
10 high B
16
11 high A 14
12 high B 12
ā¦ ā¦ ā¦ ā¦
Data
Mean
Square
F-value p-value
Exp. 2664 147.51 <0.001
Tool 29.4 1.62 0.207
Exp. X
Tool
83.85 4.64 0.014
ANOVA Results
The interaction of the two
factors is signiļ¬cant
(reject H03)
The experience is signiļ¬cant (reject H02)
Tool is not signiļ¬cant
(cannot reject H01)
F-value is the test statistic for ANOVA
104. How to Select the Right Test
ā¢ Follow the diagram
ā¢ Use the wizard at https://www.socscistatistics.com/tests/
what_stats_test_wizard.aspx
ā¢ Use the Exhaustive Table at https://stats.idre.ucla.edu/other/mult-pkg/
whatstat/ which also contains R code and code for other tools
ā¢ To ļ¬nd non-parametric alternatives: https://help.xlstat.com/s/article/
which-statistical-test-should-you-use?language=en_US
ā¢ Always remember to check that the test assumptions hold
ā¢ It takes time to acquire conļ¬dence with experiment design, so DO
NOT BE SCARED
105. How To Select the Right Test
10.3 Hypothesis Testing 137
Table 10.3 Overview of parametric/non-parametric tests for different designs
Design Parametric Non-parametric
One factor, one treatment Chi-2, Binomial test
One factor, two treatments, completely
randomized design
t-test, F-test Mann-Whitney, Chi-2
One factor, two treatments, paired comparison Paired t-test Wilcoxon, Sign test
One factor, more than two treatments ANOVA Kruskal-Wallis, Chi-2
More than one factor ANOVAa
a
This test is not described in this book. Refer instead to, for example, Marascuilo and Serlin [119]
and Montgomery [125]
Input The type of measurements needed to make the test applicable
describes the input to the test. That is, this describes what
requirements there are on the experiment design if the test should
be applicable.
Null hypothesis A formulation of the null-hypothesis is provided.
Calculations It describes what to calculate based on the measured data.
Criterion The criterion for rejecting the null hypothesis. This often involves
Factor = number of independent variables
Treatments = possible values of the independent variables
cf. Wholin et al. https://doi.org/10.1007/978-3-642-29044-2
Fundamental tests
107. Threats to Validity for
Controlled Experiments
ā¢ Construct Validity: to which extent do the measured variables represent
what I intended to estimate? Did I operationalise my research questions
in the proper manner? Did I use an appropriate design?
ā¢ Internal Validity: are there any confounding factors that may have
inļ¬uenced the outcome of the experiments? Did I control all the
variables?
ā¢ External Validity: for which values of the controlled variables are the
results valid? To which extent the results can be considered general?
ā¢ (Statistical) Conclusion Validity: to which extent are my ļ¬nding
credible? Have I used the appropriate statistical tests? Did I check the
assumptions? Have I sampled the population in the appropriate way?
Have I used reliable measurement procedures (low measurement errors)?
108. Internal Validity
ā¢ Factors jeopardising internal validity are, e.g.:
ā¢ History: did time impact on the treatments? (e.g., I have people
participating at diļ¬erent times of the day, or treatments performed in
diļ¬erent days)
ā¢ Maturation: did subjects learn throughout the experiment? did time during
the experiment aļ¬ect the performance? (e.g., people can get bored or tired)
ā¢ Experimental mortality: how many subjects left the experiment and how
did this aļ¬ect the treatment groups? Are the remaining subjects the most
motivated?
ā¢ Researcher bias: in which way could the researcher inļ¬uence the
outcomes? (e.g., presence of researcher inļ¬uences the participants)
ā¢ Experimental context: to which extent does the experimental context
inļ¬uence the behaviour of subjects?
cf. https://web.pdx.edu/~stipakb/download/PA555/ResearchDesign.html
109. External Validity
ā¢ Factors jeopardising external validity are, e.g.:
ā¢ Selection bias: are the selected subjects really
random, and are they randomly assigned to treatment?
ā¢ Representativeness: to which extent the experiment
represents a real context? To which extent was I able to
properly represent all the realistic combinations of the
control variables? To which extent was I able to select
the representative people? To which extent was I able
to select representative situations?
110. Construct Validity
ā¢ Factors jeopardising construct validity are:
ā¢ Hypothesis guessing: does knowing the expected result
inļ¬uence the behaviour of the participant?
ā¢ Bias in experimental design: was my operationalisation and
design correct?
ā¢ Subjective measures: to which extent the subjective
measures are reliable?
111. Conclusion Validity
ā¢ Factors jeopardising conclusion validity are:
ā¢ Low statistical power: power is the probability of
correctly rejecting the NULL hypothesis when FALSE; I
may fail to reject the NULL hypothesis if I have low
statistical power; low statistical power occurs when I have
few samples, and low eļ¬ect size.
ā¢ Violated assumptions: remember that all tests have
assumptions to check
ā¢ Unreliable measures of the variables: large amount of
measurement error
112. Preparing, Executing and
Reporting Experiments
Theory
Hypothesis and
Variable Deļ¬nition
Research Design
Research Question
Deļ¬ne Measures for
Variables
Recruit Participants
/ Select Artifacts
PREPARATION EXECUTION
Collect Data
Analyse Data
Report Answers
Internal Validity
External Validity
Construct &
Conclusion Validity
Construct
Validity
REPORTING
Discuss
113. 154 11 Presentation and Package
Table 11.1 Proposed reporting structure for experiment reports, by Jedlitschka and Pfahl [86]
Sections/subsections Contents
Title, authorship
Structured abstract Summarizes the paper under headings of background or context,
objectives or aims, method, results, and conclusions
Motivation Sets the scope of the work and encourages readers to read the rest of the
paper
Problem statement Reports what the problem is; where it occurs, and who observes it
Research objectives Deļ¬nes the experiment using the formalized style used in GQM
Context Reports environmental factors such as settings and locations
Related work How current study relates to other research
Experimental design Describes the outcome of the experimental planning stage
Goals, hypotheses and
variables
Presents the reļ¬ned research objectives
Design Deļ¬ne the type of experimental design
Subjects Deļ¬nes the methods used for subject sampling and group allocation
Objects Deļ¬nes what experimental objects were used
Instrumentation Deļ¬nes any guidelines and measurement instruments used
Data collection
procedure
Deļ¬nes the experimental schedule, timing and data collection procedures
Analysis procedure Speciļ¬es the mathematical analysis model to be used
Evaluation of validity Describes the validity of materials, procedures to ensure participants
keep to the experimental method, and methods to ensure the
reliability and validity of data collection methods and tools
Execution Describes how the experimental plan was implemented
Sample Description of the sample characteristics
Preparation How the experimental groups were formed and trained
Reporting Experiments (1)
cf. Wholin et al. https://doi.org/10.1007/978-3-642-29044-2
114. Reporting Experiments (2)
Analysis procedure Speciļ¬es the mathematical analysis model to be used
Evaluation of validity Describes the validity of materials, procedures to ensure participants
keep to the experimental method, and methods to ensure the
reliability and validity of data collection methods and tools
Execution Describes how the experimental plan was implemented
Sample Description of the sample characteristics
Preparation How the experimental groups were formed and trained
Data collection
performed
How data collection took place and any deviations from plan
Validity procedure How the validity process was followed and any deviation from plan
Analysis Summarizes the collected data and describes how it was analyzed
Descriptive statistics Presentation of the data using descriptive statistics
Data set reduction Describes any reduction of the data set e.g. removal of outliers
Hypothesis testing Describes how the data was evaluated and how the analysis model was
validated
Interpretation Interprets the ļ¬ndings from the Analysis section
Evaluation of results
and implications
Explains the results
Limitations of study Discusses threats to validity
Inferences How the results generalize given the ļ¬ndings and limitations
Lesson learnt Descriptions of what went well and what did not during the course of
the experiment
Conclusions and
future work
Presents a summary of the study
Relation to existing
evidence
Describes the contribution of the study in the context of earlier
experiments
Impact Identiļ¬es the most important ļ¬ndings
Limitations Identiļ¬es main limitations of approach i.e. circumstances when the
expected beneļ¬ts will not be delivered
Future work Suggestions for other experiments to further investigate
Acknowledgements Identiļ¬es any contributors who do not fulļ¬ll authorship criteria
References Lists all cited literature
Appendices Includes raw data and/or detailed analyses which might help others to
cf. Wholin et al. https://doi.org/10.1007/978-3-642-29044-2
116. What about Quasi-
Experiments?
ā¢ In experiments I randomly assign subjects to treatments;
ā¢ In quasi-experiments the assignment is based on some choices of
the designer (e.g., the Factorial ANOVA example, in which I have more
than one level of experience)
ā¢ Note that a quasi-experiment does not always allow to convincingly
establish causal relationships (e.g., diļ¬erent degrees of experience
may be related to other factors that may have inļ¬uenced the outcome)
ā¢ When I use a group of students from a certain class for my research, I
am neither performing an experiment nor a quasi-experiment, but a
case study, as I am focusing on a speciļ¬c environment and I selected
the subjects opportunistically
117. Summary
ā¢ Controlled Experiments in SE are a research strategy mostly oriented to test the
impact of some treatment (method, tool) to a certain dependent variable (e.g.,
speed, bugs, success, happiness)
ā¢ They are based on Hypothesis testing, which implies showing that the
experimental data REJECT the NULL hypothesis (i.e., no impact on the dependent
variable)
ā¢ Hypothesis testing uses Statistical tests to decide whether the NULL can be
REJECTED
ā¢ The selection of the statistical test depends on the Experimental design (look at
https://stats.idre.ucla.edu/other/mult-pkg/whatstat/)
ā¢ When I perform a statistical test, I hope to obtain for small p-values, and large
eļ¬ect size
ā¢ Remember to analyse and report Threats to Validity