SlideShare a Scribd company logo
1 of 200
Download to read offline
Topic Set Size Design and
Power Analysis in Practice
Tetsuya Sakai
@tetsuyasakai
tetsuyasakai@acm.org
Waseda University
ICTIR 2016 Tutorial: September 13, 2016, Delaware.
This half-day tutorial will teach you
• How to determine the number of topics when building
a new test collection (prerequisite: you already have
some pilot data from which you can construct a topic-
by-run score matrix). You will kind of know how it
works.
• How to check whether a reported experiment is
overpowered/underpowered and decide on a better
sample size for a future experiment.
Before attending the tutorial, please
download on your laptop
- Sample topic-by-run matrix:
https://waseda.box.com/20topics3runs
- Excel topic set size design tools:
http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeTTEST.xlsx
http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx
http://www.f.waseda.jp/tetsuya/FIT2014/samplesizeCI.xlsx
[OPTIONAL]
- (Install R first and then) R scripts for power analysis:
https://waseda.box.com/SIGIR2016PACK
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
1.1 Preliminaries (1)
• In IR experiments, we often compare sample means to
guess if the population means are different.
• We often employ parametric tests (assume specific
population distributions with parameters)
- paired and unpaired t-tests (comparing m=2 means)
- ANOVA (comparing m (>2) means)
one-way, two-way, two-way without replication
Are the two population
means equal?
Are the m population
means equal?
scores
EXAMPLE
n topics
m systems
Sample mean for a system
1.1 Preliminaries (2)
• H0: tentative assumption that all population means are equal
• test statistic: what you compute from observed data – under H0, this
should obey a known distribution (e.g. t-distribution)
• p-value: probability of observing what you have observed (or
something more extreme) assuming H0 is true
Null hypothesis
test statistic t0
1.1 Preliminaries (3)
Reject H0
if p-value <= α
test statistic t0 t(φ; α)
Accept H0 Reject H0
H0 is true
systems are equivalent
Correct conclusion
(1-α)
Type I error
α
H0 is false
systems are different
Type II error
β
Correct conclusion
(1-β)
α/2 α/2
Statistical power:
ability to detect real
differences
1.1 Preliminaries (4)
Accept H0 Reject H0
H0 is true
systems are equivalent
Correct conclusion
(1-α)
Type I error
α
H0 is false
systems are different
Type II error
β
Correct conclusion
(1-β)
Statistical power:
ability to detect real
differencesCohen’s five-eighty convention:
α=5%, 1-β=80% (β=20%)
Type I errors 4 times as serious as Type II errors
The ratio may be set depending on specific situations
For a continuous random variable x and its probability density function
f(x), the expectation of a function g(x) (including g(x)=x) is given by:
How likely x will take a
particular value
Population mean
Population variance
Population standard
deviation
The central position of x as
it is observed an infinite
number of times
How x varies from
the population
mean
1.1 Preliminaries (5)
A normal distribution with
population parameters
is denoted by .
Properties of a normal distribution
:
Probability density function
of a normal distribution
μ = 100
σ = 20
1.1 Preliminaries (6)
If x obeys , then
obeys .
Standardisation
Population
mean: 0
Population
standard
deviation: 1
Standard normal distribution
1.1 Preliminaries (7)
1.1 Preliminaries (8)
For random variables x, y, a function that satisfies the following is called
a joint probability density function:
Whereas, marginal probability density functions are defined as:
If the following holds for any (x,y), x and y are said to be independent.
If are independent and obey
then obeys
Reproductive property:
Adding normally distributed variables
still gives you a normal distribution
Population mean Population variance
1.1 Preliminaries (9)
If are independent and obey
then obeys
obeys and therefore
obeys .
Corollary: If we let ai = 1/n, μi = μ, σi = σ ...
1.1 Preliminaries (10)
Sample mean
Sample mean
Sum of squares
Sample variance
Sample standard deviation
If are independent
and obey , then
holds.
Sample variance V is an unbiased
estimator of the population variance
1.1 Preliminaries (11)
cf. 2.5 (3):
s is NOT an unbiased estimator of the population standard deviation
If are independent and
then:
• Law of large numbers
As n approaches infinity, approaches .
• Central Limit Theorem
Provided that n is large, the distribution of
can be approximated by .
It’s a good thing to observe
lots of data to estimate the
population mean.
If you have lots of observations, then the sample mean can be regarded as normally
distributed even if we don’t know much about individual random variables {xi}
1.1 Preliminaries (12)
Not necessarily normal
If are independent and obey then
the probability distribution that the following random variable obeys is
called a chi-square distribution with φ = k degrees of freedom:
The pdf of the above distribution is given by:
Gamma
function
Denoted by
1.1 Preliminaries (13)
If obeys then .
If are independent and obey then:
(a) obeys .
(b) and are independent.
(c) obeys .
1.1 Preliminaries (14)
Corollary from previous slide since
(xi – μ)/σ obeys
[Nagata03] p.57
[Nagata03] p.58
1.1 Preliminaries (15)
If and they are independent,
the probability distribution that the following random variable obeys is
called a t distribution with φ degrees of freedom, denoted by t(φ).
IMPORTANT PROPERTY:
If and are independent, then:
obeys
Sample mean and
sample variance
as defined in 1.1 (11)
1.1 Preliminaries (16)
If and they are independent,
the probability distribution that the following random variable obeys is
called an F distribution with degrees of freedom,
denoted by .
IMPORTANT PROPERTY:
If
and they are all independent, then:
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
1.2 How the t-test works (1) paired t-test
What does this
sample tell us about
the populations?
Comparing Systems X and Y with n topics with
(say) Mean nDCG over n topics
ASSUMPTIONS:
are independent and obey
are independent and obey .
Under these assumptions:
1.2 How the t-test works (2) paired t-test
In Slide 1.1 (9), let a1 = 1, a2 = -1.
⇒
⇒
⇒
is an unbiased estimator of :
t distribution with n-1 degrees of
freedom, which is basically like the
standard normal distribution
(See also 1.1 (15))
1.2 How the t-test works (3) paired t-test
1.1 (10)
1.1 (7)
We don’t know the
population variance so use a
sample variance instead.
1.1 (11)
Since under our assumptions,
if we further assume , then
.
Hypotheses:
Same population means: X and Y are equally effective
Two-sided test
1.2 How the t-test works (4) paired t-test
0
test statistic t0
Hypotheses:
1.2 How the t-test works (5) paired t-test
test statistic t0critical t value t(n-1; α)
α/2 α/2
Under , .
0
So if ,
something highly unlikely has happened.
We assumed but that must have been
wrong! Reject !
is probably true,
with 100(1-α)% confidence.
α: significance criterion
1.2 How the t-test works (6) paired t-test
test statistic t0critical t value t(n-1; α)
α/2 α/2
0
Using Excel to do a t-test:
- Reject if = TINV(α, n-1) = T.INV.2T(α, n-1).
- P-value = TDIST(|t0|, n-1, 2) = T.DIST.2T(|t0|, n-1).
Blue areas under the curve:
probability of observing the
data at hand or something
more extreme, if H0 is true
1.2 How the t-test works (7) confidence intervals
From 1.2 (3),
⇒
critical t value t(n-1; α)
α/2 α/2
0
t obeys t(n-1)
1.2 How the t-test works (8) confidence intervals
From 1.2 (3),
⇒
⇒
where .
So 95% CI for the difference in means is given by:
Margin of
Eerror
Different samples yield different CIs. 95% of the CIs will capture the true difference in means.
1.2 How the t-test works (9) unpaired t-test
Comparing Systems X and Y, based on a sample of size n1 for X and
another sample of size n2 for Y.
ASSUMPTIONS: the above observations are all independent and
and furthermore
Homoscedasticity (equal variance)
but the t-test is quite robust to the
assumption violation [Sakai16SIGIRshort]
1.2 How the t-test works (10) unpaired t-test
cf. 1.2 (15)
Under the assumptions, it is known that
where
Pooled variance
1.2 How the t-test works (11) unpaired t-test
Hypotheses:
Since under our assumptions,
if we further assume , then
1.2 How the t-test works (12) unpaired t-test
Same population means: X and Y are equally effective
Two-sided test
0
test statistic t0
Hypotheses:
1.2 How the t-test works (13) unpaired t-test
Under , .
test statistic t0critical t value t(n-1; α)
α/2 α/2
0
α: significance level
So if ,
something highly unlikely has happened.
We assumed but that must have been
wrong! Reject !
is probably true,
with 100(1-α)% confidence.
test statistic t0critical t value t(n-1; α)
α/2 α/2
0
Using Excel to do a t-test:
- Reject if = TINV(α, φ) = T.INV.2T(α, φ).
- P-value = TDIST(|t0|, φ, 2) = T.DIST.2T(|t0|, φ).
Blue areas under the curve:
probability of observing the
data at hand or something
more extreme, if H0 is true
1.2 How the t-test works (14) unpaired t-test
1.2 How the t-test works (15) unpaired t-test
• Unpaired (i.e., two-sample) t-tests:
- Student’s t-test: equal variance assumption
- Welch’s t-test: no equal variance assumption, but involves
approximations – use this if (1) two sample sizes are very different
AND (2) two sample variances are very different [Sakai16SIGIRshort].
The Welch t-statistic and the degrees of freedom:
Difference measured in standard deviation units
Paired data [Sakai14SIGIRForm] :
Unpaired data:
WARNING: Different books define “Cohen’s d” differently. [Okubo12]
1.2 How the t-test works (15) effect sizes
effect size
Pooled variance
effect size
cf. Hedges’ g, Glass’s Δ
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
1.3 T-test with Excel and R (hands-on) (1)
- Sample topic-by-run matrix:
https://waseda.box.com/20topics3runs
The easiest way to obtain the p-values:
Paired t-test:
= TTEST(A1:A20,B1:B20,2,1) = 0.2058
Unpaired, Student’s t-test:
= TTEST(A1:A20,B1:B20,2,2) = 0.5300
Unpaired, Welch’s t-test:
= TTEST(A1:A20,B1:B20,2,3) = 0.5302
0.4695 0.3732 0.3575
0.2813 0.3783 0.2435
0.3914 0.3868 0.3167
0.6884 0.5896 0.6024
0.6121 0.4725 0.4766
0.3266 0.233 0.2429
0.5605 0.4328 0.4066
0.5916 0.5073 0.4707
0.4385 0.3889 0.3384
0.5821 0.5551 0.4597
0.2871 0.3274 0.2769
0.5186 0.5066 0.4066
0.5188 0.5198 0.3859
0.5019 0.4981 0.4568
0.4702 0.3878 0.3437
0.329 0.4387 0.2649
0.4758 0.4946 0.4045
0.3028 0.34 0.3253
0.3752 0.4895 0.3205
0.2796 0.2335 0.224
Runs A, B, C
20 topics
two-sided
But this makes you treat the t-test as a black box.
To obtain the test statistic, degrees of freedom etc., let’s do it “by hand”...
1.3 T-test with Excel and R (hands-on) (2)
A B C D
=A1-B1
= AVERAGE(D1:D20)
= 0.022375
= DEVSQ(D1:D20)/(20-1)
= 0.005834
Paired t-test
= 1.3101
P-value = T.DIST.2T(|t0|, 19) = 0.2058.
0.4695 0.3732 0.3575 0.0963
0.2813 0.3783 0.2435 -0.097
0.3914 0.3868 0.3167 0.0046
0.6884 0.5896 0.6024 0.0988
0.6121 0.4725 0.4766 0.1396
0.3266 0.233 0.2429 0.0936
0.5605 0.4328 0.4066 0.1277
0.5916 0.5073 0.4707 0.0843
0.4385 0.3889 0.3384 0.0496
0.5821 0.5551 0.4597 0.027
0.2871 0.3274 0.2769 -0.0403
0.5186 0.5066 0.4066 0.012
0.5188 0.5198 0.3859 -0.001
0.5019 0.4981 0.4568 0.0038
0.4702 0.3878 0.3437 0.0824
0.329 0.4387 0.2649 -0.1097
0.4758 0.4946 0.4045 -0.0188
0.3028 0.34 0.3253 -0.0372
0.3752 0.4895 0.3205 -0.1143
0.2796 0.2335 0.224 0.0461
1.3 T-test with Excel and R (hands-on) (3)
A B C
=A1-B1
= AVERAGE(A1:A20)-AVERAGE(B1:B20)
= 0.022375
= DEVSQ(A1:A20) = 0.291139
Unpaired, Student’s t-test
= 0.012463
P-value = T.DIST.2T(|t0|, 38) = 0.5300.
0.4695 0.3732 0.3575
0.2813 0.3783 0.2435
0.3914 0.3868 0.3167
0.6884 0.5896 0.6024
0.6121 0.4725 0.4766
0.3266 0.233 0.2429
0.5605 0.4328 0.4066
0.5916 0.5073 0.4707
0.4385 0.3889 0.3384
0.5821 0.5551 0.4597
0.2871 0.3274 0.2769
0.5186 0.5066 0.4066
0.5188 0.5198 0.3859
0.5019 0.4981 0.4568
0.4702 0.3878 0.3437
0.329 0.4387 0.2649
0.4758 0.4946 0.4045
0.3028 0.34 0.3253
0.3752 0.4895 0.3205
0.2796 0.2335 0.224
= DEVSQ(B1:B20) = 0.182445
= 0.6338
1.3 T-test with Excel and R (hands-on) (4)
A B C
=A1-B1 Unpaired, Welch’s t-test
= 0.015323
P-value = T.DIST.2T(|t0|, φ*) = 0.5302.
0.4695 0.3732 0.3575
0.2813 0.3783 0.2435
0.3914 0.3868 0.3167
0.6884 0.5896 0.6024
0.6121 0.4725 0.4766
0.3266 0.233 0.2429
0.5605 0.4328 0.4066
0.5916 0.5073 0.4707
0.4385 0.3889 0.3384
0.5821 0.5551 0.4597
0.2871 0.3274 0.2769
0.5186 0.5066 0.4066
0.5188 0.5198 0.3859
0.5019 0.4981 0.4568
0.4702 0.3878 0.3437
0.329 0.4387 0.2649
0.4758 0.4946 0.4045
0.3028 0.34 0.3253
0.3752 0.4895 0.3205
0.2796 0.2335 0.224
= 0.6338
= 0.009602
= 36.0985
1.3 T-test with Excel and R (hands-on) (5)
1.3 T-test with Excel and R (hands-on) (6)
Compare with the Excel results.
1.3 T-test with Excel and R (hands-on) (7)
Also try:
R uses Welch as the default!
Compare with the Excel results.
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
1.5 How ANOVA works (1)
System Per-topic
performances
1 x11, x12, … , x1n
2 x21, x22, … , x1n
3 x31, x32, … , x3n
Topic→
↓System
1 2 … n
1 x11 x12 … x1n
2 y21 y22 … y2n
3 z31 z32 … z3n
One-way ANOVA with
equal number of replicates
Two-way ANOVA without replication
(If xi corresponds to yi and zi,
this should be preferred over one-way ANOVA)
ANOVA can ask: “Are ALL systems equally effective?” when there are m (>2) systems.
In this tutorial, let’s first consider the following two simplest types of ANOVA.
Generalises the
unpaired t-test
Generalises the
paired t-test
1.5 How ANOVA works (2) one-way ANOVA
System Per-topic
performances
1 x11, x12, … , x1n
2 x21, x22, … , x1n
3 x31, x32, … , x3n
i=1, … , m
j=1, … , n
: score of i-th system for topic j
ASSUMPTIONS: are independent and
, or, equivalently,
and .
Let and .
Then it is easy to show that .
Homoscedasticity
(equal variance)
assumption
Population grand mean i-th system effect
Hypotheses:
: At least one of the system effects is non-zero.
Let
.
Note that
1.5 How ANOVA works (3) one-way ANOVA
ALL population means
are equal
Diff between
score and
grand mean
Diff between
system mean and
grand mean
Diff between
score and
system mean
Sample grand mean Sample system mean
Similarly, ST = SA + SE holds, where System Per-topic
performances
1 x11, x12, … , x1n
2 x21, x22, … , x1n
3 x31, x32, … , x3n
1.5 How ANOVA works (4) one-way ANOVA
Total variation
Between-system
variation
Within-system
variation
ST = SA + SE
Under the i.i.d. and normality assumptions on ,
(a)
⇒
(b) .
So, under H0 (ai = 0),
φE =m(n-1)
φA =m-1
1.1 (14)(c)
φT =mn-1
= φA + φE
Degrees of freedom:
how accurate is the sum of
squares?
1.1 (14)(c)
1.1 (10)
1.5 How ANOVA works (5) one-way ANOVA
ST = SA + SE φT = φA + φE
[Under H0]
⇒ Under H0,
Is the between-system variation large compared to the within system variation?
1.5 How ANOVA works (6) one-way ANOVA
φE = m(n-1)
φA = m-1
1.1 (16)
m=3,n=10 m=5, n=10 m=20, n=10
Hypotheses:
: At least one of the system effects is non-zero.
Test statistic:
Reject H0 if
F0 >= F(φA,φE;α).
φE = m(n-1)
φA = m-1
Critical F value
F(φA,φE;α)
F0
1.5 How ANOVA works (7) one-way ANOVA
α
0
SE from
1.5 (4)
Sum of
squares
Degrees of
freedom
Mean
squares
F0
Between
System
SA φA = m-1 VA = SA/φA =
SA/(m-1)
VA/VE =
m(n-1)SA
(m-1)SE
Within
System
SE φE = m(n-1) VE = SE/φE =
SE/m(n-1)
Total ST φT = mn-1
- Reject H0 if F0 >= F(φA,φE;α) = F.INV.RT(φA,φE,α)
- P-value = F.DIST.RT(F0,φA,φE)
1.5 How ANOVA works (8) one-way ANOVA
If n varies across the m systems,
let φE = (total #observations) – m.
Population effect size
Simplest estimator of the above from a sample
(more accurate)
How much of the total variance can be accounted
for by the between-system variance?
Effect sizes for one-way ANOVA [Okubo12]
1.5 How ANOVA works (9) one-way ANOVA
More accurate
estimator in
[Okubo12, Sakai14SIGIRforum]
1.5 How ANOVA works (10) two-way ANOVA w/o replication
Topic→
↓System
1 2 … n
1 x11 x12 … x1n
2 y21 y22 … y2n
3 z31 z32 … z3n
ASSUMPTIONS: are independent and
homoscedasticity
System and topic effects are additive
and linearly related to xij
Sample grand mean
Sample system mean Sample topic mean
1.5 How ANOVA works (11) two-way ANOVA w/o replication
Hypothesis for the system effects
: at least one differs
Hypothesis for the topic effects
: at least one differs
Note that
Diff between
score and
grand mean
Diff between
system mean and
grand mean
Diff between
topic mean and
grand mean
The rest
Green part for one-way ANOVA in 1.5 (3)
1.5 How ANOVA works (12) two-way ANOVA w/o replication
Similarly, ST = SA + SB + SE holds, where
Total variation
Between-system
variation
Residual
Between-topic
variation Within-system variance
for one-way ANOVA
in 1.5 (4)
ST = SA + SB + SE φT = φA + φB + φE
Hypotheses for the system effects
: at least one differs
Under H0,
Hypotheses for the topic effects
: at least one differs
Under H0,
i
1.5 How ANOVA works (13) two-way ANOVA w/o replication
φE = (m-1)(n-1)
φA = m-1
φB = n-1
m=3,n=10 m=5, n=10 m=20, n=10
Hypotheses (for system effects):
: At least one of the system effects is non-zero.
Test statistic:
Reject H0 if
F0 >= F(φA,φE;α).
φE = (m-1)(n-1)
φA = m-1
Critical F value
F(φA,φE;α)
F0
α
0
1.5 How ANOVA works (14) two-way ANOVA w/o replication
For topic effects,
use SB and φB
instead of
SA and φA.SE from
1.5 (12)
Sum of
squares
Degrees of
freedom
Mean squares F0
Between
system
SA φA =m-1 VA = SA/φA =
SA/(m-1)
VA/VE =
(n-1)SA/SE
Between
topic
SB φB = n-1 VB = SB/φB =
SB/(n-1)
VB/VE =
(m-1)SB/SE
SE φE = (m-1)(n-1) VE = SE/φE =
SE/(n-1)(m-1)
Total ST φT = mn-1
1.5 How ANOVA works (15) two-way ANOVA w/o replication
For system effects: - Reject H0 if F0 >= F(φA,φE;α) = F.INV.RT(φA,φE,α)
- P-value = F.DIST.RT(F0,φA,φE)
ST = SA + SB + SAxB + SE
1.5 How ANOVA works (16) two-way ANOVA
φT = φA + φB + φAxB + φE
B→
↓A
1 2 … n
1 x111,
:
x11r
x121
:
x12r
… x1n1
:
x1nr
2 x211,
:
x21r
: … :
: : : … :
m xm11
:
xm1r
xm21
:
xm2r
… xmn1
:
xmnr
Not discussed in detail in this tutorial as this design is rare in system-based evaluation
• Two factors A and B
• Each cell contains r observations
(total #observations = N = mnr)
• Interaction between A and B considered
A levels
score
B level 1
B level 2
score seems
high if A level
is high AND B
level is high!
No interaction
Sum of
squares
Degrees of
freedom
Mean squares F0
A SA φA =m-1 VA = SA/φA VA/VE
B SB φB = n-1 VB = SB/φB VB/VE
AxB SAB φAxB = (m-1)(n-1) VAxB = SAxB/φAxB VAxB/VE
SE φE = mn(r-1) VE = SE/φE
Total ST φT = mnr-1
P-value = F.DIST.RT( F0, φA, φE )
P-value = F.DIST.RT( F0, φB, φE )
P-value = F.DIST.RT( F0, φAxB, φE )
1.5 How ANOVA works (17) two-way ANOVA
ST = SA + SB + SAxB + SE φT = φA + φB + φAxB + φE
Definitions of
SAxB and SE for two-way ANOVA
can be found in text books.
Population effect sizes
Simplest estimators of the above from a sample
Variances we’re not
interested in removed
from denominator
(more accurate)
Effect sizes for two-way ANOVA w and w/o replication [Okubo12]
How much of the total variance does the
between-system variance account for?
1.5 How ANOVA works (18)
without replication: ST = SA + SB + SE
with replication: ST = SA + SB + SAB + SE
More accurate
estimators in
[Okubo12, Sakai14SIGIRforum]
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
1.6 ANOVA with Excel and R (1)
one-way ANOVA
• = DEVSQ(A1:C20) = 0.726229
• = DEVSQ(A1:A20)
+ DEVSQ(B1:B20)
+ DEVSQ(C1:C20) = 0.650834
= ST – SE = 0.075395
0.4695 0.3732 0.3575
0.2813 0.3783 0.2435
0.3914 0.3868 0.3167
0.6884 0.5896 0.6024
0.6121 0.4725 0.4766
0.3266 0.233 0.2429
0.5605 0.4328 0.4066
0.5916 0.5073 0.4707
0.4385 0.3889 0.3384
0.5821 0.5551 0.4597
0.2871 0.3274 0.2769
0.5186 0.5066 0.4066
0.5188 0.5198 0.3859
0.5019 0.4981 0.4568
0.4702 0.3878 0.3437
0.329 0.4387 0.2649
0.4758 0.4946 0.4045
0.3028 0.34 0.3253
0.3752 0.4895 0.3205
0.2796 0.2335 0.224
A
20 topics
B C
1.6 ANOVA with Excel and R (2)
one-way ANOVA
0.4695 0.3732 0.3575
0.2813 0.3783 0.2435
0.3914 0.3868 0.3167
0.6884 0.5896 0.6024
0.6121 0.4725 0.4766
0.3266 0.233 0.2429
0.5605 0.4328 0.4066
0.5916 0.5073 0.4707
0.4385 0.3889 0.3384
0.5821 0.5551 0.4597
0.2871 0.3274 0.2769
0.5186 0.5066 0.4066
0.5188 0.5198 0.3859
0.5019 0.4981 0.4568
0.4702 0.3878 0.3437
0.329 0.4387 0.2649
0.4758 0.4946 0.4045
0.3028 0.34 0.3253
0.3752 0.4895 0.3205
0.2796 0.2335 0.224
A B C
Sum of
squares
Degrees of
freedom
Mean
squares
F0
Between
System
SA
= 0.075395
φA = m-1
= 2
VA = SA/φA
= 0.037697
VA/VE
= 3.3015
Within
System
SE
= 0.650834
φE = m(n-1)
= 57
VE = SE/φE
= 0.011418
Total ST
= 0.726229
P-value = F.DIST.RT( F0, φA, φE ) = 0.0440
1.6 ANOVA with Excel and R (3)
one-way ANOVA
0.4695 0.3732 0.3575
0.2813 0.3783 0.2435
0.3914 0.3868 0.3167
0.6884 0.5896 0.6024
0.6121 0.4725 0.4766
0.3266 0.233 0.2429
0.5605 0.4328 0.4066
0.5916 0.5073 0.4707
0.4385 0.3889 0.3384
0.5821 0.5551 0.4597
0.2871 0.3274 0.2769
0.5186 0.5066 0.4066
0.5188 0.5198 0.3859
0.5019 0.4981 0.4568
0.4702 0.3878 0.3437
0.329 0.4387 0.2649
0.4758 0.4946 0.4045
0.3028 0.34 0.3253
0.3752 0.4895 0.3205
0.2796 0.2335 0.224
A B C
Data that we used for the t-test
1.6 ANOVA with Excel and R (4)
one-way ANOVA
0.4695 0.3732 0.3575
0.2813 0.3783 0.2435
0.3914 0.3868 0.3167
0.6884 0.5896 0.6024
0.6121 0.4725 0.4766
0.3266 0.233 0.2429
0.5605 0.4328 0.4066
0.5916 0.5073 0.4707
0.4385 0.3889 0.3384
0.5821 0.5551 0.4597
0.2871 0.3274 0.2769
0.5186 0.5066 0.4066
0.5188 0.5198 0.3859
0.5019 0.4981 0.4568
0.4702 0.3878 0.3437
0.329 0.4387 0.2649
0.4758 0.4946 0.4045
0.3028 0.34 0.3253
0.3752 0.4895 0.3205
0.2796 0.2335 0.224
A B C
Compare with the Excel results.
1.6 ANOVA with Excel and R (5)
two-way ANOVA w/o replication
• = DEVSQ(A1:C20) = 0.726229
= 20*((0.4501-0.4146)^2
+(0.4277-0.4146)^2
+(0.3662-0.4146)^2 = 0.075395
= 0.579826
= ST – SA – SB = 0.726229
0.4695 0.3732 0.3575
0.2813 0.3783 0.2435
0.3914 0.3868 0.3167
0.6884 0.5896 0.6024
0.6121 0.4725 0.4766
0.3266 0.233 0.2429
0.5605 0.4328 0.4066
0.5916 0.5073 0.4707
0.4385 0.3889 0.3384
0.5821 0.5551 0.4597
0.2871 0.3274 0.2769
0.5186 0.5066 0.4066
0.5188 0.5198 0.3859
0.5019 0.4981 0.4568
0.4702 0.3878 0.3437
0.329 0.4387 0.2649
0.4758 0.4946 0.4045
0.3028 0.34 0.3253
0.3752 0.4895 0.3205
0.2796 0.2335 0.224
A B C
cf. 1.6 (1)
1.6 ANOVA with Excel and R (6)
two-way ANOVA w/o replication
0.4695 0.3732 0.3575
0.2813 0.3783 0.2435
0.3914 0.3868 0.3167
0.6884 0.5896 0.6024
0.6121 0.4725 0.4766
0.3266 0.233 0.2429
0.5605 0.4328 0.4066
0.5916 0.5073 0.4707
0.4385 0.3889 0.3384
0.5821 0.5551 0.4597
0.2871 0.3274 0.2769
0.5186 0.5066 0.4066
0.5188 0.5198 0.3859
0.5019 0.4981 0.4568
0.4702 0.3878 0.3437
0.329 0.4387 0.2649
0.4758 0.4946 0.4045
0.3028 0.34 0.3253
0.3752 0.4895 0.3205
0.2796 0.2335 0.224
A B C
Sum of
squares
Degrees of
freedom
Mean squares F0
Between
system
SA
= 0.075395
φA =m-1
= 2
VA = SA/φA
= 0.037697
VA/VE
= 20.1737
Between
topic
SB
= 0.579826
φB = n-1
= 19
VB = SB/φB
= 0.030517
VB/VE
= 16.3312
SE
= 0.071008
φE = (m-1)(n-1)
= 38
VE = SE/φE
= 0.001869
Total ST
= 0.726229
P-value (system) = F.DIST.RT( F0, φA, φE ) = 1.070E-06
P-value (topic) = F.DIST.RT( F0, φB, φE ) = 8.173E-13
1.6 ANOVA with Excel and R (7)
two-way ANOVA w/o replication
0.4695 0.3732 0.3575
0.2813 0.3783 0.2435
0.3914 0.3868 0.3167
0.6884 0.5896 0.6024
0.6121 0.4725 0.4766
0.3266 0.233 0.2429
0.5605 0.4328 0.4066
0.5916 0.5073 0.4707
0.4385 0.3889 0.3384
0.5821 0.5551 0.4597
0.2871 0.3274 0.2769
0.5186 0.5066 0.4066
0.5188 0.5198 0.3859
0.5019 0.4981 0.4568
0.4702 0.3878 0.3437
0.329 0.4387 0.2649
0.4758 0.4946 0.4045
0.3028 0.34 0.3253
0.3752 0.4895 0.3205
0.2796 0.2335 0.224
A B C
1.6 ANOVA with Excel and R (8)
two-way ANOVA w/o replication
0.4695 0.3732 0.3575
0.2813 0.3783 0.2435
0.3914 0.3868 0.3167
0.6884 0.5896 0.6024
0.6121 0.4725 0.4766
0.3266 0.233 0.2429
0.5605 0.4328 0.4066
0.5916 0.5073 0.4707
0.4385 0.3889 0.3384
0.5821 0.5551 0.4597
0.2871 0.3274 0.2769
0.5186 0.5066 0.4066
0.5188 0.5198 0.3859
0.5019 0.4981 0.4568
0.4702 0.3878 0.3437
0.329 0.4387 0.2649
0.4758 0.4946 0.4045
0.3028 0.34 0.3253
0.3752 0.4895 0.3205
0.2796 0.2335 0.224
A B C
1.6 ANOVA with Excel and R (9)
two-way ANOVA w/o replication
0.4695 0.3732 0.3575
0.2813 0.3783 0.2435
0.3914 0.3868 0.3167
0.6884 0.5896 0.6024
0.6121 0.4725 0.4766
0.3266 0.233 0.2429
0.5605 0.4328 0.4066
0.5916 0.5073 0.4707
0.4385 0.3889 0.3384
0.5821 0.5551 0.4597
0.2871 0.3274 0.2769
0.5186 0.5066 0.4066
0.5188 0.5198 0.3859
0.5019 0.4981 0.4568
0.4702 0.3878 0.3437
0.329 0.4387 0.2649
0.4758 0.4946 0.4045
0.3028 0.34 0.3253
0.3752 0.4895 0.3205
0.2796 0.2335 0.224
A B C
Compare with the Excel results.
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
1.7 What's wrong with significance tests? (1)
[Johnson99]
• Deming (1975) commented that the reason students have problems
understanding hypothesis tests is that they may be trying to think.
• Carver (1978) recommended that statistical significance testing
should be eliminated; it is not only useless, it is also harmful because
it is interpreted to mean something else.
• Cohen (1994:997) noted that statistical testing of the null hypothesis
"does not tell us what we want to know, and we so much want to
know what we want to know that, out of desperation, we
nevertheless believe that it does!"
1.7 What's wrong with significance tests? (2)
• We want to know P(H|D), but classical significance testing only gives
us something like P(D|H). (Alternative: Bayesian statistics etc.)
• Reporting α (e.g. 0.05) instead of the actual p-values leads to
dichotomous thinking (“Signifcant or not”?)
• Even if p-values are reported, p-values reflect not only the effect size
(magnitude of the actual difference) but also the sample size:
p-value = f( sample_size, effect_size )
large effect size ⇒ small p-value
large sample size ⇒ small p-value
H: Hypothesis, D: Data
Anything can be made statistically significant by using lots of data
1.2 (15)
1.7 What's wrong with significance tests? (3)
[Sakai14SIGIRForum]
So what should we do?
Whenever using a classical significance test, report not only p-values,
but also effect sizes and confidence intervals.
Difference between
two systems
measured in standard
deviation units
1.7 What's wrong with significance tests? (4)
[Sakai14SIGIRForum]
So what should we do?
Whenever using a classical significance test, report not only p-values,
but also effect sizes and confidence intervals.
Difference between
two systems
measured in standard
deviation units
Actually, if you want p-values for every system pair, you can
apply randomised Tukey HSD
[Carterette12,Sakai14PROMISE] WITHOUT doing ANOVA.
More accurate
estimators of
and
cf. 1.5 (18)
1.7 What's wrong with significance tests? (5)
Randomised Tukey HSD test for m>=2 systems
http://research.nii.ac.jp/ntcir/tools/discpower-en.html
• Input: a topic-by-run score matrix.
• Can be used to compute
p-values for 2 or more systems.
• Unlike classical tests, it does not
rely on assumptions such as normality.
• It is a kind of multiple comparison
procedure (free from the familywise
error rate problem).
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
1.8 Significance tests in the IR literature, or lack thereof
(1) [Sakai16SIGIR]
1.8 Significance tests in the IR literature, or lack thereof
(2) [Sakai16SIGIR]
1.8 Significance tests in the IR literature, or lack thereof
(3) [Sakai16SIGIR]
1.8 Significance tests in the IR literature, or lack thereof
(4) [Sakai16SIGIR]
1.8 Significance tests in the IR literature, or lack thereof
(5) [Sakai16SIGIR]
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
2.1 Topic set sizes in IR (1) [Sakai16IRJ]
According to Sparck Jones and Van Rijsbergen [SparckJones75],
fewer than 75 topics “are of no real value”;
250 topics “are minimally acceptable”;
more than 1000 topics “are needed for some purposes”
because “real collections are large”; “statistically significant results are
desirable” and “scaling up must be studied.”
2.1 Topic set sizes in IR (2) [Sakai16IRJ]
In 1979, in a report that considered the number of relevance
assessments required from a statistical viewpoint, Gilbert and Sparck
Jones remarked [Gilbert79]:
“Since there is some doubt about the feasibility of getting 1000
requests, or the convenience of such a large set for future experiments,
we consider 500 requests.”
2.1 Topic set sizes in IR (3)
The default topic set size at TREC: 50.
Exceptions include the million query track that created 1800+ topics
[Carterette08] but creating a “reusable” test collection was not the
objective of the track. Round Documents Topics
TREC-1 disks 1 + 2 51-100
TREC-2 disks 1 + 2 101-150
TREC-3 disks 1 + 2 151-200
TREC-4 disks 2 + 3 201-250
TREC-5 disks 2 + 4 251-300
TREC-6 disks 4 + 5 301-350
TREC-7 disks 4 + 5 351-400
TREC-8 disks 4 + 5 401-450
Early TREC ad hoc tasks and topics
[Voorhees05, p.24]
2.1 Topic set sizes in IR (4) [Sakai16IRJ]
In 2009, Voorhees conducted an experiment where she randomly
split 100 TREC topics in half to count discrepancies in statistically
significant results, and concluded that
“Fifty-topic sets are clearly too small to have confidence in a
conclusion when using a measure as unstable as P(10). Even for
stable measures, researchers should remain skeptical of conclusions
demonstrated on only a single test collection.” [Voorhees09]
TREC-7 + 8 topics
with TREC 2004
robust track systems
100 topics
random split 50
topics
50
topics
Paired t-test says
System A > B!
Paired t-test says
System A < B!
conflict
But if randomised Tukey HSD (i.e. a multiple comparison procedure) is used for filtering system pairs,
discrepancies across test collections almost never occur [Sakai16ICTIR].
2.1 Topic set sizes in IR (5)
At CIKM 2008, [Webber08] pointed out that the topic set size should be
determined based on the required statistical power.
Accept H0 Reject H0
H0 is true
systems are equivalent
Correct conclusion
(1-α)
Type I error
α
H0 is false
systems are different
Type II error
β
Correct conclusion
(1-β)
Statistical power:
ability to detect real
differences
2.1 Topic set sizes in IR (6)
The approach of [Webber08]:
• Incremental test collection building – adding topics with relevance
assessments one by one until the desired power is achieved;
• Considered the t-test without addressing the familywise error rate
problem;
• Estimated the variance of score deltas using non-standard methods;
We want a more straightforward answer to “How many topics should I create?”
In addition to the t-test, we can consider one-way ANOVA and confidence intervals as the basis.
Residual variances from ANOVA are unbiased estimators of the within-system variances.
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
2.2 Topic set size design (1) [Sakai16IRJ]
• Provides answers to the following question:
“I’m building a new test collection. How many topics should I create?”
• A prerequisite: a small topic-by-run score matrix based on pilot data,
for estimating within-system variances.
• Three approaches (with easy-to-use Excel tools), based on:
(1) paired t-test power
(2) one-way ANOVA power
(3) confidence interval width upperbound.
2.2 Topic set size design (2) [Sakai16IRJ]
Test collection designs should evolve based on past data
topic-by-run
score matrix with
pilot data
About 25 topics
with runs from
a few teams
probably sufficient
[Sakai16EVIA]
n1 topics
m runs
Estimate n1 based on the
within-system variance
estimate
TREC 201X TREC 201(X+1)
n2 topics
n0 topics
Estimate n2 based on the
within-system variance
estimate
A more accurate estimate
2.2 Topic set size design (3) [Sakai16IRJ]
Method Input required
Paired t-test α (Type I error probability), β (Type II error probability),
minDt (minimum detectable difference: whenever the diff between two systems is this
much or larger, we want to guarantee (1-β)% power),
: variance estimate for the score delta.
one-way ANOVA α (Type I error probability), β (Type II error probability), m (number of systems),
minD (minimum detectable range: whenever the diff between the best and worst
systems is this much or larger, we want to guarantee (1-β)% power),
: estimate of the within-system variance under the homoscedasticity assumption.
Confidence intervals α (Type I error probability),
δ (CI width upperbound: you want the CI for the diff between any system pair to be this
much or smaller),
: variance estimate for the score delta.
2.2 Topic set size design (4) [Sakai16IRJ]
ANOVA-based results for
m=10 can be used instead
of CI-based results
ANOVA-based results for
m=2 can be used instead of
t-test-based results
In practice, you can deduce t-test-based and CI-based results from ANOVA-based results
Caveat: the ANOVA-based tool can only
handle (α, β)=(0.05, 0.20), (0.01, 0.20),
(0.05, 0.10), (0.01, 0.10).
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
2.3 Paired t-tests (1)
Example situation: You plan to compare a system pair with the paired t-test
with α=5%. You plan to use nDCG as a primary evaluation measure, and
want to guarantee 80% power whenever the diff between two systems
>= minDt.
You know from pilot data that the variance of the nDCG delta is around .
What is the required number of topics n?
Method Input required
Paired t-test α (Type I error probability), β (Type II error probability),
minDt (minimum detectable difference: whenever the diff between two systems is this
much or larger, we want to guarantee (1-β)% power),
: variance estimate for the score delta.
2.3 Paired t-tests (2)
Notations (some slightly different from Part 1)
t: a random variable that obeys t(φ) where φ=n-1;
: two-sided critical t value for sig. criterion α
= T.INV.2T(α, φ)
α/2 α/2
0
2.3 Paired t-tests (3)
Under our assumptions, holds.
In a t-test, we let
and consider . Due to the t-test procedure,
regardless of what t0 obeys, the probability of rejecting H0 is
.
2.3 Paired t-tests (4)
Regardless of what t0 obeys, the probability of rejecting H0 is
... (a)
If H0 is true, then t0 obeys t(n-1) and (a) is exactly α
(that’s how was defined).
Alternatively, if H1 is true, the distribution that t0 obeys is known as
a noncentral t distribution with φ degrees of freedom,
and (a) is exactly the power, (1-β). Rejecting the
incorrect
hypothesis H0
Rejecting the
correct
hypothesis H0
Accept H0 Reject H0
H0 is true
systems are equivalent
Correct conclusion
(1-α)
Type I error
α
H0 is false
systems are different
Type II error
β
Correct conclusion
(1-β)
2.3 Paired t-tests (5)
t0 obeys a (central) t distribution
t0 obeys a noncentral t distribution
=
≠
(a)
2.3 Paired t-tests (6)
If H1 is true, the distribution that t0 obeys is known as a noncentral t
distribution with φ degrees of freedom, and (a) is exactly the power,
(1-β).
The noncentral t distribution in fact has another parameter called
the noncentrality parameter λt :
≠
population
effect size
Population variance of the score differences: See 1.2 (2)
2.3 Paired t-tests (7)
If H1 is true, the distribution that t0 obeys is known as a noncentral t
distribution with φ degrees of freedom and a noncentrality parameter
λt, and (a) is exactly the power, (1-β).
We want to compute (a) , but the computation involving the noncentral
t distribution is too complex...
... (a)
Power =
2.3 Paired t-tests (8)
Fortunately, a good approximation is available [Nagata03] .
t’: a random variable that obeys a noncentral t distribution with φ, λt ;
u: a random variable that obeys a standard normal distribution;
... (a)
Power =
Appendix
Theorem A’
... (a)
Power =
2.3 Paired t-tests (9)
where .
... (a’)
Theorem A’
2.3 Paired t-tests (10)
Power = 1-β
Now we know how to compute power given (α, Δt, n).
But we want to compute n given (α, β, Δt).
... (a’)
2.3 Paired t-tests (11)
Now we know how to compute power given (α, Δt, n).
But we want to compute n given (α, β, Δt). Starting again with:
Power =
Appendix
Theorem A
2.3 Paired t-tests (12)
Now we know how to compute power given (α, Δt, n).
But we want to compute n given (α, β, Δt). Starting again with:
Power =
Theorem A
If λt > 0
λt < 0 will lead to the same final
result
Ignore
2.3 Paired t-tests (13)
Now we know how to compute power given (α, Δt, n).
But we want to compute n given (α, β, Δt).
Power
⇒
one-sided z value for probability 1-β
Let
⇒
cf. This is rougher than Theorem A’
2.3 Paired t-tests (14)
Now we know how to compute power given (α, Δt, n).
But we want to compute n given (α, β, Δt).
When λt > 0 or λt < 0
(i.e. H1 is true)
Similarly, when λt = 0
(i.e. H0 is true),
two-sided t value
one-sided z value
≠ 0
2.3 Paired t-tests (15)
Now we know how to compute power given (α, Δt, n).
But we want to compute n given (α, β, Δt).
Appendix
Theorem A’’
Appendix
Theorem B
2.3 Paired t-tests (16)
Now we know how to compute power given (α, Δt, n).
But we want to compute n given (α, β, Δt).
Let and recall that . Substituting these to the above gives
≠ 0
when H1 is true
Given (α, β, minΔt), the minimal sample size n can be approximated as
by letting Δt = minΔt .
But this involved a lot of approximations, so we need to go back to (a’)
and check that n actually achieves 100(1-β)% power:
2.3 Paired t-tests (17)
minimum detectable effect size
... (a’)
EXAMPLE: α=0.05, β=0.20, detectable effect size regardless of
evaluation measure minΔt = 0.50 (i.e. half a std deviation of the diff)
→
= 33.3
(z α/2 = z 0.025 = NORM.S.INV(1-0.025)=1.960, z 1-β = z 0.80 = -0.842)
So if we let n=33, the achieved power according to (a’)
2.3 Paired t-tests (18)
= 0.795 ... doesn’t quite achieve 80%!
EXAMPLE: α=0.05, β=0.20, detectable effect size regardless of
evaluation measure minΔt = 0.50 (i.e. half a std deviation of the diff)
If we let n=34, the achieved power according to (a’)
2.3 Paired t-tests (19)
= 0.808 ... so n=34 is what we need!
Don’t worry,
http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeTTEST.xlsx
will do this for you! Use the “From effect size” sheet and fill out the
orange cells.
2.3 Paired t-tests (20)
n=34 is what you
want!
2.3 Paired t-tests (21) [Sakai16IRJ]
Topic set sizes for typical requirements based on effect sizes
2.3 Paired t-tests (22)
In practice, you might want to specify a minimum detectable diff
(minDt) in (say) nDCG instead of minΔt for guaranteeing 100(1-β)%
power.
Given minD and , so n can be obtained as before.
A conservative estimate for the delta variance would be
where is a within-system variance estimate obtained under a
homoscedasticity assumption. See 2.6
2.3 Paired t-tests (23)
EXAMPLE: For nDCG, α=0.05, β=0.20, minDt =0.1 (i.e., one-tenth of
nDCG’s score range), = 0.50 (from some pilot data)
→ Use the “From the absolute diff” sheet:
n=395 is what you
want!
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
Method Input required
one-way ANOVA α (Type I error probability), β (Type II error probability), m (number of systems),
minD (minimum detectable range: whenever the diff between the best and worst
systems is this much or larger, we want to guarantee (1-β)% power),
: estimate of the within-system variance under the homoscedasticity assumption.
Example situation: You plan to compare m systems with one-way ANOVA with
α=5%. You plan to use nDCG as a primary evaluation measure, and want to
guarantee 80% power whenever the diff between the best and the worst systems
>= minD.
You know from pilot data that the within-system variance for nDCG is around .
What is the required number of topics n?
2.4 One-way ANOVA (1) m systems
best
worst
minD <= D
2.4 One-way ANOVA (2)
Notations (some slightly different from Part 1)
F: random variable that obeys an F distribution with (φA, φE) degrees of
freedom;
: critical F value for sig. criterion α
= F.INT.RT(α, φA, φE)
α
0
φA = m-1
φE = m(n-1)
2.4 One-way ANOVA (3)
Due to the one-way ANOVA procedure, regardless of what F0 obeys, the
probability of rejecting H0 is:
If H0 is true, then F0 obeys F(φA, φE) and (c) is exactly α
(that’s how is defined).
Alternatively, if H1 is true, the distribution that F0 obeys is known as a
noncentral F distribution with (φA, φE) degrees of freedom,
and (c) is exactly the power, (1-β).
... (c)
Accept H0 Reject H0
H0 is true
systems are equivalent
Correct conclusion
(1-α)
Type I error
α
H0 is false
systems are different
Type II error
β
Correct conclusion
(1-β)
F0 obeys a (central) F distribution
F0 obeys a noncentral F
distribution
(c)
2.4 One-way ANOVA (4)
2.4 One-way ANOVA (5)
If H1 is true, the distribution that F0 obeys is known as a noncentral F
distribution with (φA, φE) degrees of freedom, and (c) is exactly the
power, (1-β).
The noncentral F distribution in fact has another parameter called
the noncentrality parameter λ :
Measures the total system effects in
variance units
Within-system variance
under homoscedasticity
2.4 One-way ANOVA (6)
If H1 is true, the distribution that F0 obeys is known as a noncentral F
distribution with (φA, φE) degrees of freedom and a noncentrality
parameter λ, and (c) is exactly the power, (1-β).
... (c)
Appendix
Theorem C ... (c’)
Denoted F’(φA, φE, λ)
2.4 One-way ANOVA (7)
Let us ensure that when Δ≠0 (i.e., H1 is true), we guarantee 100(1-β)%
power whenever the difference between best and worst systems is
minD or larger (minimum detectable range).
m systems
best
worst
minD <= D
H1: At least
one system is
different
≠ 0
2.4 One-way ANOVA (8)
Let us ensure that when Δ≠0 (i.e., H1 is true), we guarantee 100(1-β)%
power whenever the difference D between best and worst systems is
minD or larger (minimum detectable range).
Define .
Then
holds.
Appendix
Theorem D
minD does not uniquely determine Δ,
but minΔ can be used as the worst-case Δ.
2.4 One-way ANOVA (9)
The worst-case sample size:
The λ is the noncentrality parameter for
F’(φA, φE, λ), which can be approximated by
, for which these linear
approximations are available α β
0.01 0.10
0.01 0.20
0.05 0.10
0.05 0.20
Appdendix Theorem E
λ for noncentral chi-square
distributions [Nagata03]
2.4 One-way ANOVA (10)
Given (α, β, minD, m, ), the minimal sample size n can be
approximated as
.
But this involved a lot of approximations, so we need to go back to (c’)
and check that n actually achieves 100(1-β)% power:
... (c’)Power
2.4 One-way ANOVA (11)
EXAMPLE: α=0.05, β=0.20, minD=0.5, m=3, =0.5^2.
→
So let n=19 ⇒
Hence from (c’) we get power =
= 0.791 ... doesn’t quite achieve 80%!
2.4 One-way ANOVA (12)
EXAMPLE: α=0.05, β=0.20, minD=0.5, m=3, =0.5^2.
→
Try n=20 ⇒
From (c’) we get power =
= 0.813 ... so n=20 is what we need!
2.4 One-way ANOVA (13)
Don’t worry,
http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx
will do this for you! Use the appropriate sheet for a given (α, β) and fill
out the orange cells.
:
n=20 is what you
want!
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
2.5 Confidence Intervals (1)
Method Input required
Confidence intervals α (Type I error probability),
δ (CI width upperbound: you want the CI for the diff between any system pair to be this
much or smaller),
: variance estimate for the score delta.
Example situation: You plan to compare a system pair by means of 95% CI
for the difference in nDCG. You want to guarantee that the CI width for any
system pair is δ or smaller. You know from pilot data that the variance of
the nDCG delta is around .
What is the required number of topics n?
2.5 Confidence Intervals (2) cf. 1.2 (8)
The 100(1-α)% CI for a difference in means (paired data) is given by
where
.
Let’s consider a sample size n which guarantees that the CI width
(=2*MOE) for any difference will be no larger than δ.
But since MOE contains a random variable V, let’s consider the above
requirement using an expectation:
.
Now, it is known that
so we want to find the smallest n that
satisfies:
.
2.5 Confidence Intervals (3)
sample standard deviation
population standard deviation
gamma function:
(see Theorem A)
cf. 1.1 (11)
We want to find the smallest n that satisfies:
To obtain an initial n, instead of ,
consider where the variance is known.
Thus, let and start with .
Increment n’ until (d) is satisfied.
2.5 Confidence Intervals (4)
... (d)
EXAMPLE: α=0.05, δ=0.5, = 0.5 (from some pilot data)
= 30.7
Try n=31 → LHS=0.257 > 0.25
n=32 → LHS=0.253 > 0.25
n=33 → LHS=0.249 < 0.25
2.5 Confidence Intervals (5)
... (d)
=0.25
LHS
n=33 is what you
want!
2.5 Confidence Intervals (6)
Don’t worry,
http://www.f.waseda.jp/tetsuya/FIT2014/samplesizeCI.xlsx
will do this for you! Just fill out the orange cells.
n=33 is what you
want!
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
2.6 Estimating the variance (1)
We need for topic set size design based on one-way ANOVA
and for that based on the paired t-test or CI.
From a pilot topic-by-run score matrix, obtain:
Then, if possible, pool multiple estimates to enhance accuracy:
Pooled estimate
By-product of one-way
ANOVA
(use two-way w/o
replilcation for tighter
estimates)
• = DEVSQ(A1:A20)
+ DEVSQ(B1:B20)
+ DEVSQ(C1:C20) = 0.650834
φE = m(n-1) = 3(20-1)= 57
= VE = SE / φE = 0.011
0.4695 0.3732 0.3575
0.2813 0.3783 0.2435
0.3914 0.3868 0.3167
0.6884 0.5896 0.6024
0.6121 0.4725 0.4766
0.3266 0.233 0.2429
0.5605 0.4328 0.4066
0.5916 0.5073 0.4707
0.4385 0.3889 0.3384
0.5821 0.5551 0.4597
0.2871 0.3274 0.2769
0.5186 0.5066 0.4066
0.5188 0.5198 0.3859
0.5019 0.4981 0.4568
0.4702 0.3878 0.3437
0.329 0.4387 0.2649
0.4758 0.4946 0.4045
0.3028 0.34 0.3253
0.3752 0.4895 0.3205
0.2796 0.2335 0.224
A
20 topics
B C
2.6 Estimating the variance (2)
cf. 1.6 (1)
cf. 1.6 (2)
If there is no other topic-by-run matrix available, use this as .
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
2.7 How much pilot data do we need? (1)
[Sakai16EVIA]
100
topics
44 runs from 16 teams
Pilot data
Variance
estimates
(best estimates
available)
Official
NTCIR-12 STC
qrels based on
16 teams
(union of
contributions
from 16 teams)
Can we obtain a reliable even from a few teams and a small number of topics?
2.7 How much pilot data do we need? (2)
[Sakai16EVIA] Can we obtain a reliable even from a few teams and a small number of topics?
100
topics
Runs from 15 teams
Pilot data
New variance
estimates
Try
leave-1-out
10 times
Leaving out k teams
k=1
(k=1,...,15)
2.7 How much pilot data do we need? (3)
[Sakai16EVIA] Can we obtain a reliable even from a few teams and a small number of topics?
100
topics
Runs from 1 team
Pilot data
New variance
estimates
Leaving out k teams
k=15
(k=1,...,15)
Try
leave-15-out
10 times
2.7 How much pilot data do we need? (4)
[Sakai16EVIA] Can we obtain a reliable even from a few teams and a small number of topics?
100
topics
44 runs from 16 teams
Variance
estimates
(best estimates
available)
50
25
Variance
estimates
Variance
estimates
Removing topics
100 → 90 → 75 → 50 → 25 → 10
Official NTCIR-12
STC qrels
2.7 How much pilot data do we need? (5)
[Sakai16EVIA] Can we obtain a reliable even from a few teams and a small number of topics?
100
topics
Runs from 15 teams
Variance
estimates
(best estimates
available)
50
25
Variance
estimates
Variance
estimates
Removing topics
100 → 90 → 75 → 50 → 25 → 10
Leave-k-out qrels
k=1
(k=1,...,15)
Starting with n’=100 topics Starting with n’=10 topics
2.7 How much pilot data do we need? (6)
[Sakai16EVIA] About 25 topics with a few teams seems sufficient,
provided that a reasonably stable measure is used.
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
3.1 Power analysis (1) [Ellis10, pp.56-57]
1. Effect size describes the degree to which the phenomenon is
present in the population;
2. Sample size determines the amount of sampling error inherent in a
result;
3. Significance criterion α defines the risk of committing a Type I error;
4. power (1-β) refers to the chosen or implied Type II error rate.
“The four power parameters are related, meaning that the value of any
parameter can be determined from the other three.”
We had a quick look at how the computations can be done in Part 2.
3.1 Power analysis (2) [Toyoda09]
If a paper reports
- The parametric significance test type (paired/unpaired t-test, one-way
ANOVA, two-way ANOVA w and w/o replication)
- either p-value or test statistic (t-value or F-value)
- actual sample size
we can easily compute the sample effect size.
Then, using the library pwr of R, we can compute
- the achieved power of the experiment
- future sample size for achieving given (α, β).
cf. 1.7 (2)
https://cran.r-project.org/web/packages/pwr/pwr.pdf
power=(1-β)
3.1 Power analysis (3) [Sakai16SIGIR]
My R power analysis scripts, adapted from [Toyoda09] with Professor
Toyoda’s kind permission, are available at
https://waseda.box.com/SIGIR2016PACK
- Works with paired/unpaired t-test, one-way ANOVA, two-way ANOVA
w and w/o replication.
- SIGIR2016PACK also contains an Excel file from [Sakai16SIGIR]
(manual analysis of 1055 papers from SIGIR+TOIS 2006-2015).
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
3.2 With paired t-tests (1)
future.sample.pairedt arguments:
- t statistic (t)
- sample size (n)
- two-sided/one-sided (default: two-sided)
- α (default: 0.05)
- desired power (1-β) (default:0.80)
OUTPUT:
- effect size
- achieved power
- future sample size n’
1.2 (15)
Calls power.t.test
3.2 With paired t-tests (2)
A paper from SIGIR 2012 reports
“t(27)=0.953 with (two-sided) paired t-test”
⇒ t = 0.953, n = 28 (φ = n-1 = 27)
Line 270 in the raw Excel file from [Sakai16SIGIR]
very low power (15.1%)
For this kind of effect, we need a much larger
sample if we want 80% power
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
3.3 With unpaired t-tests (1)
future.sample.unpairedt arguments:
- t statistic (t)
- sample sizes (n1, n2)
- two-sided/one-sided (default: two-sided)
- α (default: 0.05)
- desired power (1-β) (default: 0.80)
OUTPUT:
- effect size
- achieved power
- future sample size n’ per group
1.2 (15)
Calls pwr.t2n.test
3.3 With unpaired t-tests (2)
A paper from SIGIR 2007 reports:
“t(188403) = 2.81, n1 = 150610, n2 = 37795 with (two-sided) two-
sample t-test”
φ = n1 + n2 -2 = 188403
Line 714 in the raw Excel file from [Sakai16SIGIR]
Appropriate level of power
n1 = n2 = 60066 would be the typical setting for 80% power
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
3.4 With one-way ANOVA (1)
future.sample.1wayanova arguments:
- F statistic (F, i.e. FA)
- #groups (systems) compared (m)
- #observations (topics) per group (n)
- α (default: 0.05)
- desired power (1-β) (default: 0.80)
OUTPUT:
- effect size
- achieved power
- future sample size per group n’
φA = m-1, φE = m(n-1)
Calls pwr.anova.test
1.5 (9)
Compares
between-system
variation against
within-system
3.4 With one-way ANOVA (2) φA = m-1, φE = m(n-1)
A paper from SIGIR 2008 reports:
“m=3 groups, n=12 subjects per group,
F(2, 33)=1.284 with (one-way) ANOVA”
(φA = m-1 = 2, φE = m(n-1) = 3*(12-1) = 33)
Line 616 in the raw Excel file from [Sakai16SIGIR]
Very low power (27.9%)
For this kind of effect, we need more subjects
if we want 80% power
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
future.sample.2waynorep arguments:
same as future.sample.1wayanova.
OUTPUT:
- effect size
- achieved power
- future sample size per group n’
3.5 With two-way ANOVA without replication (1)
φA = m-1, φE = (m-1)(n-1)
A little different from 1.5 (18)
Calls pwr.f2.test, which requires the above squared effect size
p stands for partial:
effect of B has been removed
3.5 With two-way ANOVA without replication (2)
A paper from SIGIR 2015 reports:
“m=4 groups,
F(3, 48)=0.63 with a repeated-measures ANOVA”
⇒ m = φA +1 = 4, φE = (m-1)(n-1) = 48, n = 17 per group
Line 22 in the raw Excel file from [Sakai16SIGIR]
Same procedure as two-way
ANOVA w/o replication
(second factor e.g. topics
regarded as repeated
observations)
Very low power (18.3%)
For this kind of effect, we need more subjects
if we want 80% power
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
3.6 With two-way ANOVA (1)
future.sample.2wayanova2 arguments:
- F statistics (FA, FB, FAB)
- #groups compared (m)
- #cells per group (n)
- #total observations (N=mnr)
- α (default: 0.05)
- desired power (1-β) (default: 0.80)
OUTPUT:
- effect size
- achieved power
- Total sample size N’
φA = m-1, φB = n-1, φAB = (m-1)(n-1)
φE = mn(r-1)
And similarly for B and ABCalls pwr.anova.test
p stands for partial:
effects of B and AB have
been removed
Version 2
3.6 With two-way ANOVA (2)
A paper from SIGIR 2014 reports:
“m=2, n=2, two-way ANOVA,
A: F(1, 960)=24.00,
B: F(1, 960)=24.89,
AxB: F(1, 960)=10.03”
φA = m-1 = 1, φB = n-1 = 1,
φAxB = (m-1)(n-1)=1,
φE = mn(r-1) = 960
⇒ r= 960/4+1 = 241,
N = mnr = 964
Line 121 in the raw Excel file from [Sakai16SIGIR]
Very high power
Smaller sample sizes
suffice
φE/(φA+1) + 1
= 960/(1+1) + 1
= 481
[Cohen88, p.365]
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
3.7 Overpowered and underpowered experiments in IR (1)
[Sakai16SIGIR]
SSR = sample size ratio = actual size/recommended size for future
SSR is extremely large ⇔ extremely overpowered
SSR is extremely small ⇔ extremely underpowered
133 SIGIR+TOIS papers from the past decade (2006-2015) were
examined using the R power analysis tools.
(106 with t-tests; 27 with ANOVAs)
3.7 Overpowered and underpowered experiments in IR (2)
[Sakai16SIGIR]
A paper on personalisation from a search engine company (paired t-test)
t=16.00, n=5,352,460, effect size=0.007, achieved power=1
recommended future sample size=164,107
Effect size very small (though this may translate into substantial profit for a
company)
3.7 Overpowered and underpowered experiments in IR (3)
[Sakai16SIGIR]
User experiments, paired t-test
t=0.95, n=28,
effect size=0.180,
achieved power=0.152
future sample size=244
(similar results for other t-test
results in the same paper)
3.7 Overpowered and underpowered experiments in IR (4)
[Sakai16SIGIR]
3.7 Overpowered and underpowered experiments in IR (5)
[Sakai16SIGIR]
Experiments with a commercial social media
application data (one-way ANOVA)
F=243.42, m=3,
sample size per group=2551,
effect size fhat=2.252, achieved power=1,
recommended future sample size per group=52
3.7 Overpowered and underpowered experiments in IR (6)
[Sakai16SIGIR]
User experiments, two-way
ANOVA w/o replication
F=0.63, m=4,
sample size per group=17,
effect size fhat^2 = 0.039,
achieved power=0.183,
recommended future sample
size per group=75
(similar results for other
ANOVA results in the same
paper)
3.7 Overpowered and underpowered experiments in IR (7)
[Sakai16SIGIR]
TUTORIAL OUTLINE
1. Significance testing basics and limitations
1.1 Preliminaries
1.2 How the t-test works
1.3 T-test with Excel and R (hands-on)
1.4 How ANOVA works
1.5 ANOVA with Excel and R (hands-on)
1.6 What's wrong with significance tests?
1.7 Significance tests in the IR literature, or lack thereof
2. Using the Excel topic set size design tools
2.1 Topic set sizes in IR
2.2 Topic set size design
<30min coffee break>
2.3 With paired t-tests (hands-on)
2.4 With one-way ANOVA (hands-on)
2.5 With confidence intervals (hands-on)
2.6 Estimating the variance (hands-on)
2.7 How much pilot data do we need?
3. Using the R power analysis scripts
3.1 Power analysis
3.2 With paired t-tests (hands-on)
3.3 With unpaired t-tests (hands-on)
3.4 With one-way ANOVA (hands-on)
3.5 With two-way ANOVA without replication (hands-on)
3.6 With two-way ANOVA (hands-on)
3.7 Overpowered and underpowered experiments in IR
4. Summary, a few additional remarks, and Q&A
30min
70min
20min
50min
10min
Appendix
Now you know
• How to determine the number of topics when building
a new test collection using a topic-by-run matrix from
pilot data and a simple Excel tool. And you kind of
know how it works!
• How to check whether a reported experiment is
overpowered/underpowered and decide on a better
sample size for a future experiment using simple R
scripts.
What now?
• Be aware of the limitations of classical significance testing. But while
we are still using classical tests, report effect sizes, p-values etc. for
collective wisdom [Sakai14SIGIRforum,Sakai16SIGIR]. And use topic
set size design and power analysis! Some guidance is better than
none!
• My personal wish is that the classical significance tests will soon be
replaced by Bayesian tests, so we can discuss P(H|D) instead of
P(D|H) for various H’s, not just “equality of means” etc.
Using score standardisation can give you smaller topic set sizes in topic set size design.
Please have a look at [Sakai16ICTIR].
Thank you for staying with me until the end!
Questions?
Acknowledgements
This tutorial is rather heavily based
on what I learnt from Professor
Yasushi Nagata’s and Professor
Hideki Toyoda’s books (written in
Japanese).
I thank Professor Nagata (Waseda
University) for his valuable advice
and Professor Toyoda (Waseda
University) for letting me modify
his R code and distribute it.
If there are any errors in this
tutorial, I am solely responsible.
References
[Carterette08] Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., and Allan, J.: Evaluation over
Thousands of Queries, ACM SIGIR 2008.
[Carterette12] Carterette, B.: Multiple Testing in Statistical Analysis of Systems-Based Information
Retrieval Experiments, ACM TOIS 30(1), 2012.
[Cohen88] Cohen. J.: Statistical Power Analysis for the Behavioral Sciences (Second Edition),
Psychology Press, 1988.
[Ellis10] Ellis, P. D.: The Essential Guide to Effect Sizes, Cambridge, 2010.
[Gilbert79] Gilbert, H. and Sparck Jones, K. S.:, Statistical Bases of Relevance assessment for the
`IDEAL’ Information Retrieval Test Collection, Computer Laboratory, University of Cambridge, 1979.
[Johnson99] Johnson, D. H.: The Insignificance of Statistical Significance Testing, Journal of Wildlife
Management, 63(3), 1999.
[Nagata03] Nagata, Y.: How to Design the Sample Size (In Japanese), Asakura Shoten, 2003.
[Okubo12] Okubo, G. and Okada, K.: Psychological Statistics to Tell Your Story: Effect Size,
Confidence Interval, and Power (in Japanese), Keisho Shobo, 2012.
References
[Sakai14SIGIRforum] Sakai, T.: Statistical Reform in Information Retrieval?, SIGIR
Forum, 48(1), 2016.
http://sigir.org/files/forum/2014J/2014J_sigirforum_Article_TetsuyaSakai.pdf
[Sakai16EVIA] Sakai, T. and Shang, L.: On Estimating Variances for Topic Set Size
Design, EVIA 2016.
[Sakai16ICTIR] Sakai, T.: A Simple and Effective Approach to Score Standardisation,
ACM ICTIR 2016.
[Sakai16IRJ] Sakai, T.: Topic Set Size Design, Information Retrieval Journal, 19(3),
2016. [OPEN ACCESS] http://link.springer.com/content/pdf/10.1007%2Fs10791-
015-9273-z.pdf
[Sakai16SIGIR] Sakai, T.: Statistical Significance, Power, and Sample Sizes: A
Systematic Review of SIGIR and TOIS, 2006-2015, ACM SIGIR 2016.
[Sakai16SIGIRshort] Sakai, T.: Two Sample T-tests for IR Evaluation: Student or
Welch?, ACM SIGIR 2016.
References
[SparckJones75] Sparck Jones, K.S. and Van Rijsbergen, C.J.: Report on the
Need for and Provision on an `Ideal’ Information Retrieval Test Collection,
Computer Laboratory, University of Cambridge, 1975.
[Toyoda09] Tokoda, H.: Introduction to Statistical Power Analysis: A Tutorial
with R (in Japanese). Tokyo Tosyo, 2009.
[Voorhees05] Voorhees, E. M. and Harman, D. K.: TREC: Experiment and
Evaluation in Information Retrieval, The MIT Press, 2005.
[Voorhees09] Voorhees, E. M.: Topic Set Size Redux, ACM SIGIR 2009.
[Webber08] Webber, W., Moffat, A., and Zobel, J.: Statistical Power in
Retrieval Experimentation, ACM CIKM 2008.
Appendix (everything adapted from [Nagata03])
• Definition: noncentral t distribution
• Definition: noncentral chi-square distribution
• Definition: noncentral F distribution
• Theorem A: normal approximation of a noncentral t distribution
• Theorem A’: corollary of A
• Theorem A’’: corollary of A (approximating a z value using a t value)
• Theorem B: approximating a t value using a z value
• Theorem C: normal approximation of a noncentral F distribution
• Theorem D: inequality for system effects
• Theorem E: approximating a noncentral F distribution with a noncentral chi-
square distribution
Definition: noncentral t distribution
Let ,
where the two random variables are independent.
The probability distribution of the following random variable is called a
noncentral t distribution with φ degrees of freedom and a
noncentrality parameter λ:
When λ=0,
it is reduced to the central t distribution with φ degrees of freedom, t(φ).
Denoted by t’(φ, λ)
Let where the random variables are independent.
The probability distribution of the following random variable is called a
noncentral chi-square distribution with φ=k degrees of freedom and a
noncentrality parameter λ:
where .
Definition: noncentral chi-square distribution
When λ=0,
it is reduced to the central chi-square distribution
with φ degrees of freedom, .
Denoted by
Let ,
where the two random variables are independent.
The probability distribution of the following random variable is called a
noncentral F distribution with (φ1, φ2) degrees of freedom and a
noncentrality parameter λ.
Definition: noncentral F distribution
noncentral chi-square distribution central chi-square distribution
When λ=0,
it is reduced to the central F distribution with (φ1, φ2) degrees of freedom, F(φ1, φ2).
Denoted by
Theorem A: normal approximation of a
noncentral t distribution
Let , .
Then:
where:
.
Gamma function
noncentral t distribution
Brief derivation given in
[Sakai16IRJ Appendix 1]
Theorem A’: corollary of A
Let , .
Then:
PROOF: Let
,
in Theorem A.
Brief derivation given in
[Sakai16IRJ Appendix 1]
Theorem A’’: corollary of A (approximating a z
value using a t value)
one-sided z value
two-sided t value
PROOF: In Theorem A, when λ=0,
then t=t’ obeys a (central) t distribution.
Also let
.
1
1.5
2
2.5
3
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
z f(t)
φ
2P=α=0.05
Verified with Excel
Theorem B: approximating a t value using a z
value
This is a special case of Johnson and Welch’s theorem on the noncentral t statistic. [Nagata03]
Two-sided t value one-sided z value
P = α = 0.05
1.5
2
2.5
3
3.5
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96
t f(z)
Verified with Excel
Theorem C: normal approximation of a
noncentral F distribution
Let ,
Then:
where
.
noncentral F distribution
Brief derivation given in
[Sakai16IRJ Appendix 2]
Theorem D: inequality for system effects
For ,
Let
.
Then
.
The equality holds when = D/2, = -D/2
and ai = 0 for all others.
Proof in [Sakai16IRJ footnote 19]
Theorem E: approximating a noncentral F distribution
with a noncentral chi-square distribution
Let ,
Then:
Letting φE ≒ ∞
F value for probability P
chi-square value for probability P

More Related Content

What's hot

Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Sherri Gunder
 
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Tetsuya Sakai
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation source
butest
 

What's hot (20)

Admission in India
Admission in IndiaAdmission in India
Admission in India
 
Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1) Lecture 8: Machine Learning in Practice (1)
Lecture 8: Machine Learning in Practice (1)
 
Introduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsIntroduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and Algorithms
 
Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)Lecture 2: Preliminaries (Understanding and Preprocessing data)
Lecture 2: Preliminaries (Understanding and Preprocessing data)
 
Two-sample Hypothesis Tests
Two-sample Hypothesis Tests Two-sample Hypothesis Tests
Two-sample Hypothesis Tests
 
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
Ch17 lab r_verdu103: Entry level statistics exercise (descriptives)
 
Big Data Analysis
Big Data AnalysisBig Data Analysis
Big Data Analysis
 
Nber slides11 lecture2
Nber slides11 lecture2Nber slides11 lecture2
Nber slides11 lecture2
 
evaluation and credibility-Part 2
evaluation and credibility-Part 2evaluation and credibility-Part 2
evaluation and credibility-Part 2
 
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text ConversationTopic Set Size Design with the Evaluation Measures for Short Text Conversation
Topic Set Size Design with the Evaluation Measures for Short Text Conversation
 
Resampling methods
Resampling methodsResampling methods
Resampling methods
 
Intro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data ScientistsIntro to Machine Learning for non-Data Scientists
Intro to Machine Learning for non-Data Scientists
 
Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)Lecture 3b: Decision Trees (1 part)
Lecture 3b: Decision Trees (1 part)
 
Download presentation source
Download presentation sourceDownload presentation source
Download presentation source
 
4 pye unidad1 3 repaso 2 semestre m
4 pye unidad1  3  repaso 2 semestre m4 pye unidad1  3  repaso 2 semestre m
4 pye unidad1 3 repaso 2 semestre m
 
Introduction to the t Statistic
Introduction to the t StatisticIntroduction to the t Statistic
Introduction to the t Statistic
 
The t Test for Two Independent Samples
The t Test for Two Independent SamplesThe t Test for Two Independent Samples
The t Test for Two Independent Samples
 
Statistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling DistributionStatistical Inference Part II: Types of Sampling Distribution
Statistical Inference Part II: Types of Sampling Distribution
 
Machine Learning and Data Mining
Machine Learning and Data MiningMachine Learning and Data Mining
Machine Learning and Data Mining
 
Cross-validation aggregation for forecasting
Cross-validation aggregation for forecastingCross-validation aggregation for forecasting
Cross-validation aggregation for forecasting
 

Similar to ICTIR2016tutorial

Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docxWeek 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
cockekeshia
 
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docx
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docxWeek 5 Lecture 14 The Chi Square Test Quite often, pat.docx
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docx
cockekeshia
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01
Henock Beyene
 
Two Sample Tests
Two Sample TestsTwo Sample Tests
Two Sample Tests
sanketd1983
 

Similar to ICTIR2016tutorial (20)

sigir2018tutorial
sigir2018tutorialsigir2018tutorial
sigir2018tutorial
 
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docxWeek 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
Week 5 Lecture 14 The Chi Square TestQuite often, patterns of .docx
 
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docx
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docxWeek 5 Lecture 14 The Chi Square Test Quite often, pat.docx
Week 5 Lecture 14 The Chi Square Test Quite often, pat.docx
 
Analyzing experimental research data
Analyzing experimental research dataAnalyzing experimental research data
Analyzing experimental research data
 
Aron chpt 8 ed
Aron chpt 8 edAron chpt 8 ed
Aron chpt 8 ed
 
Aron chpt 8 ed
Aron chpt 8 edAron chpt 8 ed
Aron chpt 8 ed
 
SPSS statistics - get help using SPSS
SPSS statistics - get help using SPSSSPSS statistics - get help using SPSS
SPSS statistics - get help using SPSS
 
Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01Spsshelp 100608163328-phpapp01
Spsshelp 100608163328-phpapp01
 
Analyzing experimental research data
Analyzing experimental research dataAnalyzing experimental research data
Analyzing experimental research data
 
Two Sample Tests
Two Sample TestsTwo Sample Tests
Two Sample Tests
 
An Introduction to SPSS
An Introduction to SPSSAn Introduction to SPSS
An Introduction to SPSS
 
Introduction to simulating data to improve your research
Introduction to simulating data to improve your researchIntroduction to simulating data to improve your research
Introduction to simulating data to improve your research
 
ders 5 hypothesis testing.pptx
ders 5 hypothesis testing.pptxders 5 hypothesis testing.pptx
ders 5 hypothesis testing.pptx
 
Introduction to Analysis of Variance
Introduction to Analysis of VarianceIntroduction to Analysis of Variance
Introduction to Analysis of Variance
 
Day 12 t test for dependent samples and single samples pdf
Day 12 t test for dependent samples and single samples pdfDay 12 t test for dependent samples and single samples pdf
Day 12 t test for dependent samples and single samples pdf
 
Introduction to spss
Introduction to spssIntroduction to spss
Introduction to spss
 
Basic of Statistical Inference Part-V: Types of Hypothesis Test (Parametric)
Basic of Statistical Inference Part-V: Types of Hypothesis Test (Parametric) Basic of Statistical Inference Part-V: Types of Hypothesis Test (Parametric)
Basic of Statistical Inference Part-V: Types of Hypothesis Test (Parametric)
 
Day 3 SPSS
Day 3 SPSSDay 3 SPSS
Day 3 SPSS
 
Advanced statistics Lesson 1
Advanced statistics Lesson 1Advanced statistics Lesson 1
Advanced statistics Lesson 1
 
T test
T test T test
T test
 

More from Tetsuya Sakai

More from Tetsuya Sakai (20)

NTCIR15WWW3overview
NTCIR15WWW3overviewNTCIR15WWW3overview
NTCIR15WWW3overview
 
sigir2020
sigir2020sigir2020
sigir2020
 
ipsjifat201909
ipsjifat201909ipsjifat201909
ipsjifat201909
 
sigir2019
sigir2019sigir2019
sigir2019
 
assia2019
assia2019assia2019
assia2019
 
ntcir14centre-overview
ntcir14centre-overviewntcir14centre-overview
ntcir14centre-overview
 
evia2019
evia2019evia2019
evia2019
 
Evia2017unanimity
Evia2017unanimityEvia2017unanimity
Evia2017unanimity
 
Evia2017assessors
Evia2017assessorsEvia2017assessors
Evia2017assessors
 
Evia2017dialogues
Evia2017dialoguesEvia2017dialogues
Evia2017dialogues
 
Evia2017wcw
Evia2017wcwEvia2017wcw
Evia2017wcw
 
sigir2017bayesian
sigir2017bayesiansigir2017bayesian
sigir2017bayesian
 
NL20161222invited
NL20161222invitedNL20161222invited
NL20161222invited
 
AIRS2016
AIRS2016AIRS2016
AIRS2016
 
Nl201609
Nl201609Nl201609
Nl201609
 
SIGIR2016
SIGIR2016SIGIR2016
SIGIR2016
 
On Estimating Variances for Topic Set Size Design
On Estimating Variances for Topic Set Size DesignOn Estimating Variances for Topic Set Size Design
On Estimating Variances for Topic Set Size Design
 
assia2015sakai
assia2015sakaiassia2015sakai
assia2015sakai
 
Short Text Conversation@NTCIR-12 Kickoff
Short Text Conversation@NTCIR-12 KickoffShort Text Conversation@NTCIR-12 Kickoff
Short Text Conversation@NTCIR-12 Kickoff
 
NTCIR-12 task proposal: Short Text Conversation (STC)
NTCIR-12 task proposal: Short Text Conversation (STC)NTCIR-12 task proposal: Short Text Conversation (STC)
NTCIR-12 task proposal: Short Text Conversation (STC)
 

Recently uploaded

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

ICTIR2016tutorial

  • 1. Topic Set Size Design and Power Analysis in Practice Tetsuya Sakai @tetsuyasakai tetsuyasakai@acm.org Waseda University ICTIR 2016 Tutorial: September 13, 2016, Delaware.
  • 2. This half-day tutorial will teach you • How to determine the number of topics when building a new test collection (prerequisite: you already have some pilot data from which you can construct a topic- by-run score matrix). You will kind of know how it works. • How to check whether a reported experiment is overpowered/underpowered and decide on a better sample size for a future experiment.
  • 3. Before attending the tutorial, please download on your laptop - Sample topic-by-run matrix: https://waseda.box.com/20topics3runs - Excel topic set size design tools: http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeTTEST.xlsx http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx http://www.f.waseda.jp/tetsuya/FIT2014/samplesizeCI.xlsx [OPTIONAL] - (Install R first and then) R scripts for power analysis: https://waseda.box.com/SIGIR2016PACK
  • 4. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 5. 1.1 Preliminaries (1) • In IR experiments, we often compare sample means to guess if the population means are different. • We often employ parametric tests (assume specific population distributions with parameters) - paired and unpaired t-tests (comparing m=2 means) - ANOVA (comparing m (>2) means) one-way, two-way, two-way without replication Are the two population means equal? Are the m population means equal? scores EXAMPLE n topics m systems Sample mean for a system
  • 6. 1.1 Preliminaries (2) • H0: tentative assumption that all population means are equal • test statistic: what you compute from observed data – under H0, this should obey a known distribution (e.g. t-distribution) • p-value: probability of observing what you have observed (or something more extreme) assuming H0 is true Null hypothesis test statistic t0
  • 7. 1.1 Preliminaries (3) Reject H0 if p-value <= α test statistic t0 t(φ; α) Accept H0 Reject H0 H0 is true systems are equivalent Correct conclusion (1-α) Type I error α H0 is false systems are different Type II error β Correct conclusion (1-β) α/2 α/2 Statistical power: ability to detect real differences
  • 8. 1.1 Preliminaries (4) Accept H0 Reject H0 H0 is true systems are equivalent Correct conclusion (1-α) Type I error α H0 is false systems are different Type II error β Correct conclusion (1-β) Statistical power: ability to detect real differencesCohen’s five-eighty convention: α=5%, 1-β=80% (β=20%) Type I errors 4 times as serious as Type II errors The ratio may be set depending on specific situations
  • 9. For a continuous random variable x and its probability density function f(x), the expectation of a function g(x) (including g(x)=x) is given by: How likely x will take a particular value Population mean Population variance Population standard deviation The central position of x as it is observed an infinite number of times How x varies from the population mean 1.1 Preliminaries (5)
  • 10. A normal distribution with population parameters is denoted by . Properties of a normal distribution : Probability density function of a normal distribution μ = 100 σ = 20 1.1 Preliminaries (6)
  • 11. If x obeys , then obeys . Standardisation Population mean: 0 Population standard deviation: 1 Standard normal distribution 1.1 Preliminaries (7)
  • 12. 1.1 Preliminaries (8) For random variables x, y, a function that satisfies the following is called a joint probability density function: Whereas, marginal probability density functions are defined as: If the following holds for any (x,y), x and y are said to be independent.
  • 13. If are independent and obey then obeys Reproductive property: Adding normally distributed variables still gives you a normal distribution Population mean Population variance 1.1 Preliminaries (9)
  • 14. If are independent and obey then obeys obeys and therefore obeys . Corollary: If we let ai = 1/n, μi = μ, σi = σ ... 1.1 Preliminaries (10) Sample mean
  • 15. Sample mean Sum of squares Sample variance Sample standard deviation If are independent and obey , then holds. Sample variance V is an unbiased estimator of the population variance 1.1 Preliminaries (11) cf. 2.5 (3): s is NOT an unbiased estimator of the population standard deviation
  • 16. If are independent and then: • Law of large numbers As n approaches infinity, approaches . • Central Limit Theorem Provided that n is large, the distribution of can be approximated by . It’s a good thing to observe lots of data to estimate the population mean. If you have lots of observations, then the sample mean can be regarded as normally distributed even if we don’t know much about individual random variables {xi} 1.1 Preliminaries (12) Not necessarily normal
  • 17. If are independent and obey then the probability distribution that the following random variable obeys is called a chi-square distribution with φ = k degrees of freedom: The pdf of the above distribution is given by: Gamma function Denoted by 1.1 Preliminaries (13)
  • 18. If obeys then . If are independent and obey then: (a) obeys . (b) and are independent. (c) obeys . 1.1 Preliminaries (14) Corollary from previous slide since (xi – μ)/σ obeys [Nagata03] p.57 [Nagata03] p.58
  • 19. 1.1 Preliminaries (15) If and they are independent, the probability distribution that the following random variable obeys is called a t distribution with φ degrees of freedom, denoted by t(φ). IMPORTANT PROPERTY: If and are independent, then: obeys Sample mean and sample variance as defined in 1.1 (11)
  • 20. 1.1 Preliminaries (16) If and they are independent, the probability distribution that the following random variable obeys is called an F distribution with degrees of freedom, denoted by . IMPORTANT PROPERTY: If and they are all independent, then:
  • 21. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 22. 1.2 How the t-test works (1) paired t-test What does this sample tell us about the populations?
  • 23. Comparing Systems X and Y with n topics with (say) Mean nDCG over n topics ASSUMPTIONS: are independent and obey are independent and obey . Under these assumptions: 1.2 How the t-test works (2) paired t-test In Slide 1.1 (9), let a1 = 1, a2 = -1.
  • 24. ⇒ ⇒ ⇒ is an unbiased estimator of : t distribution with n-1 degrees of freedom, which is basically like the standard normal distribution (See also 1.1 (15)) 1.2 How the t-test works (3) paired t-test 1.1 (10) 1.1 (7) We don’t know the population variance so use a sample variance instead. 1.1 (11)
  • 25. Since under our assumptions, if we further assume , then . Hypotheses: Same population means: X and Y are equally effective Two-sided test 1.2 How the t-test works (4) paired t-test 0 test statistic t0
  • 26. Hypotheses: 1.2 How the t-test works (5) paired t-test test statistic t0critical t value t(n-1; α) α/2 α/2 Under , . 0 So if , something highly unlikely has happened. We assumed but that must have been wrong! Reject ! is probably true, with 100(1-α)% confidence. α: significance criterion
  • 27. 1.2 How the t-test works (6) paired t-test test statistic t0critical t value t(n-1; α) α/2 α/2 0 Using Excel to do a t-test: - Reject if = TINV(α, n-1) = T.INV.2T(α, n-1). - P-value = TDIST(|t0|, n-1, 2) = T.DIST.2T(|t0|, n-1). Blue areas under the curve: probability of observing the data at hand or something more extreme, if H0 is true
  • 28. 1.2 How the t-test works (7) confidence intervals From 1.2 (3), ⇒ critical t value t(n-1; α) α/2 α/2 0 t obeys t(n-1)
  • 29. 1.2 How the t-test works (8) confidence intervals From 1.2 (3), ⇒ ⇒ where . So 95% CI for the difference in means is given by: Margin of Eerror Different samples yield different CIs. 95% of the CIs will capture the true difference in means.
  • 30. 1.2 How the t-test works (9) unpaired t-test
  • 31. Comparing Systems X and Y, based on a sample of size n1 for X and another sample of size n2 for Y. ASSUMPTIONS: the above observations are all independent and and furthermore Homoscedasticity (equal variance) but the t-test is quite robust to the assumption violation [Sakai16SIGIRshort] 1.2 How the t-test works (10) unpaired t-test cf. 1.2 (15)
  • 32. Under the assumptions, it is known that where Pooled variance 1.2 How the t-test works (11) unpaired t-test
  • 33. Hypotheses: Since under our assumptions, if we further assume , then 1.2 How the t-test works (12) unpaired t-test Same population means: X and Y are equally effective Two-sided test 0 test statistic t0
  • 34. Hypotheses: 1.2 How the t-test works (13) unpaired t-test Under , . test statistic t0critical t value t(n-1; α) α/2 α/2 0 α: significance level So if , something highly unlikely has happened. We assumed but that must have been wrong! Reject ! is probably true, with 100(1-α)% confidence.
  • 35. test statistic t0critical t value t(n-1; α) α/2 α/2 0 Using Excel to do a t-test: - Reject if = TINV(α, φ) = T.INV.2T(α, φ). - P-value = TDIST(|t0|, φ, 2) = T.DIST.2T(|t0|, φ). Blue areas under the curve: probability of observing the data at hand or something more extreme, if H0 is true 1.2 How the t-test works (14) unpaired t-test
  • 36. 1.2 How the t-test works (15) unpaired t-test • Unpaired (i.e., two-sample) t-tests: - Student’s t-test: equal variance assumption - Welch’s t-test: no equal variance assumption, but involves approximations – use this if (1) two sample sizes are very different AND (2) two sample variances are very different [Sakai16SIGIRshort]. The Welch t-statistic and the degrees of freedom:
  • 37. Difference measured in standard deviation units Paired data [Sakai14SIGIRForm] : Unpaired data: WARNING: Different books define “Cohen’s d” differently. [Okubo12] 1.2 How the t-test works (15) effect sizes effect size Pooled variance effect size cf. Hedges’ g, Glass’s Δ
  • 38. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 39. 1.3 T-test with Excel and R (hands-on) (1) - Sample topic-by-run matrix: https://waseda.box.com/20topics3runs The easiest way to obtain the p-values: Paired t-test: = TTEST(A1:A20,B1:B20,2,1) = 0.2058 Unpaired, Student’s t-test: = TTEST(A1:A20,B1:B20,2,2) = 0.5300 Unpaired, Welch’s t-test: = TTEST(A1:A20,B1:B20,2,3) = 0.5302 0.4695 0.3732 0.3575 0.2813 0.3783 0.2435 0.3914 0.3868 0.3167 0.6884 0.5896 0.6024 0.6121 0.4725 0.4766 0.3266 0.233 0.2429 0.5605 0.4328 0.4066 0.5916 0.5073 0.4707 0.4385 0.3889 0.3384 0.5821 0.5551 0.4597 0.2871 0.3274 0.2769 0.5186 0.5066 0.4066 0.5188 0.5198 0.3859 0.5019 0.4981 0.4568 0.4702 0.3878 0.3437 0.329 0.4387 0.2649 0.4758 0.4946 0.4045 0.3028 0.34 0.3253 0.3752 0.4895 0.3205 0.2796 0.2335 0.224 Runs A, B, C 20 topics two-sided But this makes you treat the t-test as a black box. To obtain the test statistic, degrees of freedom etc., let’s do it “by hand”...
  • 40. 1.3 T-test with Excel and R (hands-on) (2) A B C D =A1-B1 = AVERAGE(D1:D20) = 0.022375 = DEVSQ(D1:D20)/(20-1) = 0.005834 Paired t-test = 1.3101 P-value = T.DIST.2T(|t0|, 19) = 0.2058. 0.4695 0.3732 0.3575 0.0963 0.2813 0.3783 0.2435 -0.097 0.3914 0.3868 0.3167 0.0046 0.6884 0.5896 0.6024 0.0988 0.6121 0.4725 0.4766 0.1396 0.3266 0.233 0.2429 0.0936 0.5605 0.4328 0.4066 0.1277 0.5916 0.5073 0.4707 0.0843 0.4385 0.3889 0.3384 0.0496 0.5821 0.5551 0.4597 0.027 0.2871 0.3274 0.2769 -0.0403 0.5186 0.5066 0.4066 0.012 0.5188 0.5198 0.3859 -0.001 0.5019 0.4981 0.4568 0.0038 0.4702 0.3878 0.3437 0.0824 0.329 0.4387 0.2649 -0.1097 0.4758 0.4946 0.4045 -0.0188 0.3028 0.34 0.3253 -0.0372 0.3752 0.4895 0.3205 -0.1143 0.2796 0.2335 0.224 0.0461
  • 41. 1.3 T-test with Excel and R (hands-on) (3) A B C =A1-B1 = AVERAGE(A1:A20)-AVERAGE(B1:B20) = 0.022375 = DEVSQ(A1:A20) = 0.291139 Unpaired, Student’s t-test = 0.012463 P-value = T.DIST.2T(|t0|, 38) = 0.5300. 0.4695 0.3732 0.3575 0.2813 0.3783 0.2435 0.3914 0.3868 0.3167 0.6884 0.5896 0.6024 0.6121 0.4725 0.4766 0.3266 0.233 0.2429 0.5605 0.4328 0.4066 0.5916 0.5073 0.4707 0.4385 0.3889 0.3384 0.5821 0.5551 0.4597 0.2871 0.3274 0.2769 0.5186 0.5066 0.4066 0.5188 0.5198 0.3859 0.5019 0.4981 0.4568 0.4702 0.3878 0.3437 0.329 0.4387 0.2649 0.4758 0.4946 0.4045 0.3028 0.34 0.3253 0.3752 0.4895 0.3205 0.2796 0.2335 0.224 = DEVSQ(B1:B20) = 0.182445 = 0.6338
  • 42. 1.3 T-test with Excel and R (hands-on) (4) A B C =A1-B1 Unpaired, Welch’s t-test = 0.015323 P-value = T.DIST.2T(|t0|, φ*) = 0.5302. 0.4695 0.3732 0.3575 0.2813 0.3783 0.2435 0.3914 0.3868 0.3167 0.6884 0.5896 0.6024 0.6121 0.4725 0.4766 0.3266 0.233 0.2429 0.5605 0.4328 0.4066 0.5916 0.5073 0.4707 0.4385 0.3889 0.3384 0.5821 0.5551 0.4597 0.2871 0.3274 0.2769 0.5186 0.5066 0.4066 0.5188 0.5198 0.3859 0.5019 0.4981 0.4568 0.4702 0.3878 0.3437 0.329 0.4387 0.2649 0.4758 0.4946 0.4045 0.3028 0.34 0.3253 0.3752 0.4895 0.3205 0.2796 0.2335 0.224 = 0.6338 = 0.009602 = 36.0985
  • 43. 1.3 T-test with Excel and R (hands-on) (5)
  • 44. 1.3 T-test with Excel and R (hands-on) (6) Compare with the Excel results.
  • 45. 1.3 T-test with Excel and R (hands-on) (7) Also try: R uses Welch as the default! Compare with the Excel results.
  • 46. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 47. 1.5 How ANOVA works (1) System Per-topic performances 1 x11, x12, … , x1n 2 x21, x22, … , x1n 3 x31, x32, … , x3n Topic→ ↓System 1 2 … n 1 x11 x12 … x1n 2 y21 y22 … y2n 3 z31 z32 … z3n One-way ANOVA with equal number of replicates Two-way ANOVA without replication (If xi corresponds to yi and zi, this should be preferred over one-way ANOVA) ANOVA can ask: “Are ALL systems equally effective?” when there are m (>2) systems. In this tutorial, let’s first consider the following two simplest types of ANOVA. Generalises the unpaired t-test Generalises the paired t-test
  • 48. 1.5 How ANOVA works (2) one-way ANOVA System Per-topic performances 1 x11, x12, … , x1n 2 x21, x22, … , x1n 3 x31, x32, … , x3n i=1, … , m j=1, … , n : score of i-th system for topic j ASSUMPTIONS: are independent and , or, equivalently, and . Let and . Then it is easy to show that . Homoscedasticity (equal variance) assumption Population grand mean i-th system effect
  • 49. Hypotheses: : At least one of the system effects is non-zero. Let . Note that 1.5 How ANOVA works (3) one-way ANOVA ALL population means are equal Diff between score and grand mean Diff between system mean and grand mean Diff between score and system mean Sample grand mean Sample system mean
  • 50. Similarly, ST = SA + SE holds, where System Per-topic performances 1 x11, x12, … , x1n 2 x21, x22, … , x1n 3 x31, x32, … , x3n 1.5 How ANOVA works (4) one-way ANOVA Total variation Between-system variation Within-system variation
  • 51. ST = SA + SE Under the i.i.d. and normality assumptions on , (a) ⇒ (b) . So, under H0 (ai = 0), φE =m(n-1) φA =m-1 1.1 (14)(c) φT =mn-1 = φA + φE Degrees of freedom: how accurate is the sum of squares? 1.1 (14)(c) 1.1 (10) 1.5 How ANOVA works (5) one-way ANOVA
  • 52. ST = SA + SE φT = φA + φE [Under H0] ⇒ Under H0, Is the between-system variation large compared to the within system variation? 1.5 How ANOVA works (6) one-way ANOVA φE = m(n-1) φA = m-1 1.1 (16)
  • 53. m=3,n=10 m=5, n=10 m=20, n=10 Hypotheses: : At least one of the system effects is non-zero. Test statistic: Reject H0 if F0 >= F(φA,φE;α). φE = m(n-1) φA = m-1 Critical F value F(φA,φE;α) F0 1.5 How ANOVA works (7) one-way ANOVA α 0 SE from 1.5 (4)
  • 54. Sum of squares Degrees of freedom Mean squares F0 Between System SA φA = m-1 VA = SA/φA = SA/(m-1) VA/VE = m(n-1)SA (m-1)SE Within System SE φE = m(n-1) VE = SE/φE = SE/m(n-1) Total ST φT = mn-1 - Reject H0 if F0 >= F(φA,φE;α) = F.INV.RT(φA,φE,α) - P-value = F.DIST.RT(F0,φA,φE) 1.5 How ANOVA works (8) one-way ANOVA If n varies across the m systems, let φE = (total #observations) – m.
  • 55. Population effect size Simplest estimator of the above from a sample (more accurate) How much of the total variance can be accounted for by the between-system variance? Effect sizes for one-way ANOVA [Okubo12] 1.5 How ANOVA works (9) one-way ANOVA More accurate estimator in [Okubo12, Sakai14SIGIRforum]
  • 56. 1.5 How ANOVA works (10) two-way ANOVA w/o replication Topic→ ↓System 1 2 … n 1 x11 x12 … x1n 2 y21 y22 … y2n 3 z31 z32 … z3n ASSUMPTIONS: are independent and homoscedasticity System and topic effects are additive and linearly related to xij Sample grand mean Sample system mean Sample topic mean
  • 57. 1.5 How ANOVA works (11) two-way ANOVA w/o replication Hypothesis for the system effects : at least one differs Hypothesis for the topic effects : at least one differs Note that Diff between score and grand mean Diff between system mean and grand mean Diff between topic mean and grand mean The rest Green part for one-way ANOVA in 1.5 (3)
  • 58. 1.5 How ANOVA works (12) two-way ANOVA w/o replication Similarly, ST = SA + SB + SE holds, where Total variation Between-system variation Residual Between-topic variation Within-system variance for one-way ANOVA in 1.5 (4)
  • 59. ST = SA + SB + SE φT = φA + φB + φE Hypotheses for the system effects : at least one differs Under H0, Hypotheses for the topic effects : at least one differs Under H0, i 1.5 How ANOVA works (13) two-way ANOVA w/o replication φE = (m-1)(n-1) φA = m-1 φB = n-1
  • 60. m=3,n=10 m=5, n=10 m=20, n=10 Hypotheses (for system effects): : At least one of the system effects is non-zero. Test statistic: Reject H0 if F0 >= F(φA,φE;α). φE = (m-1)(n-1) φA = m-1 Critical F value F(φA,φE;α) F0 α 0 1.5 How ANOVA works (14) two-way ANOVA w/o replication For topic effects, use SB and φB instead of SA and φA.SE from 1.5 (12)
  • 61. Sum of squares Degrees of freedom Mean squares F0 Between system SA φA =m-1 VA = SA/φA = SA/(m-1) VA/VE = (n-1)SA/SE Between topic SB φB = n-1 VB = SB/φB = SB/(n-1) VB/VE = (m-1)SB/SE SE φE = (m-1)(n-1) VE = SE/φE = SE/(n-1)(m-1) Total ST φT = mn-1 1.5 How ANOVA works (15) two-way ANOVA w/o replication For system effects: - Reject H0 if F0 >= F(φA,φE;α) = F.INV.RT(φA,φE,α) - P-value = F.DIST.RT(F0,φA,φE)
  • 62. ST = SA + SB + SAxB + SE 1.5 How ANOVA works (16) two-way ANOVA φT = φA + φB + φAxB + φE B→ ↓A 1 2 … n 1 x111, : x11r x121 : x12r … x1n1 : x1nr 2 x211, : x21r : … : : : : … : m xm11 : xm1r xm21 : xm2r … xmn1 : xmnr Not discussed in detail in this tutorial as this design is rare in system-based evaluation • Two factors A and B • Each cell contains r observations (total #observations = N = mnr) • Interaction between A and B considered A levels score B level 1 B level 2 score seems high if A level is high AND B level is high! No interaction
  • 63. Sum of squares Degrees of freedom Mean squares F0 A SA φA =m-1 VA = SA/φA VA/VE B SB φB = n-1 VB = SB/φB VB/VE AxB SAB φAxB = (m-1)(n-1) VAxB = SAxB/φAxB VAxB/VE SE φE = mn(r-1) VE = SE/φE Total ST φT = mnr-1 P-value = F.DIST.RT( F0, φA, φE ) P-value = F.DIST.RT( F0, φB, φE ) P-value = F.DIST.RT( F0, φAxB, φE ) 1.5 How ANOVA works (17) two-way ANOVA ST = SA + SB + SAxB + SE φT = φA + φB + φAxB + φE Definitions of SAxB and SE for two-way ANOVA can be found in text books.
  • 64. Population effect sizes Simplest estimators of the above from a sample Variances we’re not interested in removed from denominator (more accurate) Effect sizes for two-way ANOVA w and w/o replication [Okubo12] How much of the total variance does the between-system variance account for? 1.5 How ANOVA works (18) without replication: ST = SA + SB + SE with replication: ST = SA + SB + SAB + SE More accurate estimators in [Okubo12, Sakai14SIGIRforum]
  • 65. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 66. 1.6 ANOVA with Excel and R (1) one-way ANOVA • = DEVSQ(A1:C20) = 0.726229 • = DEVSQ(A1:A20) + DEVSQ(B1:B20) + DEVSQ(C1:C20) = 0.650834 = ST – SE = 0.075395 0.4695 0.3732 0.3575 0.2813 0.3783 0.2435 0.3914 0.3868 0.3167 0.6884 0.5896 0.6024 0.6121 0.4725 0.4766 0.3266 0.233 0.2429 0.5605 0.4328 0.4066 0.5916 0.5073 0.4707 0.4385 0.3889 0.3384 0.5821 0.5551 0.4597 0.2871 0.3274 0.2769 0.5186 0.5066 0.4066 0.5188 0.5198 0.3859 0.5019 0.4981 0.4568 0.4702 0.3878 0.3437 0.329 0.4387 0.2649 0.4758 0.4946 0.4045 0.3028 0.34 0.3253 0.3752 0.4895 0.3205 0.2796 0.2335 0.224 A 20 topics B C
  • 67. 1.6 ANOVA with Excel and R (2) one-way ANOVA 0.4695 0.3732 0.3575 0.2813 0.3783 0.2435 0.3914 0.3868 0.3167 0.6884 0.5896 0.6024 0.6121 0.4725 0.4766 0.3266 0.233 0.2429 0.5605 0.4328 0.4066 0.5916 0.5073 0.4707 0.4385 0.3889 0.3384 0.5821 0.5551 0.4597 0.2871 0.3274 0.2769 0.5186 0.5066 0.4066 0.5188 0.5198 0.3859 0.5019 0.4981 0.4568 0.4702 0.3878 0.3437 0.329 0.4387 0.2649 0.4758 0.4946 0.4045 0.3028 0.34 0.3253 0.3752 0.4895 0.3205 0.2796 0.2335 0.224 A B C Sum of squares Degrees of freedom Mean squares F0 Between System SA = 0.075395 φA = m-1 = 2 VA = SA/φA = 0.037697 VA/VE = 3.3015 Within System SE = 0.650834 φE = m(n-1) = 57 VE = SE/φE = 0.011418 Total ST = 0.726229 P-value = F.DIST.RT( F0, φA, φE ) = 0.0440
  • 68. 1.6 ANOVA with Excel and R (3) one-way ANOVA 0.4695 0.3732 0.3575 0.2813 0.3783 0.2435 0.3914 0.3868 0.3167 0.6884 0.5896 0.6024 0.6121 0.4725 0.4766 0.3266 0.233 0.2429 0.5605 0.4328 0.4066 0.5916 0.5073 0.4707 0.4385 0.3889 0.3384 0.5821 0.5551 0.4597 0.2871 0.3274 0.2769 0.5186 0.5066 0.4066 0.5188 0.5198 0.3859 0.5019 0.4981 0.4568 0.4702 0.3878 0.3437 0.329 0.4387 0.2649 0.4758 0.4946 0.4045 0.3028 0.34 0.3253 0.3752 0.4895 0.3205 0.2796 0.2335 0.224 A B C Data that we used for the t-test
  • 69. 1.6 ANOVA with Excel and R (4) one-way ANOVA 0.4695 0.3732 0.3575 0.2813 0.3783 0.2435 0.3914 0.3868 0.3167 0.6884 0.5896 0.6024 0.6121 0.4725 0.4766 0.3266 0.233 0.2429 0.5605 0.4328 0.4066 0.5916 0.5073 0.4707 0.4385 0.3889 0.3384 0.5821 0.5551 0.4597 0.2871 0.3274 0.2769 0.5186 0.5066 0.4066 0.5188 0.5198 0.3859 0.5019 0.4981 0.4568 0.4702 0.3878 0.3437 0.329 0.4387 0.2649 0.4758 0.4946 0.4045 0.3028 0.34 0.3253 0.3752 0.4895 0.3205 0.2796 0.2335 0.224 A B C Compare with the Excel results.
  • 70. 1.6 ANOVA with Excel and R (5) two-way ANOVA w/o replication • = DEVSQ(A1:C20) = 0.726229 = 20*((0.4501-0.4146)^2 +(0.4277-0.4146)^2 +(0.3662-0.4146)^2 = 0.075395 = 0.579826 = ST – SA – SB = 0.726229 0.4695 0.3732 0.3575 0.2813 0.3783 0.2435 0.3914 0.3868 0.3167 0.6884 0.5896 0.6024 0.6121 0.4725 0.4766 0.3266 0.233 0.2429 0.5605 0.4328 0.4066 0.5916 0.5073 0.4707 0.4385 0.3889 0.3384 0.5821 0.5551 0.4597 0.2871 0.3274 0.2769 0.5186 0.5066 0.4066 0.5188 0.5198 0.3859 0.5019 0.4981 0.4568 0.4702 0.3878 0.3437 0.329 0.4387 0.2649 0.4758 0.4946 0.4045 0.3028 0.34 0.3253 0.3752 0.4895 0.3205 0.2796 0.2335 0.224 A B C cf. 1.6 (1)
  • 71. 1.6 ANOVA with Excel and R (6) two-way ANOVA w/o replication 0.4695 0.3732 0.3575 0.2813 0.3783 0.2435 0.3914 0.3868 0.3167 0.6884 0.5896 0.6024 0.6121 0.4725 0.4766 0.3266 0.233 0.2429 0.5605 0.4328 0.4066 0.5916 0.5073 0.4707 0.4385 0.3889 0.3384 0.5821 0.5551 0.4597 0.2871 0.3274 0.2769 0.5186 0.5066 0.4066 0.5188 0.5198 0.3859 0.5019 0.4981 0.4568 0.4702 0.3878 0.3437 0.329 0.4387 0.2649 0.4758 0.4946 0.4045 0.3028 0.34 0.3253 0.3752 0.4895 0.3205 0.2796 0.2335 0.224 A B C Sum of squares Degrees of freedom Mean squares F0 Between system SA = 0.075395 φA =m-1 = 2 VA = SA/φA = 0.037697 VA/VE = 20.1737 Between topic SB = 0.579826 φB = n-1 = 19 VB = SB/φB = 0.030517 VB/VE = 16.3312 SE = 0.071008 φE = (m-1)(n-1) = 38 VE = SE/φE = 0.001869 Total ST = 0.726229 P-value (system) = F.DIST.RT( F0, φA, φE ) = 1.070E-06 P-value (topic) = F.DIST.RT( F0, φB, φE ) = 8.173E-13
  • 72. 1.6 ANOVA with Excel and R (7) two-way ANOVA w/o replication 0.4695 0.3732 0.3575 0.2813 0.3783 0.2435 0.3914 0.3868 0.3167 0.6884 0.5896 0.6024 0.6121 0.4725 0.4766 0.3266 0.233 0.2429 0.5605 0.4328 0.4066 0.5916 0.5073 0.4707 0.4385 0.3889 0.3384 0.5821 0.5551 0.4597 0.2871 0.3274 0.2769 0.5186 0.5066 0.4066 0.5188 0.5198 0.3859 0.5019 0.4981 0.4568 0.4702 0.3878 0.3437 0.329 0.4387 0.2649 0.4758 0.4946 0.4045 0.3028 0.34 0.3253 0.3752 0.4895 0.3205 0.2796 0.2335 0.224 A B C
  • 73. 1.6 ANOVA with Excel and R (8) two-way ANOVA w/o replication 0.4695 0.3732 0.3575 0.2813 0.3783 0.2435 0.3914 0.3868 0.3167 0.6884 0.5896 0.6024 0.6121 0.4725 0.4766 0.3266 0.233 0.2429 0.5605 0.4328 0.4066 0.5916 0.5073 0.4707 0.4385 0.3889 0.3384 0.5821 0.5551 0.4597 0.2871 0.3274 0.2769 0.5186 0.5066 0.4066 0.5188 0.5198 0.3859 0.5019 0.4981 0.4568 0.4702 0.3878 0.3437 0.329 0.4387 0.2649 0.4758 0.4946 0.4045 0.3028 0.34 0.3253 0.3752 0.4895 0.3205 0.2796 0.2335 0.224 A B C
  • 74. 1.6 ANOVA with Excel and R (9) two-way ANOVA w/o replication 0.4695 0.3732 0.3575 0.2813 0.3783 0.2435 0.3914 0.3868 0.3167 0.6884 0.5896 0.6024 0.6121 0.4725 0.4766 0.3266 0.233 0.2429 0.5605 0.4328 0.4066 0.5916 0.5073 0.4707 0.4385 0.3889 0.3384 0.5821 0.5551 0.4597 0.2871 0.3274 0.2769 0.5186 0.5066 0.4066 0.5188 0.5198 0.3859 0.5019 0.4981 0.4568 0.4702 0.3878 0.3437 0.329 0.4387 0.2649 0.4758 0.4946 0.4045 0.3028 0.34 0.3253 0.3752 0.4895 0.3205 0.2796 0.2335 0.224 A B C Compare with the Excel results.
  • 75. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 76. 1.7 What's wrong with significance tests? (1) [Johnson99] • Deming (1975) commented that the reason students have problems understanding hypothesis tests is that they may be trying to think. • Carver (1978) recommended that statistical significance testing should be eliminated; it is not only useless, it is also harmful because it is interpreted to mean something else. • Cohen (1994:997) noted that statistical testing of the null hypothesis "does not tell us what we want to know, and we so much want to know what we want to know that, out of desperation, we nevertheless believe that it does!"
  • 77. 1.7 What's wrong with significance tests? (2) • We want to know P(H|D), but classical significance testing only gives us something like P(D|H). (Alternative: Bayesian statistics etc.) • Reporting α (e.g. 0.05) instead of the actual p-values leads to dichotomous thinking (“Signifcant or not”?) • Even if p-values are reported, p-values reflect not only the effect size (magnitude of the actual difference) but also the sample size: p-value = f( sample_size, effect_size ) large effect size ⇒ small p-value large sample size ⇒ small p-value H: Hypothesis, D: Data Anything can be made statistically significant by using lots of data 1.2 (15)
  • 78. 1.7 What's wrong with significance tests? (3) [Sakai14SIGIRForum] So what should we do? Whenever using a classical significance test, report not only p-values, but also effect sizes and confidence intervals. Difference between two systems measured in standard deviation units
  • 79. 1.7 What's wrong with significance tests? (4) [Sakai14SIGIRForum] So what should we do? Whenever using a classical significance test, report not only p-values, but also effect sizes and confidence intervals. Difference between two systems measured in standard deviation units Actually, if you want p-values for every system pair, you can apply randomised Tukey HSD [Carterette12,Sakai14PROMISE] WITHOUT doing ANOVA. More accurate estimators of and cf. 1.5 (18)
  • 80. 1.7 What's wrong with significance tests? (5) Randomised Tukey HSD test for m>=2 systems http://research.nii.ac.jp/ntcir/tools/discpower-en.html • Input: a topic-by-run score matrix. • Can be used to compute p-values for 2 or more systems. • Unlike classical tests, it does not rely on assumptions such as normality. • It is a kind of multiple comparison procedure (free from the familywise error rate problem).
  • 81. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 82. 1.8 Significance tests in the IR literature, or lack thereof (1) [Sakai16SIGIR]
  • 83. 1.8 Significance tests in the IR literature, or lack thereof (2) [Sakai16SIGIR]
  • 84. 1.8 Significance tests in the IR literature, or lack thereof (3) [Sakai16SIGIR]
  • 85. 1.8 Significance tests in the IR literature, or lack thereof (4) [Sakai16SIGIR]
  • 86. 1.8 Significance tests in the IR literature, or lack thereof (5) [Sakai16SIGIR]
  • 87. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 88. 2.1 Topic set sizes in IR (1) [Sakai16IRJ] According to Sparck Jones and Van Rijsbergen [SparckJones75], fewer than 75 topics “are of no real value”; 250 topics “are minimally acceptable”; more than 1000 topics “are needed for some purposes” because “real collections are large”; “statistically significant results are desirable” and “scaling up must be studied.”
  • 89. 2.1 Topic set sizes in IR (2) [Sakai16IRJ] In 1979, in a report that considered the number of relevance assessments required from a statistical viewpoint, Gilbert and Sparck Jones remarked [Gilbert79]: “Since there is some doubt about the feasibility of getting 1000 requests, or the convenience of such a large set for future experiments, we consider 500 requests.”
  • 90. 2.1 Topic set sizes in IR (3) The default topic set size at TREC: 50. Exceptions include the million query track that created 1800+ topics [Carterette08] but creating a “reusable” test collection was not the objective of the track. Round Documents Topics TREC-1 disks 1 + 2 51-100 TREC-2 disks 1 + 2 101-150 TREC-3 disks 1 + 2 151-200 TREC-4 disks 2 + 3 201-250 TREC-5 disks 2 + 4 251-300 TREC-6 disks 4 + 5 301-350 TREC-7 disks 4 + 5 351-400 TREC-8 disks 4 + 5 401-450 Early TREC ad hoc tasks and topics [Voorhees05, p.24]
  • 91. 2.1 Topic set sizes in IR (4) [Sakai16IRJ] In 2009, Voorhees conducted an experiment where she randomly split 100 TREC topics in half to count discrepancies in statistically significant results, and concluded that “Fifty-topic sets are clearly too small to have confidence in a conclusion when using a measure as unstable as P(10). Even for stable measures, researchers should remain skeptical of conclusions demonstrated on only a single test collection.” [Voorhees09] TREC-7 + 8 topics with TREC 2004 robust track systems 100 topics random split 50 topics 50 topics Paired t-test says System A > B! Paired t-test says System A < B! conflict But if randomised Tukey HSD (i.e. a multiple comparison procedure) is used for filtering system pairs, discrepancies across test collections almost never occur [Sakai16ICTIR].
  • 92. 2.1 Topic set sizes in IR (5) At CIKM 2008, [Webber08] pointed out that the topic set size should be determined based on the required statistical power. Accept H0 Reject H0 H0 is true systems are equivalent Correct conclusion (1-α) Type I error α H0 is false systems are different Type II error β Correct conclusion (1-β) Statistical power: ability to detect real differences
  • 93. 2.1 Topic set sizes in IR (6) The approach of [Webber08]: • Incremental test collection building – adding topics with relevance assessments one by one until the desired power is achieved; • Considered the t-test without addressing the familywise error rate problem; • Estimated the variance of score deltas using non-standard methods; We want a more straightforward answer to “How many topics should I create?” In addition to the t-test, we can consider one-way ANOVA and confidence intervals as the basis. Residual variances from ANOVA are unbiased estimators of the within-system variances.
  • 94. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 95. 2.2 Topic set size design (1) [Sakai16IRJ] • Provides answers to the following question: “I’m building a new test collection. How many topics should I create?” • A prerequisite: a small topic-by-run score matrix based on pilot data, for estimating within-system variances. • Three approaches (with easy-to-use Excel tools), based on: (1) paired t-test power (2) one-way ANOVA power (3) confidence interval width upperbound.
  • 96. 2.2 Topic set size design (2) [Sakai16IRJ] Test collection designs should evolve based on past data topic-by-run score matrix with pilot data About 25 topics with runs from a few teams probably sufficient [Sakai16EVIA] n1 topics m runs Estimate n1 based on the within-system variance estimate TREC 201X TREC 201(X+1) n2 topics n0 topics Estimate n2 based on the within-system variance estimate A more accurate estimate
  • 97. 2.2 Topic set size design (3) [Sakai16IRJ] Method Input required Paired t-test α (Type I error probability), β (Type II error probability), minDt (minimum detectable difference: whenever the diff between two systems is this much or larger, we want to guarantee (1-β)% power), : variance estimate for the score delta. one-way ANOVA α (Type I error probability), β (Type II error probability), m (number of systems), minD (minimum detectable range: whenever the diff between the best and worst systems is this much or larger, we want to guarantee (1-β)% power), : estimate of the within-system variance under the homoscedasticity assumption. Confidence intervals α (Type I error probability), δ (CI width upperbound: you want the CI for the diff between any system pair to be this much or smaller), : variance estimate for the score delta.
  • 98. 2.2 Topic set size design (4) [Sakai16IRJ] ANOVA-based results for m=10 can be used instead of CI-based results ANOVA-based results for m=2 can be used instead of t-test-based results In practice, you can deduce t-test-based and CI-based results from ANOVA-based results Caveat: the ANOVA-based tool can only handle (α, β)=(0.05, 0.20), (0.01, 0.20), (0.05, 0.10), (0.01, 0.10).
  • 99. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 100. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 101. 2.3 Paired t-tests (1) Example situation: You plan to compare a system pair with the paired t-test with α=5%. You plan to use nDCG as a primary evaluation measure, and want to guarantee 80% power whenever the diff between two systems >= minDt. You know from pilot data that the variance of the nDCG delta is around . What is the required number of topics n? Method Input required Paired t-test α (Type I error probability), β (Type II error probability), minDt (minimum detectable difference: whenever the diff between two systems is this much or larger, we want to guarantee (1-β)% power), : variance estimate for the score delta.
  • 102. 2.3 Paired t-tests (2) Notations (some slightly different from Part 1) t: a random variable that obeys t(φ) where φ=n-1; : two-sided critical t value for sig. criterion α = T.INV.2T(α, φ) α/2 α/2 0
  • 103. 2.3 Paired t-tests (3) Under our assumptions, holds. In a t-test, we let and consider . Due to the t-test procedure, regardless of what t0 obeys, the probability of rejecting H0 is .
  • 104. 2.3 Paired t-tests (4) Regardless of what t0 obeys, the probability of rejecting H0 is ... (a) If H0 is true, then t0 obeys t(n-1) and (a) is exactly α (that’s how was defined). Alternatively, if H1 is true, the distribution that t0 obeys is known as a noncentral t distribution with φ degrees of freedom, and (a) is exactly the power, (1-β). Rejecting the incorrect hypothesis H0 Rejecting the correct hypothesis H0
  • 105. Accept H0 Reject H0 H0 is true systems are equivalent Correct conclusion (1-α) Type I error α H0 is false systems are different Type II error β Correct conclusion (1-β) 2.3 Paired t-tests (5) t0 obeys a (central) t distribution t0 obeys a noncentral t distribution = ≠ (a)
  • 106. 2.3 Paired t-tests (6) If H1 is true, the distribution that t0 obeys is known as a noncentral t distribution with φ degrees of freedom, and (a) is exactly the power, (1-β). The noncentral t distribution in fact has another parameter called the noncentrality parameter λt : ≠ population effect size Population variance of the score differences: See 1.2 (2)
  • 107. 2.3 Paired t-tests (7) If H1 is true, the distribution that t0 obeys is known as a noncentral t distribution with φ degrees of freedom and a noncentrality parameter λt, and (a) is exactly the power, (1-β). We want to compute (a) , but the computation involving the noncentral t distribution is too complex... ... (a) Power =
  • 108. 2.3 Paired t-tests (8) Fortunately, a good approximation is available [Nagata03] . t’: a random variable that obeys a noncentral t distribution with φ, λt ; u: a random variable that obeys a standard normal distribution; ... (a) Power = Appendix Theorem A’
  • 109. ... (a) Power = 2.3 Paired t-tests (9) where . ... (a’) Theorem A’
  • 110. 2.3 Paired t-tests (10) Power = 1-β Now we know how to compute power given (α, Δt, n). But we want to compute n given (α, β, Δt). ... (a’)
  • 111. 2.3 Paired t-tests (11) Now we know how to compute power given (α, Δt, n). But we want to compute n given (α, β, Δt). Starting again with: Power = Appendix Theorem A
  • 112. 2.3 Paired t-tests (12) Now we know how to compute power given (α, Δt, n). But we want to compute n given (α, β, Δt). Starting again with: Power = Theorem A If λt > 0 λt < 0 will lead to the same final result Ignore
  • 113. 2.3 Paired t-tests (13) Now we know how to compute power given (α, Δt, n). But we want to compute n given (α, β, Δt). Power ⇒ one-sided z value for probability 1-β Let ⇒ cf. This is rougher than Theorem A’
  • 114. 2.3 Paired t-tests (14) Now we know how to compute power given (α, Δt, n). But we want to compute n given (α, β, Δt). When λt > 0 or λt < 0 (i.e. H1 is true) Similarly, when λt = 0 (i.e. H0 is true), two-sided t value one-sided z value ≠ 0
  • 115. 2.3 Paired t-tests (15) Now we know how to compute power given (α, Δt, n). But we want to compute n given (α, β, Δt). Appendix Theorem A’’ Appendix Theorem B
  • 116. 2.3 Paired t-tests (16) Now we know how to compute power given (α, Δt, n). But we want to compute n given (α, β, Δt). Let and recall that . Substituting these to the above gives ≠ 0 when H1 is true
  • 117. Given (α, β, minΔt), the minimal sample size n can be approximated as by letting Δt = minΔt . But this involved a lot of approximations, so we need to go back to (a’) and check that n actually achieves 100(1-β)% power: 2.3 Paired t-tests (17) minimum detectable effect size ... (a’)
  • 118. EXAMPLE: α=0.05, β=0.20, detectable effect size regardless of evaluation measure minΔt = 0.50 (i.e. half a std deviation of the diff) → = 33.3 (z α/2 = z 0.025 = NORM.S.INV(1-0.025)=1.960, z 1-β = z 0.80 = -0.842) So if we let n=33, the achieved power according to (a’) 2.3 Paired t-tests (18) = 0.795 ... doesn’t quite achieve 80%!
  • 119. EXAMPLE: α=0.05, β=0.20, detectable effect size regardless of evaluation measure minΔt = 0.50 (i.e. half a std deviation of the diff) If we let n=34, the achieved power according to (a’) 2.3 Paired t-tests (19) = 0.808 ... so n=34 is what we need!
  • 120. Don’t worry, http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeTTEST.xlsx will do this for you! Use the “From effect size” sheet and fill out the orange cells. 2.3 Paired t-tests (20) n=34 is what you want!
  • 121. 2.3 Paired t-tests (21) [Sakai16IRJ] Topic set sizes for typical requirements based on effect sizes
  • 122. 2.3 Paired t-tests (22) In practice, you might want to specify a minimum detectable diff (minDt) in (say) nDCG instead of minΔt for guaranteeing 100(1-β)% power. Given minD and , so n can be obtained as before. A conservative estimate for the delta variance would be where is a within-system variance estimate obtained under a homoscedasticity assumption. See 2.6
  • 123. 2.3 Paired t-tests (23) EXAMPLE: For nDCG, α=0.05, β=0.20, minDt =0.1 (i.e., one-tenth of nDCG’s score range), = 0.50 (from some pilot data) → Use the “From the absolute diff” sheet: n=395 is what you want!
  • 124. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 125. Method Input required one-way ANOVA α (Type I error probability), β (Type II error probability), m (number of systems), minD (minimum detectable range: whenever the diff between the best and worst systems is this much or larger, we want to guarantee (1-β)% power), : estimate of the within-system variance under the homoscedasticity assumption. Example situation: You plan to compare m systems with one-way ANOVA with α=5%. You plan to use nDCG as a primary evaluation measure, and want to guarantee 80% power whenever the diff between the best and the worst systems >= minD. You know from pilot data that the within-system variance for nDCG is around . What is the required number of topics n? 2.4 One-way ANOVA (1) m systems best worst minD <= D
  • 126. 2.4 One-way ANOVA (2) Notations (some slightly different from Part 1) F: random variable that obeys an F distribution with (φA, φE) degrees of freedom; : critical F value for sig. criterion α = F.INT.RT(α, φA, φE) α 0 φA = m-1 φE = m(n-1)
  • 127. 2.4 One-way ANOVA (3) Due to the one-way ANOVA procedure, regardless of what F0 obeys, the probability of rejecting H0 is: If H0 is true, then F0 obeys F(φA, φE) and (c) is exactly α (that’s how is defined). Alternatively, if H1 is true, the distribution that F0 obeys is known as a noncentral F distribution with (φA, φE) degrees of freedom, and (c) is exactly the power, (1-β). ... (c)
  • 128. Accept H0 Reject H0 H0 is true systems are equivalent Correct conclusion (1-α) Type I error α H0 is false systems are different Type II error β Correct conclusion (1-β) F0 obeys a (central) F distribution F0 obeys a noncentral F distribution (c) 2.4 One-way ANOVA (4)
  • 129. 2.4 One-way ANOVA (5) If H1 is true, the distribution that F0 obeys is known as a noncentral F distribution with (φA, φE) degrees of freedom, and (c) is exactly the power, (1-β). The noncentral F distribution in fact has another parameter called the noncentrality parameter λ : Measures the total system effects in variance units Within-system variance under homoscedasticity
  • 130. 2.4 One-way ANOVA (6) If H1 is true, the distribution that F0 obeys is known as a noncentral F distribution with (φA, φE) degrees of freedom and a noncentrality parameter λ, and (c) is exactly the power, (1-β). ... (c) Appendix Theorem C ... (c’) Denoted F’(φA, φE, λ)
  • 131. 2.4 One-way ANOVA (7) Let us ensure that when Δ≠0 (i.e., H1 is true), we guarantee 100(1-β)% power whenever the difference between best and worst systems is minD or larger (minimum detectable range). m systems best worst minD <= D H1: At least one system is different ≠ 0
  • 132. 2.4 One-way ANOVA (8) Let us ensure that when Δ≠0 (i.e., H1 is true), we guarantee 100(1-β)% power whenever the difference D between best and worst systems is minD or larger (minimum detectable range). Define . Then holds. Appendix Theorem D minD does not uniquely determine Δ, but minΔ can be used as the worst-case Δ.
  • 133. 2.4 One-way ANOVA (9) The worst-case sample size: The λ is the noncentrality parameter for F’(φA, φE, λ), which can be approximated by , for which these linear approximations are available α β 0.01 0.10 0.01 0.20 0.05 0.10 0.05 0.20 Appdendix Theorem E λ for noncentral chi-square distributions [Nagata03]
  • 134. 2.4 One-way ANOVA (10) Given (α, β, minD, m, ), the minimal sample size n can be approximated as . But this involved a lot of approximations, so we need to go back to (c’) and check that n actually achieves 100(1-β)% power: ... (c’)Power
  • 135. 2.4 One-way ANOVA (11) EXAMPLE: α=0.05, β=0.20, minD=0.5, m=3, =0.5^2. → So let n=19 ⇒ Hence from (c’) we get power = = 0.791 ... doesn’t quite achieve 80%!
  • 136. 2.4 One-way ANOVA (12) EXAMPLE: α=0.05, β=0.20, minD=0.5, m=3, =0.5^2. → Try n=20 ⇒ From (c’) we get power = = 0.813 ... so n=20 is what we need!
  • 137. 2.4 One-way ANOVA (13) Don’t worry, http://www.f.waseda.jp/tetsuya/CIKM2014/samplesizeANOVA.xlsx will do this for you! Use the appropriate sheet for a given (α, β) and fill out the orange cells. : n=20 is what you want!
  • 138. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 139. 2.5 Confidence Intervals (1) Method Input required Confidence intervals α (Type I error probability), δ (CI width upperbound: you want the CI for the diff between any system pair to be this much or smaller), : variance estimate for the score delta. Example situation: You plan to compare a system pair by means of 95% CI for the difference in nDCG. You want to guarantee that the CI width for any system pair is δ or smaller. You know from pilot data that the variance of the nDCG delta is around . What is the required number of topics n?
  • 140. 2.5 Confidence Intervals (2) cf. 1.2 (8) The 100(1-α)% CI for a difference in means (paired data) is given by where . Let’s consider a sample size n which guarantees that the CI width (=2*MOE) for any difference will be no larger than δ. But since MOE contains a random variable V, let’s consider the above requirement using an expectation: .
  • 141. Now, it is known that so we want to find the smallest n that satisfies: . 2.5 Confidence Intervals (3) sample standard deviation population standard deviation gamma function: (see Theorem A) cf. 1.1 (11)
  • 142. We want to find the smallest n that satisfies: To obtain an initial n, instead of , consider where the variance is known. Thus, let and start with . Increment n’ until (d) is satisfied. 2.5 Confidence Intervals (4) ... (d)
  • 143. EXAMPLE: α=0.05, δ=0.5, = 0.5 (from some pilot data) = 30.7 Try n=31 → LHS=0.257 > 0.25 n=32 → LHS=0.253 > 0.25 n=33 → LHS=0.249 < 0.25 2.5 Confidence Intervals (5) ... (d) =0.25 LHS n=33 is what you want!
  • 144. 2.5 Confidence Intervals (6) Don’t worry, http://www.f.waseda.jp/tetsuya/FIT2014/samplesizeCI.xlsx will do this for you! Just fill out the orange cells. n=33 is what you want!
  • 145. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 146. 2.6 Estimating the variance (1) We need for topic set size design based on one-way ANOVA and for that based on the paired t-test or CI. From a pilot topic-by-run score matrix, obtain: Then, if possible, pool multiple estimates to enhance accuracy: Pooled estimate By-product of one-way ANOVA (use two-way w/o replilcation for tighter estimates)
  • 147. • = DEVSQ(A1:A20) + DEVSQ(B1:B20) + DEVSQ(C1:C20) = 0.650834 φE = m(n-1) = 3(20-1)= 57 = VE = SE / φE = 0.011 0.4695 0.3732 0.3575 0.2813 0.3783 0.2435 0.3914 0.3868 0.3167 0.6884 0.5896 0.6024 0.6121 0.4725 0.4766 0.3266 0.233 0.2429 0.5605 0.4328 0.4066 0.5916 0.5073 0.4707 0.4385 0.3889 0.3384 0.5821 0.5551 0.4597 0.2871 0.3274 0.2769 0.5186 0.5066 0.4066 0.5188 0.5198 0.3859 0.5019 0.4981 0.4568 0.4702 0.3878 0.3437 0.329 0.4387 0.2649 0.4758 0.4946 0.4045 0.3028 0.34 0.3253 0.3752 0.4895 0.3205 0.2796 0.2335 0.224 A 20 topics B C 2.6 Estimating the variance (2) cf. 1.6 (1) cf. 1.6 (2) If there is no other topic-by-run matrix available, use this as .
  • 148. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 149. 2.7 How much pilot data do we need? (1) [Sakai16EVIA] 100 topics 44 runs from 16 teams Pilot data Variance estimates (best estimates available) Official NTCIR-12 STC qrels based on 16 teams (union of contributions from 16 teams) Can we obtain a reliable even from a few teams and a small number of topics?
  • 150. 2.7 How much pilot data do we need? (2) [Sakai16EVIA] Can we obtain a reliable even from a few teams and a small number of topics? 100 topics Runs from 15 teams Pilot data New variance estimates Try leave-1-out 10 times Leaving out k teams k=1 (k=1,...,15)
  • 151. 2.7 How much pilot data do we need? (3) [Sakai16EVIA] Can we obtain a reliable even from a few teams and a small number of topics? 100 topics Runs from 1 team Pilot data New variance estimates Leaving out k teams k=15 (k=1,...,15) Try leave-15-out 10 times
  • 152. 2.7 How much pilot data do we need? (4) [Sakai16EVIA] Can we obtain a reliable even from a few teams and a small number of topics? 100 topics 44 runs from 16 teams Variance estimates (best estimates available) 50 25 Variance estimates Variance estimates Removing topics 100 → 90 → 75 → 50 → 25 → 10 Official NTCIR-12 STC qrels
  • 153. 2.7 How much pilot data do we need? (5) [Sakai16EVIA] Can we obtain a reliable even from a few teams and a small number of topics? 100 topics Runs from 15 teams Variance estimates (best estimates available) 50 25 Variance estimates Variance estimates Removing topics 100 → 90 → 75 → 50 → 25 → 10 Leave-k-out qrels k=1 (k=1,...,15)
  • 154. Starting with n’=100 topics Starting with n’=10 topics 2.7 How much pilot data do we need? (6) [Sakai16EVIA] About 25 topics with a few teams seems sufficient, provided that a reasonably stable measure is used.
  • 155. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 156. 3.1 Power analysis (1) [Ellis10, pp.56-57] 1. Effect size describes the degree to which the phenomenon is present in the population; 2. Sample size determines the amount of sampling error inherent in a result; 3. Significance criterion α defines the risk of committing a Type I error; 4. power (1-β) refers to the chosen or implied Type II error rate. “The four power parameters are related, meaning that the value of any parameter can be determined from the other three.” We had a quick look at how the computations can be done in Part 2.
  • 157. 3.1 Power analysis (2) [Toyoda09] If a paper reports - The parametric significance test type (paired/unpaired t-test, one-way ANOVA, two-way ANOVA w and w/o replication) - either p-value or test statistic (t-value or F-value) - actual sample size we can easily compute the sample effect size. Then, using the library pwr of R, we can compute - the achieved power of the experiment - future sample size for achieving given (α, β). cf. 1.7 (2) https://cran.r-project.org/web/packages/pwr/pwr.pdf power=(1-β)
  • 158. 3.1 Power analysis (3) [Sakai16SIGIR] My R power analysis scripts, adapted from [Toyoda09] with Professor Toyoda’s kind permission, are available at https://waseda.box.com/SIGIR2016PACK - Works with paired/unpaired t-test, one-way ANOVA, two-way ANOVA w and w/o replication. - SIGIR2016PACK also contains an Excel file from [Sakai16SIGIR] (manual analysis of 1055 papers from SIGIR+TOIS 2006-2015).
  • 159. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 160. 3.2 With paired t-tests (1) future.sample.pairedt arguments: - t statistic (t) - sample size (n) - two-sided/one-sided (default: two-sided) - α (default: 0.05) - desired power (1-β) (default:0.80) OUTPUT: - effect size - achieved power - future sample size n’ 1.2 (15) Calls power.t.test
  • 161. 3.2 With paired t-tests (2) A paper from SIGIR 2012 reports “t(27)=0.953 with (two-sided) paired t-test” ⇒ t = 0.953, n = 28 (φ = n-1 = 27) Line 270 in the raw Excel file from [Sakai16SIGIR] very low power (15.1%) For this kind of effect, we need a much larger sample if we want 80% power
  • 162. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 163. 3.3 With unpaired t-tests (1) future.sample.unpairedt arguments: - t statistic (t) - sample sizes (n1, n2) - two-sided/one-sided (default: two-sided) - α (default: 0.05) - desired power (1-β) (default: 0.80) OUTPUT: - effect size - achieved power - future sample size n’ per group 1.2 (15) Calls pwr.t2n.test
  • 164. 3.3 With unpaired t-tests (2) A paper from SIGIR 2007 reports: “t(188403) = 2.81, n1 = 150610, n2 = 37795 with (two-sided) two- sample t-test” φ = n1 + n2 -2 = 188403 Line 714 in the raw Excel file from [Sakai16SIGIR] Appropriate level of power n1 = n2 = 60066 would be the typical setting for 80% power
  • 165. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 166. 3.4 With one-way ANOVA (1) future.sample.1wayanova arguments: - F statistic (F, i.e. FA) - #groups (systems) compared (m) - #observations (topics) per group (n) - α (default: 0.05) - desired power (1-β) (default: 0.80) OUTPUT: - effect size - achieved power - future sample size per group n’ φA = m-1, φE = m(n-1) Calls pwr.anova.test 1.5 (9) Compares between-system variation against within-system
  • 167. 3.4 With one-way ANOVA (2) φA = m-1, φE = m(n-1) A paper from SIGIR 2008 reports: “m=3 groups, n=12 subjects per group, F(2, 33)=1.284 with (one-way) ANOVA” (φA = m-1 = 2, φE = m(n-1) = 3*(12-1) = 33) Line 616 in the raw Excel file from [Sakai16SIGIR] Very low power (27.9%) For this kind of effect, we need more subjects if we want 80% power
  • 168. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 169. future.sample.2waynorep arguments: same as future.sample.1wayanova. OUTPUT: - effect size - achieved power - future sample size per group n’ 3.5 With two-way ANOVA without replication (1) φA = m-1, φE = (m-1)(n-1) A little different from 1.5 (18) Calls pwr.f2.test, which requires the above squared effect size p stands for partial: effect of B has been removed
  • 170. 3.5 With two-way ANOVA without replication (2) A paper from SIGIR 2015 reports: “m=4 groups, F(3, 48)=0.63 with a repeated-measures ANOVA” ⇒ m = φA +1 = 4, φE = (m-1)(n-1) = 48, n = 17 per group Line 22 in the raw Excel file from [Sakai16SIGIR] Same procedure as two-way ANOVA w/o replication (second factor e.g. topics regarded as repeated observations) Very low power (18.3%) For this kind of effect, we need more subjects if we want 80% power
  • 171. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 172. 3.6 With two-way ANOVA (1) future.sample.2wayanova2 arguments: - F statistics (FA, FB, FAB) - #groups compared (m) - #cells per group (n) - #total observations (N=mnr) - α (default: 0.05) - desired power (1-β) (default: 0.80) OUTPUT: - effect size - achieved power - Total sample size N’ φA = m-1, φB = n-1, φAB = (m-1)(n-1) φE = mn(r-1) And similarly for B and ABCalls pwr.anova.test p stands for partial: effects of B and AB have been removed Version 2
  • 173. 3.6 With two-way ANOVA (2) A paper from SIGIR 2014 reports: “m=2, n=2, two-way ANOVA, A: F(1, 960)=24.00, B: F(1, 960)=24.89, AxB: F(1, 960)=10.03” φA = m-1 = 1, φB = n-1 = 1, φAxB = (m-1)(n-1)=1, φE = mn(r-1) = 960 ⇒ r= 960/4+1 = 241, N = mnr = 964 Line 121 in the raw Excel file from [Sakai16SIGIR] Very high power Smaller sample sizes suffice φE/(φA+1) + 1 = 960/(1+1) + 1 = 481 [Cohen88, p.365]
  • 174. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 175. 3.7 Overpowered and underpowered experiments in IR (1) [Sakai16SIGIR] SSR = sample size ratio = actual size/recommended size for future SSR is extremely large ⇔ extremely overpowered SSR is extremely small ⇔ extremely underpowered 133 SIGIR+TOIS papers from the past decade (2006-2015) were examined using the R power analysis tools. (106 with t-tests; 27 with ANOVAs)
  • 176. 3.7 Overpowered and underpowered experiments in IR (2) [Sakai16SIGIR]
  • 177. A paper on personalisation from a search engine company (paired t-test) t=16.00, n=5,352,460, effect size=0.007, achieved power=1 recommended future sample size=164,107 Effect size very small (though this may translate into substantial profit for a company) 3.7 Overpowered and underpowered experiments in IR (3) [Sakai16SIGIR]
  • 178. User experiments, paired t-test t=0.95, n=28, effect size=0.180, achieved power=0.152 future sample size=244 (similar results for other t-test results in the same paper) 3.7 Overpowered and underpowered experiments in IR (4) [Sakai16SIGIR]
  • 179. 3.7 Overpowered and underpowered experiments in IR (5) [Sakai16SIGIR]
  • 180. Experiments with a commercial social media application data (one-way ANOVA) F=243.42, m=3, sample size per group=2551, effect size fhat=2.252, achieved power=1, recommended future sample size per group=52 3.7 Overpowered and underpowered experiments in IR (6) [Sakai16SIGIR]
  • 181. User experiments, two-way ANOVA w/o replication F=0.63, m=4, sample size per group=17, effect size fhat^2 = 0.039, achieved power=0.183, recommended future sample size per group=75 (similar results for other ANOVA results in the same paper) 3.7 Overpowered and underpowered experiments in IR (7) [Sakai16SIGIR]
  • 182. TUTORIAL OUTLINE 1. Significance testing basics and limitations 1.1 Preliminaries 1.2 How the t-test works 1.3 T-test with Excel and R (hands-on) 1.4 How ANOVA works 1.5 ANOVA with Excel and R (hands-on) 1.6 What's wrong with significance tests? 1.7 Significance tests in the IR literature, or lack thereof 2. Using the Excel topic set size design tools 2.1 Topic set sizes in IR 2.2 Topic set size design <30min coffee break> 2.3 With paired t-tests (hands-on) 2.4 With one-way ANOVA (hands-on) 2.5 With confidence intervals (hands-on) 2.6 Estimating the variance (hands-on) 2.7 How much pilot data do we need? 3. Using the R power analysis scripts 3.1 Power analysis 3.2 With paired t-tests (hands-on) 3.3 With unpaired t-tests (hands-on) 3.4 With one-way ANOVA (hands-on) 3.5 With two-way ANOVA without replication (hands-on) 3.6 With two-way ANOVA (hands-on) 3.7 Overpowered and underpowered experiments in IR 4. Summary, a few additional remarks, and Q&A 30min 70min 20min 50min 10min Appendix
  • 183. Now you know • How to determine the number of topics when building a new test collection using a topic-by-run matrix from pilot data and a simple Excel tool. And you kind of know how it works! • How to check whether a reported experiment is overpowered/underpowered and decide on a better sample size for a future experiment using simple R scripts.
  • 184. What now? • Be aware of the limitations of classical significance testing. But while we are still using classical tests, report effect sizes, p-values etc. for collective wisdom [Sakai14SIGIRforum,Sakai16SIGIR]. And use topic set size design and power analysis! Some guidance is better than none! • My personal wish is that the classical significance tests will soon be replaced by Bayesian tests, so we can discuss P(H|D) instead of P(D|H) for various H’s, not just “equality of means” etc. Using score standardisation can give you smaller topic set sizes in topic set size design. Please have a look at [Sakai16ICTIR].
  • 185. Thank you for staying with me until the end! Questions?
  • 186. Acknowledgements This tutorial is rather heavily based on what I learnt from Professor Yasushi Nagata’s and Professor Hideki Toyoda’s books (written in Japanese). I thank Professor Nagata (Waseda University) for his valuable advice and Professor Toyoda (Waseda University) for letting me modify his R code and distribute it. If there are any errors in this tutorial, I am solely responsible.
  • 187. References [Carterette08] Carterette, B., Pavlu, V., Kanoulas, E., Aslam, J. A., and Allan, J.: Evaluation over Thousands of Queries, ACM SIGIR 2008. [Carterette12] Carterette, B.: Multiple Testing in Statistical Analysis of Systems-Based Information Retrieval Experiments, ACM TOIS 30(1), 2012. [Cohen88] Cohen. J.: Statistical Power Analysis for the Behavioral Sciences (Second Edition), Psychology Press, 1988. [Ellis10] Ellis, P. D.: The Essential Guide to Effect Sizes, Cambridge, 2010. [Gilbert79] Gilbert, H. and Sparck Jones, K. S.:, Statistical Bases of Relevance assessment for the `IDEAL’ Information Retrieval Test Collection, Computer Laboratory, University of Cambridge, 1979. [Johnson99] Johnson, D. H.: The Insignificance of Statistical Significance Testing, Journal of Wildlife Management, 63(3), 1999. [Nagata03] Nagata, Y.: How to Design the Sample Size (In Japanese), Asakura Shoten, 2003. [Okubo12] Okubo, G. and Okada, K.: Psychological Statistics to Tell Your Story: Effect Size, Confidence Interval, and Power (in Japanese), Keisho Shobo, 2012.
  • 188. References [Sakai14SIGIRforum] Sakai, T.: Statistical Reform in Information Retrieval?, SIGIR Forum, 48(1), 2016. http://sigir.org/files/forum/2014J/2014J_sigirforum_Article_TetsuyaSakai.pdf [Sakai16EVIA] Sakai, T. and Shang, L.: On Estimating Variances for Topic Set Size Design, EVIA 2016. [Sakai16ICTIR] Sakai, T.: A Simple and Effective Approach to Score Standardisation, ACM ICTIR 2016. [Sakai16IRJ] Sakai, T.: Topic Set Size Design, Information Retrieval Journal, 19(3), 2016. [OPEN ACCESS] http://link.springer.com/content/pdf/10.1007%2Fs10791- 015-9273-z.pdf [Sakai16SIGIR] Sakai, T.: Statistical Significance, Power, and Sample Sizes: A Systematic Review of SIGIR and TOIS, 2006-2015, ACM SIGIR 2016. [Sakai16SIGIRshort] Sakai, T.: Two Sample T-tests for IR Evaluation: Student or Welch?, ACM SIGIR 2016.
  • 189. References [SparckJones75] Sparck Jones, K.S. and Van Rijsbergen, C.J.: Report on the Need for and Provision on an `Ideal’ Information Retrieval Test Collection, Computer Laboratory, University of Cambridge, 1975. [Toyoda09] Tokoda, H.: Introduction to Statistical Power Analysis: A Tutorial with R (in Japanese). Tokyo Tosyo, 2009. [Voorhees05] Voorhees, E. M. and Harman, D. K.: TREC: Experiment and Evaluation in Information Retrieval, The MIT Press, 2005. [Voorhees09] Voorhees, E. M.: Topic Set Size Redux, ACM SIGIR 2009. [Webber08] Webber, W., Moffat, A., and Zobel, J.: Statistical Power in Retrieval Experimentation, ACM CIKM 2008.
  • 190. Appendix (everything adapted from [Nagata03]) • Definition: noncentral t distribution • Definition: noncentral chi-square distribution • Definition: noncentral F distribution • Theorem A: normal approximation of a noncentral t distribution • Theorem A’: corollary of A • Theorem A’’: corollary of A (approximating a z value using a t value) • Theorem B: approximating a t value using a z value • Theorem C: normal approximation of a noncentral F distribution • Theorem D: inequality for system effects • Theorem E: approximating a noncentral F distribution with a noncentral chi- square distribution
  • 191. Definition: noncentral t distribution Let , where the two random variables are independent. The probability distribution of the following random variable is called a noncentral t distribution with φ degrees of freedom and a noncentrality parameter λ: When λ=0, it is reduced to the central t distribution with φ degrees of freedom, t(φ). Denoted by t’(φ, λ)
  • 192. Let where the random variables are independent. The probability distribution of the following random variable is called a noncentral chi-square distribution with φ=k degrees of freedom and a noncentrality parameter λ: where . Definition: noncentral chi-square distribution When λ=0, it is reduced to the central chi-square distribution with φ degrees of freedom, . Denoted by
  • 193. Let , where the two random variables are independent. The probability distribution of the following random variable is called a noncentral F distribution with (φ1, φ2) degrees of freedom and a noncentrality parameter λ. Definition: noncentral F distribution noncentral chi-square distribution central chi-square distribution When λ=0, it is reduced to the central F distribution with (φ1, φ2) degrees of freedom, F(φ1, φ2). Denoted by
  • 194. Theorem A: normal approximation of a noncentral t distribution Let , . Then: where: . Gamma function noncentral t distribution Brief derivation given in [Sakai16IRJ Appendix 1]
  • 195. Theorem A’: corollary of A Let , . Then: PROOF: Let , in Theorem A. Brief derivation given in [Sakai16IRJ Appendix 1]
  • 196. Theorem A’’: corollary of A (approximating a z value using a t value) one-sided z value two-sided t value PROOF: In Theorem A, when λ=0, then t=t’ obeys a (central) t distribution. Also let . 1 1.5 2 2.5 3 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 z f(t) φ 2P=α=0.05 Verified with Excel
  • 197. Theorem B: approximating a t value using a z value This is a special case of Johnson and Welch’s theorem on the noncentral t statistic. [Nagata03] Two-sided t value one-sided z value P = α = 0.05 1.5 2 2.5 3 3.5 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 t f(z) Verified with Excel
  • 198. Theorem C: normal approximation of a noncentral F distribution Let , Then: where . noncentral F distribution Brief derivation given in [Sakai16IRJ Appendix 2]
  • 199. Theorem D: inequality for system effects For , Let . Then . The equality holds when = D/2, = -D/2 and ai = 0 for all others. Proof in [Sakai16IRJ footnote 19]
  • 200. Theorem E: approximating a noncentral F distribution with a noncentral chi-square distribution Let , Then: Letting φE ≒ ∞ F value for probability P chi-square value for probability P