Constructive role of replication crises teaches a lot about 1.) Non-fallacious uses of statistical tests, 2.) Rationale for the role of probability in tests, 3.) How to reformulate tests.
Mattingly "AI & Prompt Design: Large Language Models"
D. G. Mayo: The Replication Crises and its Constructive Role in the Philosophy of Statistics
1. The Replication Crises and its
Constructive Role in the
Philosophy of Statistics
Deborah G Mayo
November 3, 2018
2. What’s the constructive role of the
replication crisis?
• High profile failures of replication have resulted in
much soul-searching among statisticians
• Why do I say it has (or should have) a very
constructive role in philosophy of statistics?
2
3. What’s failed replication?
• Results found statistically significant are
not found significant by an independent
group, using new subjects, stricter
protocols and preregistration
3
4. Paradox of Replication
• Crisis of Replication: it’s too difficult to
replicate the small P-values others found
when we use preregistered protocols
• Leading to the complaint: It’s too easy to
get low P-values
4
5. That it’s too easy when you abuse or cheat teaches
a lot about:
I. Non-fallacious uses of statistical tests
II. Rationale for the role of probability in tests
III. How to reformulate tests
5
6. Most findings are false?
“Several methodologists have pointed out that the high
rate of nonreplication of research discoveries is a
consequence of the convenient, yet ill-founded strategy
of claiming conclusive research findings solely on the
basis of a single study assessed by formal statistical
significance, typically for a p-value less than 0.05.” …
It can be proven that most claimed research findings are
false.” (John Ioannidis 2005, 0696)
6
8. I. Non-fallacious tests
“[W]e need, not an isolated record, but a reliable
method of procedure. In relation to the test of
significance, we may say that a phenomenon is
experimentally demonstrable when we know how to
conduct an experiment which will rarely fail to give
us a statistically significant result.” (Fisher 1947, 14)
8
9. Fisher’s Simple Significance Test
“…to test the conformity of the particular data
under analysis with H0 in some respect:
…we find a function T = t(y) of the data, to be
called the test statistic, such that
• the larger the value of T the more inconsistent
are the data with H0;
• The random variable T = t(Y) has a
(numerically) known probability distribution
when H0 is true.
…the p-value corresponding to any t0bs as
p = p(t) = Pr(T ≥ t0bs; H0)”
(Mayo and Cox 2006, 81)
9
10. Testing Reasoning
• If even larger differences than t0bs occur fairly
frequently under H0 (i.e., P-value is not small),
there’s scarcely evidence of incompatibility
with H0
• Small P-value indicates some underlying
discrepancy from H0 because very probably you
would have seen a less impressive difference
than t0bs were H0 true.
• This still isn’t evidence of a genuine statistical
effect H1, let alone a scientific conclusion H*
Stat-Sub fallacy H => H*
10
11. Fallacy of rejection
• H* makes claims that haven’t been probed by the
statistical test
• The moves from experimental interventions to H*
don’t get enough attention–but your statistical
account should block it.
11
12. Neyman-Pearson (N-P) tests:
A null and alternative hypotheses H0, H1
that are exhaustive
H0: μ ≤ 0 vs. H1: μ > 0
• So this fallacy of rejection H1H* is impossible
• Rejecting H0 only indicates statistical alternatives
H1 (how discrepant from null)
12
13. Despite philosophical debates
between Fisher & N-P
• They both fall under tools for “appraising and
bounding the probabilities (under respective
hypotheses) of seriously misleading interpretations
of data” (Birnbaum 1970, 1033)–error probabilities
• I place all under the rubric of error statistics
• Confidence intervals, N-P and Fisherian tests,
resampling, randomization.
13
14. N-P and Fisher showed error
control is lost with selective
reporting
Sufficient finagling—cherry-picking, P-hacking,
significance seeking, multiple testing, look
elsewhere—may practically guarantee a preferred
claim H gets support, even if it’s unwarranted by
evidence
14
15. Minimal principle for evidence
If the test had little or no capability of finding
flaws with H (even if H is incorrect), then
agreement between data x0 and H provides
poor (or no) evidence for H
Such a test fails a minimal requirement for
evidence (severity principle)
• Holds outside of formal tests, to estimation,
prediction.
15
16. II. Key to revising roles of error
probabilities
• What bothers you with selective reporting,
cherry picking, stopping when the data look
good (biasing selection effects)?
• Not problems about long-runs—
16
17. We cannot say the case at hand has done a
good job of avoiding the sources of
misinterpreting data
18. 21 Word Solution: Report Sampling
Plan in Methods Section
• Replication researchers (re)discovered that data-
dependent hypotheses are a major source of
spurious significance levels.
“We report how we determined our sample size, all
data exclusions (if any), all manipulations, and all
measures in the study.”
(Simmons, Nelson, and Simonsohn 2012, 4)
18
19. Fishing for significance
(nominal vs. actual)
Suppose that twenty sets of differences have
been examined, that one difference seems large
enough to test and that this difference turns out
to be ‘significant at the 5 percent level.’ ….The
actual level of significance is not 5 percent,
but 64 percent! (Selvin 1970, 104)
(Morrison & Henkel’s Significance Test controversy
1970!)
19
20. Spurious P-Value
• He reports: Such results would be difficult to
achieve under the assumption of H0
• When in fact such results are common under
the assumption of H0
• Calls for adjusting the P-value to reflect the
actual error probability
20
21. Yet some accounts of evidence object
“Two problems that plague frequentist inference:
multiple comparisons and multiple looks, or…data
dredging and peeking at the data. The frequentist
solution to both problems involves adjusting the P-
value…
But adjusting the measure of evidence because
of considerations that have nothing to do with
the data defies scientific sense” (Goodman 1999,
1010)
(To his credit, he’s open about this; heads the Meta-Research
Innovation Center at Stanford) 21
22. Likelihood Principle (LP)
A pivotal disagreement in the philosophy of statistics
wars:
In classical Bayesian and likelihoodist accounts, the
import of the data is via the ratios of likelihoods of
hypotheses
Pr(x0;H0)/Pr(x0;H1)
Condition on fixed data x0, hypotheses vary
22
23. Hacking (1965)
• “Law of Likelihood”: x support hypothesis H0
less well than H1 if,
Pr(x;H0) < Pr(x;H1)
(abandoned in 1980)
• “there always is such a rival hypothesis viz., that
things just had to turn out the way they actually
did” (Barnard 1972, 129).
23
25. All error probabilities violate LP
(even without selection effects):
Sampling distributions, significance levels, power, all
depend on something more [than the likelihood
function]–something that is irrelevant in Bayesian
inference–namely the sample space
(Lindley 1971, 436)
The LP implies…the irrelevance of predesignation,
of whether a hypothesis was thought of beforehand
or was introduced to explain known effects
(Rosenkrantz 1977, 122)
25
26. How might intuitively unwarranted
inferences be blocked (without error
probabilities)?
Give a high prior probability to H0: no effect, in a
Bayesian analysis
26
27. Harold Jeffreys
“If mere improbability of the observations, given the
hypothesis, was the criterion, any hypothesis
whatever would be rejected. Everybody rejects the
conclusion” (Jeffreys 1939/1961, 385).
Add one of two things: error probabilities of the
method, or prior probabilities in the hypotheses
27
28. Problems with appealing to priors
to block inferences based on
selection effects
• It still wouldn’t show what researchers had
done wrong—battle of beliefs
• The believability of data-dredged hypotheses
is what makes them so seductive
• Additional source of flexibility, priors and
biasing selection effects
28
29. No help with our key problem
• How to distinguish the warrant for a single
hypothesis H with different methods
(e.g., one has biasing selection effects, another,
pre-registered results and precautions)?
• Since there’s a single H, its prior would be the
same
29
30. Criticisms of P-hackers lose force
• Wanting to promote an account that
downplays error probabilities, the researcher
deserving criticism is given a life-raft:
30
31. Bem’s “Feeling the Future” 2011:
ESP?
• Daryl Bem (2011): subjects do better than chance
at predicting the (erotic) picture shown in the
future
• Some locate the start of the Replication Crisis
With Bem
• Bem admits data dredging
• Bayesian critics resort to a default Bayesian prior
to (a point) null hypothesis
31
32. Bem’s Response
“Whenever the null hypothesis is sharply defined but
the prior distribution on the alternative hypothesis is
diffused over a wide range of values, as it is [here] it
boosts the probability that any observed data will be
higher under the null hypothesis than under the
alternative.
This is known as the Lindley-Jeffreys paradox: A
frequentist [can always] be contradicted by a
…Bayesian analysis that concludes that the same data
are more likely under the null.” (Bem et al. 2011, 717)
32
33. III Reformulate Tests: P-values don’t
give an effect size
Severity function: SEV(Test T, data x, claim C)
• Tests are reformulated in terms of a discrepancy γ
from H0
• Instead of a binary cut-off (significant or not) the
particular outcome is used to infer discrepancies
that are or are not warranted
33
34. 1-sided Normal test:
H0: μ ≤ 0 vs. H1: μ > 0 (Let σ = 1 n = 100)
Reject H0 whenever M ≥ 2SE: M ≥ 0.2
M is the sample mean (significance level = .025)
Let M = .2, so I reject H0.
1SE = s/√n = .1
What can you infer?
34
35. Some ask: Does this mean I can infer μ = .3?
• Inferences not in terms of points, but μ > 0 + γ
• Do we have evidence for μ > .3?
No.
• 84% of the time, M would have been larger than it is
even if μ = .3: SEV(μ > .3) is low (.16)
Pr (M < .2; .3 ) = .16
35
37. Improves on confidence intervals
which inherit problems of N-P
tests
• We do not fix a single confidence level,
• The evidential warrant for different points
in any interval are distinguished
• Go beyond a “performance goal”
37
38. Quick sum-up
• Main source of hand-wringing stems from
biasing selection effects
• These alter error probabilities of methods
• They don’t alter evidence in accounts that
obey the Likelihood Principle
• To a follower of the LP, the error
statistician is considering “imaginary data”
and “intentions”
38
39. • To the severe tester, the LP precludes key way to
block spurious results:
What’s the value of preregistered reports?
It’s that your appraisal is altered once you consider
the probability that some hypotheses, stopping
point, …or other could have led to a false positive
• Constructive role of replication crisis:
Biasing selection effects impinge on error
probabilities
Error probabilities impinge on well-testedness
39
40. • Can block inferences without appeal to error
probabilities: background beliefs (probabilism)
• Gives a life-raft to the P-hacker and cherry
picker; puts blame in the wrong place
• Significance tests are a small part of error
statistics, need reformulation and a new
rationale
• Error probabilities used to assess how well-
probed claims are (probativism)
40
42. References
• Barnard, G. (1972). ‘The Logic of Statistical Inference (Review of “The Logic of
Statistical Inference” by Ian Hacking)’, British Journal for the Philosophy of Science
23(2), 123–32.
• Bem, J. 2011. “Feeling the Future: Experimental Evidence for Anomalous
Retroactive Influences on Cognition and Affect”, Journal of Personality and Social
Psychology 100(3), 407-425.
• Bem, J., Utts, J., and Johnson, W. 2011. “Must Psychologists Change the Way
They Analyze Their Data?”, Journal of Personality and Social Psychology 101(4),
716-719.
• Birnbaum, A. 1970. “Statistical Methods in Scientific Inference (letter to the
Editor).” Nature 225 (5237) (March 14): 1033.
• Fisher, R. A. 1947. The Design of Experiments 4th ed., Edinburgh: Oliver and Boyd.
• Goodman SN. 1999. “Toward evidence-based medical statistics. 2: The Bayes
factor,” Annals of Internal Medicine 1999; 130:1005 –1013.
• Hacking, I. (1965). Logic of Statistical Inference. Cambridge: Cambridge University
Press.
• Hacking, I. (1980). ‘The Theory of Probable Inference: Neyman, Peirce and
Braithwaite’, in Mellor, D. (ed.), Science, Belief and Behavior: Essays in Honour of
R. B. Braithwaite, Cambridge: Cambridge University Press, pp. 141–60.
• Ioannidis, J. (2005). “Why Most Published Research Findings are False”, PLoS
Medicine 2(8), 0696–0701.
• Jeffreys, H. ([1939]/ 1961). Theory of Probability. Oxford: Oxford University
Press.
42
43. • Lindley, D. V. 1971. “The Estimation of Many Parameters.” In Foundations of
Statistical Inference, edited by V. P. Godambe and D. A. Sprott, 435–455. Toronto:
Holt, Rinehart and Winston.
• Mayo, D. G. 1996. Error and the Growth of Experimental Knowledge. Science and
Its Conceptual Foundation. Chicago: University of Chicago Press.
• Mayo, D. G. 2018. Statistical Inference as Severe Testing: How to Get Beyond the
Statistics Wars, Cambridge: Cambridge University Press.
• Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive
Inference” in Rojo, J. (ed.) The Second Erich L. Lehmann Symposium: Optimality,
2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical
Statistics: 247-275.
• Mayo, D. G., and A. Spanos. 2006. “Severe Testing as a Basic Concept in a
Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of
Science 57 (2) (June 1): 323–357.
• Mayo, D. G., and A. Spanos. 2011. “Error Statistics.” In Philosophy of
Statistics, edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster,
7:152–198. Handbook of the Philosophy of Science. The Netherlands:
Elsevier.
• Morrison, D. E., and R. E. Henkel, ed. 1970. The Significance Test
Controversy: A Reader. Chicago: Aldine De Gruyter.
• Pearson, E. S. & Neyman, J. (1930). “On the problem of two samples”, Joint
Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of
Calif. Press). First published in Bul. Acad. Pol.Sci. 73-96.
43
44. • Rosenkrantz, R. 1977. Inference, Method and Decision: Towards a Bayesian
Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.
• Savage, L. J. 1962. The Foundations of Statistical Inference: A Discussion.
London: Methuen.
• Selvin, H. 1970. “A critique of tests of significance in survey research. In The
significance test controversy, edited by D. Morrison and R. Henkel, 94-106.
Chicago: Aldine De Gruyter.
• Simmons, J. Nelson, L. and Simonsohn, U. (2012). “A 21 Word Solution”,
Dialogue: The Official Newsletter of the Society for Personality and Social
Psychology, 26(2), 4–7.
• Wagenmakers, E-J., 2007. “A Practical Solution to the Pervasive Problems of P
values”, Psychonomic Bulletin & Review 14(5): 779-804.
44
45. SEV(μ > μ1) = Pr( M < .2; μ = .3 )
= Pr( Z < -1) = .16
Z = (.2 - .3)/.1 = -1
45
46. Severity for Test T+:
SEV(T+, d(x0), claim C)
Normal testing: H0: μ ≤ μ0 vs. H1: μ > μ0 known σ;
discrepancy parameter γ; μ1 = μ0 +γ; d0 = d(x0)
(observed value of test statistic) √n(M - μ0)/σ
SIR: (Severity Interpretation with low P-values)
• (a): (high): If there’s a very low probability that so
large a d0 would have resulted, if μ were no greater
than μ1, then d0 it indicates μ > μ1: SEV(μ > μ1) is
high.
• (b): (low) If there is a fairly high probability that d0
would have been larger than it is, even if μ = μ1, then
d0 is not a good indication μ > μ1: SEV(μ > μ1) is low.46
47. SIN: (Severity Interpretation for
Negative results)
• (a): (high) If there is a very high probability
that d0 would have been larger than it is, were
μ > μ1, then μ ≤ μ1 passes the test with high
severity: SEV(μ ≤ μ1) is high.
• (b): (low) If there is a low probability that d0
would have been larger than it is, even if μ >
μ1, then μ ≤ μ1 passes with low severity:
SEV(μ ≤ μ1) is low.
47
48. Jimmy Savage on the LP:
“According to Bayes' theorem,…. if y is the
datum of some other experiment, and if it
happens that P(x|µ) and P(y|µ) are
proportional functions of µ (that is,
constant multiples of each other), then
each of the two data x and y have exactly
the same thing to say about the values of
µ…” (Savage 1962, p. 17)
48