SlideShare a Scribd company logo
1 of 71
Evidence
as Passing a Severe Test
(How it Gets You Beyond the Statistics Wars)
Deborah G Mayo
Dept of Philosophy, Virginia Tech
CUNY Graduate Center Philosophy Colloquium
April 28th, 2021
0
In a conversation with Sir David Cox:
COX: Deborah, in some fields foundations do not
seem very important, but we both think foundations of
statistical inference are important; why do you think
that is?
MAYO: …in statistics …we invariably cross into
philosophical questions about empirical knowledge,
evidence and inductive inference.
(“A Statistical Scientist Meets a Philosopher of
Science” 2011) 1
Role of probability: performance or
probabilism?
(Frequentist vs. Bayesian)
• Statistical Inference
• Unifications and Eclecticism
• Long-standing battles still simmer below the
surface (agreement on numbers)
2
Statistical inference as severe testing
• Brush the dust off pivotal debates in relation to
today’s statistical crisis in science
• We set sail with a simple tool: If little or nothing
has been done to rule out flaws in inferring
claim C, then you don’t have evidence for it
• Sufficiently general to apply to any methods
now in use
• You needn’t accept this philosophy to use it to
excavate the statistics wars
3
A philosophical excursion
“Taking the severity principle, along with the aim
that we desire to find things out… let’s set sail on a
philosophical excursion to illuminate statistical
inference.” --a special interest cruise
(pix/animations out)
• And at the same time revisit classic problems:
induction, falsification, demarcation of science
4
Most findings are false?
“Several methodologists have pointed out that the high
rate of nonreplication of research discoveries is a
consequence of the convenient, yet ill-founded strategy
of claiming conclusive research findings solely on the
basis of a single study assessed by formal statistical
significance, typically for a p-value less than 0.05. …
It can be proven that most claimed research findings are
false.” (John Ioannidis 2005, 0696)
5
R.A. Fisher
“[W]e need, not an isolated record, but a reliable
method of procedure. In relation to the test of
significance, we may say that a phenomenon is
experimentally demonstrable when we know how to
conduct an experiment which will rarely fail to give
us a statistically significant result.” (Fisher 1947, 14)
6
Simple significance tests (Fisher)
“p-value. …to test the conformity of the particular data
under analysis with H0 in some respect:
…we find a function T = t(y) of the data, the test
statistic, such that
• the larger the value of T the more inconsistent are
the data with H0;
• T = t(Y) has a known probability distribution
when H0 is true.
…the p-value corresponding to any t0bs as
p = p(t) = Pr(T ≥ t0bs; H0)”
(Mayo and Cox 2006, 81) 7
Testing reasoning
• If even larger differences than t0bs occur fairly
frequently under H0 (i.e., P-value is not small),
there’s scarcely evidence of incompatibility
with H0
• Small P-value indicates some underlying
discrepancy from H0 because very probably
you would have seen a less impressive
difference than t0bs were H0 true.
• This still isn’t evidence of a genuine statistical
effect H1, let alone a scientific conclusion H*
Stat-Sub fallacy H => H*
8
Fallacy of rejection
• H* makes claims that haven’t been probed by the
statistical test
• The moves from experimental interventions to H*
don’t get enough attention–but your statistical
account should block them
9
Neyman-Pearson (N-P) tests:
A null and alternative hypotheses H0, H1
that are exhaustive*
H0: μ ≤ 0 vs. H1: μ > 0
“no effect” vs. “some positive effect”
• So this fallacy of rejection H1H* is blocked
• Rejecting H0 only indicates statistical alternatives
H1 (how discrepant from null)
*(introduces Type II error, and power )
10
Both Fisher and N-P methods: it’s
easy to lie with statistics with
biasing selection effects
• Sufficient finagling—cherry-picking, significance
seeking, multiple testing, post-data subgroups,
trying and trying again—may practically
guarantee a preferred claim H gets support,
even if it’s unwarranted by evidence
11
Severity Requirement:
If the test had little or no capability of finding flaws
with H (even if H is incorrect), then agreement
between data x0 and H provides poor (or no)
evidence for H
• Such a test fails a minimal requirement for a
stringent or severe test
12
13
• A claim passes severely only if it has been
subjected to and passes a test that would
have, with high probability, found it flawed or
specifiably false (if it is).
• This probability is the severity with which it
has passed the test, and is a measure of
evidential warrant
A claim is warranted to the extent
it passes severely
This alters the role of probability:
Probabilism. To assign a degree of probability,
confirmation, support or belief in a hypothesis,
given data x0 (absolute or comparative)
(e.g., Bayesian, likelihoodist, Fisher (at times))
Performance. Ensure long-run reliability of
methods, coverage probabilities (frequentist,
behavioristic Neyman-Pearson, Fisher (at times))
14
• Neither “probabilism” nor “performance” directly
captures assessing error probing capacity
• Good long-run performance is a necessary, not
a sufficient, condition for severity
15
Key to solving a major
philosophical problem for
frequentists
• Why is good performance relevant for
inference in the case at hand?
• What bothers you with selective reporting,
cherry picking, stopping when the data look
good, P-hacking
• Not problems about long-runs—
16
• We cannot say the case at hand has done
a good job of avoiding the sources of
misinterpreting data
• Performance is relevant when it teaches
us about the capabilities of our methods
• Basis of severe testing philosophy
17
A claim C is not warranted _______
• Probabilism: unless C is true or probable (gets
a probability boost, made comparatively firmer)
• Performance: unless it stems from a method
with low long-run error
• Probativism (severe testing) unless something
(a fair amount) has been done to probe ways we
can be wrong about C
18
Severe Tests
Informal example: To test if I’ve gained weight
between the start of the pandemic and now, I use a
series of well-calibrated and stable scales, both at
the start and now.
All show an over 4 lb gain, none shows a difference
in weighing EGEK, I’m forced to infer:
H: I’ve gained at least 4 pounds
19
20
• Giving the properties of the weighing methods is
akin to the properties of statistical tests
(performance).
• No one claims the justification is merely long run
and can say nothing about my weight.
• We argue about the source of the readings from
the high capacity to reveal if any scales were wrong
21
The severe tester is assumed to be in
a context of wanting to find things out
• I could insist all the scales are wrong—they work fine
with weighing known objects—but this would prevent
correctly finding out about weight….. (rigged
alternative)
• What sort of extraordinary circumstance could cause
them all to go astray just when we do not know the
weight of the test object?
Statistical Inference and Sexy Science
22
Even large scale theories connect with data only
by intermediate hypotheses and models.
Next month 102 Years Ago: May 29, 1919:
Testing GTR
On Einstein's theory of gravitation, light passing near
the sun is deflected by an angle λ, reaching 1.75”,
for light just grazing the sun.
Only detectable during a total eclipse, which "by
strange good fortune” would occur on May 29, 1919
[1920] 1987, p. 113).
23
Two key stages of inquiry
i. is there a deflection effect of the amount
predicted by Einstein as against Newton
(0.87")?
ii. is it "attributable to the sun's gravitational field"
as described in Einstein's hypothesis?
24
25
Eclipse photos of stars (eclipse plate) compared to
their positions photographed at night when the effect
of the sun is absent (the night plate)–a control.
Technique was known to astronomers from
determining stellar parallax, "for which much greater
accuracy is required" (Eddington 1920), pp. 115-16).
26
The problem in (i) is reduced to a statistical one: the
observed mean deflections (from sets of
photographs) are normally distributed around the
predicted mean deflection Îź.
H0: μ ≤ 0.87 and the H1: μ > 0.87
H1: includes the Einsteinian value of 1.75.
2 expeditions, to Sobral, North Brazil and Principle,
Gulf of Guinea (West Africa)
27
A year of checking instrumental and other errors…
Sobral: Îź = 1.98" Âą 0.18".
Principe: Îź = 1.61" Âą 0.45".
(in probable errors 0.12 and 0.30 respectively, 1
probable error is 0.68 standard errors SE.)
“It is usual to allow a margin of safety of about twice
the probable error on either side of the mean.” [~1.4
SE]. The Principe plates are just sufficient to rule out
the the ‘half-deflection’, the Sobral plates exclude it
(Eddington 1920, p. 118).
28
(ii) Is the effect "attributable to the sun's
gravitational field”? (Can’t assume H*)
Using the known eclipse effect to explain it while
saving Newton from falsification is unproblematic–if
each conjecture is severely tested.
Sir Oliver Lodge’s “ether effect” was one of many
(e.g., shadow, corona).
Were any other cause to exist that produced a
considerable fraction of the deflection effect that
alone would falsify the Einstein hypothesis (which
asserts that all of the 1.75" are due to gravity)
(Jeffreys 1919, p. 138).
29
Each Newton-saving hypothesis collapsed on the
basis of a one-two punch:
1. the magnitude of effect that could have been
due to the conjectured factor is far too small to
account for the eclipse effect; and
2. if large enough to account for the eclipse effect,
it would have false or contradictory implications
elsewhere.
The Newton-saving factors might have been
plausible but they were unable to pass severe tests.
Saving Newton this way would be bad science.
30
More Severe Tests of GTR in the 1970s
• Radio interferometry data from quasars (quasi-stellar
radio sources) are more capable of uncovering
errors, and discriminating values of the deflection
than the crude eclipse tests.
• The Einstein deflection effect “passed” the test, but
even then, they couldn’t infer all of GTR severely.
• The [Einstein] law is firmly based on experiment,
even the complete abandonment of the theory would
scarcely affect it. (Eddington 1920, p. 126)
31
Popper, GTR and Severity
[T]he impressive thing about [the 1919 tests of
Einstein’s theory of gravity] is the risk involved in a
prediction of this kind. … The theory is incompatible
with certain possible results of observation–in fact
with results which everybody before Einstein would
have expected. This is quite different from [Freud
and Adlerian psychology] (Popper 1962, p. 36)
32
The problem with Freudian and Adlerian
psychology
• Any observed behavior – jumping in the water to
save a child, or failing to save her–can be a
accounted for by Adlerian inferiority complexes, or
Freudian theories of sublimation or Oedipal
complexes (Popper 1962, p. 35).
• I’d modify Popper: it needn’t be the flexibility of
the theory but of the overall inquiry: research
question, auxiliaries, and interpretive rules.
• The flexibility isn’t picked up on in logics of
induction
33
Popper denies that severity can be formalized by
any confirmation logics or logics of induction
“the probability of a statement . . . simply does not
express an appraisal of the severity of the tests a
theory has passed, or of the manner in which it has
passed these tests” (pp. 394– 5).
34
Wars between Popper vs logics of
Induction relevant for today’s
statistics wars: Alan Musgrave
“According to modern logical empiricist orthodoxy, in
deciding whether hypothesis h is confirmed by
evidence e, . . . we must consider only the statements h
and e, and the logical relations [C(h,e)] between them.
It is quite irrelevant whether e was known first and h
proposed to explain it, or whether e resulted from
testing predictions drawn from h”.
(Alan Musgrave 1974, p. 2)
Likelihood Principle (LP)
In logics of induction, like probabilist accounts (as
I’m using the term) the import of the data is via the
ratios of likelihoods of hypotheses
Pr(x0;H0)/Pr(x0;H1)
The data x0 are fixed, while the hypotheses vary
35
Comparative Logic of Support
• Ian Hacking (1965) “Law of Likelihood”:
x support hypothesis H0 less well than H1 if,
Pr(x;H0) < Pr(x;H1)
(rejects in 1980)
• Any hypothesis that perfectly fits the data is
maximally likely
• “there always is such a rival hypothesis viz., that
things just had to turn out the way they actually
did” (Barnard 1972, 129). 36
N-P error probabilities and
Popper’s methodological probabilities
• Pr(H0 is less well supported than H1;H0 ) is high
for some H1 or other
“In order to fix a limit between ‘small’ and ‘large’
values of [the likelihood ratio] we must know how
often such values appear when we deal with a
true hypothesis.” (Pearson and Neyman 1967,
106)
37
Fishing for significance
(nominal vs. actual)
Suppose that twenty sets of differences have
been examined, that one difference seems large
enough to test and that this difference turns out
to be ‘significant at the 5 percent level.’ ….The
actual level of significance is not 5 percent,
but 64 percent! (Selvin 1970, 104)
(Morrison & Henkel’s Significance Test Controversy
1970!)
38
Spurious P-Value
The data-dredger reports: Such results would be
difficult to achieve under the assumption of H0
When in fact such results are common under the
assumption of H0
• There are many more ways to be wrong with
biasing selection effects
• Need to adjust P-values or at least report the
multiple testing
39
Some accounts of evidence object:
“Two problems that plague frequentist inference:
multiple comparisons and multiple looks, or…data
dredging and peeking at the data. The frequentist
solution to both problems involves adjusting the P-
value…
But adjusting the measure of evidence because
of considerations that have nothing to do with
the data defies scientific sense” (Goodman 1999,
1010)
(Co-director, with Ioannidis, the Meta-Research Innovation
Center at Stanford)
40
All error probabilities
violate the LP :
Sampling distributions, significance levels, power,
all depend on something more [than the likelihood
function]–something that is irrelevant in Bayesian
inference–namely the sample space
(Lindley 1971, 436)
The LP implies…the irrelevance of predesignation,
of whether a hypothesis was thought of before
hand or was introduced to explain known effects
(Rosenkrantz 1977, 122)
41
Many “reforms” offered as
alternative to significance tests
follow the LP
• “Bayes factors can be used in the complete absence
of a sampling plan…” (Bayarri, Benjamin, Berger,
Sellke 2016, 100)
• It seems very strange that a frequentist could not
analyze a given set of data…if the stopping rule is
not given….Data should be able to speak for itself.
(Berger and Wolpert 1988, 78 authors of the
Likelihood Principle) 42
At odds with fraud-busters:
21 Word Solution
“We report how we determined our sample size,
and data exclusions (if any), all manipulations, and
all measures in the study” (Simmons, Nelson, and
Simonsohn 2012, 4).
• Replication researchers find that selection effects–
data-dependent hypotheses, fishing, and stopping
rules–are a major source of failed replication
43
Inferences based on biasing
selection effects might be blocked
with Bayesian prior probabilities
(without error probabilities)?
• Supplement with subjective beliefs: What do I
believe? As opposed to What is the evidence?
(Royall 1997)
• Likelihoodists + prior probabilities
44
Problems with appealing to priors to
block inferences based on
selection effects
• Doesn’t show what researchers had done wrong—
battle of beliefs
• The believability of data-dredged hypotheses is
what makes them so seductive
• Additional source of flexibility, priors and biasing
selection effects
45
No help with the severe tester’s
key problem
• How to distinguish the warrant for a single
hypothesis H with different methods
(e.g., one has biasing selection effects,
another, pre-registered results and
precautions)?
• Since there’s a single H, its prior would be the
same
46
Most Bayesians (last decade) use
“default” priors: unification
• ‘Eliciting’ subjective priors too difficult, scientists
reluctant for subjective beliefs to overshadow data
“[V]irtually never would different experts give prior
distributions that even overlapped” (J. Berger 2006,
392)
• Default priors are supposed to prevent prior beliefs
from influencing the posteriors–data dominant
47
How should we interpret them?
• “The priors are not to be considered expressions of
uncertainty, ignorance, or degree of belief.
Conventional priors may not even be probabilities…”
(Cox and Mayo 2010, 299)
• No agreement on rival systems for default/non-
subjective priors
(invariance, maximum entropy, maximizing missing
information, matching)
48
Criticisms of Data-Dredgers Lose
Force
• Wanting to promote an account that
downplays error probabilities, the researcher
deserving criticism is given a life-raft
• One of the ironies of today’s reforms
49
Bem’s “Feeling the Future” 2011:
ESP?
• Daryl Bem (2011): subjects do better than chance
at predicting the (erotic) picture shown in the future
• Some locate the start of the Replication Crisis with
Bem
• Bem admits data dredging
• Bayesian critics resort to a default Bayesian prior
to (a point) null hypothesis
50
Bem’s Response
“Whenever the null hypothesis is sharply defined but
the prior distribution on the alternative hypothesis is
diffused over a wide range of values, as it is [here] it
boosts the probability that any observed data will be
higher under the null hypothesis than under the
alternative.
This is known as the Lindley-Jeffreys paradox*: A
frequentist [can always] be contradicted by a
…Bayesian analysis that concludes that the same data
are more likely under the null.” (Bem et al. 2011, 717)
*Bayes-Fisher disagreement
51
Many of Today’s Statistics wars
trace to P-values vs posteriors
• The posterior probability Pr(H0|x) can be large while
the P-value is small (2-sided test, spike and smear)
• To the Bayesian, the P-value exaggerates the
evidence against H0
• To the significance tester: the Bayesian is biasing
results to favor H0
52
Some Bayesians reject probabilism
(Gelman: Falsificationist Bayesian;
Shalizi: error statistician)
• “[C]rucial parts of Bayesian data analysis, such as
model checking, can be understood as ‘error
probes’ in Mayo’s sense” which might be seen as
using modern statistics to implement the
Popperian criteria of severe tests. (Andrew
Gelman and Cosma Shalizi 2013, 10).
• Last part of SIST: (Probabilist) Foundations Lost,
(Probative) Foundations Found
53
Severity directs a reformulation of
tests
Severity function: SEV(Test T, data x, claim C)
• Tests are reformulated in terms of a discrepancy γ
from H0
• Instead of a binary cut-off (significant or not) the
particular outcome is used to infer discrepancies
that are and are not warranted
• Poorly warranted claims must be reported 54
Using Severity to Avoid Fallacies:
Fallacy of Rejection: Large n
problem
• Fixing the P-value, increasing sample size n,
the cut-off gets smaller
• Get to a point where x is closer to the null than
various alternatives
• Many would lower the P-value requirement as n
increases-can always avoid inferring a
discrepancy beyond what’s warranted: 55
Severity tells us:
• an α-significant difference indicates less of a
discrepancy from the null if it results from larger (n1)
rather than a smaller (n2) sample size (n1 > n2 )
• What’s more indicative of a large effect (fire), a fire
alarm that goes off with burnt toast or one that
doesn’t go off unless the house is fully ablaze?
• [The larger sample size is like the one that goes off
with burnt toast] 56
What About Fallacies of
Non-Significant Results?
• They don’t warrant 0 discrepancy
• There are discrepancies the test had little
probability of detecting
• Using severity reasoning: rule out discrepancies
that very probably would have resulted in larger
differences than observed- set upper bounds
• If you very probably would have observed a
larger value of test statistic (smaller P-value),
were Îź = Îź1 then the data indicate that Îź< Îź1
SEV(Îź < Îź1) is high
57
Confidence Intervals Are Also
Re-interpretated
Duality between tests and intervals: values within the
(1 -‐α) CI are non-‐
rejectable at the Îą level
• Too dichotomous: in/out, plausible/not plausible
• Fixed confidence levels (need several
benchmarks)
• Justified in terms of long-‐
run coverage
(performance)
--if interpreted correctly
58
59
Duality of Tests and CIs
(estimating Îź in a Normal Distribution)
μ > M0 – 1.96σ/√n CI-lower
μ < M0 + 1.96σ/√n CI-upper
M0 : the observed sample mean
CI-lower: the value of Îź that M0 is statistically
significantly greater than at P= 0.025
CI-upper: the value of Îź that M0 is statistically
significantly lower than at P= 0.025
 You could get a CI by asking for these values,
and learn indicated effect sizes with tests
60
We get an inferential rationale absent from CIs
CI Estimator:
CI-lower < Îź < CI-upper
Because it came from a procedure with good
coverage probability
Severe Tester:
Îź > CI-lower because with high probability (.975) we
would have observed a smaller M0 if μ ≤ CI-lower
Îź < CI-upper because with high probability (.975)
we would have observed a larger M0 if μ ≥ CI-lower
FEV: Frequentist Principle of Evidence; Mayo and
Cox (2006); SEV: Mayo 1991, Mayo and Spanos
(2006)
FEV/SEV A small P-value indicates discrepancy Îł from H0, if
and only if, there is a high probability the test would have
resulted in a larger P-value were a discrepancy as large as Îł
absent.
FEV/SEV A moderate P-value indicates the absence of a
discrepancy Îł from H0, only if there is a high probability
the test would have given a worse fit with H0 (i.e., a
smaller P-value) were a discrepancy Îł present.
61
Sum-up
• I begin with a minimal requirement for evidence: data
are evidence for C only if it has been subjected to and
passes a test it probably would have failed if false
• Biasing selection effects make it easy to find
impressive-looking effects erroneously
• They alter a method’s error probing capacities
• They may not alter evidence (in traditional
probabilisms): Likelihood Principle (LP)
• To the LP holder: to consider what could have
happened but didn’t is to consider “imaginary data”
62
• To the severe tester, probabilists are robbed from a
main way to block spurious results
• Severity principles direct the reinterpretation of
significance tests and other methods
• Probabilists may block inferences without appeal to
error probabilities: high prior to H0 (no effect) can
result in a high posterior probability to H0
• Gives a life-raft to the P-hacker and cherry picker;
puts blame in the wrong place
• Piecemeal statistical inferences (or informal
counterparts) link data to scientific claims at multiple
levels
63
• A silver lining to distinguishing highly probable and
highly probed–can use different methods for different
contexts
• Some Bayesians may find their foundations in error
statistics
• Last excursion: (probabilist) foundations lost;
(probative) foundations found
64
65
66
References
• Barnard, G. (1972). ‘The Logic of Statistical Inference (Review of “The Logic of
Statistical Inference” by Ian Hacking)’, British Journal for the Philosophy of Science
23(2), 123–32.
• Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and
Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses."
Journal of Mathematical Psychology 72: 90-103.
• Bem, J. (2011). “Feeling the Future: Experimental Evidence for Anomalous
Retroactive Influences on Cognition and Affect”, Journal of Personality and Social
Psychology 100(3), 407-425.
• Bem, J., Utts, J., and Johnson, W. (2011). “Must Psychologists Change the Way
They Analyze Their Data?”, Journal of Personality and Social Psychology 101(4),
716-719.
• Berger, J. O. (2006). “The Case for Objective Bayesian Analysis.” Bayesian
Analysis 1 (3): 385–402.
• Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6
Lecture Notes-Monograph Series. Hayward, CA: Institute of Mathematical
Statistics.
• Cox, D. R., and Mayo, D. G. (2010). “Objectivity and Conditionality in
Frequentist Inference.” In Error and Inference: Recent Exchanges on
Experimental Reasoning, Reliability, and the Objectivity and Rationality of
Science, edited by Deborah G. Mayo and Aris Spanos, 276–304. Cambridge:
Cambridge University Press.
67
• Cox, D. and Mayo, D. (2011). “A Statistical Scientist Meets a Philosopher of
Science: A Conversation between Sir David Cox and Deborah Mayo”, in
Rationality, Markets and Morals (RMM) 2, 103–14.
• Eddington, A. ([1920]1987). Space, Time and Gravitation: An Outline of the
General Relativity Theory, Cambridge Science Classics Series. Cambridge:
Cambridge University Press.
• Fisher, R. A. (1947). The Design of Experiments 4th ed., Edinburgh: Oliver and
Boyd.
• Gelman, A. and Shalizi, C. (2013). “Philosophy and the Practice of Bayesian
Statistics” and “Rejoinder’” British Journal of Mathematical and Statistical
Psychology 66(1): 8–38; 76-80.
• Goodman SN. (1999). “Toward evidence-based medical statistics. 2: The Bayes
factor,” Annals of Internal Medicine 1999; 130:1005 –1013.
• Hacking, I. (1965). Logic of Statistical Inference. Cambridge: Cambridge University
Press.
• Hacking, I. (1980). ‘The Theory of Probable Inference: Neyman, Peirce and
Braithwaite’, in Mellor, D. (ed.), Science, Belief and Behavior: Essays in Honour of
R. B. Braithwaite, Cambridge: Cambridge University Press, pp. 141–60.
• Ioannidis, J. (2005). “Why Most Published Research Findings are False”, PLoS
Medicine 2(8), 0696–0701.
68
• Jeffreys, H. (1919). ‘Contribution to Discussion on the Theory of Relativity’, and
‘On the Crucial Test of Einstein’s Theory of Gravitation’, Monthly Notices of the
Royal Astronomical Society 80, 96–118; 138–54.
• Lindley, D. V. (1971). “The Estimation of Many Parameters.” in Godambe, V. and
Sprott, D. (eds.), Foundations of Statistical Inference 435–455. Toronto: Holt,
Rinehart and Winston.
• Lodge, O. (1919). ‘Contribution to “Discussion on the Theory of Relativity”’,
Monthly Notices of the Royal Astronomical Society 80, 106–9.
• Mayo, D. (1991). ‘Novel Evidence and Severe Tests’, Philosophy of Science 58(4),
523–52.
• Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science
and Its Conceptual Foundation. Chicago: University of Chicago Press.
• Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond
the Statistics Wars, Cambridge: Cambridge University Press.
• Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive
Inference” in Rojo, J. (ed.) The Second Erich L. Lehmann Symposium: Optimality,
2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical
Statistics: 247-275.
• Mayo, D. G., and A. Spanos. (2006). “Severe Testing as a Basic Concept in a
Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of
Science 57 (2) (June 1): 323–357. 69
70
• Mayo, D. G., and A. Spanos (2011). “Error Statistics.” In Philosophy of Statistics,
edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198.
Handbook of the Philosophy of Science. The Netherlands: Elsevier.
• Morrison, D. E., and R. E. Henkel, (eds.), (1970). The Significance Test
Controversy: A Reader. Chicago: Aldine De Gruyter.
• Musgrave, A. (1974). ‘Logical versus Historical Theories of Confirmation’, The
British Journal for the Philosophy of Science 25(1), 1–23.
• Pearson, E. S. & Neyman, J. (1967). “On the problem of two samples”, Joint
Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif.
Press). First published 1930 in Bul. Acad. Pol.Sci. 73-96.
• Popper, K. (1962). Conjectures and Refutations: The Growth of Scientific
Knowledge. New York: Basic Books.
• Rosenkrantz, R. (1977). Inference, Method and Decision: Towards a Bayesian
Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.
• Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. Boca Raton FL:
Chapman and Hall, CRC press.
• Selvin, H. (1970). “A critique of tests of significance in survey research. In The
significance test controversy, edited by D. Morrison and R. Henkel, 94-106.
Chicago: Aldine De Gruyter.
• Simmons, J. Nelson, L. and Simonsohn, U. (2012) “A 21 word solution”, Dialogue:
The Official Newsletter of the Society for Personality and Social Psychology 26(2),
4–7.

More Related Content

What's hot

D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...jemille6
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Warsjemille6
 
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist PerformanceProbing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performancejemille6
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyjemille6
 
Phil6334 day#4slidesfeb13
Phil6334 day#4slidesfeb13Phil6334 day#4slidesfeb13
Phil6334 day#4slidesfeb13jemille6
 
D. Mayo: Putting the brakes on the breakthrough: An informal look at the argu...
D. Mayo: Putting the brakes on the breakthrough: An informal look at the argu...D. Mayo: Putting the brakes on the breakthrough: An informal look at the argu...
D. Mayo: Putting the brakes on the breakthrough: An informal look at the argu...jemille6
 
Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma jemille6
 
Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)jemille6
 
April 3 2014 slides mayo
April 3 2014 slides mayoApril 3 2014 slides mayo
April 3 2014 slides mayojemille6
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)jemille6
 
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...jemille6
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversyjemille6
 
Final mayo's aps_talk
Final mayo's aps_talkFinal mayo's aps_talk
Final mayo's aps_talkjemille6
 
Mayo &amp; parker spsp 2016 june 16
Mayo &amp; parker   spsp 2016 june 16Mayo &amp; parker   spsp 2016 june 16
Mayo &amp; parker spsp 2016 june 16jemille6
 
Feb21 mayobostonpaper
Feb21 mayobostonpaperFeb21 mayobostonpaper
Feb21 mayobostonpaperjemille6
 
Gelman psych crisis_2
Gelman psych crisis_2Gelman psych crisis_2
Gelman psych crisis_2jemille6
 
D. Lakens: Preregistration as a Tool to Evaluate the Severity of a Test
D. Lakens: Preregistration  as a Tool to Evaluate the Severity of a TestD. Lakens: Preregistration  as a Tool to Evaluate the Severity of a Test
D. Lakens: Preregistration as a Tool to Evaluate the Severity of a Testjemille6
 
Mayod@psa 21(na)
Mayod@psa 21(na)Mayod@psa 21(na)
Mayod@psa 21(na)DeborahMayo4
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy jemille6
 
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...jemille6
 

What's hot (20)

D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
D. Mayo: The Science Wars and the Statistics Wars: scientism, popular statist...
 
D. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics WarsD. Mayo: Philosophical Interventions in the Statistics Wars
D. Mayo: Philosophical Interventions in the Statistics Wars
 
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist PerformanceProbing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
Probing with Severity: Beyond Bayesian Probabilism and Frequentist Performance
 
D. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severelyD. G. Mayo: Your data-driven claims must still be probed severely
D. G. Mayo: Your data-driven claims must still be probed severely
 
Phil6334 day#4slidesfeb13
Phil6334 day#4slidesfeb13Phil6334 day#4slidesfeb13
Phil6334 day#4slidesfeb13
 
D. Mayo: Putting the brakes on the breakthrough: An informal look at the argu...
D. Mayo: Putting the brakes on the breakthrough: An informal look at the argu...D. Mayo: Putting the brakes on the breakthrough: An informal look at the argu...
D. Mayo: Putting the brakes on the breakthrough: An informal look at the argu...
 
Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma Statistical Flukes, the Higgs Discovery, and 5 Sigma
Statistical Flukes, the Higgs Discovery, and 5 Sigma
 
Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)Byrd statistical considerations of the histomorphometric test protocol (1)
Byrd statistical considerations of the histomorphometric test protocol (1)
 
April 3 2014 slides mayo
April 3 2014 slides mayoApril 3 2014 slides mayo
April 3 2014 slides mayo
 
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
Frequentist Statistics as a Theory of Inductive Inference (2/27/14)
 
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
Surrogate Science: How Fisher, Neyman-Pearson, and Bayes Were Transformed int...
 
Controversy Over the Significance Test Controversy
Controversy Over the Significance Test ControversyControversy Over the Significance Test Controversy
Controversy Over the Significance Test Controversy
 
Final mayo's aps_talk
Final mayo's aps_talkFinal mayo's aps_talk
Final mayo's aps_talk
 
Mayo &amp; parker spsp 2016 june 16
Mayo &amp; parker   spsp 2016 june 16Mayo &amp; parker   spsp 2016 june 16
Mayo &amp; parker spsp 2016 june 16
 
Feb21 mayobostonpaper
Feb21 mayobostonpaperFeb21 mayobostonpaper
Feb21 mayobostonpaper
 
Gelman psych crisis_2
Gelman psych crisis_2Gelman psych crisis_2
Gelman psych crisis_2
 
D. Lakens: Preregistration as a Tool to Evaluate the Severity of a Test
D. Lakens: Preregistration  as a Tool to Evaluate the Severity of a TestD. Lakens: Preregistration  as a Tool to Evaluate the Severity of a Test
D. Lakens: Preregistration as a Tool to Evaluate the Severity of a Test
 
Mayod@psa 21(na)
Mayod@psa 21(na)Mayod@psa 21(na)
Mayod@psa 21(na)
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy
 
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
Mayo: 2nd half “Frequentist Statistics as a Theory of Inductive Inference” (S...
 

Similar to Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars)

The Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and CasualtiesThe Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and Casualtiesjemille6
 
D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500jemille6
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”jemille6
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilismjemille6
 
Philosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of StatisticsPhilosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of Statisticsjemille6
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learningjemille6
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsificationjemille6
 
8. Hypothesis Testing.ppt
8. Hypothesis Testing.ppt8. Hypothesis Testing.ppt
8. Hypothesis Testing.pptABDULRAUF411
 
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...jemille6
 
Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)jemille6
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severityjemille6
 
Excursion 3 Tour III, Capability and Severity: Deeper Concepts
Excursion 3 Tour III, Capability and Severity: Deeper ConceptsExcursion 3 Tour III, Capability and Severity: Deeper Concepts
Excursion 3 Tour III, Capability and Severity: Deeper Conceptsjemille6
 
The Statistics Wars and Their Casualties
The Statistics Wars and Their CasualtiesThe Statistics Wars and Their Casualties
The Statistics Wars and Their Casualtiesjemille6
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)jemille6
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)jemille6
 
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”jemille6
 
P values and replication
P values and replicationP values and replication
P values and replicationStephen Senn
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (jemille6
 
Mayo slides meeting #3 (Phil 6334/Econ 6614)
Mayo slides meeting #3 (Phil 6334/Econ 6614)Mayo slides meeting #3 (Phil 6334/Econ 6614)
Mayo slides meeting #3 (Phil 6334/Econ 6614)jemille6
 
Statistics in Astronomy
Statistics in AstronomyStatistics in Astronomy
Statistics in AstronomyPeter Coles
 

Similar to Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars) (20)

The Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and CasualtiesThe Statistics Wars: Errors and Casualties
The Statistics Wars: Errors and Casualties
 
D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500D.g. mayo 1st mtg lse ph 500
D.g. mayo 1st mtg lse ph 500
 
“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”“The importance of philosophy of science for statistical science and vice versa”
“The importance of philosophy of science for statistical science and vice versa”
 
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and ProbabilismStatistical Inference as Severe Testing: Beyond Performance and Probabilism
Statistical Inference as Severe Testing: Beyond Performance and Probabilism
 
Philosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of StatisticsPhilosophy of Science and Philosophy of Statistics
Philosophy of Science and Philosophy of Statistics
 
D. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &LearningD. G. Mayo Columbia slides for Workshop on Probability &Learning
D. G. Mayo Columbia slides for Workshop on Probability &Learning
 
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and FalsificationP-Value "Reforms": Fixing Science or Threat to Replication and Falsification
P-Value "Reforms": Fixing Science or Threat to Replication and Falsification
 
8. Hypothesis Testing.ppt
8. Hypothesis Testing.ppt8. Hypothesis Testing.ppt
8. Hypothesis Testing.ppt
 
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
Statistical "Reforms": Fixing Science or Threats to Replication and Falsifica...
 
Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)Mayo minnesota 28 march 2 (1)
Mayo minnesota 28 march 2 (1)
 
Error Control and Severity
Error Control and SeverityError Control and Severity
Error Control and Severity
 
Excursion 3 Tour III, Capability and Severity: Deeper Concepts
Excursion 3 Tour III, Capability and Severity: Deeper ConceptsExcursion 3 Tour III, Capability and Severity: Deeper Concepts
Excursion 3 Tour III, Capability and Severity: Deeper Concepts
 
The Statistics Wars and Their Casualties
The Statistics Wars and Their CasualtiesThe Statistics Wars and Their Casualties
The Statistics Wars and Their Casualties
 
The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)The Statistics Wars and Their Causalities (refs)
The Statistics Wars and Their Causalities (refs)
 
The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)The Statistics Wars and Their Casualties (w/refs)
The Statistics Wars and Their Casualties (w/refs)
 
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”"The Statistical Replication Crisis: Paradoxes and Scapegoats”
"The Statistical Replication Crisis: Paradoxes and Scapegoats”
 
P values and replication
P values and replicationP values and replication
P values and replication
 
The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (The role of background assumptions in severity appraisal (
The role of background assumptions in severity appraisal (
 
Mayo slides meeting #3 (Phil 6334/Econ 6614)
Mayo slides meeting #3 (Phil 6334/Econ 6614)Mayo slides meeting #3 (Phil 6334/Econ 6614)
Mayo slides meeting #3 (Phil 6334/Econ 6614)
 
Statistics in Astronomy
Statistics in AstronomyStatistics in Astronomy
Statistics in Astronomy
 

More from jemille6

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfjemille6
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfjemille6
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022jemille6
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inferencejemille6
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?jemille6
 
What's the question?
What's the question? What's the question?
What's the question? jemille6
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metasciencejemille6
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...jemille6
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Twojemille6
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...jemille6
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testingjemille6
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredgingjemille6
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probabilityjemille6
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...jemille6
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...jemille6
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...jemille6
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...jemille6
 
D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides jemille6
 
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and BoundariesT. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundariesjemille6
 
Masters thesis analysis-of-six-patients-with-unknown-viruses-2
Masters thesis analysis-of-six-patients-with-unknown-viruses-2Masters thesis analysis-of-six-patients-with-unknown-viruses-2
Masters thesis analysis-of-six-patients-with-unknown-viruses-2jemille6
 

More from jemille6 (20)

D. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdfD. Mayo JSM slides v2.pdf
D. Mayo JSM slides v2.pdf
 
reid-postJSM-DRC.pdf
reid-postJSM-DRC.pdfreid-postJSM-DRC.pdf
reid-postJSM-DRC.pdf
 
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
Errors of the Error Gatekeepers: The case of Statistical Significance 2016-2022
 
Causal inference is not statistical inference
Causal inference is not statistical inferenceCausal inference is not statistical inference
Causal inference is not statistical inference
 
What are questionable research practices?
What are questionable research practices?What are questionable research practices?
What are questionable research practices?
 
What's the question?
What's the question? What's the question?
What's the question?
 
The neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and MetascienceThe neglected importance of complexity in statistics and Metascience
The neglected importance of complexity in statistics and Metascience
 
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
Mathematically Elegant Answers to Research Questions No One is Asking (meta-a...
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
Revisiting the Two Cultures in Statistical Modeling and Inference as they rel...
 
Comparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple TestingComparing Frequentists and Bayesian Control of Multiple Testing
Comparing Frequentists and Bayesian Control of Multiple Testing
 
Good Data Dredging
Good Data DredgingGood Data Dredging
Good Data Dredging
 
The Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of ProbabilityThe Duality of Parameters and the Duality of Probability
The Duality of Parameters and the Duality of Probability
 
On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...On the interpretation of the mathematical characteristics of statistical test...
On the interpretation of the mathematical characteristics of statistical test...
 
The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...The two statistical cornerstones of replicability: addressing selective infer...
The two statistical cornerstones of replicability: addressing selective infer...
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...
 
The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...The ASA president Task Force Statement on Statistical Significance and Replic...
The ASA president Task Force Statement on Statistical Significance and Replic...
 
D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides D. G. Mayo jan 11 slides
D. G. Mayo jan 11 slides
 
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and BoundariesT. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
T. Pradeu & M. Lemoine: Philosophy in Science: Definition and Boundaries
 
Masters thesis analysis-of-six-patients-with-unknown-viruses-2
Masters thesis analysis-of-six-patients-with-unknown-viruses-2Masters thesis analysis-of-six-patients-with-unknown-viruses-2
Masters thesis analysis-of-six-patients-with-unknown-viruses-2
 

Recently uploaded

A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfakmcokerachita
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17Celine George
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting DataJhengPantaleon
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonJericReyAuditor
 

Recently uploaded (20)

A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Class 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdfClass 11 Legal Studies Ch-1 Concept of State .pdf
Class 11 Legal Studies Ch-1 Concept of State .pdf
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17How to Configure Email Server in Odoo 17
How to Configure Email Server in Odoo 17
 
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data_Math 4-Q4 Week 5.pptx Steps in Collecting Data
_Math 4-Q4 Week 5.pptx Steps in Collecting Data
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
9953330565 Low Rate Call Girls In Rohini Delhi NCR
9953330565 Low Rate Call Girls In Rohini  Delhi NCR9953330565 Low Rate Call Girls In Rohini  Delhi NCR
9953330565 Low Rate Call Girls In Rohini Delhi NCR
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
Science lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lessonScience lesson Moon for 4th quarter lesson
Science lesson Moon for 4th quarter lesson
 

Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars)

  • 1. Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars) Deborah G Mayo Dept of Philosophy, Virginia Tech CUNY Graduate Center Philosophy Colloquium April 28th, 2021 0
  • 2. In a conversation with Sir David Cox: COX: Deborah, in some fields foundations do not seem very important, but we both think foundations of statistical inference are important; why do you think that is? MAYO: …in statistics …we invariably cross into philosophical questions about empirical knowledge, evidence and inductive inference. (“A Statistical Scientist Meets a Philosopher of Science” 2011) 1
  • 3. Role of probability: performance or probabilism? (Frequentist vs. Bayesian) • Statistical Inference • Unifications and Eclecticism • Long-standing battles still simmer below the surface (agreement on numbers) 2
  • 4. Statistical inference as severe testing • Brush the dust off pivotal debates in relation to today’s statistical crisis in science • We set sail with a simple tool: If little or nothing has been done to rule out flaws in inferring claim C, then you don’t have evidence for it • Sufficiently general to apply to any methods now in use • You needn’t accept this philosophy to use it to excavate the statistics wars 3
  • 5. A philosophical excursion “Taking the severity principle, along with the aim that we desire to find things out… let’s set sail on a philosophical excursion to illuminate statistical inference.” --a special interest cruise (pix/animations out) • And at the same time revisit classic problems: induction, falsification, demarcation of science 4
  • 6. Most findings are false? “Several methodologists have pointed out that the high rate of nonreplication of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically for a p-value less than 0.05. … It can be proven that most claimed research findings are false.” (John Ioannidis 2005, 0696) 5
  • 7. R.A. Fisher “[W]e need, not an isolated record, but a reliable method of procedure. In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result.” (Fisher 1947, 14) 6
  • 8. Simple significance tests (Fisher) “p-value. …to test the conformity of the particular data under analysis with H0 in some respect: …we find a function T = t(y) of the data, the test statistic, such that • the larger the value of T the more inconsistent are the data with H0; • T = t(Y) has a known probability distribution when H0 is true. …the p-value corresponding to any t0bs as p = p(t) = Pr(T ≥ t0bs; H0)” (Mayo and Cox 2006, 81) 7
  • 9. Testing reasoning • If even larger differences than t0bs occur fairly frequently under H0 (i.e., P-value is not small), there’s scarcely evidence of incompatibility with H0 • Small P-value indicates some underlying discrepancy from H0 because very probably you would have seen a less impressive difference than t0bs were H0 true. • This still isn’t evidence of a genuine statistical effect H1, let alone a scientific conclusion H* Stat-Sub fallacy H => H* 8
  • 10. Fallacy of rejection • H* makes claims that haven’t been probed by the statistical test • The moves from experimental interventions to H* don’t get enough attention–but your statistical account should block them 9
  • 11. Neyman-Pearson (N-P) tests: A null and alternative hypotheses H0, H1 that are exhaustive* H0: Îź ≤ 0 vs. H1: Îź > 0 “no effect” vs. “some positive effect” • So this fallacy of rejection H1H* is blocked • Rejecting H0 only indicates statistical alternatives H1 (how discrepant from null) *(introduces Type II error, and power ) 10
  • 12. Both Fisher and N-P methods: it’s easy to lie with statistics with biasing selection effects • Sufficient finagling—cherry-picking, significance seeking, multiple testing, post-data subgroups, trying and trying again—may practically guarantee a preferred claim H gets support, even if it’s unwarranted by evidence 11
  • 13. Severity Requirement: If the test had little or no capability of finding flaws with H (even if H is incorrect), then agreement between data x0 and H provides poor (or no) evidence for H • Such a test fails a minimal requirement for a stringent or severe test 12
  • 14. 13 • A claim passes severely only if it has been subjected to and passes a test that would have, with high probability, found it flawed or specifiably false (if it is). • This probability is the severity with which it has passed the test, and is a measure of evidential warrant A claim is warranted to the extent it passes severely
  • 15. This alters the role of probability: Probabilism. To assign a degree of probability, confirmation, support or belief in a hypothesis, given data x0 (absolute or comparative) (e.g., Bayesian, likelihoodist, Fisher (at times)) Performance. Ensure long-run reliability of methods, coverage probabilities (frequentist, behavioristic Neyman-Pearson, Fisher (at times)) 14
  • 16. • Neither “probabilism” nor “performance” directly captures assessing error probing capacity • Good long-run performance is a necessary, not a sufficient, condition for severity 15
  • 17. Key to solving a major philosophical problem for frequentists • Why is good performance relevant for inference in the case at hand? • What bothers you with selective reporting, cherry picking, stopping when the data look good, P-hacking • Not problems about long-runs— 16
  • 18. • We cannot say the case at hand has done a good job of avoiding the sources of misinterpreting data • Performance is relevant when it teaches us about the capabilities of our methods • Basis of severe testing philosophy 17
  • 19. A claim C is not warranted _______ • Probabilism: unless C is true or probable (gets a probability boost, made comparatively firmer) • Performance: unless it stems from a method with low long-run error • Probativism (severe testing) unless something (a fair amount) has been done to probe ways we can be wrong about C 18
  • 20. Severe Tests Informal example: To test if I’ve gained weight between the start of the pandemic and now, I use a series of well-calibrated and stable scales, both at the start and now. All show an over 4 lb gain, none shows a difference in weighing EGEK, I’m forced to infer: H: I’ve gained at least 4 pounds 19
  • 21. 20 • Giving the properties of the weighing methods is akin to the properties of statistical tests (performance). • No one claims the justification is merely long run and can say nothing about my weight. • We argue about the source of the readings from the high capacity to reveal if any scales were wrong
  • 22. 21 The severe tester is assumed to be in a context of wanting to find things out • I could insist all the scales are wrong—they work fine with weighing known objects—but this would prevent correctly finding out about weight….. (rigged alternative) • What sort of extraordinary circumstance could cause them all to go astray just when we do not know the weight of the test object?
  • 23. Statistical Inference and Sexy Science 22 Even large scale theories connect with data only by intermediate hypotheses and models.
  • 24. Next month 102 Years Ago: May 29, 1919: Testing GTR On Einstein's theory of gravitation, light passing near the sun is deflected by an angle Îť, reaching 1.75”, for light just grazing the sun. Only detectable during a total eclipse, which "by strange good fortune” would occur on May 29, 1919 [1920] 1987, p. 113). 23
  • 25. Two key stages of inquiry i. is there a deflection effect of the amount predicted by Einstein as against Newton (0.87")? ii. is it "attributable to the sun's gravitational field" as described in Einstein's hypothesis? 24
  • 26. 25 Eclipse photos of stars (eclipse plate) compared to their positions photographed at night when the effect of the sun is absent (the night plate)–a control. Technique was known to astronomers from determining stellar parallax, "for which much greater accuracy is required" (Eddington 1920), pp. 115-16).
  • 27. 26 The problem in (i) is reduced to a statistical one: the observed mean deflections (from sets of photographs) are normally distributed around the predicted mean deflection Îź. H0: Îź ≤ 0.87 and the H1: Îź > 0.87 H1: includes the Einsteinian value of 1.75. 2 expeditions, to Sobral, North Brazil and Principle, Gulf of Guinea (West Africa)
  • 28. 27 A year of checking instrumental and other errors… Sobral: Îź = 1.98" Âą 0.18". Principe: Îź = 1.61" Âą 0.45". (in probable errors 0.12 and 0.30 respectively, 1 probable error is 0.68 standard errors SE.) “It is usual to allow a margin of safety of about twice the probable error on either side of the mean.” [~1.4 SE]. The Principe plates are just sufficient to rule out the the ‘half-deflection’, the Sobral plates exclude it (Eddington 1920, p. 118).
  • 29. 28 (ii) Is the effect "attributable to the sun's gravitational field”? (Can’t assume H*) Using the known eclipse effect to explain it while saving Newton from falsification is unproblematic–if each conjecture is severely tested. Sir Oliver Lodge’s “ether effect” was one of many (e.g., shadow, corona). Were any other cause to exist that produced a considerable fraction of the deflection effect that alone would falsify the Einstein hypothesis (which asserts that all of the 1.75" are due to gravity) (Jeffreys 1919, p. 138).
  • 30. 29 Each Newton-saving hypothesis collapsed on the basis of a one-two punch: 1. the magnitude of effect that could have been due to the conjectured factor is far too small to account for the eclipse effect; and 2. if large enough to account for the eclipse effect, it would have false or contradictory implications elsewhere. The Newton-saving factors might have been plausible but they were unable to pass severe tests. Saving Newton this way would be bad science.
  • 31. 30 More Severe Tests of GTR in the 1970s • Radio interferometry data from quasars (quasi-stellar radio sources) are more capable of uncovering errors, and discriminating values of the deflection than the crude eclipse tests. • The Einstein deflection effect “passed” the test, but even then, they couldn’t infer all of GTR severely. • The [Einstein] law is firmly based on experiment, even the complete abandonment of the theory would scarcely affect it. (Eddington 1920, p. 126)
  • 32. 31 Popper, GTR and Severity [T]he impressive thing about [the 1919 tests of Einstein’s theory of gravity] is the risk involved in a prediction of this kind. … The theory is incompatible with certain possible results of observation–in fact with results which everybody before Einstein would have expected. This is quite different from [Freud and Adlerian psychology] (Popper 1962, p. 36)
  • 33. 32 The problem with Freudian and Adlerian psychology • Any observed behavior – jumping in the water to save a child, or failing to save her–can be a accounted for by Adlerian inferiority complexes, or Freudian theories of sublimation or Oedipal complexes (Popper 1962, p. 35). • I’d modify Popper: it needn’t be the flexibility of the theory but of the overall inquiry: research question, auxiliaries, and interpretive rules. • The flexibility isn’t picked up on in logics of induction
  • 34. 33 Popper denies that severity can be formalized by any confirmation logics or logics of induction “the probability of a statement . . . simply does not express an appraisal of the severity of the tests a theory has passed, or of the manner in which it has passed these tests” (pp. 394– 5).
  • 35. 34 Wars between Popper vs logics of Induction relevant for today’s statistics wars: Alan Musgrave “According to modern logical empiricist orthodoxy, in deciding whether hypothesis h is confirmed by evidence e, . . . we must consider only the statements h and e, and the logical relations [C(h,e)] between them. It is quite irrelevant whether e was known first and h proposed to explain it, or whether e resulted from testing predictions drawn from h”. (Alan Musgrave 1974, p. 2)
  • 36. Likelihood Principle (LP) In logics of induction, like probabilist accounts (as I’m using the term) the import of the data is via the ratios of likelihoods of hypotheses Pr(x0;H0)/Pr(x0;H1) The data x0 are fixed, while the hypotheses vary 35
  • 37. Comparative Logic of Support • Ian Hacking (1965) “Law of Likelihood”: x support hypothesis H0 less well than H1 if, Pr(x;H0) < Pr(x;H1) (rejects in 1980) • Any hypothesis that perfectly fits the data is maximally likely • “there always is such a rival hypothesis viz., that things just had to turn out the way they actually did” (Barnard 1972, 129). 36
  • 38. N-P error probabilities and Popper’s methodological probabilities • Pr(H0 is less well supported than H1;H0 ) is high for some H1 or other “In order to fix a limit between ‘small’ and ‘large’ values of [the likelihood ratio] we must know how often such values appear when we deal with a true hypothesis.” (Pearson and Neyman 1967, 106) 37
  • 39. Fishing for significance (nominal vs. actual) Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be ‘significant at the 5 percent level.’ ….The actual level of significance is not 5 percent, but 64 percent! (Selvin 1970, 104) (Morrison & Henkel’s Significance Test Controversy 1970!) 38
  • 40. Spurious P-Value The data-dredger reports: Such results would be difficult to achieve under the assumption of H0 When in fact such results are common under the assumption of H0 • There are many more ways to be wrong with biasing selection effects • Need to adjust P-values or at least report the multiple testing 39
  • 41. Some accounts of evidence object: “Two problems that plague frequentist inference: multiple comparisons and multiple looks, or…data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P- value… But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense” (Goodman 1999, 1010) (Co-director, with Ioannidis, the Meta-Research Innovation Center at Stanford) 40
  • 42. All error probabilities violate the LP : Sampling distributions, significance levels, power, all depend on something more [than the likelihood function]–something that is irrelevant in Bayesian inference–namely the sample space (Lindley 1971, 436) The LP implies…the irrelevance of predesignation, of whether a hypothesis was thought of before hand or was introduced to explain known effects (Rosenkrantz 1977, 122) 41
  • 43. Many “reforms” offered as alternative to significance tests follow the LP • “Bayes factors can be used in the complete absence of a sampling plan…” (Bayarri, Benjamin, Berger, Sellke 2016, 100) • It seems very strange that a frequentist could not analyze a given set of data…if the stopping rule is not given….Data should be able to speak for itself. (Berger and Wolpert 1988, 78 authors of the Likelihood Principle) 42
  • 44. At odds with fraud-busters: 21 Word Solution “We report how we determined our sample size, and data exclusions (if any), all manipulations, and all measures in the study” (Simmons, Nelson, and Simonsohn 2012, 4). • Replication researchers find that selection effects– data-dependent hypotheses, fishing, and stopping rules–are a major source of failed replication 43
  • 45. Inferences based on biasing selection effects might be blocked with Bayesian prior probabilities (without error probabilities)? • Supplement with subjective beliefs: What do I believe? As opposed to What is the evidence? (Royall 1997) • Likelihoodists + prior probabilities 44
  • 46. Problems with appealing to priors to block inferences based on selection effects • Doesn’t show what researchers had done wrong— battle of beliefs • The believability of data-dredged hypotheses is what makes them so seductive • Additional source of flexibility, priors and biasing selection effects 45
  • 47. No help with the severe tester’s key problem • How to distinguish the warrant for a single hypothesis H with different methods (e.g., one has biasing selection effects, another, pre-registered results and precautions)? • Since there’s a single H, its prior would be the same 46
  • 48. Most Bayesians (last decade) use “default” priors: unification • ‘Eliciting’ subjective priors too difficult, scientists reluctant for subjective beliefs to overshadow data “[V]irtually never would different experts give prior distributions that even overlapped” (J. Berger 2006, 392) • Default priors are supposed to prevent prior beliefs from influencing the posteriors–data dominant 47
  • 49. How should we interpret them? • “The priors are not to be considered expressions of uncertainty, ignorance, or degree of belief. Conventional priors may not even be probabilities…” (Cox and Mayo 2010, 299) • No agreement on rival systems for default/non- subjective priors (invariance, maximum entropy, maximizing missing information, matching) 48
  • 50. Criticisms of Data-Dredgers Lose Force • Wanting to promote an account that downplays error probabilities, the researcher deserving criticism is given a life-raft • One of the ironies of today’s reforms 49
  • 51. Bem’s “Feeling the Future” 2011: ESP? • Daryl Bem (2011): subjects do better than chance at predicting the (erotic) picture shown in the future • Some locate the start of the Replication Crisis with Bem • Bem admits data dredging • Bayesian critics resort to a default Bayesian prior to (a point) null hypothesis 50
  • 52. Bem’s Response “Whenever the null hypothesis is sharply defined but the prior distribution on the alternative hypothesis is diffused over a wide range of values, as it is [here] it boosts the probability that any observed data will be higher under the null hypothesis than under the alternative. This is known as the Lindley-Jeffreys paradox*: A frequentist [can always] be contradicted by a …Bayesian analysis that concludes that the same data are more likely under the null.” (Bem et al. 2011, 717) *Bayes-Fisher disagreement 51
  • 53. Many of Today’s Statistics wars trace to P-values vs posteriors • The posterior probability Pr(H0|x) can be large while the P-value is small (2-sided test, spike and smear) • To the Bayesian, the P-value exaggerates the evidence against H0 • To the significance tester: the Bayesian is biasing results to favor H0 52
  • 54. Some Bayesians reject probabilism (Gelman: Falsificationist Bayesian; Shalizi: error statistician) • “[C]rucial parts of Bayesian data analysis, such as model checking, can be understood as ‘error probes’ in Mayo’s sense” which might be seen as using modern statistics to implement the Popperian criteria of severe tests. (Andrew Gelman and Cosma Shalizi 2013, 10). • Last part of SIST: (Probabilist) Foundations Lost, (Probative) Foundations Found 53
  • 55. Severity directs a reformulation of tests Severity function: SEV(Test T, data x, claim C) • Tests are reformulated in terms of a discrepancy Îł from H0 • Instead of a binary cut-off (significant or not) the particular outcome is used to infer discrepancies that are and are not warranted • Poorly warranted claims must be reported 54
  • 56. Using Severity to Avoid Fallacies: Fallacy of Rejection: Large n problem • Fixing the P-value, increasing sample size n, the cut-off gets smaller • Get to a point where x is closer to the null than various alternatives • Many would lower the P-value requirement as n increases-can always avoid inferring a discrepancy beyond what’s warranted: 55
  • 57. Severity tells us: • an Îą-significant difference indicates less of a discrepancy from the null if it results from larger (n1) rather than a smaller (n2) sample size (n1 > n2 ) • What’s more indicative of a large effect (fire), a fire alarm that goes off with burnt toast or one that doesn’t go off unless the house is fully ablaze? • [The larger sample size is like the one that goes off with burnt toast] 56
  • 58. What About Fallacies of Non-Significant Results? • They don’t warrant 0 discrepancy • There are discrepancies the test had little probability of detecting • Using severity reasoning: rule out discrepancies that very probably would have resulted in larger differences than observed- set upper bounds • If you very probably would have observed a larger value of test statistic (smaller P-value), were Îź = Îź1 then the data indicate that Îź< Îź1 SEV(Îź < Îź1) is high 57
  • 59. Confidence Intervals Are Also Re-interpretated Duality between tests and intervals: values within the (1 -‐α) CI are non-‐ rejectable at the Îą level • Too dichotomous: in/out, plausible/not plausible • Fixed confidence levels (need several benchmarks) • Justified in terms of long-‐ run coverage (performance) --if interpreted correctly 58
  • 60. 59 Duality of Tests and CIs (estimating Îź in a Normal Distribution) Îź > M0 – 1.96σ/√n CI-lower Îź < M0 + 1.96σ/√n CI-upper M0 : the observed sample mean CI-lower: the value of Îź that M0 is statistically significantly greater than at P= 0.025 CI-upper: the value of Îź that M0 is statistically significantly lower than at P= 0.025  You could get a CI by asking for these values, and learn indicated effect sizes with tests
  • 61. 60 We get an inferential rationale absent from CIs CI Estimator: CI-lower < Îź < CI-upper Because it came from a procedure with good coverage probability Severe Tester: Îź > CI-lower because with high probability (.975) we would have observed a smaller M0 if Îź ≤ CI-lower Îź < CI-upper because with high probability (.975) we would have observed a larger M0 if Îź ≥ CI-lower
  • 62. FEV: Frequentist Principle of Evidence; Mayo and Cox (2006); SEV: Mayo 1991, Mayo and Spanos (2006) FEV/SEV A small P-value indicates discrepancy Îł from H0, if and only if, there is a high probability the test would have resulted in a larger P-value were a discrepancy as large as Îł absent. FEV/SEV A moderate P-value indicates the absence of a discrepancy Îł from H0, only if there is a high probability the test would have given a worse fit with H0 (i.e., a smaller P-value) were a discrepancy Îł present. 61
  • 63. Sum-up • I begin with a minimal requirement for evidence: data are evidence for C only if it has been subjected to and passes a test it probably would have failed if false • Biasing selection effects make it easy to find impressive-looking effects erroneously • They alter a method’s error probing capacities • They may not alter evidence (in traditional probabilisms): Likelihood Principle (LP) • To the LP holder: to consider what could have happened but didn’t is to consider “imaginary data” 62
  • 64. • To the severe tester, probabilists are robbed from a main way to block spurious results • Severity principles direct the reinterpretation of significance tests and other methods • Probabilists may block inferences without appeal to error probabilities: high prior to H0 (no effect) can result in a high posterior probability to H0 • Gives a life-raft to the P-hacker and cherry picker; puts blame in the wrong place • Piecemeal statistical inferences (or informal counterparts) link data to scientific claims at multiple levels 63
  • 65. • A silver lining to distinguishing highly probable and highly probed–can use different methods for different contexts • Some Bayesians may find their foundations in error statistics • Last excursion: (probabilist) foundations lost; (probative) foundations found 64
  • 66. 65
  • 67. 66
  • 68. References • Barnard, G. (1972). ‘The Logic of Statistical Inference (Review of “The Logic of Statistical Inference” by Ian Hacking)’, British Journal for the Philosophy of Science 23(2), 123–32. • Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses." Journal of Mathematical Psychology 72: 90-103. • Bem, J. (2011). “Feeling the Future: Experimental Evidence for Anomalous Retroactive Influences on Cognition and Affect”, Journal of Personality and Social Psychology 100(3), 407-425. • Bem, J., Utts, J., and Johnson, W. (2011). “Must Psychologists Change the Way They Analyze Their Data?”, Journal of Personality and Social Psychology 101(4), 716-719. • Berger, J. O. (2006). “The Case for Objective Bayesian Analysis.” Bayesian Analysis 1 (3): 385–402. • Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6 Lecture Notes-Monograph Series. Hayward, CA: Institute of Mathematical Statistics. • Cox, D. R., and Mayo, D. G. (2010). “Objectivity and Conditionality in Frequentist Inference.” In Error and Inference: Recent Exchanges on Experimental Reasoning, Reliability, and the Objectivity and Rationality of Science, edited by Deborah G. Mayo and Aris Spanos, 276–304. Cambridge: Cambridge University Press. 67
  • 69. • Cox, D. and Mayo, D. (2011). “A Statistical Scientist Meets a Philosopher of Science: A Conversation between Sir David Cox and Deborah Mayo”, in Rationality, Markets and Morals (RMM) 2, 103–14. • Eddington, A. ([1920]1987). Space, Time and Gravitation: An Outline of the General Relativity Theory, Cambridge Science Classics Series. Cambridge: Cambridge University Press. • Fisher, R. A. (1947). The Design of Experiments 4th ed., Edinburgh: Oliver and Boyd. • Gelman, A. and Shalizi, C. (2013). “Philosophy and the Practice of Bayesian Statistics” and “Rejoinder’” British Journal of Mathematical and Statistical Psychology 66(1): 8–38; 76-80. • Goodman SN. (1999). “Toward evidence-based medical statistics. 2: The Bayes factor,” Annals of Internal Medicine 1999; 130:1005 –1013. • Hacking, I. (1965). Logic of Statistical Inference. Cambridge: Cambridge University Press. • Hacking, I. (1980). ‘The Theory of Probable Inference: Neyman, Peirce and Braithwaite’, in Mellor, D. (ed.), Science, Belief and Behavior: Essays in Honour of R. B. Braithwaite, Cambridge: Cambridge University Press, pp. 141–60. • Ioannidis, J. (2005). “Why Most Published Research Findings are False”, PLoS Medicine 2(8), 0696–0701. 68
  • 70. • Jeffreys, H. (1919). ‘Contribution to Discussion on the Theory of Relativity’, and ‘On the Crucial Test of Einstein’s Theory of Gravitation’, Monthly Notices of the Royal Astronomical Society 80, 96–118; 138–54. • Lindley, D. V. (1971). “The Estimation of Many Parameters.” in Godambe, V. and Sprott, D. (eds.), Foundations of Statistical Inference 435–455. Toronto: Holt, Rinehart and Winston. • Lodge, O. (1919). ‘Contribution to “Discussion on the Theory of Relativity”’, Monthly Notices of the Royal Astronomical Society 80, 106–9. • Mayo, D. (1991). ‘Novel Evidence and Severe Tests’, Philosophy of Science 58(4), 523–52. • Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science and Its Conceptual Foundation. Chicago: University of Chicago Press. • Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond the Statistics Wars, Cambridge: Cambridge University Press. • Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive Inference” in Rojo, J. (ed.) The Second Erich L. Lehmann Symposium: Optimality, 2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical Statistics: 247-275. • Mayo, D. G., and A. Spanos. (2006). “Severe Testing as a Basic Concept in a Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of Science 57 (2) (June 1): 323–357. 69
  • 71. 70 • Mayo, D. G., and A. Spanos (2011). “Error Statistics.” In Philosophy of Statistics, edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198. Handbook of the Philosophy of Science. The Netherlands: Elsevier. • Morrison, D. E., and R. E. Henkel, (eds.), (1970). The Significance Test Controversy: A Reader. Chicago: Aldine De Gruyter. • Musgrave, A. (1974). ‘Logical versus Historical Theories of Confirmation’, The British Journal for the Philosophy of Science 25(1), 1–23. • Pearson, E. S. & Neyman, J. (1967). “On the problem of two samples”, Joint Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif. Press). First published 1930 in Bul. Acad. Pol.Sci. 73-96. • Popper, K. (1962). Conjectures and Refutations: The Growth of Scientific Knowledge. New York: Basic Books. • Rosenkrantz, R. (1977). Inference, Method and Decision: Towards a Bayesian Philosophy of Science. Dordrecht, The Netherlands: D. Reidel. • Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. Boca Raton FL: Chapman and Hall, CRC press. • Selvin, H. (1970). “A critique of tests of significance in survey research. In The significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine De Gruyter. • Simmons, J. Nelson, L. and Simonsohn, U. (2012) “A 21 word solution”, Dialogue: The Official Newsletter of the Society for Personality and Social Psychology 26(2), 4–7.

Editor's Notes

  1. ----- Meeting Notes (5/17/19 21:00) ----- 1. Cox-Mayo conversation 2. A central issue in today’s statistics ways is the role of probability: Should probability enter to ensure we won’t reach mistaken interpretations of data too often in the long run of experience? Or to capture degrees of belief about claims? (performance or probabilism) The field has been marked by disagreements between competing tribes of frequentists and Bayesians that have been so contentious that everyone wants to believe we are long past them. We now enjoy unifications and reconciliations between rival schools, it will be said, and practitioners are eclectic, prepared to use whatever method “works.” The truth is, long-standing battles still simmer below the surface of today’ s debates about scientific integrity, irreproducibility, and questionable research practices, Reluctance to reopen wounds from old battles has allowed them to fester. The reconciliations and unifications have been revealed to have serious problems, and there’s little agreement on which to use or how to interpret them. As for eclecticism, it’s often not clear what is even meant by “works.” The presumption that all we need is an agreement on numbers–never mind if they’re measuring different things–leads to statistical schizophrenia. I say we need to brush the dust off the pivotal debates, and consider them anew, in relation to today’s problems. 3. Statistical Inference as Severe Testing: What’s behind the constant drum beat today that science is in crisis?? The problem is that high powered methods can make it easy to uncover impressive-looking findings even if they are false: spurious correlations and other errors have not been severely probed. We set sail with a simple tool: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. In the severe testing view, probability arises in scientific contexts to assess and control how capable methods are at uncovering and avoiding erroneous interpretations of data. That’s what it means to view statistical inference as severe testing. In saying we may view statistical inference as severe testing, I’m not saying statistical inference is always about formal statistical testing. The concept of severe testing is sufficiently general to apply to any of the methods now in use, whether for exploration, estimation, or prediction. you needn’t accept the severe testing view in order to employ it as a tool for getting beyond the statistics wars. It’s a tool for excavation, and for keeping us afloat in the marshes and quicksand that often mark today’s controversies. 4. A philosophical excursion Taking the severity principle, along with the aim that we desire to find things out without being obstructed in this goal, let’s set sail on a philosophical excursion to illuminate statistical inference
  2. What’s behind the constant drum beat today that science is in crisis? The problem is that high powered methods can make it easy to uncover impressive-looking findings even if they are false: spurious correlations and other errors have not been severely probed. We set sail with a simple tool: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test. In the severe testing view, probability arises in scientific contexts to assess and control how capable methods are at uncovering and avoiding erroneous interpretations of data. That’s what it means to view statistical inference as severe testing. In saying we may view statistical inference as severe testing, I’m not saying statistical inference is always about formal statistical testing. The concept of severe testing is sufficiently general to apply to any of the methods now in use, whether for exploration, estimation, or prediction. Regardless of the type of claim, you don’t have evidence for it if nothing has been done to have found it flawed.
  3. ; problem with unifications ----- Meeting Notes (11/24/18 20:22) ----- third problem
  4. With Bayes Factors make to make the null hypothesis comparatively more probable than a chosen alternative
  5. (source of Bayes/Fisher disagreement)