Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars)

Evidence
as Passing a Severe Test
(How it Gets You Beyond the Statistics Wars)
Deborah G Mayo
Dept of Philosophy, Virginia Tech
CUNY Graduate Center Philosophy Colloquium
April 28th, 2021
0

In a conversation with Sir David Cox:
COX: Deborah, in some fields foundations do not
seem very important, but we both think foundations of
statistical inference are important; why do you think
that is?
MAYO: …in statistics …we invariably cross into
philosophical questions about empirical knowledge,
evidence and inductive inference.
(“A Statistical Scientist Meets a Philosopher of
Science” 2011) 1

Role of probability: performance or
probabilism?
(Frequentist vs. Bayesian)
• Statistical Inference
• Unifications and Eclecticism
• Long-standing battles still simmer below the
surface (agreement on numbers)
2

Statistical inference as severe testing
• Brush the dust off pivotal debates in relation to
today’s statistical crisis in science
• We set sail with a simple tool: If little or nothing
has been done to rule out flaws in inferring
claim C, then you don’t have evidence for it
• Sufficiently general to apply to any methods
now in use
• You needn’t accept this philosophy to use it to
excavate the statistics wars
3

A philosophical excursion
“Taking the severity principle, along with the aim
that we desire to find things out… let’s set sail on a
philosophical excursion to illuminate statistical
inference.” --a special interest cruise
(pix/animations out)
• And at the same time revisit classic problems:
induction, falsification, demarcation of science
4

Most findings are false?
“Several methodologists have pointed out that the high
rate of nonreplication of research discoveries is a
consequence of the convenient, yet ill-founded strategy
of claiming conclusive research findings solely on the
basis of a single study assessed by formal statistical
significance, typically for a p-value less than 0.05. …
It can be proven that most claimed research findings are
false.” (John Ioannidis 2005, 0696)
5

R.A. Fisher
“[W]e need, not an isolated record, but a reliable
method of procedure. In relation to the test of
significance, we may say that a phenomenon is
experimentally demonstrable when we know how to
conduct an experiment which will rarely fail to give
us a statistically significant result.” (Fisher 1947, 14)
6

Simple significance tests (Fisher)
“p-value. …to test the conformity of the particular data
under analysis with H0 in some respect:
…we find a function T = t(y) of the data, the test
statistic, such that
• the larger the value of T the more inconsistent are
the data with H0;
• T = t(Y) has a known probability distribution
when H0 is true.
…the p-value corresponding to any t0bs as
p = p(t) = Pr(T ≥ t0bs; H0)”
(Mayo and Cox 2006, 81) 7

Testing reasoning
• If even larger differences than t0bs occur fairly
frequently under H0 (i.e., P-value is not small),
there’s scarcely evidence of incompatibility
with H0
• Small P-value indicates some underlying
discrepancy from H0 because very probably
you would have seen a less impressive
difference than t0bs were H0 true.
• This still isn’t evidence of a genuine statistical
effect H1, let alone a scientific conclusion H*
Stat-Sub fallacy H => H*
8

Fallacy of rejection
• H* makes claims that haven’t been probed by the
statistical test
• The moves from experimental interventions to H*
don’t get enough attention–but your statistical
account should block them
9

Neyman-Pearson (N-P) tests:
A null and alternative hypotheses H0, H1
that are exhaustive*
H0: μ ≤ 0 vs. H1: μ > 0
“no effect” vs. “some positive effect”
• So this fallacy of rejection H1H* is blocked
• Rejecting H0 only indicates statistical alternatives
H1 (how discrepant from null)
*(introduces Type II error, and power )
10

Both Fisher and N-P methods: it’s
easy to lie with statistics with
biasing selection effects
• Sufficient finagling—cherry-picking, significance
seeking, multiple testing, post-data subgroups,
trying and trying again—may practically
guarantee a preferred claim H gets support,
even if it’s unwarranted by evidence
11

Severity Requirement:
If the test had little or no capability of finding flaws
with H (even if H is incorrect), then agreement
between data x0 and H provides poor (or no)
evidence for H
• Such a test fails a minimal requirement for a
stringent or severe test
12

13
• A claim passes severely only if it has been
subjected to and passes a test that would
have, with high probability, found it flawed or
specifiably false (if it is).
• This probability is the severity with which it
has passed the test, and is a measure of
evidential warrant
A claim is warranted to the extent
it passes severely

This alters the role of probability:
Probabilism. To assign a degree of probability,
confirmation, support or belief in a hypothesis,
given data x0 (absolute or comparative)
(e.g., Bayesian, likelihoodist, Fisher (at times))
Performance. Ensure long-run reliability of
methods, coverage probabilities (frequentist,
behavioristic Neyman-Pearson, Fisher (at times))
14

• Neither “probabilism” nor “performance” directly
captures assessing error probing capacity
• Good long-run performance is a necessary, not
a sufficient, condition for severity
15

Key to solving a major
philosophical problem for
frequentists
• Why is good performance relevant for
inference in the case at hand?
• What bothers you with selective reporting,
cherry picking, stopping when the data look
good, P-hacking
• Not problems about long-runs—
16

• We cannot say the case at hand has done
a good job of avoiding the sources of
misinterpreting data
• Performance is relevant when it teaches
us about the capabilities of our methods
• Basis of severe testing philosophy
17

A claim C is not warranted _______
• Probabilism: unless C is true or probable (gets
a probability boost, made comparatively firmer)
• Performance: unless it stems from a method
with low long-run error
• Probativism (severe testing) unless something
(a fair amount) has been done to probe ways we
can be wrong about C
18

Severe Tests
Informal example: To test if I’ve gained weight
between the start of the pandemic and now, I use a
series of well-calibrated and stable scales, both at
the start and now.
All show an over 4 lb gain, none shows a difference
in weighing EGEK, I’m forced to infer:
H: I’ve gained at least 4 pounds
19

20
• Giving the properties of the weighing methods is
akin to the properties of statistical tests
(performance).
• No one claims the justification is merely long run
and can say nothing about my weight.
• We argue about the source of the readings from
the high capacity to reveal if any scales were wrong

21
The severe tester is assumed to be in
a context of wanting to find things out
• I could insist all the scales are wrong—they work fine
with weighing known objects—but this would prevent
correctly finding out about weight….. (rigged
alternative)
• What sort of extraordinary circumstance could cause
them all to go astray just when we do not know the
weight of the test object?

Statistical Inference and Sexy Science
22
Even large scale theories connect with data only
by intermediate hypotheses and models.

Next month 102 Years Ago: May 29, 1919:
Testing GTR
On Einstein's theory of gravitation, light passing near
the sun is deflected by an angle λ, reaching 1.75”,
for light just grazing the sun.
Only detectable during a total eclipse, which "by
strange good fortune” would occur on May 29, 1919
[1920] 1987, p. 113).
23

Two key stages of inquiry
i. is there a deflection effect of the amount
predicted by Einstein as against Newton
(0.87")?
ii. is it "attributable to the sun's gravitational field"
as described in Einstein's hypothesis?
24

25
Eclipse photos of stars (eclipse plate) compared to
their positions photographed at night when the effect
of the sun is absent (the night plate)–a control.
Technique was known to astronomers from
determining stellar parallax, "for which much greater
accuracy is required" (Eddington 1920), pp. 115-16).

26
The problem in (i) is reduced to a statistical one: the
observed mean deflections (from sets of
photographs) are normally distributed around the
predicted mean deflection μ.
H0: μ ≤ 0.87 and the H1: μ > 0.87
H1: includes the Einsteinian value of 1.75.
2 expeditions, to Sobral, North Brazil and Principle,
Gulf of Guinea (West Africa)

27
A year of checking instrumental and other errors…
Sobral: μ = 1.98" ± 0.18".
Principe: μ = 1.61" ± 0.45".
(in probable errors 0.12 and 0.30 respectively, 1
probable error is 0.68 standard errors SE.)
“It is usual to allow a margin of safety of about twice
the probable error on either side of the mean.” [~1.4
SE]. The Principe plates are just sufficient to rule out
the the ‘half-deflection’, the Sobral plates exclude it
(Eddington 1920, p. 118).

28
(ii) Is the effect "attributable to the sun's
gravitational field”? (Can’t assume H*)
Using the known eclipse effect to explain it while
saving Newton from falsification is unproblematic–if
each conjecture is severely tested.
Sir Oliver Lodge’s “ether effect” was one of many
(e.g., shadow, corona).
Were any other cause to exist that produced a
considerable fraction of the deflection effect that
alone would falsify the Einstein hypothesis (which
asserts that all of the 1.75" are due to gravity)
(Jeffreys 1919, p. 138).

29
Each Newton-saving hypothesis collapsed on the
basis of a one-two punch:
1. the magnitude of effect that could have been
due to the conjectured factor is far too small to
account for the eclipse effect; and
2. if large enough to account for the eclipse effect,
it would have false or contradictory implications
elsewhere.
The Newton-saving factors might have been
plausible but they were unable to pass severe tests.
Saving Newton this way would be bad science.

30
More Severe Tests of GTR in the 1970s
• Radio interferometry data from quasars (quasi-stellar
radio sources) are more capable of uncovering
errors, and discriminating values of the deflection
than the crude eclipse tests.
• The Einstein deflection effect “passed” the test, but
even then, they couldn’t infer all of GTR severely.
• The [Einstein] law is firmly based on experiment,
even the complete abandonment of the theory would
scarcely affect it. (Eddington 1920, p. 126)

31
Popper, GTR and Severity
[T]he impressive thing about [the 1919 tests of
Einstein’s theory of gravity] is the risk involved in a
prediction of this kind. … The theory is incompatible
with certain possible results of observation–in fact
with results which everybody before Einstein would
have expected. This is quite different from [Freud
and Adlerian psychology] (Popper 1962, p. 36)

32
The problem with Freudian and Adlerian
psychology
• Any observed behavior – jumping in the water to
save a child, or failing to save her–can be a
accounted for by Adlerian inferiority complexes, or
Freudian theories of sublimation or Oedipal
complexes (Popper 1962, p. 35).
• I’d modify Popper: it needn’t be the flexibility of
the theory but of the overall inquiry: research
question, auxiliaries, and interpretive rules.
• The flexibility isn’t picked up on in logics of
induction

33
Popper denies that severity can be formalized by
any confirmation logics or logics of induction
“the probability of a statement . . . simply does not
express an appraisal of the severity of the tests a
theory has passed, or of the manner in which it has
passed these tests” (pp. 394– 5).

34
Wars between Popper vs logics of
Induction relevant for today’s
statistics wars: Alan Musgrave
“According to modern logical empiricist orthodoxy, in
deciding whether hypothesis h is confirmed by
evidence e, . . . we must consider only the statements h
and e, and the logical relations [C(h,e)] between them.
It is quite irrelevant whether e was known first and h
proposed to explain it, or whether e resulted from
testing predictions drawn from h”.
(Alan Musgrave 1974, p. 2)

Likelihood Principle (LP)
In logics of induction, like probabilist accounts (as
I’m using the term) the import of the data is via the
ratios of likelihoods of hypotheses
Pr(x0;H0)/Pr(x0;H1)
The data x0 are fixed, while the hypotheses vary
35

Comparative Logic of Support
• Ian Hacking (1965) “Law of Likelihood”:
x support hypothesis H0 less well than H1 if,
Pr(x;H0) < Pr(x;H1)
(rejects in 1980)
• Any hypothesis that perfectly fits the data is
maximally likely
• “there always is such a rival hypothesis viz., that
things just had to turn out the way they actually
did” (Barnard 1972, 129). 36

N-P error probabilities and
Popper’s methodological probabilities
• Pr(H0 is less well supported than H1;H0 ) is high
for some H1 or other
“In order to fix a limit between ‘small’ and ‘large’
values of [the likelihood ratio] we must know how
often such values appear when we deal with a
true hypothesis.” (Pearson and Neyman 1967,
106)
37

Fishing for significance
(nominal vs. actual)
Suppose that twenty sets of differences have
been examined, that one difference seems large
enough to test and that this difference turns out
to be ‘significant at the 5 percent level.’ ….The
actual level of significance is not 5 percent,
but 64 percent! (Selvin 1970, 104)
(Morrison & Henkel’s Significance Test Controversy
1970!)
38

Spurious P-Value
The data-dredger reports: Such results would be
difficult to achieve under the assumption of H0
When in fact such results are common under the
assumption of H0
• There are many more ways to be wrong with
biasing selection effects
• Need to adjust P-values or at least report the
multiple testing
39

Some accounts of evidence object:
“Two problems that plague frequentist inference:
multiple comparisons and multiple looks, or…data
dredging and peeking at the data. The frequentist
solution to both problems involves adjusting the P-
value…
But adjusting the measure of evidence because
of considerations that have nothing to do with
the data defies scientific sense” (Goodman 1999,
1010)
(Co-director, with Ioannidis, the Meta-Research Innovation
Center at Stanford)
40

All error probabilities
violate the LP :
Sampling distributions, significance levels, power,
all depend on something more [than the likelihood
function]–something that is irrelevant in Bayesian
inference–namely the sample space
(Lindley 1971, 436)
The LP implies…the irrelevance of predesignation,
of whether a hypothesis was thought of before
hand or was introduced to explain known effects
(Rosenkrantz 1977, 122)
41

Many “reforms” offered as
alternative to significance tests
follow the LP
• “Bayes factors can be used in the complete absence
of a sampling plan…” (Bayarri, Benjamin, Berger,
Sellke 2016, 100)
• It seems very strange that a frequentist could not
analyze a given set of data…if the stopping rule is
not given….Data should be able to speak for itself.
(Berger and Wolpert 1988, 78 authors of the
Likelihood Principle) 42

At odds with fraud-busters:
21 Word Solution
“We report how we determined our sample size,
and data exclusions (if any), all manipulations, and
all measures in the study” (Simmons, Nelson, and
Simonsohn 2012, 4).
• Replication researchers find that selection effects–
data-dependent hypotheses, fishing, and stopping
rules–are a major source of failed replication
43

Inferences based on biasing
selection effects might be blocked
with Bayesian prior probabilities
(without error probabilities)?
• Supplement with subjective beliefs: What do I
believe? As opposed to What is the evidence?
(Royall 1997)
• Likelihoodists + prior probabilities
44

Problems with appealing to priors to
block inferences based on
selection effects
• Doesn’t show what researchers had done wrong—
battle of beliefs
• The believability of data-dredged hypotheses is
what makes them so seductive
• Additional source of flexibility, priors and biasing
selection effects
45

No help with the severe tester’s
key problem
• How to distinguish the warrant for a single
hypothesis H with different methods
(e.g., one has biasing selection effects,
another, pre-registered results and
precautions)?
• Since there’s a single H, its prior would be the
same
46

Most Bayesians (last decade) use
“default” priors: unification
• ‘Eliciting’ subjective priors too difficult, scientists
reluctant for subjective beliefs to overshadow data
“[V]irtually never would different experts give prior
distributions that even overlapped” (J. Berger 2006,
392)
• Default priors are supposed to prevent prior beliefs
from influencing the posteriors–data dominant
47

How should we interpret them?
• “The priors are not to be considered expressions of
uncertainty, ignorance, or degree of belief.
Conventional priors may not even be probabilities…”
(Cox and Mayo 2010, 299)
• No agreement on rival systems for default/non-
subjective priors
(invariance, maximum entropy, maximizing missing
information, matching)
48

Criticisms of Data-Dredgers Lose
Force
• Wanting to promote an account that
downplays error probabilities, the researcher
deserving criticism is given a life-raft
• One of the ironies of today’s reforms
49

Bem’s “Feeling the Future” 2011:
ESP?
• Daryl Bem (2011): subjects do better than chance
at predicting the (erotic) picture shown in the future
• Some locate the start of the Replication Crisis with
Bem
• Bem admits data dredging
• Bayesian critics resort to a default Bayesian prior
to (a point) null hypothesis
50

Bem’s Response
“Whenever the null hypothesis is sharply defined but
the prior distribution on the alternative hypothesis is
diffused over a wide range of values, as it is [here] it
boosts the probability that any observed data will be
higher under the null hypothesis than under the
alternative.
This is known as the Lindley-Jeffreys paradox*: A
frequentist [can always] be contradicted by a
…Bayesian analysis that concludes that the same data
are more likely under the null.” (Bem et al. 2011, 717)
*Bayes-Fisher disagreement
51

Many of Today’s Statistics wars
trace to P-values vs posteriors
• The posterior probability Pr(H0|x) can be large while
the P-value is small (2-sided test, spike and smear)
• To the Bayesian, the P-value exaggerates the
evidence against H0
• To the significance tester: the Bayesian is biasing
results to favor H0
52

Some Bayesians reject probabilism
(Gelman: Falsificationist Bayesian;
Shalizi: error statistician)
• “[C]rucial parts of Bayesian data analysis, such as
model checking, can be understood as ‘error
probes’ in Mayo’s sense” which might be seen as
using modern statistics to implement the
Popperian criteria of severe tests. (Andrew
Gelman and Cosma Shalizi 2013, 10).
• Last part of SIST: (Probabilist) Foundations Lost,
(Probative) Foundations Found
53

Severity directs a reformulation of
tests
Severity function: SEV(Test T, data x, claim C)
• Tests are reformulated in terms of a discrepancy γ
from H0
• Instead of a binary cut-off (significant or not) the
particular outcome is used to infer discrepancies
that are and are not warranted
• Poorly warranted claims must be reported 54

Using Severity to Avoid Fallacies:
Fallacy of Rejection: Large n
problem
• Fixing the P-value, increasing sample size n,
the cut-off gets smaller
• Get to a point where x is closer to the null than
various alternatives
• Many would lower the P-value requirement as n
increases-can always avoid inferring a
discrepancy beyond what’s warranted: 55

Severity tells us:
• an α-significant difference indicates less of a
discrepancy from the null if it results from larger (n1)
rather than a smaller (n2) sample size (n1 > n2 )
• What’s more indicative of a large effect (fire), a fire
alarm that goes off with burnt toast or one that
doesn’t go off unless the house is fully ablaze?
• [The larger sample size is like the one that goes off
with burnt toast] 56

What About Fallacies of
Non-Significant Results?
• They don’t warrant 0 discrepancy
• There are discrepancies the test had little
probability of detecting
• Using severity reasoning: rule out discrepancies
that very probably would have resulted in larger
differences than observed- set upper bounds
• If you very probably would have observed a
larger value of test statistic (smaller P-value),
were μ = μ1 then the data indicate that μ< μ1
SEV(μ < μ1) is high
57

Confidence Intervals Are Also
Re-interpretated
Duality between tests and intervals: values within the
(1 -‐α) CI are non-‐
rejectable at the α level
• Too dichotomous: in/out, plausible/not plausible
• Fixed confidence levels (need several
benchmarks)
• Justified in terms of long-‐
run coverage
(performance)
--if interpreted correctly
58

59
Duality of Tests and CIs
(estimating μ in a Normal Distribution)
μ > M0 – 1.96σ/√n CI-lower
μ < M0 + 1.96σ/√n CI-upper
M0 : the observed sample mean
CI-lower: the value of μ that M0 is statistically
significantly greater than at P= 0.025
CI-upper: the value of μ that M0 is statistically
significantly lower than at P= 0.025
 You could get a CI by asking for these values,
and learn indicated effect sizes with tests

60
We get an inferential rationale absent from CIs
CI Estimator:
CI-lower < μ < CI-upper
Because it came from a procedure with good
coverage probability
Severe Tester:
μ > CI-lower because with high probability (.975) we
would have observed a smaller M0 if μ ≤ CI-lower
μ < CI-upper because with high probability (.975)
we would have observed a larger M0 if μ ≥ CI-lower

FEV: Frequentist Principle of Evidence; Mayo and
Cox (2006); SEV: Mayo 1991, Mayo and Spanos
(2006)
FEV/SEV A small P-value indicates discrepancy γ from H0, if
and only if, there is a high probability the test would have
resulted in a larger P-value were a discrepancy as large as γ
absent.
FEV/SEV A moderate P-value indicates the absence of a
discrepancy γ from H0, only if there is a high probability
the test would have given a worse fit with H0 (i.e., a
smaller P-value) were a discrepancy γ present.
61

Sum-up
• I begin with a minimal requirement for evidence: data
are evidence for C only if it has been subjected to and
passes a test it probably would have failed if false
• Biasing selection effects make it easy to find
impressive-looking effects erroneously
• They alter a method’s error probing capacities
• They may not alter evidence (in traditional
probabilisms): Likelihood Principle (LP)
• To the LP holder: to consider what could have
happened but didn’t is to consider “imaginary data”
62

• To the severe tester, probabilists are robbed from a
main way to block spurious results
• Severity principles direct the reinterpretation of
significance tests and other methods
• Probabilists may block inferences without appeal to
error probabilities: high prior to H0 (no effect) can
result in a high posterior probability to H0
• Gives a life-raft to the P-hacker and cherry picker;
puts blame in the wrong place
• Piecemeal statistical inferences (or informal
counterparts) link data to scientific claims at multiple
levels
63

• A silver lining to distinguishing highly probable and
highly probed–can use different methods for different
contexts
• Some Bayesians may find their foundations in error
statistics
• Last excursion: (probabilist) foundations lost;
(probative) foundations found
64

References
• Barnard, G. (1972). ‘The Logic of Statistical Inference (Review of “The Logic of
Statistical Inference” by Ian Hacking)’, British Journal for the Philosophy of Science
23(2), 123–32.
• Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). “Rejection Odds and
Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses."
Journal of Mathematical Psychology 72: 90-103.
• Bem, J. (2011). “Feeling the Future: Experimental Evidence for Anomalous
Retroactive Influences on Cognition and Affect”, Journal of Personality and Social
Psychology 100(3), 407-425.
• Bem, J., Utts, J., and Johnson, W. (2011). “Must Psychologists Change the Way
They Analyze Their Data?”, Journal of Personality and Social Psychology 101(4),
716-719.
• Berger, J. O. (2006). “The Case for Objective Bayesian Analysis.” Bayesian
Analysis 1 (3): 385–402.
• Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6
Lecture Notes-Monograph Series. Hayward, CA: Institute of Mathematical
Statistics.
• Cox, D. R., and Mayo, D. G. (2010). “Objectivity and Conditionality in
Frequentist Inference.” In Error and Inference: Recent Exchanges on
Experimental Reasoning, Reliability, and the Objectivity and Rationality of
Science, edited by Deborah G. Mayo and Aris Spanos, 276–304. Cambridge:
Cambridge University Press.
67

• Cox, D. and Mayo, D. (2011). “A Statistical Scientist Meets a Philosopher of
Science: A Conversation between Sir David Cox and Deborah Mayo”, in
Rationality, Markets and Morals (RMM) 2, 103–14.
• Eddington, A. ([1920]1987). Space, Time and Gravitation: An Outline of the
General Relativity Theory, Cambridge Science Classics Series. Cambridge:
Cambridge University Press.
• Fisher, R. A. (1947). The Design of Experiments 4th ed., Edinburgh: Oliver and
Boyd.
• Gelman, A. and Shalizi, C. (2013). “Philosophy and the Practice of Bayesian
Statistics” and “Rejoinder’” British Journal of Mathematical and Statistical
Psychology 66(1): 8–38; 76-80.
• Goodman SN. (1999). “Toward evidence-based medical statistics. 2: The Bayes
factor,” Annals of Internal Medicine 1999; 130:1005 –1013.
• Hacking, I. (1965). Logic of Statistical Inference. Cambridge: Cambridge University
Press.
• Hacking, I. (1980). ‘The Theory of Probable Inference: Neyman, Peirce and
Braithwaite’, in Mellor, D. (ed.), Science, Belief and Behavior: Essays in Honour of
R. B. Braithwaite, Cambridge: Cambridge University Press, pp. 141–60.
• Ioannidis, J. (2005). “Why Most Published Research Findings are False”, PLoS
Medicine 2(8), 0696–0701.
68

• Jeffreys, H. (1919). ‘Contribution to Discussion on the Theory of Relativity’, and
‘On the Crucial Test of Einstein’s Theory of Gravitation’, Monthly Notices of the
Royal Astronomical Society 80, 96–118; 138–54.
• Lindley, D. V. (1971). “The Estimation of Many Parameters.” in Godambe, V. and
Sprott, D. (eds.), Foundations of Statistical Inference 435–455. Toronto: Holt,
Rinehart and Winston.
• Lodge, O. (1919). ‘Contribution to “Discussion on the Theory of Relativity”’,
Monthly Notices of the Royal Astronomical Society 80, 106–9.
• Mayo, D. (1991). ‘Novel Evidence and Severe Tests’, Philosophy of Science 58(4),
523–52.
• Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science
and Its Conceptual Foundation. Chicago: University of Chicago Press.
• Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond
the Statistics Wars, Cambridge: Cambridge University Press.
• Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive
Inference” in Rojo, J. (ed.) The Second Erich L. Lehmann Symposium: Optimality,
2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical
Statistics: 247-275.
• Mayo, D. G., and A. Spanos. (2006). “Severe Testing as a Basic Concept in a
Neyman–Pearson Philosophy of Induction.” British Journal for the Philosophy of
Science 57 (2) (June 1): 323–357. 69

70
• Mayo, D. G., and A. Spanos (2011). “Error Statistics.” In Philosophy of Statistics,
edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152–198.
Handbook of the Philosophy of Science. The Netherlands: Elsevier.
• Morrison, D. E., and R. E. Henkel, (eds.), (1970). The Significance Test
Controversy: A Reader. Chicago: Aldine De Gruyter.
• Musgrave, A. (1974). ‘Logical versus Historical Theories of Confirmation’, The
British Journal for the Philosophy of Science 25(1), 1–23.
• Pearson, E. S. & Neyman, J. (1967). “On the problem of two samples”, Joint
Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif.
Press). First published 1930 in Bul. Acad. Pol.Sci. 73-96.
• Popper, K. (1962). Conjectures and Refutations: The Growth of Scientific
Knowledge. New York: Basic Books.
• Rosenkrantz, R. (1977). Inference, Method and Decision: Towards a Bayesian
Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.
• Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. Boca Raton FL:
Chapman and Hall, CRC press.
• Selvin, H. (1970). “A critique of tests of significance in survey research. In The
significance test controversy, edited by D. Morrison and R. Henkel, 94-106.
Chicago: Aldine De Gruyter.
• Simmons, J. Nelson, L. and Simonsohn, U. (2012) “A 21 word solution”, Dialogue:
The Official Newsletter of the Society for Personality and Social Psychology 26(2),
4–7.

Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars)

Similar to Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars) (20)

More from jemille6

More from jemille6 (20)

Recently uploaded

Recently uploaded (20)

Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars)

Editor's Notes