D. G. Mayo April 28, 2021 presentation to the CUNY Graduate Center Philosophy Colloquium "Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars)"
Mayo: Evidence as Passing a Severe Test (How it Gets You Beyond the Statistics Wars)
1. Evidence
as Passing a Severe Test
(How it Gets You Beyond the Statistics Wars)
Deborah G Mayo
Dept of Philosophy, Virginia Tech
CUNY Graduate Center Philosophy Colloquium
April 28th, 2021
0
2. In a conversation with Sir David Cox:
COX: Deborah, in some fields foundations do not
seem very important, but we both think foundations of
statistical inference are important; why do you think
that is?
MAYO: âŚin statistics âŚwe invariably cross into
philosophical questions about empirical knowledge,
evidence and inductive inference.
(âA Statistical Scientist Meets a Philosopher of
Scienceâ 2011) 1
3. Role of probability: performance or
probabilism?
(Frequentist vs. Bayesian)
⢠Statistical Inference
⢠Unifications and Eclecticism
⢠Long-standing battles still simmer below the
surface (agreement on numbers)
2
4. Statistical inference as severe testing
⢠Brush the dust off pivotal debates in relation to
todayâs statistical crisis in science
⢠We set sail with a simple tool: If little or nothing
has been done to rule out flaws in inferring
claim C, then you donât have evidence for it
⢠Sufficiently general to apply to any methods
now in use
⢠You neednât accept this philosophy to use it to
excavate the statistics wars
3
5. A philosophical excursion
âTaking the severity principle, along with the aim
that we desire to find things out⌠letâs set sail on a
philosophical excursion to illuminate statistical
inference.â --a special interest cruise
(pix/animations out)
⢠And at the same time revisit classic problems:
induction, falsification, demarcation of science
4
6. Most findings are false?
âSeveral methodologists have pointed out that the high
rate of nonreplication of research discoveries is a
consequence of the convenient, yet ill-founded strategy
of claiming conclusive research findings solely on the
basis of a single study assessed by formal statistical
significance, typically for a p-value less than 0.05. âŚ
It can be proven that most claimed research findings are
false.â (John Ioannidis 2005, 0696)
5
7. R.A. Fisher
â[W]e need, not an isolated record, but a reliable
method of procedure. In relation to the test of
significance, we may say that a phenomenon is
experimentally demonstrable when we know how to
conduct an experiment which will rarely fail to give
us a statistically significant result.â (Fisher 1947, 14)
6
8. Simple significance tests (Fisher)
âp-value. âŚto test the conformity of the particular data
under analysis with H0 in some respect:
âŚwe find a function T = t(y) of the data, the test
statistic, such that
⢠the larger the value of T the more inconsistent are
the data with H0;
⢠T = t(Y) has a known probability distribution
when H0 is true.
âŚthe p-value corresponding to any t0bs as
p = p(t) = Pr(T ⼠t0bs; H0)â
(Mayo and Cox 2006, 81) 7
9. Testing reasoning
⢠If even larger differences than t0bs occur fairly
frequently under H0 (i.e., P-value is not small),
thereâs scarcely evidence of incompatibility
with H0
⢠Small P-value indicates some underlying
discrepancy from H0 because very probably
you would have seen a less impressive
difference than t0bs were H0 true.
⢠This still isnât evidence of a genuine statistical
effect H1, let alone a scientific conclusion H*
Stat-Sub fallacy H => H*
8
10. Fallacy of rejection
⢠H* makes claims that havenât been probed by the
statistical test
⢠The moves from experimental interventions to H*
donât get enough attentionâbut your statistical
account should block them
9
11. Neyman-Pearson (N-P) tests:
A null and alternative hypotheses H0, H1
that are exhaustive*
H0: Ο ⤠0 vs. H1: Ο > 0
âno effectâ vs. âsome positive effectâ
⢠So this fallacy of rejection H1ď¨H* is blocked
⢠Rejecting H0 only indicates statistical alternatives
H1 (how discrepant from null)
*(introduces Type II error, and power )
10
12. Both Fisher and N-P methods: itâs
easy to lie with statistics with
biasing selection effects
⢠Sufficient finaglingâcherry-picking, significance
seeking, multiple testing, post-data subgroups,
trying and trying againâmay practically
guarantee a preferred claim H gets support,
even if itâs unwarranted by evidence
11
13. Severity Requirement:
If the test had little or no capability of finding flaws
with H (even if H is incorrect), then agreement
between data x0 and H provides poor (or no)
evidence for H
⢠Such a test fails a minimal requirement for a
stringent or severe test
12
14. 13
⢠A claim passes severely only if it has been
subjected to and passes a test that would
have, with high probability, found it flawed or
specifiably false (if it is).
⢠This probability is the severity with which it
has passed the test, and is a measure of
evidential warrant
A claim is warranted to the extent
it passes severely
15. This alters the role of probability:
Probabilism. To assign a degree of probability,
confirmation, support or belief in a hypothesis,
given data x0 (absolute or comparative)
(e.g., Bayesian, likelihoodist, Fisher (at times))
Performance. Ensure long-run reliability of
methods, coverage probabilities (frequentist,
behavioristic Neyman-Pearson, Fisher (at times))
14
16. ⢠Neither âprobabilismâ nor âperformanceâ directly
captures assessing error probing capacity
⢠Good long-run performance is a necessary, not
a sufficient, condition for severity
15
17. Key to solving a major
philosophical problem for
frequentists
⢠Why is good performance relevant for
inference in the case at hand?
⢠What bothers you with selective reporting,
cherry picking, stopping when the data look
good, P-hacking
⢠Not problems about long-runsâ
16
18. ⢠We cannot say the case at hand has done
a good job of avoiding the sources of
misinterpreting data
⢠Performance is relevant when it teaches
us about the capabilities of our methods
⢠Basis of severe testing philosophy
17
19. A claim C is not warranted _______
⢠Probabilism: unless C is true or probable (gets
a probability boost, made comparatively firmer)
⢠Performance: unless it stems from a method
with low long-run error
⢠Probativism (severe testing) unless something
(a fair amount) has been done to probe ways we
can be wrong about C
18
20. Severe Tests
Informal example: To test if Iâve gained weight
between the start of the pandemic and now, I use a
series of well-calibrated and stable scales, both at
the start and now.
All show an over 4 lb gain, none shows a difference
in weighing EGEK, Iâm forced to infer:
H: Iâve gained at least 4 pounds
19
21. 20
⢠Giving the properties of the weighing methods is
akin to the properties of statistical tests
(performance).
⢠No one claims the justification is merely long run
and can say nothing about my weight.
⢠We argue about the source of the readings from
the high capacity to reveal if any scales were wrong
22. 21
The severe tester is assumed to be in
a context of wanting to find things out
⢠I could insist all the scales are wrongâthey work fine
with weighing known objectsâbut this would prevent
correctly finding out about weightâŚ.. (rigged
alternative)
⢠What sort of extraordinary circumstance could cause
them all to go astray just when we do not know the
weight of the test object?
23. Statistical Inference and Sexy Science
22
Even large scale theories connect with data only
by intermediate hypotheses and models.
24. Next month 102 Years Ago: May 29, 1919:
Testing GTR
On Einstein's theory of gravitation, light passing near
the sun is deflected by an angle Îť, reaching 1.75â,
for light just grazing the sun.
Only detectable during a total eclipse, which "by
strange good fortuneâ would occur on May 29, 1919
[1920] 1987, p. 113).
23
25. Two key stages of inquiry
i. is there a deflection effect of the amount
predicted by Einstein as against Newton
(0.87")?
ii. is it "attributable to the sun's gravitational field"
as described in Einstein's hypothesis?
24
26. 25
Eclipse photos of stars (eclipse plate) compared to
their positions photographed at night when the effect
of the sun is absent (the night plate)âa control.
Technique was known to astronomers from
determining stellar parallax, "for which much greater
accuracy is required" (Eddington 1920), pp. 115-16).
27. 26
The problem in (i) is reduced to a statistical one: the
observed mean deflections (from sets of
photographs) are normally distributed around the
predicted mean deflection Îź.
H0: Ο ⤠0.87 and the H1: Ο > 0.87
H1: includes the Einsteinian value of 1.75.
2 expeditions, to Sobral, North Brazil and Principle,
Gulf of Guinea (West Africa)
28. 27
A year of checking instrumental and other errorsâŚ
Sobral: Îź = 1.98" Âą 0.18".
Principe: Îź = 1.61" Âą 0.45".
(in probable errors 0.12 and 0.30 respectively, 1
probable error is 0.68 standard errors SE.)
âIt is usual to allow a margin of safety of about twice
the probable error on either side of the mean.â [~1.4
SE]. The Principe plates are just sufficient to rule out
the the âhalf-deflectionâ, the Sobral plates exclude it
(Eddington 1920, p. 118).
29. 28
(ii) Is the effect "attributable to the sun's
gravitational fieldâ? (Canât assume H*)
Using the known eclipse effect to explain it while
saving Newton from falsification is unproblematicâif
each conjecture is severely tested.
Sir Oliver Lodgeâs âether effectâ was one of many
(e.g., shadow, corona).
Were any other cause to exist that produced a
considerable fraction of the deflection effect that
alone would falsify the Einstein hypothesis (which
asserts that all of the 1.75" are due to gravity)
(Jeffreys 1919, p. 138).
30. 29
Each Newton-saving hypothesis collapsed on the
basis of a one-two punch:
1. the magnitude of effect that could have been
due to the conjectured factor is far too small to
account for the eclipse effect; and
2. if large enough to account for the eclipse effect,
it would have false or contradictory implications
elsewhere.
The Newton-saving factors might have been
plausible but they were unable to pass severe tests.
Saving Newton this way would be bad science.
31. 30
More Severe Tests of GTR in the 1970s
⢠Radio interferometry data from quasars (quasi-stellar
radio sources) are more capable of uncovering
errors, and discriminating values of the deflection
than the crude eclipse tests.
⢠The Einstein deflection effect âpassedâ the test, but
even then, they couldnât infer all of GTR severely.
⢠The [Einstein] law is firmly based on experiment,
even the complete abandonment of the theory would
scarcely affect it. (Eddington 1920, p. 126)
32. 31
Popper, GTR and Severity
[T]he impressive thing about [the 1919 tests of
Einsteinâs theory of gravity] is the risk involved in a
prediction of this kind. ⌠The theory is incompatible
with certain possible results of observationâin fact
with results which everybody before Einstein would
have expected. This is quite different from [Freud
and Adlerian psychology] (Popper 1962, p. 36)
33. 32
The problem with Freudian and Adlerian
psychology
⢠Any observed behavior â jumping in the water to
save a child, or failing to save herâcan be a
accounted for by Adlerian inferiority complexes, or
Freudian theories of sublimation or Oedipal
complexes (Popper 1962, p. 35).
⢠Iâd modify Popper: it neednât be the flexibility of
the theory but of the overall inquiry: research
question, auxiliaries, and interpretive rules.
⢠The flexibility isnât picked up on in logics of
induction
34. 33
Popper denies that severity can be formalized by
any confirmation logics or logics of induction
âthe probability of a statement . . . simply does not
express an appraisal of the severity of the tests a
theory has passed, or of the manner in which it has
passed these testsâ (pp. 394â 5).
35. 34
Wars between Popper vs logics of
Induction relevant for todayâs
statistics wars: Alan Musgrave
âAccording to modern logical empiricist orthodoxy, in
deciding whether hypothesis h is confirmed by
evidence e, . . . we must consider only the statements h
and e, and the logical relations [C(h,e)] between them.
It is quite irrelevant whether e was known first and h
proposed to explain it, or whether e resulted from
testing predictions drawn from hâ.
(Alan Musgrave 1974, p. 2)
36. Likelihood Principle (LP)
In logics of induction, like probabilist accounts (as
Iâm using the term) the import of the data is via the
ratios of likelihoods of hypotheses
Pr(x0;H0)/Pr(x0;H1)
The data x0 are fixed, while the hypotheses vary
35
37. Comparative Logic of Support
⢠Ian Hacking (1965) âLaw of Likelihoodâ:
x support hypothesis H0 less well than H1 if,
Pr(x;H0) < Pr(x;H1)
(rejects in 1980)
⢠Any hypothesis that perfectly fits the data is
maximally likely
⢠âthere always is such a rival hypothesis viz., that
things just had to turn out the way they actually
didâ (Barnard 1972, 129). 36
38. N-P error probabilities and
Popperâs methodological probabilities
⢠Pr(H0 is less well supported than H1;H0 ) is high
for some H1 or other
âIn order to fix a limit between âsmallâ and âlargeâ
values of [the likelihood ratio] we must know how
often such values appear when we deal with a
true hypothesis.â (Pearson and Neyman 1967,
106)
37
39. Fishing for significance
(nominal vs. actual)
Suppose that twenty sets of differences have
been examined, that one difference seems large
enough to test and that this difference turns out
to be âsignificant at the 5 percent level.â âŚ.The
actual level of significance is not 5 percent,
but 64 percent! (Selvin 1970, 104)
(Morrison & Henkelâs Significance Test Controversy
1970!)
38
40. Spurious P-Value
The data-dredger reports: Such results would be
difficult to achieve under the assumption of H0
When in fact such results are common under the
assumption of H0
⢠There are many more ways to be wrong with
biasing selection effects
⢠Need to adjust P-values or at least report the
multiple testing
39
41. Some accounts of evidence object:
âTwo problems that plague frequentist inference:
multiple comparisons and multiple looks, orâŚdata
dredging and peeking at the data. The frequentist
solution to both problems involves adjusting the P-
valueâŚ
But adjusting the measure of evidence because
of considerations that have nothing to do with
the data defies scientific senseâ (Goodman 1999,
1010)
(Co-director, with Ioannidis, the Meta-Research Innovation
Center at Stanford)
40
42. All error probabilities
violate the LP :
Sampling distributions, significance levels, power,
all depend on something more [than the likelihood
function]âsomething that is irrelevant in Bayesian
inferenceânamely the sample space
(Lindley 1971, 436)
The LP impliesâŚthe irrelevance of predesignation,
of whether a hypothesis was thought of before
hand or was introduced to explain known effects
(Rosenkrantz 1977, 122)
41
43. Many âreformsâ offered as
alternative to significance tests
follow the LP
⢠âBayes factors can be used in the complete absence
of a sampling planâŚâ (Bayarri, Benjamin, Berger,
Sellke 2016, 100)
⢠It seems very strange that a frequentist could not
analyze a given set of dataâŚif the stopping rule is
not givenâŚ.Data should be able to speak for itself.
(Berger and Wolpert 1988, 78 authors of the
Likelihood Principle) 42
44. At odds with fraud-busters:
21 Word Solution
âWe report how we determined our sample size,
and data exclusions (if any), all manipulations, and
all measures in the studyâ (Simmons, Nelson, and
Simonsohn 2012, 4).
⢠Replication researchers find that selection effectsâ
data-dependent hypotheses, fishing, and stopping
rulesâare a major source of failed replication
43
45. Inferences based on biasing
selection effects might be blocked
with Bayesian prior probabilities
(without error probabilities)?
⢠Supplement with subjective beliefs: What do I
believe? As opposed to What is the evidence?
(Royall 1997)
⢠Likelihoodists + prior probabilities
44
46. Problems with appealing to priors to
block inferences based on
selection effects
⢠Doesnât show what researchers had done wrongâ
battle of beliefs
⢠The believability of data-dredged hypotheses is
what makes them so seductive
⢠Additional source of flexibility, priors and biasing
selection effects
45
47. No help with the severe testerâs
key problem
⢠How to distinguish the warrant for a single
hypothesis H with different methods
(e.g., one has biasing selection effects,
another, pre-registered results and
precautions)?
⢠Since thereâs a single H, its prior would be the
same
46
48. Most Bayesians (last decade) use
âdefaultâ priors: unification
⢠âElicitingâ subjective priors too difficult, scientists
reluctant for subjective beliefs to overshadow data
â[V]irtually never would different experts give prior
distributions that even overlappedâ (J. Berger 2006,
392)
⢠Default priors are supposed to prevent prior beliefs
from influencing the posteriorsâdata dominant
47
49. How should we interpret them?
⢠âThe priors are not to be considered expressions of
uncertainty, ignorance, or degree of belief.
Conventional priors may not even be probabilitiesâŚâ
(Cox and Mayo 2010, 299)
⢠No agreement on rival systems for default/non-
subjective priors
(invariance, maximum entropy, maximizing missing
information, matching)
48
50. Criticisms of Data-Dredgers Lose
Force
⢠Wanting to promote an account that
downplays error probabilities, the researcher
deserving criticism is given a life-raft
⢠One of the ironies of todayâs reforms
49
51. Bemâs âFeeling the Futureâ 2011:
ESP?
⢠Daryl Bem (2011): subjects do better than chance
at predicting the (erotic) picture shown in the future
⢠Some locate the start of the Replication Crisis with
Bem
⢠Bem admits data dredging
⢠Bayesian critics resort to a default Bayesian prior
to (a point) null hypothesis
50
52. Bemâs Response
âWhenever the null hypothesis is sharply defined but
the prior distribution on the alternative hypothesis is
diffused over a wide range of values, as it is [here] it
boosts the probability that any observed data will be
higher under the null hypothesis than under the
alternative.
This is known as the Lindley-Jeffreys paradox*: A
frequentist [can always] be contradicted by a
âŚBayesian analysis that concludes that the same data
are more likely under the null.â (Bem et al. 2011, 717)
*Bayes-Fisher disagreement
51
53. Many of Todayâs Statistics wars
trace to P-values vs posteriors
⢠The posterior probability Pr(H0|x) can be large while
the P-value is small (2-sided test, spike and smear)
⢠To the Bayesian, the P-value exaggerates the
evidence against H0
⢠To the significance tester: the Bayesian is biasing
results to favor H0
52
54. Some Bayesians reject probabilism
(Gelman: Falsificationist Bayesian;
Shalizi: error statistician)
⢠â[C]rucial parts of Bayesian data analysis, such as
model checking, can be understood as âerror
probesâ in Mayoâs senseâ which might be seen as
using modern statistics to implement the
Popperian criteria of severe tests. (Andrew
Gelman and Cosma Shalizi 2013, 10).
⢠Last part of SIST: (Probabilist) Foundations Lost,
(Probative) Foundations Found
53
55. Severity directs a reformulation of
tests
Severity function: SEV(Test T, data x, claim C)
⢠Tests are reformulated in terms of a discrepancy γ
from H0
⢠Instead of a binary cut-off (significant or not) the
particular outcome is used to infer discrepancies
that are and are not warranted
⢠Poorly warranted claims must be reported 54
56. Using Severity to Avoid Fallacies:
Fallacy of Rejection: Large n
problem
⢠Fixing the P-value, increasing sample size n,
the cut-off gets smaller
⢠Get to a point where x is closer to the null than
various alternatives
⢠Many would lower the P-value requirement as n
increases-can always avoid inferring a
discrepancy beyond whatâs warranted: 55
57. Severity tells us:
⢠an ι-significant difference indicates less of a
discrepancy from the null if it results from larger (n1)
rather than a smaller (n2) sample size (n1 > n2 )
⢠Whatâs more indicative of a large effect (fire), a fire
alarm that goes off with burnt toast or one that
doesnât go off unless the house is fully ablaze?
⢠[The larger sample size is like the one that goes off
with burnt toast] 56
58. What About Fallacies of
Non-Significant Results?
⢠They donât warrant 0 discrepancy
⢠There are discrepancies the test had little
probability of detecting
⢠Using severity reasoning: rule out discrepancies
that very probably would have resulted in larger
differences than observed- set upper bounds
⢠If you very probably would have observed a
larger value of test statistic (smaller P-value),
were Îź = Îź1 then the data indicate that Îź< Îź1
SEV(Îź < Îź1) is high
57
59. Confidence Intervals Are Also
Re-interpretated
Duality between tests and intervals: values within the
(1 -âÎą) CI are non-â
rejectable at the Îą level
⢠Too dichotomous: in/out, plausible/not plausible
⢠Fixed confidence levels (need several
benchmarks)
⢠Justified in terms of long-â
run coverage
(performance)
--if interpreted correctly
58
60. 59
Duality of Tests and CIs
(estimating Îź in a Normal Distribution)
Îź > M0 â 1.96Ď/ân CI-lower
Îź < M0 + 1.96Ď/ân CI-upper
M0 : the observed sample mean
CI-lower: the value of Îź that M0 is statistically
significantly greater than at P= 0.025
CI-upper: the value of Îź that M0 is statistically
significantly lower than at P= 0.025
ďˇ You could get a CI by asking for these values,
and learn indicated effect sizes with tests
61. 60
We get an inferential rationale absent from CIs
CI Estimator:
CI-lower < Îź < CI-upper
Because it came from a procedure with good
coverage probability
Severe Tester:
Îź > CI-lower because with high probability (.975) we
would have observed a smaller M0 if Ο ⤠CI-lower
Îź < CI-upper because with high probability (.975)
we would have observed a larger M0 if Ο ⼠CI-lower
62. FEV: Frequentist Principle of Evidence; Mayo and
Cox (2006); SEV: Mayo 1991, Mayo and Spanos
(2006)
FEV/SEV A small P-value indicates discrepancy Îł from H0, if
and only if, there is a high probability the test would have
resulted in a larger P-value were a discrepancy as large as Îł
absent.
FEV/SEV A moderate P-value indicates the absence of a
discrepancy Îł from H0, only if there is a high probability
the test would have given a worse fit with H0 (i.e., a
smaller P-value) were a discrepancy Îł present.
61
63. Sum-up
⢠I begin with a minimal requirement for evidence: data
are evidence for C only if it has been subjected to and
passes a test it probably would have failed if false
⢠Biasing selection effects make it easy to find
impressive-looking effects erroneously
⢠They alter a methodâs error probing capacities
⢠They may not alter evidence (in traditional
probabilisms): Likelihood Principle (LP)
⢠To the LP holder: to consider what could have
happened but didnât is to consider âimaginary dataâ
62
64. ⢠To the severe tester, probabilists are robbed from a
main way to block spurious results
⢠Severity principles direct the reinterpretation of
significance tests and other methods
⢠Probabilists may block inferences without appeal to
error probabilities: high prior to H0 (no effect) can
result in a high posterior probability to H0
⢠Gives a life-raft to the P-hacker and cherry picker;
puts blame in the wrong place
⢠Piecemeal statistical inferences (or informal
counterparts) link data to scientific claims at multiple
levels
63
65. ⢠A silver lining to distinguishing highly probable and
highly probedâcan use different methods for different
contexts
⢠Some Bayesians may find their foundations in error
statistics
⢠Last excursion: (probabilist) foundations lost;
(probative) foundations found
64
68. References
⢠Barnard, G. (1972). âThe Logic of Statistical Inference (Review of âThe Logic of
Statistical Inferenceâ by Ian Hacking)â, British Journal for the Philosophy of Science
23(2), 123â32.
⢠Bayarri, M., Benjamin, D., Berger, J., Sellke, T. (2016). âRejection Odds and
Rejection Ratios: A Proposal for Statistical Practice in Testing Hypotheses."
Journal of Mathematical Psychology 72: 90-103.
⢠Bem, J. (2011). âFeeling the Future: Experimental Evidence for Anomalous
Retroactive Influences on Cognition and Affectâ, Journal of Personality and Social
Psychology 100(3), 407-425.
⢠Bem, J., Utts, J., and Johnson, W. (2011). âMust Psychologists Change the Way
They Analyze Their Data?â, Journal of Personality and Social Psychology 101(4),
716-719.
⢠Berger, J. O. (2006). âThe Case for Objective Bayesian Analysis.â Bayesian
Analysis 1 (3): 385â402.
⢠Berger, J. O. and Wolpert, R. (1988). The Likelihood Principle, 2nd ed. Vol. 6
Lecture Notes-Monograph Series. Hayward, CA: Institute of Mathematical
Statistics.
⢠Cox, D. R., and Mayo, D. G. (2010). âObjectivity and Conditionality in
Frequentist Inference.â In Error and Inference: Recent Exchanges on
Experimental Reasoning, Reliability, and the Objectivity and Rationality of
Science, edited by Deborah G. Mayo and Aris Spanos, 276â304. Cambridge:
Cambridge University Press.
67
69. ⢠Cox, D. and Mayo, D. (2011). âA Statistical Scientist Meets a Philosopher of
Science: A Conversation between Sir David Cox and Deborah Mayoâ, in
Rationality, Markets and Morals (RMM) 2, 103â14.
⢠Eddington, A. ([1920]1987). Space, Time and Gravitation: An Outline of the
General Relativity Theory, Cambridge Science Classics Series. Cambridge:
Cambridge University Press.
⢠Fisher, R. A. (1947). The Design of Experiments 4th ed., Edinburgh: Oliver and
Boyd.
⢠Gelman, A. and Shalizi, C. (2013). âPhilosophy and the Practice of Bayesian
Statisticsâ and âRejoinderââ British Journal of Mathematical and Statistical
Psychology 66(1): 8â38; 76-80.
⢠Goodman SN. (1999). âToward evidence-based medical statistics. 2: The Bayes
factor,â Annals of Internal Medicine 1999; 130:1005 â1013.
⢠Hacking, I. (1965). Logic of Statistical Inference. Cambridge: Cambridge University
Press.
⢠Hacking, I. (1980). âThe Theory of Probable Inference: Neyman, Peirce and
Braithwaiteâ, in Mellor, D. (ed.), Science, Belief and Behavior: Essays in Honour of
R. B. Braithwaite, Cambridge: Cambridge University Press, pp. 141â60.
⢠Ioannidis, J. (2005). âWhy Most Published Research Findings are Falseâ, PLoS
Medicine 2(8), 0696â0701.
68
70. ⢠Jeffreys, H. (1919). âContribution to Discussion on the Theory of Relativityâ, and
âOn the Crucial Test of Einsteinâs Theory of Gravitationâ, Monthly Notices of the
Royal Astronomical Society 80, 96â118; 138â54.
⢠Lindley, D. V. (1971). âThe Estimation of Many Parameters.â in Godambe, V. and
Sprott, D. (eds.), Foundations of Statistical Inference 435â455. Toronto: Holt,
Rinehart and Winston.
⢠Lodge, O. (1919). âContribution to âDiscussion on the Theory of Relativityââ,
Monthly Notices of the Royal Astronomical Society 80, 106â9.
⢠Mayo, D. (1991). âNovel Evidence and Severe Testsâ, Philosophy of Science 58(4),
523â52.
⢠Mayo, D. G. (1996). Error and the Growth of Experimental Knowledge. Science
and Its Conceptual Foundation. Chicago: University of Chicago Press.
⢠Mayo, D. G. (2018). Statistical Inference as Severe Testing: How to Get Beyond
the Statistics Wars, Cambridge: Cambridge University Press.
⢠Mayo, D. G. and Cox, D. R. (2006). "Frequentist Statistics as a Theory of Inductive
Inferenceâ in Rojo, J. (ed.) The Second Erich L. Lehmann Symposium: Optimality,
2006, Lecture Notes-Monograph Series, Volume 49, Institute of Mathematical
Statistics: 247-275.
⢠Mayo, D. G., and A. Spanos. (2006). âSevere Testing as a Basic Concept in a
NeymanâPearson Philosophy of Induction.â British Journal for the Philosophy of
Science 57 (2) (June 1): 323â357. 69
71. 70
⢠Mayo, D. G., and A. Spanos (2011). âError Statistics.â In Philosophy of Statistics,
edited by Prasanta S. Bandyopadhyay and Malcolm R. Forster, 7:152â198.
Handbook of the Philosophy of Science. The Netherlands: Elsevier.
⢠Morrison, D. E., and R. E. Henkel, (eds.), (1970). The Significance Test
Controversy: A Reader. Chicago: Aldine De Gruyter.
⢠Musgrave, A. (1974). âLogical versus Historical Theories of Confirmationâ, The
British Journal for the Philosophy of Science 25(1), 1â23.
⢠Pearson, E. S. & Neyman, J. (1967). âOn the problem of two samplesâ, Joint
Statistical Papers by J. Neyman & E.S. Pearson, 99-115 (Berkeley: U. of Calif.
Press). First published 1930 in Bul. Acad. Pol.Sci. 73-96.
⢠Popper, K. (1962). Conjectures and Refutations: The Growth of Scientific
Knowledge. New York: Basic Books.
⢠Rosenkrantz, R. (1977). Inference, Method and Decision: Towards a Bayesian
Philosophy of Science. Dordrecht, The Netherlands: D. Reidel.
⢠Royall, R. (1997). Statistical Evidence: A Likelihood Paradigm. Boca Raton FL:
Chapman and Hall, CRC press.
⢠Selvin, H. (1970). âA critique of tests of significance in survey research. In The
significance test controversy, edited by D. Morrison and R. Henkel, 94-106.
Chicago: Aldine De Gruyter.
⢠Simmons, J. Nelson, L. and Simonsohn, U. (2012) âA 21 word solutionâ, Dialogue:
The Official Newsletter of the Society for Personality and Social Psychology 26(2),
4â7.
Editor's Notes
----- Meeting Notes (5/17/19 21:00) -----
1. Cox-Mayo conversation
2. A central issue in todayâs statistics ways is the role of probability: Should probability enter to ensure we wonât reach mistaken interpretations of data too often in the long run of experience? Or to capture degrees of belief about claims? (performance or probabilism)
The field has been marked by disagreements between competing tribes of frequentists and Bayesians that have been so contentious that everyone wants to believe we are long past them.
We now enjoy unifications and reconciliations between rival schools, it will be said, and practitioners are eclectic, prepared to use whatever method âworks.â
The truth is, long-standing battles still simmer below the surface of todayâ s debates about scientific integrity, irreproducibility, and questionable research practices,
Reluctance to reopen wounds from old battles has allowed them to fester.
The reconciliations and unifications have been revealed to have serious problems, and thereâs little agreement on which to use or how to interpret them.
As for eclecticism, itâs often not clear what is even meant by âworks.â
The presumption that all we need is an agreement on numbersânever mind if theyâre measuring different thingsâleads to statistical schizophrenia.
I say we need to brush the dust off the pivotal debates, and consider them anew, in relation to todayâs problems.
3. Statistical Inference as Severe Testing:
Whatâs behind the constant drum beat today that science is in crisis??
The problem is that high powered methods can make it easy to uncover impressive-looking findings even if they are false: spurious correlations and other errors have not been severely probed.
We set sail with a simple tool: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test.
In the severe testing view, probability arises in scientific contexts to assess and control how capable methods are at uncovering and avoiding erroneous interpretations of data.
Thatâs what it means to view statistical inference as severe testing.
In saying we may view statistical inference as severe testing, Iâm not saying statistical inference is always about formal statistical testing.
The concept of severe testing is sufficiently general to apply to any of the methods now in use, whether for exploration, estimation, or prediction.
you neednât accept the severe testing view in order to employ it as a tool for getting beyond the statistics wars.
Itâs a tool for excavation, and for keeping us afloat in the marshes and quicksand that often mark todayâs controversies.
4. A philosophical excursion
Taking the severity principle, along with the aim that we desire to find things out without being obstructed in this goal, letâs set sail on a philosophical excursion to illuminate statistical inference
Whatâs behind the constant drum beat today that science is in crisis?
The problem is that high powered methods can make it easy to uncover impressive-looking findings even if they are false: spurious correlations and other errors have not been severely probed.
We set sail with a simple tool: If little or nothing has been done to rule out flaws in inferring a claim, then it has not passed a severe test.
In the severe testing view, probability arises in scientific contexts to assess and control how capable methods are at uncovering and avoiding erroneous interpretations of data.
Thatâs what it means to view statistical inference as severe testing.
In saying we may view statistical inference as severe testing, Iâm not saying statistical inference is always about formal statistical testing.
The concept of severe testing is sufficiently general to apply to any of the methods now in use, whether for exploration, estimation, or prediction.
Regardless of the type of claim, you donât have evidence for it if nothing has been done to have found it flawed.
; problem with unifications
----- Meeting Notes (11/24/18 20:22) -----
third problem
With Bayes Factors make to make the null hypothesis comparatively more probable than a chosen alternative