What is the significance of p value while reporting statistical analysis. Is there an alternate approach for Fisher, if so what is that approach. These are some of the issues addressed here.
2. Context
While reporting results researchers generally stop with p-
value significance.
Should the researchers go beyond p-value?
What are the limitations of p-value?
What reporting is expected by American Psychology
Association?
CONTENT
3. WHAT ARE THE IMPORTANT
OUTCOMES TO BE REPORTED IN
THE HUMAN RESOURCES
RESEARCH?
IS STATISTICAL
SIGNIFICANCE IS EQUAL
TO PRACTICAL
SIGNIFICANCE?
WHY P-VALUES ARE
CRITICIZED BY THE
RESEARCHERS?
4. WILL HISTORY OF
STATISTICAL
SIGNIFICANCE OFFER
SOME CLARITY?
WHAT IS
INFLATED ERROR
RATES,
STATISTICAL
POWER, EFFECT
SIZE?
HOW TO USE G-
POWER FOR TESTING
STATISTICAL
SIGNIFICANCE?
HOW SHOULD A HUMAN
RESOURCES
PROFESSIONAL WILL
REPORT HIS OR HER
FINDINGS?
5. The Cult of Statistical Significance
Statistical significance is, we argue, a diversion from the proper
objects of scientific study.
Significance, reduced to its narrow and statistical meaning only…
“p < .05”—has little to do with a defensible notion of scientific
inference.
Its arbitrary, mechanical illogic, though currently sanctioned by
science and its bureaucracies of reproduction, is causing a loss of
jobs, justice, profit, and even life.
The Cult of Statistical Significance, Stephen T. Ziliak and Deirdre
N. McCloskey 2008.
6. Reporting Results
For the reader to appreciate the magnitude or importance of a study's findings, it is almost
always necessary to include some measure of effect size in the Results section.
Whenever possible, provide a confidence interval for each effect size reported to indicate
the precision of estimation of the effect size.
Effect sizes may be expressed in the original units and are often most easily understood
when reported in original units.
The general principle to be followed, however, is to provide the reader with enough
information to assess the magnitude of the observed effect.
Publication Manual of the American Psychological Association, p.34
8. Florence Nightingale: An English social
reformer and statistician and Data
Visualizer in modern terms
9. Ronald A. Fisher (1890-1962) introduced p values, level of
significance to evaluate evidences
Tea experiment: A woman claimed to be able
to find by tasting whether a cup of tea with
milk had the tea poured first or the milk
poured first.
An experiment was performed, and eight cups
of tea are prepared and given to her in random
order to identify. Four had the milk poured
first, and four had the tea poured first.
The lady tasted each one and gave her opinion.
If she guesses all four correctly, the probability
of happening of the event is 1/70= 0.014.
10. Tea experiment: A woman claimed to be able
to find by tasting whether a cup of tea with milk
had the tea poured first or the milk poured first.
• An experiment was performed, and eight cups of tea are
prepared and given to her in random order to identify.
Four had the milk poured first, and four had the tea poured
first.
• The lady tasted each one and gave her opinion. If she
guesses all four correctly, the probability of happening of
the event is 1/70 = 0.014.
13. Neyman and Pearson Approach
They believed
scientific
statements
should be split
into
hypotheses
that may be
tested. They
called them as
experimental
hypothesis or
(alternative
hypothesis)
and null
hypothesis.
The hypothesis
or a prediction
from the
theory will
generally have
an “effect”.
When we
designed
experimental
hypothesis or
alternative
hypothesis
why there is a
need for a null
hypothesis?
We will not be
able to prove
the alternative
hypothesis by
statistics. What
we do instead
is to try to
reject the null
hypothesis so
that we have a
support (not
causation) for
our alternative
hypothesis.
14. Neyman and Pearson
Identification of Errors
Type 1
Error:
• We conclude that there is effect when there
is no effect. Probability of this error
generally set to .05.
Type 2
Error:
• We conclude that there is no effect when
there is effect.
15. While Type 1 error is
intuitive and to a
large extent
appreciated by
researchers, Type 2
error requires more
reporting.
It is important effects
happening the real
world should not be
missed and it is
known as β-level.
Cohen
(1988)suggested the
maximum acceptable
probability for this
error to be .20.
Translated into
practical terms out of
100 samples we will
not be able to detect
20 samples even if
the effect exists.
Neyman and Pearson
Identification of Errors
16. Is failing to reject null hypothesis is equal to
accepting null hypothesis?
Failing to reject the null hypothesis is not equal to accepting the null
hypothesis.
The null hypothesis is never accepted, and failing to find an effect is not the
same thing as showing that there is no effect.
It is incorrect to claim evidence of no treatment effect or no difference if we
fail to reject the null hypothesis. We may say that there is inconclusive
results.
17. The theories
are
complements
and not
competing
with each
other.
There exists
difference
between both
the
approaches.
However, they
are
complementary.
Lehmann, E. L. (1993). The Fisher, Neyman-Pearson theories of testing hypotheses:
one theory or two?. Journal of the American statistical Association, 88(424), 1242-1249
18. Null Hypothesis Significance Testing
(NHST)
We assume null hypothesis is true (there is no effect).
We fit the data to an experimental results or model to represent
alternative hypothesis and how well it fits.
We find a p-value of getting the model if the null hypothesis is true.
If the probability is very small such as .05 or less then we gain confidence in
the alternate or experimental hypothesis.
19. What is
Significance
Test?
It is a process of comparing
the p value obtained from
the sample data to a
predetermined level of
significance decided by
researcher.
The level of significance is
the probability of committing
a Type I error. This we define
as finding evidence in sample
or difference when there is
no evidence or difference.
20. What is
Significance
Test?
The conventional level of
significance used in most
studies is .05, which
corresponds to rejecting the
null hypothesis incorrectly in
approximately 1 out of every
20 experiments.
The p value does not give
information on hypothesis being
true or false. The p value gives the
probability of observing the sample
data or something more extreme,
assuming the null hypothesis is true.
What is extreme? Let us do an
experiment.
21. Coin Experiment
Here is a coin and we assume that the coin will give head or tail with
equal probability if it is a fair coin.
I am having this coin and I have no ability to find it as a fair coin or
biased coin.
Let us test the hypothesis.
H0 : The coin is fair.
Now let us toss the coin. I get head.
The P(H)= ½ ; I have no evidence to suspect the coin.
22. Coin Experiment
Then I toss the coin: This time I get again a head.
The probability of this event is = ½*1/2 = ¼
I toss the coin again. I get a head.
The probability of the event is = ¼*1/2= 1/8 = .125
I toss the coin again. I get a head.
The probability of the event is = 1/8*1/2= 1/16 = 0.0625
I started suspecting my null hypothesis.
23. Coin Experiment
I toss the coin again. I get a head.
The probability of happening of the event is = 1/16*1/2 = 0.03125
After the fourth toss, there is a little evidence to support that null
hypothesis is true.
This is considered as extreme and the null hypothesis is rejected.
25. What is the p-value of .05 and why it is
needed to interpreted carefully?
The level of significance (p<0.05) is a
subjective quantity arrived at by the
researcher and (it may differ from other
researchers) selection determines the result of
a significance test.
This suggests clearly that it is a
subjective procedure. Therefore, to
believe that a significance test is an
objective measure of scientific evidence
may not be correct.
26. What happens if the p-values are
.03 and .00005?
Let us assume that two
studies are conducted by
two researchers in similar
areas.
The results of researcher A
gives p = .03 and the result
of researcher B gives
p = .00005.
27. What happens if the p-values are
.03 and .00005?
In both the cases, the p value is less than the predetermined level
of significance .05 which is predetermined by the researchers.
Therefore both the researchers conclude that the null hypothesis
is rejected.
Both the researcher A as well as researcher B inferences are
treated the same. The p-value of .00005 is of no much significance
than p-value of .03.
28. p-value and Sample Size
The p value and
statistical significance
are influenced by
sample size.
If the sample size is very
large, the p value necessarily
will be very small. An
increasingly large sample
size yields a decreasingly
smaller p value; thus, a large
sample size leads to a
statistically significant result,
regardless of scientific
importance.
A statistically
significant effect
based on a small
sample is more
impressive than a
statistically significant
effect based on a large
sample.
29. Inflated Error Rates or
Multiple Comparisons
Let us imagine we use the .05 level of significance in a research. The
probability of no Type 1 error is .95. This is for one test only in one
research.
However in reality we conduct multiple tests. If we conduct two tests the
overall probability of no type 1 error will be .95*.95=.9025.
Then the probability of at least one type error is given by 1-.9025=.0975 or
9.75%.
30. Inflated Error Rates or
Multiple Comparisons
Thus across the tests the type 1 error increases or it
is known as experiment wise error rate. The general
formula for the same is given by 1-(.95)n
This may be controlled by using Bonferroni Correction =
P
cri
= α/k where K is the number of comparisons.
However, it reduces the statistical power.
31. Important Remarks
• Systematic variation if the
variation that can be explained
by the model.
• Unsystematic variation is one
thig that cannot be explained
by the model we have
designed. That is the variation
is not attributable to the effect
we are experimenting.
There are two
variations that
are there in any
experiment,
32. Important Remarks
In effect we are finding the
• Test statistic = variance explained by the model/variance not explained
by the model = effect/error.
The ratio of effect to error is called the test statistic and
t,F and Chi-Square or Signal to noise ratio. If the
effect/error=1 or more the effect is more than error but
need not be significant.