2. Bias
• Once we collect the data we represent the data by way of a model.
Let us assume a linear model.
• This may be written as y(outcome)= a1x1+a2x2+a3x3+…+anxn+ error
• Therefore we predict that there will be an error as the outcome is
expressed as a set of predictor variables multiplied by a set of
coefficients the parameters the a in the equation and tell us about
the relationship between the predictor and outcome variable.
• The prediction will not be perfect as there will be an error as we are
using sample data to predict the outcome variable.
3. The contexts for bias
• Things that bias the parameter estimates
• Things that bias standard errors and confidence intervals
• Things that bias test statistics and p-values. These bias are related. If
the test statistics are bias then the confidence intervals will be biased.
A bias in confidence intervals will bias the test statistics.
• If the test statistics is biased then the results will be biased and we
need to identify and eliminate the biases as much as possible.
4. Assumptions that lead to bias
1. Presence of outliners
2. Additivity and linearity
3. Normality
4. Homoscedasticity or homogeneity of variance
5. Independence
5. Outliers
• Presence of outliers in data will bias the data.
• For example if the class average marks is 60 and standard deviation is
10 marks then if there is a presence of zero marks or 100 marks by
few students may bias the data.
• The outliers need to be identified and removed or replaced to have a
better representation of the data. It generally affect the mean of the
data as well as some of the squares errors. The sum of the squares is
used to compute the standard deviation, which in turn is used to
estimate the standard error. The standard error is used for confidence
intervals around the parameter estimates. This it will have a domino
effect on the results.
6. Additivity and Linearity
• The assumption is the outcome variable is linearly related to all
predictors. That means the relationship may be summed up as a
straight line.
• If there are several predictors as we have see the equation
y(outcome)= a1x1+a2x2+a3x3+…+anxn+ error
their combined effect is described by adding their effects together.
The model can described accurately by the equation given here.
7. Assumption of Normality
• There is a mistaken belief that assumption of normality = the data need to be
from normally distributed. This misconception stems from the fact that if the
data is normally distributed then errors in the model as well as sampling
distribution is also normally distributed.
• The central limit theorem means that there are different situations in which we
can assume normality regardless of the shape of the sample data.
• Normality matters when you construct confidence intervals around parameters of
the model or compute significance tests relating to those parameters then
assumption of normality matters in small samples.
• As long as the sample size is fairly large, outliers are taken into account then
assumption of normality will not be a pressing concern.
• Lumley, T., Diehr, P., Emerson, S., & Chen, L. (2002). The importance of
the normality assumption in large public health data sets. Annual review of
public health, 23(1), 151-169.