2. Measures of Relationship
• The Mean, Median, Mode Range and
Standard Deviation are univariate as it
describes only one variable at a time.
• Description for two variable is done in terms
of relationship.
• The most common bivariate descriptive
statistics include cross tab tables,
correlation and regression.
• The cross tab table is same as contingency
table.
3. Concept of Probability
• A probability is a number that reflects the chance
or likelihood that a particular event will occur.
• Probabilities can be expressed as proportions that
range from 0 to 1, and they can also be expressed
as percentages ranging from 0% to 100%.
• A probability of 0 indicates that there is no
chance that a particular event will occur, whereas
a probability of 1 indicates that an event is
certain to occur.
• A probability of 0.45 (45%) indicates that there
are 45 chances out of 100 of the event occurring.
4. Concept of Probability
• The concept of probability can be illustrated
in the context of a study of obesity in
children 5-10 years of age who are seeking
medical care at a particular pediatric
practice.
• The population (sampling frame) includes all
children who were seen in the practice in
the past 12 months and is summarized in
the table.
5. Concept of Probability
• Unconditional Probability: A randomly
selected child will have the equal probability
of other children and it is 1/N, where N=the
population size. Thus, the probability that
any child is selected is 1/5,290 = 0.0002.
Age (years)
5 6 7 8 9 10 Total
Boys 432 379 501 410 420 418 2,560
Girls 408 513 412 436 461 500 2,730
Total 840 892 913 846 881 918 5,290
6. Concept of Probability
• Conditional Probability: A purposeful
selection of a population subset such as
probability of 9 year old girls. This can be
computed by the formula 461/2730 = 0.169
(16.9%)
Age (years)
5 6 7 8 9 10 Total
Boys 432 379 501 410 420 418 2,560
Girls 408 513 412 436 461 500 2,730
Total 840 892 913 846 881 918 5,290
7. Normal Probability Curve (Z Score)
Properties
• It is also called as normal distribution.
• It is based on the area/distribution of data.
• It is a bell shaped curve.
• Its centre point is equal in Mean = Median =
Mode. (X=M=Z)
8. Normal Probability Curve (Z Score)
Properties
• When the Mean, Median and Mode are equal at
the centre of the curve it is denoted as “µ” (mu).
• The line of the cure is extended to infinity at left
side as well as right side.
• Total area of the normal curve is taken as “1”
• 1 is indicative of the maximum probability.
• Probability is the measure of the likelihood that
an event will occur in a Random Experiment.
• Probability is quantified as a number between 0
and 1, where, loosely speaking, 0 indicates
impossibility and 1 indicates certainty.
9. Normal Probability Curve (Z Score)
Properties
• It is also called Gaussian or normal curve.
• The shape of the curve depends on mean and SD.
• If SD is high then width increases and vice versa
and height decreases.
• When the mean is 0 and SD is 1 curve is said to
be standard normal curve.
• The normal distribution is calculated normal
probability model
10. Normal Probability Curve (Z Score)
Properties
• Distributions that are normal or Gaussian have
the following characteristics:
• Approximately 68% (68.27%) of the values fall
between the mean and one standard deviation (in
either direction)
• Approximately 95% (95.45%) of the values fall
between the mean and two standard deviations
(in either direction)
• Approximately 99.9% (99.73%) of the values fall
between the mean and three standard deviations
(in either direction)
11. Normal Probability Curve (Z Score)
Properties
• If we have a normally distributed variable and
know the population mean (μ) and the standard
deviation (σ), then we can compute the probability
of particular values based on this equation for the
normal probability model.
12. Normal Probability Curve (Z Score)
Example
• Consider body mass index (BMI) in a
population of 60 year old males in whom
BMI is normally distributed and has a mean
value = 29 and a standard deviation = 6.
The standard deviation gives us a measure
of how spread out the observations are.
13. Normal Probability Curve (Z Score)
Example
• The mean (μ = 29) is in the center of the
distribution, and the horizontal axis is scaled in
increments of the standard deviation (σ = 6) and
the distribution essentially ranges from μ - 3 σ to
μ + 3σ.
• It is possible to have BMI values below 11 or above
47, but extreme values occur very infrequently.
14. Normal Probability Curve (Z Score)
Example
• To compute probabilities from normal
distributions, we will compute areas under the
curve.
• The total area under the curve is 1.
• Here the mean is equal to median, so half
(50%) of the area under the curve is above the
mean and half is below, so Pr(BMI < 29)=0.50.
• Consequently, if we select a man at random
from this population and ask what is the
probability his BMI is less than 29?, the
answer is 0.50 or 50%, since 50% of the area
under the curve is below the value BMI = 29.
15. Normal Probability Curve (Z Score)
Example
• What is the probability that a 60 year old
male has BMI less than 35?
• The probability is displayed graphically and
represented by the area under the curve to
the left of the value 35 in the figure below.
16. Normal Probability Curve (Z Score)
Example
• Note that BMI = 35 is 1 standard deviation above
the mean.
• For the normal distribution we know that
approximately 68% of the area under the curve
lies between the mean plus or minus one standard
deviation.
17. Normal Probability Curve (Z Score)
Example
• Therefore, 68% of the area under the curve lies
between 23 and 35.
• We also know that the normal distribution is
symmetric about the mean, therefore P(29 < X <
35) = P(23 < X < 29) = 0.34.
• Consequently, P(X < 35) = 0.5 + 0.34 = 0.84 or
84%.
18. Normal Probability Curve (Z Score)
Example
• This can also be calculated using the formula
• Z = X - µ / σ.
• where μ is the mean and σ is the standard
deviation of the variable X.
• In order to compute P(X < 30) we convert the
X=30 to its corresponding Z score
• Z= 30-29/6 = 1/6 = 0.17 (refer the Z table for
corresponding value i.e 0.0675) = 0.0675 +
0.5 = 0.5675 = 56.75%
• Z-table (Right of Curve or Left) - Statistics
How To.pdf
19. Normal Probability Curve (Z Score)
Example
• The mean height of 500 students is 165 cm and
the SD is 6. assuming that heights are normally
distributed. Find how many students will have
height between 155 and 175cm. (Z = X - µ / σ.)
• Z = 155-165/6 = -10/6 = -1.67
• Z = 175 -165/6 = 10/6 = 1.67
• Area under the standard normal curve is between
Z = -1.67 and 1.67.
• = ( area between Z = -1.67 and 0) + area between Z
= 0 and 1.67.
• = (0.9525 – 0.5 = 0.4525) + (0.4525) = 0.9050 =
90.5% (0.9050x500 = 452.5 = 452 ) students are
having height between 155cm to 175cm.
20. Importance of Normal Probability Curve
• Data obtained from biological measurements
approximately follow normal distribution.
• Binominal and Poisson distribution can be
approximated to normal distribution.
• Binominal is a fixed trial with limited probability.
It can have only two results. (tossing coin)
• Poisson is infinite trial with multiple outcome of
results. (Printing mistakes of a book)
• In case of large samples it can be used to study
the descriptive statistics such as mean, SD etc.
• Used to find confidence limits of the population
parameters.
• It is the basis of test of significance.
21. Correlation
• The Mean, Median, Mode Range and
Standard Deviation are univariate as it
describes only one variable at a time.
• Description for two variable is done in terms
of relationship.
• The most common bivariate descriptive
statistics include cross tab tables,
correlation and regression.
• The cross tab table is same as contingency
table.
22. Correlation Coefficient
• The relationship between two quantitative
variable is called correlation.
• The extent/degree /intensity of relationship
between two variables is expressed in terms
of correlation coefficient that ranges from -1
to 1.
• It shows only the relation of variables not
the influence or cause and effect
relationships.
23. Types of Correlation Coefficient
• Based on the direction of changes;
a. Perfect Positive Correlation: X is directly
proportional to Y. Both rise and fall in same
proportion. Eg. Designation & Salary. r = 1.
b. Perfect Negative Correlation: X and Y are inversely
proportionate. r= -1. Eg. Insulin and blood sugar.
c. Moderately Positive Correlation: A type of positive
correlation.
d. Moderately Negative Correlation. A type of
negative correlation.
e. No Correlation. No relation. r = 0. smoking and
type of housing.
24. Types of Correlation Coefficient
• Based on number of variables;
a. Simple: Only two variables.
b. Multiple: More than two variables.
c. Partial: More than two variables but
correlation is studies for only two variables
by keeping the third variable as constant.
Eg. X= yield, y = fertilizer, z = amount of
rainfall.
Simple = r(xy), r(yz), r(xz)
Multiple= r(xyz)
Partial = r(xy)z
25. Types of Correlation Coefficient
• Based on Linearity;
a. Linear: If the
changes in one
variable bears a
constant amount of
change or solid
pattern of change in
another variable
then the correlation
is said to be linear.
26. Types of Correlation Coefficient
• Based on Linearity;
a. Non Linear: Correlation
is said to be non linear
if the ratio of change is
not constant. In other
words, when all the
points on the scatter
diagram tend to lie near
a smooth curve, the
correlation is said to be
non linear (curvilinear).
27. Methods of Correlation Coefficient
• Karl Pearson’s method of correlation
• Spearman’s rank correlation.
• Scatter Plot/graph/scatter diagram method.
28. Karl Pearson’s method of correlation
• The Karl Pearson’s product-moment correlation
coefficient (or simply, the Pearson’s correlation
coefficient) is a measure of the strength of a linear
association between two variables and is denoted
by r or rxy(x and y being the two variables involved).
• It attempts to draw a line of best fit through the
data of two variables, and the value of the Pearson
correlation coefficient, r, indicates how far away all
these data points are to this line of best fit.
• It does not consider whether the variable is
dependent or independent variable. It treats all
variables equally.
29. Properties of Pearson’s method
• r is unit-less. Thus, we may use it to compare
association between totally different bivariate
distributions as well.
• The value of r always lies between +1 and -
1. Depending on its exact value, we see the
following degrees of association between the
variables.
• A value greater than 0 indicates a positive
association i.e. as the value of one variable
increases, so does the value of the other variable.
• A value less than 0 indicates a negative association
i.e. as the value of one variable increases, the value
of the other variable decreases.
30. Interpretation of Pearson’s method
Strength of
Association
Negative r Positive r
Weak -0.1 to -0.3 0.1 to 0.3
Average -0.3 to -0.5 0.3 to 0.5
Strong -0.5 to -1 0.5 to 1
Perfect -1 +1
The coefficient of correlation is “ zero” when
the variables X and Y are independent.
31. Assumptions of Pearson’s method
• The relationship between the variables
is “Linear”, which means when the two
variables are plotted, a straight line is formed
by the points plotted.
• The variables are independent of each other.
• The coefficient of correlation measures not
only the magnitude of correlation but also
tells the direction. Such as, r = -0.67, which
shows correlation is negative because the
sign is “-“ and the magnitude is 0.67.
32. Karl Pearson’s method of correlation
• It can be calculated using the formula
• In case of grouped data “x” and “y” can be
taken as the mid value of the class interval.
33. Pearson’s method
• Compute the correlation coefficient from the
following data;
• Create the table.
• Find the mean of “x” and “y”
Weight in Kg 60 70 80 90
Cholesterol 120 130 140 150
34. Assumptions of Pearson’s method
x y
60 120
70 130
80 140
90 150
Σx=300 Σy=540
X - x Y - y
-15 -15
-5 -5
5 5
15 15
(x –x)(y - y)
225
25
25
225
Σ (x –x)(y - y)
= 500
35. Pearson’s method
r = 500
√500x500
= 500
√2,50,000
= 500/500 = 1
Hence there is
perfect correlation
between weight
and cholesterol
level of patients.
(x – x)2
225
25
25
225
Σ(x – x)2
500
(y – y)2
225
25
25
225
Σ(y – y)2
500
36. Pearson’s method
• (Homework) Compute the correlation
coefficient from the following data;
Age 30 40 50 60 70
Blood
pressure
120 130 140 150 160
37. Merits and Demerits of Pearson’s method
Merits;
• It summarizes the correlation and if plotted on
a graph with a linear line then it shows the
direction too.
Demerits:
• The correlation coefficient always assumes
linear relationship regardless of the fact that
assumption is correct or not.
• The value of the coefficient is unduly affected
by the extreme values.
• It cannot be used for ordinal data
• It is time consuming method.
38. Spearman’s Rank Correlation Coefficient
• It is a method of finding correlation between
two variables by taking their ranks.
• This is used for qualitative data.
• It can be used when actual magnitude of
characteristics under consideration is not
known, but relative position or rank of the
magnitude is known.
• It is the nonparametric version of the Pearson
correlation coefficient.
• The data must be ordinal, interval or ratio
with ranks.
39. Spearman’s Rank Correlation Coefficient
• Spearman’s returns a value from -1 to 1,
where: +1 = a perfect positive correlation
between ranks -1 = a perfect negative
correlation between ranks 0 = no correlation
between ranks.
• It is denoted by “ rho”
• There are two case for calculating rank
correlation.
• A. No tie of allotted rank
• B. there is tie for two or more values/ranks in
either “x” or “y” or both.
40. Spearman’s Rank Correlation Coefficient
• Case 1: No tie of allotted rank: In this case
none of the values/ranks of x and y are
repeated.
• In this case “p” can be calculated using the
formula;
• D/d = difference in the ranks of data set of ‘x’
and ‘y’ (d = Rx - Ry)
41. Spearman’s Rank Correlation Coefficient
• Calculate the rank correlation of the following
marks obtained by five nursing students in
anatomy and FON.
• Here the data should not be arranged in the
ascending order/descending order but the
ranks should be arranged in ascending or
descending order. One set of data belongs to
one student.
• Prepare a table to calculate Σd2
Anatomy 85 81 77 68 53
FON 78 70 72 62 67
42. Spearman’s Rank Correlation Coefficient
• 1 – 6x4 / 5 (25-1) = 1 – 24/120 = 0.8 The
marks of the two subjects are partially
positive correlated.
x y Rx Ry D = Rx-Ry D2
85 78 1 1 0 0
81 70 2 3 -1 1
77 72 3 2 1 1
68 62 4 5 -1 1
53 67 5 4 1 1
Σd2
43. Spearman’s Rank Correlation Coefficient
• Example: Calculate the correlation for
following set of data. Given are the
temperature (Degree Celsius) of Jammu and
Katra at different days.
Jammu 20 28 25 23 22 30 31
Katra 15 26 17 19 21 24 27
44. Spearman’s Rank Correlation Coefficient
• Case 2: There is tie of allotted rank: In this case
more than one rank is present in either x or y or
both x and y.
• In this case “p” can be calculated using the
formula +CF
• CF is the correlation factor. The correlation factor
has to be calculated for each repeated ranks and
be added. The CF can be calculated using the
formula CF = m (m2 – 1)/12
• D/d = difference in the ranks of data set of ‘x’
and ‘y’ (d = Rx - Ry)
45. Spearman’s Rank Correlation Coefficient
• Calculate the rank correlation of the following
marks obtained by five nursing students in
MSN and OBG.
• Here MSN (x) the value 68 is repeated twice
and in OBG (y) the value 70 is repeated
thrice.
• In the first series CF = 2x(4-1)/12 = 0.5
• In the second series CF = 3x(9-1)/12 = 2
MSN 60 81 72 68 53 75 85 68
OBG 78 70 72 62 67 70 70 61
47. Spearman’s Rank Correlation Coefficient
• 1 – 6x 40 + 0.5 + 2 / 8 (64-1) = 1 – 242.5/504
= 1- 0.48 = 0.52 The marks of the two
subjects have strong positive correlation.
• Home work: Calculate correlation for the
following set of data;
X 10 15 14 25 14 14
Y 6 25 12 18 25 40
48. Merits and Demerits of
Spearman’s method
Merits
• This method can be used as a measure of degree
of association between qualitative data.
• This method is very simple and easily
understandable
• It can be used when the actual data is given or
when only the ranks of the data are given.
Demerits
• We cannot calculate the ranks coefficient for a
frequency distribution, i.e., grouped data
• When a large number of observations are given,
the calculation becomes tedious
49. Scatter Diagram Method
• Scatter Diagrams are convenient
mathematical tools to study the correlation
between two random variables.
• They are a form of a sheet of paper upon
which the data points corresponding to the
variables of interest, are scattered.
• Judging by the shape of the pattern that the
data points form on this sheet of paper, we
can determine the association between the
two variables, and can further apply the best
suitable correlation analysis technique.
50. Scatter Diagram Method: Use
• Quickly confirm a hypothesis that two
variables are correlated.
• Provide a graphical representation of the
strength of the relationship between two
variables.
• It also helps in understanding cause and
effect relationship to evaluate whether
manipulation of independent variable (cause)
is actually producing the change in
dependent variable (effect.)
51. Steps to make Scatter Diagram
• Step 1: on the graph paper or normal paper draw a
line “L”, where the horizontal part of “L” is x axis and
vertical part of “L” is y axis.
• Step 2: Make the scale units at even multiples such
as 10,20,30,40 etc so as to have an even scale
system.
• Step 3: Place the independent (cause) variable on
horizontal axis (from left to right) and dependent
(effect) variable on vertical axis (from bottom to top).
• Plot the data points at the intersection of x and y
axis.
• The plots on the graphs generally look scattered and
hence named as scatter plot.
• Interpret the data and find the relationship.
52. Interpretation of Scatter Diagram
• It suggests the degree and the direction of the
correlation.
• The greater the scatter of plotted points on
the chart the lesser is the relationship.
• The more closely the points come to a straight
line falling from left corner to the upper right
corner the correlation is said to be perfectly
positive. (r = +1)
• On the other hand all the plots are on the line
falling from upper left corner to the lower
right corner the correlation is said to be
perfectly negative. (r = -1)
53. Interpretation of Scatter Diagram
• If the points are widely distributed/scatterd
on the graph it indicates very little
relationship. (weak positive or weak negative)
• If the plotted points lie on the diagram in
disorganized manner it shows absence of
correlation.
54. Merits and Demerits of Scatter Diagram
Merits
• It is simple and non mathematical method to
study correlation.
• Easily understood and rough idea can be quickly
formed.
• It is not influenced by the extreme values of x
and y.
Demerits
• Cannot establish the exact degree of correlation.
• It cannot be always referred as a measure of
degree of correlation since it is not mathematical
and hence less reliable.
55. Regression
• Regression was introduced by Francis
Galton in the field of biometry.
• Regression analysis is a reliable method of
identifying which variables have impact on a
topic of interest.
• Dependent Variable: This is the main factor
that you’re trying to understand or predict.
• Independent Variables: These are the
factors that you hypothesize have an impact
on your dependent variable.
56. Regression
• Regression is done by deriving a suitable
equation on the basis of available bivariate
data.
• This equation is called Regression equation
and its geometrical representation is called
Regression curve.
• The regression equation requires the
Regression coefficient.
• The method of calculating regression
coefficient (b/b1) is described below.
57. Regression Analysis
• Regression analysis attempts to establish
the nature of relationship between the
variables ie to study the functional
relationship between the variables and
thereby provide a mechanism for prediction,
or forecasting.
• It is a mathematical model which describes
the relationship between dependent variable
(y) and independent variable (x) with a
feature of estimating the unknown values of
‘y’ and for the known values of ‘x’ through
the mathematical method y = a+bx
58. Properties of Regression
Coefficient
• It is denoted by b.
• Between two variables (x and y), two values
of regression coefficient can be obtained.
One will be obtained when we consider x as
independent and y as dependent and the
other when it is reversed.
• The regression coefficient of y on x is
represented as byx and that of x on y as
bxy.
• The square root of the products of two
regression coefficients (b=byx and b1=bxy) is
correlation coefficient.
59. Regression Equations
• There will be two lines/two equations of
regression.
• 1. Regression Equation of y on x.
• 2. Regression equation of x on y.
60. Regression Equation of y on x.
• It is y = a + bx where y=dependent variable,
x= independent variable and a & b are
constants.
• It is also to be noted that b = byx (regression
coefficient of y on x)
• b = Σxy – nx y
Σx2 –nx2
• a = y - bx
61. Regression Equation of x on y.
• It is x = a1 + b1x where x=dependent
variable, y= independent variable and a1 &
b1 are constants.
• It is also to be noted that b1 = bxy
(regression coefficient of x on y)
• b1 = Σxy–nx y
Σy2 –ny 2
• a1 = x – b1y
62. Types of Regression
• Simple linear regression: It is the
relationship between a scalar response
or dependent variable and one or
more explanatory/independent variables.
• Multiple linear regression: More than one
explanatory variable.
• Multivariate linear regression: Multiple
correlated dependent variables are
predicted, rather than a single scalar
variable.
63. Types of Regression
• Positive regression: A positive sign indicates
that as the predictor variable increases, the
response variable also increases.
• Negative regression: A negative sign
indicates that as the predictor variable
increases, the response variable decreases.
• Linear and nonlinear Regression: A model is
linear when each term is either a constant or
the product of a parameter and a predictor
variable. It is non linear if the equation does
not meet the linear criteria.
64. Regression Analysis
• Fit a regression equation of B.P on age based
on the following data and estimate the
probable B.P for the subject who is aging 55.
• n = 5
• X = Σx/n = 250/5 = 50
• Y = Σy/n = 700/5 = 140
• The regression equation to be fitted is y =
a+bx where y is B.P and x is the age.
Age 30 40 50 60 70
B.P 120 130 140 150 160
65. Regression Equation of y on x.
• Find b and a using the given formula.
• b = Σxy – nx y
Σx2 –nx2
• a = y - bx
67. Regression Equation of y on x.
• b = 36000 – 5x50x140
13500 – 5x(50)2
• b = 36000 – 35000/13500 – 12500
• b = 1000/1000 = 1
• a = y – bx
• a = 140 – 1 x 50 = 90
• So the fitted regression equation is y = a+bx.
• B.P = 90 + 1 x 35 = 90 +35 = 145mm of Hg.
68. Regression Analysis: Example 2
• Fit the two line of regression equation for the
following data.
• n = 5
• X = Σx/n = 150/5 = 30
• Y = Σy/n = 350/5 = 70
• The regression equation to be fitted is y =
a+bx and x = a1+b1y.
X 10 20 30 40 50
Y 30 50 70 90 110
69. Regression Equation of y on x.
• Find b and a using the given formula.
• b = Σxy – nx y
Σx2 –nx2
• a = y - bx
71. Regression Equation of y on x.
• b = 12500 – 5x30x70
5500 – 5x(30)2
• b = 12500 – 10500/5500 – 4500
• b = 2000/1000 = 2
• a = y – bx
• a = 70 – 2 x 30 = 70 -60 = 10
• So the fitted regression equation is y = 10 +
2x.
72. Regression Equation of x on y.
• Find b1 and a1 and a using the formula.
• b1 = Σxy – nx y
Σy2 –ny2
• a1 = x - by
73. Regression Equation of y on x.
• b1 = 12500 – 5x30x70
28500 – 5x(70)2
• b1 = 12500 – 10500/28500 – 24500
• b1 = 2000/4000 = 0.5
• a1 = x – b1y
• a1 = 30 – 0.5 x 70 = 30 -35 = -5
• So the fitted regression equation is x = -5 +
0.5y.
74. Properties
• The square root of the products of two
regression coefficients is correlation
coefficient. In the given examples
• b = byx = 2
• b1 = b1
xy = 0.5
• r = √2 x 0.5 = √1 = 1
75. Coefficient of Variation
• Coefficient of Variation is the percentage variation
in mean, standard deviation being considered as
the total variation in the mean.
• Two compare the variability of two or more series,
we can use the coefficient of variation.
• The series of data for which the coefficient of
variation is large indicates that the group is more
variable and it is less stable or less uniform.
• If a coefficient of variation is small it indicates
that the group is less variable and it is more
stable or more uniform.
76. Coefficient of Variation
• Find the CV for the following data. ( 13, 35, 56,
58, 35, 60 )
• Mean = 42.8
• SD = 18.5
• CV = 18.5/42.8 = 0.43 (43%)
77. Coefficient of Variation:
Example
• To compare their efficacy, 2 sleep producing
drugs were tested independently on 5
patients. The following data gives the
amount of sleep (in hours) the patients had
after taking the drugs.
• Compare the efficiencies of the two drugs on
the basis of coefficient of variation.
Drug A 6 2 4 5 3 2 1
Drug B 3 6 7 2 6 3 7