This is a Jumbo slide of Business statistics for NET-JRF SLET in commerce according to updated syllabus of NTA. Prepared keeping the reader point of view and given basic clarity of the contents, covered all the topics of Business Statistics syllabus and given more important to give collective materials in one roof, very detailed information given, i proposed to work out at least one problem from each category wherever problems are applicable. Presented at Student's Academic Programme held at Sri H.D. Devegowda Government First Grade College, Holenarasipura by Sundar B. N. Assistant Professor of Commerce at Government First Grade College for Women, Holenarasipura, Hassan District
PISA-VET launch_El Iza Mohamedou_19 March 2024.pptx
Business statistics for NET-JRF SLET according to updated syllabus
1. “LAKSHYA: NET-JRF & KSET”
STUDENT’S ACADEMIC DEVELOPMENT
PROGRAMME
AT
SRI H. D. DEVEGOWDA G.F.G.COLLEGE,
PADUVALAHIPPE
2. Business Statistics in
NET-JRF/KSET
Presented by
Sundar B. N.
Assistant Professor
I am the wisest man alive, for I
know one thing, and that is that I
know nothing.
-Plato, The Republic
3. Business Statistics and Research Methods
Measures of central tendency
Measures of dispersion
Measures of Skewness
Correlation and regression of two variables
Probability: Approaches to probability; Bayes’ theorem
Probability distributions: Binomial, Poisson and normal distributions
Research: Concept and types; Research designs
Data: Collection and classification of data
Sampling and estimation: Concepts; Methods of sampling - probability and
nonprobability methods; Sampling distribution; Central limit theorem; Standard
error; Statistical estimation
Hypothesis testing: z-test; t-test; ANOVA; Chi–square test; Mann-Whitney test
(Utest); Kruskal-Wallis test (H-test); Rank correlation test
Report writing
4. STATISTICS
Measures of central tendency
Measures of dispersion
Measures of Skewness
Correlation and regression of two variables
Probability: Approaches to probability; Bayes’ theorem
Probability distributions: Binomial, Poisson and normal
distributions
Hypothesis testing: z-test; t-test; ANOVA; Chi–square test;
Mann-Whitney test (Utest); Kruskal-Wallis test (H-test);
Rank correlation test
Report writing
5. Meaning of Research
Research is an investigative process of finding
reliable solution to a problem through a systematic
selection, collection, analysis and interpretation of
data relating to problem.
7. Analysis of Data
Data analysis is the process of systematically
applying statistical and logical techniques to
describe ad illustrate, condense and recap and
evaluate data
Technically speaking, processing implies editing,
coding, classification and tabulation of collected
data so that they are amenable to analysis.
8. Statistics
The word “Statistics’ seems to have been derived from Latin word ‘Status’ or Italian
word ‘Statista’ or German word ‘Statistik’
But according to the observations of great John Graunt (1620-1674), the word
‘Statistics’ is of Italian origin and it is derived from the word ‘Stato’ and
statista means a person who deals with affairs of the state.
That is, initially kings or monarchs or governments used it to collect the
information related to the population, agricultural land, wealth, etc. of the
state. Their aim behind it was just to get an idea about the men power of the
state, force needed for the purpose of a war and necessary taxes to be impose to
meet the financial need of the state. So, it indicates that initially it was used by
kings or monarchs or governments for administrative requirements of the
state. That is why its origin lies in the State craft(The art of managing state
affairs).
On the basis of evidences form papyrus manuscripts and ancient monuments in
pharaonic temples, it is assumed that first census in the world was carried out
in Egypt in 3050 BC. Yet, China’s census data around 2000 BC is considered
as the oldest surviving census data in the world.
9. Statistics in India
In the 3rd century BC when “Arthashastra” came into existence written by one of the greatest geniuses of
political administration, Kautilya. In it, he had described the details related to conduct of population,
agriculture and economic census. An efficient system of collecting official and administrative statistics
was in use during the reign of Chandra Gupta Maurya ( 324-300 BC) under the guidance of Kautilya.
Many things like taxation policy of the state, governance and administration, public finance, duties of
a king, etc. had also been discussed in this celebrated Arthashastra.
Another evidence that statistics was in use during Emperor Akbar’s empire (1556-1605) is in the form of
“Ain-I-Akbari” written by Abul Fazl, one of the nine jems of Akbar. Raja Todar Mal, Akbar’s finance
minister and another one of the nine jems of Akbar, used to keep very good records of land and
revenue and he developed a very systematic revenue collections system in the kingdom of Akbar by
using his expertise and the recorded data. Revenue collection system developed by Raja Todar Mal
was so systematic that it became a model for future Mughals and later on for British.
British Government, after transfer of the power from East India Company to it, started a publication
entitled ‘Statistical Abstract of British India’ as a regular annual feature in 1868 in which all the useful
statistical information related to local administrations to all the British Provinces was provided. In
between some census reports were coming on based on a particular area, but not at the national
level. The first attempt to get detailed information on the whole population of India was made between
1867 and 1872. First decennial census was undertaken on 17th February 1881 by W.W. Plowden,
first census commissioner of India. After that a census has been carried out over a period of 10 years
in India. 2011 census was the 15th census in India.
Credit of establishing Statistics as a discipline in India goes to Prasanta Chandra Mahalanobis (P.C.
Mahalanobis). He was a professor of physics in the Presidency College in Kolkata. During his study at
Cambridge he got a chance to go through the work of Karl Pearson and R. A. Fisher. Continuing his
interest in Statistics, he established a Statistical laboratory in the Presidency College Kolkata. On 17
December 1931, this statistical laboratory was given the nameIndian Statistical Institute (ISI).
First post graduate course in Statistics was started by Kolkata University in 1941, while first under
graduate course in Statistics was started by the Presidency College Kolkata.
10. DEFINITION OF STATISTICS
“Statistics is the science of counting”. – A.L. Bowly.
“Statistics is the science of average.” – A.L. Bowly.
Statistics is “The science of the measurement of the social organism, regarded as a whole, in all
its manifestations.” – A.L. Bowly.
“Statistics are the numerical statements of facts in any department of enquiry placed in relation to
each other.” – A.L. Bowly.
“By statistics we mean quantitative data affected to a marked extent by multiplicity of causes.” –
Yule and Kendall.
“Science of estimates and probabilities.” – Boddington.
“The method of judging collective natural or social phenomena from the results obtained by the
analysis of an enumeration or collection of estimaties.” – W.I. King.
“Statistics is the science which deals with collection classification and tabulation of numerical
facts as the basis for explanation description and comparison of phenomenon”. – Lovitt.
“The science which deals with the collection, tabulation, analysis and interpretation of numerical
data.” – Croxton and Cowden.
From the above author definition comprehended as
“Statistics is a branch of science which deals with collection, classification,
tabulation, analysis and interpretation of data.”
11. DATA
Data play the role of raw material for any statistical
investigation and defined in a single sentence as
“The values of different objects collected in a survey or
recorded values of an experiment over a time period taken
together constitute what we call data in Statistics”
Each value in the data is known as observation. Statistical
data based on the characteristic, nature of the
characteristic, level of measurement, time and ways of
obtaining it may be classified as follows:
12. Types of Data
Based on the characteristic
Qualitative Data
Quantitative Data
Based on nature of the characteristic
Discrete data
Continuous data
Based on the level of measurement
Nominal Data
Ordinal Data
Interval Data
Ratio Data
Based on the Time Component
Time Series data
Cross Sectional data
Based on the ways of obtaining the data
Primary Data
Secondary Data
13. Quantitative Data
As the name quantitative itself suggests that it is related to the quantity. In fact, data are
said to be quantitative data if a numerical quantity (which exactly measure the
characteristic under study) is associated with each observation.
Generally, interval or ratio scales are used as a measurement of scale in case of
quantitative data. Data based on the following characteristics generally gives
quantitative type of data. Such as weight, height, ages, length, area, volume, money,
temperature, humidity, size, etc.
For example,
(i) Weights in kilogram (say) of students of a class.
(ii)Height in centimetre (say) of the candidates appearing in a direct recruitment of
Indian army organised by a particular cantonment.
(iii)Age of the females at the time of marriage celebrated over a period of week in
Delhi.
(iv)Length (in cm) of different tables in a showroom of furniture.
14. Qualitative Data
As the name qualitative itself suggests that it is related to the quality of an object/thing. It is
obvious that quality cannot be measured numerically in exact terms. Thus, if the
characteristic/attribute under study is such that it is measured only on the bases of presence or
absence then the data thus obtained is known as qualitative data.
Generally nominal and ordinal scales are used as a measurement of scale in case of qualitative
data. Data based on the following characteristics generally gives qualitative data. Such as
gender, marital status, qualification, colour, religion, satisfaction, types of trees, beauty,
honesty, etc.
For example,
i. If the characteristic under study is gender then objects can be divided into two categories,
male and female.
ii. If the characteristic under study is marital status then objects can be divided into four
categories married, unmarried, divorcee, widower.
iii. If the characteristic under study is qualification (say) ‘matriculation’ then objects can be
divided into two categories as ‘Matriculation passed’ and ‘not passed’.
iv. If the characteristic under study is ‘colour’ then the objects can be divided into a number of
categories Violet, Indigo, Blue, Green, Yellow, Orange and Red.
15. Discrete Data
If the nature of the characteristic under study is such that values of observations may be
at most countable between two certain limits then corresponding data are known as
discrete data
For example,
(i) Number of books on the self of an Elmira in a library form discrete data. Because
number of books may be 0 or 1 or 2 or 3,…. But number of books cannot take any
real values such as 0.8, 1.32, 1.53245, etc.
(ii)If there are 30 students in a class, then number of students presents in a lecture
forms discrete data. Because number of present students may be 1 or 2 or 3 or 4
or…or 30. But number of present students cannot take any real values between 0
and 30 such as 1.8675, 22.56, 29.95, etc.
(iii)Number of children in a family in a locality forms discrete data. Because number of
children in a family may be 0 or 1 or 2 or 3 or 4 or…. But number of children
cannot take any real values such as 2.3, 3.75, etc.
(iv)Number of mistakes on a particular page of a book. Obviously number of
mistakes may be 0 or 1 or 2 or 3…. But cannot be 6.74, 3.9832, etc.
16. Continuous Data
Data are said to be continuous if the measurement of the observations of
a characteristic under study may be any real value between two
certain limits.
For example,
(i)Data obtained by measuring the heights of the students of a class of
say 30 students form continuous data, because if minimum and
maximum heights are 152cm and 175 cm then heights of the students
may take any possible values between 152 cm and 175 cm. For
example, it may be 152.2375 cm, 160.31326… cm, etc.
(ii)Data obtained by measuring weights of the students of a class also
form continuous data because weights of students may be 48.25796…
kg, 50.275kg, 42.314314314…kg, etc.
17. Time Series Data
Collection of data is done to solve a purpose in hand. The purpose may have its connection with time,
geographical location or both. If the purpose of data collection has its connection with time then it is
known as time series data. That is, in time series data, time is one of the main variables and the data
collected usually at regular interval of time related to the characteristic(s) under study show how
characteristic(s) changes over the time.
For example, quarterly profit of a company for last eight quarters, yearly production of a crop in India for last
six years, yearly expenditure of a family on different items for last five years, weekly rate of inflation for
last ten weeks, etc. all form time series data.
If the purpose of the data collection has its connection with geographical location then it is known as Spatial
Data. For example,
(i) Price of petrol in Delhi, Haryana, Punjab, Chandigarh at a particular time.
(ii) Number of runs scored by a batsman in different matches in a one day series in different stadiums.
If the purpose of the data collection has its connection with both time and geographical location then it is
known as Spacio Temporal Data.
For example, data related to population of different states in India in 2001 and 2011 will be Spacio Temporal
Data.
In time series data, spatial data and spacio temporal data we see that concept of frequency have no
significance and hence known as non-frequency data.
For instance, in the example discussed in case of time series data, expenditure of Rs 40000 on food in 2006 is
itself important, here its frequency say 3 (repeated three times) does not make any sense.
Now consider the case of marks of 40 students in a class out of 10 (say). Here we note that there may be
more than one student who score same marks in the test. Suppose out of 40 students 5 score 10 out of 10,
it means marks 10 have frequency 5. This type of data where frequency is meaningful is known as
frequency data.
18. Cross Sectional Data
Sometimes we are interested to know that how is a characteristic (such
as income or expenditure, population, votes in an election, etc.) under
study at one point in time is distributed over different subjects (such
as families, countries, political parties, etc.). This type of data which
is collected at one point in time is known as cross sectional data.
For example, annual income of different families of a locality, survey of
consumer’s expenditure conducted by a research scholar, opinion
polls conducted by an agency, salaries of all employees of an institute,
etc.
19. Primary Data
Data which are collected by an investigator or agency or institution for a specific purpose and
these people are first to use these data, are called primary data. That is, these data are
originally collected by these people and they are first to use these data.
For example, suppose a research scholar wants to know the mean age of students of M.Sc.
Chemistry of a particular university. If he collects the data related to the age of each student
of M.Sc. Chemistry of that particular university by contacting each student personally. The
data so obtained by the research scholar is an example of primary data for the same research
scholar.
There are a number of methods of collection of primary data depending upon many factors such
as geographical area of the field, money available, time period, accuracy needed, literacy of
the respondents/informants, etc.
Here we will discuss only following commonly used methods.
(1) Direct Personal Investigation Method
(2) Telephone Method
(3) Indirect Oral Interviews Method
(4) Local Correspondents Method
(5) Mailed Questionnaires Method
(6) Schedules Method
Let us discuss these methods one by one with some examples, merits and demerits.
20. SECONDARY DATA
Discussion in the previous section shows that collection of primary data requires lot of time, money,
manpower, etc. But sometimes some or all these resources are not sufficient to go for the collection of
primary data. Also, in some situations it may not be feasible to collect primary data easily. To overcome
these types of difficulties, there is another way of collecting data known as secondary data. The data
obtained/gathered by an investigator or agency or institution from a source which already exists, are
called secondary data. That is, these data were originally collected by an investigator or agency or
institution and has been used by them at least once and now, these are going to be used at least second
time. Already existed data in different sources may be in published or unpublished form. So sources of
secondary data can broadly be classified under the following two heads.
(1) Published Sources
When an institution or organisation publishes its own collected data (primary data) in public domain either in
printed form or in electronic form then these data are said to be secondary data in published form and the
source where these data are available is known as published source of the secondary data of the
corresponding institution or organisation. Some of the published sources of secondary data are given
below:
International Publications
Government Publications in India
Published Reports of Commissions and Committees
Research Publications
Reports of Trade and Industry Associations
Published Printed Sources
Published Electronic Sources
21. SECONDARY DATA(2)
(2)Unpublished Sources- Collected information in term of
data or data observed through own experience by an
individual or by an organisation which is in unpublished
form is known as unpublished source of secondary data.
(i) Records and statistics maintained by different institutions
or organisations whether they are government or non-
government
(ii)Unpublished projects works, field works or some other
research related works submitted by students in their
corresponding institutes
(iii)Records of Central Bureau of Investigation
(iv)Personal diaries, etc.
22. MEASUREMENT SCALES
Two words “counting” and “measurement” are very frequently used by everybody. For
example, if you want to know the number of pages in a note book, you can easily
count them. Also, if you want to know the height of a man, you can easily measure
it. But, in Statistics, act of counting and measurement is divided into 4 levels of
measurement scales known as
(1) Nominal Scale
In Latin, ‘Nomen’ means name. The word nominal has come from this Latin word, i.e.
‘Nomen’. Therefore, under nominal scale we divide the objects under study into
two or more categories by giving them unique names. The classification of objects
into atleast two or more categories is done in such a way that
(a) Each object takes place only in one category, i.e. each object falls in a unique
category, i.e. it either belongs to a category or not. Mathematically, we may use the
symbol (“=”, “”) if an object falls in a category or not. (b) Number of categories
must be sufficient to include all objects, i.e. there should not be scope for missing
even a single object which does not fall in any of the categories. That is, in
statistical language categories must be mutually exclusive and exhaustive. Generally
nominal scale is used when we want to categories the data based on the
characteristic such as gender, race, region, religion, etc.
23. (2) Ordinal Scale
We have seen that order does not make any sense in nominal scale. As the name ordinal itself suggests
that other than the names or codes given to the different categories, it also provides the order
among the categories. That is, we can place the objects in a series based on the orders or ranks
given by using ordinal scale. But here we cannot find actual difference between the two categories.
Generally ordinal scale is used when we want to measure the attitude scores towards the level of liking,
satisfaction, preference, etc. Different designation in an institute can also be measured by using
ordinal scale. For example
Suppose, a school boy is asked to list the name of three ice-cream flavours according to his preference.
Suppose he lists them in the following order:
Vanilla Straw berry Tooty-frooty
This indicates that he likes vanilla more compared to straw berry and straw berry more as
compared to tooty-frooty. But the actual difference between his liking between vanilla and straw berry
cannot be measured.
In sixth pay commission, teachers of colleges and universities are designated as Assistant Professor,
Associate Professor and Professor. The rank of Professor is higher than that of Associate Professor
and designation of Associate Professor is higher than Assistant Professor. But you cannot find the
actual difference between Professor and Associate Professor or Professor and Assistant Professor
or Associate Professor and Assistant Professor. This is because, one teacher in a designation might
have served certain number of years and have done a good quality of research work, etc. and other
teacher in the same designation might have served for lesser number of years have done
unsatisfactory research work, etc. So, the actual difference between one designation and other
designation cannot be found. So one may be very near to his next higher designation and other may
24. (3) Interval Scale
If I = [4, 9] then length of this interval is 9-4 =5, i.e. difference between 4 and 9 is 5, i.e.
we can find the difference between any two points of the interval. For example, 7, 7.3,
difference between 7 and 7.3 is 0.3. Thus we see that property of difference holds in case
of intervals. Similarly, third level of measurement, i.e. interval scale possesses the
property of difference which was not satisfied in case of nominal and ordinal scales.
Nominal scale gives only names to the different categories, ordinal scale moving one step
further also provides the concept of order between the categories and interval scale
moving one step ahead to ordinal scale also provides the characteristic of the difference
between any two categories.
Interval scale is used when we want to measure years/historical time/calendar time,
temperature (except in the Kelvin scale), sea level, marks in the tests where there is
negative marking also, etc. Mathematically, this scale includes +, – in addition to >, <
and = and not equal.
let us consider some examples:
The measurement of time of an historical event comes under interval scale because there is
no fixed origin of time (i.e. ‘0’ year). As’0’ year differ calendar to calendar or
society/country to society/country e.g. Hindus, Muslim and Hebrew calendars have
different origin of time, i.e. ‘0’ year is not defined. In Indian history also, we may find
BC (Before Christ).
25. (4) Ratio Scale
Ratio scale is the highest level of measurement because nominal scale gives only names
to the different categories, ordinal scale provides orders between categories other
than names, interval scale provides the facility of difference between categories
other than names and orders but ratio scale other than names, orders and
characteristic of difference also provides natural zero (absolute zero). In ratio
measurement scale values of characteristic cannot be negative.
Ratio scale is used when we want to measure temperature in Kelvin, weight, height,
length, age, mass, time, plane angle, etc. Ratio scale includes x, division in addition
to +, –, >, <, =, not equal. But be careful never take ‘0’ in the denominator while
finding ratios.
For example, 0/4 is meaningless.
let us consider some examples,
Measurement of temperature in Kelvin scale comes under ratio scale because it has an
absolute zero which is equivalent to C 15.273 0. This characteristic of origin allows
us to make the statement like 50K (‘50K’ read as 50 degree Kelvin) is 5 time hot
compare to 10K.
Both height (in cm.) and age (in days) of students of M.Sc. Statistics of a particular
university satisfy all the requirements of a ratio scale. Because height and age both
cannot be negative (i.e have an absolute zero).
26. Permissible Statistical Tools in measurement scales
MEASUREMENT SCALE PERMISSIBLE
STATISTICAL TOOLS
LOGIC/REASON
NOMINAL SCALE Mode, chi-square test and run
test
Here counting is only
permissible operation.
ORDINAL SCALE Median all positional averages
like quartile, Decile,
percentile, Spearman’s Rank
correlation
Here other than counting,
order relation (less than or
greater than) also exists.
INTERVAL SCALE Mean , S.D., t-test, F-test,
ANOVA, sample multiple and
moment correlations,
regression.
Here counting, order and
difference operations hold.
RATIO SCALE Geometric mean (G.M.),
Harmonic mean (H.M.),
Coefficient of variation.
Here counting, order,
difference and natural zero
exist.
27. Types of Data Analysis
1. Descriptive Statistics - provide an overview of the attributes
of a data set. These include measurements of central tendency
(frequency, histograms, mean, median, & mode) and dispersion
(range, variance & standard deviation)
2. Inferential Statistics - provide measures of how well your
data support your hypothesis and if your data are generalizable
beyond what was tested (significance tests)
28. Types of Data Analysis
Descriptive
Measures of central tendency
Measures of dispersion
Measures of Skewness
Correlation and regression of
two variables
Inferential
Parametric tests-
Hypothesis testing: z-test; t-test;
ANOVA(1 Way);
Chi–square test;
Non-Parametric tests-
Mann-Whitney test (U-test);
Kruskal-Wallis test (H-test);
Rank correlation test
29. Measures of central tendency-According to Professor Bowley, averages are
“statistical constants which enable us to comprehend in a single effort the significance
of the whole”. They throw light as to how the values are concentrated in the central part
of the distribution. For this reason as on last page that they are also called the measures
of central tendency, an average is a single value which is considered as the most
representative for a given set of data. Measures of central tendency show the tendency
of some central value around which data tend to cluster.
Significance of the Measure of Central Tendency
The following are two main reasons for studying an average:
1. To get a single representative
Measure of central tendency enables us to get a single value from the mass of data and
also provide an idea about the entire data. For example it is impossible to remember
the heights measurement of all students in a class. But if the average height is
obtained, we get a single value that represents the entire class.
2. To facilitate comparison
Measures of central tendency enable us to compare two or more than two populations
by reducing the mass of data in one single figure. The comparison can be made
either at the same time or over a period of time. For example, if a subject has been
taught in more than two classes so by obtaining the average marks of those classes,
comparison can be made.
30. Properties of a Good Average
1. It should be simple to understand Since we use the measures of central tendency to simplify the
complexity of a data, so an average should be understandable easily otherwise its use is bound to be very
limited.
2. It should be easy to calculate An average not only should be easy to understand but also should be simple
to compute, so that it can be used as widely as possible.
3. It should be rigidly defined A measure of central tendency should be defined properly so that it has an
appropriate interpretation. It should also have an algebraic formula so that if different people compute the
average from same figures, they get the same answer.
4. It should be liable for algebraic manipulations A measure of central tendency should be liable for the
algebraic manipulations. If there are two sets of data and the individual information is available for both
set, then one can be able to find the information regarding the combined set also then something is
missing.
5. It should be least affected by sampling fluctuations We should prefer a tool which has a sampling stability.
In other words, if we select 10 different groups of observations from same population and compute the
average of each group, then we should expect to get approximately the same values. There may be little
difference because of the sampling fluctuation only.
6. It should be based on all the observations If any measure of central tendency is used to analyse the data, it
is desirable that each and every observation is used for its calculation.
7. It should be possible to calculate even for open-end class intervals
A measure of central tendency should able to be calculated for the data with open end classes.
8. It should not be affected by extremely small or extremely large observations
It is assumed that each and every observation influences the value of the average. If one or two very small or
very large observations affect the average i.e. either increase or decrease its value largely, then the
average cannot be consider as a good average.
31. Different Measures of central tendency
1) Arithmetic Mean
2) Weighted Mean
3) Geometric Mean
4) Harmonic Mean
5) Median
6) Mode
Partition Values
1) Quartiles
2) Deciles
3) Percentiles
32. Arithmetic Mean Arithmetic mean (also called mean) is defined as the
sum of all the observations divided by the number of observations.
Arithmetic mean fulfils most of the properties of a good average except
the last two. It is particularly useful when we are dealing with a sample
as it is least affected by sampling fluctuations. It is the most popular
average and should always be our first choice unless there is a strong
reason for not using it.
Calculate mean of the weights of five students 54, 56, 70, 45, 50 (in kg)
Sum of the given value is 275/5=55
Therefore, average weight of students is 55 kg
Merits
1. It utilizes all the observations;
2. It is rigidly defined;
3. It is easy to understand and
compute; and
4. It can be used for further
mathematical treatments.
Demerits
1. It is badly affected by extremely
small or extremely large values;
2. It cannot be calculated for open
end class intervals; and
3. It is generally not preferred for
highly skewed distributions.
33. WEIGHTED MEAN
Weight here refers to the importance of a value in a distribution. A
simple logic is that a number is as important in the distribution as the
number of times it appears. So, the frequency of a number can also be
its weight. But there may be other situations where we have to
determine the weight based on some other reasons.
For example, the number of innings in which runs were made may be
considered as weight because runs (50 or 100 or 200) show their
importance. Calculating the weighted mean of scores of several
innings of a player, we may take the strength of the opponent (as
judged by the proportion of matches lost by a team against the
opponent) as the corresponding weight. Higher the proportion
stronger would be the opponent and hence more would be the weight.
34. MEDIAN
Median is that value of the variable which divides the whole distribution into two equal parts. Here, it may be
noted that the data should be arranged in ascending or descending order of magnitude. When the number of
observations is odd then the median is the middle value of the data. For even number of observations, there
will be two middle values. So we take the arithmetic mean of these two middle values. Number of the
observations below and above the median, are same. Median is not affected by extremely large or extremely
small values (as it corresponds to the middle value) and it is also not affected by open end class intervals. In
such situations, it is preferable in comparison to mean. It is also useful when the distribution is skewed
(asymmetric).
Find median of following observations: 6, 4, 3, 7, 8
First we arrange the given data in ascending order as 3, 4, 6, 7, 8
Since, the number of observations i.e. 5, is odd, so median would be the middle value that is 6.
Merits
1. It is rigidly defined;
2. It is easy to understand and compute;
3. It is not affected by extremely small or extremely
large values; and
4. It can be calculated even for open end classes (like
“less than 10” or “50 and above”).
Demerits
1. In case of even number of observations we get
only an estimate of the median by taking the
mean of the two middle values. We don’t get its
exact value;
2. It does not utilize all the observations. The median
of 1, 2, 3 is 2. If the observation 3 is replaced by
any number higher than or equal to 2 and if the
number 1 is replaced by any number lower than
or equal to 2, the median value will be
unaffected. This means 1 and 3 are not being
utilized;
3. It is not amenable to algebraic treatment; and
4. It is affected by sampling fluctuations.
35. MODE
Highest frequent observation in the distribution is known as mode. In other words,
mode is that observation in a distribution which has the maximum frequency. For
example, when we say that the average size of shoes sold in a shop is 7 it is the modal
size which is sold most frequently.
Merits
1. Mode is the easiest average to
understand and also easy to calculate;
2. It is not affected by extreme values;
3. It can be calculated for open end
classes;
4. As far as the modal class is
confirmed the pre-modal class and
the post modal class are of equal
width; and
5. Mode can be calculated even if the
other classes are of unequal width
Demerits
1. It is not rigidly defined. A
distribution can have more than one
mode;
2. It is not utilizing all the observations;
3. It is not amenable to algebraic
treatment; and
4. It is greatly affected by sampling
fluctuations.
36. Relationship between Mean, Median
and Mode
For a symmetrical distribution the mean, median and
mode coincide. But if the distribution is moderately
asymmetrical, there is an empirical relationship
between them. The relationship is
Mean – Mode = 3 (Mean – Median)
Mode = 3 Median – 2 Mean
Note: Using this formula, we can calculate
mean/median/mode if other two of them are known.
37. GEOMETRIC MEAN The geometric mean (GM) of n observations is defined as the n-th
root of the product of the n observations. It is useful for averaging ratios or proportions. It
is the ideal average for calculating index numbers (index numbers are economic barometers
which reflect the change in prices or commodity consumption in the current period with
respect to some base period taken as standard). It fails to give the correct average if an
observation is zero or negative.
Merits
1. It is rigidly defined;
2. It utilizes all the observations;
3. It is amenable to algebraic treatment
(the reader should verify that if GM1
and GM2 are Geometric Means of two
series-Series 1 of size n and Series 2
of size m respectively, then Geometric
Mean of the combined series is given
by
Log GM = (n GM1 + m GM2) / (n + m);
4. It gives more weight to small items;
and
5. It is not affected greatly by sampling
fluctuations.
Demerits
1. Difficult to understand and calculate;
and
2. It becomes imaginary for an odd
number of negative observations and
becomes zero or undefined if a single
observation is zero.
38. HARMONIC MEAN
HM is defined as the value obtained when the number of values in the
data set is divided by the sum of reciprocals
The harmonic mean (HM) is defined as the reciprocal (inverse) of the
arithmetic mean of the reciprocals of the observations of a set.
Merits
1. It is rigidly defined;
2. It utilizes all the
observations;
3. It is amenable to algebraic
treatment; and
4. It gives greater importance
to small items.
Demerits
1. Difficult to understand and
compute.
39. PARTITION VALUES-
Partition values are those values of variable which divide the distribution into a certain number of equal parts.
Here it may be noted that the data should be arranged in ascending or descending order of magnitude.
Commonly used partition values are quartiles, deciles and percentiles. For example, quartiles divide the data
into four equal parts. Similarly, deciles and percentiles divide the distribution into ten and hundred equal
parts, respectively.
Quartiles
Quartiles divide whole distribution in to four equal parts. There are three quartiles- 1st Quartile
denoted as Q1, 2nd Quartile denoted as Q2 and 3rd Quartile as Q3, which divide the whole
data in four parts. 1st Quartile contains the ¼ part of data, 2nd Quartile contains ½ of the data
and 3rd Quartile contains the ¾ part of data. Here, it may be noted that the data should be
arranged in ascending or descending order of magnitude.
Deciles
Deciles divide whole distribution in to ten equal parts. There are nine deciles. D1, D2,...,D9 are
known as 1st Decile, 2nd Decile,...,9th Decile respectively and ith Decile contains the
(iN/10)th part of data. Here, it may be noted that the data should be arranged in ascending or
descending order of magnitude.
Percentiles
Percentiles divide whole distribution in to 100 equal parts. There are ninety nine percentiles. P1,
P2, …,P99 are known as 1st percentile, 2nd percentile,…,99th percentile and ith percentile
contains the (iN/100)th part of data. Here, it may be noted that the data should be arranged in
ascending or descending order of magnitude.
40. MEASURES OF DISPERSION
Different measures of central tendency give a value around which the data is concentrated. But it
gives no idea about the nature of scatter or spread. For example, the observations 10, 30 and
50 have mean 30 while the observations 28, 30, 32 also have mean 30. Both the distributions
are spread around 30. But it is observed that the variability among units is more in the first
than in the second. In other words, there is greater variability or dispersion in the first set of
observations in comparison to other. Measure of dispersion is calculated to get an idea about
the variability in the data.
According to Spiegel, the degree to which numerical data tend to spread about an average value
is called the variation or dispersion of data. Actually, there are two basic kinds of a measure
of dispersion (i) Absolute measures and (ii) Relative measures. The absolute measures of
dispersion are used to measure the variability of a given data expressed in the same unit,
while the relative measures are used to compare the variability of two or more sets of
observations. Following are the different measures of dispersion:
1. Range
2. Quartile Deviation
3. Mean Deviation
4. Standard Deviation and Variance
41. Properties of Good Measure of
Dispersion
The properties of a good measure of dispersion are similar to the
properties of a good measure of average. So, a good measure of
dispersion should possess the following properties:
1. It should be simple to understand;
2. It should be easy to compute;
3. It should be rigidly defined;
4. It should be based on each and every observations of data;
5. It should be amenable to further algebraic treatment;
6. It should have sampling stability; and
7. It should not be unduly affected by extreme observations
42. RANGE
Range is the simplest measure of dispersion. It is defined as the difference between the
maximum value of the variable and the minimum value of the variable in the
distribution. Its merit lies in its simplicity. The demerit is that it is a crude measure
because it is using only the maximum and the minimum observations of variable.
However, it still finds applications in Order Statistics and Statistical Quality
Control.
R=X max-X min
where, X max : Maximum value of variable and
X min : Minimum value of variable
Find the range of the distribution 6, 8, 2, 10, 15, 5, 1, 13.
For the given distribution, the maximum value of variable is 15 and the minimum value
of variable is 1. Hence range = 15 -1 = 14.
Merits of Range
1. It is the simplest to understand;
2. It can be visually obtained since one can detect the largest and the smallest
observations easily and can take the difference without involving much calculations;
and
3. Though it is crude, it has useful applications in areas like order statistics and
43. QUARTILE DEVIATION
As you have already studied about quartile that Q1 and Q3 are the first
quartile and the third quartile respectively. (Q3 – Q1) gives the inter
quartile range. The semi inter quartile range which is also known as
Quartile Deviation (QD) is given by
Quartile Déviation (QD) = (Q3 – Q1) / 2
Relative measure of Q.D. known as Coefficient of Q.D. and is defined as
Coefficient of Q.D = Q3-Q1/Q3+Q1
44. MEAN DEVIATION
Mean deviation is defined as average of the sum of the absolute values of deviation
from any arbitrary value viz. mean, median, mode, etc. It is often suggested to
calculate it from the median because it gives least value when measured from the
median.
The deviation of an observation xi from the assumed mean A is defined as (xi – A).
Therefore, the mean deviation can be defined as
MD = (xi – A)/n
Merits of Mean Deviation
1. It utilizes all the observations;
2. It is easy to understand and calculate; and
3. It is not much affected by extreme values.
Demerits of Mean Deviation
1. Negative deviations are straightaway made positive;
2. It is not amenable to algebraic treatment; and
3. It can not be calculated for open end classes
45. VARIANCE
In the previous section, we have seen that while calculating the mean deviation, negative deviations are
straightaway made positive. To overcome this drawback we move towards the next measure of dispersion
called variance. Variance is the average of the square of deviations of the values taken from mean. Taking a
square of the deviation is a better technique to get rid of negative deviations.
Variance is defined as
And for frequency distribution, the formula is
It should be noted that sum of squares of deviations is least when deviations are measured from the mean.
This means Sum(xi – A)2 is least when A = Mean.
Merits of Variance
1. It is rigidly defined;
2. It utilizes all the observations;
3. Amenable to algebraic treatment;
4. Squaring is a better technique to get rid of negative deviations; and
5. It is the most popular measure of dispersion.
Demerits of Variance
1. In cases where mean is not a suitable average, standard deviation may not be the coveted measure of
dispersion like when open end classes are present. In such cases quartile deviation may be used;
2.Although easy to understand, calculation may require a calculator or a computer; and
3.Its unit is square of the unit of the variable due to which it is difficult to judge the magnitude of dispersion
compared to standard deviation.
46. Standard Deviation
Standard deviation (SD) is defined as the positive square root of variance. The formula is
Merits of Standard Deviation
1. It is rigidly defined;
2. It utilizes all the observations;
3. It is amenable to algebraic treatment;
4. Squaring is a better technique to get rid of negative deviations; and
5. It is the most popular measure of dispersion.
Demerits of Standard Deviation
1. In cases where mean is not a suitable average, standard deviation may not be the appropriate
measure of dispersion like when open end classes are present. In such cases quartile deviation
may be used;
2. It is not unit free; and
3. Although it is easy to understand but calculation may require a calculator or a computer.
47. SKEWNESS
We have talked about average and dispersion. They give the location and scale of the
distribution.
In addition to measures of central tendency and dispersion, we also need to have an idea about
the shape of the distribution. Measure of Skewness gives the direction and the magnitude of the
lack of symmetry whereas the kurtosis gives the idea of flatness.
Lack of symmetry is called skewness for a frequency distribution. If the distribution is not
symmetric, the frequencies will not be uniformly distributed about the centre of the distribution.
CONCEPT OF SKEWNESS
Skewness means lack of symmetry. In mathematics, a figure is called symmetric if there exists a
point in it through which if a perpendicular is drawn on the X-axis, it divides the figure into two
congruent parts i.e. identical in all respect or one part can be superimposed on the other i.e mirror
images of each other. In Statistics, a distribution is called symmetric if mean, median and mode
coincide. Otherwise, the distribution becomes asymmetric. If the right tail is longer, we get a
positively skewed distribution for which mean > median > mode while if the left tail is longer,
we
get a negatively skewed distribution for which mean < median < mode.
The example of the Symmetrical curve, Positive skewed curve and Negative skewed curve are
given in the next slide
49. VARIOUS MEASURES OF SKEWNESS
Measures of skewness help us to know to what degree and in which direction (positive or
negative) the frequency distribution has a departure from symmetry. Although positive or
negative skewness can be detected graphically depending on whether the right tail or the left
tail is longer but, we don’t get idea of the magnitude. Besides, borderline cases between
symmetry and asymmetry may be difficult to detect graphically. Hence some statistical
measures are required to find the magnitude of lack of symmetry.
A good measure of skewness should
1. It should be a unit free number so that the shapes of different distributions, so far as symmetry
is concerned, can be compared even if the unit of the underlying variables are different;
2. If the distribution is symmetric, the value of the measure should be zero. Similarly, the
measure should give positive or negative values according as the distribution has positive or
negative skewness respectively; and
3. As we move from extreme negative skewness to extreme positive skewness, the value of the
measure should vary accordingly. possess three criteria:
Measures of skewness can be both absolute as well as relative. Since in a symmetrical
distribution mean, median and mode are identical more the mean moves away from the
mode, the larger the asymmetry or skewness. An absolute measure of skewness can not be
used for purposes of comparison because of the same amount of skewness has different
meanings in distribution with small variation and in distribution with large variation.
50. Absolute Measures of Skewness
Following are the absolute measures of skewness:
1. Skewness (Sk) = Mean – Median
2. Skewness (Sk) = Mean – Mode
3. Skewness (Sk) = (Q3 - Q2) - (Q2 - Q1)
For comparing to series, we do not calculate these absolute measures we
calculate the relative measures which are called coefficient of
skewness. Coefficient of skewness are pure numbers independent of
units of measurements.
51. Relative Measures of Skewness
In order to make valid comparison between the skewness of two or more
distributions we have to eliminate the distributing influence of
variation. Such elimination can be done by dividing the absolute
skewness by standard deviation.
The following are the important methods of measuring relative
skewness:
Karl Pearson’s coefficient of skewness=
Sk=Mean-Mode/SD
Sk(P)=3(Mean-Mode)/SD
Bowley’s coefficient of skewness
Sk(B)=Q3+Q1-2Md/Q3-Q1
52. CORRELATION CONCEPT
In many practical applications, we might come across the situation where observations are
available on two or more variables. The following examples will illustrate the situations clearly:
1. Heights and weights of persons of a certain group;
2. Sales revenue and advertising expenditure in business; and
3. Time spent on study and marks obtained by students in exam.
If data are available for two variables, say X and Y, it is called bivariate distribution.
Let us consider the example of sales revenue and expenditure on advertising in business. A
natural question arises in mind that is there any connection between sales revenue and
expenditure on advertising? Does sales revenue increase or decrease as expenditure on
advertising increases or decreases?
If we see the example of time spent on study and marks obtained by students, a natural question
appears whether marks increase or decrease as time spent on study increase or decrease.
In all these situations, we try to find out relation between two variables and correlation answers
the question, if there is any relationship between one variable and another.
When two variables are related in such a way that change in the value of one variable
affects the value of another variable, then variables are said to be correlated or there is
correlation between these two variables.
53. TYPES OF CORRELATION
1. Positive Correlation
Correlation between two variables is said to be positive if the values of the variables
deviate in the same direction i.e. if the values of one variable increase (or decrease)
then
the values of other variable also increase (or decrease).
Some examples of positive correlation are correlation between
1. Heights and weights of group of persons;
2. House hold income and expenditure;
3. Amount of rainfall and yield of crops; and
4. Expenditure on advertising and sales revenue
In the last example, it is observed that as the expenditure on advertising increases, sales
revenue also increases. Thus, the change is in the same direction. Hence the
correlation is positive.
In remaining three examples, usually value of the second variable increases (or
decreases) as the value of the first variable increases (or decreases).
54. 2. Negative Correlation
Correlation between two variables is said to be negative if the values of variables
deviate in opposite direction i.e. if the values of one variable increase (or decrease) then
the values of other variable decrease (or increase).
Some examples of negative correlations are correlation between
1. Volume and pressure of perfect gas;
2. Price and demand of goods;
3. Literacy and poverty in a country; and
4. Time spent on watching TV and marks obtained by students in examination.
In the first example pressure decreases as the volume increases or pressure increases as
the volume decreases. Thus the change is in opposite direction.
Therefore, the correlation between volume and pressure is negative.
In remaining three examples also, values of the second variable change in the opposite
direction of the change in the values of first variable.
55. SCATTER DIAGRAM
Scatter diagram is a statistical tool for determining the potentiality of correlation between
dependent variable and independent variable. Scatter diagram does not tell about exact
relationship between two variables but it indicates whether they are correlated or not.
Let (Xi, Yi); (1,2,.......n) be the bivariate distribution. If the values of the dependent variable Y
are plotted against corresponding values of the independent variable X in the XY plane, such
diagram of dots is called scatter diagram or dot diagram. It is to be noted that scatter diagram
is not suitable for large number of observations.
Interpretation from Scatter Diagram
If dots are in the shape of a line and line rises from left bottom to the right top (Fig.1), then
correlation is said to be perfect positive.
56. If dots in the scatter diagram are in the shape of a line and line moves from left top to
right bottom (Fig. 2), then correlation is perfect negative.
If dots show some trend and trend is upward rising from left bottom to right top (Fig.3)
correlation is positive.
57. If dots show some trend and trend is downward from left top to the right bottom (Fig.4) correlation is said to be negative.
If dots of scatter diagram do not show any trend (Fig. 5) there is no correlation between the variables.
58. COEFFICIENT OF CORRELATION
Scatter diagram tells us whether variables are correlated or not. But it does not indicate the extent
of which they are correlated. Coefficient of correlation gives the exact idea of the extent of
which they are correlated.
If X and Y are two random variables then correlation coefficient between X and Y is denoted by
r and defined as
Coefficient of correlation measures the intensity or degree of linear relationship between two
variables. It was given by British Biometrician Karl Pearson (1867-1936).
59. Assumptions for Correlation Coefficient
1. Assumption of Linearity Variables being used to know correlation coefficient must be linearly
related. You can see the linearity of the variables through scatter diagram.
2. Assumption of Normality Both variables under study should follow Normal distribution. They
should not be skewed in either the positive or the negative direction.
3. Assumption of Cause and Effect Relationship There should be cause and effect relationship
between both variables, for example, Heights and Weights of children, Demand and Supply
of goods, etc. When there is no cause and effect relationship between variables then
correlation coefficient should be zero. If it is non zero then correlation is termed as chance
correlation or spurious correlation.
For example, correlation coefficient between:
1. Weight and income of a person over periods of time; and
2. Rainfall and literacy in a state over periods of time.
60. LINEAR REGRESSION
Prediction or estimation is one of the major problems in most of the human activities. Like
prediction of future production of any crop, consumption, price of any good, sales, income,
profit, etc. are very important in business world. Similarly, prediction of population,
consumption of agricultural product, rainfall, revenue, etc. have great importance to the
government of any country for effective planning.
If two variables are correlated significantly, then it is possible to predict or estimate the values of
one variable from the other. This leads us to very important concept of regression analysis. In
fact, regression analysis is a statistical technique which is used to investigate the relationship
between variables. The effect of price increase on demand, the effect of change in the money
supply on the increase rate, effect of change in expenditure on advertisement on sales and
profit in business are such examples where investigators or researchers try to construct cause
and affect relationship. To handle these type of situations, investigators collect data on
variables of interest and apply regression method to estimate the quantitative effect of the
causal variables upon the variable that they influence.
Regression analysis describes how the independent variable(s) is (are) related to the dependent
variable i.e. regression analysis measures the average relationship between independent
variables and dependent variable. The literal meaning of regression is “stepping back towards
the average” which was used by British Biometrician Sir Francis Galton (1822-1911)
regarding the height of parents and their offspring’s.
Regression analysis is a mathematical measure of the average relationship between two or more
variables.
61. Types of variables in regression analysis
Independent variable The variable which is used for prediction is called
independent variable. It is also known as regressor or predictor or explanatory
variable.
Dependent variable The variable whose value is predicted by the independent
variable is called dependent variable. It is also known as regressed or explained
variable.
If scatter diagram shows some relationship between independent variable X and
dependent variable Y, then the scatter diagram will be more or less concentrated
round a curve, which may be called the curve of regression.
When the curve is a straight line, it is known as line of regression and the regression is
said to be linear regression.
If the relationship between dependent and independent variables is not a straight line
but curve of any other type then regression is known as nonlinear regression.
Regression can also be classified according to number of variables being used. If only
two variables are being used this is considered as simple regression whereas the
involvement of more than two variables in regression is categorized as multiple
regression.
62. Formula of Linear Regression
If regression line of y on x is and
If regression line of x on y is as follows
63. DISTINCTION BETWEEN CORRELATION AND
REGRESSION
Both correlation and regression have important role in relationship study but
there are some distinctions between them which can be described as follow:
(i) Correlation studies the linear relationship between two variables while
regression analysis is a mathematical measure of the average relationship
between two or more variables.
(ii) Correlation has limited application because it gives the strength of linear
relationship while the purpose of regression is to "predict" the value of the
dependent variable for the given values of one or more independent
variables.
(iii) Correlation makes no distinction between independent and dependent
variables while linear regression does it, i.e. correlation does not consider
the concept of dependent and independent variables while in regression
analysis one variable is considered as dependent variable and other(s) is/are
as independent variable(s).
64. CONCEPT OF HYPOTHESIS TESTING
In our day-to-day life, we see different commercials advertisements in television,
newspapers, magazines, etc. such as
(i) The refrigerator of certain brand saves up to 20% electric bill,
(ii)The motorcycle of certain brand gives 60 km/liter mileage,
(iii)A detergent of certain brand produces the cleanest wash,
(iv)Ninety nine out of hundred dentists recommend brand A toothpaste for their
patients to save the teeth against cavity, etc.
Now, the question may arise in our mind “can such types of claims be verified
statistically?” Fortunately, in many cases the answer is “yes”.
The technique of testing such type of claims or statements or assumptions is known as
testing of hypothesis. The truth or falsity of a claim or statement is never known
unless we examine the entire population. But practically it is not possible in mostly
situations so we take a random sample from the population under study and use the
information contained in this sample to take the decision whether a claim is true or
false.
65. CONCEPT OF HYPOTHESIS TESTING COUNTD
In our day-to-day life, we see different commercials advertisements in television, newspapers,
magazines, etc. and if someone may be interested to test such type of claims or statement then we
come across the problem of testing of hypothesis.
For example, (i) a customer of motorcycle wants to test whether the claim of motorcycle of certain
brand gives the average mileage 60 km/liter is true or false
(ii) the businessman of banana wants to test whether the average weight of a banana of Kerala is more
than 200 gm,
(iii) a doctor wants to test whether new medicine is really more effective for controlling blood pressure
than old medicine,
(iv) an economist wants to test whether the variability in incomes differ in two populations,
(v) a psychologist wants to test whether the proportion of literates between two groups of people is
same, etc.
In all the cases discussed above, the decision maker is interested in making inference about the
population parameter(s). However, he/she is not interested in estimating the value of parameter(s)
but he/she is interested in testing a claim or statement or assumption about the value of population
parameter(s). Such claim or statement is postulated in terms of hypothesis.
In statistics, a hypothesis is a statement or a claim or an assumption about the value of a population
parameter (e.g., mean, median, variance, proportion, etc.).
Similarly, in case of two or more populations a hypothesis is comparative statement or a claim or an
assumption about the values of population parameters. (e.g., means of two populations are equal,
variance of one population is greater than other, etc.). The plural of hypothesis is hypotheses.
66. GENERAL PROCEDURE OF TESTING A HYPOTHESIS
Testing of hypothesis is a huge demanded statistical tool by many discipline and professionals. It
is a step by step procedure as you will see in next three units through a large number of
examples. The aim of this section is just give you flavour of that sequence which involves
following steps:
Step I: First of all, we have to setup null hypothesis H0 and alternative hypothesis H1. Suppose,
we want to test the hypothetical / claimed / assumed value θ0 of parameter θ. So we can take
the null and alternative hypotheses as
Step II: After setting the null and alternative hypotheses, we establish a criteria for rejection or
non-rejection of null hypothesis, that is, decide the level of significance (a), at which we want
to test our hypothesis. Generally, it is taken as 5% or 1% (α = 0.05 or 0.01).
Case I: If the alternative hypothesis is right-sided such as H1: θ > θ0 or H1: θ1 > θ2 then
the entire critical or rejection region of size α lies on right tail of the probability curve of
sampling distribution of the test statistic as shown
67. Case II: If the alternative hypothesis is left-sided such as H1: θ < θ0 or H1: θ1 < θ2 then the entire critical or rejection
region of size α lies on left tail of the probability curve of sampling distribution of the test statistic as shown
Case III: If the alternative hypothesis is two sided such as H1: θ ≠ θ0 or H1: θ1 ≠ θ2 then critical or rejection regions
of size α/2 lies on both tails of the probability curve of sampling distribution of the test statistic as shown
68. GENERAL PROCEDURE OF TESTING A HYPOTHESIS(3)
Step III: The third step is to choose an appropriate test statistic under H0 for testing the null
hypothesis as given below:
After that, specify the sampling distribution of the test statistic preferably in the standard form like Z
(standard normal), Chi square, t, F or any other well-known in literature.
Step IV: Calculate the value of the test statistic described in Step III on the basis of observed sample
observations.
Step V: Obtain the critical (or cut-off) value(s) in the sampling distribution of the test statistic and construct
rejection (critical) region of size alpha. Generally, critical values for various levels of significance are
putted in the form of a table for various standard sampling distributions of test statistic such as Z-table,
chi square2-table, t-table, etc.
Step VI: After that, compare the calculated value of test statistic obtained from Step IV, with the critical
value(s) obtained in Step V and locates the position of the calculated test statistic, that is, it lies in
rejection region or non-rejection region
Step VII: In testing of hypothesis ultimately we have to reach at a conclusion. It is done as explained below:
(i) If calculated value of test statistic lies in rejection region at level of significance then we reject null
hypothesis. It means that the sample data provide us sufficient evidence against the null hypothesis and
there is a significant difference between hypothesized value and observed value of the parameter.
(ii) If calculated value of test statistic lies in non-rejection region at level of significance then we do not
reject null hypothesis. Its means that the sample data fails to provide us sufficient evidence against the
null hypothesis and the difference between hypothesized value and observed value of the parameter due
to fluctuation of sample.
69. TYPE-I AND TYPE-II ERRORS
Type I Errors
• A Type I error occurs when the sample data appear to show a treatment effect
when, in fact, there is none.
• In this case the researcher will reject the null hypothesis and falsely conclude that
the treatment has an effect.
• Type I errors are caused by unusual, unrepresentative samples. Just by chance the
researcher selects an extreme sample with the result that the sample falls in the
critical region even though the treatment has no effect.
• The hypothesis test is structured so that Type I errors are very unlikely; specifically,
the probability of a Type I error is equal to the alpha level.
Type II Errors
• A Type II error occurs when the sample does not appear to have been affected by
the treatment when, in fact, the treatment does have an effect.
• In this case, the researcher will fail to reject the null hypothesis and falsely conclude
that the treatment does not have an effect.
• Type II errors are commonly the result of a very small treatment effect. Although
the treatment does have an effect, it is not large enough to show up in the research
study.
71. Difference between Statistic and Parameter
Statistic
Statistic is a measure which
describes a fraction of population
Numerical value Variable and
Known
Statistical Notation
s = Sample Standard Deviation
x = Data Elements
n = Size of sample
r = Correlation coefficient
Parameter
Parameter refers to a measure
which describes population.
Numerical value Fixed and
Unknown
Statistical Notation
μ = Population Mean
σ = Population Standard Deviation
P = Population Proportion
X = Data Elements
N = Size of Population
ρ = Correlation coefficient
72. Parametric Stastical test
Parametric statistic is a branch of statistic, which assumes that sample data comes from
a population that follows a probability or normal distribution. When the assumption
are correct, parametric methods will produce more accurate and precise estimates.
Assumptions
The scores must be independent (In other words the selection of any particular score
must not be bias the chance of any other case for inclusion).
The observations must be drawn from normally distributed populations(Follow ND)
The selected population is representative of general population
The data is in Interval or Ratio scale
The populations(If comparing two or more groups) must have the same variances
Types of Parametric test
1. Z- test.
2. T-test.
3. ANOVA.
4. F-test.
5. Chi-Square test.
73. Z-test
A Z-test is given by Fisher. A Z-test is a type of hypothesis test or
statistical test.
It is used for testing the mean of a population versus a standard or
comparing the means of two population with large sample (n>30).
When we can run a Z-test
Your sample size is greater than 30.
Data point should be independent from each other.
Your data should be randomly selected from a population, where each
item has an equal chance of being selected.
Data should follow normal distribution.
The standard deviation of the populations is known.
There are two ways to calculate z-test
a. one-sample z-test.
b. two-sample z-test.
74. One-sample z-test
One-sample z-test we are comparing the mean, calculated on a single of
score (one sample) with known standard deviation.
Ex. The manager of a candy manufacture wants to know whether
the mean weight of batch of candy boxes is equal to the target value
of 10 pounds from historical data.
75. Two-sample z-test
When testing for the differences between two groups can imagine two separate
situation. Comparing the proportion of two population. In two sample z-test both
independent populations.
Ex: 1. Comparing the average engineering salaries of men versus women.
2. Comparing the fraction defectives from two production line.
76. T-test
It is derived by W.S Gosset in 1908. It is also called student t-test. A t-
test statistical significance indicates whether or not the difference
between two groups.
Assumption:
Samples must be random and independent.
When samples are small. n<30
Standard deviation is not known.
Population is Normal distributed.
There are two ways to calculate T-test such as,
a. Unpaired t-test.(independent)
b. Paired t-test.
77. Unpaired t-test:
If there is no link between the data then use the unpaired t-test. When two separate
set of independent sample are obtain one from each of the two population being
compared.
Ex:1. Compare the height of girls and boys.
2. Compare the 2 stress reduction intervention.
When one group practiced mindfulness meditation, while other learned
yoga.
78. Paired t-test consists of a sample of matched pairs of similar units or one group of units
that has been tested twice (a” repeated measures” t-test). If there is some link
between the data then use the paired t-test.(e.g. Before and after)
Ex: 1. where subject are tested prior to a treatment say for high blood pressure, and the
same subject are tested again after treatment with a blood pressure lowering
medication.
2. Test on person or any group before and after training.
Paired t-test.
79. ANOVA (Analysis of Variance)
It is developed by Fisher in 1920. ANOVA is a collection of statistical model used to analyze the
differences between groups. Compare multiple groups at one time. It is advanced technique
for the experimental treatment of testing differences all of the mean which is not possible in
case of t-test.
Assumptions:
All population have same standard deviation.
Individuals in population are selected randomly.
Independent samples.
The population must be normal distribution.
There are two ways to calculate ANOVA such as.
One-way ANOVA: One-way anova compare three or more unmatched groups when data are
categorized in one way.
Ex: You might be studying the effect of tea on weight loss, from three groups, green tea, black
tea, no tea.
Two-way ANOVA
Two way anova technique is used when the data are classified on the basis of two
factors. And two way anova analyzed a 2 independent variable and 1 dependent
variable.
Ex: The agricultural output may be classified on the basis of different verities
of seeds. and also on the basis of different verities of fertilizer used.
80. Chi-Square test
It is a test that measures how expectations compare to
actual observed data. It is used to investigate
whether distribution of categorical variables differ
from one another
Formula Chi Square= Summation(Oi-Ei)2/Ei
It is drawn by Karl Pearson. Chi square test is a
statistical test used as a parametric for testing for
comparing variance .
It is denoted as “ x²”
Formula:
81. Non-parametric statistics test
Non-parametric statistics is the branch of statistics. It refers to a statistical method in
which the data is not required to fit a normal distribution. Nonparametric statistics uses data that
is often ordinal, meaning it does not rely on numbers, but rather a ranking or order of sorts.
For example: a survey conveying consumer preferences ranging from like to dislike would be
considered ordinal data.
Nonparametric statistics does not assume that data is drawn from a normal distribution.
Instead, the shape of the distribution is estimated under this form of statistical measurements like
descriptive statistics, statistical test, inference statistics and models. There is no assumption of
sample size because it’s observed data is quantitative.
This type of statistics can be used without the mean, sample size, standard deviation or
estimation of any other parameters.
The non-parametric test are called as “distribution-free” test since they make no assumptions
regarding the population distribution. It is test may be applied ranking test. They are easier to
explain and easier to understand but one should not forget the fact that they usually less
efficient/powerful as they are based on no assumptions. Non-parametric test is always valid, but
not always efficient.
Types of Non-parametric statistics test
Rank sum test
Chi-square test
Spearman’s rank correlation
82. Rank sum test
Rank sum tests are
U test (Wilcoxon-Mann-Whitney test)
H test (Kruskal-Wallis test)
U test: It is a non-parametric test. This test is
determine whether two independent samples have
been drawn from the same population. The data that
can be ranked i.e., order from lowest to highest
(ordinal data).
83. U test
For example
The values of one sample 53,
38, 69, 57, 46
The values of another sample
44, 40, 61, 53, 32
We assign the ranks to all
observations, adopting low to
high ranking process and
given items belong to a single
sample.
Size of sample in ascending
order
Rank
32 1
38 2
40 3
44 4
46 5
53 6.5
53 6.5
57 8
61 9
69 10
84. Kruskal-Wallis H test
H test: The Kruskal-Wallis H test (also called as the “one-
Way ANOVA on ranks”) is a rank-based non parametric
test that can be used to determine if there are statistically
significant difference between two or more groups of an
independent variable on a continuous or ordinal dependent
variable.
For example: H test to understand whether exam performance,
measured on a continuous scale from 0-100, differed based
on test anxiety levels(i.e., dependent variable would be
“exam performance” and independent variable would be
“test axiety level”, which has three independent groups:
students with “low”, “medium” and “high” test anxiety
levels).
85. Chi square test
The chi-square test is a non-parametric test. It is used mainly when
dealing with a nominal variable. The chi-square test is mainly 2 methods.
Goodness of fit: Goodness of fit refers to whether a significant
difference exists between an observed number and an expected number
of responses, people or other objects.
For example: suppose that we flip a coin 20 times and record the
frequency of occurrence of heads and tails. Then we should expect 10
heads and 10 tails.
Let us suppose our coin-flipping experiment yielded 12 heads and 8 tails.
Our expected frequencies (10-10) and our observed frequencies (12-8).
Independence: the independence of test is difference between the
frequencies of occurrence in two or more categories with two or more
groups.
86. Spearman’s rank correlation test-In this method a measure of association
that is based on the ranks of the observations and not on the numerical values of the
data. It was developed by famous Charles spearman in the early 1990s and such it is
also known as spearman’s rank correlation co-efficient.
English (marks) Maths (marks) Rank (English) Rank (maths) Difference of
ranks
56 66 9 4 5
75 70 3 2 1
45 40 10 10 0
71 60 4 7 3
62 65 6 5 1
64 56 5 9 16
58 59 8 8 0
80 77 1 1 0
76 67 2 3 1
61 63 7 6 1
87. PROBABILITY
In our daily lives, we face many situations when we are unable to forecast the future with complete certainty. That is,
in many decisions, the uncertainty is faced. Need to cope up with the uncertainty leads to the study and use of the
probability theory. The first attempt to give quantitative measure of probability was made by Galileo (1564-1642),
an
Italian mathematician, when he was answering the following question on the request of his patron, the Grand Duke
of
Tuscany, who wanted to improve his performance at the gambling tables: “With three dice a total of 9 and 10 can
each be produced by six different combinations, and yet experience shows that the number 10 is oftener thrown than
the number 9?” To the mind of his patron the cases were (1, 2, 6), (1, 3, 5), (1, 4, 4), (2, 2, 5), (2, 3, 4), (3,3, 3) for 9
and (1, 3, 6), (1, 4, 5), (2, 2, 6), (2, 3, 5), (2, 4, 4), (3, 3, 4) for 10 and hence he was thinking that why they do not
occur equally frequently i.e. why there chances are not the same? Galileo makes a careful analysis of all the cases
which can occur, and he showed that out of the 216 possible cases 27 are favourable to the appearance of the number
10 since permutations of (1, 3, 6) are (1, 3, 6), (1, 6, 3), (3, 1, 6), (3, 6, 1), (6, 1, 3), (6, 3, 1) i.e. number of
permutations of (1, 3, 6) is 6; similarly, the number of permutations of (1, 4, 5), (2, 2, 6), (2, 3, 5), (2, 4, 4), (3, 3, 4)
is 6, 3, 6, 3, 3 respectively and hence the total number of cases come out to be 6 + 6 + 3 + 6 + 3 + 3 = 27 whereas the
number of favourable cases for getting a total of 9 on three dice are 6 + 6 + 3 + 3 + 6 + 1 = 25. Hence, this was the
reason for10 appearing oftener thrown than 9. But the first foundation was laid by the two mathematicians Pascal
(1623-62) and Fermat (1601-65) due to a gambler's dispute in 1654 which led to the creation of a mathematical
theory of probability by them. Later, important contributions were made by various researchers including Huyghens
(1629 - 1695), Jacob Bernoulli (1654-1705), Laplace (1749-1827), Abraham De Moivre (1667-1754), and Markov
(1856-1922). Thomas Bayes (died in 1761, at the age of 59) gave an important technical result known as Bayes’
theorem, published after his death in 1763, using which probabilities can be revised on the basis of some new
information. Thereafter, the probability, an important branch of Statistics, is being used worldwide.
104. Probability Distribution
There are two types of Probability Distribution;
1)Discrete Probability Distribution- the set of all
possible values is at most a finite or a countable
infinite number of possible values
Poisson Distribution
Binomial Distribution
1)Continuous Probability Distribution- takes on
values at every point over a given interval
Normal (Gaussian) Distribution
105. Normal (Gaussian) Distribution
• The normal distribution is a descriptive model
that describes real world situations.
• It is defined as a continuous frequency distribution of infinite range (can take any
values not just integers as in the case of binomial and Poisson distribution).
• This is the most important probability distribution in statistics and important tool
in analysis of epidemiological data and management science.
Characteristics of Normal Distribution
• It links frequency distribution to probability distribution
• Has a Bell Shape Curve and is Symmetric
• It is Symmetric around the mean:
Two halves of the curve are the same (mirror images)
• Hence Mean = Median
• The total area under the curve is 1 (or 100%)
• Normal Distribution has the same shape as Standard Normal Distribution.
• In a Standard Normal Distribution:
The mean (μ ) = 0 and
Standard deviation (σ) =1
106. Normal (Gaussian) Distribution(2)
Z Score (Standard Score)
• Z = X - μ
• Z indicates how many standard deviations away
from the mean the point x lies.
• Z score is calculated to 2 decimal places.
Tables
Areas under the standard normal curve
108. Normal (Gaussian) Distribution(4)
Distinguishing Features
• The mean ± 1 standard deviation covers 66.7% of the area under the
curve
• The mean ± 2 standard deviation covers 95% of the area under the
curve
• The mean ± 3 standard deviation covers 99.7% of the area under the
curve
Application/Uses of Normal Distribution
• It’s application goes beyond describing distributions
• It is used by researchers and modelers.
• The major use of normal distribution is the role it plays in
statistical inference.
• The z score along with the t –score, chi-square and F-statistics is
important in hypothesis testing.
• It helps managers/management make decisions.
109. Binomial Distribution
A widely known discrete distribution constructed by determining the probabilities of X
successes in n trials.
Assumptions of the Binomial Distribution
• The experiment involves n identical trials
• Each trial has only two possible outcomes: success and failure
• Each trial is independent of the previous trials
• The terms p and q remain constant throughout the experiment
• p is the probability of a success on any one trial
• q = (1-p) is the probability of a failure on any one trial
• In the n trials X is the number of successes possible where X is a whole number
between 0 and n.
• Applications
• Sampling with replacement
• Sampling without replacement causes p to change but if the sample size n < 5%
N, the independence assumption is not a great concern.
110. Binomial Distribution Formula
• Probability
function
• Mean value
• Variance
and
standard
deviation
( )
P X
n
X n X
X n
X n X
p q( )
!
! !
=
−
⋅ ≤ ≤
−
for 0
µ = ⋅n p
2
2
σ
σ σ
= ⋅ ⋅
= = ⋅ ⋅
n p q
n p q
111. Poisson Distribution
French mathematician Siméon Denis Poisson proposed Poisson
DistributionThe Poisson distribution is popular for modelling
the number of times an event occurs in an interval of time or space. It
is a discrete probability distribution that expresses the probability of a
given number of events occurring in a fixed interval of time or space
if these events occur with a known constant rate and independently of
the time since the last event.
The Poisson distribution may be useful to model events such as
• The number of meteorites greater than 1 meter diameter that strike
Earth in a year
• The number of patients arriving in an emergency room between 10
and 11 pm
• The number of photons hitting a detector in a particular time interval
• The number of mistakes committed per pages
112. Poisson Distribution
Assumptions of the Poisson Distribution
• Describes discrete occurrences over a continuum or
interval
• A discrete distribution
• Describes rare events
• Each occurrence is independent any other
occurrences.
• The number of occurrences in each interval can vary
from zero to infinity.
• The expected number of occurrences must hold
constant throughout the experiment.
This is the diagram of a normal distribution curve or z distribution. Note the bell shape of the curve and that its ends/tail don’t touch the horizontal axis below. As I mentioned earlier, the area under the curve equals 1 or 100%. Therefore, each half of the distribution from the center (that is from the mean is equal to 50%. Thus, the area from/above the mean up to 1 standard deviation is equal to 33.35%, area above +1 standard deviation is equal to 13.6%, the area above +2 standard deviation is equal to 2.2% and area above +3 standard deviations is equal to 0.1%. Since the other half is a mirror image, the percentage/proportion of area above -1 standard deviation is the same as the area above + 1 standard deviation i.e. it is 33.35%. And -2 standard deviation=+2 standard deviation and so forth….