Statistical Analysis and Hypothesis Tesing

Chapter -2
Statistical Data Analysis
1

Introduction
● Data Science is as interdisciplinary field which requires a strong
understanding of mathematics,statistical reasoning and computer science
● Statistics is the science of collecting ,analyzing and interpreting data
● The data is usually numerical data in large quantities
● Statistics serve as a foundation while dealing with data and its analysis in
data science.
● It provides tools and methods to find structure in ans to give deeper insight
into data
● Data scientist use the combination of statistical formulae and computer
algorithms to notice patterns and trends within data
3

Steps for processing data
1. Identify the importance feature in the data
2. Finding relationship between features
3. Converting the features into the required format
4. Nomalizing and scaling the data
5. Identifying the distribution and nature of the data
6. Performing adjustment in the data
7. Identifying the right mathematical approach
8. Verify the results using different accuracy measurement scales
4

Roles of statistics in Data Science
Data Exploration
Data Cleaning
Data Transformation
Data Visualization
Finding Similarity/Dissimilarity
Model Selection and Evaluation
Hypothesis Testing
Statistical Modeling
Probability Distribution and Estimation
5

Types of Statistics
6
Types of Statistics
Descriptive statistics Inferential Statistics
Parameter Estimation
Hypothesis Testing
Measures of
Dispersion
Measures of
Frequency
Measures of
Central Tendency

Descriptive Statistics
● Provides ways for describing,presenting,summarizing and organizing
the data
● Descriptive statistics summarizes this large amount of data and
presents it in a simple and understandable form.
● The summarization is done from the sample of the population using
different parameters like mean,median,standard deviation
7

Types of Descriptive Statistics
8
Descriptive Statistics
Measures of
Frequency
Measures of
Central Tendency Measures of Dispersion
Mean
Mode
Median
Range
Interquartile
Range
Standard
Deviation

Measures of Frequency
● Frequency is statistical quantity in data science.
● It is number of times a value of the data occurs.
● In a dataset it analyzes how often a particular data value in a feature occurs.
● The frequency distribution can be tabulated as a frequency chart
9
Twenty students were asked how many hours they worked per day. Their responses, in hours,
are as follows: 55, 66, 33, 33, 22, 44, 77, 55, 22, 33, 55, 66, 55, 44, 44, 33, 55, 22, 55, 33.

DATA
VALUE
FREQUENCY
2 3
3 5
4 3
5 6
6 2
7 1
10

Measures of Central Tendency
● It is important measures of statistical analysis is to find one value that
describes the characteristics of the entire set of data.
● This single value is referred to as a central tendency that describes a whole
set of data with single value that represents the center of its distribution.
● Measure of central Tendency is also known as summary statistics that is used
to represent the center point.
11

Mean
● The most common and effective numeric measure of the center of a set of
data .
● It is the sum of all the observations divided by the sample size.
● The types of mean
Arithmetic Mean
Harmonic Mean
Geometric mean
12

Arithmetic mean
● It is obtained by adding all the values and then dividing the sum by the total
number of values.
● Let x1,x2,x3,x4…..xn be a set of N values or observation. The arithmetic
mean of this set of values is :
13

● Suppose the marks obtained by 10 students in a quiz are 8,3,7,6,9,10,5,7,8,5
● We can calculate
(8+3+7+6+9+10+5+7+8+5)
10 =6.8
The arithmetic mean can be calculated by using mean () function from Numpy
library
14

Harmonic Mean
● The harmonic mean is used when we want to find the reciprocal of the
average of the reciprocal terms in a series. The formula to determine
harmonic mean is n / [1/x1 + 1/x2 + 1/x3 + ... + 1/xn].
● Example x=(6,3,1,5,2)
● HM= ?
15

Geometric Mean
● A geometric mean is a mean or average which shows the central tendency of
a set of numbers by using the product of their values.
16

Median
● It is middle value of data.
● It is the value that separates the higher half of a data set from the lower half.
● It splits the data in half and also called 50 th percentile
● If the number of elements in the data set is odd then middle element is
median
● If the number of elements in the data set is even then average of two
central elements.
Advantages
Less affected by the outliers and skewed data as compared to mean
Appropriate for Skewed data
17

Mode
● It is value that occur more frequently in a dataset.
● It is possible for several different values to have the maximum frequency
which result in more than one mode.
● Dataset with one mode is called unimodal.
● Dataset with two mode is called bimodal.
● Dataset with three mode is called trimodal.
18

● Advantages
○ Can be used for categorical values
○ Determined for qualitative and quantitative values
○ Not affected by extreme values
● Disadvantages
○ Not based on all values
○ Mode can not clearly defined in case of multi model series
○ Not applicable for further statistical analysis and algebraic calculation
19

Measures of Dispersion
● Dispersion is the extent to which values in a distribution differ from the average of
distribution
● Measures of central tendency is alone not sufficient to describe the data.
● Measures of dispersion helps us to know the degree of variability in the data and
provide better understanding of data
● Measures of dispersion indicate the measures to assess the dispersion or spread
of numeric data.
● The measures are:
o Rage
o Quantiles
o Quartiles
o Percentiles
o Interquartile range
20

Range
● It is simplest measure of dispersion.Let x1,x2,….xn be a set of observations
for some numeric attributes X.
● The range of the set is the difference between the largest(max() and the
smallest (min() values)
● Range=max-min
21

Standard Deviation
● It is a measure of how much the data values deviate from the mean value
● σ = √(∑x−x
̄ )2 /n)
22
Find the SD for 4,9,11,12,17,5,8,12,14

Variance
● Variance measures how far a data set is spread out.It is mathematically
defined ad the average of the squared differences from the mean.
● Variance = (Standard deviation)2= σ2
23

Interquartile Range
● Interquartile range is a measure of variation, which describes how spread out
the data is.
● The interquartile range is a measure of variability based on splitting data
into quartiles.
● Interquartile range is the difference between the first and
third quartiles (Q1 and Q3).
● Quartile divides the range of data into four equal parts That are demarcated
by the three quartiles Q1,Q2,Q3
● Consider the following data
2,3,4,7,10,15,22,26,27,30,32
24

Inferential Statistics
25
•Inferential Statistics draw inferences
and prediction about a population
based on data chosen from the
population in question
•Sample is considered as a
representative of the entire universe or
population
•Statistical Inference mainly deals with
two different kinds of problems
Hypothesis testing
Estimation of parameter values

Hypothesis testing
● Hypothesis testing is mainly used to determine whether there is sufficient
evidence in a data sample to conculde that a particular condition holds for an
entire population
● There are two hypothesis
○ Null Hypothesis
○ Alternative Hypothesis
● The null hypothesis in statistics states that there is no difference between
groups or no relationship between variables.
● The alternative hypothesis states that there is a relationship between the two
variables being studied (one variable has an effect on the other).
26

Steps for Hypothesis Testing
● State the null and alternative hypothesis
● Select the appropriate significance level and check the specified test
assumption
● Analyze the data by computing appropriate statistical tests
27

Example of Hypothesis
● For example, suppose a biologist believes that a certain fertilizer will cause
plants to grow more during a one-month period than they normally do, which
is currently 20 inches. To test this, she applies the fertilizer to each of the
plants in her laboratory for one month.
● She then performs a hypothesis test using the following hypotheses:
● H0: μ = 20 inches (the fertilizer will have no effect on the mean plant growth)
● HA: μ > 20 inches (the fertilizer will cause mean plant growth to increase)
28

● For example, suppose a doctor believes that a new drug is able to reduce
blood pressure in obese patients. To test this, he may measure the blood
pressure of 40 patients before and after using the new drug for one month.
● He then performs a hypothesis test using the following hypotheses:
● H0: μafter = μbefore (the mean blood pressure is the same before and after using
the drug)
● HA: μafter < μbefore (the mean blood pressure is less after using the drug)
29

Parametric hypothesis tests
Information about the population is completely known and can be used for statistical inference
Steps for Parametric test
Step -1 State Null and Alternate hypothesis
Step -2 Consider the level of significance
Step- 3 Identify the type of parametric test to be conducted
Step- 4 Find the Critical value to decide the accept/reject regions
Step- 5 Consider the sample find the objtained parametric test value
Step-6 Compare obtained value critical value to decide whether the null hypothesis is accepted
or rejected
30

Terms related with Parametric test
1. Acceptance and critical regions :
All set of possible values can be divided into two mutually exclusive groups:
● Acceptance Region : Set of values that appear to be consistent with the null
hypothesis
● Rejection Region : Consists of values that are unlikely to occur if the null
hypothesis is true
31

One tailed test and Two tailed Test
If the specified problem has an equal sign it is two tailed test
If the problem has a greater than or less than sign it is one tailed test
Case 1 :A government school states that dropout of female students between ages 12 and 18
years is 28%
Case 2 :A government school states that dropout of female
students between ages 12 and 18 years greater than 28%
Case 3 :A government school states that dropout of female
students between ages 12 and 18 years less than 28%
32

Significance Level
It is denoted by α
It is probability of rejecting null hypothesis being rejected even if it is true
For example a significance level of 0.03 indicates that a 3 % risk is being taken
that a difference in values exists when there is no difference.
Typical values of significance level is 0.01,0.05,0.1
33

Calculated probability
It is calculated probability that states that when the null hypothesis is true,the
statistical summary will be greater than or equal to the actual observed results
Example of One Sample parametric tests
Z-test
T-test
Chi-Square
34

Types of Hypothesis Testing
35
Hypothesis Test
Two
Sample
One
Sample
NonParametric Test
Parametric Test
Z-Test
Chi-Square
Test
T-test
Independent
Samples
Paired Samples
Z-Test
Two group
test
Paired-Test
Two
Sample
One
Sample

Z -Test
● This test is used for comparing the mean of a sample to some hypothesized mean
of a given population.
● The method for carrying out z-test for one sample is
z=X-µ
H0
σp /√n
Where µ
H0 =hypothesized population mean
σp Standard deviation
36

Example
● For a sample of 500 female students having a mean height of 5.4 feet.The
task is to find whether it can be reasonably regarded as a sample from a large
population with a mean height of 5.6 feet and standard deviation of 1.45
feet.Let us consider 5 % level of significance to solve the problem.
37

T-test
● The one sample t-test is mainly used for determining whether the mean of
sample is statistically different from a known or hypothesized mean of a given
population.
● The test variable needs to be continuous
z=X-µ
H0
σs /√n
38

Chi Square test
● A chi square test is a test of statistical significance for categroical variables.
● It is used to find difference between the observed and expected data
● To find the correlation beween categorical variables In our data
39

ANOVA
● Analysis of Variance (ANOVA) is an extension of t-test. It is used to check if
the mean of two or more groups are significantly different from each other.
40

Two sample parametric tests
● Independent samples z-test
This test is carried out on two normally distributed but independent population
for comparing the means of the samples.
The population variances of both the samples are already known.
Original size of samples considered should be larger than 30
41
Where S1 is Standard deviation of sample 1
Where S2 is Standard deviation of sample 2

Independent sample t-test
● This test is carried out to test the statistical difference between
1. The means of two groups
2. The means of two interventions
3. The means of two change scores
42

Paired Sample t-test
● To carried out to compare two population means for given two samples in
which observation in one sample can be paired with observations in one
sample can be compared with observations in other.
● This test is usually used in case of before-and-after observations for
considered subject
43

Non Parametric Hypothesis test
● Information about the population is unknown and hence no assumption can be made regarding the
population
● It is more suitable for data that can be represented in qualitative scales
(nominal or ordinal )
● Cover techniques that do not rely on data belonging to any particular distribution
● The distribution of data can be skewed as well as the population variace can be non homogeneous
● One sample non-parametric test
One factor Chi-Square
Binomial
Wilcoxon Signed Rank Test
● Two Independent Sample
Mann-Whitney Test
Kolmogorov-Smirnov/s Test
● Two Paired Samples
Sign
Chi-Square
Wilcoxon Signed rank
44

Estimation of Parameter values
● In statistics finding estimation or inference refers to the task of drawing conclusion
about a population based on information provided about the population
● This can be done in two ways
Point estimate
Interval estimate
● Point estimation considers only single value of a statistics.
● Point estimation is based on single random sample its value will vary when
different random samples will considered from sample population.
● Few of the standard Point estimation methods are
Maximum Likelihood Estimator
Minimum Variance mean Unbiased Estimator
Minimum mean squared error
Best Linear Unbiased Estimator
45

Interval Estimate
It considers two values between which the population parameter considers two
values between which the population parameter is likely to lie.
The two values
46

Measuring Data Similarity and Dissimilarity
● Similarity measure is a way of measuring how data samples are related or
close to each other.
● Dissimilarity measure is to tell how much the data objects are distinct.
● Similarity measures are expressed as numerical value
● It gets higher when the data samples are more alike
● Zero means low similarity and one means very similar)
● Data structures
The data matrix
The dissimilarity matrix
● Object dissimilarity can be computed for objects described by nominal
attributes, binary attributes, numerical attributes, ordinal attributes.
47

Proximity measures for Nominal Attributes
● Nominal Attributes means relating to names. The. value of nominal attribute
are symbols or names or things.
● Let M be the total number of states in nominal attribute .Then status can be
numbered from 1 to M.
● Let m be the total number of attributes for which I and j are in same state and
p the total number of attributes then dissimilarity can be calculated as
d(i,j)=(p-m)/p
Similarity as
s(I,j)=1-d(I,j)
48

Proximity measures for Numeric data
● Euclidean distance d = √[ (x22 – x11)2 + (y22 – y11)2]
● Manhattan distance The Manhattan Distance between two points (X1,
Y1) and (X2, Y2) is given by |X1 – X2| + |Y1 – Y2|.
● Minkowski distance
( |X1 – Y1|p + |X2 – Y2|p + |X2 – Y2|p )1/p
49

● SET A
1. Write a Python program to find the maximum and minimum value of a given
flattened array.
import numpy as np
ar=np.array([[0,1],[2,3]])
print("Original Flattened Array");
print(ar)
print("-----------------")
print("Maximum value of the above flattened array:")
print(np.amax(ar))
print("Minimum value of the above flattened array:")
print(np.amin(ar))
50

Write a python program to compute Euclidian Distance between two data
points in a dataset. [Hint: Use linalgo.norm function from NumPy]
import numpy as np
point1 = np.array((1, 2, 3))
point2 = np.array((1, 1, 1))
# calculating Euclidean distance
# using linalg.norm()
dist = np.linalg.norm(point1 - point2)
# printing Euclidean distance
print(dist)
51

3. Create one dataframe of data values. Find out mean, range and IQR for
this data
.
import pandas as pd
df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
columns=["Apple", "Orange", "Banana", "Pear"],
index=["Basket1", "Basket2", "Basket3", "Basket4",
"Basket5", "Basket6"])
Print(“n----------- Calculate Mean -----------n”)
print(df.mean())
print("-----Maximum Value-------")
a=df.max()
print(a)
print("-----Minimum Value-------")
b=df.min()
print(b)
r=a-b
print("-------Range-------")
print(r)
52

4 find sum of Manhattan distance between all the pairs of given points
Return the sum of distance between all the pair of points.
def distancesum (x, y, n):
sum = 0
# for each point, finding distance
# to rest of the point
for i in range(n):
for j in range(i+1,n):
sum += (abs(x[i] - x[j]) +
abs(y[i] - y[j]))
return sum
# Driven Code
x = [ -1, 1, 3, 2 ]
y = [ 5, 6, 5, 3 ]
n = len(x)
print(distancesum(x, y, n) )
53

5. Write a NumPy program to compute the histogram of nums against the
bins.
import numpy as np
import matplotlib.pyplot as plt
nums = np.array([0.5, 0.7, 1.0, 1.2, 1.3, 2.1])
bins = np.array([0, 1, 2, 3])
print("nums: ",nums)
print("bins: ",bins)
print("Result:", np.histogram(nums, bins))
plt.hist(nums, bins=bins)
plt.show()
54

6. Create a dataframe for students’ information such name, graduation
percentage and age.
#Display average age of students, average of graduation percentage.
#And, also describe all basic statistics of data. (Hint: use describe()).
import pandas as pd
import numpy as np
stud_data = {"name": ["Akanksha", "Diya", "Komal", "James“,"Emily","Jonas"],"grade": [78, 69, 65, 90,
45,89],
"age": [21,23,22,19,20,18]}
df = pd.DataFrame(stud_data)
print(df)
print("------average of graduation percentage-------")
mean_grade = df["grade"].mean()
print(mean_grade)
print("------average of graduation age-------")
mean_age = df["age"].mean()
print(mean_age)
print("------Describe basic statistics of data-------")
df.describe()
55

Concept of outlier
● An outlier is an observation that lies an
abnormal distance from other values in a
random sample from a population.
● Outlier detection is the process of finding
data objects with behaviors that are
different from expectation
● They can be caused by measurement or
execution errors.
56

There are eight main causes of outliers.
● Incorrect data entry by humans
● Codes used instead of values
● Sampling errors, or data has been extracted from the wrong place or mixed
with other data
● Unexpected distribution of variables
● Measurement errors caused by the application or system
● Experimental errors in extracting the data or planning errors
● Intentional dummy outliers inserted to test the detection methods
● Natural deviations in data, not actually an error, that are indicate fraud or
some other anomaly you are trying to detect
57

Global Outlier
● Global outliers are also called point outliers. Global
outliers are taken as the simplest form of outliers.
● When data points deviate from all the rest of the data
points in a given data set, it is known as the global outlier.
● In most cases, all the outlier detection procedures are
targeted to determine the global outliers. The green data
point is the global outlier.
59

Contextual Outlier
● Contextual outliers are also known as Conditional
outliers. These types of outliers happen if a data
object deviates from the other data points because of
any specific condition in a given data set.
● As we know, there are two types of attributes of
objects of data: contextual attributes and behavioral
attributes.
● Contextual outlier analysis enables the users to
examine outliers in different contexts and conditions,
which can be useful in various applications.
● For example, A temperature reading of 45 degrees
Celsius may behave as an outlier in a rainy season.
Still, it will behave like a normal data point in the
context of a summer season. In the given diagram, a
green dot representing the low-temperature value in
June is a contextual outlier since the same value in
December is not an outlier. 60

● Collective outliers are groups of data
points that collectively deviate significantly
from the overall distribution of a dataset.
● Collective outliers may not be outliers
when considered individually, but as a
group, they exhibit unusual behavior.
● Detecting and interpreting collective
outliers can be more complex than
individual outliers, as the focus is on group
behavior rather than individual data
points.
61

Outlier detection Method
● Supervised
● Semi Supervised
● Unsupervised
62

Supervised methods
● Supervised methods model data normality and abnormality.
● Domain professionals tests and label a sample of the basic data.
● Outlier detection can be modeled as a classification issue. The service is to
understand a classifier that can identify outliers.
● The sample can be used for training and testing.
● In some application the experts may label just the normal objects and any
other objects not matching the model of normal objects are reported as
outlier.
63

Unsupervised methods
64
•In various application methods, objects labeled as “normal” or “outlier” are
not applicable.
•Therefore, an unsupervised learning approach has to be used.
•Unsupervised outlier detection methods create an implicit assumption such
as the normal objects are considerably “clustered.”
•An unsupervised outlier detection method predict that normal objects follow a
pattern far more generally than outliers.
•Normal objects do not have to decline into one team sharing large similarity.
Instead, they can form several groups, where each group has multiple
features.

Semi-Supervised Methods
● In several applications, although obtaining some labeled instance is possible,
the number of such labeled instances is small.
● It can encounter cases where only a small group of the normal and outlier
objects are labeled, but some data are unlabeled.
● Semi-supervised outlier detection methods were produced to tackle such
methods.
● Semi-supervised outlier detection methods can be concerned as applications
of semisupervised learning approaches. For example, when some labeled
normal objects are accessible, it can use them with unlabeled objects that are
nearby, to train a model for normal objects. The model of normal objects is
used to identify outliers—those objects not suitable the model of normal
objects are defined as outliers.
65

Statistical Method
● This are also known as mode based method
● Simply starting with visual analysis of the Univariate data by using Boxplots,
Scatter plots, Whisker plots, etc., can help in finding the extreme values in the
data.
● Assuming a normal distribution, calculate the z-score, which means the
standard deviation (σ) times the data point is from the sample’s mean.
● Another way would be to use InterQuartile Range (IQR) as a criterion and
treating outliers outside the range of 1.5 times from the first or the third
quartile.
66

Proximity Methods
● They assume that an object is an outlier if the nearest neighbors of the object
are far away in feature space;
● The proximity of the object to its neighbors significantly deviates from the
proximity of most of the other objects to their neighbors in the same data set.
● Proximity-based methods are classified into two types: Distance-based
methods judge a data point based on the distance(s) to its neighbors.
Density-based determines the degree of outlines of each data instance based
on its local density.
67

Statistical Analysis and Hypothesis Tesing

Recommended

Recommended

More Related Content

Similar to Statistical Analysis and Hypothesis Tesing

Similar to Statistical Analysis and Hypothesis Tesing (20)

Recently uploaded

Recently uploaded (20)

Statistical Analysis and Hypothesis Tesing