SlideShare a Scribd company logo
1 of 67
Chapter -2
Statistical Data Analysis
1
2
Introduction
● Data Science is as interdisciplinary field which requires a strong
understanding of mathematics,statistical reasoning and computer science
● Statistics is the science of collecting ,analyzing and interpreting data
● The data is usually numerical data in large quantities
● Statistics serve as a foundation while dealing with data and its analysis in
data science.
● It provides tools and methods to find structure in ans to give deeper insight
into data
● Data scientist use the combination of statistical formulae and computer
algorithms to notice patterns and trends within data
3
Steps for processing data
1. Identify the importance feature in the data
2. Finding relationship between features
3. Converting the features into the required format
4. Nomalizing and scaling the data
5. Identifying the distribution and nature of the data
6. Performing adjustment in the data
7. Identifying the right mathematical approach
8. Verify the results using different accuracy measurement scales
4
Roles of statistics in Data Science
Data Exploration
Data Cleaning
Data Transformation
Data Visualization
Finding Similarity/Dissimilarity
Model Selection and Evaluation
Hypothesis Testing
Statistical Modeling
Probability Distribution and Estimation
5
Types of Statistics
6
Types of Statistics
Descriptive statistics Inferential Statistics
Parameter Estimation
Hypothesis Testing
Measures of
Dispersion
Measures of
Frequency
Measures of
Central Tendency
Descriptive Statistics
● Provides ways for describing,presenting,summarizing and organizing
the data
● Descriptive statistics summarizes this large amount of data and
presents it in a simple and understandable form.
● The summarization is done from the sample of the population using
different parameters like mean,median,standard deviation
7
Types of Descriptive Statistics
8
Descriptive Statistics
Measures of
Frequency
Measures of
Central Tendency Measures of Dispersion
Mean
Mode
Median
Range
Interquartile
Range
Standard
Deviation
Measures of Frequency
● Frequency is statistical quantity in data science.
● It is number of times a value of the data occurs.
● In a dataset it analyzes how often a particular data value in a feature occurs.
● The frequency distribution can be tabulated as a frequency chart
9
Twenty students were asked how many hours they worked per day. Their responses, in hours,
are as follows: 55, 66, 33, 33, 22, 44, 77, 55, 22, 33, 55, 66, 55, 44, 44, 33, 55, 22, 55, 33.
DATA
VALUE
FREQUENCY
2 3
3 5
4 3
5 6
6 2
7 1
10
Measures of Central Tendency
● It is important measures of statistical analysis is to find one value that
describes the characteristics of the entire set of data.
● This single value is referred to as a central tendency that describes a whole
set of data with single value that represents the center of its distribution.
● Measure of central Tendency is also known as summary statistics that is used
to represent the center point.
11
Mean
● The most common and effective numeric measure of the center of a set of
data .
● It is the sum of all the observations divided by the sample size.
● The types of mean
Arithmetic Mean
Harmonic Mean
Geometric mean
12
Arithmetic mean
● It is obtained by adding all the values and then dividing the sum by the total
number of values.
● Let x1,x2,x3,x4…..xn be a set of N values or observation. The arithmetic
mean of this set of values is :
13
● Suppose the marks obtained by 10 students in a quiz are 8,3,7,6,9,10,5,7,8,5
● We can calculate
(8+3+7+6+9+10+5+7+8+5)
10 =6.8
The arithmetic mean can be calculated by using mean () function from Numpy
library
14
Harmonic Mean
● The harmonic mean is used when we want to find the reciprocal of the
average of the reciprocal terms in a series. The formula to determine
harmonic mean is n / [1/x1 + 1/x2 + 1/x3 + ... + 1/xn].
● Example x=(6,3,1,5,2)
● HM= ?
15
Geometric Mean
● A geometric mean is a mean or average which shows the central tendency of
a set of numbers by using the product of their values.
16
Median
● It is middle value of data.
● It is the value that separates the higher half of a data set from the lower half.
● It splits the data in half and also called 50 th percentile
● If the number of elements in the data set is odd then middle element is
median
● If the number of elements in the data set is even then average of two
central elements.
Advantages
Less affected by the outliers and skewed data as compared to mean
Appropriate for Skewed data
17
Mode
● It is value that occur more frequently in a dataset.
● It is possible for several different values to have the maximum frequency
which result in more than one mode.
● Dataset with one mode is called unimodal.
● Dataset with two mode is called bimodal.
● Dataset with three mode is called trimodal.
18
● Advantages
○ Can be used for categorical values
○ Determined for qualitative and quantitative values
○ Not affected by extreme values
● Disadvantages
○ Not based on all values
○ Mode can not clearly defined in case of multi model series
○ Not applicable for further statistical analysis and algebraic calculation
19
Measures of Dispersion
● Dispersion is the extent to which values in a distribution differ from the average of
distribution
● Measures of central tendency is alone not sufficient to describe the data.
● Measures of dispersion helps us to know the degree of variability in the data and
provide better understanding of data
● Measures of dispersion indicate the measures to assess the dispersion or spread
of numeric data.
● The measures are:
o Rage
o Quantiles
o Quartiles
o Percentiles
o Interquartile range
20
Range
● It is simplest measure of dispersion.Let x1,x2,….xn be a set of observations
for some numeric attributes X.
● The range of the set is the difference between the largest(max() and the
smallest (min() values)
● Range=max-min
21
Standard Deviation
● It is a measure of how much the data values deviate from the mean value
● σ = √(∑x−x
̄ )2 /n)
22
Find the SD for 4,9,11,12,17,5,8,12,14
Variance
● Variance measures how far a data set is spread out.It is mathematically
defined ad the average of the squared differences from the mean.
● Variance = (Standard deviation)2= σ2
23
Interquartile Range
● Interquartile range is a measure of variation, which describes how spread out
the data is.
● The interquartile range is a measure of variability based on splitting data
into quartiles.
● Interquartile range is the difference between the first and
third quartiles (Q1 and Q3).
● Quartile divides the range of data into four equal parts That are demarcated
by the three quartiles Q1,Q2,Q3
● Consider the following data
2,3,4,7,10,15,22,26,27,30,32
24
Inferential Statistics
25
•Inferential Statistics draw inferences
and prediction about a population
based on data chosen from the
population in question
•Sample is considered as a
representative of the entire universe or
population
•Statistical Inference mainly deals with
two different kinds of problems
Hypothesis testing
Estimation of parameter values
Hypothesis testing
● Hypothesis testing is mainly used to determine whether there is sufficient
evidence in a data sample to conculde that a particular condition holds for an
entire population
● There are two hypothesis
○ Null Hypothesis
○ Alternative Hypothesis
● The null hypothesis in statistics states that there is no difference between
groups or no relationship between variables.
● The alternative hypothesis states that there is a relationship between the two
variables being studied (one variable has an effect on the other).
26
Steps for Hypothesis Testing
● State the null and alternative hypothesis
● Select the appropriate significance level and check the specified test
assumption
● Analyze the data by computing appropriate statistical tests
27
Example of Hypothesis
● For example, suppose a biologist believes that a certain fertilizer will cause
plants to grow more during a one-month period than they normally do, which
is currently 20 inches. To test this, she applies the fertilizer to each of the
plants in her laboratory for one month.
● She then performs a hypothesis test using the following hypotheses:
● H0: μ = 20 inches (the fertilizer will have no effect on the mean plant growth)
● HA: μ > 20 inches (the fertilizer will cause mean plant growth to increase)
28
● For example, suppose a doctor believes that a new drug is able to reduce
blood pressure in obese patients. To test this, he may measure the blood
pressure of 40 patients before and after using the new drug for one month.
● He then performs a hypothesis test using the following hypotheses:
● H0: μafter = μbefore (the mean blood pressure is the same before and after using
the drug)
● HA: μafter < μbefore (the mean blood pressure is less after using the drug)
29
Parametric hypothesis tests
Information about the population is completely known and can be used for statistical inference
Steps for Parametric test
Step -1 State Null and Alternate hypothesis
Step -2 Consider the level of significance
Step- 3 Identify the type of parametric test to be conducted
Step- 4 Find the Critical value to decide the accept/reject regions
Step- 5 Consider the sample find the objtained parametric test value
Step-6 Compare obtained value critical value to decide whether the null hypothesis is accepted
or rejected
30
Terms related with Parametric test
1. Acceptance and critical regions :
All set of possible values can be divided into two mutually exclusive groups:
● Acceptance Region : Set of values that appear to be consistent with the null
hypothesis
● Rejection Region : Consists of values that are unlikely to occur if the null
hypothesis is true
31
One tailed test and Two tailed Test
If the specified problem has an equal sign it is two tailed test
If the problem has a greater than or less than sign it is one tailed test
Case 1 :A government school states that dropout of female students between ages 12 and 18
years is 28%
Case 2 :A government school states that dropout of female
students between ages 12 and 18 years greater than 28%
Case 3 :A government school states that dropout of female
students between ages 12 and 18 years less than 28%
32
Significance Level
It is denoted by α
It is probability of rejecting null hypothesis being rejected even if it is true
For example a significance level of 0.03 indicates that a 3 % risk is being taken
that a difference in values exists when there is no difference.
Typical values of significance level is 0.01,0.05,0.1
33
Calculated probability
It is calculated probability that states that when the null hypothesis is true,the
statistical summary will be greater than or equal to the actual observed results
Example of One Sample parametric tests
Z-test
T-test
Chi-Square
34
Types of Hypothesis Testing
35
Hypothesis Test
Two
Sample
One
Sample
NonParametric Test
Parametric Test
Z-Test
Chi-Square
Test
T-test
Independent
Samples
Paired Samples
Z-Test
Two group
test
Paired-Test
Two
Sample
One
Sample
Z -Test
● This test is used for comparing the mean of a sample to some hypothesized mean
of a given population.
● The method for carrying out z-test for one sample is
z=X-µ
H0
σp /√n
Where µ
H0 =hypothesized population mean
σp Standard deviation
36
Example
● For a sample of 500 female students having a mean height of 5.4 feet.The
task is to find whether it can be reasonably regarded as a sample from a large
population with a mean height of 5.6 feet and standard deviation of 1.45
feet.Let us consider 5 % level of significance to solve the problem.
37
T-test
● The one sample t-test is mainly used for determining whether the mean of
sample is statistically different from a known or hypothesized mean of a given
population.
● The test variable needs to be continuous
z=X-µ
H0
σs /√n
38
Chi Square test
● A chi square test is a test of statistical significance for categroical variables.
● It is used to find difference between the observed and expected data
● To find the correlation beween categorical variables In our data
39
ANOVA
● Analysis of Variance (ANOVA) is an extension of t-test. It is used to check if
the mean of two or more groups are significantly different from each other.
40
Two sample parametric tests
● Independent samples z-test
This test is carried out on two normally distributed but independent population
for comparing the means of the samples.
The population variances of both the samples are already known.
Original size of samples considered should be larger than 30
41
Where S1 is Standard deviation of sample 1
Where S2 is Standard deviation of sample 2
Independent sample t-test
● This test is carried out to test the statistical difference between
1. The means of two groups
2. The means of two interventions
3. The means of two change scores
42
Paired Sample t-test
● To carried out to compare two population means for given two samples in
which observation in one sample can be paired with observations in one
sample can be compared with observations in other.
● This test is usually used in case of before-and-after observations for
considered subject
43
Non Parametric Hypothesis test
● Information about the population is unknown and hence no assumption can be made regarding the
population
● It is more suitable for data that can be represented in qualitative scales
(nominal or ordinal )
● Cover techniques that do not rely on data belonging to any particular distribution
● The distribution of data can be skewed as well as the population variace can be non homogeneous
● One sample non-parametric test
One factor Chi-Square
Binomial
Wilcoxon Signed Rank Test
● Two Independent Sample
Mann-Whitney Test
Kolmogorov-Smirnov/s Test
● Two Paired Samples
Sign
Chi-Square
Wilcoxon Signed rank
44
Estimation of Parameter values
● In statistics finding estimation or inference refers to the task of drawing conclusion
about a population based on information provided about the population
● This can be done in two ways
Point estimate
Interval estimate
● Point estimation considers only single value of a statistics.
● Point estimation is based on single random sample its value will vary when
different random samples will considered from sample population.
● Few of the standard Point estimation methods are
Maximum Likelihood Estimator
Minimum Variance mean Unbiased Estimator
Minimum mean squared error
Best Linear Unbiased Estimator
45
Interval Estimate
It considers two values between which the population parameter considers two
values between which the population parameter is likely to lie.
The two values
46
Measuring Data Similarity and Dissimilarity
● Similarity measure is a way of measuring how data samples are related or
close to each other.
● Dissimilarity measure is to tell how much the data objects are distinct.
● Similarity measures are expressed as numerical value
● It gets higher when the data samples are more alike
● Zero means low similarity and one means very similar)
● Data structures
The data matrix
The dissimilarity matrix
● Object dissimilarity can be computed for objects described by nominal
attributes, binary attributes, numerical attributes, ordinal attributes.
47
Proximity measures for Nominal Attributes
● Nominal Attributes means relating to names. The. value of nominal attribute
are symbols or names or things.
● Let M be the total number of states in nominal attribute .Then status can be
numbered from 1 to M.
● Let m be the total number of attributes for which I and j are in same state and
p the total number of attributes then dissimilarity can be calculated as
d(i,j)=(p-m)/p
Similarity as
s(I,j)=1-d(I,j)
48
Proximity measures for Numeric data
● Euclidean distance d = √[ (x22 – x11)2 + (y22 – y11)2]
● Manhattan distance The Manhattan Distance between two points (X1,
Y1) and (X2, Y2) is given by |X1 – X2| + |Y1 – Y2|.
● Minkowski distance
( |X1 – Y1|p + |X2 – Y2|p + |X2 – Y2|p )1/p
49
● SET A
1. Write a Python program to find the maximum and minimum value of a given
flattened array.
import numpy as np
ar=np.array([[0,1],[2,3]])
print("Original Flattened Array");
print(ar)
print("-----------------")
print("Maximum value of the above flattened array:")
print(np.amax(ar))
print("Minimum value of the above flattened array:")
print(np.amin(ar))
50
Write a python program to compute Euclidian Distance between two data
points in a dataset. [Hint: Use linalgo.norm function from NumPy]
import numpy as np
point1 = np.array((1, 2, 3))
point2 = np.array((1, 1, 1))
# calculating Euclidean distance
# using linalg.norm()
dist = np.linalg.norm(point1 - point2)
# printing Euclidean distance
print(dist)
51
3. Create one dataframe of data values. Find out mean, range and IQR for
this data
.
import pandas as pd
df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12],
[15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]],
columns=["Apple", "Orange", "Banana", "Pear"],
index=["Basket1", "Basket2", "Basket3", "Basket4",
"Basket5", "Basket6"])
Print(“n----------- Calculate Mean -----------n”)
print(df.mean())
print("-----Maximum Value-------")
a=df.max()
print(a)
print("-----Minimum Value-------")
b=df.min()
print(b)
r=a-b
print("-------Range-------")
print(r)
52
4 find sum of Manhattan distance between all the pairs of given points
Return the sum of distance between all the pair of points.
def distancesum (x, y, n):
sum = 0
# for each point, finding distance
# to rest of the point
for i in range(n):
for j in range(i+1,n):
sum += (abs(x[i] - x[j]) +
abs(y[i] - y[j]))
return sum
# Driven Code
x = [ -1, 1, 3, 2 ]
y = [ 5, 6, 5, 3 ]
n = len(x)
print(distancesum(x, y, n) )
53
5. Write a NumPy program to compute the histogram of nums against the
bins.
import numpy as np
import matplotlib.pyplot as plt
nums = np.array([0.5, 0.7, 1.0, 1.2, 1.3, 2.1])
bins = np.array([0, 1, 2, 3])
print("nums: ",nums)
print("bins: ",bins)
print("Result:", np.histogram(nums, bins))
plt.hist(nums, bins=bins)
plt.show()
54
6. Create a dataframe for students’ information such name, graduation
percentage and age.
#Display average age of students, average of graduation percentage.
#And, also describe all basic statistics of data. (Hint: use describe()).
import pandas as pd
import numpy as np
stud_data = {"name": ["Akanksha", "Diya", "Komal", "James“,"Emily","Jonas"],"grade": [78, 69, 65, 90,
45,89],
"age": [21,23,22,19,20,18]}
df = pd.DataFrame(stud_data)
print(df)
print("------average of graduation percentage-------")
mean_grade = df["grade"].mean()
print(mean_grade)
print("------average of graduation age-------")
mean_age = df["age"].mean()
print(mean_age)
print("------Describe basic statistics of data-------")
df.describe()
55
Concept of outlier
● An outlier is an observation that lies an
abnormal distance from other values in a
random sample from a population.
● Outlier detection is the process of finding
data objects with behaviors that are
different from expectation
● They can be caused by measurement or
execution errors.
56
There are eight main causes of outliers.
● Incorrect data entry by humans
● Codes used instead of values
● Sampling errors, or data has been extracted from the wrong place or mixed
with other data
● Unexpected distribution of variables
● Measurement errors caused by the application or system
● Experimental errors in extracting the data or planning errors
● Intentional dummy outliers inserted to test the detection methods
● Natural deviations in data, not actually an error, that are indicate fraud or
some other anomaly you are trying to detect
57
58
Global Outlier
● Global outliers are also called point outliers. Global
outliers are taken as the simplest form of outliers.
● When data points deviate from all the rest of the data
points in a given data set, it is known as the global outlier.
● In most cases, all the outlier detection procedures are
targeted to determine the global outliers. The green data
point is the global outlier.
59
Contextual Outlier
● Contextual outliers are also known as Conditional
outliers. These types of outliers happen if a data
object deviates from the other data points because of
any specific condition in a given data set.
● As we know, there are two types of attributes of
objects of data: contextual attributes and behavioral
attributes.
● Contextual outlier analysis enables the users to
examine outliers in different contexts and conditions,
which can be useful in various applications.
● For example, A temperature reading of 45 degrees
Celsius may behave as an outlier in a rainy season.
Still, it will behave like a normal data point in the
context of a summer season. In the given diagram, a
green dot representing the low-temperature value in
June is a contextual outlier since the same value in
December is not an outlier. 60
● Collective outliers are groups of data
points that collectively deviate significantly
from the overall distribution of a dataset.
● Collective outliers may not be outliers
when considered individually, but as a
group, they exhibit unusual behavior.
● Detecting and interpreting collective
outliers can be more complex than
individual outliers, as the focus is on group
behavior rather than individual data
points.
61
Outlier detection Method
● Supervised
● Semi Supervised
● Unsupervised
62
Supervised methods
● Supervised methods model data normality and abnormality.
● Domain professionals tests and label a sample of the basic data.
● Outlier detection can be modeled as a classification issue. The service is to
understand a classifier that can identify outliers.
● The sample can be used for training and testing.
● In some application the experts may label just the normal objects and any
other objects not matching the model of normal objects are reported as
outlier.
63
Unsupervised methods
64
•In various application methods, objects labeled as “normal” or “outlier” are
not applicable.
•Therefore, an unsupervised learning approach has to be used.
•Unsupervised outlier detection methods create an implicit assumption such
as the normal objects are considerably “clustered.”
•An unsupervised outlier detection method predict that normal objects follow a
pattern far more generally than outliers.
•Normal objects do not have to decline into one team sharing large similarity.
Instead, they can form several groups, where each group has multiple
features.
Semi-Supervised Methods
● In several applications, although obtaining some labeled instance is possible,
the number of such labeled instances is small.
● It can encounter cases where only a small group of the normal and outlier
objects are labeled, but some data are unlabeled.
● Semi-supervised outlier detection methods were produced to tackle such
methods.
● Semi-supervised outlier detection methods can be concerned as applications
of semisupervised learning approaches. For example, when some labeled
normal objects are accessible, it can use them with unlabeled objects that are
nearby, to train a model for normal objects. The model of normal objects is
used to identify outliers—those objects not suitable the model of normal
objects are defined as outliers.
65
Statistical Method
● This are also known as mode based method
● Simply starting with visual analysis of the Univariate data by using Boxplots,
Scatter plots, Whisker plots, etc., can help in finding the extreme values in the
data.
● Assuming a normal distribution, calculate the z-score, which means the
standard deviation (σ) times the data point is from the sample’s mean.
● Another way would be to use InterQuartile Range (IQR) as a criterion and
treating outliers outside the range of 1.5 times from the first or the third
quartile.
66
Proximity Methods
● They assume that an object is an outlier if the nearest neighbors of the object
are far away in feature space;
● The proximity of the object to its neighbors significantly deviates from the
proximity of most of the other objects to their neighbors in the same data set.
● Proximity-based methods are classified into two types: Distance-based
methods judge a data point based on the distance(s) to its neighbors.
Density-based determines the degree of outlines of each data instance based
on its local density.
67

More Related Content

Similar to Statistical Analysis and Hypothesis Tesing

Basic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxBasic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxAnusuya123
 
Measure OF Central Tendency
Measure OF Central TendencyMeasure OF Central Tendency
Measure OF Central TendencyIqrabutt038
 
Upload 140103034715-phpapp01 (1)
Upload 140103034715-phpapp01 (1)Upload 140103034715-phpapp01 (1)
Upload 140103034715-phpapp01 (1)captaininfantry
 
measures of central tendency.pptx
measures of central tendency.pptxmeasures of central tendency.pptx
measures of central tendency.pptxManish Agarwal
 
Topic 2 Measures of Central Tendency.pptx
Topic 2   Measures of Central Tendency.pptxTopic 2   Measures of Central Tendency.pptx
Topic 2 Measures of Central Tendency.pptxCallplanetsDeveloper
 
Introduction to statistics in health care
Introduction to statistics in health care Introduction to statistics in health care
Introduction to statistics in health care Dhasarathi Kumar
 
Measure of central tendency grouped data.pptx
Measure of central tendency grouped data.pptxMeasure of central tendency grouped data.pptx
Measure of central tendency grouped data.pptxSandeAlotaBoco
 
uses of statistics in experimental plant pathology
uses of statistics in experimental plant pathologyuses of statistics in experimental plant pathology
uses of statistics in experimental plant pathologyreza23220508
 
TREATMENT OF DATA_Scrd.pptx
TREATMENT OF DATA_Scrd.pptxTREATMENT OF DATA_Scrd.pptx
TREATMENT OF DATA_Scrd.pptxCarmela857185
 
Soni_Biostatistics.ppt
Soni_Biostatistics.pptSoni_Biostatistics.ppt
Soni_Biostatistics.pptOgunsina1
 
Biostatistics mean median mode unit 1.pptx
Biostatistics mean median mode unit 1.pptxBiostatistics mean median mode unit 1.pptx
Biostatistics mean median mode unit 1.pptxSailajaReddyGunnam
 
Inferential Statistics.pptx
Inferential Statistics.pptxInferential Statistics.pptx
Inferential Statistics.pptxjonatanjohn1
 
STATISTICAL PARAMETERS
STATISTICAL  PARAMETERSSTATISTICAL  PARAMETERS
STATISTICAL PARAMETERSHasiful Arabi
 
Selection of appropriate data analysis technique
Selection of appropriate data analysis techniqueSelection of appropriate data analysis technique
Selection of appropriate data analysis techniqueRajaKrishnan M
 
ANALYSIS ANDINTERPRETATION OF DATA Analysis and Interpr.docx
ANALYSIS ANDINTERPRETATION  OF DATA Analysis and Interpr.docxANALYSIS ANDINTERPRETATION  OF DATA Analysis and Interpr.docx
ANALYSIS ANDINTERPRETATION OF DATA Analysis and Interpr.docxcullenrjzsme
 

Similar to Statistical Analysis and Hypothesis Tesing (20)

Basic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxBasic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptx
 
Measure OF Central Tendency
Measure OF Central TendencyMeasure OF Central Tendency
Measure OF Central Tendency
 
Upload 140103034715-phpapp01 (1)
Upload 140103034715-phpapp01 (1)Upload 140103034715-phpapp01 (1)
Upload 140103034715-phpapp01 (1)
 
measures of central tendency.pptx
measures of central tendency.pptxmeasures of central tendency.pptx
measures of central tendency.pptx
 
ANALYSIS OF DATA.pptx
ANALYSIS OF DATA.pptxANALYSIS OF DATA.pptx
ANALYSIS OF DATA.pptx
 
Topic 2 Measures of Central Tendency.pptx
Topic 2   Measures of Central Tendency.pptxTopic 2   Measures of Central Tendency.pptx
Topic 2 Measures of Central Tendency.pptx
 
Introduction to statistics in health care
Introduction to statistics in health care Introduction to statistics in health care
Introduction to statistics in health care
 
Measure of central tendency grouped data.pptx
Measure of central tendency grouped data.pptxMeasure of central tendency grouped data.pptx
Measure of central tendency grouped data.pptx
 
uses of statistics in experimental plant pathology
uses of statistics in experimental plant pathologyuses of statistics in experimental plant pathology
uses of statistics in experimental plant pathology
 
Data analysis
Data analysisData analysis
Data analysis
 
TREATMENT OF DATA_Scrd.pptx
TREATMENT OF DATA_Scrd.pptxTREATMENT OF DATA_Scrd.pptx
TREATMENT OF DATA_Scrd.pptx
 
Soni_Biostatistics.ppt
Soni_Biostatistics.pptSoni_Biostatistics.ppt
Soni_Biostatistics.ppt
 
Unit 3_1.pptx
Unit 3_1.pptxUnit 3_1.pptx
Unit 3_1.pptx
 
Biostatistics mean median mode unit 1.pptx
Biostatistics mean median mode unit 1.pptxBiostatistics mean median mode unit 1.pptx
Biostatistics mean median mode unit 1.pptx
 
Inferential Statistics.pptx
Inferential Statistics.pptxInferential Statistics.pptx
Inferential Statistics.pptx
 
Basic statistics
Basic statisticsBasic statistics
Basic statistics
 
Statistics
StatisticsStatistics
Statistics
 
STATISTICAL PARAMETERS
STATISTICAL  PARAMETERSSTATISTICAL  PARAMETERS
STATISTICAL PARAMETERS
 
Selection of appropriate data analysis technique
Selection of appropriate data analysis techniqueSelection of appropriate data analysis technique
Selection of appropriate data analysis technique
 
ANALYSIS ANDINTERPRETATION OF DATA Analysis and Interpr.docx
ANALYSIS ANDINTERPRETATION  OF DATA Analysis and Interpr.docxANALYSIS ANDINTERPRETATION  OF DATA Analysis and Interpr.docx
ANALYSIS ANDINTERPRETATION OF DATA Analysis and Interpr.docx
 

Recently uploaded

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationBoston Institute of Analytics
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computationsit20ad004
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 

Recently uploaded (20)

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Data Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health ClassificationData Science Project: Advancements in Fetal Health Classification
Data Science Project: Advancements in Fetal Health Classification
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Data Warehouse , Data Cube Computation
Data Warehouse   , Data Cube ComputationData Warehouse   , Data Cube Computation
Data Warehouse , Data Cube Computation
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 

Statistical Analysis and Hypothesis Tesing

  • 2. 2
  • 3. Introduction ● Data Science is as interdisciplinary field which requires a strong understanding of mathematics,statistical reasoning and computer science ● Statistics is the science of collecting ,analyzing and interpreting data ● The data is usually numerical data in large quantities ● Statistics serve as a foundation while dealing with data and its analysis in data science. ● It provides tools and methods to find structure in ans to give deeper insight into data ● Data scientist use the combination of statistical formulae and computer algorithms to notice patterns and trends within data 3
  • 4. Steps for processing data 1. Identify the importance feature in the data 2. Finding relationship between features 3. Converting the features into the required format 4. Nomalizing and scaling the data 5. Identifying the distribution and nature of the data 6. Performing adjustment in the data 7. Identifying the right mathematical approach 8. Verify the results using different accuracy measurement scales 4
  • 5. Roles of statistics in Data Science Data Exploration Data Cleaning Data Transformation Data Visualization Finding Similarity/Dissimilarity Model Selection and Evaluation Hypothesis Testing Statistical Modeling Probability Distribution and Estimation 5
  • 6. Types of Statistics 6 Types of Statistics Descriptive statistics Inferential Statistics Parameter Estimation Hypothesis Testing Measures of Dispersion Measures of Frequency Measures of Central Tendency
  • 7. Descriptive Statistics ● Provides ways for describing,presenting,summarizing and organizing the data ● Descriptive statistics summarizes this large amount of data and presents it in a simple and understandable form. ● The summarization is done from the sample of the population using different parameters like mean,median,standard deviation 7
  • 8. Types of Descriptive Statistics 8 Descriptive Statistics Measures of Frequency Measures of Central Tendency Measures of Dispersion Mean Mode Median Range Interquartile Range Standard Deviation
  • 9. Measures of Frequency ● Frequency is statistical quantity in data science. ● It is number of times a value of the data occurs. ● In a dataset it analyzes how often a particular data value in a feature occurs. ● The frequency distribution can be tabulated as a frequency chart 9 Twenty students were asked how many hours they worked per day. Their responses, in hours, are as follows: 55, 66, 33, 33, 22, 44, 77, 55, 22, 33, 55, 66, 55, 44, 44, 33, 55, 22, 55, 33.
  • 10. DATA VALUE FREQUENCY 2 3 3 5 4 3 5 6 6 2 7 1 10
  • 11. Measures of Central Tendency ● It is important measures of statistical analysis is to find one value that describes the characteristics of the entire set of data. ● This single value is referred to as a central tendency that describes a whole set of data with single value that represents the center of its distribution. ● Measure of central Tendency is also known as summary statistics that is used to represent the center point. 11
  • 12. Mean ● The most common and effective numeric measure of the center of a set of data . ● It is the sum of all the observations divided by the sample size. ● The types of mean Arithmetic Mean Harmonic Mean Geometric mean 12
  • 13. Arithmetic mean ● It is obtained by adding all the values and then dividing the sum by the total number of values. ● Let x1,x2,x3,x4…..xn be a set of N values or observation. The arithmetic mean of this set of values is : 13
  • 14. ● Suppose the marks obtained by 10 students in a quiz are 8,3,7,6,9,10,5,7,8,5 ● We can calculate (8+3+7+6+9+10+5+7+8+5) 10 =6.8 The arithmetic mean can be calculated by using mean () function from Numpy library 14
  • 15. Harmonic Mean ● The harmonic mean is used when we want to find the reciprocal of the average of the reciprocal terms in a series. The formula to determine harmonic mean is n / [1/x1 + 1/x2 + 1/x3 + ... + 1/xn]. ● Example x=(6,3,1,5,2) ● HM= ? 15
  • 16. Geometric Mean ● A geometric mean is a mean or average which shows the central tendency of a set of numbers by using the product of their values. 16
  • 17. Median ● It is middle value of data. ● It is the value that separates the higher half of a data set from the lower half. ● It splits the data in half and also called 50 th percentile ● If the number of elements in the data set is odd then middle element is median ● If the number of elements in the data set is even then average of two central elements. Advantages Less affected by the outliers and skewed data as compared to mean Appropriate for Skewed data 17
  • 18. Mode ● It is value that occur more frequently in a dataset. ● It is possible for several different values to have the maximum frequency which result in more than one mode. ● Dataset with one mode is called unimodal. ● Dataset with two mode is called bimodal. ● Dataset with three mode is called trimodal. 18
  • 19. ● Advantages ○ Can be used for categorical values ○ Determined for qualitative and quantitative values ○ Not affected by extreme values ● Disadvantages ○ Not based on all values ○ Mode can not clearly defined in case of multi model series ○ Not applicable for further statistical analysis and algebraic calculation 19
  • 20. Measures of Dispersion ● Dispersion is the extent to which values in a distribution differ from the average of distribution ● Measures of central tendency is alone not sufficient to describe the data. ● Measures of dispersion helps us to know the degree of variability in the data and provide better understanding of data ● Measures of dispersion indicate the measures to assess the dispersion or spread of numeric data. ● The measures are: o Rage o Quantiles o Quartiles o Percentiles o Interquartile range 20
  • 21. Range ● It is simplest measure of dispersion.Let x1,x2,….xn be a set of observations for some numeric attributes X. ● The range of the set is the difference between the largest(max() and the smallest (min() values) ● Range=max-min 21
  • 22. Standard Deviation ● It is a measure of how much the data values deviate from the mean value ● σ = √(∑x−x ̄ )2 /n) 22 Find the SD for 4,9,11,12,17,5,8,12,14
  • 23. Variance ● Variance measures how far a data set is spread out.It is mathematically defined ad the average of the squared differences from the mean. ● Variance = (Standard deviation)2= σ2 23
  • 24. Interquartile Range ● Interquartile range is a measure of variation, which describes how spread out the data is. ● The interquartile range is a measure of variability based on splitting data into quartiles. ● Interquartile range is the difference between the first and third quartiles (Q1 and Q3). ● Quartile divides the range of data into four equal parts That are demarcated by the three quartiles Q1,Q2,Q3 ● Consider the following data 2,3,4,7,10,15,22,26,27,30,32 24
  • 25. Inferential Statistics 25 •Inferential Statistics draw inferences and prediction about a population based on data chosen from the population in question •Sample is considered as a representative of the entire universe or population •Statistical Inference mainly deals with two different kinds of problems Hypothesis testing Estimation of parameter values
  • 26. Hypothesis testing ● Hypothesis testing is mainly used to determine whether there is sufficient evidence in a data sample to conculde that a particular condition holds for an entire population ● There are two hypothesis ○ Null Hypothesis ○ Alternative Hypothesis ● The null hypothesis in statistics states that there is no difference between groups or no relationship between variables. ● The alternative hypothesis states that there is a relationship between the two variables being studied (one variable has an effect on the other). 26
  • 27. Steps for Hypothesis Testing ● State the null and alternative hypothesis ● Select the appropriate significance level and check the specified test assumption ● Analyze the data by computing appropriate statistical tests 27
  • 28. Example of Hypothesis ● For example, suppose a biologist believes that a certain fertilizer will cause plants to grow more during a one-month period than they normally do, which is currently 20 inches. To test this, she applies the fertilizer to each of the plants in her laboratory for one month. ● She then performs a hypothesis test using the following hypotheses: ● H0: μ = 20 inches (the fertilizer will have no effect on the mean plant growth) ● HA: μ > 20 inches (the fertilizer will cause mean plant growth to increase) 28
  • 29. ● For example, suppose a doctor believes that a new drug is able to reduce blood pressure in obese patients. To test this, he may measure the blood pressure of 40 patients before and after using the new drug for one month. ● He then performs a hypothesis test using the following hypotheses: ● H0: μafter = μbefore (the mean blood pressure is the same before and after using the drug) ● HA: μafter < μbefore (the mean blood pressure is less after using the drug) 29
  • 30. Parametric hypothesis tests Information about the population is completely known and can be used for statistical inference Steps for Parametric test Step -1 State Null and Alternate hypothesis Step -2 Consider the level of significance Step- 3 Identify the type of parametric test to be conducted Step- 4 Find the Critical value to decide the accept/reject regions Step- 5 Consider the sample find the objtained parametric test value Step-6 Compare obtained value critical value to decide whether the null hypothesis is accepted or rejected 30
  • 31. Terms related with Parametric test 1. Acceptance and critical regions : All set of possible values can be divided into two mutually exclusive groups: ● Acceptance Region : Set of values that appear to be consistent with the null hypothesis ● Rejection Region : Consists of values that are unlikely to occur if the null hypothesis is true 31
  • 32. One tailed test and Two tailed Test If the specified problem has an equal sign it is two tailed test If the problem has a greater than or less than sign it is one tailed test Case 1 :A government school states that dropout of female students between ages 12 and 18 years is 28% Case 2 :A government school states that dropout of female students between ages 12 and 18 years greater than 28% Case 3 :A government school states that dropout of female students between ages 12 and 18 years less than 28% 32
  • 33. Significance Level It is denoted by α It is probability of rejecting null hypothesis being rejected even if it is true For example a significance level of 0.03 indicates that a 3 % risk is being taken that a difference in values exists when there is no difference. Typical values of significance level is 0.01,0.05,0.1 33
  • 34. Calculated probability It is calculated probability that states that when the null hypothesis is true,the statistical summary will be greater than or equal to the actual observed results Example of One Sample parametric tests Z-test T-test Chi-Square 34
  • 35. Types of Hypothesis Testing 35 Hypothesis Test Two Sample One Sample NonParametric Test Parametric Test Z-Test Chi-Square Test T-test Independent Samples Paired Samples Z-Test Two group test Paired-Test Two Sample One Sample
  • 36. Z -Test ● This test is used for comparing the mean of a sample to some hypothesized mean of a given population. ● The method for carrying out z-test for one sample is z=X-µ H0 σp /√n Where µ H0 =hypothesized population mean σp Standard deviation 36
  • 37. Example ● For a sample of 500 female students having a mean height of 5.4 feet.The task is to find whether it can be reasonably regarded as a sample from a large population with a mean height of 5.6 feet and standard deviation of 1.45 feet.Let us consider 5 % level of significance to solve the problem. 37
  • 38. T-test ● The one sample t-test is mainly used for determining whether the mean of sample is statistically different from a known or hypothesized mean of a given population. ● The test variable needs to be continuous z=X-µ H0 σs /√n 38
  • 39. Chi Square test ● A chi square test is a test of statistical significance for categroical variables. ● It is used to find difference between the observed and expected data ● To find the correlation beween categorical variables In our data 39
  • 40. ANOVA ● Analysis of Variance (ANOVA) is an extension of t-test. It is used to check if the mean of two or more groups are significantly different from each other. 40
  • 41. Two sample parametric tests ● Independent samples z-test This test is carried out on two normally distributed but independent population for comparing the means of the samples. The population variances of both the samples are already known. Original size of samples considered should be larger than 30 41 Where S1 is Standard deviation of sample 1 Where S2 is Standard deviation of sample 2
  • 42. Independent sample t-test ● This test is carried out to test the statistical difference between 1. The means of two groups 2. The means of two interventions 3. The means of two change scores 42
  • 43. Paired Sample t-test ● To carried out to compare two population means for given two samples in which observation in one sample can be paired with observations in one sample can be compared with observations in other. ● This test is usually used in case of before-and-after observations for considered subject 43
  • 44. Non Parametric Hypothesis test ● Information about the population is unknown and hence no assumption can be made regarding the population ● It is more suitable for data that can be represented in qualitative scales (nominal or ordinal ) ● Cover techniques that do not rely on data belonging to any particular distribution ● The distribution of data can be skewed as well as the population variace can be non homogeneous ● One sample non-parametric test One factor Chi-Square Binomial Wilcoxon Signed Rank Test ● Two Independent Sample Mann-Whitney Test Kolmogorov-Smirnov/s Test ● Two Paired Samples Sign Chi-Square Wilcoxon Signed rank 44
  • 45. Estimation of Parameter values ● In statistics finding estimation or inference refers to the task of drawing conclusion about a population based on information provided about the population ● This can be done in two ways Point estimate Interval estimate ● Point estimation considers only single value of a statistics. ● Point estimation is based on single random sample its value will vary when different random samples will considered from sample population. ● Few of the standard Point estimation methods are Maximum Likelihood Estimator Minimum Variance mean Unbiased Estimator Minimum mean squared error Best Linear Unbiased Estimator 45
  • 46. Interval Estimate It considers two values between which the population parameter considers two values between which the population parameter is likely to lie. The two values 46
  • 47. Measuring Data Similarity and Dissimilarity ● Similarity measure is a way of measuring how data samples are related or close to each other. ● Dissimilarity measure is to tell how much the data objects are distinct. ● Similarity measures are expressed as numerical value ● It gets higher when the data samples are more alike ● Zero means low similarity and one means very similar) ● Data structures The data matrix The dissimilarity matrix ● Object dissimilarity can be computed for objects described by nominal attributes, binary attributes, numerical attributes, ordinal attributes. 47
  • 48. Proximity measures for Nominal Attributes ● Nominal Attributes means relating to names. The. value of nominal attribute are symbols or names or things. ● Let M be the total number of states in nominal attribute .Then status can be numbered from 1 to M. ● Let m be the total number of attributes for which I and j are in same state and p the total number of attributes then dissimilarity can be calculated as d(i,j)=(p-m)/p Similarity as s(I,j)=1-d(I,j) 48
  • 49. Proximity measures for Numeric data ● Euclidean distance d = √[ (x22 – x11)2 + (y22 – y11)2] ● Manhattan distance The Manhattan Distance between two points (X1, Y1) and (X2, Y2) is given by |X1 – X2| + |Y1 – Y2|. ● Minkowski distance ( |X1 – Y1|p + |X2 – Y2|p + |X2 – Y2|p )1/p 49
  • 50. ● SET A 1. Write a Python program to find the maximum and minimum value of a given flattened array. import numpy as np ar=np.array([[0,1],[2,3]]) print("Original Flattened Array"); print(ar) print("-----------------") print("Maximum value of the above flattened array:") print(np.amax(ar)) print("Minimum value of the above flattened array:") print(np.amin(ar)) 50
  • 51. Write a python program to compute Euclidian Distance between two data points in a dataset. [Hint: Use linalgo.norm function from NumPy] import numpy as np point1 = np.array((1, 2, 3)) point2 = np.array((1, 1, 1)) # calculating Euclidean distance # using linalg.norm() dist = np.linalg.norm(point1 - point2) # printing Euclidean distance print(dist) 51
  • 52. 3. Create one dataframe of data values. Find out mean, range and IQR for this data . import pandas as pd df = pd.DataFrame([[10, 20, 30, 40], [7, 14, 21, 28], [55, 15, 8, 12], [15, 14, 1, 8], [7, 1, 1, 8], [5, 4, 9, 2]], columns=["Apple", "Orange", "Banana", "Pear"], index=["Basket1", "Basket2", "Basket3", "Basket4", "Basket5", "Basket6"]) Print(“n----------- Calculate Mean -----------n”) print(df.mean()) print("-----Maximum Value-------") a=df.max() print(a) print("-----Minimum Value-------") b=df.min() print(b) r=a-b print("-------Range-------") print(r) 52
  • 53. 4 find sum of Manhattan distance between all the pairs of given points Return the sum of distance between all the pair of points. def distancesum (x, y, n): sum = 0 # for each point, finding distance # to rest of the point for i in range(n): for j in range(i+1,n): sum += (abs(x[i] - x[j]) + abs(y[i] - y[j])) return sum # Driven Code x = [ -1, 1, 3, 2 ] y = [ 5, 6, 5, 3 ] n = len(x) print(distancesum(x, y, n) ) 53
  • 54. 5. Write a NumPy program to compute the histogram of nums against the bins. import numpy as np import matplotlib.pyplot as plt nums = np.array([0.5, 0.7, 1.0, 1.2, 1.3, 2.1]) bins = np.array([0, 1, 2, 3]) print("nums: ",nums) print("bins: ",bins) print("Result:", np.histogram(nums, bins)) plt.hist(nums, bins=bins) plt.show() 54
  • 55. 6. Create a dataframe for students’ information such name, graduation percentage and age. #Display average age of students, average of graduation percentage. #And, also describe all basic statistics of data. (Hint: use describe()). import pandas as pd import numpy as np stud_data = {"name": ["Akanksha", "Diya", "Komal", "James“,"Emily","Jonas"],"grade": [78, 69, 65, 90, 45,89], "age": [21,23,22,19,20,18]} df = pd.DataFrame(stud_data) print(df) print("------average of graduation percentage-------") mean_grade = df["grade"].mean() print(mean_grade) print("------average of graduation age-------") mean_age = df["age"].mean() print(mean_age) print("------Describe basic statistics of data-------") df.describe() 55
  • 56. Concept of outlier ● An outlier is an observation that lies an abnormal distance from other values in a random sample from a population. ● Outlier detection is the process of finding data objects with behaviors that are different from expectation ● They can be caused by measurement or execution errors. 56
  • 57. There are eight main causes of outliers. ● Incorrect data entry by humans ● Codes used instead of values ● Sampling errors, or data has been extracted from the wrong place or mixed with other data ● Unexpected distribution of variables ● Measurement errors caused by the application or system ● Experimental errors in extracting the data or planning errors ● Intentional dummy outliers inserted to test the detection methods ● Natural deviations in data, not actually an error, that are indicate fraud or some other anomaly you are trying to detect 57
  • 58. 58
  • 59. Global Outlier ● Global outliers are also called point outliers. Global outliers are taken as the simplest form of outliers. ● When data points deviate from all the rest of the data points in a given data set, it is known as the global outlier. ● In most cases, all the outlier detection procedures are targeted to determine the global outliers. The green data point is the global outlier. 59
  • 60. Contextual Outlier ● Contextual outliers are also known as Conditional outliers. These types of outliers happen if a data object deviates from the other data points because of any specific condition in a given data set. ● As we know, there are two types of attributes of objects of data: contextual attributes and behavioral attributes. ● Contextual outlier analysis enables the users to examine outliers in different contexts and conditions, which can be useful in various applications. ● For example, A temperature reading of 45 degrees Celsius may behave as an outlier in a rainy season. Still, it will behave like a normal data point in the context of a summer season. In the given diagram, a green dot representing the low-temperature value in June is a contextual outlier since the same value in December is not an outlier. 60
  • 61. ● Collective outliers are groups of data points that collectively deviate significantly from the overall distribution of a dataset. ● Collective outliers may not be outliers when considered individually, but as a group, they exhibit unusual behavior. ● Detecting and interpreting collective outliers can be more complex than individual outliers, as the focus is on group behavior rather than individual data points. 61
  • 62. Outlier detection Method ● Supervised ● Semi Supervised ● Unsupervised 62
  • 63. Supervised methods ● Supervised methods model data normality and abnormality. ● Domain professionals tests and label a sample of the basic data. ● Outlier detection can be modeled as a classification issue. The service is to understand a classifier that can identify outliers. ● The sample can be used for training and testing. ● In some application the experts may label just the normal objects and any other objects not matching the model of normal objects are reported as outlier. 63
  • 64. Unsupervised methods 64 •In various application methods, objects labeled as “normal” or “outlier” are not applicable. •Therefore, an unsupervised learning approach has to be used. •Unsupervised outlier detection methods create an implicit assumption such as the normal objects are considerably “clustered.” •An unsupervised outlier detection method predict that normal objects follow a pattern far more generally than outliers. •Normal objects do not have to decline into one team sharing large similarity. Instead, they can form several groups, where each group has multiple features.
  • 65. Semi-Supervised Methods ● In several applications, although obtaining some labeled instance is possible, the number of such labeled instances is small. ● It can encounter cases where only a small group of the normal and outlier objects are labeled, but some data are unlabeled. ● Semi-supervised outlier detection methods were produced to tackle such methods. ● Semi-supervised outlier detection methods can be concerned as applications of semisupervised learning approaches. For example, when some labeled normal objects are accessible, it can use them with unlabeled objects that are nearby, to train a model for normal objects. The model of normal objects is used to identify outliers—those objects not suitable the model of normal objects are defined as outliers. 65
  • 66. Statistical Method ● This are also known as mode based method ● Simply starting with visual analysis of the Univariate data by using Boxplots, Scatter plots, Whisker plots, etc., can help in finding the extreme values in the data. ● Assuming a normal distribution, calculate the z-score, which means the standard deviation (σ) times the data point is from the sample’s mean. ● Another way would be to use InterQuartile Range (IQR) as a criterion and treating outliers outside the range of 1.5 times from the first or the third quartile. 66
  • 67. Proximity Methods ● They assume that an object is an outlier if the nearest neighbors of the object are far away in feature space; ● The proximity of the object to its neighbors significantly deviates from the proximity of most of the other objects to their neighbors in the same data set. ● Proximity-based methods are classified into two types: Distance-based methods judge a data point based on the distance(s) to its neighbors. Density-based determines the degree of outlines of each data instance based on its local density. 67