Statistical analysis training course

•Download as PPTX, PDF•

22 likes•9,032 views

This presentation was intended for employees of Dubai Municipality. It is about how to use SPSS and other statistical data analysis tools like Excel and Minitab in data analysis. The course presented some statistical concepts and definitions.

Data & Analytics

STATISTICAL ANALYSIS
TRAINING COURSE
This course was lectured in
Municipality of Dubai
By Data Analyst: Marwa Abo-
Amra
On November 2014

WHY WE USE STATISTICS &
STATISTICAL ANALYSIS
Data don’t make any
sense, we will have to
resort to statistics.

IN TODAY’S
WORLD…
Customer
Surveys
Medical News Demographics
Political Polls
Economic
Predictions
Marketing
Information
Sales
Forecasts
Stock Market
Projections
Consumer
Price Index
Sports
Statistics

COURSE CONTENTS
• Introduction to Statistics
• Statistical Concepts
• Data Collection Methods
• Statistical Analysis
• Introduction to Statistical Software: SPSS and Minitab
• Data Preparation and Analysis using MS Excel, SPSS, & Minitab
• Statistical Report writing
• Practical Examples and Applications

COURSE LEARNING OBJECTIVES
Those who deal with the different fields in life should be familiar
with the importance of data and how to deal with it as well as
the scientific methods to extract the necessary indicators for
decision-making, and help decision makers to know the various
alternatives and how to evaluate them.

DEFINITION OF STATISTICS
Statistics is the science of collecting, organizing, presenting,
analyzing, and interpreting numerical data to assist in making
more effective decisions.

STAGES OF STATISTICS
“Statistics is a way to get information from data.”
Collect Organize Present Analyze Interpret
Descriptive Analysis
Make Valid Conclusion & Take Correct
Decision
Data Preparation Inferential Analysis

IMPORTANCE OF STATISTICS
simplifies
complex
data
provides
a
technique
of
comparis
on
helps in
formulati
ng
policies
helps to
test &
formulate
theories
presents
the facts
in a
definite
form
studies
relationsh
ip
helps in
forecastin
g

SOME BASIC CONCEPTS
Population
Sample &
Sampling
Parameter
& Statistic
Elementary
Units &
Variables

VARIABLES
Dependent
Independe
nt
Cause Effect

Variables
Descriptive Information
Discrete (Counted) Continuous (Measured)
Qualitative Quantitative
Sex
Nationality
Type of crime
Occupation
Religion
Marital status
Literacy
…etc.
Number of children in
a family
Number of accidents
per week
Students marks
…etc.
Age
Distance
Height
Weight
…etc.

VARIABLES & UNITS
Employees Race Sex Job Title Years of Service Annual Salary
Abel Caucasian Male Pilot 2 $34,000
Cruz Caucasian Male Chief mechanic 10 60,000
Dunn Western Male Chief pilot 23 70,000
Hill Western Female Secretary 5 14,000
King Caucasian Male Janitor 8 17,000
Otis Caucasian Male Grounds keeper 10 20,000
West Western Male Mechanic 2 36,000
Wolf Caucasian Female Pilot 7 36,000
Zorn Caucasian Female Mechanic 7 40,000
Qualitative
variable
Elementary
unit
Frame Population
of
employee
sexes
Population
of
employee
Qualitative
variable
Datum
Sample of
employee
salaries
Selected characteristics of all the full-time employees of Mountain Aviation,
Inc.,: July 1, 1994

QUANTITATIVE VARIABLES
Nominal
• Attributes are
only named
• No order
Ordinal
• Attributes can be
rank-ordered
• Distance is not
meaningful
Interval
• Distance is
meaningful
Ratio
• Absolute zero
Strongest
Weakest
Discrete Continuous
Measurement Level

POPULATION VS. SAMPLING
Population
Sample
Parameter
Statistic
Something that
describes the population
as a whole
Something that
describes the
sample
Elementary Unit
Sampling Inference
Target
Population

WHY SAMPLING?
Reduced
cost
Time factor
Very large
populations
Partly
accessible
populations
Destructive
units
Accuracy

SIMPLE RANDOM
SAMPLING
For the sampling plan
to be statistically valid,
the sample must be
randomly selected from
the population.

WHAT IS DATA?
Data is a collection of
facts, such as values or
measurements.
Examples of data
include price, exam
scores, export and
import, labor and so on.

Data Information Statistics
20 kg, 25 kg
5 individuals in the 20-to-25-kg
range
Mean weight = 22.5 kg
28 kg, 30 kg, etc.
15 individuals in the 26-to-30-kg
range
Median weight = 28 kg
DATA VS. INFORMATION VS.
STATISTICS

SOURCES OF DATA
Internal
• Organization internal
records
External
• Published and unpublished
external sources

DATA COLLECTION METHODSCensus
SampleSurveyExperiment
ObservationalStudy
Census
Survey
Experiment
(Control)
Observation
(No Control)

OTHER DATA COLLECTION
METHODS
Interviews
 Face-to-face
 Telephone
Focus Groups
Ethnographies, Oral History, & Case Studies
Schedule through enumerators
Documents & Records

Questionnaire Schedule
• It is not confirmed if expected
respondent have filled the answers
• Respondent identity is know
• Very slow • Information collected in time
• No personal contact • Direct personal contact
• Used only when respondent is
educated & cooperative
• Information can be collected from
illiterate people
• Wider distribution of samples possible • Difficult for wider distribution
• More incomplete and false information • Relatively more complete and correct
information
• Information validity depends on
quality of questionnaire
• Information validity depends on
honesty & competence of numerator
• Questionnaire physical appearance
should be attractive
• Schedule physical appearance not
necessary
• Observation cannot be used • Observation can be used by
enumerator

FRAMING A QUESTIONNAIRE OR A
SCHEDULE
Cover Letter
Number of Questions
Nature of Questions
Questions should be simple
Arrangement of Questions
Information collected is usable
Avoid mathematical questions

TYPES OF QUESTIONS
•Yes/No
•Multiple Choice
•Scaled (i.e., Likert)
•no predefined
options or categories
•closed-ended but
arranged one under
the other
•answered only when
the respondent
provides a particular
response
Contingency
questions
Matrix
questions
Closed
ended
questions
Open ended
questions

LIKERT SCALE
1 2 3 4 5 6 7 8 9
Disagre
e
Agree
9-point Scale
5-point Scale
Not at
all
Satisfied
Not
Satisfied
Neutral Satisfied Very
Satisfied
1 2 3 4 5
7-point Scale
Entirely
Disagre
e
Mostly
Disagre
e
Somewhat
Disagree
Neither
Agree nor
Disagree
Somewhat
Agree
Mostly
Agree
Entirely
Agree
1 2 3 4 5 6 7

STATISTICAL ANALYSIS METHODS &
TECHNIQUES
Descriptive Statistics
Tables & Graphs
Summary Measures
Inferential Statistics
Confidence Intervals
Hypothesis Testing

DATA ANALYSIS
- Performed to study a
single variable
- Descriptive statistics
•Tables & Graphs
•Summary Measures
•One sample tests
•Normality Tests
•Normal Probability Plot
- Performed to study two
variables relationship
- Statistical Analysis
•Comparisons
•Contingency Tables
•Scatter plot
•Correlation
•Regression
Univariate
Bivariate

DESCRIPTIVE STATISTICS
Tables & Graphs
Frequency
Table
Frequency Histogram
Bar & Column Chart
Time-series Line Graph
Pie Chart
Stem-&-leaf Diagram
Boxplot
Summary Measures
Measures of
Location
Measures of
Variability
Measures of
Shape
Proportion

$SUMMARY MEASURES Measures of Central Tendency (Location) •Mean (Average) •Median (Middle Value) •Mode (Most often Value) Measures of Dispersion (Spread) •Range (Highest Value – Lowest Value) •Variance •Standard Deviation •Coefficient of Variation Measures of Shape •Skewness •Kurtosis Proportion •Frequency of observations in a particular category as a fraction of all observations$

TABLES & GRAPHS
Frequency Table Frequency Histogram Frequency Histogram
Time-series Line Graph Pie Chart Stem-&-Leaf Diagram Boxplot

DATA REPRESENTATION SUMMARY
Variable
Associated
Data type
Graphically represented
by
Summary measures
Central
Tendency
Dispersion
Qualitative
(Categorical)
Nominal Pie Chart
Column or Bar Chart
Mode Range
Ordinal Column or Bar Chart Mode
Median
Range
Quantitative
(Scale)
Interval &
Ratio
Frequency Histogram
Boxplot
Stem-&-leaf Diagram
Time-series line Graph
Mode
Median
Mean
Range
Variance
Standard Deviation
Coefficient of Variation
Effective representation of data depends on choosing proper graphical tool and
summary measures.

CENTRAL LIMIT THEOREM
The central limit theorem in it's
shortest form states that the
sampling distribution of the
sampling means approaches a
normal distribution as the sample
size gets larger, regardless of the
shape of the population distribution.

TEST OF NORMALITY
Since a number of the most common statistical
tests rely on the normality of a sample or
population, it is often useful to test whether the
underlying distribution is normal, or at least
symmetric. This can be done via the following
approaches:
• Review the distribution graphically (histograms,
& Normal Probability plots)
• Analyze the skewness & kurtosis
• Employ statistical tests (Chi-square,

INFERENTIAL STATISTICS
Confidence
Interval Hypothesis
Testing

STATISTICAL TESTS
Independent Samples
Parametric Tests
1 Sample
(t-test)
2 Samples
(t-test)
K Samples
(ANOVA)
Independent Samples
Nonparametric Tests
1 Sample
(Binomial, Chi-square,
Kolmogorov-Smirnov)
2 Samples
(Mann-Whitney)
K Samples
(Kruskal-Wallis)
Related Samples
Parametric Tests
2 Samples
(t-test)
K Samples
(Repeated Measures
ANOVA)
Related Samples
Nonparametric Tests
2 Samples
(Wilcoxon)
K Samples
(Friedman)

CONTINGENCY
TABLES
A two-way table is a useful tool
for examining relationships
between categorical variables.
The entries in the cells of a two-
way table can be frequency
counts or relative frequencies

SCATTER PLOT
A Scatter (XY) Plot has points that show the relationship between
two sets of data.
In this example, each dot shows one person's weight versus their
height.
Positive
Association
Negative
Association No
Association

CORRELATION
When two sets of data are strongly linked together we say they have a High Correlation.
• Correlation is Positive when the values increase together, and
• Correlation is Negative when one value decreases as the other increases
Correlation can have a value:
1 is a perfect positive correlation
0 is no correlation (the values don't seem
linked at all)
-1 is a perfect negative correlation
The value shows how good
the correlation is (not how
steep the line is), and if it is
positive or negative.

LINEAR REGRESSION
In a cause and effect
relationship, the
independent variable
is the cause, and the
dependent variable
is the effect.

STATISTICAL PACKAGE FOR SOCIAL
SCIENCES (SPSS)

DATA PREPARATION AND
ANALYSIS USING MS
EXCEL, SPSS, & MINITAB

DATA PREPARATION
Collect
Data
Enter the
Data into
Computer
Check for
Data
Accuracy
Develop a
Database
Structure

WHAT NEXT…?
Data ready
in Excel
Import Data
to SPSS or
Minitab
Explore data
using
Descriptive
Statistics
Perform
Hypothesis
Testing if
needed
Write your
report

HOW TO WRITE A RESEARCH
REPORT…?
Cover Page
•Title
•Date
•Researcher Name
•Supervisor Name
Contents & Executive
Summary
Introduction
•Data Source
•Data Analysis Technique
•Main Research Aim
Data Analysis
•Descriptive
•Research Questions Conclusion

TAILORED REPORTS
Sales
Real
Estate
Marketing NPS Other

SOURCES
• https://cyfernetsearch.org/ilm_4_4
• http://stattrek.com/statistics/data-collection-methods.aspx
• http://onlinestatbook.com/2/introduction/descriptive.html
• http://books.google.com.eg/books?id=JbzzoRyOiXcC&pg=PA19&lpg=PA19&dq=introduction
+to+statistics+internal+and+external+data&source=bl&ots=hKFxFq_DHG&sig=EHQL0LJ8Jxv
GQ1dRN5Et8BX6VLk&hl=en&sa=X&ei=2ohrVKC7FoLksASUzYGQCQ&ved=0CCgQ6AEwAg#v=o
nepage&q=introduction%20to%20statistics%20internal%20and%20external%20data&f=false
• http://www.sagepub.com/upm-data/10985_Chapter_4.pdf
• http://sociology.about.com/od/Statistics/a/Descriptive-inferential-statistics.htm
• http://www.statcan.gc.ca/edu/power-pouvoir/ch2/methods-methodes/5214773-eng.htm
• http://www.stats.gla.ac.uk/steps/glossary/index.html
• http://www.psychstat.missouristate.edu/introbook/sbk28m.htm

THANK
YOU
Have Successful career, Wonderful Life full of Joy, Productivity, & Happiness
Marwa Abo-Amra
Email: analyst.amra@gmail.com
Blog: analystamra.blogspot.com

What's hot

Inferential statisticsDalia El-Shafei

Descriptive StatisticsNeny Isharyanti

Types of data Kiran Rawat

Descriptive statisticsVenkata Reddy Konasani

Exploratory data analysis Peter Reimann

DATA TypesAniruddha Deshmukh

Types of Statisticsloranel

Basic Statistics & Data AnalysisAjendra Sharma

Inferential statisticsAshok Kulkarni

Analysis and Interpretation of DataMultan Post Graduate College, Multan

Introduction to StatisticsAnjan Mahanta

Meaning and Importance of StatisticsFlipped Channel

Inferential statistics.pptNursing Path

What is Statisticssidra-098

Spss an introductionSuresh Thengumpallil

Descriptive statistics iiMohammad Ihmeidan

Multivariate analysisSUDARSHAN KUMAR PATEL

Four data types Data Scientist should knowRanjit Nambisan

RESEARCH MATHODOLOGY and VARIABLESWaheed Ali

Introduction to Descriptive StatisticsSanju Rusara Seneviratne

What's hot (20)

Inferential statistics

Descriptive Statistics

Types of data

Descriptive statistics

Exploratory data analysis

DATA Types

Types of Statistics

Basic Statistics & Data Analysis

Inferential statistics

Analysis and Interpretation of Data

Introduction to Statistics

Meaning and Importance of Statistics

Inferential statistics.ppt

What is Statistics

Spss an introduction

Descriptive statistics ii

Multivariate analysis

Four data types Data Scientist should know

RESEARCH MATHODOLOGY and VARIABLES

Introduction to Descriptive Statistics

Viewers also liked

Statistical analysis and interpretationDave Marcial

Employee motivationMarwa Abo-Amra

Statistical Process ControlMarwa Abo-Amra

Statistical Analysis with R and Mind Mapping automationJosé M. Guerrero

Lecture 07MEHTAB REHMAN

Employee SatisfactionMarwa Abo-Amra

Sq measurementVikash Kumar Bibhakar

Visual Design with DataSeth Familian

Data analysis test for association BY Prof Sachin Udepurkarsachinudepurkar

N06 spss introduRaj Kumar

08 ms excelfosterstac

Statistical software for Sampling from Finite Populations: an analysis using ...michele de meo

Data representationChew Hoong

Displaying and describing categorical dataOlivia Dombrowski

David Neasham Practical Use Pharmacoepi Drug Devguest41e570

Educational Assessment DesignGhaidaa H. Naguib

SPC (Statistical Process Control) concepts in forecastingCharles Novak

20100820080859 lecture 7,8,9,10( steps of proposal0peningla

Statistical Analysis StepsShaqra University - KSA

The effects of worked examples on transfer of statistical reasoningMarianna Lamnina

Viewers also liked (20)

Statistical analysis and interpretation

Employee motivation

Statistical Process Control

Statistical Analysis with R and Mind Mapping automation

Lecture 07

Employee Satisfaction

Sq measurement

Visual Design with Data

Data analysis test for association BY Prof Sachin Udepurkar

N06 spss introdu

08 ms excel

Statistical software for Sampling from Finite Populations: an analysis using ...

Data representation

Displaying and describing categorical data

David Neasham Practical Use Pharmacoepi Drug Dev

Educational Assessment Design

SPC (Statistical Process Control) concepts in forecasting

20100820080859 lecture 7,8,9,10( steps of proposal0

Statistical Analysis Steps

The effects of worked examples on transfer of statistical reasoning

Similar to Statistical analysis training course

data analysis in Statistics-2023 guide 2023ayesha455941

WEEK-1-IS-20022023-094301am.pdfMdDahri

Introduction to StatisticsMoonWeryah

2 statistics, measurement, graphical techniquesPenny Jiang

Data visualization.pptxnaveen shyam

Intro statisticsArash Kamrani

Descriptive statisticsHiba Armouche

Research Course - RCT.pdfMarioKopljar1

Research Course - RCT.pptxMarioKopljar1

Research Course - RCT.pdfMarioKopljar1

DataAshutosh Mittal

Charles Cotter's PhD research findings & recommendations_Strategic L&DCharles Cotter, PhD

Biostatisticsanju mathew

statistical analysisRakhi Kripa Prince

DescriptivestatisticsCarla Piper

Introduction to Data Analysis for Nurse ResearchersRupa Verma

Chapter 1 and 2 Tara Kissel, M.Ed

Biostatistics.pptxTawhid4

PRESENTATION.pptxMedicalEducation7

Sampling methodAmira Misdar

Similar to Statistical analysis training course (20)

data analysis in Statistics-2023 guide 2023

WEEK-1-IS-20022023-094301am.pdf

Introduction to Statistics

2 statistics, measurement, graphical techniques

Data visualization.pptx

Intro statistics

Descriptive statistics

Research Course - RCT.pdf

Research Course - RCT.pptx

Research Course - RCT.pdf

Data

Charles Cotter's PhD research findings & recommendations_Strategic L&D

Biostatistics

statistical analysis

Descriptivestatistics

Introduction to Data Analysis for Nurse Researchers

Chapter 1 and 2

Biostatistics.pptx

PRESENTATION.pptx

Sampling method

Recently uploaded

April 2024 - Crypto Market Report's Analysismanisha194592

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums

Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Halmar dropshipping via API with DroFxolyaivanovalion

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls

Week-01-2.ppt BBB human Computer interactionfulawalesam

Carero dropshipping via API with DroFx.pptxolyaivanovalion

Data-Analysis for Chicago Crime Data 2023ymrp368

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H

Mature dropshipping via API with DroFx.pptxolyaivanovalion

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal

BigBuy dropshipping via API with DroFx.pptxolyaivanovalion

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

VidaXL dropshipping via API with DroFx.pptxolyaivanovalion

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY

Invezz.com - Grow your wealth with trading signalsInvezz1

CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion

Recently uploaded (20)

April 2024 - Crypto Market Report's Analysis

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...

Schema on read is obsolete. Welcome metaprogramming..pdf

Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Halmar dropshipping via API with DroFx

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779

Week-01-2.ppt BBB human Computer interaction

Carero dropshipping via API with DroFx.pptx

Data-Analysis for Chicago Crime Data 2023

Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf

Mature dropshipping via API with DroFx.pptx

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure

BigBuy dropshipping via API with DroFx.pptx

Generative AI on Enterprise Cloud with NiFi and Milvus

VidaXL dropshipping via API with DroFx.pptx

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...

Invezz.com - Grow your wealth with trading signals

CebaBaby dropshipping via API with DroFX.pptx

Statistical analysis training course

1. STATISTICAL ANALYSIS TRAINING COURSE This course was lectured in Municipality of Dubai By Data Analyst: Marwa Abo- Amra On November 2014

2. WHY WE USE STATISTICS & STATISTICAL ANALYSIS Data don’t make any sense, we will have to resort to statistics.

3. IN TODAY’S WORLD… Customer Surveys Medical News Demographics Political Polls Economic Predictions Marketing Information Sales Forecasts Stock Market Projections Consumer Price Index Sports Statistics

4. COURSE CONTENTS • Introduction to Statistics • Statistical Concepts • Data Collection Methods • Statistical Analysis • Introduction to Statistical Software: SPSS and Minitab • Data Preparation and Analysis using MS Excel, SPSS, & Minitab • Statistical Report writing • Practical Examples and Applications

5. COURSE LEARNING OBJECTIVES Those who deal with the different fields in life should be familiar with the importance of data and how to deal with it as well as the scientific methods to extract the necessary indicators for decision-making, and help decision makers to know the various alternatives and how to evaluate them.

6. INTRODUCTION TO STATISTICS

7. DEFINITION OF STATISTICS Statistics is the science of collecting, organizing, presenting, analyzing, and interpreting numerical data to assist in making more effective decisions.

8. STAGES OF STATISTICS “Statistics is a way to get information from data.” Collect Organize Present Analyze Interpret Descriptive Analysis Make Valid Conclusion & Take Correct Decision Data Preparation Inferential Analysis

9. IMPORTANCE OF STATISTICS simplifies complex data provides a technique of comparis on helps in formulati ng policies helps to test & formulate theories presents the facts in a definite form studies relationsh ip helps in forecastin g

10. STATISTICAL CONCEPTS

11. SOME BASIC CONCEPTS Population Sample & Sampling Parameter & Statistic Elementary Units & Variables

12. VARIABLES Dependent Independe nt Cause Effect

13. Variables Descriptive Information Discrete (Counted) Continuous (Measured) Qualitative Quantitative Sex Nationality Type of crime Occupation Religion Marital status Literacy …etc. Number of children in a family Number of accidents per week Students marks …etc. Age Distance Height Weight …etc.

14. VARIABLES & UNITS Employees Race Sex Job Title Years of Service Annual Salary Abel Caucasian Male Pilot 2 $34,000 Cruz Caucasian Male Chief mechanic 10 60,000 Dunn Western Male Chief pilot 23 70,000 Hill Western Female Secretary 5 14,000 King Caucasian Male Janitor 8 17,000 Otis Caucasian Male Grounds keeper 10 20,000 West Western Male Mechanic 2 36,000 Wolf Caucasian Female Pilot 7 36,000 Zorn Caucasian Female Mechanic 7 40,000 Qualitative variable Elementary unit Frame Population of employee sexes Population of employee Qualitative variable Datum Sample of employee salaries Selected characteristics of all the full-time employees of Mountain Aviation, Inc.,: July 1, 1994

15. QUANTITATIVE VARIABLES Nominal • Attributes are only named • No order Ordinal • Attributes can be rank-ordered • Distance is not meaningful Interval • Distance is meaningful Ratio • Absolute zero Strongest Weakest Discrete Continuous Measurement Level

16. POPULATION VS. SAMPLING Population Sample Parameter Statistic Something that describes the population as a whole Something that describes the sample Elementary Unit Sampling Inference Target Population

17. WHY SAMPLING? Reduced cost Time factor Very large populations Partly accessible populations Destructive units Accuracy

18. SIMPLE RANDOM SAMPLING For the sampling plan to be statistically valid, the sample must be randomly selected from the population.

19. DATA COLLECTION & PREPARATION

20. WHAT IS DATA? Data is a collection of facts, such as values or measurements. Examples of data include price, exam scores, export and import, labor and so on.

21. Data Information Statistics 20 kg, 25 kg 5 individuals in the 20-to-25-kg range Mean weight = 22.5 kg 28 kg, 30 kg, etc. 15 individuals in the 26-to-30-kg range Median weight = 28 kg DATA VS. INFORMATION VS. STATISTICS

22. SOURCES OF DATA Internal • Organization internal records External • Published and unpublished external sources

23. DATA COLLECTION METHODSCensus SampleSurveyExperiment ObservationalStudy Census Survey Experiment (Control) Observation (No Control)

24. OTHER DATA COLLECTION METHODS Interviews  Face-to-face  Telephone Focus Groups Ethnographies, Oral History, & Case Studies Schedule through enumerators Documents & Records

25. Questionnaire Schedule • It is not confirmed if expected respondent have filled the answers • Respondent identity is know • Very slow • Information collected in time • No personal contact • Direct personal contact • Used only when respondent is educated & cooperative • Information can be collected from illiterate people • Wider distribution of samples possible • Difficult for wider distribution • More incomplete and false information • Relatively more complete and correct information • Information validity depends on quality of questionnaire • Information validity depends on honesty & competence of numerator • Questionnaire physical appearance should be attractive • Schedule physical appearance not necessary • Observation cannot be used • Observation can be used by enumerator

26. FRAMING A QUESTIONNAIRE OR A SCHEDULE Cover Letter Number of Questions Nature of Questions Questions should be simple Arrangement of Questions Information collected is usable Avoid mathematical questions

27. TYPES OF QUESTIONS •Yes/No •Multiple Choice •Scaled (i.e., Likert) •no predefined options or categories •closed-ended but arranged one under the other •answered only when the respondent provides a particular response Contingency questions Matrix questions Closed ended questions Open ended questions

28. LIKERT SCALE 1 2 3 4 5 6 7 8 9 Disagre e Agree 9-point Scale 5-point Scale Not at all Satisfied Not Satisfied Neutral Satisfied Very Satisfied 1 2 3 4 5 7-point Scale Entirely Disagre e Mostly Disagre e Somewhat Disagree Neither Agree nor Disagree Somewhat Agree Mostly Agree Entirely Agree 1 2 3 4 5 6 7

29. SEMANTIC DIFFERENTIAL SCALE

30. RANK-ORDER SCALE

31. STATISTICAL ANALYSIS

32. STATISTICAL ANALYSIS METHODS & TECHNIQUES Descriptive Statistics Tables & Graphs Summary Measures Inferential Statistics Confidence Intervals Hypothesis Testing

33. DATA ANALYSIS - Performed to study a single variable - Descriptive statistics •Tables & Graphs •Summary Measures •One sample tests •Normality Tests •Normal Probability Plot - Performed to study two variables relationship - Statistical Analysis •Comparisons •Contingency Tables •Scatter plot •Correlation •Regression Univariate Bivariate

34. DESCRIPTIVE STATISTICS Tables & Graphs Frequency Table Frequency Histogram Bar & Column Chart Time-series Line Graph Pie Chart Stem-&-leaf Diagram Boxplot Summary Measures Measures of Location Measures of Variability Measures of Shape Proportion

35. SUMMARY MEASURES Measures of Central Tendency (Location) •Mean (Average) •Median (Middle Value) •Mode (Most often Value) Measures of Dispersion (Spread) •Range (Highest Value – Lowest Value) •Variance •Standard Deviation •Coefficient of Variation Measures of Shape •Skewness •Kurtosis Proportion •Frequency of observations in a particular category as a fraction of all observations

36. TABLES & GRAPHS Frequency Table Frequency Histogram Frequency Histogram Time-series Line Graph Pie Chart Stem-&-Leaf Diagram Boxplot

37. FREQUENCY TABLE

38. HISTOGRAM VS. BAR CHART

39. TIME-SERIES LINE GRAPH

40. PIE CHART

41. STEM-AND-LEAF DIAGRAM

42. BOXPLOT

43. DATA REPRESENTATION SUMMARY Variable Associated Data type Graphically represented by Summary measures Central Tendency Dispersion Qualitative (Categorical) Nominal Pie Chart Column or Bar Chart Mode Range Ordinal Column or Bar Chart Mode Median Range Quantitative (Scale) Interval & Ratio Frequency Histogram Boxplot Stem-&-leaf Diagram Time-series line Graph Mode Median Mean Range Variance Standard Deviation Coefficient of Variation Effective representation of data depends on choosing proper graphical tool and summary measures.

44. NORMAL DISTRIBUTION

45. CENTRAL LIMIT THEOREM The central limit theorem in it's shortest form states that the sampling distribution of the sampling means approaches a normal distribution as the sample size gets larger, regardless of the shape of the population distribution.

46. TEST OF NORMALITY Since a number of the most common statistical tests rely on the normality of a sample or population, it is often useful to test whether the underlying distribution is normal, or at least symmetric. This can be done via the following approaches: • Review the distribution graphically (histograms, & Normal Probability plots) • Analyze the skewness & kurtosis • Employ statistical tests (Chi-square,

47. INFERENTIAL STATISTICS Confidence Interval Hypothesis Testing

48. DATA ANALYSIS - Performed to study a single variable - Descriptive statistics •Tables & Graphs •Summary Measures •One sample tests •Normality Tests •Normal Probability Plot - Performed to study two variables relationship - Statistical Analysis •Comparisons •Contingency Tables •Scatter plot •Correlation •Regression Univariate Bivariate

49. STATISTICAL TESTS Independent Samples Parametric Tests 1 Sample (t-test) 2 Samples (t-test) K Samples (ANOVA) Independent Samples Nonparametric Tests 1 Sample (Binomial, Chi-square, Kolmogorov-Smirnov) 2 Samples (Mann-Whitney) K Samples (Kruskal-Wallis) Related Samples Parametric Tests 2 Samples (t-test) K Samples (Repeated Measures ANOVA) Related Samples Nonparametric Tests 2 Samples (Wilcoxon) K Samples (Friedman)

50. CONTINGENCY TABLES A two-way table is a useful tool for examining relationships between categorical variables. The entries in the cells of a two- way table can be frequency counts or relative frequencies

51. SCATTER PLOT A Scatter (XY) Plot has points that show the relationship between two sets of data. In this example, each dot shows one person's weight versus their height. Positive Association Negative Association No Association

52. CORRELATION When two sets of data are strongly linked together we say they have a High Correlation. • Correlation is Positive when the values increase together, and • Correlation is Negative when one value decreases as the other increases Correlation can have a value: 1 is a perfect positive correlation 0 is no correlation (the values don't seem linked at all) -1 is a perfect negative correlation The value shows how good the correlation is (not how steep the line is), and if it is positive or negative.

53. LINEAR REGRESSION In a cause and effect relationship, the independent variable is the cause, and the dependent variable is the effect.

54. SPSS AND MINITAB

55. MS EXCEL

56. STATISTICAL PACKAGE FOR SOCIAL SCIENCES (SPSS)

57. MINITAB

58. DATA PREPARATION AND ANALYSIS USING MS EXCEL, SPSS, & MINITAB

59. DATA PREPARATION Collect Data Enter the Data into Computer Check for Data Accuracy Develop a Database Structure

60. WHAT NEXT…? Data ready in Excel Import Data to SPSS or Minitab Explore data using Descriptive Statistics Perform Hypothesis Testing if needed Write your report

61. STATISTICAL REPORT WRITING

62. HOW TO WRITE A RESEARCH REPORT…? Cover Page •Title •Date •Researcher Name •Supervisor Name Contents & Executive Summary Introduction •Data Source •Data Analysis Technique •Main Research Aim Data Analysis •Descriptive •Research Questions Conclusion

63. TAILORED REPORTS Sales Real Estate Marketing NPS Other

64. SOURCES • https://cyfernetsearch.org/ilm_4_4 • http://stattrek.com/statistics/data-collection-methods.aspx • http://onlinestatbook.com/2/introduction/descriptive.html • http://books.google.com.eg/books?id=JbzzoRyOiXcC&pg=PA19&lpg=PA19&dq=introduction +to+statistics+internal+and+external+data&source=bl&ots=hKFxFq_DHG&sig=EHQL0LJ8Jxv GQ1dRN5Et8BX6VLk&hl=en&sa=X&ei=2ohrVKC7FoLksASUzYGQCQ&ved=0CCgQ6AEwAg#v=o nepage&q=introduction%20to%20statistics%20internal%20and%20external%20data&f=false • http://www.sagepub.com/upm-data/10985_Chapter_4.pdf • http://sociology.about.com/od/Statistics/a/Descriptive-inferential-statistics.htm • http://www.statcan.gc.ca/edu/power-pouvoir/ch2/methods-methodes/5214773-eng.htm • http://www.stats.gla.ac.uk/steps/glossary/index.html • http://www.psychstat.missouristate.edu/introbook/sbk28m.htm

65. THANK YOU Have Successful career, Wonderful Life full of Joy, Productivity, & Happiness Marwa Abo-Amra Email: analyst.amra@gmail.com Blog: analystamra.blogspot.com

Editor's Notes

Population vs. Sample The study of statistics revolves around the study of data sets. This section describes two important types of data sets - populations and samples. A population includes each element from the set of observations that can be made. A sample consists only of observations drawn from the population. Variable vs. Elementary Unit In statistics, a variable has two defining characteristics: A variable is an attribute that describes a person, place, thing, or idea (the elementary unit). The value of the variable can "vary" from one entity to another.
Independent and dependent variables Variables are properties or characteristics of some event, object, or person that can take on different values or amounts (as opposed to constants such as π that do not vary). When conducting research, experimenters often manipulate variables. For example, an experimenter might compare the effectiveness of four types of antidepressants. In this case, the variable is "type of antidepressant." When a variable is manipulated by an experimenter, it is called an independent variable. The experiment seeks to determine the effect of the independent variable on relief from depression. In this example, relief from depression is called a dependent variable. In general, the independent variable is manipulated by the experimenter and its effects on the dependent variable are measured. Levels of an Independent Variable If an experiment compares an experimental treatment with a control treatment, then the independent variable (type of treatment) has two levels: experimental and control. If an experiment were comparing five types of diets, then the independent variable (type of diet) would have 5 levels. In general, the number of levels of an independent variable is the number of experimental conditions.
Qualitative and Quantitative Variables An important distinction between variables is between qualitative variables and quantitative variables. Qualitative variables are those that express a qualitative attribute such as hair color, eye color, religion, favorite movie, gender, and so on. The values of a qualitative variable do not imply a numerical ordering. Values of the variable “religion” differ qualitatively; no ordering of religions is implied. Qualitative variables are sometimes referred to as categorical variables. Quantitative variables are those variables that are measured in terms of numbers. Some examples of quantitative variables are height, weight, and shoe size. Discrete and Continuous Variables Variables such as number of children in a household are called discrete variables since the possible scores are discrete points on the scale. For example, a household could have three children or six children, but not 4.53 children. Other variables such as "time to respond to a question" are continuous variables since the scale is continuous and not made up of discrete steps. The response time could be 1.64 seconds, or it could be 1.64237123922121 seconds. Of course, the practicalities of measurement preclude most measured variables from being truly continuous.
This table contains a statistical frame and the multivariate data set derived from it. The table illustrates the meaning of a number of basic statistical concepts. Thus, first column from the left lists 9 elementary units that jointly constitute the frame (shaded). The headings of the second to the sixth column show characteristics of the elementary units that are called variables and that can be qualitative (race, sex, job title) or quantitative (years of service, annual salary). All possible observations about a given variable constitute a statistical population – the shaded entries in the third and the sixth column from the left are two examples of populations; any single observation is a datum; any subset of a population or of the frame is a sample.
Types of Scales Before we can conduct a statistical analysis, we need to measure our dependent variable. Exactly how the measurement is carried out depends on the type of variable involved in the analysis. Different types are measured differently. To measure the time taken to respond to a stimulus, you might use a stop watch. Stop watches are of no use, of course, when it comes to measuring someone's attitude towards a political candidate. A rating scale is more appropriate in this case (with labels like "very favorable," "somewhat favorable," etc.). For a dependent variable such as "favorite color," you can simply note the color-word (like "red") that the subject offers. Although procedures for measurement differ in many ways, they can be classified using a few fundamental categories. In a given category, all of the procedures share some properties that are important for you to know about. The categories are called "scale types," or just "scales," and are described in this section. Nominal scales When measuring using a nominal scale, one simply names or categorizes responses. Gender, handedness, favorite color, and religion are examples of variables measured on a nominal scale. The essential point about nominal scales is that they do not imply any ordering among the responses. For example, when classifying people according to their favorite color, there is no sense in which green is placed "ahead of" blue. Responses are merely categorized. Nominal scales embody the lowest level of measurement. Ordinal scales A researcher wishing to measure consumers' satisfaction with their microwave ovens might ask them to specify their feelings as either "very dissatisfied," "somewhat dissatisfied," "somewhat satisfied," or "very satisfied." The items in this scale are ordered, ranging from least to most satisfied. This is what distinguishes ordinal from nominal scales. Unlike nominal scales, ordinal scales allow comparisons of the degree to which two subjects possess the dependent variable. For example, our satisfaction ordering makes it meaningful to assert that one person is more satisfied than another with their microwave ovens. Such an assertion reflects the first person's use of a verbal label that comes later in the list than the label chosen by the second person. On the other hand, ordinal scales fail to capture important information that will be present in the other scales we examine. In particular, the difference between two levels of an ordinal scale cannot be assumed to be the same as the difference between two other levels. In our satisfaction scale, for example, the difference between the responses "very dissatisfied" and "somewhat dissatisfied" is probably not equivalent to the difference between "somewhat dissatisfied" and "somewhat satisfied." Nothing in our measurement procedure allows us to determine whether the two differences reflect the same difference in psychological satisfaction. Statisticians express this point by saying that the differences between adjacent scale values do not necessarily represent equal intervals on the underlying scale giving rise to the measurements. (In our case, the underlying scale is the true feeling of satisfaction, which we are trying to measure.) What if the researcher had measured satisfaction by asking consumers to indicate their level of satisfaction by choosing a number from one to four? Would the difference between the responses of one and two necessarily reflect the same difference in satisfaction as the difference between the responses two and three? The answer is No. Changing the response format to numbers does not change the meaning of the scale. We still are in no position to assert that the mental step from 1 to 2 (for example) is the same as the mental step from 3 to 4. Interval scales Interval scales are numerical scales in which intervals have the same interpretation throughout. As an example, consider the Fahrenheit scale of temperature. The difference between 30 degrees and 40 degrees represents the same temperature difference as the difference between 80 degrees and 90 degrees. This is because each 10-degree interval has the same physical meaning (in terms of the kinetic energy of molecules). Interval scales are not perfect, however. In particular, they do not have a true zero point even if one of the scaled values happens to carry the name "zero." The Fahrenheit scale illustrates the issue. Zero degrees Fahrenheit does not represent the complete absence of temperature (the absence of any molecular kinetic energy). In reality, the label "zero" is applied to its temperature for quite accidental reasons connected to the history of temperature measurement. Since an interval scale has no true zero point, it does not make sense to compute ratios of temperatures. For example, there is no sense in which the ratio of 40 to 20 degrees Fahrenheit is the same as the ratio of 100 to 50 degrees; no interesting physical property is preserved across the two ratios. After all, if the "zero" label were applied at the temperature that Fahrenheit happens to label as 10 degrees, the two ratios would instead be 30 to 10 and 90 to 40, no longer the same! For this reason, it does not make sense to say that 80 degrees is "twice as hot" as 40 degrees. Such a claim would depend on an arbitrary decision about where to "start" the temperature scale, namely, what temperature to call zero (whereas the claim is intended to make a more fundamental assertion about the underlying physical reality). Ratio scales The ratio scale of measurement is the most informative scale. It is an interval scale with the additional property that its zero position indicates the absence of the quantity being measured. You can think of a ratio scale as the three earlier scales rolled up in one. Like a nominal scale, it provides a name or category for each object (the numbers serve as labels). Like an ordinal scale, the objects are ordered (in terms of the ordering of the numbers). Like an interval scale, the same difference at two places on the scale has the same meaning. And in addition, the same ratio at two places on the scale also carries the same meaning. The Fahrenheit scale for temperature has an arbitrary zero point and is therefore not a ratio scale. However, zero on the Kelvin scale is absolute zero. This makes the Kelvin scale a ratio scale. For example, if one temperature is twice as high as another as measured on the Kelvin scale, then it has twice the kinetic energy of the other temperature. Another example of a ratio scale is the amount of money you have in your pocket right now (25 cents, 55 cents, etc.). Money is measured on a ratio scale because, in addition to having the properties of an interval scale, it has a true zero point: if you have zero money, this implies the absence of money. Since money has a true zero point, it makes sense to say that someone with 50 cents has twice as much money as someone with 25 cents (or that Bill Gates has a million times more money than you do).
When we think of the term “population,” we usually think of people in our town, region, state or country and their respective characteristics such as gender, age, marital status, ethnic membership, religion and so forth. In statistics the term “population” takes on a slightly different meaning. The “population” in statistics includes all members of a defined group that we are studying or collecting information on for data driven decisions. A part of the population is called a sample. It is a proportion of the population, a slice of it, a part of it and all its characteristics. A sample is a scientifically drawn group that actually possesses the same characteristics as the population – if it is drawn randomly. A measurable characteristic of a population is called a parameter; but a measurable characteristic of a sample is called a statistic.
Sampling is the process by which inference is made to the whole by examining a part. Purpose of sampling To provide various types of statistical information of a qualitative or quantitative nature about the whole by examining a few selected units. It is cheaper than census method.
What is Simple Random Sampling? Simple random sampling refers to a sampling method that has the following properties. The population consists of N objects. The sample consists of n objects. All possible samples of n objects are equally likely to occur. An important benefit of simple random sampling is that it allows researchers to use statistical methods to analyze sample results. For example, given a simple random sample, researchers can use statistical methods to define a confidence interval around a sample mean. Statistical analysis is not appropriate when non-random sampling methods are used. There are many ways to obtain a simple random sample. One way would be the lottery method. Each of the N population members is assigned a unique number. The numbers are placed in a bowl and thoroughly mixed. Then, a blind-folded researcher selects n numbers. Population members having the selected numbers are included in the sample.
Before one can present and interpret information, there has to be a process of gathering and sorting data. Just as trees are the raw material from which paper is produced, so too, can data be viewed as the raw material from which information is obtained.
Once data have been collected and processed, they are ready to be organized into information. Indeed, it is hard to imagine reasons for collecting data other than to provide information. This information leads to knowledge about issues, and helps individuals and groups make informed decisions. In practice, informed decision-making can save countries millions of dollars (for example, through accurate targeting of government spending). It can also lead to life saving breakthroughs in medicine, and can help conserve the earth's natural environment. Information is data that have been recorded, classified, organized, related, or interpreted within a framework so that meaning emerges. Statistics represent a common method of presenting information. In general, statistics relate to numerical data, and can refer to the science of dealing with the numerical data itself. Above all, statistics aim to provide useful information by means of numbers. Therefore, a good definition of statistics is "a type of information obtained through mathematical operations on numerical data".
Collection of data is the first step in any statistical investigation of a phenomenon. Sources of Data Data are generally classified into the following two groups: 1. Internal Data Internal data come from internal sources related with the functioning of an organization or firm where records regarding purchase, production, sales, profits etc. are kept on a regular basis. Various Government departments, like Railways, Communications, Education etc. also generate internal data which are useful for their proper internal functioning. However, the internal data can be either insufficient or inappropriate for the statistical inquiry into a phenomenon. In that situation we need external data. 2. External Data The External data are collected and published by external agencies. This type of data can be obtained from primary source or secondary source. Thus, the external data can further be classified as: Primary and Secondary Data. Primary data are original and firsthand information. Data are termed primary when the reference is to data collected for the first time by the investigator. For example, the Meteorological department regularly collects data on various aspects of the weather and climate such as amount of rainfall, humidity, minimum and maximum temperature of a certain place. These constitute primary data. Similarly, the data in a population census obtained by the office of the Registrar General and Census Commissioner are primary in nature. On the other hand, data are termed secondary when collected from records or data already available. In other words, the secondary data are one which has already been collected by a source other than the present investigator. For example, population census data are primary for the office of the Registrar General and Census Commissioner where as, for other organizations or individuals who use such data, they are secondary. Thus, data which are primary in one hand become in the hands of others.
To derive conclusions from data, we need to know how the data were collected; that is, we need to know the method(s) of data collection. Methods of Data Collection There are four main methods of data collection. 1. Census. A census is a study that obtains data from every member of a population. In most studies, a census is not practical, because of the cost and/or time required. 2. Sample survey. A sample survey is a study that obtains data from a subset of a population, in order to estimate population attributes. 3. Experiment. An experiment is a controlled study in which the researcher attempts to understand cause-and-effect relationships. The study is "controlled" in the sense that the researcher controls (1) how subjects are assigned to groups and (2) which treatments each group receives. In the analysis phase, the researcher compares group scores on some dependent variable. Based on the analysis, the researcher draws a conclusion about whether the treatment ( independent variable) had a causal effect on the dependent variable. 4. Observational study. Like experiments, observational studies attempt to understand cause-and-effect relationships. However, unlike experiments, the researcher is not able to control (1) how subjects are assigned to groups and/or (2) which treatments each group receives. Data Collection Methods: Pros and Cons Each method of data collection has advantages and disadvantages. 1. Resources. When the population is large, a sample survey has a big resource advantage over a census. A well-designed sample survey can provide very precise estimates of population parameters - quicker, cheaper, and with less manpower than a census. 2. Generalizability. Generalizability refers to the appropriateness of applying findings from a study to a larger population. Generalizability requires random selection. If participants in a study are randomly selected from a larger population, it is appropriate to generalize study results to the larger population; if not, it is not appropriate to generalize. Observational studies do not feature random selection; so generalizing from the results of an observational study to a larger population can be a problem. 3. Causal inference. Cause-and-effect relationships can be teased out when subjects are randomly assigned to groups. Therefore, experiments, which allow the researcher to control assignment of subjects to treatment groups, are the best method for investigating causal relationships.
Interviews Interviews can be conducted face-to-face or by telephone. They can range from in-depth, semi-structured to unstructured depending on the information being sought. Face to face interviews are advantageous since: detailed questions can be asked further probing can be done to provide rich data literacy requirements of participants is not an issue non verbal data can be collected through observation complex and unknown issues can be explored response rates are usually higher than for self-administered questionnaires. Disadvantages of face to face interviews include: they can be expensive and time consuming training of interviewers is necessary to reduce interviewer bias and are administered in a standardized why they are prone to interviewer bias and interpreter bias (if interpreters are used) sensitive issues maybe challenging. Telephone interviews according to Bowling, yield just as accurate data as face to face interviews. Telephone interviews are advantageous as they: are cheaper and faster than face to face interviews to conduct use less resources than face to face interviews allow to clarify questions do not require literacy skills. Disadvantages of telephone interviews include: having to make repeated calls as calls may not be answered the first time potential bias if call backs are not made so bias is towards those who are at home only suitable for short surveys only accessible to the population with a telephone not appropriate for exploring sensitive issues. Focus groups Focus groups or group discussions are useful to further explore a topic, providing a broader understanding of why the target group may behave or think in a particular way, and assist in determining the reason for attitudes and beliefs. They are conducted with a small sample of the target group and are used to stimulate discussion and gain greater insights. Focus groups and group discussions are advantageous as they: are useful when exploring cultural values and health beliefs can be used to examine how and why people think in a particular way and how is influences their beliefs and values can be used to explore complex issues can be used to develop hypothesis for further research do not require participants to be literate. Disadvantages of focus groups include: lack of privacy/anonymity having to carefully balance the group to ensure they are culturally and gender appropriate (i.e. gender may be an issue) potential for the risk of ‘group think’ (not allowing for other attitudes, beliefs etc.) potential for group to be dominated by one or two people group leader needs to be skilled at conducting focus groups, dealing with conflict, drawing out passive participants and creating a relaxed, welcoming environment are time consuming to conduct and can be difficult and time consuming to analyze. Ethnographies, Oral History, & Case Studies Involves studying a single phenomenon Examines people in their natural settings Uses a combination of techniques such as observation, interviews, and surveys Ethnography is a more holistic approach to evaluation Researcher can become a confounding variable Schedule through enumerators Initially let us make a distinction between a questionnaire and a schedule. The questionnaire is a set of questions the answers to which are recorded by the informant itself, whereas in a schedule answers are recorded by the investigators or an enumerator on his behalf. In this method the investigators or enumerators approach the informants with a prepared questionnaire and get the replies to the questions. This method is generally used in census and large scale surveys. In the case of census, investigators visits every member of the source of information in the zones while, in the case of sample survey, they collect information from those members who have been selected in the sample. Documents & Records This method consists of examining existing data in the form of databases, meeting minutes, reports, attendance logs, financial records, newsletters, etc. This can be an inexpensive way to gather information, but may be an incomplete data source. Substantial description and documentation, often referred to as “thick description”, can be used to further explore a subject.7 This process provides a thorough description of the “study participants, context and procedures, the purpose of the intervention and its transferability”. Thick description also includes the complexities experienced in addition to the commonalities found, which assists in maintaining data integrity. The use of documentation provides an ongoing record of activities. This can be records of informal feedback and reflections through journals, diaries or progress reports. The challenge of documentation is that it requires an ongoing commitment to regularly document thoughts and activities throughout the evaluation process.
Great care and caution has to be exercised in drafting a questionnaire or a schedule, as this is the basis of collecting information in an investigation. A part from the care and caution, a lot of expertise and experience of the phenomenon under investigation are required in its preparation. Though there are no hard and fast rules for drafting a questionnaire, still following points be given due consideration. These points are: 1. Covering letter: in this letter the investigator should introduce himself and make the objectives of the survey clear to the informant. In this the informant should be assured that the information provided by him will be kept secret. 2. Number of questions: the number of questions in the questionnaire should be small as possible. This results in time saving and is convenient to both, i.e., to the enumerator and the respondent. 3. Nature of questions: delicate questions should be put with great care. Often indirect questions should be formed to get the answers to these questions. 4. The questions should be simple: the questions set should be clear, concise, short answered and unambiguous. These should be related to the phenomenon under investigation. 5. Arrangement of questions: there should be a natural and logical order of the questions in a questionnaire. For example, it is not logical to ask a man about his income before asking him about his occupation. 6. Information collected is usable: it should be noted that the information collected through questions is usable. 7. Avoid mathematical questions: as far as possible, questions involving mathematical calculations be avoided. It is always better to use multiple questions (having four or five alternatives) or simple alternative questions (Yes or No type). 8. Attractive layout of the questionnaire: the book of the questionnaire should be attractive and questions be suitably spaced for proper answering.
1. Contingency questions - A question that is answered only if the respondent gives a particular response to a previous question. This avoids asking questions of people that do not apply to them (for example, asking men if they have ever been pregnant). 2. Matrix questions - Identical response categories are assigned to multiple questions. The questions are placed one under the other, forming a matrix with response categories along the top and a list of questions down the side. This is an efficient use of page space and respondents’ time. 3. Closed ended questions - Respondents’ answers are limited to a fixed set of responses. Most scales are closed ended. Other types of closed ended questions include: Yes/no questions - The respondent answers with a "yes" or a "no". Multiple choice - The respondent has several option from which to choose. Scaled questions - Responses are graded on a continuum (example : rate the appearance of the product on a scale from 1 to 10, with 10 being the most preferred appearance). Examples of types of scales include the Likert scale, semantic differential scale, and rank-order scale (See scale for a complete list of scaling techniques.). 4. Open ended questions - No options or predefined categories are suggested. The respondent supplies their own answer without being constrained by a fixed set of possible responses. Examples of types of open ended questions include: Completely unstructured questions- openly ask the opinion or view of the respondent Word association questions - the participant states the first word that pops in his mind once a series of words are presented Thematic Apperception Test – a picture is presented to the respondent which he explains on his own point-of-view Sentence, story or picture completion – the respondent continues an incomplete sentence or story, or writes on empty conversation balloons in a picture
A Likert scale is a psychometric scale commonly involved in research that employs questionnaires. Likert-type or frequency scales use fixed choice response formats and are designed to measure attitudes or opinions. These ordinal scales measure levels of agreement/disagreement. A Likert-type scale assumes that the strength/intensity of experience is linear, i.e. on a continuum from strongly agree to strongly disagree, and makes the assumption that attitudes can be measured. Respondents may be offered a choice of five to seven or even nine pre-coded responses with the neutral point being neither agree nor disagree. In it final form, the Likert Scale is a five (or seven) point scale which is used to allow the individual to express how much they agree or disagree with a particular statement. Scoring & Analysis Likert scale is indeed ordinal, if well presented it may nevertheless approximate an interval-level measurement. This can be beneficial since, if it was treated just as an ordinal scale, then some valuable information could be lost if the ‘distance’ between Likert items were not available for consideration. The important idea here is that the appropriate type of analysis is dependent on how the Likert scale has been presented.
Semantic differential The semantic differential is a scale used for measuring the meaning of things and concepts. There are two aspects of meaning: denotative and connotative. The semantic differential measures connotative meaning.
Rank-order Scale Rank order scaling questions allow a certain set of brands or products to be ranked based upon a specific attribute or characteristic.
When analyzing data, both descriptive and inferential statistics are used to analyze results and draw conclusions. So what are descriptive and inferential statistics? And what are their differences? Descriptive Statistics Descriptive statistics are numbers that are used to summarize and describe data. If we are analyzing birth certificates, for example, a descriptive statistic might be the percentage of certificates issued in Dubai, or the average age of the mother. Any other number we choose to compute also counts as a descriptive statistic for the data from which the statistic is computed. Several descriptive statistics are often used at one time to give a full picture of the data. Descriptive statistics are just descriptive. They do not involve generalizing beyond the data at hand. Generalizing from our data to another set of cases is the business of inferential statistics. Descriptive statistics are useful and serviceable if you do not need to extend your results to any larger group. However, much of social sciences tend to include studies that give us “universal” truths about segments of the population, such as all parents, all women, all victims, etc. Inferential Statistics Inferential statistics is concerned with making predictions or inferences about a population from observations and analyses of a sample. That is, we can take the results of an analysis using a sample and can generalize it to the larger population that the sample represents. In order to do this, however, it is imperative that the sample is representative of the group to which it is being generalized.
There are two ways of presenting data: 1. Tables and Graphs Frequency table consists of absolute, relative, and cumulative frequency distributions. Frequency histogram is a graphical portrayal of an absolute or relative frequency distribution for continuous quantitative data in such a way that absolute or relative class frequencies are represented by rectangular areas in the graph. Bar & Column charts: a series of horizontal bars, the length of which are proportional to the values to be depicted. Time-series Line graphs: the graphical portrayal, by a continuous line, of data that are linked with time. Pie charts: a portrayal of divisions of some aggregate by a segmented circle in such a way that the sector areas are proportional to the sizes of the divisions in question Stem-and-leaf diagrams: unusual type of diagram that combines the features of an ordered array of numbers and a frequency histogram. Box-and-whisker diagram (Boxplot): a type of graph used to display patterns of quantitative data. 2. Summary Measures Summary measures of central tendency (or location) are values around which observations tend to cluster and that describe the location of what in some sense might be called the “center” of a data set. Summary measures of dispersion (or variability) are numbers that indicate the spread or scatter of observations; they show the extent to which individual values in a data set differ from one another and, hence, differ from their central location. Summary measures of shape are numbers that indicate either the degree of asymmetry or the degree of peakedness in a frequency distribution.
Measures of Central Tendency The Mean The arithmetic mean is the most common measure of central tendency. It is simply the sum of the numbers divided by the number of numbers. The symbol "μ" is used for the mean of a population. The symbol " 𝑥 " is used for the mean of a sample. The formula for μ is shown below: μ = ΣX/N where ΣX is the sum of all the numbers in the population and N is the number of numbers in the population. The formula for 𝑥 is essentially identical: 𝑥 = Σ𝑥/n where Σ𝑥 is the sum of all the numbers in the sample and n is the number of numbers in the sample. As an example, the mean of the numbers 1, 2, 3, 6, 8 is 20/5 = 4 regardless of whether the numbers constitute the entire population or just a sample from the population. The Median The median is also a frequently used measure of central tendency. The median is the midpoint of a distribution: the same number of scores is above the median as below it. Computation of the Median: When there is an odd number of numbers, the median is simply the middle number. For example, the median of 2, 4, and 7 is 4. When there is an even number of numbers, the median is the mean of the two middle numbers. Thus, the median of the numbers 2, 4, 7, 12 is (4+7)/2 = 5.5. The Mode The mode is the most frequently occurring value. Measures of Variability Variability refers to how "spread out" a group of scores is. To see what we mean by spread out. There are four frequently used measures of variability: the range, interquartile range, variance, and standard deviation. The Range The range is the simplest measure of variability to calculate, and one you have probably encountered many times in your life. The range is simply the highest score minus the lowest score. The Interquartile Range: The interquartile range (IQR) is the range of the middle 50% of the scores in a distribution. It is computed as follows: IQR = 75th percentile - 25th percentile The Variance Variability can also be defined in terms of how close the scores in the distribution are to the middle of the distribution. Using the mean as the measure of the middle of the distribution, the variance is defined as the average squared difference of the scores from the mean. The formula for the variance is: 𝜎 2 = (𝑥 − 𝜇) 2 𝑁 where 𝜎 2 is the variance, μ is the mean, and N is the number of observations. If the variance in a sample is used to estimate the variance in a population, then the previous formula underestimates the variance and the following formula should be used: 𝑠 2 = (𝑥 − 𝑥 ) 2 𝑛 −1 where 𝑠 2 is the estimate of the variance and 𝑥 is the sample mean. Note that 𝑥 is the mean of a sample taken from a population with a mean of μ. Since, in practice, the variance is usually computed in a sample, this formula is most often used. The simulation "estimating variance" illustrates the bias in the formula with n in the denominator. Let's take a concrete example. Assume the scores 1, 2, 4, and 5 were sampled from a larger population. To estimate the variance in the population you would compute s2 as follows: 𝑥 = (1 + 2 + 4 + 5)/4 = 12/4 = 3. 𝑠 2 = [(1-3)2 + (2-3)2 + (4-3)2 + (5-3)2]/(4-1) = (4 + 1 + 1 + 4)/3 = 10/3 = 3.333 The Standard Deviation The standard deviation is simply the square root of the variance. The Coefficient of Variation: is an indicator of relative dispersion. It is calculated as the ration of the standard deviation to the mean. It is always a percentage and can be used to compare two or more sets of data measured in different units. Measures of Shape Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point. Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak. A uniform distribution would be the extreme case. Proportion A number that describes the frequency of observations in a particular category as a fraction of all observations made.
After collecting data, the most important task is the effective presentation of data. This task is particularly crucial when the data collection is large. No human mind is capable of grasping the meaning of any considerable quantity of data unless their mass is somehow reduced to relatively few convenient categories or is condensed with the help of some kind of visual aid.
The first step in drawing a frequency distribution is to construct a frequency table. A frequency table is a way of organizing the data by listing every possible score (including those not actually obtained in the sample) as a column of numbers and the frequency of occurrence of each score as another. Computing the frequency of a score is simply a matter of counting the number of times that score appears in the set of data. The frequency of a particular data value is the number of times the data value occurs. For example, if four students have a score of 80 in mathematics, and then the score of 80 is said to have a frequency of 4. A frequency table is constructed by arranging collected data values in ascending order of magnitude with their corresponding frequencies.
The information contained in the frequency table may be transformed to a graphical or pictorial form. No information is gained or lost in this transformation, but the human information processing system often finds the graphical or pictorial presentation easier to comprehend. A histogram is drawn by plotting the scores (midpoints) on the X-axis and the frequencies on the Y-axis. A bar is drawn for each score value, the width of the bar corresponding to the real limits of the interval and the height corresponding to the frequency of the occurrence of the score value. Bar charts can be used to illustrate the frequencies of different categories. If the data are nominal categorical in form, the histogram is similar, except that the bars do not touch, forming a bar chart.
A line graph is a bar graph with the tops of the bars represented by points joined by lines (the rest of the bar is suppressed). Line graphs are appropriate only when both the X- and Y-axes display ordered (rather than qualitative) variables. Although bar graphs can also be used in this situation, line graphs are generally better at comparing changes over time.
Pie Chart is a special chart that uses "pie slices" to show relative sizes of data.
A stem and leaf display is a graphical method of displaying data. It is particularly useful when your data are not too numerous. One purpose of a stem and leaf display is to clarify the shape of the distribution. There is a variation of stem and leaf displays that is useful for comparing distributions. Whether your data can be suitably represented by a stem and leaf graph depends on whether they can be rounded without loss of important information.
Box plots are useful for identifying outliers and for comparing distributions. There are several steps in constructing a box plot. The first relies on the 25th, 50th, and 75th percentiles in the distribution of scores. For a data set, we draw a box extending from the 25th percentile to the 75th percentile. The 50th percentile is drawn inside the box. Therefore, the bottom of each box is the 25th percentile, the top is the 75th percentile, and the line in the middle is the 50th percentile. Continuing with the box plots, we put "whiskers" above and below each box to give additional information about the spread of the data. Whiskers are vertical lines that end in a horizontal stroke. Whiskers are drawn from the upper and lower hinges to the upper and lower adjacent values. Although we don't draw whiskers all the way to outside or far out values, we still wish to represent them in our box plots. This is achieved by adding additional marks beyond the whiskers. Specifically, outside values are indicated by small "o's" and far out values are indicated by asterisks (*). Box plots provide basic information about a distribution. For example, a distribution with a positive skew would have a longer whisker in the positive direction than in the negative direction. A larger mean than median would also indicate a positive skew. Box plots are good at portraying extreme values and are especially good at showing differences between distributions. However, many of the details of a distribution are not revealed in a box plot, and to examine these details one should create a histogram and/or a stem and leaf display.
The normal distribution is the most important and most widely used distribution in statistics. It is sometimes called the "bell curve," although the tonal qualities of such a bell would be less than pleasing. It is also called the "Gaussian curve" after the mathematician Karl Friedrich Gauss. Eight features of normal distributions are listed below. Normal distributions are symmetric around their mean. The mean, median, and mode of a normal distribution are equal. The area under the normal curve is equal to 1.0. Normal distributions are denser in the center and less dense in the tails. Normal distributions are defined by two parameters, the mean (μ) and the standard deviation (σ). 68% of the area of a normal distribution is within one standard deviation of the mean. Approximately 95% of the area of a normal distribution is within two standard deviations of the mean. Approximately 99.7% of the area of a normal distribution is within three standard deviations of the mean.
The central limit theorem states that the sampling distribution of any statistic will be normal or nearly normal, if the sample size is large enough. Generally, a sample size is considered "large enough" if any of the following conditions apply. The population distribution is normal. The sample distribution is roughly symmetric, unimodal, without outliers, and the sample size is 15 or less. The sample distribution is moderately skewed, unimodal, without outliers, and sample size is between 16 and 30. The sample size is greater than 30, without outliers.
An assessment of the normality of data is a prerequisite for many statistical tests because normal data is an underlying assumption in parametric testing. There are two main methods of assessing normality: graphically and numerically. Methods of assessing normality SPSS Statistics allows you to test all of these procedures within Explore... command. The Explore... command can be used in isolation if you are testing normality in one group or splitting your dataset into one or more groups. The SPSS produces a table that presents two well-known tests of normality, namely the Kolmogorov-Smirnov Test and the Shapiro-Wilk Test. The Shapiro-Wilk Test is more appropriate for small sample sizes (< 50 samples), but can also handle sample sizes as large as 2000. Normal Q-Q Plot In order to determine normality graphically, we can use the output of a normal Q-Q Plot. If the data are normally distributed, the data points will be close to the diagonal line. If the data points stray from the line in an obvious non-linear fashion, the data are not normally distributed.
Confidence Interval A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data. If independent samples are taken repeatedly from the same population, and a confidence interval calculated for each sample, then a certain percentage (confidence level) of the intervals will include the unknown population parameter. Confidence intervals are usually calculated so that this percentage is 95%, but we can produce 90%, 99%, 99.9% (or whatever) confidence intervals for the unknown parameter. The width of the confidence interval gives us some idea about how uncertain we are about the unknown parameter. Hypothesis Testing Setting up and testing hypotheses is an essential part of statistical inference. In order to formulate such a test, usually some theory has been put forward, either because it is believed to be true or because it is to be used as a basis for argument, but has not been proved, for example, claiming that a new drug is better than the current drug for treatment of the same symptoms. In each problem considered, the question of interest is simplified into two competing claims / hypotheses between which we have a choice; the null hypothesis, denoted H0, against the alternative hypothesis, denoted H1. These two competing claims / hypotheses are not however treated on an equal basis: special consideration is given to the null hypothesis. We have two common situations: 1. The experiment has been carried out in an attempt to disprove or reject a particular hypothesis, the null hypothesis, thus we give that one priority so it cannot be rejected unless the evidence against it is sufficiently strong. For example, H0: there is no difference in taste between coke and diet coke against H1: there is a difference. If one of the two hypotheses is 'simpler' we give it priority so that a more 'complicated' theory is not adopted unless there is sufficient evidence against the simpler one. For example, it is 'simpler' to claim that there is no difference in flavor between coke and diet coke than it is to say that there is a difference. The hypotheses are often statements about population parameters like expected value and variance; for example H0 might be that the expected value of the height of ten year old boys in the Scottish population is not different from that of ten year old girls. A hypothesis might also be a statement about the distributional form of a characteristic of interest, for example that the height of ten year old boys is normally distributed within the Scottish population. The outcome of a hypothesis test is "Reject H0 in favor of H1" or "Do not reject H0".
Hypothesis tests may be performed on contingency tables in order to decide whether or not effects are present. Effects in a contingency table are defined as relationships between the row and column variables; that is, are the levels of the row variable differentially distributed over levels of the column variables. Significance in this hypothesis test means that interpretation of the cell frequencies is warranted. Non-significance means that any differences in cell frequencies could be explained by chance. Hypothesis tests on contingency tables are based on a statistic called Chi-square. REVIEW OF CONTINGENCY TABLES Frequency tables of two variables presented simultaneously are called contingency tables. Contingency tables are constructed by listing all the levels of one variable as rows in a table and the levels of the other variables as columns, then finding the joint or cell frequency for each cell. The cell frequencies are then summed across both rows and columns. The sums are placed in the margins, the values of which are called marginal frequencies. The lower right hand corner value contains the sum of either the row or column marginal frequencies, which both must be equal to N.
How to Read a Scatterplot A scatterplot consists of an X axis (the horizontal axis), a Y axis (the vertical axis), and a series of dots. Each dot on the scatterplot represents one observation from a data set. The position of the dot on the scatterplot represents its X and Y values.
Correlation coefficients measure the strength of association between two variables. The most common correlation coefficient, called the Pearson product-moment correlation coefficient, measures the strength of the linear association between variables. The sign and the absolute value of a Pearson correlation coefficient describe the direction and the magnitude of the relationship between two variables. The value of a correlation coefficient ranges between -1 and 1. The greater the absolute value of a correlation coefficient, the stronger the linear relationship. The strongest linear relationship is indicated by a correlation coefficient of -1 or 1. The weakest linear relationship is indicated by a correlation coefficient equal to 0. A positive correlation means that if one variable gets bigger, the other variable tends to get bigger. A negative correlation means that if one variable gets bigger, the other variable tends to get smaller. Keep in mind that the Pearson correlation coefficient only measures linear relationships. Therefore, a correlation of 0 does not mean zero relationship between two variables; rather, it means zero linear relationship. (It is possible for two variables to have zero linear relationship and a strong curvilinear relationship at the same time.)
In a cause and effect relationship, the independent variable is the cause, and the dependent variable is the effect. Least squares linear regression is a method for predicting the value of a dependent variable Y, based on the value of an independent variable X. In this section, we focus on the case where there is only one independent variable. This is called simple regression (as opposed to multiple regression, which handles two or more independent variables). Prerequisites for Regression Simple linear regression is appropriate when the following conditions are satisfied. The dependent variable Y has a linear relationship to the independent variable X. To check this, make sure that the XY scatterplot is linear and that the residual plot shows a random pattern. For each value of X, the probability distribution of Y has the same standard deviation σ. When this condition is satisfied, the variability of the residuals will be relatively constant across all values of X, which is easily checked in a residual plot. For any given value of X, The Y values are independent, as indicated by a random pattern on the residual plot. The Y values are roughly normally distributed (i.e., symmetric and unimodal). A little skewness is ok if the sample size is large. A histogram or a dotplot will show the shape of the distribution. The Least Squares Regression Line Linear regression finds the straight line, called the least squares regression line or LSRL, that best represents observations in a bivariate data set. Suppose Y is a dependent variable, and X is an independent variable. The population regression line is: Y = Β0 + Β1X where Β0 is a constant, Β1 is the regression coefficient, X is the value of the independent variable, and Y is the value of the dependent variable. Given a random sample of observations, the population regression line is estimated by: ŷ = b0 + b1x where b0 is a constant, b1 is the regression coefficient, x is the value of the independent variable, and ŷ is the predicted value of the dependent variable. The Coefficient of Determination The coefficient of determination (denoted by R2) is a key output of regression analysis. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable. The coefficient of determination ranges from 0 to 1. An R2 of 0 means that the dependent variable cannot be predicted from the independent variable. An R2 of 1 means the dependent variable can be predicted without error from the independent variable. An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.10 means that 10 percent of the variance in Y is predictable from X; an R2 of 0.20 means that 20 percent is predictable; and so on. Standard Error The standard error about the regression line (often denoted by SE) is a measure of the average amount that the regression equation over- or under-predicts. The higher the coefficient of determination, the lower the standard error; and the more accurate predictions are likely to be.

Statistical analysis training course

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Statistical analysis training course

Similar to Statistical analysis training course (20)

Recently uploaded

Recently uploaded (20)

Statistical analysis training course

Editor's Notes