This presentation was intended for employees of Dubai Municipality. It is about how to use SPSS and other statistical data analysis tools like Excel and Minitab in data analysis. The course presented some statistical concepts and definitions.
4. COURSE CONTENTS
⢠Introduction to Statistics
⢠Statistical Concepts
⢠Data Collection Methods
⢠Statistical Analysis
⢠Introduction to Statistical Software: SPSS and Minitab
⢠Data Preparation and Analysis using MS Excel, SPSS, & Minitab
⢠Statistical Report writing
⢠Practical Examples and Applications
5. COURSE LEARNING OBJECTIVES
Those who deal with the different fields in life should be familiar
with the importance of data and how to deal with it as well as
the scientific methods to extract the necessary indicators for
decision-making, and help decision makers to know the various
alternatives and how to evaluate them.
7. DEFINITION OF STATISTICS
Statistics is the science of collecting, organizing, presenting,
analyzing, and interpreting numerical data to assist in making
more effective decisions.
8. STAGES OF STATISTICS
âStatistics is a way to get information from data.â
Collect Organize Present Analyze Interpret
Descriptive Analysis
Make Valid Conclusion & Take Correct
Decision
Data Preparation Inferential Analysis
13. Variables
Descriptive Information
Discrete (Counted) Continuous (Measured)
Qualitative Quantitative
Sex
Nationality
Type of crime
Occupation
Religion
Marital status
Literacy
âŚetc.
Number of children in
a family
Number of accidents
per week
Students marks
âŚetc.
Age
Distance
Height
Weight
âŚetc.
14. VARIABLES & UNITS
Employees Race Sex Job Title Years of Service Annual Salary
Abel Caucasian Male Pilot 2 $34,000
Cruz Caucasian Male Chief mechanic 10 60,000
Dunn Western Male Chief pilot 23 70,000
Hill Western Female Secretary 5 14,000
King Caucasian Male Janitor 8 17,000
Otis Caucasian Male Grounds keeper 10 20,000
West Western Male Mechanic 2 36,000
Wolf Caucasian Female Pilot 7 36,000
Zorn Caucasian Female Mechanic 7 40,000
Qualitative
variable
Elementary
unit
Frame Population
of
employee
sexes
Population
of
employee
Qualitative
variable
Datum
Sample of
employee
salaries
Selected characteristics of all the full-time employees of Mountain Aviation,
Inc.,: July 1, 1994
15. QUANTITATIVE VARIABLES
Nominal
⢠Attributes are
only named
⢠No order
Ordinal
⢠Attributes can be
rank-ordered
⢠Distance is not
meaningful
Interval
⢠Distance is
meaningful
Ratio
⢠Absolute zero
Strongest
Weakest
Discrete Continuous
Measurement Level
20. WHAT IS DATA?
Data is a collection of
facts, such as values or
measurements.
Examples of data
include price, exam
scores, export and
import, labor and so on.
21. Data Information Statistics
20 kg, 25 kg
5 individuals in the 20-to-25-kg
range
Mean weight = 22.5 kg
28 kg, 30 kg, etc.
15 individuals in the 26-to-30-kg
range
Median weight = 28 kg
DATA VS. INFORMATION VS.
STATISTICS
22. SOURCES OF DATA
Internal
⢠Organization internal
records
External
⢠Published and unpublished
external sources
24. OTHER DATA COLLECTION
METHODS
Interviews
ď Face-to-face
ď Telephone
Focus Groups
Ethnographies, Oral History, & Case Studies
Schedule through enumerators
Documents & Records
25. Questionnaire Schedule
⢠It is not confirmed if expected
respondent have filled the answers
⢠Respondent identity is know
⢠Very slow ⢠Information collected in time
⢠No personal contact ⢠Direct personal contact
⢠Used only when respondent is
educated & cooperative
⢠Information can be collected from
illiterate people
⢠Wider distribution of samples possible ⢠Difficult for wider distribution
⢠More incomplete and false information ⢠Relatively more complete and correct
information
⢠Information validity depends on
quality of questionnaire
⢠Information validity depends on
honesty & competence of numerator
⢠Questionnaire physical appearance
should be attractive
⢠Schedule physical appearance not
necessary
⢠Observation cannot be used ⢠Observation can be used by
enumerator
26. FRAMING A QUESTIONNAIRE OR A
SCHEDULE
Cover Letter
Number of Questions
Nature of Questions
Questions should be simple
Arrangement of Questions
Information collected is usable
Avoid mathematical questions
27. TYPES OF QUESTIONS
â˘Yes/No
â˘Multiple Choice
â˘Scaled (i.e., Likert)
â˘no predefined
options or categories
â˘closed-ended but
arranged one under
the other
â˘answered only when
the respondent
provides a particular
response
Contingency
questions
Matrix
questions
Closed
ended
questions
Open ended
questions
28. LIKERT SCALE
1 2 3 4 5 6 7 8 9
Disagre
e
Agree
9-point Scale
5-point Scale
Not at
all
Satisfied
Not
Satisfied
Neutral Satisfied Very
Satisfied
1 2 3 4 5
7-point Scale
Entirely
Disagre
e
Mostly
Disagre
e
Somewhat
Disagree
Neither
Agree nor
Disagree
Somewhat
Agree
Mostly
Agree
Entirely
Agree
1 2 3 4 5 6 7
33. DATA ANALYSIS
- Performed to study a
single variable
- Descriptive statistics
â˘Tables & Graphs
â˘Summary Measures
â˘One sample tests
â˘Normality Tests
â˘Normal Probability Plot
- Performed to study two
variables relationship
- Statistical Analysis
â˘Comparisons
â˘Contingency Tables
â˘Scatter plot
â˘Correlation
â˘Regression
Univariate
Bivariate
34. DESCRIPTIVE STATISTICS
Tables & Graphs
Frequency
Table
Frequency Histogram
Bar & Column Chart
Time-series Line Graph
Pie Chart
Stem-&-leaf Diagram
Boxplot
Summary Measures
Measures of
Location
Measures of
Variability
Measures of
Shape
Proportion
35. SUMMARY MEASURES
Measures of Central
Tendency (Location)
â˘Mean (Average)
â˘Median (Middle Value)
â˘Mode (Most often Value)
Measures of
Dispersion (Spread)
â˘Range (Highest Value â
Lowest Value)
â˘Variance
â˘Standard Deviation
â˘Coefficient of Variation
Measures of Shape
â˘Skewness
â˘Kurtosis
Proportion
â˘Frequency of
observations in a
particular category as a
fraction of all
observations
36. TABLES & GRAPHS
Frequency Table Frequency Histogram Frequency Histogram
Time-series Line Graph Pie Chart Stem-&-Leaf Diagram Boxplot
43. DATA REPRESENTATION SUMMARY
Variable
Associated
Data type
Graphically represented
by
Summary measures
Central
Tendency
Dispersion
Qualitative
(Categorical)
Nominal Pie Chart
Column or Bar Chart
Mode Range
Ordinal Column or Bar Chart Mode
Median
Range
Quantitative
(Scale)
Interval &
Ratio
Frequency Histogram
Boxplot
Stem-&-leaf Diagram
Time-series line Graph
Mode
Median
Mean
Range
Variance
Standard Deviation
Coefficient of Variation
Effective representation of data depends on choosing proper graphical tool and
summary measures.
45. CENTRAL LIMIT THEOREM
The central limit theorem in it's
shortest form states that the
sampling distribution of the
sampling means approaches a
normal distribution as the sample
size gets larger, regardless of the
shape of the population distribution.
46. TEST OF NORMALITY
Since a number of the most common statistical
tests rely on the normality of a sample or
population, it is often useful to test whether the
underlying distribution is normal, or at least
symmetric. This can be done via the following
approaches:
⢠Review the distribution graphically (histograms,
& Normal Probability plots)
⢠Analyze the skewness & kurtosis
⢠Employ statistical tests (Chi-square,
48. DATA ANALYSIS
- Performed to study a
single variable
- Descriptive statistics
â˘Tables & Graphs
â˘Summary Measures
â˘One sample tests
â˘Normality Tests
â˘Normal Probability Plot
- Performed to study two
variables relationship
- Statistical Analysis
â˘Comparisons
â˘Contingency Tables
â˘Scatter plot
â˘Correlation
â˘Regression
Univariate
Bivariate
49. STATISTICAL TESTS
Independent Samples
Parametric Tests
1 Sample
(t-test)
2 Samples
(t-test)
K Samples
(ANOVA)
Independent Samples
Nonparametric Tests
1 Sample
(Binomial, Chi-square,
Kolmogorov-Smirnov)
2 Samples
(Mann-Whitney)
K Samples
(Kruskal-Wallis)
Related Samples
Parametric Tests
2 Samples
(t-test)
K Samples
(Repeated Measures
ANOVA)
Related Samples
Nonparametric Tests
2 Samples
(Wilcoxon)
K Samples
(Friedman)
50. CONTINGENCY
TABLES
A two-way table is a useful tool
for examining relationships
between categorical variables.
The entries in the cells of a two-
way table can be frequency
counts or relative frequencies
51. SCATTER PLOT
A Scatter (XY) Plot has points that show the relationship between
two sets of data.
In this example, each dot shows one person's weight versus their
height.
Positive
Association
Negative
Association No
Association
52. CORRELATION
When two sets of data are strongly linked together we say they have a High Correlation.
⢠Correlation is Positive when the values increase together, and
⢠Correlation is Negative when one value decreases as the other increases
Correlation can have a value:
1 is a perfect positive correlation
0 is no correlation (the values don't seem
linked at all)
-1 is a perfect negative correlation
The value shows how good
the correlation is (not how
steep the line is), and if it is
positive or negative.
53. LINEAR REGRESSION
In a cause and effect
relationship, the
independent variable
is the cause, and the
dependent variable
is the effect.
60. WHAT NEXT�
Data ready
in Excel
Import Data
to SPSS or
Minitab
Explore data
using
Descriptive
Statistics
Perform
Hypothesis
Testing if
needed
Write your
report
62. HOW TO WRITE A RESEARCH
REPORT�
Cover Page
â˘Title
â˘Date
â˘Researcher Name
â˘Supervisor Name
Contents & Executive
Summary
Introduction
â˘Data Source
â˘Data Analysis Technique
â˘Main Research Aim
Data Analysis
â˘Descriptive
â˘Research Questions Conclusion
65. THANK
YOU
Have Successful career, Wonderful Life full of Joy, Productivity, & Happiness
Marwa Abo-Amra
Email: analyst.amra@gmail.com
Blog: analystamra.blogspot.com
Editor's Notes
Population vs. Sample
The study of statistics revolves around the study of data sets. This section describes two important types of data sets - populations and samples. A population includes each element from the set of observations that can be made. A sample consists only of observations drawn from the population.
Variable vs. Elementary Unit
In statistics, a variable has two defining characteristics:
A variable is an attribute that describes a person, place, thing, or idea (the elementary unit).
The value of the variable can "vary" from one entity to another.
Independent and dependent variables
Variables are properties or characteristics of some event, object, or person that can take on different values or amounts (as opposed to constants such as Ď that do not vary). When conducting research, experimenters often manipulate variables. For example, an experimenter might compare the effectiveness of four types of antidepressants. In this case, the variable is "type of antidepressant." When a variable is manipulated by an experimenter, it is called an independent variable. The experiment seeks to determine the effect of the independent variable on relief from depression. In this example, relief from depression is called a dependent variable. In general, the independent variable is manipulated by the experimenter and its effects on the dependent variable are measured.
Levels of an Independent Variable
If an experiment compares an experimental treatment with a control treatment, then the independent variable (type of treatment) has two levels: experimental and control. If an experiment were comparing five types of diets, then the independent variable (type of diet) would have 5 levels. In general, the number of levels of an independent variable is the number of experimental conditions.
Qualitative and Quantitative Variables
An important distinction between variables is between qualitative variables and quantitative variables. Qualitative variables are those that express a qualitative attribute such as hair color, eye color, religion, favorite movie, gender, and so on. The values of a qualitative variable do not imply a numerical ordering. Values of the variable âreligionâ differ qualitatively; no ordering of religions is implied. Qualitative variables are sometimes referred to as categorical variables. Quantitative variables are those variables that are measured in terms of numbers. Some examples of quantitative variables are height, weight, and shoe size.
Discrete and Continuous Variables
Variables such as number of children in a household are called discrete variables since the possible scores are discrete points on the scale. For example, a household could have three children or six children, but not 4.53 children. Other variables such as "time to respond to a question" are continuous variables since the scale is continuous and not made up of discrete steps. The response time could be 1.64 seconds, or it could be 1.64237123922121 seconds. Of course, the practicalities of measurement preclude most measured variables from being truly continuous.
This table contains a statistical frame and the multivariate data set derived from it. The table illustrates the meaning of a number of basic statistical concepts. Thus, first column from the left lists 9 elementary units that jointly constitute the frame (shaded). The headings of the second to the sixth column show characteristics of the elementary units that are called variables and that can be qualitative (race, sex, job title) or quantitative (years of service, annual salary). All possible observations about a given variable constitute a statistical population â the shaded entries in the third and the sixth column from the left are two examples of populations; any single observation is a datum; any subset of a population or of the frame is a sample.
Types of Scales
Before we can conduct a statistical analysis, we need to measure our dependent variable. Exactly how the measurement is carried out depends on the type of variable involved in the analysis. Different types are measured differently. To measure the time taken to respond to a stimulus, you might use a stop watch. Stop watches are of no use, of course, when it comes to measuring someone's attitude towards a political candidate. A rating scale is more appropriate in this case (with labels like "very favorable," "somewhat favorable," etc.). For a dependent variable such as "favorite color," you can simply note the color-word (like "red") that the subject offers.
Although procedures for measurement differ in many ways, they can be classified using a few fundamental categories. In a given category, all of the procedures share some properties that are important for you to know about. The categories are called "scale types," or just "scales," and are described in this section.
Nominal scales
When measuring using a nominal scale, one simply names or categorizes responses. Gender, handedness, favorite color, and religion are examples of variables measured on a nominal scale. The essential point about nominal scales is that they do not imply any ordering among the responses. For example, when classifying people according to their favorite color, there is no sense in which green is placed "ahead of" blue. Responses are merely categorized. Nominal scales embody the lowest level of measurement.
Ordinal scales
A researcher wishing to measure consumers' satisfaction with their microwave ovens might ask them to specify their feelings as either "very dissatisfied," "somewhat dissatisfied," "somewhat satisfied," or "very satisfied." The items in this scale are ordered, ranging from least to most satisfied. This is what distinguishes ordinal from nominal scales. Unlike nominal scales, ordinal scales allow comparisons of the degree to which two subjects possess the dependent variable. For example, our satisfaction ordering makes it meaningful to assert that one person is more satisfied than another with their microwave ovens. Such an assertion reflects the first person's use of a verbal label that comes later in the list than the label chosen by the second person.
On the other hand, ordinal scales fail to capture important information that will be present in the other scales we examine. In particular, the difference between two levels of an ordinal scale cannot be assumed to be the same as the difference between two other levels. In our satisfaction scale, for example, the difference between the responses "very dissatisfied" and "somewhat dissatisfied" is probably not equivalent to the difference between "somewhat dissatisfied" and "somewhat satisfied." Nothing in our measurement procedure allows us to determine whether the two differences reflect the same difference in psychological satisfaction. Statisticians express this point by saying that the differences between adjacent scale values do not necessarily represent equal intervals on the underlying scale giving rise to the measurements. (In our case, the underlying scale is the true feeling of satisfaction, which we are trying to measure.)
What if the researcher had measured satisfaction by asking consumers to indicate their level of satisfaction by choosing a number from one to four? Would the difference between the responses of one and two necessarily reflect the same difference in satisfaction as the difference between the responses two and three? The answer is No. Changing the response format to numbers does not change the meaning of the scale. We still are in no position to assert that the mental step from 1 to 2 (for example) is the same as the mental step from 3 to 4.
Interval scales
Interval scales are numerical scales in which intervals have the same interpretation throughout. As an example, consider the Fahrenheit scale of temperature. The difference between 30 degrees and 40 degrees represents the same temperature difference as the difference between 80 degrees and 90 degrees. This is because each 10-degree interval has the same physical meaning (in terms of the kinetic energy of molecules).
Interval scales are not perfect, however. In particular, they do not have a true zero point even if one of the scaled values happens to carry the name "zero." The Fahrenheit scale illustrates the issue. Zero degrees Fahrenheit does not represent the complete absence of temperature (the absence of any molecular kinetic energy). In reality, the label "zero" is applied to its temperature for quite accidental reasons connected to the history of temperature measurement. Since an interval scale has no true zero point, it does not make sense to compute ratios of temperatures. For example, there is no sense in which the ratio of 40 to 20 degrees Fahrenheit is the same as the ratio of 100 to 50 degrees; no interesting physical property is preserved across the two ratios. After all, if the "zero" label were applied at the temperature that Fahrenheit happens to label as 10 degrees, the two ratios would instead be 30 to 10 and 90 to 40, no longer the same! For this reason, it does not make sense to say that 80 degrees is "twice as hot" as 40 degrees. Such a claim would depend on an arbitrary decision about where to "start" the temperature scale, namely, what temperature to call zero (whereas the claim is intended to make a more fundamental assertion about the underlying physical reality).
Ratio scales
The ratio scale of measurement is the most informative scale. It is an interval scale with the additional property that its zero position indicates the absence of the quantity being measured. You can think of a ratio scale as the three earlier scales rolled up in one. Like a nominal scale, it provides a name or category for each object (the numbers serve as labels). Like an ordinal scale, the objects are ordered (in terms of the ordering of the numbers). Like an interval scale, the same difference at two places on the scale has the same meaning. And in addition, the same ratio at two places on the scale also carries the same meaning.
The Fahrenheit scale for temperature has an arbitrary zero point and is therefore not a ratio scale. However, zero on the Kelvin scale is absolute zero. This makes the Kelvin scale a ratio scale. For example, if one temperature is twice as high as another as measured on the Kelvin scale, then it has twice the kinetic energy of the other temperature.
Another example of a ratio scale is the amount of money you have in your pocket right now (25 cents, 55 cents, etc.). Money is measured on a ratio scale because, in addition to having the properties of an interval scale, it has a true zero point: if you have zero money, this implies the absence of money. Since money has a true zero point, it makes sense to say that someone with 50 cents has twice as much money as someone with 25 cents (or that Bill Gates has a million times more money than you do).
When we think of the term âpopulation,â we usually think of people in our town, region, state or country and their respective characteristics such as gender, age, marital status, ethnic membership, religion and so forth. In statistics the term âpopulationâ takes on a slightly different meaning. The âpopulationâ in statistics includes all members of a defined group that we are studying or collecting information on for data driven decisions.
A part of the population is called a sample. It is a proportion of the population, a slice of it, a part of it and all its characteristics. A sample is a scientifically drawn group that actually possesses the same characteristics as the population â if it is drawn randomly.
A measurable characteristic of a population is called a parameter; but a measurable characteristic of a sample is called a statistic.
Sampling is the process by which inference is made to the whole by examining a part.
Purpose of sampling
To provide various types of statistical information of a qualitative or quantitative nature about the whole by examining a few selected units.
It is cheaper than census method.
What is Simple Random Sampling?
Simple random sampling refers to a sampling method that has the following properties.
The population consists of N objects.
The sample consists of n objects.
All possible samples of n objects are equally likely to occur.
An important benefit of simple random sampling is that it allows researchers to use statistical methods to analyze sample results. For example, given a simple random sample, researchers can use statistical methods to define a confidence interval around a sample mean. Statistical analysis is not appropriate when non-random sampling methods are used.
There are many ways to obtain a simple random sample. One way would be the lottery method. Each of the N population members is assigned a unique number. The numbers are placed in a bowl and thoroughly mixed. Then, a blind-folded researcher selects n numbers. Population members having the selected numbers are included in the sample.
Before one can present and interpret information, there has to be a process of gathering and sorting data. Just as trees are the raw material from which paper is produced, so too, can data be viewed as the raw material from which information is obtained.
Once data have been collected and processed, they are ready to be organized into information. Indeed, it is hard to imagine reasons for collecting data other than to provide information. This information leads to knowledge about issues, and helps individuals and groups make informed decisions.
In practice, informed decision-making can save countries millions of dollars (for example, through accurate targeting of government spending). It can also lead to life saving breakthroughs in medicine, and can help conserve the earth's natural environment.
Information is data that have been recorded, classified, organized, related, or interpreted within a framework so that meaning emerges.
Statistics represent a common method of presenting information. In general, statistics relate to numerical data, and can refer to the science of dealing with the numerical data itself. Above all, statistics aim to provide useful information by means of numbers.
Therefore, a good definition of statistics is "a type of information obtained through mathematical operations on numerical data".
Collection of data is the first step in any statistical investigation of a phenomenon.
Sources of Data
Data are generally classified into the following two groups:
1. Internal Data
Internal data come from internal sources related with the functioning of an organization or firm where records regarding purchase, production, sales, profits etc. are kept on a regular basis. Various Government departments, like Railways, Communications, Education etc. also generate internal data which are useful for their proper internal functioning. However, the internal data can be either insufficient or inappropriate for the statistical inquiry into a phenomenon. In that situation we need external data.
2. External Data
The External data are collected and published by external agencies. This type of data can be obtained from primary source or secondary source. Thus, the external data can further be classified as: Primary and Secondary Data.
Primary data are original and firsthand information. Data are termed primary when the reference is to data collected for the first time by the investigator. For example, the Meteorological department regularly collects data on various aspects of the weather and climate such as amount of rainfall, humidity, minimum and maximum temperature of a certain place. These constitute primary data. Similarly, the data in a population census obtained by the office of the Registrar General and Census Commissioner are primary in nature. On the other hand, data are termed secondary when collected from records or data already available. In other words, the secondary data are one which has already been collected by a source other than the present investigator. For example, population census data are primary for the office of the Registrar General and Census Commissioner where as, for other organizations or individuals who use such data, they are secondary. Thus, data which are primary in one hand become in the hands of others.
To derive conclusions from data, we need to know how the data were collected; that is, we need to know the method(s) of data collection.
Methods of Data Collection
There are four main methods of data collection.
1. Census. A census is a study that obtains data from every member of a population. In most studies, a census is not practical, because of the cost and/or time required.
2. Sample survey. A sample survey is a study that obtains data from a subset of a population, in order to estimate population attributes.
3. Experiment. An experiment is a controlled study in which the researcher attempts to understand cause-and-effect relationships. The study is "controlled" in the sense that the researcher controls (1) how subjects are assigned to groups and (2) which treatments each group receives.
In the analysis phase, the researcher compares group scores on some dependent variable. Based on the analysis, the researcher draws a conclusion about whether the treatment ( independent variable) had a causal effect on the dependent variable.
4. Observational study. Like experiments, observational studies attempt to understand cause-and-effect relationships. However, unlike experiments, the researcher is not able to control (1) how subjects are assigned to groups and/or (2) which treatments each group receives.
Data Collection Methods: Pros and Cons
Each method of data collection has advantages and disadvantages.
1. Resources. When the population is large, a sample survey has a big resource advantage over a census. A well-designed sample survey can provide very precise estimates of population parameters - quicker, cheaper, and with less manpower than a census.
2. Generalizability. Generalizability refers to the appropriateness of applying findings from a study to a larger population. Generalizability requires random selection. If participants in a study are randomly selected from a larger population, it is appropriate to generalize study results to the larger population; if not, it is not appropriate to generalize.
Observational studies do not feature random selection; so generalizing from the results of an observational study to a larger population can be a problem.
3. Causal inference. Cause-and-effect relationships can be teased out when subjects are randomly assigned to groups. Therefore, experiments, which allow the researcher to control assignment of subjects to treatment groups, are the best method for investigating causal relationships.
Interviews
Interviews can be conducted face-to-face or by telephone. They can range from in-depth, semi-structured to unstructured depending on the information being sought.
Face to face interviews are advantageous since:
detailed questions can be asked
further probing can be done to provide rich data
literacy requirements of participants is not an issue
non verbal data can be collected through observation
complex and unknown issues can be explored
response rates are usually higher than for self-administered questionnaires.
Disadvantages of face to face interviews include:
they can be expensive and time consuming
training of interviewers is necessary to reduce interviewer bias and are administered in a standardized why
they are prone to interviewer bias and interpreter bias (if interpreters are used)
sensitive issues maybe challenging.
Telephone interviews according to Bowling, yield just as accurate data as face to face interviews.
Telephone interviews are advantageous as they:
are cheaper and faster than face to face interviews to conduct
use less resources than face to face interviews
allow to clarify questions
do not require literacy skills.
Disadvantages of telephone interviews include:
having to make repeated calls as calls may not be answered the first time
potential bias if call backs are not made so bias is towards those who are at home
only suitable for short surveys
only accessible to the population with a telephone
not appropriate for exploring sensitive issues.
Focus groups
Focus groups or group discussions are useful to further explore a topic, providing a broader understanding of why the target group may behave or think in a particular way, and assist in determining the reason for attitudes and beliefs. They are conducted with a small sample of the target group and are used to stimulate discussion and gain greater insights.
Focus groups and group discussions are advantageous as they:
are useful when exploring cultural values and health beliefs
can be used to examine how and why people think in a particular way and how is influences their beliefs and values
can be used to explore complex issues
can be used to develop hypothesis for further research
do not require participants to be literate.
Disadvantages of focus groups include:
lack of privacy/anonymity
having to carefully balance the group to ensure they are culturally and gender appropriate (i.e. gender may be an issue)
potential for the risk of âgroup thinkâ (not allowing for other attitudes, beliefs etc.)
potential for group to be dominated by one or two people
group leader needs to be skilled at conducting focus groups, dealing with conflict, drawing out passive participants and creating a relaxed, welcoming environment
are time consuming to conduct and can be difficult and time consuming to analyze.
Ethnographies, Oral History, & Case Studies
Involves studying a single phenomenon
Examines people in their natural settings
Uses a combination of techniques such as observation, interviews, and surveys
Ethnography is a more holistic approach to evaluation
Researcher can become a confounding variable
Schedule through enumerators
Initially let us make a distinction between a questionnaire and a schedule. The questionnaire is a set of questions the answers to which are recorded by the informant itself, whereas in a schedule answers are recorded by the investigators or an enumerator on his behalf.
In this method the investigators or enumerators approach the informants with a prepared questionnaire and get the replies to the questions. This method is generally used in census and large scale surveys. In the case of census, investigators visits every member of the source of information in the zones while, in the case of sample survey, they collect information from those members who have been selected in the sample.
Documents & Records
This method consists of examining existing data in the form of databases, meeting minutes, reports, attendance logs, financial records, newsletters, etc. This can be an inexpensive way to gather information, but may be an incomplete data source.
Substantial description and documentation, often referred to as âthick descriptionâ, can be used to further explore a subject.7 This process provides a thorough description of the âstudy participants, context and procedures, the purpose of the intervention and its transferabilityâ. Thick description also includes the complexities experienced in addition to the commonalities found, which assists in maintaining data integrity.
The use of documentation provides an ongoing record of activities. This can be records of informal feedback and reflections through journals, diaries or progress reports. The challenge of documentation is that it requires an ongoing commitment to regularly document thoughts and activities throughout the evaluation process.
Great care and caution has to be exercised in drafting a questionnaire or a schedule, as this is the basis of collecting information in an investigation. A part from the care and caution, a lot of expertise and experience of the phenomenon under investigation are required in its preparation. Though there are no hard and fast rules for drafting a questionnaire, still following points be given due consideration. These points are:
1. Covering letter: in this letter the investigator should introduce himself and make the objectives of the survey clear to the informant. In this the informant should be assured that the information provided by him will be kept secret.
2. Number of questions: the number of questions in the questionnaire should be small as possible. This results in time saving and is convenient to both, i.e., to the enumerator and the respondent.
3. Nature of questions: delicate questions should be put with great care. Often indirect questions should be formed to get the answers to these questions.
4. The questions should be simple: the questions set should be clear, concise, short answered and unambiguous. These should be related to the phenomenon under investigation.
5. Arrangement of questions: there should be a natural and logical order of the questions in a questionnaire. For example, it is not logical to ask a man about his income before asking him about his occupation.
6. Information collected is usable: it should be noted that the information collected through questions is usable.
7. Avoid mathematical questions: as far as possible, questions involving mathematical calculations be avoided. It is always better to use multiple questions (having four or five alternatives) or simple alternative questions (Yes or No type).
8. Attractive layout of the questionnaire: the book of the questionnaire should be attractive and questions be suitably spaced for proper answering.
1. Contingency questions - A question that is answered only if the respondent gives a particular response to a previous question. This avoids asking questions of people that do not apply to them (for example, asking men if they have ever been pregnant).
2. Matrix questions - Identical response categories are assigned to multiple questions. The questions are placed one under the other, forming a matrix with response categories along the top and a list of questions down the side. This is an efficient use of page space and respondentsâ time.
3. Closed ended questions - Respondentsâ answers are limited to a fixed set of responses. Most scales are closed ended. Other types of closed ended questions include:
Yes/no questions - The respondent answers with a "yes" or a "no".
Multiple choice - The respondent has several option from which to choose.
Scaled questions - Responses are graded on a continuum (example : rate the appearance of the product on a scale from 1 to 10, with 10 being the most preferred appearance). Examples of types of scales include the Likert scale, semantic differential scale, and rank-order scale (See scale for a complete list of scaling techniques.).
4. Open ended questions - No options or predefined categories are suggested. The respondent supplies their own answer without being constrained by a fixed set of possible responses. Examples of types of open ended questions include:
Completely unstructured questions- openly ask the opinion or view of the respondent
Word association questions - the participant states the first word that pops in his mind once a series of words are presented
Thematic Apperception Test â a picture is presented to the respondent which he explains on his own point-of-view
Sentence, story or picture completion â the respondent continues an incomplete sentence or story, or writes on empty conversation balloons in a picture
A Likert scale is a psychometric scale commonly involved in research that employs questionnaires. Likert-type or frequency scales use fixed choice response formats and are designed to measure attitudes or opinions. These ordinal scales measure levels of agreement/disagreement. A Likert-type scale assumes that the strength/intensity of experience is linear, i.e. on a continuum from strongly agree to strongly disagree, and makes the assumption that attitudes can be measured. Respondents may be offered a choice of five to seven or even nine pre-coded responses with the neutral point being neither agree nor disagree. In it final form, the Likert Scale is a five (or seven) point scale which is used to allow the individual to express how much they agree or disagree with a particular statement.
Scoring & Analysis
Likert scale is indeed ordinal, if well presented it may nevertheless approximate an interval-level measurement. This can be beneficial since, if it was treated just as an ordinal scale, then some valuable information could be lost if the âdistanceâ between Likert items were not available for consideration. The important idea here is that the appropriate type of analysis is dependent on how the Likert scale has been presented.
Semantic differential
The semantic differential is a scale used for measuring the meaning of things and concepts. There are two aspects of meaning: denotative and connotative. The semantic differential measures connotative meaning.
Rank-order Scale
Rank order scaling questions allow a certain set of brands or products to be ranked based upon a specific attribute or characteristic.
When analyzing data, both descriptive and inferential statistics are used to analyze results and draw conclusions. So what are descriptive and inferential statistics? And what are their differences?
Descriptive Statistics
Descriptive statistics are numbers that are used to summarize and describe data. If we are analyzing birth certificates, for example, a descriptive statistic might be the percentage of certificates issued in Dubai, or the average age of the mother. Any other number we choose to compute also counts as a descriptive statistic for the data from which the statistic is computed. Several descriptive statistics are often used at one time to give a full picture of the data.
Descriptive statistics are just descriptive. They do not involve generalizing beyond the data at hand. Generalizing from our data to another set of cases is the business of inferential statistics.
Descriptive statistics are useful and serviceable if you do not need to extend your results to any larger group. However, much of social sciences tend to include studies that give us âuniversalâ truths about segments of the population, such as all parents, all women, all victims, etc.
Inferential Statistics
Inferential statistics is concerned with making predictions or inferences about a population from observations and analyses of a sample. That is, we can take the results of an analysis using a sample and can generalize it to the larger population that the sample represents. In order to do this, however, it is imperative that the sample is representative of the group to which it is being generalized.
There are two ways of presenting data:
1. Tables and Graphs
Frequency table consists of absolute, relative, and cumulative frequency distributions.
Frequency histogram is a graphical portrayal of an absolute or relative frequency distribution for continuous quantitative data in such a way that absolute or relative class frequencies are represented by rectangular areas in the graph.
Bar & Column charts: a series of horizontal bars, the length of which are proportional to the values to be depicted.
Time-series Line graphs: the graphical portrayal, by a continuous line, of data that are linked with time.
Pie charts: a portrayal of divisions of some aggregate by a segmented circle in such a way that the sector areas are proportional to the sizes of the divisions in question
Stem-and-leaf diagrams: unusual type of diagram that combines the features of an ordered array of numbers and a frequency histogram.
Box-and-whisker diagram (Boxplot): a type of graph used to display patterns of quantitative data.
2. Summary Measures
Summary measures of central tendency (or location) are values around which observations tend to cluster and that describe the location of what in some sense might be called the âcenterâ of a data set.
Summary measures of dispersion (or variability) are numbers that indicate the spread or scatter of observations; they show the extent to which individual values in a data set differ from one another and, hence, differ from their central location.
Summary measures of shape are numbers that indicate either the degree of asymmetry or the degree of peakedness in a frequency distribution.
Measures of Central Tendency
The Mean
The arithmetic mean is the most common measure of central tendency. It is simply the sum of the numbers divided by the number of numbers. The symbol "Îź" is used for the mean of a population. The symbol " đĽ " is used for the mean of a sample. The formula for Îź is shown below:
Îź = ÎŁX/N
where ÎŁX is the sum of all the numbers in the population and N is the number of numbers in the population.
The formula for đĽ is essentially identical:
đĽ = ÎŁđĽ/n
where ÎŁđĽ is the sum of all the numbers in the sample and n is the number of numbers in the sample.
As an example, the mean of the numbers 1, 2, 3, 6, 8 is 20/5 = 4 regardless of whether the numbers constitute the entire population or just a sample from the population.
The Median
The median is also a frequently used measure of central tendency. The median is the midpoint of a distribution: the same number of scores is above the median as below it.
Computation of the Median: When there is an odd number of numbers, the median is simply the middle number. For example, the median of 2, 4, and 7 is 4. When there is an even number of numbers, the median is the mean of the two middle numbers. Thus, the median of the numbers 2, 4, 7, 12 is (4+7)/2 = 5.5.
The Mode
The mode is the most frequently occurring value.
Measures of Variability
Variability refers to how "spread out" a group of scores is. To see what we mean by spread out. There are four frequently used measures of variability: the range, interquartile range, variance, and standard deviation.
The Range
The range is the simplest measure of variability to calculate, and one you have probably encountered many times in your life. The range is simply the highest score minus the lowest score.
The Interquartile Range: The interquartile range (IQR) is the range of the middle 50% of the scores in a distribution. It is computed as follows:
IQR = 75th percentile - 25th percentile
The Variance
Variability can also be defined in terms of how close the scores in the distribution are to the middle of the distribution. Using the mean as the measure of the middle of the distribution, the variance is defined as the average squared difference of the scores from the mean.
The formula for the variance is:
đ 2 = (đĽ â đ) 2 đ
where đ 2 is the variance, Îź is the mean, and N is the number of observations.
If the variance in a sample is used to estimate the variance in a population, then the previous formula underestimates the variance and the following formula should be used:
đ 2 = (đĽ â đĽ ) 2 đ â1
where đ 2 is the estimate of the variance and đĽ is the sample mean. Note that đĽ is the mean of a sample taken from a population with a mean of Îź. Since, in practice, the variance is usually computed in a sample, this formula is most often used. The simulation "estimating variance" illustrates the bias in the formula with n in the denominator.
Let's take a concrete example. Assume the scores 1, 2, 4, and 5 were sampled from a larger population. To estimate the variance in the population you would compute s2 as follows:
đĽ = (1 + 2 + 4 + 5)/4 = 12/4 = 3.
đ 2 = [(1-3)2 + (2-3)2 + (4-3)2 + (5-3)2]/(4-1) = (4 + 1 + 1 + 4)/3 = 10/3 = 3.333
The Standard Deviation
The standard deviation is simply the square root of the variance.
The Coefficient of Variation: is an indicator of relative dispersion. It is calculated as the ration of the standard deviation to the mean. It is always a percentage and can be used to compare two or more sets of data measured in different units.
Measures of Shape
Skewness is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.
Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean rather than a sharp peak. A uniform distribution would be the extreme case.
Proportion
A number that describes the frequency of observations in a particular category as a fraction of all observations made.
After collecting data, the most important task is the effective presentation of data. This task is particularly crucial when the data collection is large. No human mind is capable of grasping the meaning of any considerable quantity of data unless their mass is somehow reduced to relatively few convenient categories or is condensed with the help of some kind of visual aid.
The first step in drawing a frequency distribution is to construct a frequency table. A frequency table is a way of organizing the data by listing every possible score (including those not actually obtained in the sample) as a column of numbers and the frequency of occurrence of each score as another. Computing the frequency of a score is simply a matter of counting the number of times that score appears in the set of data.
The frequency of a particular data value is the number of times the data value occurs. For example, if four students have a score of 80 in mathematics, and then the score of 80 is said to have a frequency of 4.
A frequency table is constructed by arranging collected data values in ascending order of magnitude with their corresponding frequencies.
The information contained in the frequency table may be transformed to a graphical or pictorial form. No information is gained or lost in this transformation, but the human information processing system often finds the graphical or pictorial presentation easier to comprehend.
A histogram is drawn by plotting the scores (midpoints) on the X-axis and the frequencies on the Y-axis. A bar is drawn for each score value, the width of the bar corresponding to the real limits of the interval and the height corresponding to the frequency of the occurrence of the score value.
Bar charts can be used to illustrate the frequencies of different categories. If the data are nominal categorical in form, the histogram is similar, except that the bars do not touch, forming a bar chart.
A line graph is a bar graph with the tops of the bars represented by points joined by lines (the rest of the bar is suppressed).
Line graphs are appropriate only when both the X- and Y-axes display ordered (rather than qualitative) variables. Although bar graphs can also be used in this situation, line graphs are generally better at comparing changes over time.
Pie Chart is a special chart that uses "pie slices" to show relative sizes of data.
A stem and leaf display is a graphical method of displaying data. It is particularly useful when your data are not too numerous. One purpose of a stem and leaf display is to clarify the shape of the distribution. There is a variation of stem and leaf displays that is useful for comparing distributions.
Whether your data can be suitably represented by a stem and leaf graph depends on whether they can be rounded without loss of important information.
Box plots are useful for identifying outliers and for comparing distributions. There are several steps in constructing a box plot. The first relies on the 25th, 50th, and 75th percentiles in the distribution of scores. For a data set, we draw a box extending from the 25th percentile to the 75th percentile. The 50th percentile is drawn inside the box.
Therefore,
the bottom of each box is the 25th percentile,
the top is the 75th percentile,
and the line in the middle is the 50th percentile.
Continuing with the box plots, we put "whiskers" above and below each box to give additional information about the spread of the data. Whiskers are vertical lines that end in a horizontal stroke. Whiskers are drawn from the upper and lower hinges to the upper and lower adjacent values.
Although we don't draw whiskers all the way to outside or far out values, we still wish to represent them in our box plots. This is achieved by adding additional marks beyond the whiskers. Specifically, outside values are indicated by small "o's" and far out values are indicated by asterisks (*).
Box plots provide basic information about a distribution. For example, a distribution with a positive skew would have a longer whisker in the positive direction than in the negative direction. A larger mean than median would also indicate a positive skew. Box plots are good at portraying extreme values and are especially good at showing differences between distributions. However, many of the details of a distribution are not revealed in a box plot, and to examine these details one should create a histogram and/or a stem and leaf display.
The normal distribution is the most important and most widely used distribution in statistics. It is sometimes called the "bell curve," although the tonal qualities of such a bell would be less than pleasing. It is also called the "Gaussian curve" after the mathematician Karl Friedrich Gauss. Eight features of normal distributions are listed below.
Normal distributions are symmetric around their mean.
The mean, median, and mode of a normal distribution are equal.
The area under the normal curve is equal to 1.0.
Normal distributions are denser in the center and less dense in the tails.
Normal distributions are defined by two parameters, the mean (Îź) and the standard deviation (Ď).
68% of the area of a normal distribution is within one standard deviation of the mean.
Approximately 95% of the area of a normal distribution is within two standard deviations of the mean.
Approximately 99.7% of the area of a normal distribution is within three standard deviations of the mean.
The central limit theorem states that the sampling distribution of any statistic will be normal or nearly normal, if the sample size is large enough. Generally, a sample size is considered "large enough" if any of the following conditions apply.
The population distribution is normal.
The sample distribution is roughly symmetric, unimodal, without outliers, and the sample size is 15 or less.
The sample distribution is moderately skewed, unimodal, without outliers, and sample size is between 16 and 30.
The sample size is greater than 30, without outliers.
An assessment of the normality of data is a prerequisite for many statistical tests because normal data is an underlying assumption in parametric testing. There are two main methods of assessing normality: graphically and numerically.
Methods of assessing normality
SPSS Statistics allows you to test all of these procedures within Explore... command. The Explore... command can be used in isolation if you are testing normality in one group or splitting your dataset into one or more groups.
The SPSS produces a table that presents two well-known tests of normality, namely the Kolmogorov-Smirnov Test and the Shapiro-Wilk Test. The Shapiro-Wilk Test is more appropriate for small sample sizes (< 50 samples), but can also handle sample sizes as large as 2000.
Normal Q-Q Plot
In order to determine normality graphically, we can use the output of a normal Q-Q Plot. If the data are normally distributed, the data points will be close to the diagonal line. If the data points stray from the line in an obvious non-linear fashion, the data are not normally distributed.
Confidence Interval
A confidence interval gives an estimated range of values which is likely to include an unknown population parameter, the estimated range being calculated from a given set of sample data.
If independent samples are taken repeatedly from the same population, and a confidence interval calculated for each sample, then a certain percentage (confidence level) of the intervals will include the unknown population parameter. Confidence intervals are usually calculated so that this percentage is 95%, but we can produce 90%, 99%, 99.9% (or whatever) confidence intervals for the unknown parameter.
The width of the confidence interval gives us some idea about how uncertain we are about the unknown parameter.
Hypothesis Testing
Setting up and testing hypotheses is an essential part of statistical inference. In order to formulate such a test, usually some theory has been put forward, either because it is believed to be true or because it is to be used as a basis for argument, but has not been proved, for example, claiming that a new drug is better than the current drug for treatment of the same symptoms.
In each problem considered, the question of interest is simplified into two competing claims / hypotheses between which we have a choice; the null hypothesis, denoted H0, against the alternative hypothesis, denoted H1. These two competing claims / hypotheses are not however treated on an equal basis: special consideration is given to the null hypothesis.
We have two common situations:
1. The experiment has been carried out in an attempt to disprove or reject a particular hypothesis, the null hypothesis, thus we give that one priority so it cannot be rejected unless the evidence against it is sufficiently strong. For example,
H0: there is no difference in taste between coke and diet coke
against
H1: there is a difference.
If one of the two hypotheses is 'simpler' we give it priority so that a more 'complicated' theory is not adopted unless there is sufficient evidence against the simpler one. For example, it is 'simpler' to claim that there is no difference in flavor between coke and diet coke than it is to say that there is a difference.
The hypotheses are often statements about population parameters like expected value and variance; for example H0 might be that the expected value of the height of ten year old boys in the Scottish population is not different from that of ten year old girls. A hypothesis might also be a statement about the distributional form of a characteristic of interest, for example that the height of ten year old boys is normally distributed within the Scottish population.
The outcome of a hypothesis test is "Reject H0 in favor of H1" or "Do not reject H0".
Hypothesis tests may be performed on contingency tables in order to decide whether or not effects are present. Effects in a contingency table are defined as relationships between the row and column variables; that is, are the levels of the row variable differentially distributed over levels of the column variables. Significance in this hypothesis test means that interpretation of the cell frequencies is warranted. Non-significance means that any differences in cell frequencies could be explained by chance. Hypothesis tests on contingency tables are based on a statistic called Chi-square.
REVIEW OF CONTINGENCY TABLES
Frequency tables of two variables presented simultaneously are called contingency tables. Contingency tables are constructed by listing all the levels of one variable as rows in a table and the levels of the other variables as columns, then finding the joint or cell frequency for each cell. The cell frequencies are then summed across both rows and columns. The sums are placed in the margins, the values of which are called marginal frequencies. The lower right hand corner value contains the sum of either the row or column marginal frequencies, which both must be equal to N.
How to Read a Scatterplot
A scatterplot consists of an X axis (the horizontal axis), a Y axis (the vertical axis), and a series of dots. Each dot on the scatterplot represents one observation from a data set. The position of the dot on the scatterplot represents its X and Y values.
Correlation coefficients measure the strength of association between two variables. The most common correlation coefficient, called the Pearson product-moment correlation coefficient, measures the strength of the linear association between variables.
The sign and the absolute value of a Pearson correlation coefficient describe the direction and the magnitude of the relationship between two variables.
The value of a correlation coefficient ranges between -1 and 1.
The greater the absolute value of a correlation coefficient, the stronger the linear relationship.
The strongest linear relationship is indicated by a correlation coefficient of -1 or 1.
The weakest linear relationship is indicated by a correlation coefficient equal to 0.
A positive correlation means that if one variable gets bigger, the other variable tends to get bigger.
A negative correlation means that if one variable gets bigger, the other variable tends to get smaller.
Keep in mind that the Pearson correlation coefficient only measures linear relationships. Therefore, a correlation of 0 does not mean zero relationship between two variables; rather, it means zero linear relationship. (It is possible for two variables to have zero linear relationship and a strong curvilinear relationship at the same time.)
In a cause and effect relationship, the independent variable is the cause, and the dependent variable is the effect. Least squares linear regression is a method for predicting the value of a dependent variable Y, based on the value of an independent variable X.
In this section, we focus on the case where there is only one independent variable. This is called simple regression (as opposed to multiple regression, which handles two or more independent variables).
Prerequisites for Regression
Simple linear regression is appropriate when the following conditions are satisfied.
The dependent variable Y has a linear relationship to the independent variable X. To check this, make sure that the XY scatterplot is linear and that the residual plot shows a random pattern.
For each value of X, the probability distribution of Y has the same standard deviation Ď. When this condition is satisfied, the variability of the residuals will be relatively constant across all values of X, which is easily checked in a residual plot.
For any given value of X,
The Y values are independent, as indicated by a random pattern on the residual plot.
The Y values are roughly normally distributed (i.e., symmetric and unimodal). A little skewness is ok if the sample size is large. A histogram or a dotplot will show the shape of the distribution.
The Least Squares Regression Line
Linear regression finds the straight line, called the least squares regression line or LSRL, that best represents observations in a bivariate data set. Suppose Y is a dependent variable, and X is an independent variable. The population regression line is:
Y = Î0 + Î1X
where Î0 is a constant, Î1 is the regression coefficient, X is the value of the independent variable, and Y is the value of the dependent variable.
Given a random sample of observations, the population regression line is estimated by:
š = b0 + b1x
where b0 is a constant, b1 is the regression coefficient, x is the value of the independent variable, and š is the predicted value of the dependent variable.
The Coefficient of Determination
The coefficient of determination (denoted by R2) is a key output of regression analysis. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable.
The coefficient of determination ranges from 0 to 1.
An R2 of 0 means that the dependent variable cannot be predicted from the independent variable.
An R2 of 1 means the dependent variable can be predicted without error from the independent variable.
An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.10 means that 10 percent of the variance in Y is predictable from X; an R2 of 0.20 means that 20 percent is predictable; and so on.
Standard Error
The standard error about the regression line (often denoted by SE) is a measure of the average amount that the regression equation over- or under-predicts. The higher the coefficient of determination, the lower the standard error; and the more accurate predictions are likely to be.