Decoding Patterns: Customer Churn Prediction Data Analysis Project
Various statistical software's in data analysis.
1. Role of Various Statistical
Software in Data Analysis
1
Dr. S. Selva Mani
First year post graduate student
Department of public health dentistry
Ragas dental college & hospital
3. INTRODUCTION
The emergence of statistical software has undoubtedly
contributed enormously to the development in research
studies in this 21st century.
The high premium placed on ICT by human beings,
researchers and organizations has undoubtedly made it a
major drive of every nation.
Statistical Software (SS) is a vital tool for research
analysis, data validation and findings
3
4. HISTORY
4
‘Over the course of history, different forms of data analysis
methods have been in existence. Initially, it was paper and pen and
later the advent of which computer has helped invention of punching
machines and later upgraded to simple calculator and complex
scientific calculator.
Nevertheless, pundits have revealed that statistical software is a
software program that makes the calculation and presentation of
statistics relatively easy.
Statistical software allows researchers to avoid routine
mathematical mistakes and produce accurate figures in their research
if they input all data correctly.
Development of statistical software allows academic researchers to
conduct more quantitative studies easily.
5. HISTORY 5
Many researchers, professionals, scientists and business
managers also can clearly present accurately prediction of
the future using statistical software.
Many proprietary and freeware statistical software
packages are available that are suitable for different
statistical analysis, depending on the user's needs:
Some of the Proprietary software are
SPSS, STATA, EVIEW, MINITAB, BMDP etc., and
Few among Open or free Statistical software are
R, EPI-INFO, CS-PRO, to mention but few.
6. FEATURES OF STATISTICAL
SOFTWARE 6
Statistical software has some common characteristics that
make it reliable and suitable for data analysis:
1. Data editor is in rows and columns which make it very
easy to enter numeric data.
2. There is availability of menu bar comprises drop-down
menu, quick analysis as well as brief user manual.
3. Statistical level of measurement is put into consideration
in data entry
4. They follow the initial steps in research project
7. FEATURES OF STATISTICAL SOFTWARE7
All data should be numeric, although it may not be all variables it is
not desirable to use letter or word (String variable) as data. This can be
achieved by recoding the letter or word (string data) into desirable
numeric and labeled appropriately.
Data exploration can be done to check for errors and other accuracy.
(a) Getting your data ready to enter into the software.
(b) Defining and labeling variable
(c) Entering data appropriately with each row containing each case
and each column as variable.
(d) Data checking and cleaning is possible
9. COMMON QUANTITAIVE DATA ANALYSIS
STATISTICAL SOFTWARE AND THEIR
APPLICATION
9
SPSS
STATA
SAS
MATLAB
MINITAB
EVIEW
Mathematica
10. COMMON QUALITATIVE DATA
ANALYSIS STATISTICAL SOFTWARE AND
THEIR APPLICATION
10
Atlas ti 6.0(www.atlasti.com)
Hyper RESEARCH 2.8(www.researchware.com)
Max QDA (www.maxqda.com )
The Ethnograph 5.08
QSRN6(www.qsrinternational.com)
QSR Nvivo (www.qsrinternational.com)
Weft QDA(www.pressure.to/qda)
Open code 3.4 (www8.umu.se)
11. MS EXCEL
11
MS-EXCEL -Microsoft Excel 2010 is one of the most popular
software applications worldwide and is part of the Microsoft Office
2010 productivity suite.
You can use Excel to analyze data,
for example, in accounts, budgets, billing and many other areas.
Excel allows you to explore the menu bar and the different tasks that
can be done with it.
You can work on sample spreadsheets doing basic math, adding and
deleting columns and rows, and preparing the worksheet for printing.
12. MS EXCEL
12
You can run your data visually to show trends,
patterns and comparisons between the data in a chart,
table or other template, and Excel performs most
general statistical analyses but weak in regression,
logistic regression, survival analysis, analysis of
variance, factor analysis, multivariate analysis.
13. G POWER 13
G*Power is a free-to use software used to calculate statistical
power.
The program offers the ability to calculate power for a wide variety
of statistical tests including t-tests, F-tests, and chi-square-tests, among
others.
Additionally, the user must determine which of the many contexts
this test is being used, such as a one-way ANOVA versus a multi-way
ANOVA.
G*Power has a built-in tool for determining effect size if it cannot
be estimated from prior literature or is not easily calculable.
16. Calculating sample size to estimate sensitivity and
specificity of a diagnostic tool with precision
16
CAGE is a screening tool for alcoholism validated for adult (age more
than 18years )population .A researcher wishes to estimate the sensitivity
of the tool in teenage population of age 15-19 years. The sensitivity and
specificity done in the same age group from the other population was
found to be 75%and 85% respectively .How much samples should be
collected to estimate the parameters With 95%confidence interval at
10%of margin error.
Sensitivity =75%(0.75) precision d=10%(0.10)
Specificity =85%(0.85) z value for C.I( Z=1.96)
Sensitivity =(z1-α)2*se*(1-se)/(d)2
Specificity =(z1-α)2*sp*(1-sp)/(d)2
17. N master
17nMaster is a software that incorporates sample size
calculation(STATA, EpiInfo, nQuery, etc.), in terms of contents, each of
use and the cost.
There is a growing recognition on the importance of sample size
calculation and power analysis in research.
When planning a study design, it is very imperative to consider how
many participants are needed, therefore, sample size must be
premeditated carefully to ensure that the research time, patient effort
and support costs invested are not wasted.
There are many standard statistical software and stand alone packages
available in public domain (internet) for performing sample
size/power calculations.
18. EPI INFO 18
Epi Info is public domain statistical software for epidemiology
developed by Centers for Disease Control and Prevention (CDC) in
Atlanta, Georgia (USA).
Epi Info has been in existence for over 20 years and is currently
available for Microsoft windows.
The program allows for electronic survey creation, data entry, and
analysis.
19. EPI INFO 19
Within the analysis module, analytic routines include t-tests,
ANOVA, nonparametric statistics, cross tabulations and stratification
with estimates of odds ratios, risk ratios, and risk differences,
logistic regression (conditional and unconditional), survival analysis
(Kaplan Meier and Cox proportional hazard), and analysis of
complex survey data.
The software is in the public domain, free, and can be downloaded
from http://www.cdc.gov/epiinfo
20. SPSS (STATISTICAL PACKAGE FOR THE
SOCIAL SCIENCES ) 20
SPSS is most widely used in social science disciplines and courses.
SPSS is the oldest software programs developed and made available
in 1960s and has been redeveloped over the years, the latest version is
SPSS 20.0 which was produced in August 2013.
Many sociologists, psychologists and social workers use this program
to enter their research data and formulate results.
Although social science uses SPSS more widely than other fields,
many find it easy to navigate with SPSS because it is a package that
many beginners enjoy due to its very easy to use nature.
21. SPSS 21SPSS has a "point and click" interface that allows you to use pull
down menus to select commands that you wish to perform.
Odusina (2011) disclosed that working with SPSS demand some
background knowledge of statistics. There are slight variations in the
difference version of SPSS e.g. version 10, 11, 12, 13, 14, 15, 16, etc.
SPSS assists the user in describing data, testing hypotheses and
looking for a correlation or relationship between one or more
variables.
SPSS is very suitable for most regression analysis and different
kinds of ANOVA (regression, logistic regression, survival analysis,
analysis of variance, factor analysis, multivariate analysis but not
suitable for time series analysis and multilevel regression analysis).
22. SPSS: USAGE
• Data can be typed in, copied and pasted from a text or Excel file,
imported through menu-driven prompts, or read in from a ASCII file
using Syntax editor
• New variables can be added to worksheet or created using formulas
• Datasets contain raw data and only one dataset can be active at a
time
• Can create and save macros and/or commands
• Output window displays output, including graphs
• Output can be copied and pasted into other documents
• Dataset and Outputs are saved separately
• Optional syntax window can read and run commands and can also
be saved separately
22
29. SPSS: PROS AND CONS
Pros
Commonly used in industry,
especially those that utilize
survey data
Easy-to-use menu-driven
software
Output and graphics are
clear and well-organized
Separate “Data” and
“Variable” tabs in data
worksheet make it easy to
switch from raw data to
variable information (labels,
codes, variable type, etc
CONS
Limited options for analyses
Can only analyze one data set at a
time
Not as much help available as some
other packages
Not as useful for:
Simulations
Data mining
Survival analysis
Sample size calculation/power
analysis
Advanced or complex modeling
Design of experiments
29
30. PSPP
30
The PSPP project (originally called "Fiasco") was born at
the end of the 1990s as a free software replacement for
SPSS, which was a data management and analysis tool, at
the time produced by SPSS Inc.
The nature of SPSS's proprietary licensing and the
presence of digital restrictions management motivated the
author to write an alternative which later became
functionally identical, but with permission for everyone to
copy, modify and share.
The latest version was released in April, 2014.
31. PSPP 31
This software provides a basic set of capabilities: frequencies, cross-
tabs comparison of means (t-test and one way ANOVA); linear
regression, logistic regression, and re-ordering data, non-parametric
tests, chi-square analysis and more.
A range of statistical graphs can be produced, such as histograms, pie
chart, Screen plots and np-chart.
PSPP can import Numeric and Open Document spreadsheets, posture
databases, comma-separated values .It can export files in the SPSS
'portable' and 'system' file formats.
33. STATA
33
STATA is a powerful statistical package with smart data-
management facilities, a wide array of up-to-date statistical
techniques, and an excellent system for producing publication-quality
graphs.
Stata latest version was produced June 24, 2013 which is a fast and
easy to use data management package.
Stata is available for Windows, Unix, and Mac computers. The
standard version is called Stata/IC (or Intercooled Stata) and can
handle up to 2,047 variables.
34. STATA 34
There is a special edition called Stata/SE that can handle up to
32,766 variables (and also allows longer string variables and larger
matrices), and a version for multicore/multiprocessor computers
called Stata/MP, which has the same limits but is substantially
faster.
STATA performs most general statistical analyses (regression,
logistic regression, survival analysis, analysis of variance, factor
analysis, multivariate analysis and time series analysis).
35. STATA: USAGE
• Data can be typed in, read in through code, copied and pasted
from a text or Excel file, or imported and converted from other
files (such as a .txt, etc.)
• Command window is used to write and run commands
• Review window displays previous analysis, which can be selected
to run again
• Project window displays all input and output, including graphs
• Store and edit data in the Data Editor, which can be saved on its
own
• Log will copy and automatically save the project for you (must
start and close log before and after the analyses you want to save)
35
41. STATA: PROS AND CONS
Pros:
Somewhat common in both
industry and academia
Somewhat flexible and
customizable
Contains up-to-date advanced
methods
Quality graphics
Cons:
Scripting programming
language
Can only analyze one data set
at a time
Does not work as well with
large data sets
Not as useful for:
Quality assessment and
improvement
Design of experiments
Optimization
41
42. STATA: OVERVIEW
• Utilizes both menu-driven selections and scripting commands
• Multiple versions available depending on needs (commercial,
educational, etc.)
• Extensive help documentation and technical support
• Contains both basic and advanced statistical methods
• Not case-sensitive language
• Common fields:
Economics
Sociology
Political science
42
43. MINITAB
43
MINITAB is statistical software used by educators,
students, scientists, business associates and researchers to
provide statistical software in a multitude of areas.
MINITAB, developed around 1990, and remains one of
the oldest statistical software programs available.
MINITAB has compatibility with PC, Macintosh, Linux
and all other major platforms.
As one of the easiest statistical software programs to use,
MINITAB remains a popular choice with those new
statistical software.
44. MINITAB 44
With drop-down menus and dialog boxes describing how
and what to do next, MINITAB persists as a popular choice
for teaching students about statistics and data analysis.
MINITAB primarily has a user base of educators using the
program to show students research methods and analysis in
college and graduate-level courses.
MINITAB performs most general statistical analyses
(regression, logistic regression, survival analysis, analysis of
variance, factor analysis, but has its weaknesses in general
linear model (GLM) and Multilevel regression).
45. MINITAB: USAGE
• Data can be typed in, copied and pasted from a text or Excel file,
or imported through menu-driven prompts
• New variables can be added to worksheet or created using
formulas
• Worksheets contain raw data and only one worksheet can be active
at a time
• Can create and save macros and/or commands
• Session window displays output
• Graphs and other visual charts are shown in individual windows
• Project manager contains outline that helps you to jump to
particular output
• Worksheet can be saved separately, but saving whole project will
save both worksheet and output
45
51. MINITAB: PROS AND CONS
Pros:
Commonly used in industry
and some academic settings
Easy-to-use menu-driven
software
Clear output and graphics with
some interactive features
Has an “Assistant” feature that
includes flow-charts and takes
users step-by-step to analyze
data properly
Used in most undergraduate
statistics courses; there are
example data sets included in
software
Cons:
Limited options for analyses
Can only analyze one data set
at a time
Does not work as well with
large data sets
Not as much help available as
some other packages
Not as useful for:
Simulations
Data warehousing
Multivariate analysis
Nonparametric methods
51
52. Minitab: Overview
• Menu-driven statistical software, but does have scripting language
available for typing commands or creating macros
• Used in most Six Sigma courses and workshops
• Help documentation located in software as well as online
• Used by many analysts to quantitatively make decisions
• Common fields:
Social science
Marketing
Education
Sociology
Manufacturing
52
53. Statistical Analysis System (SAS) 53
SAS - which has its latest version produced in December 2011 is a
package that many "power users" like because of its power and
programmability.
SAS is one of the packages that are difficult to learn.
To use SAS, you must write SAS programs that manipulate your
data and perform your data analyses.
If you make a mistake in a SAS program, it can be hard to see where
the errors occurred or how to correct it.
However, it can take a long time to learn and understand data
management in SAS than many other packages like SPSS or STATA
with simpler commands line.
54. SAS: USAGE
• Data can be read in through a command or imported through menu-
driven prompts
• Variables and functions can be created and renamed
• Multiple data sets can be handled at once and are stored in various
workspaces (“libraries”)
• Four types of commands: DATA step (read & edit data); Procedure
steps (run built-in functions); macros (create and run own function);
ODS statements (set output settings, styles, etc.)
• Editor window is used to write and save commands
• Log window reads commands and displays any errors or comments
• Output window displays some output created by commands
• Results viewer window displays most output, including graphs
• Can save only commands, only data, or whole project
54
61. SAS: PROS AND CONS
Pros:
Widely used in both
industry and academia
High-performance
architecture that supports
computationally-intensive
algorithms
Flexible and customizable
analyses and graphics
Cons:
Scripting programming
language
Expensive
Some versions are not
100% compatible
Not as useful for:
Simple analysis and
manipulation
61
62. SAS: Overview
• Major statistical software in many industries
• Multiple add-ons and extensions available, including integration of
SQL programming language and integration with JMP
• Extensive online help manuals and forums
• Used by many statisticians and computer scientists for data mining,
data analysis, and development of statistical methodology
• Offers various certifications, which many employers value highly
• Common fields:
Statistical science Agriculture Computer science
Quantitative finance Engineering
Sociology
Manufacturing
Pharmaceutical science
62
63. R & MATLAB 63
R & MATLAB – Stanford (2014) identified R and MATLAB as
the richest statistical systems by far.
They contain an impressive amount of libraries, which is growing
each day.
Even if a much desired specific model is not part of the standard
functionality, you can implement it yourself, because R and Matlab
are really programming languages with relatively simple syntaxes.
As "languages" they allow you to express any idea.
The question is whether you are a good writer or not.
64. R & MATLAB 64
On the flip side, Matlab has much better graphics, which you will
not be ashamed to put in a paper or a presentation.
MATLAB and R perform most general statistical analyses
(regression, logistic regression, survival analysis, analysis of variance,
factor analysis, multivariate analysis).
The greatest strengths of both are probably in its ANOVA, mixed
model analysis and users creative freedom in analysis
In terms of modern applied statistics tools, R libraries are
somewhat richer than those of Matlab.
Also R is free software.
65. R: USAGE
• Data can be read in through code or created
• Variables and functions can be created and renamed
• Multiple data sets can be handled at once
• Editor window is used to write and save commands
• Console window reads commands and displays output, which is
best saved by copying and pasting into a word processing document
• Graphs are outputted in separate window, which is overwritten for
each new graph unless otherwise indicated in commands
• Workspaces can be saved, meaning data sets and variables do not
need to be recreated (especially useful if data creation and
manipulation take a long time to run)
65
66. R: EXAMPLES
• Read in data set from a text file
• Create a variable
• Find online help
• Run a t-test
• Create a histogram
66
72. R: PROS AND CONS
Pros:
Widely used in both industry
and academia
Flexible and customizable
analyses and graphics
Great for:
Data manipulation,
editing, and coding
Data mining
Simulations
Survival analysis
Linear and nonlinear
model
Cons:
Scripting programming
language
Mediocre graphics
Not as useful for:
Graphical analysis
Data summary
Exploratory analysis
Quality assessment and
improvement
Design of experiments
72
73. R: Overview
• Free, open-source software; similar to S-plus
• Multiple add-ons and extensions available, including integration with
LaTeX ( a word processor) via R Studio, and Excel via RExcel
• Extensive online help manuals and forums
• Used by many statisticians and computer scientists for data mining,
data analysis, and development of statistical methodology
• Case-sensitive language
• Common fields:
Statistical science
Computational biology
Computer science
73
74. Econometric Views (EViews)
74
EViews is a statistical package for window, used mainly
for time-series oriented econometrics analysis.
It was developed by Quantitative Micro Software
(QMS) and now a part of IHS.
Version 1.0 was released in March 1994, and has
replaced MicroTSP. The current version of EViews is 8.0,
released in March 2013.
EViews can be used for general statistical analysis and
econometric analyses, such as cross-section and panel data
analysis and time series estimation and forecasting.
75. Econometric Views (EViews) 75
EViews relies heavily on a proprietary and
undocumented file format for data storage.
However, for input and output it supports
numerous formats, including databank format, Ms-
Excel format, SPSS/PSPP, DAP/SAS, STATA,
RATS, and TSP.
EViews can access ODBC databases .
79. Coding can be done manually.
Manual method of coding in qualitative data analysis is rightly
considered as labour - intensive, time-consuming and outdated.
Coding can also be done using Qualitative data analysis software
such as NVivo, Atlas ti 6.0, HyperRESEARCH 2.8, Max
QDA(Qualitative Data Analysis) software and others.
When choosing software for qualitative data analysis you need to
consider a wide range of factors such as the type and amount of
data you need to analyse, time required to master the software and
cost considerations.
79SOFTWARES
80. ATLAS TI 6.0 80
The purpose of ATLAS.ti is to help researchers uncover and systematically analyze
complex phenomena hidden in unstructured data(text, multimedia, geospatial).
The program provides tools that let the user locate, code, and annotate findings in
primary data material, to weigh and evaluate their importance, and to visualize the
often complex relations between them.
ATLAS.ti is used by researchers and practitioners in a wide variety of fields
including anthropology, arts, architecture, communication, criminology, economics,
educational sciences, engineering, ethnological studies, management studies, market
research, quality management, psychology and sociology.
81. Software can support analyses BUT does not “do”
analyses for you.
The Ethnograph 5.08
QSR N6 (www.qsrinternational.com)
Weft QDA (www.pressure.to/qda)
Open code 3.4 (www8.umu.se)
81
Ritchie J, Lewis J. Qualitative research practice. London: Sage Publications
82. HYPER RESEARCH 82
HyperRESEARCH is used by qualitative researchers in areas such as
health care, legal, sociology, anthropology, music, geography, geology,
education, theology, philosophy, history, market research, focus group
analysis and most other fields using qualitative research approaches.
HyperRESEARCH enables coding and retrieval of source material,
theory building, and analyses of your data. With its multimedia
capabilities, HyperRESEARCH allows you to work with text, graphics,
audio, and video sources.
HyperRESEARCH has been in use by qualitative researchers since it
was first introduced in 1991.
83. FEATURES
83Easy-to-use interface: offers the same intuitive case-based interface
on all supported platforms.
It provides a turnkey solution by employing a built-in help system
and several tutorials with step-by-step instructions.
With flexible organization of codes from any source to any case,
support for code frequencies and other code statistics,
HyperRESEARCH is well suited for mix-method approaches to
qualitative research.
The HyperRESEARCH study file format and all of the various
media types supported will work across operating systems.
Multi-Media capabilities: allows you to work with text, graphics,
audio, and video source material in many popular formats.
84. MAXQDA 84
MAXQDA is a software program designed for computer-
assisted qualitative and mixed methods data, text and multimedia
analysis in academic, scientific, and business institutions. It is
being developed and distributed by VERBI Software based in
Berlin, Germany.
MAXQDA is designed for the use in qualitative, quantitative
and mixed methods research. The emphasis on going beyond
qualitative research can be observed in the extensive attributes
function (called variables in the programme itself) and the ability
of the programme to deal relatively quickly with larger numbers of
interviews.
85. FEATURES
85Import of text documents, tables, audio, video, images,
twitter tweets, surveys
Data is stored in project file '(*.mx18)
Reading, editing and coding data
Settings links from one part of a document to another
Visualization options (number of codes in different
documents etc.)
Group Comparison
Analyse code combinations
Import and export demographic information (variables)
from and to SPSS and Excel
Import of web pages with the free browser add-on
MAXQDA
Analyse of responses to survey questions
86. CONCLUSION 86
It is the opinion of the researcher, after analyzing the field
data to a very large extent, that statistical software has
contributed immensely to social research especially in the
area of demographic and data analysis.
This was achieved by using a scientific approach to solve
fundamental problems in research which is data analysis.
Although some other factors contributed to the quality of
research work such as literature review, methodology and
findings, it is quite clear that the impact of statistical
software packages on analysis and findings of research
cannot be over estimated.
87. REFERENCES
87 Karen (2013) Choosing a Statistical Software Package or Two
Source:http://www.theanalysisfactor.com/choosing-statistical-software/ VISITED ON 7TH
April, 2014
McGann, C. (2009). Role of Statistical Software . eHow Contribution accessed by 3rd April,
2014
McCullough, B.D. (1999). "Econometric software reliability: EViews, LIMDEP, SHAZAM
and TSP". JournalofAppliedEconometrics 14 (2):191–202. HYPERLINK
"http://en.wikipedia.org/wiki/Digital_object_identifier" o "Digital object identifier" doi :
Odusina E.K. (2011)
Stanford P. (2013) Statistical & Financial Consulting
http://stanfordphd.com/Statistical_Software.html visited 6th April, 2014
STATA PRESS site (2014) Stata history and usage.visited 28th April, 2014.
Ann Lewins and Christina Silver: Using Software in Qualitative Research: A Step-by-Step
Guide, 2nd edition, 2014, Sage Publications, Los Angeles, London, New Delhi, Singapore
88. Evaluation : 88
1. Which year did IBM take over SPSS??
2. Expand BMDP &CS-PRO
3. Which software is most popular ??
4. What is the latest version of MINITAB??
5. Is PSPP has any acronym ??