SlideShare a Scribd company logo
1 of 6
Download to read offline
White Paper
IBM Analytics
Foundational Methodology
for Data Science
2 Foundational Methodology for Data Science
In the domain of data science, solving problems and answering
questions through data analysis is standard practice. Often,
data scientists construct a model to predict outcomes or
discover underlying patterns, with the goal of gaining insights.
Organizations can then use these insights to take actions that
ideally improve future outcomes.
There are numerous rapidly evolving technologies for
analyzing data and building models. In a remarkably short
time, they have progressed from desktops to massively
parallel warehouses with huge data volumes and in-database
analytic functionality in relational databases and Apache
Hadoop. Text analytics on unstructured or semi-structured
data is becoming increasingly important as a way to
incorporate sentiment and other useful information from
text into predictive models, often leading to significant
improvements in model quality and accuracy.
Emerging analytics approaches seek to automate many of the
steps in model building and application, making machine-
learning technology more accessible to those who lack deep
quantitative skills. Also, in contrast to the “top-down” approach
of first defining the business problem and then analyzing
the data to find a solution, some data scientists may use a
“bottom-up” approach. With the latter, the data scientist looks
into large volumes of data to see what business goal might be
suggested by the data and then tackles that problem. Since
most problems are addressed in a top-down manner, the
methodology in this paper reflects that view.
A 10-stage data science methodology that
spans technologies and approaches
As data analytics capabilities become more accessible and
prevalent, data scientists need a foundational methodology
capable of providing a guiding strategy, regardless of the
technologies, data volumes or approaches involved (see
Figure 1). This methodology bears some similarities to
recognized methodologies1-5
for data mining, but it emphasizes
several of the new practices in data science such as the use of
very large data volumes, the incorporation of text analytics into
predictive modeling and the automation of some processes.
The methodology consists of 10 stages that form an iterative
process for using data to uncover insights. Each stage plays a
vital role in the context of the overall methodology.
What is a methodology?
A methodology is a general strategy that guides the
processes and activities within a given domain.
Methodology does not depend on particular
technologies or tools, nor is it a set of techniques
or recipes. Rather, a methodology provides the data
scientist with a framework for how to proceed with
whatever methods, processes and heuristics will be
used to obtain answers or results.
IBM Analytics 3
Stage 1: Business understanding
Every project starts with business understanding. The business
sponsors who need the analytic solution play the most critical
role in this stage by defining the problem, project objectives
and solution requirements from a business perspective. This
first stage lays the foundation for a successful resolution of the
business problem. To help guarantee the project’s success, the
sponsors should be involved throughout the project to provide
domain expertise, review intermediate findings and ensure the
work remains on track to generate the intended solution.
Figure 1. Foundational Methodology for Data Science.
Stage 2: Analytic approach
Once the business problem has been clearly stated, the
data scientist can define the analytic approach to solving
the problem. This stage entails expressing the problem
in the context of statistical and machine-learning techniques,
so the organization can identify the most suitable ones for
the desired outcome. For example, if the goal is to predict
a response such as “yes” or “no,” then the analytic approach
could be defined as building, testing and implementing a
classification model.
Business
understanding
Analytic
approach
Data
requirements
Data
collection
Data
understanding
Data
preparationModeling
Evaluation
Deployment
Feedback
4 Foundational Methodology for Data Science
Stage 6: Data preparation
This stage encompasses all activities to construct the data
set that will be used in the subsequent modeling stage. Data
preparation activities include data cleaning (dealing with
missing or invalid values, eliminating duplicates, formatting
properly), combining data from multiple sources (files, tables,
platforms) and transforming data into more useful variables.
In a process called feature engineering, data scientists can create
additional explanatory variables, also referred to as predictors
or features, through a combination of domain knowledge
and existing structured variables. When text data is available,
such as customer call center logs or physicians’ notes in
unstructured or semi-structured form, text analytics is useful in
deriving new structured variables to enrich the set of predictors
and improve model accuracy.
Data preparation is usually the most time-consuming step in
a data science project. In many domains, some data preparation
steps are common across different problems. Automating
certain data preparation steps in advance may accelerate the
process by minimizing ad hoc preparation time. With today’s
high-performance, massively parallel systems and analytic
functionality residing where the data is stored, data scientists can
more easily and rapidly prepare data using very large data sets.
Stage 7: Modeling
Starting with the first version of the prepared data set, the
modeling stage focuses on developing predictive or descriptive
models according to the previously defined analytic approach.
With predictive models, data scientists use a training set
(historical data in which the outcome of interest is known)
to build the model. The modeling process is typically highly
Stage 3: Data requirements
The chosen analytic approach determines the data
requirements. Specifically, the analytic methods to be used
require certain data content, formats and representations,
guided by domain knowledge.
Stage 4: Data collection
In the initial data collection stage, data scientists identify and
gather the available data resources—structured, unstructured
and semi-structured—relevant to the problem domain.
Typically, they must choose whether to make additional
investments to obtain less-accessible data elements. It
may be best to defer the investment decision until more is
known about the data and the model. If there are gaps in
data collection, the data scientist may have to revise the data
requirements accordingly and collect new and/or more data.
While data sampling and subsetting are still important,
today’s high-performance platforms and in-database analytic
functionality let data scientists use much larger data sets
containing much or even all of the available data. By
incorporating more data, predictive models may be better
able to represent rare events such as disease incidence or
system failure.
Stage 5: Data understanding
After the original data collection, data scientists typically
use descriptive statistics and visualization techniques to
understand the data content, assess data quality and discover
initial insights about the data. Additional data collection may
be necessary to fill gaps.
IBM Analytics 5
iterative as organizations gain intermediate insights, leading to
refinements in data preparation and model specification. For
a given technique, data scientists may try multiple algorithms
with their respective parameters to find the best model for the
available variables.
Stage 8: Evaluation
During model development and before deployment, the
data scientist evaluates the model to understand its quality
and ensure that it properly and fully addresses the business
problem. Model evaluation entails computing various
diagnostic measures and other outputs such as tables and
graphs, enabling the data scientist to interpret the model’s
quality and its efficacy in solving the problem. For a predictive
model, data scientists use a testing set, which is independent of
the training set but follows the same probability distribution
and has a known outcome. The testing set is used to evaluate
the model so it can be refined as needed. Sometimes the final
model is applied also to a validation set for a final assessment.
In addition, data scientists may assign statistical significance
tests to the model as further proof of its quality. This additional
proof may be instrumental in justifying model implementation
or taking actions when the stakes are high—such as an
expensive supplemental medical protocol or a critical airplane
flight system.
Stage 9: Deployment
Once a satisfactory model has been developed and is approved
by the business sponsors, it is deployed into the production
environment or a comparable test environment. Usually it
is deployed in a limited way until its performance has been
fully evaluated. Deployment may be as simple as generating a
report with recommendations, or as involved as embedding the
model in a complex workflow and scoring process managed by
a custom application. Deploying a model into an operational
business process usually involves additional groups, skills and
technologies from within the enterprise. For example, a sales
group may deploy a response propensity model through a
campaign management process created by a development team
and administered by a marketing group.
Stage 10: Feedback
By collecting results from the implemented model, the
organization gets feedback on the model’s performance and
its impact on the environment in which it was deployed.
For example, feedback could take the form of response rates
to a promotional campaign targeting a group of customers
identified by the model as high-potential responders. Analyzing
this feedback enables data scientists to refine the model to
improve its accuracy and usefulness. They can automate
some or all of the feedback-gathering and model assessment,
refinement and redeployment steps to speed up the process of
model refreshing for better outcomes.
Providing ongoing value to
the organization
The flow of the methodology illustrates the iterative nature
of the problem-solving process. As data scientists learn
more about the data and the modeling, they frequently
return to a previous stage to make adjustments. Models are
not created once, deployed and left in place as is; instead,
through feedback, refinement and redeployment, models are
continually improved and adapted to evolving conditions. In
this way, both the model and the work behind it can provide
continuous value to the organization for as long as the solution
is needed.
© Copyright IBM Corporation 2015
IBM Analytics
Route 100
Somers, NY 10589
Produced in the United States of America
June 2015
IBM, the IBM logo, and ibm.com are trademarks of International Business
Machines Corp., registered in many jurisdictions worldwide. Other product
and service names might be trademarks of IBM or other companies.A current
list of IBM trademarks is available on the web at “Copyright and trademark
information” at ibm.com/legal/copytrade.shtml
This document is current as of the initial date of publication and may be
changed by IBM at any time. Not all offerings are available in every
country in which IBM operates.
THE INFORMATION IN THIS DOCUMENT IS PROVIDED “AS
IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED,
INCLUDING WITHOUT ANY WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE
AND ANY WARRANTY OR CONDITION OF NON-
INFRINGEMENT. IBM products are warranted according to the terms
and conditions of the agreements under which they are provided.
1
Brachman, R.  Anand, T., “The process of knowledge discovery in
databases,” in Fayyad, U. et al., eds., Advances in knowledge discovery and
data mining, AAAI Press, 1996 (pp. 37-57)
2
SAS Institute, http://en.wikipedia.org/wiki/SEMMA, www.sas.com/
en_us/software/analytics/enterprise-miner.html, www.sas.com/en_gb/
software/small-midsize-business/desktop-data-mining.html
3
Wikipedia, “Cross Industry Standard Process for Data Mining,” http://
en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_
Mining, http://the-modeling-agency.com/crisp-dm.pdf
4
Ballard, C., Rollins, J., Ramos, J., Perkins, A., Hale, R., Dorneich, A.,
Milner, E., and Chodagam, J.: Dynamic Warehousing: Data Mining
Made Easy, IBM Redbook SG24-7418-00 (Sep. 2007), pp. 9-26.
5
Gregory Piatetsky, CRISP-DM, still the top methodology for analytics,
data mining, or data science projects, Oct. 28, 2014, www.kdnuggets.
com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data-
science-projects.html
Please Recycle
IMW14828-USEN-00
For more information
A new course on the Foundational Data Science Methodology
is available through Big Data University. The free online
course is available at: http://bigdatauniversity.com/bdu-wp/
bdu-course/data-science-methodology
For working examples of how this methodology has been
implemented in actual use cases, visit:
• http://ibm.co/1SUhxFm
• http://ibm.co/1IazTvG
Acknowledgements
Thanks to Michael Haide, Michael Wurst, Ph.D., Brandon
MacKenzie and Gregory Rodd for their helpful comments
and to Jo A. Ramos for his role in the development of this
methodology over our years of collaboration.
About the author
John B. Rollins, Ph.D., is a data scientist in the IBM Analytics
organization. His background is in engineering, data mining and
econometrics across many industries. He holds seven patents
and has authored a best-selling engineering textbook and many
technical papers. He holds doctoral degrees in petroleum
engineering and economics from Texas AM University.

More Related Content

What's hot

Bab 6 (staddas ukuran keruncingan)
Bab 6 (staddas ukuran keruncingan)Bab 6 (staddas ukuran keruncingan)
Bab 6 (staddas ukuran keruncingan)fatria anggita
 
Algoritma pencarian lintasan jalur terpendek
Algoritma pencarian lintasan jalur terpendekAlgoritma pencarian lintasan jalur terpendek
Algoritma pencarian lintasan jalur terpendekLaili Wahyunita
 
Bab 4 operasi-operasi dasar pengolahan citra dijital
Bab 4 operasi-operasi dasar pengolahan citra dijitalBab 4 operasi-operasi dasar pengolahan citra dijital
Bab 4 operasi-operasi dasar pengolahan citra dijitalSyafrizal
 
Transformasi Linear ( Aljabar Linear Elementer )
Transformasi Linear ( Aljabar Linear Elementer )Transformasi Linear ( Aljabar Linear Elementer )
Transformasi Linear ( Aljabar Linear Elementer )Kelinci Coklat
 
Analisis Algoritma - Strategi Algoritma Dynamic Programming
Analisis Algoritma - Strategi Algoritma Dynamic ProgrammingAnalisis Algoritma - Strategi Algoritma Dynamic Programming
Analisis Algoritma - Strategi Algoritma Dynamic ProgrammingAdam Mukharil Bachtiar
 
Metode Transportasi (Masalah dalam Metode Transportasi)
Metode Transportasi (Masalah dalam Metode Transportasi)Metode Transportasi (Masalah dalam Metode Transportasi)
Metode Transportasi (Masalah dalam Metode Transportasi)hazhiyah
 
Desain dan analisis algoritma
Desain dan analisis algoritmaDesain dan analisis algoritma
Desain dan analisis algoritmaDiki Rosandy
 
Uji proporsi satu populasi dan dua populasi
Uji proporsi satu populasi dan dua populasiUji proporsi satu populasi dan dua populasi
Uji proporsi satu populasi dan dua populasiRosmaiyadi Snt
 
Bab iv pemusatan dan penyebaran data
Bab iv pemusatan dan penyebaran dataBab iv pemusatan dan penyebaran data
Bab iv pemusatan dan penyebaran datalinda_rosalina
 
Analisis Sensitivitas
Analisis SensitivitasAnalisis Sensitivitas
Analisis SensitivitasAde Nurlaila
 
Proses Data Mining
Proses Data MiningProses Data Mining
Proses Data Miningdedidarwis
 
nilai eigen dan vektor eigen
nilai eigen dan vektor eigennilai eigen dan vektor eigen
nilai eigen dan vektor eigenelmabb
 
Contoh soal dan penyelesaian metode biseksi
Contoh soal dan penyelesaian metode biseksiContoh soal dan penyelesaian metode biseksi
Contoh soal dan penyelesaian metode biseksimuhamadaulia3
 
Algoritma dan Struktur Data - Binary Search
Algoritma dan Struktur Data - Binary SearchAlgoritma dan Struktur Data - Binary Search
Algoritma dan Struktur Data - Binary SearchKuliahKita
 
Matematika Diskrit - 07 teori bilangan - 04
Matematika Diskrit - 07 teori bilangan - 04Matematika Diskrit - 07 teori bilangan - 04
Matematika Diskrit - 07 teori bilangan - 04KuliahKita
 
Data mining 5 klasifikasi decision tree dan random forest
Data mining 5   klasifikasi decision tree dan random forestData mining 5   klasifikasi decision tree dan random forest
Data mining 5 klasifikasi decision tree dan random forestIrwansyahSaputra1
 

What's hot (20)

Bab 6 (staddas ukuran keruncingan)
Bab 6 (staddas ukuran keruncingan)Bab 6 (staddas ukuran keruncingan)
Bab 6 (staddas ukuran keruncingan)
 
Algoritma pencarian lintasan jalur terpendek
Algoritma pencarian lintasan jalur terpendekAlgoritma pencarian lintasan jalur terpendek
Algoritma pencarian lintasan jalur terpendek
 
Bab 4 operasi-operasi dasar pengolahan citra dijital
Bab 4 operasi-operasi dasar pengolahan citra dijitalBab 4 operasi-operasi dasar pengolahan citra dijital
Bab 4 operasi-operasi dasar pengolahan citra dijital
 
Transformasi Linear ( Aljabar Linear Elementer )
Transformasi Linear ( Aljabar Linear Elementer )Transformasi Linear ( Aljabar Linear Elementer )
Transformasi Linear ( Aljabar Linear Elementer )
 
Rancangan Faktorial 2k
Rancangan Faktorial 2kRancangan Faktorial 2k
Rancangan Faktorial 2k
 
Analisis Algoritma - Strategi Algoritma Dynamic Programming
Analisis Algoritma - Strategi Algoritma Dynamic ProgrammingAnalisis Algoritma - Strategi Algoritma Dynamic Programming
Analisis Algoritma - Strategi Algoritma Dynamic Programming
 
Metode Transportasi (Masalah dalam Metode Transportasi)
Metode Transportasi (Masalah dalam Metode Transportasi)Metode Transportasi (Masalah dalam Metode Transportasi)
Metode Transportasi (Masalah dalam Metode Transportasi)
 
Model dan Simulasi
Model dan SimulasiModel dan Simulasi
Model dan Simulasi
 
Desain dan analisis algoritma
Desain dan analisis algoritmaDesain dan analisis algoritma
Desain dan analisis algoritma
 
Uji proporsi satu populasi dan dua populasi
Uji proporsi satu populasi dan dua populasiUji proporsi satu populasi dan dua populasi
Uji proporsi satu populasi dan dua populasi
 
Bab iv pemusatan dan penyebaran data
Bab iv pemusatan dan penyebaran dataBab iv pemusatan dan penyebaran data
Bab iv pemusatan dan penyebaran data
 
Analisis Sensitivitas
Analisis SensitivitasAnalisis Sensitivitas
Analisis Sensitivitas
 
Uas riset operasi (kevin surya)
Uas riset operasi (kevin surya)Uas riset operasi (kevin surya)
Uas riset operasi (kevin surya)
 
Proses Data Mining
Proses Data MiningProses Data Mining
Proses Data Mining
 
nilai eigen dan vektor eigen
nilai eigen dan vektor eigennilai eigen dan vektor eigen
nilai eigen dan vektor eigen
 
Contoh soal dan penyelesaian metode biseksi
Contoh soal dan penyelesaian metode biseksiContoh soal dan penyelesaian metode biseksi
Contoh soal dan penyelesaian metode biseksi
 
Bab 3-pros stok
Bab 3-pros stokBab 3-pros stok
Bab 3-pros stok
 
Algoritma dan Struktur Data - Binary Search
Algoritma dan Struktur Data - Binary SearchAlgoritma dan Struktur Data - Binary Search
Algoritma dan Struktur Data - Binary Search
 
Matematika Diskrit - 07 teori bilangan - 04
Matematika Diskrit - 07 teori bilangan - 04Matematika Diskrit - 07 teori bilangan - 04
Matematika Diskrit - 07 teori bilangan - 04
 
Data mining 5 klasifikasi decision tree dan random forest
Data mining 5   klasifikasi decision tree dan random forestData mining 5   klasifikasi decision tree dan random forest
Data mining 5 klasifikasi decision tree dan random forest
 

Viewers also liked

The data science handbook pre release (1)
The data science handbook   pre release (1)The data science handbook   pre release (1)
The data science handbook pre release (1)Lakshmi Prasanna
 
Step by Step guide to executing an analytics project
Step by Step guide to executing an analytics projectStep by Step guide to executing an analytics project
Step by Step guide to executing an analytics projectRamkumar Ravichandran
 
Fortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkFortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkBas Geerdink
 
Switching From Web Development to Data Science
Switching From Web Development to Data ScienceSwitching From Web Development to Data Science
Switching From Web Development to Data ScienceKarlijn Willems
 
Learn Data Science in 8 (Easy) Steps
Learn Data Science in 8 (Easy) StepsLearn Data Science in 8 (Easy) Steps
Learn Data Science in 8 (Easy) StepsKarlijn Willems
 
CRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologyCRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologySergey Shelpuk
 
Python For Data Science Cheat Sheet
Python For Data Science Cheat SheetPython For Data Science Cheat Sheet
Python For Data Science Cheat SheetKarlijn Willems
 

Viewers also liked (8)

The data science handbook pre release (1)
The data science handbook   pre release (1)The data science handbook   pre release (1)
The data science handbook pre release (1)
 
Step by Step guide to executing an analytics project
Step by Step guide to executing an analytics projectStep by Step guide to executing an analytics project
Step by Step guide to executing an analytics project
 
Agile Data Science
Agile Data ScienceAgile Data Science
Agile Data Science
 
Fortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache SparkFortune Teller API - Doing Data Science with Apache Spark
Fortune Teller API - Doing Data Science with Apache Spark
 
Switching From Web Development to Data Science
Switching From Web Development to Data ScienceSwitching From Web Development to Data Science
Switching From Web Development to Data Science
 
Learn Data Science in 8 (Easy) Steps
Learn Data Science in 8 (Easy) StepsLearn Data Science in 8 (Easy) Steps
Learn Data Science in 8 (Easy) Steps
 
CRISP-DM: a data science project methodology
CRISP-DM: a data science project methodologyCRISP-DM: a data science project methodology
CRISP-DM: a data science project methodology
 
Python For Data Science Cheat Sheet
Python For Data Science Cheat SheetPython For Data Science Cheat Sheet
Python For Data Science Cheat Sheet
 

Similar to Foundational Methodology for Data Science

MODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptxMODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptxnikshaikh786
 
Data Science course in Hyderabad .
Data Science course in Hyderabad            .Data Science course in Hyderabad            .
Data Science course in Hyderabad .rajasrichalamala3zen
 
Data Science course in Hyderabad .
Data Science course in Hyderabad         .Data Science course in Hyderabad         .
Data Science course in Hyderabad .rajasrichalamala3zen
 
data science course training in Hyderabad
data science course training in Hyderabaddata science course training in Hyderabad
data science course training in Hyderabadmadhupriya3zen
 
data science course training in Hyderabad
data science course training in Hyderabaddata science course training in Hyderabad
data science course training in Hyderabadmadhupriya3zen
 
best data science course institutes in Hyderabad
best data science course institutes in Hyderabadbest data science course institutes in Hyderabad
best data science course institutes in Hyderabadrajasrichalamala3zen
 
data science course in Hyderabad data science course in Hyderabad
data science course in Hyderabad data science course in Hyderabaddata science course in Hyderabad data science course in Hyderabad
data science course in Hyderabad data science course in Hyderabadakhilamadupativibhin
 
Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptx
Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptxUnit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptx
Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptxtesfkeb
 
Selection of Articles Using Data Analytics for Behavioral Dissertation Resear...
Selection of Articles Using Data Analytics for Behavioral Dissertation Resear...Selection of Articles Using Data Analytics for Behavioral Dissertation Resear...
Selection of Articles Using Data Analytics for Behavioral Dissertation Resear...PhD Assistance
 
Data Science.pdf
Data Science.pdfData Science.pdf
Data Science.pdfWinduGata3
 
IT6010-BUSINESS-INTELLIGENCE-Question-Bank_watermark.pdf
IT6010-BUSINESS-INTELLIGENCE-Question-Bank_watermark.pdfIT6010-BUSINESS-INTELLIGENCE-Question-Bank_watermark.pdf
IT6010-BUSINESS-INTELLIGENCE-Question-Bank_watermark.pdfHemaSenthil5
 
Data Mining methodology
 Data Mining methodology  Data Mining methodology
Data Mining methodology rebeccatho
 
How can a data scientist expert solve real world problems?
How can a data scientist expert solve real world problems? How can a data scientist expert solve real world problems?
How can a data scientist expert solve real world problems? priyanka rajput
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxShanmugasundaram M
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learningSandeep Garg
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfShaikSikindar1
 

Similar to Foundational Methodology for Data Science (20)

MODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptxMODULE 1_Introduction to Data analytics and life cycle..pptx
MODULE 1_Introduction to Data analytics and life cycle..pptx
 
Unit 2
Unit 2Unit 2
Unit 2
 
Data Science course in Hyderabad .
Data Science course in Hyderabad            .Data Science course in Hyderabad            .
Data Science course in Hyderabad .
 
Data Science course in Hyderabad .
Data Science course in Hyderabad         .Data Science course in Hyderabad         .
Data Science course in Hyderabad .
 
data science course training in Hyderabad
data science course training in Hyderabaddata science course training in Hyderabad
data science course training in Hyderabad
 
data science course training in Hyderabad
data science course training in Hyderabaddata science course training in Hyderabad
data science course training in Hyderabad
 
best data science course institutes in Hyderabad
best data science course institutes in Hyderabadbest data science course institutes in Hyderabad
best data science course institutes in Hyderabad
 
data science.pptx
data science.pptxdata science.pptx
data science.pptx
 
data science course in Hyderabad data science course in Hyderabad
data science course in Hyderabad data science course in Hyderabaddata science course in Hyderabad data science course in Hyderabad
data science course in Hyderabad data science course in Hyderabad
 
Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptx
Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptxUnit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptx
Unit_8_Data_processing,_analysis_and_presentation_and_Application (1).pptx
 
Selection of Articles Using Data Analytics for Behavioral Dissertation Resear...
Selection of Articles Using Data Analytics for Behavioral Dissertation Resear...Selection of Articles Using Data Analytics for Behavioral Dissertation Resear...
Selection of Articles Using Data Analytics for Behavioral Dissertation Resear...
 
Data Science.pdf
Data Science.pdfData Science.pdf
Data Science.pdf
 
IT6010-BUSINESS-INTELLIGENCE-Question-Bank_watermark.pdf
IT6010-BUSINESS-INTELLIGENCE-Question-Bank_watermark.pdfIT6010-BUSINESS-INTELLIGENCE-Question-Bank_watermark.pdf
IT6010-BUSINESS-INTELLIGENCE-Question-Bank_watermark.pdf
 
Data Mining methodology
 Data Mining methodology  Data Mining methodology
Data Mining methodology
 
How can a data scientist expert solve real world problems?
How can a data scientist expert solve real world problems? How can a data scientist expert solve real world problems?
How can a data scientist expert solve real world problems?
 
Data Analytics Life Cycle
Data Analytics Life CycleData Analytics Life Cycle
Data Analytics Life Cycle
 
KDD assignmnt data.docx
KDD assignmnt data.docxKDD assignmnt data.docx
KDD assignmnt data.docx
 
Self Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docxSelf Study Business Approach to DS_01022022.docx
Self Study Business Approach to DS_01022022.docx
 
Credit card fraud detection using python machine learning
Credit card fraud detection using python machine learningCredit card fraud detection using python machine learning
Credit card fraud detection using python machine learning
 
Top 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdfTop 30 Data Analyst Interview Questions.pdf
Top 30 Data Analyst Interview Questions.pdf
 

Foundational Methodology for Data Science

  • 1. White Paper IBM Analytics Foundational Methodology for Data Science
  • 2. 2 Foundational Methodology for Data Science In the domain of data science, solving problems and answering questions through data analysis is standard practice. Often, data scientists construct a model to predict outcomes or discover underlying patterns, with the goal of gaining insights. Organizations can then use these insights to take actions that ideally improve future outcomes. There are numerous rapidly evolving technologies for analyzing data and building models. In a remarkably short time, they have progressed from desktops to massively parallel warehouses with huge data volumes and in-database analytic functionality in relational databases and Apache Hadoop. Text analytics on unstructured or semi-structured data is becoming increasingly important as a way to incorporate sentiment and other useful information from text into predictive models, often leading to significant improvements in model quality and accuracy. Emerging analytics approaches seek to automate many of the steps in model building and application, making machine- learning technology more accessible to those who lack deep quantitative skills. Also, in contrast to the “top-down” approach of first defining the business problem and then analyzing the data to find a solution, some data scientists may use a “bottom-up” approach. With the latter, the data scientist looks into large volumes of data to see what business goal might be suggested by the data and then tackles that problem. Since most problems are addressed in a top-down manner, the methodology in this paper reflects that view. A 10-stage data science methodology that spans technologies and approaches As data analytics capabilities become more accessible and prevalent, data scientists need a foundational methodology capable of providing a guiding strategy, regardless of the technologies, data volumes or approaches involved (see Figure 1). This methodology bears some similarities to recognized methodologies1-5 for data mining, but it emphasizes several of the new practices in data science such as the use of very large data volumes, the incorporation of text analytics into predictive modeling and the automation of some processes. The methodology consists of 10 stages that form an iterative process for using data to uncover insights. Each stage plays a vital role in the context of the overall methodology. What is a methodology? A methodology is a general strategy that guides the processes and activities within a given domain. Methodology does not depend on particular technologies or tools, nor is it a set of techniques or recipes. Rather, a methodology provides the data scientist with a framework for how to proceed with whatever methods, processes and heuristics will be used to obtain answers or results.
  • 3. IBM Analytics 3 Stage 1: Business understanding Every project starts with business understanding. The business sponsors who need the analytic solution play the most critical role in this stage by defining the problem, project objectives and solution requirements from a business perspective. This first stage lays the foundation for a successful resolution of the business problem. To help guarantee the project’s success, the sponsors should be involved throughout the project to provide domain expertise, review intermediate findings and ensure the work remains on track to generate the intended solution. Figure 1. Foundational Methodology for Data Science. Stage 2: Analytic approach Once the business problem has been clearly stated, the data scientist can define the analytic approach to solving the problem. This stage entails expressing the problem in the context of statistical and machine-learning techniques, so the organization can identify the most suitable ones for the desired outcome. For example, if the goal is to predict a response such as “yes” or “no,” then the analytic approach could be defined as building, testing and implementing a classification model. Business understanding Analytic approach Data requirements Data collection Data understanding Data preparationModeling Evaluation Deployment Feedback
  • 4. 4 Foundational Methodology for Data Science Stage 6: Data preparation This stage encompasses all activities to construct the data set that will be used in the subsequent modeling stage. Data preparation activities include data cleaning (dealing with missing or invalid values, eliminating duplicates, formatting properly), combining data from multiple sources (files, tables, platforms) and transforming data into more useful variables. In a process called feature engineering, data scientists can create additional explanatory variables, also referred to as predictors or features, through a combination of domain knowledge and existing structured variables. When text data is available, such as customer call center logs or physicians’ notes in unstructured or semi-structured form, text analytics is useful in deriving new structured variables to enrich the set of predictors and improve model accuracy. Data preparation is usually the most time-consuming step in a data science project. In many domains, some data preparation steps are common across different problems. Automating certain data preparation steps in advance may accelerate the process by minimizing ad hoc preparation time. With today’s high-performance, massively parallel systems and analytic functionality residing where the data is stored, data scientists can more easily and rapidly prepare data using very large data sets. Stage 7: Modeling Starting with the first version of the prepared data set, the modeling stage focuses on developing predictive or descriptive models according to the previously defined analytic approach. With predictive models, data scientists use a training set (historical data in which the outcome of interest is known) to build the model. The modeling process is typically highly Stage 3: Data requirements The chosen analytic approach determines the data requirements. Specifically, the analytic methods to be used require certain data content, formats and representations, guided by domain knowledge. Stage 4: Data collection In the initial data collection stage, data scientists identify and gather the available data resources—structured, unstructured and semi-structured—relevant to the problem domain. Typically, they must choose whether to make additional investments to obtain less-accessible data elements. It may be best to defer the investment decision until more is known about the data and the model. If there are gaps in data collection, the data scientist may have to revise the data requirements accordingly and collect new and/or more data. While data sampling and subsetting are still important, today’s high-performance platforms and in-database analytic functionality let data scientists use much larger data sets containing much or even all of the available data. By incorporating more data, predictive models may be better able to represent rare events such as disease incidence or system failure. Stage 5: Data understanding After the original data collection, data scientists typically use descriptive statistics and visualization techniques to understand the data content, assess data quality and discover initial insights about the data. Additional data collection may be necessary to fill gaps.
  • 5. IBM Analytics 5 iterative as organizations gain intermediate insights, leading to refinements in data preparation and model specification. For a given technique, data scientists may try multiple algorithms with their respective parameters to find the best model for the available variables. Stage 8: Evaluation During model development and before deployment, the data scientist evaluates the model to understand its quality and ensure that it properly and fully addresses the business problem. Model evaluation entails computing various diagnostic measures and other outputs such as tables and graphs, enabling the data scientist to interpret the model’s quality and its efficacy in solving the problem. For a predictive model, data scientists use a testing set, which is independent of the training set but follows the same probability distribution and has a known outcome. The testing set is used to evaluate the model so it can be refined as needed. Sometimes the final model is applied also to a validation set for a final assessment. In addition, data scientists may assign statistical significance tests to the model as further proof of its quality. This additional proof may be instrumental in justifying model implementation or taking actions when the stakes are high—such as an expensive supplemental medical protocol or a critical airplane flight system. Stage 9: Deployment Once a satisfactory model has been developed and is approved by the business sponsors, it is deployed into the production environment or a comparable test environment. Usually it is deployed in a limited way until its performance has been fully evaluated. Deployment may be as simple as generating a report with recommendations, or as involved as embedding the model in a complex workflow and scoring process managed by a custom application. Deploying a model into an operational business process usually involves additional groups, skills and technologies from within the enterprise. For example, a sales group may deploy a response propensity model through a campaign management process created by a development team and administered by a marketing group. Stage 10: Feedback By collecting results from the implemented model, the organization gets feedback on the model’s performance and its impact on the environment in which it was deployed. For example, feedback could take the form of response rates to a promotional campaign targeting a group of customers identified by the model as high-potential responders. Analyzing this feedback enables data scientists to refine the model to improve its accuracy and usefulness. They can automate some or all of the feedback-gathering and model assessment, refinement and redeployment steps to speed up the process of model refreshing for better outcomes. Providing ongoing value to the organization The flow of the methodology illustrates the iterative nature of the problem-solving process. As data scientists learn more about the data and the modeling, they frequently return to a previous stage to make adjustments. Models are not created once, deployed and left in place as is; instead, through feedback, refinement and redeployment, models are continually improved and adapted to evolving conditions. In this way, both the model and the work behind it can provide continuous value to the organization for as long as the solution is needed.
  • 6. © Copyright IBM Corporation 2015 IBM Analytics Route 100 Somers, NY 10589 Produced in the United States of America June 2015 IBM, the IBM logo, and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies.A current list of IBM trademarks is available on the web at “Copyright and trademark information” at ibm.com/legal/copytrade.shtml This document is current as of the initial date of publication and may be changed by IBM at any time. Not all offerings are available in every country in which IBM operates. THE INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED, INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON- INFRINGEMENT. IBM products are warranted according to the terms and conditions of the agreements under which they are provided. 1 Brachman, R. Anand, T., “The process of knowledge discovery in databases,” in Fayyad, U. et al., eds., Advances in knowledge discovery and data mining, AAAI Press, 1996 (pp. 37-57) 2 SAS Institute, http://en.wikipedia.org/wiki/SEMMA, www.sas.com/ en_us/software/analytics/enterprise-miner.html, www.sas.com/en_gb/ software/small-midsize-business/desktop-data-mining.html 3 Wikipedia, “Cross Industry Standard Process for Data Mining,” http:// en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_ Mining, http://the-modeling-agency.com/crisp-dm.pdf 4 Ballard, C., Rollins, J., Ramos, J., Perkins, A., Hale, R., Dorneich, A., Milner, E., and Chodagam, J.: Dynamic Warehousing: Data Mining Made Easy, IBM Redbook SG24-7418-00 (Sep. 2007), pp. 9-26. 5 Gregory Piatetsky, CRISP-DM, still the top methodology for analytics, data mining, or data science projects, Oct. 28, 2014, www.kdnuggets. com/2014/10/crisp-dm-top-methodology-analytics-data-mining-data- science-projects.html Please Recycle IMW14828-USEN-00 For more information A new course on the Foundational Data Science Methodology is available through Big Data University. The free online course is available at: http://bigdatauniversity.com/bdu-wp/ bdu-course/data-science-methodology For working examples of how this methodology has been implemented in actual use cases, visit: • http://ibm.co/1SUhxFm • http://ibm.co/1IazTvG Acknowledgements Thanks to Michael Haide, Michael Wurst, Ph.D., Brandon MacKenzie and Gregory Rodd for their helpful comments and to Jo A. Ramos for his role in the development of this methodology over our years of collaboration. About the author John B. Rollins, Ph.D., is a data scientist in the IBM Analytics organization. His background is in engineering, data mining and econometrics across many industries. He holds seven patents and has authored a best-selling engineering textbook and many technical papers. He holds doctoral degrees in petroleum engineering and economics from Texas AM University.