SlideShare a Scribd company logo
1 of 38
1
Mrs. Dipali Meher
Modern College of Arts, Science and Commerce,
Ganeshkhind, Pune 411016
Data Mining : An Introduction
2
Bayes Thm(1763)
Regression(1805)
KDD(1989)
Support Vector Machine(1992)
Data Science(2001)
Moneyball(2003)
Turing(1963)
Neural Networks(1943)
Evolutionary Computation(1965)
Databases(1970)
Genetic Algorithms(1975)
Big Data
From Then till Now…..
3
DBMS
RDBMS
Distributed DBMS
Data Mining
4
Data Mining deals with the discovery of
hidden Knowledge , unexpected pattern
and new rules from large data sets
5
Examples of Information extracted using query
language
 List customers who use credit card to purchase
more than Rs. 10000 worth groceries
 List patients who had at least one heart attack
 List students who had at least one backlog
 List employees who have taken home loans
6
Examples of what data mining is used for
 Develop a general profile of credit card customers
 Determine patients whose lifestyle is prone to getting a
heart attack in near future
 Differentiate poor credit risk customers from good
credit card customers
 Differentiate students who had one backlogs in their
academic
 Determine employees who have taken loan for any
purpose
Data Mining differs from usual query processing in
many ways
Query Processing Data Mining
Query Wel formed as
Select…
From…
Where……
Query is not well formed.
What is found out that is
usually hidden
Data Data from online
transaction processing
systems generally in table
formats
Data is integrated from
various sources. Huge
amount of data
Output Subset of databases Not only subset but also
in analyzed and in terms
of patterns
7
8
Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or
knowledge from huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
•Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or
knowledge from huge amount of data
Data mining: a misnomer?
•Alternative names
Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.
9
Knowledge discovery in databases (KDD)-is a multistep
process of finding useful information and patterns in
data while Data Mining is one of the steps in KDD of
using algorithms for extraction of patterns
Steps Of KDD
1. Selection-
Data Extraction -Obtaining Data from heterogeneous data sources -
Databases, Data warehouses, World wide web or other information
repositories
2. Preprocessing-
Data Cleaning- Incomplete , noisy, inconsistent data to be cleaned-
Missing data may be ignored or predicted, erroneous data may be deleted
or corrected
10
3. Transformation-
Data Integration- Combines data from multiple sources
into a coherent store -Data can be encoded in common
formats, normalized, reduced
4. Data mining –
Apply algorithms to transformed data an extract patterns
5. Pattern Interpretation/evaluation -
Pattern Evaluation- Evaluate the interestingness of resulting
patterns or apply interestingness measures to filter out
discovered patterns
Knowledge presentation- present the mined knowledge-
visualization techniques can be used
11
Transformation
KDD is the nontrivial extraction of
implicit previously unknown and
potentially useful knowledge from
data
Knowledge Discovery Process
Preprocessing
Data Mining
Pattern Interpretation and
evaluation
Selection
12
13
27
40
34
54
24 25
29
0
10
20
30
40
50
60
a b c d e f g
Graph 1
a
b
c
d
e
f
g
Graphical-bar charts, pie
charts histograms
Icon-based- using colors figures as
icons
14
Hierarchical- Hierarchically dividing
display area
Geometric-boxplot, scatter plot
15
Pixel-based- data as colored pixels
Hybrid- combination of above
approaches
Why Data Mining?—Potential Applications
 Data analysis and decision support
 Market analysis and management
 Target marketing, customer relationship management (CRM), market basket
analysis, cross selling, market segmentation
 Risk analysis and management
 Forecasting, customer retention, improved underwriting, quality control,
competitive analysis
 Fraud detection and detection of unusual patterns (outliers)
 Other Applications
 Text mining (news group, email, documents) and Web mining
 Stream data mining
 Bioinformatics and bio-data analysis
16
Data Mining algorithms-All algorithms attempt to fit a model
closest to the data being examined.
Model is based on the analysis of attributes of a training data
set
The Model is than evaluated using a test data set
Data Model can be
 Predictive model makes predictions regarding data values
using the results found from available data. Thus it makes use
of historical data to make predictions
 Descriptive model identifies patterns or relationships in data. It
finds out the properties of existing data and does not predict
the new properties.
17
Data Mining
Predictive Descriptive
Classification
Regression
Time series Analysis
Prediction
Clustering
Summarization
Association rules
Sequence Discovery
18
Classification- maps data into predefined groups or classes
It uses supervised learning .
The algorithm uses learning phase to build a classifier using training
data set containing data attributes and associated class labels
Example : result of a student. In which class students result will be…
Pattern recognition is type of classification where input patter is
classified into several classes based on its similarity to predefined
classes.
Example: to identify terrorists from passengers. They are identified with
their basic pattern as distance between eyes, size and shape. Then
these patterns are compared with entries into data to see whether
any match were found.
19
20
21
Grade Useful Heat Value(kcal/kg)
A >6200
B 5601 - 6200
C 4941 - 5600
D 4201 - 4940
E 3361 - 4200
F 2401 - 3360
G 1301 - 2400
22
Regression-maps data into real-valued prediction variable.
Algorithm tries to find best function (linear, Non-linear that fits the
training data). Assumes that target data always fits into some
function.
Example . College professor determines his retirement plan based on
current savings and income. If professor want to do more savings
then he must alter his experiences by using simple linear regression
formula.
23
Time Series Analysis- the value of an attribute is examined as it varies over
time
It can be used to determine similarities, classify the behavior or predict future
values
Example
Share market
Prediction – predicts future values using regression, time series analysis or other
approaches
Example
To find out flood prediction of river depending on water level, rain amount time,
humidity. Sensors at different locations are placed in the river area which will
monitor flood condition and flood prediction can be done.
Whether analysis
Pollution analysis
24
25
Clustering -Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
Unsupervised learning: no predefined classes
Interpretability and usability-results should be comprehensible
and usable-domain expert is required
Example
Students are clustered among various attributes like good
academics, area in which they live, age, height, weight, body
mass index, extra curricular activities.
Clusters do not have specific size and shape.
26
Outlier
27
Summarization - maps data into subsets with simple descriptions- It extracts or
derives representative summary type of information
Example
Summary of student result whish give you number of students appeared for the
exam passed, failed and according to classes
Association rules–discovers relationship among data – used in
Market basket analysis to find item frequently purchased together
Example: person buying a sugar in the mall also buys milk. The thing
which person buy together will always kept together.
28
Sequence Discovery- discovers sequential patterns in
data-order in which items are purchased or data is
accessed
Example:
When TV set will be purchased by customer , sales
manager assumes that customer also buys some cds and
music system.
29
Influence from many disciplines
Data Mining
Artificial
IntelligenceInformation
Technology Database
Technology
Machine
Learning
Pattern
Recognition
Statistics
Algorithm
Visualization
Mathematical
Modeling
30
Depending on data mining approach, techniques from
other disciplines may be applied such as
•Information Retrieval
•Artificial Intelligence
•Neural networks
•Fuzzy set theory
•Knowledge representation
•Logic programming
•High performance computing
31
Data Mining issues
 Human interaction- to analyze the output and find the
correct inference after data mining step interfaces required
with both domain and technical experts
 Over fitting – It occurs when the model fits for the current
data exactly but does not fit for future data-if training
dataset will be wrong then over fitting occurs
 Outliers – The model may get distorted because of the
presence of outliers
 Interpretation of results- experts are required due to
interpretability problems
 Visualization of results- visualization helps to display
analyzed data – but for multi-dimensional data visualization
becomes problematic
32
Data Mining issues continued…
 Large datasets- scalability may arise – as algorithms do not
scale well with massive real-world datasets- sampling and
parallelization are effective tools are used to solve this problem
 High dimensionality -Conventional database may contain
many different attributes out of them all are not relevant. Some
may increases complexity and reduces efficiency. This is known
as dimensionality curse -data reduction can be done so that
dimensionality reduction will also be there.
 Multimedia data - found in GIS databases proves
conventional data mining algorithms ineffective
 Missing data -It is not always possible to ignore missing data
but in preprocessing data mining algorithms can be used to
replace missing data with estimates
33
Data Mining issues continued…
 Irrelevant data – data reduced by removing irrelevant data
 Noisy data –Invalid , incorrect data will lead to poor quality
data mining
 Changing data- Data warehouses contain non-volatile data-
Dynamic data is uploaded and then algorithms are reapplied to
check their correct working.
 Integration- KDD requests are one time needs-data mining
functions are now integrated into traditional database systems
 Applications – Effective use of output of mining algorithm is
a challenge rather than the complexity of the mining algorithm
34
Data Mining Metrics
How to measure the effectiveness of data mining process?
-KDD process is expensive- Return on investment will be the
saving due to decision process using the results
-Difficult to measure and quantify
Social Implications of Data mining
It is two sides of the coin
Data mining can be used to improve customer service and
satisfaction
Data mining can be used to confront one’s right to privacy
Omnipresent Invisible Data mining affecting everyone
35
Data mining should follow certain Guidelines
Purpose specification and use limitation
Openness
Security safeguards
Individual participation
Privacy Preserving data mining
- secure Multiparty computation
- data obscuration
36
Applications of Data Mining
Security-To find out terrorists using classification
technique
Whether- To predict whether, pollution
Finance-Share market
Ecommerce-Market basket analysis
Education-Student result preparation
Bank- Analysis of customer for buying loan
Research- Data Analysis
Fraud detection
Marketing-targeting customers
Molecular biology
Astronomy
Health- to find out disease in peoples
37
Books for Reference
Data Mining, Introduction and Advanced Topics by
Margaret H. Dunham and Sridhar
Pearson Education
ISBN 81-7758-785-4
Data Mining Concepts and Techniques by Jiawei Han
and Micheline Kamber
Morgan Kaufmann Publishers
ISBN 81-312-0535-5
.
38

More Related Content

What's hot (20)

01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
Data analytics
Data analyticsData analytics
Data analytics
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data visualization
Data visualizationData visualization
Data visualization
 
Data mining
Data miningData mining
Data mining
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Data mining
Data mining Data mining
Data mining
 
Data Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data AnalysisData Visualization in Exploratory Data Analysis
Data Visualization in Exploratory Data Analysis
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data Science
 
Data mining
Data miningData mining
Data mining
 
Chapter 1 big data
Chapter 1 big dataChapter 1 big data
Chapter 1 big data
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
 
web mining
web miningweb mining
web mining
 
Data mining
Data miningData mining
Data mining
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & Applications
 
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis Data mining: Concepts and Techniques, Chapter12 outlier Analysis
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
 
Introduction to Data Science and Analytics
Introduction to Data Science and AnalyticsIntroduction to Data Science and Analytics
Introduction to Data Science and Analytics
 
Data mining
Data mining Data mining
Data mining
 

Similar to Data mining an introduction

Data mining introduction
Data mining introductionData mining introduction
Data mining introductionBasma Gamal
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerIJERA Editor
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptxHarsha Patel
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slidestafosepsdfasg
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CSThanveen
 
DM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfDM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfssuserb933d8
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data miningUjjawal
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and workAmr Abd El Latief
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesFellowBuddy.com
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceMahir Haque
 

Similar to Data mining an introduction (20)

Data mining introduction
Data mining introductionData mining introduction
Data mining introduction
 
data mining
data miningdata mining
data mining
 
Unit 1.pptx
Unit 1.pptxUnit 1.pptx
Unit 1.pptx
 
Seminar Presentation
Seminar PresentationSeminar Presentation
Seminar Presentation
 
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using RapidminerStudy and Analysis of K-Means Clustering Algorithm Using Rapidminer
Study and Analysis of K-Means Clustering Algorithm Using Rapidminer
 
Unit-V-Introduction to Data Mining.pptx
Unit-V-Introduction to  Data Mining.pptxUnit-V-Introduction to  Data Mining.pptx
Unit-V-Introduction to Data Mining.pptx
 
Data mining
Data miningData mining
Data mining
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
2 introductory slides
2 introductory slides2 introductory slides
2 introductory slides
 
Chapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data MiningChapter 1: Introduction to Data Mining
Chapter 1: Introduction to Data Mining
 
Additional themes of data mining for Msc CS
Additional themes of data mining for Msc CSAdditional themes of data mining for Msc CS
Additional themes of data mining for Msc CS
 
Unit i
Unit iUnit i
Unit i
 
DM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdfDM-Unit-1-Part 1-R.pdf
DM-Unit-1-Part 1-R.pdf
 
Introduction to data mining
Introduction to data miningIntroduction to data mining
Introduction to data mining
 
Data mining concepts and work
Data mining concepts and workData mining concepts and work
Data mining concepts and work
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
Talk
TalkTalk
Talk
 
Data Mining and Knowledge
Data Mining and KnowledgeData Mining and Knowledge
Data Mining and Knowledge
 
Dma unit 1
Dma unit   1Dma unit   1
Dma unit 1
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 

More from Dr-Dipali Meher

More from Dr-Dipali Meher (17)

Database Security Methods, DAC, MAC,View
Database Security Methods, DAC, MAC,ViewDatabase Security Methods, DAC, MAC,View
Database Security Methods, DAC, MAC,View
 
Version Stamps in NOSQL Databases
Version Stamps in NOSQL DatabasesVersion Stamps in NOSQL Databases
Version Stamps in NOSQL Databases
 
DataPreprocessing.pptx
DataPreprocessing.pptxDataPreprocessing.pptx
DataPreprocessing.pptx
 
Literature Review
Literature ReviewLiterature Review
Literature Review
 
Research Problem
Research ProblemResearch Problem
Research Problem
 
Formulation of Research Design
Formulation of Research DesignFormulation of Research Design
Formulation of Research Design
 
Types of Research
Types of ResearchTypes of Research
Types of Research
 
Research Methodology-Intorduction
Research Methodology-IntorductionResearch Methodology-Intorduction
Research Methodology-Intorduction
 
Introduction to Research
Introduction to ResearchIntroduction to Research
Introduction to Research
 
Neo4j session
Neo4j sessionNeo4j session
Neo4j session
 
Introduction to NoSQL
Introduction to NoSQLIntroduction to NoSQL
Introduction to NoSQL
 
Consistency in NoSQL
Consistency in NoSQLConsistency in NoSQL
Consistency in NoSQL
 
Data models in NoSQL
Data models in NoSQLData models in NoSQL
Data models in NoSQL
 
Schema migrations in no sql
Schema migrations in no sqlSchema migrations in no sql
Schema migrations in no sql
 
Polyglot Persistence
Polyglot Persistence Polyglot Persistence
Polyglot Persistence
 
Naive bayesian classification
Naive bayesian classificationNaive bayesian classification
Naive bayesian classification
 
Function Pointer
Function PointerFunction Pointer
Function Pointer
 

Recently uploaded

AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Jisc
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfSpandanaRallapalli
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxthorishapillay1
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)lakshayb543
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxiammrhaywood
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfphamnguyenenglishnb
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management SystemChristalin Nelson
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxCarlos105
 

Recently uploaded (20)

AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...Procuring digital preservation CAN be quick and painless with our new dynamic...
Procuring digital preservation CAN be quick and painless with our new dynamic...
 
ACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdfACC 2024 Chronicles. Cardiology. Exam.pdf
ACC 2024 Chronicles. Cardiology. Exam.pdf
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Proudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptxProudly South Africa powerpoint Thorisha.pptx
Proudly South Africa powerpoint Thorisha.pptx
 
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
Visit to a blind student's school🧑‍🦯🧑‍🦯(community medicine)
 
Raw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptxRaw materials used in Herbal Cosmetics.pptx
Raw materials used in Herbal Cosmetics.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptxECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
ECONOMIC CONTEXT - PAPER 1 Q3: NEWSPAPERS.pptx
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdfAMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
AMERICAN LANGUAGE HUB_Level2_Student'sBook_Answerkey.pdf
 
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Transaction Management in Database Management System
Transaction Management in Database Management SystemTransaction Management in Database Management System
Transaction Management in Database Management System
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptxBarangay Council for the Protection of Children (BCPC) Orientation.pptx
Barangay Council for the Protection of Children (BCPC) Orientation.pptx
 

Data mining an introduction

  • 1. 1 Mrs. Dipali Meher Modern College of Arts, Science and Commerce, Ganeshkhind, Pune 411016 Data Mining : An Introduction
  • 2. 2 Bayes Thm(1763) Regression(1805) KDD(1989) Support Vector Machine(1992) Data Science(2001) Moneyball(2003) Turing(1963) Neural Networks(1943) Evolutionary Computation(1965) Databases(1970) Genetic Algorithms(1975) Big Data From Then till Now…..
  • 4. 4 Data Mining deals with the discovery of hidden Knowledge , unexpected pattern and new rules from large data sets
  • 5. 5 Examples of Information extracted using query language  List customers who use credit card to purchase more than Rs. 10000 worth groceries  List patients who had at least one heart attack  List students who had at least one backlog  List employees who have taken home loans
  • 6. 6 Examples of what data mining is used for  Develop a general profile of credit card customers  Determine patients whose lifestyle is prone to getting a heart attack in near future  Differentiate poor credit risk customers from good credit card customers  Differentiate students who had one backlogs in their academic  Determine employees who have taken loan for any purpose
  • 7. Data Mining differs from usual query processing in many ways Query Processing Data Mining Query Wel formed as Select… From… Where…… Query is not well formed. What is found out that is usually hidden Data Data from online transaction processing systems generally in table formats Data is integrated from various sources. Huge amount of data Output Subset of databases Not only subset but also in analyzed and in terms of patterns 7
  • 8. 8 Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer? Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
  • 9. •Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer? •Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc. 9
  • 10. Knowledge discovery in databases (KDD)-is a multistep process of finding useful information and patterns in data while Data Mining is one of the steps in KDD of using algorithms for extraction of patterns Steps Of KDD 1. Selection- Data Extraction -Obtaining Data from heterogeneous data sources - Databases, Data warehouses, World wide web or other information repositories 2. Preprocessing- Data Cleaning- Incomplete , noisy, inconsistent data to be cleaned- Missing data may be ignored or predicted, erroneous data may be deleted or corrected 10
  • 11. 3. Transformation- Data Integration- Combines data from multiple sources into a coherent store -Data can be encoded in common formats, normalized, reduced 4. Data mining – Apply algorithms to transformed data an extract patterns 5. Pattern Interpretation/evaluation - Pattern Evaluation- Evaluate the interestingness of resulting patterns or apply interestingness measures to filter out discovered patterns Knowledge presentation- present the mined knowledge- visualization techniques can be used 11
  • 12. Transformation KDD is the nontrivial extraction of implicit previously unknown and potentially useful knowledge from data Knowledge Discovery Process Preprocessing Data Mining Pattern Interpretation and evaluation Selection 12
  • 13. 13 27 40 34 54 24 25 29 0 10 20 30 40 50 60 a b c d e f g Graph 1 a b c d e f g Graphical-bar charts, pie charts histograms Icon-based- using colors figures as icons
  • 14. 14 Hierarchical- Hierarchically dividing display area Geometric-boxplot, scatter plot
  • 15. 15 Pixel-based- data as colored pixels Hybrid- combination of above approaches
  • 16. Why Data Mining?—Potential Applications  Data analysis and decision support  Market analysis and management  Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation  Risk analysis and management  Forecasting, customer retention, improved underwriting, quality control, competitive analysis  Fraud detection and detection of unusual patterns (outliers)  Other Applications  Text mining (news group, email, documents) and Web mining  Stream data mining  Bioinformatics and bio-data analysis 16
  • 17. Data Mining algorithms-All algorithms attempt to fit a model closest to the data being examined. Model is based on the analysis of attributes of a training data set The Model is than evaluated using a test data set Data Model can be  Predictive model makes predictions regarding data values using the results found from available data. Thus it makes use of historical data to make predictions  Descriptive model identifies patterns or relationships in data. It finds out the properties of existing data and does not predict the new properties. 17
  • 18. Data Mining Predictive Descriptive Classification Regression Time series Analysis Prediction Clustering Summarization Association rules Sequence Discovery 18
  • 19. Classification- maps data into predefined groups or classes It uses supervised learning . The algorithm uses learning phase to build a classifier using training data set containing data attributes and associated class labels Example : result of a student. In which class students result will be… Pattern recognition is type of classification where input patter is classified into several classes based on its similarity to predefined classes. Example: to identify terrorists from passengers. They are identified with their basic pattern as distance between eyes, size and shape. Then these patterns are compared with entries into data to see whether any match were found. 19
  • 20. 20
  • 21. 21 Grade Useful Heat Value(kcal/kg) A >6200 B 5601 - 6200 C 4941 - 5600 D 4201 - 4940 E 3361 - 4200 F 2401 - 3360 G 1301 - 2400
  • 22. 22 Regression-maps data into real-valued prediction variable. Algorithm tries to find best function (linear, Non-linear that fits the training data). Assumes that target data always fits into some function. Example . College professor determines his retirement plan based on current savings and income. If professor want to do more savings then he must alter his experiences by using simple linear regression formula.
  • 23. 23 Time Series Analysis- the value of an attribute is examined as it varies over time It can be used to determine similarities, classify the behavior or predict future values Example Share market
  • 24. Prediction – predicts future values using regression, time series analysis or other approaches Example To find out flood prediction of river depending on water level, rain amount time, humidity. Sensors at different locations are placed in the river area which will monitor flood condition and flood prediction can be done. Whether analysis Pollution analysis 24
  • 25. 25 Clustering -Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes Interpretability and usability-results should be comprehensible and usable-domain expert is required Example Students are clustered among various attributes like good academics, area in which they live, age, height, weight, body mass index, extra curricular activities. Clusters do not have specific size and shape.
  • 27. 27 Summarization - maps data into subsets with simple descriptions- It extracts or derives representative summary type of information Example Summary of student result whish give you number of students appeared for the exam passed, failed and according to classes
  • 28. Association rules–discovers relationship among data – used in Market basket analysis to find item frequently purchased together Example: person buying a sugar in the mall also buys milk. The thing which person buy together will always kept together. 28
  • 29. Sequence Discovery- discovers sequential patterns in data-order in which items are purchased or data is accessed Example: When TV set will be purchased by customer , sales manager assumes that customer also buys some cds and music system. 29
  • 30. Influence from many disciplines Data Mining Artificial IntelligenceInformation Technology Database Technology Machine Learning Pattern Recognition Statistics Algorithm Visualization Mathematical Modeling 30
  • 31. Depending on data mining approach, techniques from other disciplines may be applied such as •Information Retrieval •Artificial Intelligence •Neural networks •Fuzzy set theory •Knowledge representation •Logic programming •High performance computing 31
  • 32. Data Mining issues  Human interaction- to analyze the output and find the correct inference after data mining step interfaces required with both domain and technical experts  Over fitting – It occurs when the model fits for the current data exactly but does not fit for future data-if training dataset will be wrong then over fitting occurs  Outliers – The model may get distorted because of the presence of outliers  Interpretation of results- experts are required due to interpretability problems  Visualization of results- visualization helps to display analyzed data – but for multi-dimensional data visualization becomes problematic 32
  • 33. Data Mining issues continued…  Large datasets- scalability may arise – as algorithms do not scale well with massive real-world datasets- sampling and parallelization are effective tools are used to solve this problem  High dimensionality -Conventional database may contain many different attributes out of them all are not relevant. Some may increases complexity and reduces efficiency. This is known as dimensionality curse -data reduction can be done so that dimensionality reduction will also be there.  Multimedia data - found in GIS databases proves conventional data mining algorithms ineffective  Missing data -It is not always possible to ignore missing data but in preprocessing data mining algorithms can be used to replace missing data with estimates 33
  • 34. Data Mining issues continued…  Irrelevant data – data reduced by removing irrelevant data  Noisy data –Invalid , incorrect data will lead to poor quality data mining  Changing data- Data warehouses contain non-volatile data- Dynamic data is uploaded and then algorithms are reapplied to check their correct working.  Integration- KDD requests are one time needs-data mining functions are now integrated into traditional database systems  Applications – Effective use of output of mining algorithm is a challenge rather than the complexity of the mining algorithm 34
  • 35. Data Mining Metrics How to measure the effectiveness of data mining process? -KDD process is expensive- Return on investment will be the saving due to decision process using the results -Difficult to measure and quantify Social Implications of Data mining It is two sides of the coin Data mining can be used to improve customer service and satisfaction Data mining can be used to confront one’s right to privacy Omnipresent Invisible Data mining affecting everyone 35
  • 36. Data mining should follow certain Guidelines Purpose specification and use limitation Openness Security safeguards Individual participation Privacy Preserving data mining - secure Multiparty computation - data obscuration 36
  • 37. Applications of Data Mining Security-To find out terrorists using classification technique Whether- To predict whether, pollution Finance-Share market Ecommerce-Market basket analysis Education-Student result preparation Bank- Analysis of customer for buying loan Research- Data Analysis Fraud detection Marketing-targeting customers Molecular biology Astronomy Health- to find out disease in peoples 37
  • 38. Books for Reference Data Mining, Introduction and Advanced Topics by Margaret H. Dunham and Sridhar Pearson Education ISBN 81-7758-785-4 Data Mining Concepts and Techniques by Jiawei Han and Micheline Kamber Morgan Kaufmann Publishers ISBN 81-312-0535-5 . 38