SlideShare a Scribd company logo
1 of 33
Data Preprocessing
Content
 What & Why preprocess the data?
 Data cleaning
 Data integration
 Data transformation
 Data reduction

PAAS Group
It is a data mining technique that involves
transforming raw data into an understandable
format.

PAAS Group
Why preprocess the data?

PAAS Group
Data Preprocessing
• Data in the real world is:
– incomplete: lacking values, certain attributes of
interest, etc.
– noisy: containing errors or outliers
– inconsistent: lack of compatibility or similarity
between two or more facts.

• No quality data, no quality mining results!
– Quality decisions must be based on quality data
– Data warehouse needs consistent integration of
quality data

PAAS Group
Measure of Data Quality
 Accuracy
 Completeness
 Consistency
 Timeliness
 Believability
 Value added
 Interpretability
 Accessibility

PAAS Group
Data preprocessing techniques
• Data Cleaning
• Data Integration
• Data Transformation
• Data Reduction

PAAS Group
Major Tasks in Data Preprocessing
• Data cleaning
– Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies

• Data integration
– Integration of multiple databases, data cubes, or files

• Data transformation
– Normalization and aggregation

• Data reduction
– Obtains reduced representation in volume but produces the same or similar
analytical results

PAAS Group
Data Preprocessing

PAAS Group
Data Cleaning

PAAS Group
Data Cleaning
“Data Cleaning attempt to fill in missing
values, smooth out noise while identifying
outliers and correct inconsistencies in the real
world data.”

PAAS Group
Data Cleaning - Missing Values
•
•
•
•
•

Ignore the tuple
Fill in the missing value manually
Use a global constant
Use attribute mean
Use the most probable value
(decision tree, Bayesian Formalism)

PAAS Group
Data Cleaning - Noisy Data
•
•
•
•

Binning
Clustering
Combined computer and human inspection
Regression

PAAS Group
Data Cleaning - Inconsistent Data
• Manually, using external references
• Knowledge engineering tools

PAAS Group
Few Important Terms
• Discrepancy Detection
– Human Error
– Data Decay
– Deliberate Errors

• Metadata
• Unique Rules
• Null Rules
PAAS Group
Data Integration

PAAS Group
Data Integration
“Data Integration implies combining of data
from multiple sources into a coherent data
store(data warehouse). ”

PAAS Group
Data Integration - Issues
•
•
•
•

Entity identification problem
Redundancy
Tuple Duplication
Detecting data value conflicts

PAAS Group
Data Transformation

PAAS Group
Data Transformation
“Transforming or consolidating data into
mining suitable form is known as Data
Transformation.”

PAAS Group
Handling Redundant Data in Data Integration
• Redundant data occur often when integration of
multiple databases
– The same attribute may have different names in
different databases
– One attribute may be a “derived” attribute in
another table, e.g., annual revenue
Handling Redundant Data in Data Integration
• Redundant data may be able to be detected by
correlation analysis
• Careful integration of the data from multiple sources
may
help
reduce/avoid
redundancies
and
inconsistencies and improve mining speed and quality
Data Transformation
• Smoothing: remove noise from data
• Aggregation: summarization, data cube construction
• Generalization: concept hierarchy climbing
Data Reduction

PAAS Group
Data Reduction
“Data reduction techniques are applied to obtain a
reduced representation of the data set that is much
smaller in volume, yet closely maintains the integrity
of base data.”

PAAS Group
Data Reduction - Strategies
•
•
•
•
•

Data cube aggregation
Dimension Reduction
Data Compression
Numerosity Reduction
Discretization and concept hierarchy
generation

PAAS Group
Example of Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A6?

A1?

Class 1
>

Class 2

Class 1

Class 2

Reduced attribute set: {A1, A4, A6}
PAAS Group
Histograms
• A popular data reduction
technique
• Divide data into buckets and
store average (sum) for each
bucket
• Can be constructed optimally
in one dimension.
• Related to quantization
problems.

PAAS Group
Clustering
• Partition data set into clusters, and one can store cluster
representation only
• Can be very effective if data is clustered.
• Can have hierarchical clustering and be stored in multidimensional index tree structures.

PAAS Group
Sampling
• Allows a large data set to be represented by a much
smaller of the data.
• Let a large data set D, contains N tuples.
• Methods to reduce data set D:
–
–
–
–

Simple random sample without replacement (SRSWOR)
Simple random sample with replacement (SRSWR)
Cluster sample
Stright sample

PAAS Group
Sampling
SRSW
(simp OR
le ra n
do
samp
le wit m
hout
replac
emen
t)
R
SW
SR

Raw Data

PAAS Group
Sampling
Cluster/Stratified Sample

Raw Data

PAAS Group
PAAS Group

More Related Content

What's hot

Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision treesKnoldus Inc.
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning pyingkodi maran
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data miningKamal Acharya
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysisKrish_ver2
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science ProcessVishal Patel
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data miningkavitha muneeshwaran
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data MiningIffat Firozy
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataSalah Amean
 
Data Reduction
Data ReductionData Reduction
Data ReductionRajan Shah
 
Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning Usama Fayyaz
 
Association rule mining
Association rule miningAssociation rule mining
Association rule miningAcad
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learningParas Kohli
 

What's hot (20)

Apriori Algorithm
Apriori AlgorithmApriori Algorithm
Apriori Algorithm
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Data preprocessing in Machine learning
Data preprocessing in Machine learning Data preprocessing in Machine learning
Data preprocessing in Machine learning
 
Classification techniques in data mining
Classification techniques in data miningClassification techniques in data mining
Classification techniques in data mining
 
3.7 outlier analysis
3.7 outlier analysis3.7 outlier analysis
3.7 outlier analysis
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.01 Data Mining: Concepts and Techniques, 2nd ed.
01 Data Mining: Concepts and Techniques, 2nd ed.
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
The Data Science Process
The Data Science ProcessThe Data Science Process
The Data Science Process
 
Data Integration and Transformation in Data mining
Data Integration and Transformation in Data miningData Integration and Transformation in Data mining
Data Integration and Transformation in Data mining
 
Data Mining: Data processing
Data Mining: Data processingData Mining: Data processing
Data Mining: Data processing
 
Data Preprocessing || Data Mining
Data Preprocessing || Data MiningData Preprocessing || Data Mining
Data Preprocessing || Data Mining
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Data Reduction
Data ReductionData Reduction
Data Reduction
 
Data reduction
Data reductionData reduction
Data reduction
 
Kdd process
Kdd processKdd process
Kdd process
 
Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning Supervised learning and Unsupervised learning
Supervised learning and Unsupervised learning
 
Association rule mining
Association rule miningAssociation rule mining
Association rule mining
 
Data mining tasks
Data mining tasksData mining tasks
Data mining tasks
 
Supervised and unsupervised learning
Supervised and unsupervised learningSupervised and unsupervised learning
Supervised and unsupervised learning
 

Viewers also liked

1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reductionKrish_ver2
 
Data Processing-Presentation
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentationnibraspk
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingSlideshare
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretizationKrish_ver2
 
Data mining slides
Data mining slidesData mining slides
Data mining slidessmj
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSINGKing Julian
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessingkayathri02
 
Data editing ( In research methodology )
Data editing ( In research methodology )Data editing ( In research methodology )
Data editing ( In research methodology )Np Shakeel
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining ConceptsDung Nguyen
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial Salah Amean
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesSaif Ullah
 

Viewers also liked (20)

1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
Types of Data Processing
Types of Data ProcessingTypes of Data Processing
Types of Data Processing
 
DataPreProcessing
DataPreProcessing DataPreProcessing
DataPreProcessing
 
Data mining
Data miningData mining
Data mining
 
Data Processing
Data ProcessingData Processing
Data Processing
 
Data Processing-Presentation
Data Processing-PresentationData Processing-Presentation
Data Processing-Presentation
 
Data discretization
Data discretizationData discretization
Data discretization
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data Cleaning Techniques
Data Cleaning TechniquesData Cleaning Techniques
Data Cleaning Techniques
 
Data cleansing
Data cleansingData cleansing
Data cleansing
 
1.8 discretization
1.8 discretization1.8 discretization
1.8 discretization
 
Data mining slides
Data mining slidesData mining slides
Data mining slides
 
DATA WAREHOUSING
DATA WAREHOUSINGDATA WAREHOUSING
DATA WAREHOUSING
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Data processing cycle
Data processing cycleData processing cycle
Data processing cycle
 
Data editing ( In research methodology )
Data editing ( In research methodology )Data editing ( In research methodology )
Data editing ( In research methodology )
 
Data Mining Concepts
Data Mining ConceptsData Mining Concepts
Data Mining Concepts
 
introduction to data mining tutorial
introduction to data mining tutorial introduction to data mining tutorial
introduction to data mining tutorial
 
Data mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniquesData mining (lecture 1 & 2) conecpts and techniques
Data mining (lecture 1 & 2) conecpts and techniques
 
Data preprocessing ng
Data preprocessing   ngData preprocessing   ng
Data preprocessing ng
 

Similar to Data preprocessing

Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1meenas06
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Dhilsath Fathima
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data miningDhilsath Fathima
 
Data pre processing
Data pre processingData pre processing
Data pre processingpommurajopt
 
Preprocessing
PreprocessingPreprocessing
Preprocessingmmuthuraj
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptchatbot9
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data PreparationUmair Shafique
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.pptcongtran88
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
 

Similar to Data preprocessing (20)

Data preprocessing ppt1
Data preprocessing ppt1Data preprocessing ppt1
Data preprocessing ppt1
 
Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16Data extraction, cleanup & transformation tools 29.1.16
Data extraction, cleanup & transformation tools 29.1.16
 
Unit 3 part ii Data mining
Unit 3 part ii Data miningUnit 3 part ii Data mining
Unit 3 part ii Data mining
 
Preprocess
PreprocessPreprocess
Preprocess
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Pre processing
Pre processingPre processing
Pre processing
 
Preprocessing
PreprocessingPreprocessing
Preprocessing
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Datapreprocess
DatapreprocessDatapreprocess
Datapreprocess
 
Pre-Processing and Data Preparation
Pre-Processing and Data PreparationPre-Processing and Data Preparation
Pre-Processing and Data Preparation
 
Unit2
Unit2Unit2
Unit2
 
Data Preparation.pptx
Data Preparation.pptxData Preparation.pptx
Data Preparation.pptx
 
Preprocessing.ppt
Preprocessing.pptPreprocessing.ppt
Preprocessing.ppt
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 
Data1
Data1Data1
Data1
 
Data1
Data1Data1
Data1
 

More from ankur bhalla (20)

Load balancing
Load balancingLoad balancing
Load balancing
 
Languages
LanguagesLanguages
Languages
 
E branding
E brandingE branding
E branding
 
Crt
CrtCrt
Crt
 
Johari window
Johari windowJohari window
Johari window
 
Generation of computer
Generation of computerGeneration of computer
Generation of computer
 
Dont mix religion and politics
Dont mix religion and politicsDont mix religion and politics
Dont mix religion and politics
 
Windows 7
Windows 7Windows 7
Windows 7
 
Effective leadership
Effective leadershipEffective leadership
Effective leadership
 
Random and raster scan
Random and raster scanRandom and raster scan
Random and raster scan
 
Flat panel
Flat panelFlat panel
Flat panel
 
Softwares
SoftwaresSoftwares
Softwares
 
Animation in bollywood
Animation in bollywoodAnimation in bollywood
Animation in bollywood
 
Cad
CadCad
Cad
 
5d cinemas
5d cinemas5d cinemas
5d cinemas
 
Animation
AnimationAnimation
Animation
 
Walt disney
Walt disneyWalt disney
Walt disney
 
Olympics
OlympicsOlympics
Olympics
 
Apple.paas2010
Apple.paas2010Apple.paas2010
Apple.paas2010
 
Certifications in IT fields
Certifications in IT fieldsCertifications in IT fields
Certifications in IT fields
 

Data preprocessing

  • 2. Content  What & Why preprocess the data?  Data cleaning  Data integration  Data transformation  Data reduction PAAS Group
  • 3. It is a data mining technique that involves transforming raw data into an understandable format. PAAS Group
  • 4. Why preprocess the data? PAAS Group
  • 5. Data Preprocessing • Data in the real world is: – incomplete: lacking values, certain attributes of interest, etc. – noisy: containing errors or outliers – inconsistent: lack of compatibility or similarity between two or more facts. • No quality data, no quality mining results! – Quality decisions must be based on quality data – Data warehouse needs consistent integration of quality data PAAS Group
  • 6. Measure of Data Quality  Accuracy  Completeness  Consistency  Timeliness  Believability  Value added  Interpretability  Accessibility PAAS Group
  • 7. Data preprocessing techniques • Data Cleaning • Data Integration • Data Transformation • Data Reduction PAAS Group
  • 8. Major Tasks in Data Preprocessing • Data cleaning – Fill in missing values, smooth noisy data, identify or remove outliers, and resolve inconsistencies • Data integration – Integration of multiple databases, data cubes, or files • Data transformation – Normalization and aggregation • Data reduction – Obtains reduced representation in volume but produces the same or similar analytical results PAAS Group
  • 11. Data Cleaning “Data Cleaning attempt to fill in missing values, smooth out noise while identifying outliers and correct inconsistencies in the real world data.” PAAS Group
  • 12. Data Cleaning - Missing Values • • • • • Ignore the tuple Fill in the missing value manually Use a global constant Use attribute mean Use the most probable value (decision tree, Bayesian Formalism) PAAS Group
  • 13. Data Cleaning - Noisy Data • • • • Binning Clustering Combined computer and human inspection Regression PAAS Group
  • 14. Data Cleaning - Inconsistent Data • Manually, using external references • Knowledge engineering tools PAAS Group
  • 15. Few Important Terms • Discrepancy Detection – Human Error – Data Decay – Deliberate Errors • Metadata • Unique Rules • Null Rules PAAS Group
  • 17. Data Integration “Data Integration implies combining of data from multiple sources into a coherent data store(data warehouse). ” PAAS Group
  • 18. Data Integration - Issues • • • • Entity identification problem Redundancy Tuple Duplication Detecting data value conflicts PAAS Group
  • 20. Data Transformation “Transforming or consolidating data into mining suitable form is known as Data Transformation.” PAAS Group
  • 21. Handling Redundant Data in Data Integration • Redundant data occur often when integration of multiple databases – The same attribute may have different names in different databases – One attribute may be a “derived” attribute in another table, e.g., annual revenue
  • 22. Handling Redundant Data in Data Integration • Redundant data may be able to be detected by correlation analysis • Careful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality
  • 23. Data Transformation • Smoothing: remove noise from data • Aggregation: summarization, data cube construction • Generalization: concept hierarchy climbing
  • 25. Data Reduction “Data reduction techniques are applied to obtain a reduced representation of the data set that is much smaller in volume, yet closely maintains the integrity of base data.” PAAS Group
  • 26. Data Reduction - Strategies • • • • • Data cube aggregation Dimension Reduction Data Compression Numerosity Reduction Discretization and concept hierarchy generation PAAS Group
  • 27. Example of Decision Tree Induction Initial attribute set: {A1, A2, A3, A4, A5, A6} A4 ? A6? A1? Class 1 > Class 2 Class 1 Class 2 Reduced attribute set: {A1, A4, A6} PAAS Group
  • 28. Histograms • A popular data reduction technique • Divide data into buckets and store average (sum) for each bucket • Can be constructed optimally in one dimension. • Related to quantization problems. PAAS Group
  • 29. Clustering • Partition data set into clusters, and one can store cluster representation only • Can be very effective if data is clustered. • Can have hierarchical clustering and be stored in multidimensional index tree structures. PAAS Group
  • 30. Sampling • Allows a large data set to be represented by a much smaller of the data. • Let a large data set D, contains N tuples. • Methods to reduce data set D: – – – – Simple random sample without replacement (SRSWOR) Simple random sample with replacement (SRSWR) Cluster sample Stright sample PAAS Group
  • 31. Sampling SRSW (simp OR le ra n do samp le wit m hout replac emen t) R SW SR Raw Data PAAS Group

Editor's Notes

  1. lack of compatibility or similarity between two or more facts.