SlideShare a Scribd company logo
1 of 15
Submitted By,
Tushar Jadhav
1
Record Matching Over Query Results
from Multiple Web Databases
4 January 2015DYPIET
CONTENTS
2
 INTRODUCTION
 PROBLEM DEFINITION
 CONCEPT
 CONCLUSION
 REFERENCES
4 January 2015DYPIET
Introduction
3
RECORD MATCHING
 Records in a database that refer to the same entity across
different data sources .
 Databases is collection of large amount of information.
 It identifies the records that represent the same real world
entity .
 Duplicate Detection has attracted much attention from research
fields.
4 January 2015DYPIET
4 January 2015DYPIET
4
 Duplicate Detection has attracted much attention from
research fields.
 Duplicate detection in Databases, Data Mining,
Artificial Intelligence & NLP.
 In the Web database scenario the records to match
are highly query-dependent .
Problem Definition
4 January 2015DYPIET
5
Most previous works are based on
predefined matching rules hand-coded by
domain experts.
 Such approaches work well in a traditional
database environment.
Contd…
6
Main Concern is data duplication in Web
databases .
The goal of duplicate detection is to
determine the matching status of the
record pairs.
4 January 2015DYPIET
4 January 2015
7
Concept
Record matching methods uses two major steps :
 Identifying a similarity function :
- Using training examples
- Decision tree Learning
- Bayesian network
- SVM
 Matching records :
- Similarity function -
It is used to calculate the similarity between the
candidate pairs.
DYPIET
Supervised Learning
4 January 2015DYPIET
8
• A computer system learns from data, which represents some “past
experiences” of an application domain.
Age_Cust Has _Job Own House Credit Score Decision
Old Yes No Excellent Yes
Middle No No Good No
Young Yes No Fair Yes
Table : Sample training data used for supervised learning
• Using the above training data the system is trained how to respond
given a set of criteria
4 January 2015DYPIET
9
• Once training is completed, system uses this data to classify
new records.
• For ex. using the above training data, the system can respond to a
new credit card application based on the criteria defined.
• Supervised learning uses pre-determined training data.
Such an approach is also known as classification .
10
UDD :
 Unsupervised Duplicate Detection
 Implemented to address the problem of record
matching in the Web database scenario
 For a given query, can effectively identify
duplicates from the query result records of
multiple Web databases.
DUPLICATE DETECTION IN UDD
4 January 2015DYPIET
Contd…
11
 Two classifiers are used they are:
 Weighted component similarity summing classifier(WCSS) :
- Calculate the Similarity between Pair of Records.
- Duplicates are Identified without any training.
 SVM classifier :
- Uses training data to classify a record as duplicate or non-duplicate.
4 January 2015DYPIET
12
UDD Architecture Diagram 4 January 2015DYPIET
4 January 2015DYPIET
13
Data Retrieval:
-The data retrieval module consists of an interface to read the user query
along with the actual data retrieval from the database.
Pre-Processing :
-Data Cleaning
- Data is sorted and exact matching records are deleted.
UDD Algorithm :
-Calculates the similarity vectors of the selected
dataset.
-Classification of data.
Data Presentation :
- Presenting the data to the user (Unique Data, Statistics).
CONCLUSION
14
Duplicate detection is an important step in
data integration.
 In the Web database scenario, where
records to match are greatly query-
dependent which is not applicable.
Two classifiers WCSS and SVM are used
UDD is implemented to overcome all the
problems for detecting duplicates over the
query results of multiple Web databases.
4 January 2015DYPIET
REFERENCES
15
R. Ananthakrishna, S. Chaudhuri, and V. Ganti, Eliminating Fuzzy
Duplicates in Data Warehouses", Proc. 28th Intl Conf. Very Large
Data Bases, pp.586-597, 2002.
R. Baxter, P. Christen, and T. Churches, A Comparison of Fast
Blocking Methods for Record Linkage", Proc. KDD Workshop Data
Cleaning, Record Linkage and Object Consolidation, pp. 25-27,
2003.
S. Chaudhuri, V. Ganti, and R. Motwani, “Robust Identication of
Fuzzy Duplicates", Proc. 21st IEEE Intl Conf. Data Eng., pp. 865-
876, 2005.
4 January 2015DYPIET

More Related Content

What's hot

Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and ReuseMendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and ReuseAnita de Waard
 
EDI Training Module 10: EDI Data Repository Overview
EDI Training Module 10:  EDI Data Repository OverviewEDI Training Module 10:  EDI Data Repository Overview
EDI Training Module 10: EDI Data Repository OverviewEnvironmental Data Initiative
 
Facilitating good research data management practice as part of scholarly publ...
Facilitating good research data management practice as part of scholarly publ...Facilitating good research data management practice as part of scholarly publ...
Facilitating good research data management practice as part of scholarly publ...Varsha Khodiyar
 
5 data preparation and processing2
5 data preparation and processing25 data preparation and processing2
5 data preparation and processing2Mahmoud Alfarra
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Matteo Manca
 
Introduction to the Environmental Data Initiative (EDI)
Introduction to the Environmental Data Initiative (EDI)Introduction to the Environmental Data Initiative (EDI)
Introduction to the Environmental Data Initiative (EDI)Corinna Gries
 
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsReal-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsAnita de Waard
 
Data Repositories: Recommendation, Certification and Models for Cost Recovery
Data Repositories: Recommendation, Certification and Models for Cost RecoveryData Repositories: Recommendation, Certification and Models for Cost Recovery
Data Repositories: Recommendation, Certification and Models for Cost RecoveryAnita de Waard
 
Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanyRobert Grossman
 
Repositories in an Open Data Ecosystem
Repositories in an Open Data EcosystemRepositories in an Open Data Ecosystem
Repositories in an Open Data EcosystemWolfgang Kuchinke
 
Data mining
Data mining Data mining
Data mining AthiraR23
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2Mahmoud Alfarra
 
1 Introduction to-data-mining lecture
1   Introduction to-data-mining lecture1   Introduction to-data-mining lecture
1 Introduction to-data-mining lectureMahmoud Alfarra
 
EDI Training Module 12: An Introduction to Metadata and Data Repositories
EDI Training Module 12:  An Introduction to Metadata and Data RepositoriesEDI Training Module 12:  An Introduction to Metadata and Data Repositories
EDI Training Module 12: An Introduction to Metadata and Data RepositoriesEnvironmental Data Initiative
 

What's hot (19)

Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and ReuseMendeley Data: Enhancing Data Discovery, Sharing and Reuse
Mendeley Data: Enhancing Data Discovery, Sharing and Reuse
 
Unit 3 part 2
Unit  3 part 2Unit  3 part 2
Unit 3 part 2
 
EDI Training Module 10: EDI Data Repository Overview
EDI Training Module 10:  EDI Data Repository OverviewEDI Training Module 10:  EDI Data Repository Overview
EDI Training Module 10: EDI Data Repository Overview
 
Data mining
Data miningData mining
Data mining
 
Facilitating good research data management practice as part of scholarly publ...
Facilitating good research data management practice as part of scholarly publ...Facilitating good research data management practice as part of scholarly publ...
Facilitating good research data management practice as part of scholarly publ...
 
5 data preparation and processing2
5 data preparation and processing25 data preparation and processing2
5 data preparation and processing2
 
EDI Training Module 2: EDI Project
EDI Training Module 2:  EDI ProjectEDI Training Module 2:  EDI Project
EDI Training Module 2: EDI Project
 
G045033841
G045033841G045033841
G045033841
 
Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning Introduction to data pre-processing and cleaning
Introduction to data pre-processing and cleaning
 
Introduction to the Environmental Data Initiative (EDI)
Introduction to the Environmental Data Initiative (EDI)Introduction to the Environmental Data Initiative (EDI)
Introduction to the Environmental Data Initiative (EDI)
 
data mining
data miningdata mining
data mining
 
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data EcosystemsReal-World Data Challenges: Moving Towards Richer Data Ecosystems
Real-World Data Challenges: Moving Towards Richer Data Ecosystems
 
Data Repositories: Recommendation, Certification and Models for Cost Recovery
Data Repositories: Recommendation, Certification and Models for Cost RecoveryData Repositories: Recommendation, Certification and Models for Cost Recovery
Data Repositories: Recommendation, Certification and Models for Cost Recovery
 
Some Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your CompanySome Frameworks for Improving Analytic Operations at Your Company
Some Frameworks for Improving Analytic Operations at Your Company
 
Repositories in an Open Data Ecosystem
Repositories in an Open Data EcosystemRepositories in an Open Data Ecosystem
Repositories in an Open Data Ecosystem
 
Data mining
Data mining Data mining
Data mining
 
Data preparation and processing chapter 2
Data preparation and processing chapter  2Data preparation and processing chapter  2
Data preparation and processing chapter 2
 
1 Introduction to-data-mining lecture
1   Introduction to-data-mining lecture1   Introduction to-data-mining lecture
1 Introduction to-data-mining lecture
 
EDI Training Module 12: An Introduction to Metadata and Data Repositories
EDI Training Module 12:  An Introduction to Metadata and Data RepositoriesEDI Training Module 12:  An Introduction to Metadata and Data Repositories
EDI Training Module 12: An Introduction to Metadata and Data Repositories
 

Viewers also liked

An adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate recordsAn adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate recordsLikan Patra
 
novel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingnovel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingVipin Kp
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiersLars Marius Garshol
 
Duplicate detection
Duplicate detectionDuplicate detection
Duplicate detectionjonecx
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismseSAT Journals
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Kira
 
Progressive duplicate detection
Progressive duplicate detectionProgressive duplicate detection
Progressive duplicate detectionieeepondy
 
Efficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsEfficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsPradeeban Kathiravelu, Ph.D.
 

Viewers also liked (10)

Progressive Texture
Progressive TextureProgressive Texture
Progressive Texture
 
An adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate recordsAn adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate records
 
novel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingnovel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawling
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiers
 
Duplicate detection
Duplicate detectionDuplicate detection
Duplicate detection
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanisms
 
Deduplication
DeduplicationDeduplication
Deduplication
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
 
Progressive duplicate detection
Progressive duplicate detectionProgressive duplicate detection
Progressive duplicate detection
 
Efficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsEfficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data Sets
 

Similar to Record matching over query results from Web Databases

DMDW Lesson 04 - Data Mining Theory
DMDW Lesson 04 - Data Mining TheoryDMDW Lesson 04 - Data Mining Theory
DMDW Lesson 04 - Data Mining TheoryJohannes Hoppe
 
TTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueTTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueMehmet Beyaz
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceMahir Haque
 
Indexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record DeduplicationIndexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record Deduplicationidescitation
 
Udd for multiple web databases
Udd for multiple web databasesUdd for multiple web databases
Udd for multiple web databasessabhadakwan
 
Health Plan Survey Paper
Health Plan Survey PaperHealth Plan Survey Paper
Health Plan Survey PaperLisa Olive
 
Using Randomized Response Techniques for Privacy-Preserving Data Mining
Using Randomized Response Techniques for Privacy-Preserving Data MiningUsing Randomized Response Techniques for Privacy-Preserving Data Mining
Using Randomized Response Techniques for Privacy-Preserving Data Mining14894
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesFellowBuddy.com
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining Sushil Kulkarni
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective ApproachIRJET Journal
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data WarehousingAswathy S Nair
 
Record matching
Record matchingRecord matching
Record matchingNishna Ma
 
Association rule visualization technique
Association rule visualization techniqueAssociation rule visualization technique
Association rule visualization techniquemustafasmart
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introductionDr-Dipali Meher
 

Similar to Record matching over query results from Web Databases (20)

Seminar Report Vaibhav
Seminar Report VaibhavSeminar Report Vaibhav
Seminar Report Vaibhav
 
DMDW Lesson 04 - Data Mining Theory
DMDW Lesson 04 - Data Mining TheoryDMDW Lesson 04 - Data Mining Theory
DMDW Lesson 04 - Data Mining Theory
 
TTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining TechniqueTTG Int.LTD Data Mining Technique
TTG Int.LTD Data Mining Technique
 
data mining
data miningdata mining
data mining
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Indexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record DeduplicationIndexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record Deduplication
 
Udd for multiple web databases
Udd for multiple web databasesUdd for multiple web databases
Udd for multiple web databases
 
Health Plan Survey Paper
Health Plan Survey PaperHealth Plan Survey Paper
Health Plan Survey Paper
 
Unit 5
Unit 5 Unit 5
Unit 5
 
Using Randomized Response Techniques for Privacy-Preserving Data Mining
Using Randomized Response Techniques for Privacy-Preserving Data MiningUsing Randomized Response Techniques for Privacy-Preserving Data Mining
Using Randomized Response Techniques for Privacy-Preserving Data Mining
 
Data Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture NotesData Mining & Data Warehousing Lecture Notes
Data Mining & Data Warehousing Lecture Notes
 
Data mining
Data miningData mining
Data mining
 
Introduction to Data Mining
Introduction to Data Mining Introduction to Data Mining
Introduction to Data Mining
 
Data Mining – A Perspective Approach
Data Mining – A Perspective ApproachData Mining – A Perspective Approach
Data Mining – A Perspective Approach
 
Data Mining and Data Warehousing
Data Mining and Data WarehousingData Mining and Data Warehousing
Data Mining and Data Warehousing
 
Record matching
Record matchingRecord matching
Record matching
 
Data science unit2
Data science unit2Data science unit2
Data science unit2
 
Association rule visualization technique
Association rule visualization techniqueAssociation rule visualization technique
Association rule visualization technique
 
Data mining an introduction
Data mining an introductionData mining an introduction
Data mining an introduction
 

Recently uploaded

Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxsomshekarkn64
 
Piping Basic stress analysis by engineering
Piping Basic stress analysis by engineeringPiping Basic stress analysis by engineering
Piping Basic stress analysis by engineeringJuanCarlosMorales19600
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptMadan Karki
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the weldingMuhammadUzairLiaqat
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 

Recently uploaded (20)

Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptx
 
Piping Basic stress analysis by engineering
Piping Basic stress analysis by engineeringPiping Basic stress analysis by engineering
Piping Basic stress analysis by engineering
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
Indian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.pptIndian Dairy Industry Present Status and.ppt
Indian Dairy Industry Present Status and.ppt
 
welding defects observed during the welding
welding defects observed during the weldingwelding defects observed during the welding
welding defects observed during the welding
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 

Record matching over query results from Web Databases

  • 1. Submitted By, Tushar Jadhav 1 Record Matching Over Query Results from Multiple Web Databases 4 January 2015DYPIET
  • 2. CONTENTS 2  INTRODUCTION  PROBLEM DEFINITION  CONCEPT  CONCLUSION  REFERENCES 4 January 2015DYPIET
  • 3. Introduction 3 RECORD MATCHING  Records in a database that refer to the same entity across different data sources .  Databases is collection of large amount of information.  It identifies the records that represent the same real world entity .  Duplicate Detection has attracted much attention from research fields. 4 January 2015DYPIET
  • 4. 4 January 2015DYPIET 4  Duplicate Detection has attracted much attention from research fields.  Duplicate detection in Databases, Data Mining, Artificial Intelligence & NLP.  In the Web database scenario the records to match are highly query-dependent .
  • 5. Problem Definition 4 January 2015DYPIET 5 Most previous works are based on predefined matching rules hand-coded by domain experts.  Such approaches work well in a traditional database environment.
  • 6. Contd… 6 Main Concern is data duplication in Web databases . The goal of duplicate detection is to determine the matching status of the record pairs. 4 January 2015DYPIET
  • 7. 4 January 2015 7 Concept Record matching methods uses two major steps :  Identifying a similarity function : - Using training examples - Decision tree Learning - Bayesian network - SVM  Matching records : - Similarity function - It is used to calculate the similarity between the candidate pairs. DYPIET
  • 8. Supervised Learning 4 January 2015DYPIET 8 • A computer system learns from data, which represents some “past experiences” of an application domain. Age_Cust Has _Job Own House Credit Score Decision Old Yes No Excellent Yes Middle No No Good No Young Yes No Fair Yes Table : Sample training data used for supervised learning • Using the above training data the system is trained how to respond given a set of criteria
  • 9. 4 January 2015DYPIET 9 • Once training is completed, system uses this data to classify new records. • For ex. using the above training data, the system can respond to a new credit card application based on the criteria defined. • Supervised learning uses pre-determined training data. Such an approach is also known as classification .
  • 10. 10 UDD :  Unsupervised Duplicate Detection  Implemented to address the problem of record matching in the Web database scenario  For a given query, can effectively identify duplicates from the query result records of multiple Web databases. DUPLICATE DETECTION IN UDD 4 January 2015DYPIET
  • 11. Contd… 11  Two classifiers are used they are:  Weighted component similarity summing classifier(WCSS) : - Calculate the Similarity between Pair of Records. - Duplicates are Identified without any training.  SVM classifier : - Uses training data to classify a record as duplicate or non-duplicate. 4 January 2015DYPIET
  • 12. 12 UDD Architecture Diagram 4 January 2015DYPIET
  • 13. 4 January 2015DYPIET 13 Data Retrieval: -The data retrieval module consists of an interface to read the user query along with the actual data retrieval from the database. Pre-Processing : -Data Cleaning - Data is sorted and exact matching records are deleted. UDD Algorithm : -Calculates the similarity vectors of the selected dataset. -Classification of data. Data Presentation : - Presenting the data to the user (Unique Data, Statistics).
  • 14. CONCLUSION 14 Duplicate detection is an important step in data integration.  In the Web database scenario, where records to match are greatly query- dependent which is not applicable. Two classifiers WCSS and SVM are used UDD is implemented to overcome all the problems for detecting duplicates over the query results of multiple Web databases. 4 January 2015DYPIET
  • 15. REFERENCES 15 R. Ananthakrishna, S. Chaudhuri, and V. Ganti, Eliminating Fuzzy Duplicates in Data Warehouses", Proc. 28th Intl Conf. Very Large Data Bases, pp.586-597, 2002. R. Baxter, P. Christen, and T. Churches, A Comparison of Fast Blocking Methods for Record Linkage", Proc. KDD Workshop Data Cleaning, Record Linkage and Object Consolidation, pp. 25-27, 2003. S. Chaudhuri, V. Ganti, and R. Motwani, “Robust Identication of Fuzzy Duplicates", Proc. 21st IEEE Intl Conf. Data Eng., pp. 865- 876, 2005. 4 January 2015DYPIET