Record matching over query results from Web Databases

Submitted By,
Tushar Jadhav
1
Record Matching Over Query Results
from Multiple Web Databases
4 January 2015DYPIET

CONTENTS
2
 INTRODUCTION
 PROBLEM DEFINITION
 CONCEPT
 CONCLUSION
 REFERENCES

Introduction
3
RECORD MATCHING
 Records in a database that refer to the same entity across
different data sources .
 Databases is collection of large amount of information.
 It identifies the records that represent the same real world
entity .
 Duplicate Detection has attracted much attention from research
fields.

4
 Duplicate Detection has attracted much attention from
research fields.
 Duplicate detection in Databases, Data Mining,
Artificial Intelligence & NLP.
 In the Web database scenario the records to match
are highly query-dependent .

Problem Definition
5
Most previous works are based on
predefined matching rules hand-coded by
domain experts.
 Such approaches work well in a traditional
database environment.

Contd…
6
Main Concern is data duplication in Web
databases .
The goal of duplicate detection is to
determine the matching status of the
record pairs.

4 January 2015
7
Concept
Record matching methods uses two major steps :
 Identifying a similarity function :
- Using training examples
- Decision tree Learning
- Bayesian network
- SVM
 Matching records :
- Similarity function -
It is used to calculate the similarity between the
candidate pairs.
DYPIET

Supervised Learning
8
• A computer system learns from data, which represents some “past
experiences” of an application domain.
Age_Cust Has _Job Own House Credit Score Decision
Old Yes No Excellent Yes
Middle No No Good No
Young Yes No Fair Yes
Table : Sample training data used for supervised learning
• Using the above training data the system is trained how to respond
given a set of criteria

9
• Once training is completed, system uses this data to classify
new records.
• For ex. using the above training data, the system can respond to a
new credit card application based on the criteria defined.
• Supervised learning uses pre-determined training data.
Such an approach is also known as classification .

10
UDD :
 Unsupervised Duplicate Detection
 Implemented to address the problem of record
matching in the Web database scenario
 For a given query, can effectively identify
duplicates from the query result records of
multiple Web databases.
DUPLICATE DETECTION IN UDD

Contd…
11
 Two classifiers are used they are:
 Weighted component similarity summing classifier(WCSS) :
- Calculate the Similarity between Pair of Records.
- Duplicates are Identified without any training.
 SVM classifier :
- Uses training data to classify a record as duplicate or non-duplicate.

12
UDD Architecture Diagram 4 January 2015DYPIET

13
Data Retrieval:
-The data retrieval module consists of an interface to read the user query
along with the actual data retrieval from the database.
Pre-Processing :
-Data Cleaning
- Data is sorted and exact matching records are deleted.
UDD Algorithm :
-Calculates the similarity vectors of the selected
dataset.
-Classification of data.
Data Presentation :
- Presenting the data to the user (Unique Data, Statistics).

CONCLUSION
14
Duplicate detection is an important step in
data integration.
 In the Web database scenario, where
records to match are greatly query-
dependent which is not applicable.
Two classifiers WCSS and SVM are used
UDD is implemented to overcome all the
problems for detecting duplicates over the
query results of multiple Web databases.

REFERENCES
15
R. Ananthakrishna, S. Chaudhuri, and V. Ganti, Eliminating Fuzzy
Duplicates in Data Warehouses", Proc. 28th Intl Conf. Very Large
Data Bases, pp.586-597, 2002.
R. Baxter, P. Christen, and T. Churches, A Comparison of Fast
Blocking Methods for Record Linkage", Proc. KDD Workshop Data
Cleaning, Record Linkage and Object Consolidation, pp. 25-27,
2003.
S. Chaudhuri, V. Ganti, and R. Motwani, “Robust Identication of
Fuzzy Duplicates", Proc. 21st IEEE Intl Conf. Data Eng., pp. 865-
876, 2005.

Record matching over query results from Web Databases

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (10)

Similar to Record matching over query results from Web Databases

Similar to Record matching over query results from Web Databases (20)

Recently uploaded

Recently uploaded (20)

Record matching over query results from Web Databases