3. Introduction
3
RECORD MATCHING
Records in a database that refer to the same entity across
different data sources .
Databases is collection of large amount of information.
It identifies the records that represent the same real world
entity .
Duplicate Detection has attracted much attention from research
fields.
4 January 2015DYPIET
4. 4 January 2015DYPIET
4
Duplicate Detection has attracted much attention from
research fields.
Duplicate detection in Databases, Data Mining,
Artificial Intelligence & NLP.
In the Web database scenario the records to match
are highly query-dependent .
5. Problem Definition
4 January 2015DYPIET
5
Most previous works are based on
predefined matching rules hand-coded by
domain experts.
Such approaches work well in a traditional
database environment.
6. Contd…
6
Main Concern is data duplication in Web
databases .
The goal of duplicate detection is to
determine the matching status of the
record pairs.
4 January 2015DYPIET
7. 4 January 2015
7
Concept
Record matching methods uses two major steps :
Identifying a similarity function :
- Using training examples
- Decision tree Learning
- Bayesian network
- SVM
Matching records :
- Similarity function -
It is used to calculate the similarity between the
candidate pairs.
DYPIET
8. Supervised Learning
4 January 2015DYPIET
8
• A computer system learns from data, which represents some “past
experiences” of an application domain.
Age_Cust Has _Job Own House Credit Score Decision
Old Yes No Excellent Yes
Middle No No Good No
Young Yes No Fair Yes
Table : Sample training data used for supervised learning
• Using the above training data the system is trained how to respond
given a set of criteria
9. 4 January 2015DYPIET
9
• Once training is completed, system uses this data to classify
new records.
• For ex. using the above training data, the system can respond to a
new credit card application based on the criteria defined.
• Supervised learning uses pre-determined training data.
Such an approach is also known as classification .
10. 10
UDD :
Unsupervised Duplicate Detection
Implemented to address the problem of record
matching in the Web database scenario
For a given query, can effectively identify
duplicates from the query result records of
multiple Web databases.
DUPLICATE DETECTION IN UDD
4 January 2015DYPIET
11. Contd…
11
Two classifiers are used they are:
Weighted component similarity summing classifier(WCSS) :
- Calculate the Similarity between Pair of Records.
- Duplicates are Identified without any training.
SVM classifier :
- Uses training data to classify a record as duplicate or non-duplicate.
4 January 2015DYPIET
13. 4 January 2015DYPIET
13
Data Retrieval:
-The data retrieval module consists of an interface to read the user query
along with the actual data retrieval from the database.
Pre-Processing :
-Data Cleaning
- Data is sorted and exact matching records are deleted.
UDD Algorithm :
-Calculates the similarity vectors of the selected
dataset.
-Classification of data.
Data Presentation :
- Presenting the data to the user (Unique Data, Statistics).
14. CONCLUSION
14
Duplicate detection is an important step in
data integration.
In the Web database scenario, where
records to match are greatly query-
dependent which is not applicable.
Two classifiers WCSS and SVM are used
UDD is implemented to overcome all the
problems for detecting duplicates over the
query results of multiple Web databases.
4 January 2015DYPIET
15. REFERENCES
15
R. Ananthakrishna, S. Chaudhuri, and V. Ganti, Eliminating Fuzzy
Duplicates in Data Warehouses", Proc. 28th Intl Conf. Very Large
Data Bases, pp.586-597, 2002.
R. Baxter, P. Christen, and T. Churches, A Comparison of Fast
Blocking Methods for Record Linkage", Proc. KDD Workshop Data
Cleaning, Record Linkage and Object Consolidation, pp. 25-27,
2003.
S. Chaudhuri, V. Ganti, and R. Motwani, “Robust Identication of
Fuzzy Duplicates", Proc. 21st IEEE Intl Conf. Data Eng., pp. 865-
876, 2005.
4 January 2015DYPIET