SlideShare a Scribd company logo
1 of 28
Efficient Techniques for Online
        Record Linkage

            Guided By,
            Mrs.k.Sujatha B.E.,M.Tech.,(Ph.D..,)


                       Presented by,
                         D. Angelin chitra,
                        A. Jainambu sariba,
                         S. Sathiya priya,
                     K. Sridevi.


                                                   1
Overview
Introduction
LiteratureReview
Record Linkage
System Design
Modules
Screen Shots
Conclusion



                         2
Introduction
 Databases   frequently contain duplicate fields and
  records that refer to the same real-world entity.
 The data needed to support these decisions are often
  scattered in heterogeneous distributed databases.
 Heterogeneous databases are usually designed and
  managed by different organizations there may be no
  common candidate key for linking the records.
 If the databases use the same set of design standards
  linking can easily be done using the primary key.
 In this project we are developing a software which
  will produce the relevant links from heterogenous
  database with reference to the keyword.



                                                          3
ABSTRACT
   Matching records that refer to the same entity across databases is
    becoming an increasingly important part of many data mining
    projects, as often data from multiple sources needs to be matched in
    order to enrich data or improve its quality.
    Record linkage is the computation of the associations among
    records of multiple databases.
   Matching data from heterogeneous data source has been a real
    problem.
    Statistical record linkage techniques could be used for resolving
    this problem but it causes communication bottleneck in a
    distributed environment.
   A matching tree is used to overcome communication overhead and
    give matching decision as obtained using the conventional linkage
    technique.

                                                                       4
Literature Review

Existing System
 No common candidate key for linking the records in
  Heterogeneous databases.
 It is possible to use common non key attributes to access
  the heterogeneous database, but the result obtained using
  these attributes may not always be accurate.
 When the matching records reside at a remote site,
  existing techniques cannot be directly applied because
  they would involve transferring the entire remote
  relation, thereby incurring a huge communication
  overhead.
 Different ranking algorithm are used for the search
  which may be time consuming.
                                                          5
Disadvantages

 It   has not been work on online.
 Not    cost effective.
 It   cannot be reduce communication overhead.




                                                  6
Proposed System
 An  efficient technique is developed to facilitate record
  linkage decisions in a distributed, online setting.
 A matching tree is developed for attribute acquisition
  based on sequential decision making.
 The proposed techniques reduce the communication
  overhead considerably, and the linkage performance is
  assured to be at the same level as the traditional
  approach.




                                                              7
Advantages

 Reducing     the communication overhead in a distributed
  environment cost effective model.
 It   is cost effective model.




                                                             8
Sequential record linkage




                            9
Principles of matching tree
Input  selection
      Assume that we are at some node of the
 tree and are trying to decide how to branch
 from there. At that point , we would be
 interested in finding the next best attribute to
  be acquired from the set of remaining
 attributes.
Stopping
      The stopping decision is made when no
 realisation of the remaining attributes can
 sufficiently revise the current matching
 probability so that the matching decision
 changes

                                                10
Tree based linkage techniques
◦ Here we develop efficient online record linkage techniques
  based on the matching tree
◦ The first two stages in this process are performed offline,
  using the training data.
◦ Once the matching tree has been built, the online linkage is
  done as the final step.
◦ We can now characterize the different techniques that can
  be employed in the last step.
◦ Given a local enquiry record, the ultimate goal of any
  linkage technique is to identify and fetch all the records
  from the remote site that have a matching probability
◦ The partitioning itself can be done in one of two possible
  ways:
    1) Sequential
    2) Concurrent

                                                                 11
Sequential partitioning
The  set of remote records is partitioned
 recursively, till we obtain the desired
 partition of all the relevant records.
This partitioning can be done in one of
 two ways:
        i). Sequential Attribute Acquisition
        ii). Sequential Identifier Acquisition



                                                 12
Concurrent Partitioning
The   tree is used to formulate a database
 query that selects the relevant remote
 records directly, in one single step.
Once the relevant records are identified,
 all their attribute values are transferred




                                              13
Record Linkage
Record   linkage refers to the task of
 finding records in a data set that refer to the
 same entity across different data sources (e.g.,
 data files, books, websites, databases).
Record linkage is necessary when joining data
 sets based on entities that may or may not
 share a common identifier.
Record linkage has applications in customer
 systems for marketing, relationship
 management, fraud detection, law
 enforcement and government administration


                                                    14
15
Modules

 Multiple Source Data.
 Detecting Overhead By Matching Tree.
 Overheads Eliminated.




                                         16
MODULE DESCRIPTION




                     17
Multiple Source Data
 Databases   is becoming an increasingly important part of
  many data mining projects, as often data from multiple
  sources needs to be matched in order to enrich data or
  improve its quality.
 The data needed to support these decisions are often scattered
  in heterogeneous distributed databases.
 In such cases, it maybe necessary to link records in multiple
  databases so that one can consolidate and use the data
  pertaining to the same real world entity.
 Entity matching is a crucial task for data integration and data
  cleaning. It is the task of identifying entities (objects, data
  instances) referring to the same real-world entity.
 The entity matching problem arises when there is no common
  identifier across the heterogeneous data sources.

                                                                18
Detecting Overhead by Matching
              Tree
To  integrate or link the data stored in
 heterogeneous data sources, a critical
 problem is entity matching, i.e., matching
 records representing corresponding
 entities in the real world, across the
 sources.
In this paper, we describe how this
 method can be applied in entity matching
 rules from heterogeneous databases.

                                              19
Overheads Eliminated
We   develop a matching tree, similar to
 a decision tree, and use it to reduce
 the communication overhead
 significantly.
The online record linkage process
 become more efficient by reducing the
 communication overhead in a
 distributed environment.


                                            20
Screen shots
Login   page




                           21
Registration page




                    22
Search page




              23
Search results




                 24
Conclusion
Record   linkage is an important issue in
 heterogeneous database systems where
 the records representing the same real-
 world entity type are identified using
 different identifiers in different databases.
In the absence of a common identifier, it
 is often difficult to find records in a
 remote database that are similar to a local
 enquiry record.

                                                 25
References
   A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, “Duplicate
                     Record Detection: A Survey,” IEEE Trans.
    Knowledge and Data Eng.,vol. 19, no. 1, pp. 1-16, Jan. 2007.
   B. Tepping, “A Model for Optimum Linkage of Records,” J. Am.
    Statistical Assoc., vol. 63, pp. 1321-1332, 1968.
   C. Batini, M. Lenzerini, and S.B. Navathe, “A Comparative
    Analysis of Methodologies for Database Schema
    Integration,”ACM Computing Surveys, vol. 18, no. 4, pp. 323-
    364, 1986
   D. Dey, “Record Matching in Data Warehouses: A Decision
    Model for Data Consolidation,” Operations Research, vol. 51,
    no. 2, pp. 240-254,
   D. Dey, S. Sarkar, and P. De, “A Distance-Based Approach to
    Entity Reconciliation in Heterogeneous Databases,” IEEE
    Trans.Knowledge and Data Eng., vol. 14, no. 3, pp. 567-582,
    May/June 2002.


                                                                 26
THANK YOU




            27
QUERIES




          28

More Related Content

What's hot

Urika-GD Product Brief Online 5-page
Urika-GD Product Brief Online 5-pageUrika-GD Product Brief Online 5-page
Urika-GD Product Brief Online 5-page
Adnan Khaleel
 
Privacy Preserving Distributed Association Rule Mining Algorithm for Vertical...
Privacy Preserving Distributed Association Rule Mining Algorithm for Vertical...Privacy Preserving Distributed Association Rule Mining Algorithm for Vertical...
Privacy Preserving Distributed Association Rule Mining Algorithm for Vertical...
IJCSIS Research Publications
 
Graph Based Workload Driven Partitioning System by Using MongoDB
Graph Based Workload Driven Partitioning System by Using MongoDBGraph Based Workload Driven Partitioning System by Using MongoDB
Graph Based Workload Driven Partitioning System by Using MongoDB
IJAAS Team
 
Final review m score
Final review m scoreFinal review m score
Final review m score
azhar4010
 

What's hot (19)

DLD_SYNOPSIS
DLD_SYNOPSISDLD_SYNOPSIS
DLD_SYNOPSIS
 
FUZZY FINGERPRINT METHOD FOR DETECTION OF SENSITIVE DATA EXPOSURE
FUZZY FINGERPRINT METHOD FOR DETECTION OF SENSITIVE DATA EXPOSUREFUZZY FINGERPRINT METHOD FOR DETECTION OF SENSITIVE DATA EXPOSURE
FUZZY FINGERPRINT METHOD FOR DETECTION OF SENSITIVE DATA EXPOSURE
 
Genetic Algorithm based Reversible Watermarking Approach for Numeric and Non-...
Genetic Algorithm based Reversible Watermarking Approach for Numeric and Non-...Genetic Algorithm based Reversible Watermarking Approach for Numeric and Non-...
Genetic Algorithm based Reversible Watermarking Approach for Numeric and Non-...
 
Urika-GD Product Brief Online 5-page
Urika-GD Product Brief Online 5-pageUrika-GD Product Brief Online 5-page
Urika-GD Product Brief Online 5-page
 
U0 vqmtq3m tc=
U0 vqmtq3m tc=U0 vqmtq3m tc=
U0 vqmtq3m tc=
 
Repository Federation: Towards Data Interoperability
Repository Federation: Towards Data InteroperabilityRepository Federation: Towards Data Interoperability
Repository Federation: Towards Data Interoperability
 
V01 i010414
V01 i010414V01 i010414
V01 i010414
 
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...
Analysis of Bayes, Neural Network and Tree Classifier of Classification Techn...
 
NISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data Services
NISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data ServicesNISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data Services
NISO Forum, Denver, Sept. 24, 2012: DataCite and Campus Data Services
 
Privacy Preserving Distributed Association Rule Mining Algorithm for Vertical...
Privacy Preserving Distributed Association Rule Mining Algorithm for Vertical...Privacy Preserving Distributed Association Rule Mining Algorithm for Vertical...
Privacy Preserving Distributed Association Rule Mining Algorithm for Vertical...
 
Big Data Repository for Structural Biology: Challenges and Opportunities by P...
Big Data Repository for Structural Biology: Challenges and Opportunities by P...Big Data Repository for Structural Biology: Challenges and Opportunities by P...
Big Data Repository for Structural Biology: Challenges and Opportunities by P...
 
Educating a New Breed of Data Scientists for Scientific Data Management
Educating a New Breed of Data Scientists for Scientific Data Management Educating a New Breed of Data Scientists for Scientific Data Management
Educating a New Breed of Data Scientists for Scientific Data Management
 
SEAD Virtual Archive: Building a Federation of Institutional Repositories fo...
 SEAD Virtual Archive: Building a Federation of Institutional Repositories fo... SEAD Virtual Archive: Building a Federation of Institutional Repositories fo...
SEAD Virtual Archive: Building a Federation of Institutional Repositories fo...
 
Secure distributed deduplication systems with improved reliability
Secure distributed deduplication systems with improved reliabilitySecure distributed deduplication systems with improved reliability
Secure distributed deduplication systems with improved reliability
 
Maintaining Data Confidentiality in Association Rule Mining in Distributed En...
Maintaining Data Confidentiality in Association Rule Mining in Distributed En...Maintaining Data Confidentiality in Association Rule Mining in Distributed En...
Maintaining Data Confidentiality in Association Rule Mining in Distributed En...
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
A Review on Resource Discovery Strategies in Grid Computing
A Review on Resource Discovery Strategies in Grid ComputingA Review on Resource Discovery Strategies in Grid Computing
A Review on Resource Discovery Strategies in Grid Computing
 
Graph Based Workload Driven Partitioning System by Using MongoDB
Graph Based Workload Driven Partitioning System by Using MongoDBGraph Based Workload Driven Partitioning System by Using MongoDB
Graph Based Workload Driven Partitioning System by Using MongoDB
 
Final review m score
Final review m scoreFinal review m score
Final review m score
 

Viewers also liked

Prescription event monitoring and record linkage system
Prescription event monitoring and record linkage systemPrescription event monitoring and record linkage system
Prescription event monitoring and record linkage system
Vineetha Menon
 
Central limit theorem
Central limit theoremCentral limit theorem
Central limit theorem
Vijeesh Soman
 

Viewers also liked (16)

Prescription Event Monitoring & Record Linkage Systems
Prescription Event Monitoring & Record Linkage SystemsPrescription Event Monitoring & Record Linkage Systems
Prescription Event Monitoring & Record Linkage Systems
 
Java IEEE 2013 Projects list
Java IEEE 2013 Projects list Java IEEE 2013 Projects list
Java IEEE 2013 Projects list
 
Prescription event monitorig
Prescription event monitorigPrescription event monitorig
Prescription event monitorig
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiers
 
Indexing Techniques for Scalable Record Linkage and Deduplication
Indexing Techniques for Scalable Record Linkage and DeduplicationIndexing Techniques for Scalable Record Linkage and Deduplication
Indexing Techniques for Scalable Record Linkage and Deduplication
 
A Case Study in Record Linkage_PVER Conf_May2011
A Case Study in Record Linkage_PVER Conf_May2011A Case Study in Record Linkage_PVER Conf_May2011
A Case Study in Record Linkage_PVER Conf_May2011
 
Prescription event monitoring and record linkage system
Prescription event monitoring and record linkage systemPrescription event monitoring and record linkage system
Prescription event monitoring and record linkage system
 
Spontaneous reporting
Spontaneous reporting Spontaneous reporting
Spontaneous reporting
 
Methods of data collection
Methods of data collectionMethods of data collection
Methods of data collection
 
Central limit theorem
Central limit theoremCentral limit theorem
Central limit theorem
 
Data and information
Data and informationData and information
Data and information
 
Data vs Information vs Knowledge
Data vs Information vs Knowledge Data vs Information vs Knowledge
Data vs Information vs Knowledge
 
Data vs. information
Data vs. informationData vs. information
Data vs. information
 
Data collection presentation
Data collection presentationData collection presentation
Data collection presentation
 
storage devices
storage devicesstorage devices
storage devices
 
Tools of data collection
Tools of data collectionTools of data collection
Tools of data collection
 

Similar to online Record Linkage

Indexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record DeduplicationIndexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record Deduplication
idescitation
 
11.0004www.iiste.org call for paper.on demand quality of web services using r...
11.0004www.iiste.org call for paper.on demand quality of web services using r...11.0004www.iiste.org call for paper.on demand quality of web services using r...
11.0004www.iiste.org call for paper.on demand quality of web services using r...
Alexander Decker
 
NoSql And The Semantic Web
NoSql And The Semantic WebNoSql And The Semantic Web
NoSql And The Semantic Web
Irina Hutanu
 
The FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdfThe FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdf
Alan Morrison
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
c.titus.brown
 

Similar to online Record Linkage (20)

Implementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record LinkageImplementation of Matching Tree Technique for Online Record Linkage
Implementation of Matching Tree Technique for Online Record Linkage
 
DBMS basics
DBMS basicsDBMS basics
DBMS basics
 
Semantic Conflicts and Solutions in Integration of Fuzzy Relational Databases
Semantic Conflicts and Solutions in Integration of Fuzzy Relational DatabasesSemantic Conflicts and Solutions in Integration of Fuzzy Relational Databases
Semantic Conflicts and Solutions in Integration of Fuzzy Relational Databases
 
Iaetsd a survey on one class clustering
Iaetsd a survey on one class clusteringIaetsd a survey on one class clustering
Iaetsd a survey on one class clustering
 
Database Management System
Database Management SystemDatabase Management System
Database Management System
 
Indexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record DeduplicationIndexing based Genetic Programming Approach to Record Deduplication
Indexing based Genetic Programming Approach to Record Deduplication
 
11.0004www.iiste.org call for paper.on demand quality of web services using r...
11.0004www.iiste.org call for paper.on demand quality of web services using r...11.0004www.iiste.org call for paper.on demand quality of web services using r...
11.0004www.iiste.org call for paper.on demand quality of web services using r...
 
4.on demand quality of web services using ranking by multi criteria 31-35
4.on demand quality of web services using ranking by multi criteria 31-354.on demand quality of web services using ranking by multi criteria 31-35
4.on demand quality of web services using ranking by multi criteria 31-35
 
RDBMS to NoSQL. An overview.
RDBMS to NoSQL. An overview.RDBMS to NoSQL. An overview.
RDBMS to NoSQL. An overview.
 
Spe165 t
Spe165 tSpe165 t
Spe165 t
 
Data Integration in Multi-sources Information Systems
Data Integration in Multi-sources Information SystemsData Integration in Multi-sources Information Systems
Data Integration in Multi-sources Information Systems
 
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
IRJET-Efficient Data Linkage Technique using one Class Clustering Tree for Da...
 
Data models and ro
Data models and roData models and ro
Data models and ro
 
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clu...
 
NoSql And The Semantic Web
NoSql And The Semantic WebNoSql And The Semantic Web
NoSql And The Semantic Web
 
MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)MC0088 Internal Assignment (SMU)
MC0088 Internal Assignment (SMU)
 
The FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdfThe FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdf
 
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptx
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptxAstraZeneca at Neo4j GraphSummit London 14Nov23.pptx
AstraZeneca at Neo4j GraphSummit London 14Nov23.pptx
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 
2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc2013 nas-ehs-data-integration-dc
2013 nas-ehs-data-integration-dc
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

online Record Linkage

  • 1. Efficient Techniques for Online Record Linkage Guided By, Mrs.k.Sujatha B.E.,M.Tech.,(Ph.D..,) Presented by, D. Angelin chitra, A. Jainambu sariba, S. Sathiya priya, K. Sridevi. 1
  • 3. Introduction  Databases frequently contain duplicate fields and records that refer to the same real-world entity.  The data needed to support these decisions are often scattered in heterogeneous distributed databases.  Heterogeneous databases are usually designed and managed by different organizations there may be no common candidate key for linking the records.  If the databases use the same set of design standards linking can easily be done using the primary key.  In this project we are developing a software which will produce the relevant links from heterogenous database with reference to the keyword. 3
  • 4. ABSTRACT  Matching records that refer to the same entity across databases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality.  Record linkage is the computation of the associations among records of multiple databases.  Matching data from heterogeneous data source has been a real problem.  Statistical record linkage techniques could be used for resolving this problem but it causes communication bottleneck in a distributed environment.  A matching tree is used to overcome communication overhead and give matching decision as obtained using the conventional linkage technique. 4
  • 5. Literature Review Existing System  No common candidate key for linking the records in Heterogeneous databases.  It is possible to use common non key attributes to access the heterogeneous database, but the result obtained using these attributes may not always be accurate.  When the matching records reside at a remote site, existing techniques cannot be directly applied because they would involve transferring the entire remote relation, thereby incurring a huge communication overhead.  Different ranking algorithm are used for the search which may be time consuming. 5
  • 6. Disadvantages  It has not been work on online.  Not cost effective.  It cannot be reduce communication overhead. 6
  • 7. Proposed System  An efficient technique is developed to facilitate record linkage decisions in a distributed, online setting.  A matching tree is developed for attribute acquisition based on sequential decision making.  The proposed techniques reduce the communication overhead considerably, and the linkage performance is assured to be at the same level as the traditional approach. 7
  • 8. Advantages  Reducing the communication overhead in a distributed environment cost effective model.  It is cost effective model. 8
  • 10. Principles of matching tree Input selection Assume that we are at some node of the tree and are trying to decide how to branch from there. At that point , we would be interested in finding the next best attribute to be acquired from the set of remaining attributes. Stopping The stopping decision is made when no realisation of the remaining attributes can sufficiently revise the current matching probability so that the matching decision changes 10
  • 11. Tree based linkage techniques ◦ Here we develop efficient online record linkage techniques based on the matching tree ◦ The first two stages in this process are performed offline, using the training data. ◦ Once the matching tree has been built, the online linkage is done as the final step. ◦ We can now characterize the different techniques that can be employed in the last step. ◦ Given a local enquiry record, the ultimate goal of any linkage technique is to identify and fetch all the records from the remote site that have a matching probability ◦ The partitioning itself can be done in one of two possible ways: 1) Sequential 2) Concurrent 11
  • 12. Sequential partitioning The set of remote records is partitioned recursively, till we obtain the desired partition of all the relevant records. This partitioning can be done in one of two ways: i). Sequential Attribute Acquisition ii). Sequential Identifier Acquisition 12
  • 13. Concurrent Partitioning The tree is used to formulate a database query that selects the relevant remote records directly, in one single step. Once the relevant records are identified, all their attribute values are transferred 13
  • 14. Record Linkage Record linkage refers to the task of finding records in a data set that refer to the same entity across different data sources (e.g., data files, books, websites, databases). Record linkage is necessary when joining data sets based on entities that may or may not share a common identifier. Record linkage has applications in customer systems for marketing, relationship management, fraud detection, law enforcement and government administration 14
  • 15. 15
  • 16. Modules  Multiple Source Data.  Detecting Overhead By Matching Tree.  Overheads Eliminated. 16
  • 18. Multiple Source Data  Databases is becoming an increasingly important part of many data mining projects, as often data from multiple sources needs to be matched in order to enrich data or improve its quality.  The data needed to support these decisions are often scattered in heterogeneous distributed databases.  In such cases, it maybe necessary to link records in multiple databases so that one can consolidate and use the data pertaining to the same real world entity.  Entity matching is a crucial task for data integration and data cleaning. It is the task of identifying entities (objects, data instances) referring to the same real-world entity.  The entity matching problem arises when there is no common identifier across the heterogeneous data sources. 18
  • 19. Detecting Overhead by Matching Tree To integrate or link the data stored in heterogeneous data sources, a critical problem is entity matching, i.e., matching records representing corresponding entities in the real world, across the sources. In this paper, we describe how this method can be applied in entity matching rules from heterogeneous databases. 19
  • 20. Overheads Eliminated We develop a matching tree, similar to a decision tree, and use it to reduce the communication overhead significantly. The online record linkage process become more efficient by reducing the communication overhead in a distributed environment. 20
  • 25. Conclusion Record linkage is an important issue in heterogeneous database systems where the records representing the same real- world entity type are identified using different identifiers in different databases. In the absence of a common identifier, it is often difficult to find records in a remote database that are similar to a local enquiry record. 25
  • 26. References  A.K. Elmagarmid, P.G. Ipeirotis, and V.S. Verykios, “Duplicate Record Detection: A Survey,” IEEE Trans. Knowledge and Data Eng.,vol. 19, no. 1, pp. 1-16, Jan. 2007.  B. Tepping, “A Model for Optimum Linkage of Records,” J. Am. Statistical Assoc., vol. 63, pp. 1321-1332, 1968.  C. Batini, M. Lenzerini, and S.B. Navathe, “A Comparative Analysis of Methodologies for Database Schema Integration,”ACM Computing Surveys, vol. 18, no. 4, pp. 323- 364, 1986  D. Dey, “Record Matching in Data Warehouses: A Decision Model for Data Consolidation,” Operations Research, vol. 51, no. 2, pp. 240-254,  D. Dey, S. Sarkar, and P. De, “A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases,” IEEE Trans.Knowledge and Data Eng., vol. 14, no. 3, pp. 567-582, May/June 2002. 26
  • 27. THANK YOU 27
  • 28. QUERIES 28