SlideShare a Scribd company logo
1 of 23
Deduplication Bouvet BigOne, 2011-04-13 Lars Marius Garshol, <larsga@bouvet.no> http://twitter.com/larsga
Getting started Baby steps
The problem The suppliers table Real-world data is very, very messy
The problem – take 2 Suppliers Customers Customers Customers Companies CRM Billing ERP Each of these has internal duplicates, plus duplicates across the tables. No easy fix.
But ... what about identifiers? No, there are no system IDs across these tables Yes, there are outside identifiers organization number for companies personal number for people But, these are problematic many records don't have them they are inconsistently formatted sometimes they are misspelled some parts of huge organizations have the same org number, but need to be treated as separate
First attempt at solution I wrote a simple Python script in ~2 hours It does the following: load all records normalize the data strip extra whitespace, lowercase, remove letters from org codes... use Bayesian inferencing for matching
Configuration
Matching This sums out to 0.93 probability
Problems The functions comparing values are still pretty primitive Performance is abysmal 90 minutes to process 14,500 records performance is O(n2) total number of records is ~2.5 million time to process all records: 1 year 10 months Now what?
An idea Well, we don't necessarily need to compare each record with all others if we have indexes we can look up the records which have matching values Use DBM for the indexes, for example Unfortunately, these only allow exact matching But, we can break up complex values into tokens, and index those Hang on, isn't this rather like a search engine? Bing! Let's try Lucene!
Lucene-based prototype I whip out Jython and try it New script first builds Lucene index Then searches all records against the index Time to process 14,500 records: 1 minute Now we're talking...
Reality sets in A splash of cold water to the face
Prior art It turns out people have been doing this before They call it entity resolution identity resolution merge/purge deduplication record linkage ... This makes Googling for information an absolute nightmare
Existing tools Several commercial tools they look big and expensive: we skip those Stian found some open source tools Oyster: slow, bad architecture, primitive matching SERF: slow, bad architecture I’ve later found more, but was not impressed So, it seems we still have to do it ourselves
Finds in the research literature General problem is well-understood "naïve Bayes" is naïve lots of interesting work on value comparisons performance problem 'solved' with "blocking" build a key from parts of the data sort records by key compare each record with m nearest neighbours performance goes from O(n2) to O(n m) parallel processing widely used Swoosh paper compare and merge should have ICAR1 properties optimal algorithms for general merge found run-time for 14,000 records ~1.5 hours... 1 Idempotence, commutativity, associativity, reflexivity
Good research papers Threat and Fraud Intelligence, Las Vegas Style, Jeff Jonas http://jeffjonas.typepad.com/IEEE.Identity.Resolution.pdf Real-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfo http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.3496&rep=rep1&type=pdf Swoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et al http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.5696&rep=rep1&type=pdf
DUplicate KillEr Duke
Java deduplication engine Work in progress so far spent only ~20 hours on it only command-line batch client built so far Based on Lucene 3.1 Open source (on Google Code) http://code.google.com/p/duke/ Blazingly fast 960,000 records in 11 minutes on this laptop
Architecture data in equivalences out SDshare client SDshare server RDF frontend Datastore API Duke engine Lucene H2 database
Architecture #2 data in link file out Command-line client More frontends: ,[object Object]
 SPARQL
 RDF file
 ...CSV frontend Datastore API Duke engine Lucene

More Related Content

What's hot

What's hot (20)

Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4jNatural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Graph Databases and Neo4j
 
Optimizing GenAI apps, by N. El Mawass and Maria Knorps
Optimizing GenAI apps, by N. El Mawass and Maria KnorpsOptimizing GenAI apps, by N. El Mawass and Maria Knorps
Optimizing GenAI apps, by N. El Mawass and Maria Knorps
 
Integrating Relational Databases with the Semantic Web: A Reflection
Integrating Relational Databases with the Semantic Web: A ReflectionIntegrating Relational Databases with the Semantic Web: A Reflection
Integrating Relational Databases with the Semantic Web: A Reflection
 
Semantic search
Semantic searchSemantic search
Semantic search
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
Bias in AI
Bias in AIBias in AI
Bias in AI
 
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Massive-Scale Entity Resolution Using the Power of Apache Spark and GraphMassive-Scale Entity Resolution Using the Power of Apache Spark and Graph
Massive-Scale Entity Resolution Using the Power of Apache Spark and Graph
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)Introduction to natural language processing (NLP)
Introduction to natural language processing (NLP)
 
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
ASTRAZENECA. Knowledge Graphs Powering a Fast-moving Global Life Sciences Org...
 
Family tree of data – provenance and neo4j
Family tree of data – provenance and neo4jFamily tree of data – provenance and neo4j
Family tree of data – provenance and neo4j
 
Building a Knowledge Graph using NLP and Ontologies
Building a Knowledge Graph using NLP and OntologiesBuilding a Knowledge Graph using NLP and Ontologies
Building a Knowledge Graph using NLP and Ontologies
 
Knowledge Graphs Overview
Knowledge Graphs OverviewKnowledge Graphs Overview
Knowledge Graphs Overview
 
Fuzzy Matching on Apache Spark with Jennifer Shin
Fuzzy Matching on Apache Spark with Jennifer ShinFuzzy Matching on Apache Spark with Jennifer Shin
Fuzzy Matching on Apache Spark with Jennifer Shin
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic Search
 
Graph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph DatabaseGraph in Apache Cassandra. The World’s Most Scalable Graph Database
Graph in Apache Cassandra. The World’s Most Scalable Graph Database
 
AI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation JourneyAI Foundations Course Module 1 - An AI Transformation Journey
AI Foundations Course Module 1 - An AI Transformation Journey
 
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine LearnGraphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
Graphs in Retail: Know Your Customers and Make Your Recommendations Engine Learn
 

Similar to Deduplication

Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingShowing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Dan Kaminsky
 

Similar to Deduplication (20)

Metric Abuse: Frequently Misused Metrics in Oracle
Metric Abuse: Frequently Misused Metrics in OracleMetric Abuse: Frequently Misused Metrics in Oracle
Metric Abuse: Frequently Misused Metrics in Oracle
 
How to fix bug or defects in software
How to fix bug or defects in software How to fix bug or defects in software
How to fix bug or defects in software
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
 
Zen and the Art of ILS Migration--KUDOSCon 2011
Zen and the Art of ILS Migration--KUDOSCon 2011Zen and the Art of ILS Migration--KUDOSCon 2011
Zen and the Art of ILS Migration--KUDOSCon 2011
 
Oops Concepts
Oops ConceptsOops Concepts
Oops Concepts
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingShowing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
 
Backpack Tools4 Sql Dev
Backpack Tools4 Sql DevBackpack Tools4 Sql Dev
Backpack Tools4 Sql Dev
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Smart Housekeeping Apps
Smart Housekeeping AppsSmart Housekeeping Apps
Smart Housekeeping Apps
 
Large Components in the Rearview Mirror
Large Components in the Rearview MirrorLarge Components in the Rearview Mirror
Large Components in the Rearview Mirror
 
10 ways to accelerate software development by dave thomas at yow! nights hk
10 ways to accelerate software development by dave thomas at yow! nights hk10 ways to accelerate software development by dave thomas at yow! nights hk
10 ways to accelerate software development by dave thomas at yow! nights hk
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and bolts
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
Puppet for Sys Admins
Puppet for Sys AdminsPuppet for Sys Admins
Puppet for Sys Admins
 
Teradata Aster Discovery Platform
Teradata Aster Discovery PlatformTeradata Aster Discovery Platform
Teradata Aster Discovery Platform
 
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
 
Data Management - Basic Concepts
Data Management - Basic ConceptsData Management - Basic Concepts
Data Management - Basic Concepts
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Cleaning and sorting data
Cleaning and sorting dataCleaning and sorting data
Cleaning and sorting data
 

More from Lars Marius Garshol

Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
Lars Marius Garshol
 

More from Lars Marius Garshol (20)

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
 
Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
History of writing
History of writingHistory of writing
History of writing
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativity
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDF
 
The Euro crisis in 10 minutes
The Euro crisis in 10 minutesThe Euro crisis in 10 minutes
The Euro crisis in 10 minutes
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Hops - the green gold
Hops - the green goldHops - the green gold
Hops - the green gold
 
Big data 101
Big data 101Big data 101
Big data 101
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparators
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Deduplication

  • 1. Deduplication Bouvet BigOne, 2011-04-13 Lars Marius Garshol, <larsga@bouvet.no> http://twitter.com/larsga
  • 3. The problem The suppliers table Real-world data is very, very messy
  • 4. The problem – take 2 Suppliers Customers Customers Customers Companies CRM Billing ERP Each of these has internal duplicates, plus duplicates across the tables. No easy fix.
  • 5. But ... what about identifiers? No, there are no system IDs across these tables Yes, there are outside identifiers organization number for companies personal number for people But, these are problematic many records don't have them they are inconsistently formatted sometimes they are misspelled some parts of huge organizations have the same org number, but need to be treated as separate
  • 6. First attempt at solution I wrote a simple Python script in ~2 hours It does the following: load all records normalize the data strip extra whitespace, lowercase, remove letters from org codes... use Bayesian inferencing for matching
  • 8. Matching This sums out to 0.93 probability
  • 9. Problems The functions comparing values are still pretty primitive Performance is abysmal 90 minutes to process 14,500 records performance is O(n2) total number of records is ~2.5 million time to process all records: 1 year 10 months Now what?
  • 10. An idea Well, we don't necessarily need to compare each record with all others if we have indexes we can look up the records which have matching values Use DBM for the indexes, for example Unfortunately, these only allow exact matching But, we can break up complex values into tokens, and index those Hang on, isn't this rather like a search engine? Bing! Let's try Lucene!
  • 11. Lucene-based prototype I whip out Jython and try it New script first builds Lucene index Then searches all records against the index Time to process 14,500 records: 1 minute Now we're talking...
  • 12. Reality sets in A splash of cold water to the face
  • 13. Prior art It turns out people have been doing this before They call it entity resolution identity resolution merge/purge deduplication record linkage ... This makes Googling for information an absolute nightmare
  • 14. Existing tools Several commercial tools they look big and expensive: we skip those Stian found some open source tools Oyster: slow, bad architecture, primitive matching SERF: slow, bad architecture I’ve later found more, but was not impressed So, it seems we still have to do it ourselves
  • 15. Finds in the research literature General problem is well-understood "naïve Bayes" is naïve lots of interesting work on value comparisons performance problem 'solved' with "blocking" build a key from parts of the data sort records by key compare each record with m nearest neighbours performance goes from O(n2) to O(n m) parallel processing widely used Swoosh paper compare and merge should have ICAR1 properties optimal algorithms for general merge found run-time for 14,000 records ~1.5 hours... 1 Idempotence, commutativity, associativity, reflexivity
  • 16. Good research papers Threat and Fraud Intelligence, Las Vegas Style, Jeff Jonas http://jeffjonas.typepad.com/IEEE.Identity.Resolution.pdf Real-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfo http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.3496&rep=rep1&type=pdf Swoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et al http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.5696&rep=rep1&type=pdf
  • 18. Java deduplication engine Work in progress so far spent only ~20 hours on it only command-line batch client built so far Based on Lucene 3.1 Open source (on Google Code) http://code.google.com/p/duke/ Blazingly fast 960,000 records in 11 minutes on this laptop
  • 19. Architecture data in equivalences out SDshare client SDshare server RDF frontend Datastore API Duke engine Lucene H2 database
  • 20.
  • 23. ...CSV frontend Datastore API Duke engine Lucene
  • 24. Architecture #3 data in equivalences out REST interface X frontend Datastore API Duke engine Lucene H2 database
  • 25. Weaknesses Tied to naïve Bayes model research shows more sophisticated models perform better non-trivial to reconcile these with index lookup Value comparison sophistication limited Lucene does support Levenshtein queries (these are slow, though. will be fast in 4.x)