3. The problem The suppliers table Real-world data is very, very messy
4. The problem – take 2 Suppliers Customers Customers Customers Companies CRM Billing ERP Each of these has internal duplicates, plus duplicates across the tables. No easy fix.
5. But ... what about identifiers? No, there are no system IDs across these tables Yes, there are outside identifiers organization number for companies personal number for people But, these are problematic many records don't have them they are inconsistently formatted sometimes they are misspelled some parts of huge organizations have the same org number, but need to be treated as separate
6. First attempt at solution I wrote a simple Python script in ~2 hours It does the following: load all records normalize the data strip extra whitespace, lowercase, remove letters from org codes... use Bayesian inferencing for matching
9. Problems The functions comparing values are still pretty primitive Performance is abysmal 90 minutes to process 14,500 records performance is O(n2) total number of records is ~2.5 million time to process all records: 1 year 10 months Now what?
10. An idea Well, we don't necessarily need to compare each record with all others if we have indexes we can look up the records which have matching values Use DBM for the indexes, for example Unfortunately, these only allow exact matching But, we can break up complex values into tokens, and index those Hang on, isn't this rather like a search engine? Bing! Let's try Lucene!
11. Lucene-based prototype I whip out Jython and try it New script first builds Lucene index Then searches all records against the index Time to process 14,500 records: 1 minute Now we're talking...
13. Prior art It turns out people have been doing this before They call it entity resolution identity resolution merge/purge deduplication record linkage ... This makes Googling for information an absolute nightmare
14. Existing tools Several commercial tools they look big and expensive: we skip those Stian found some open source tools Oyster: slow, bad architecture, primitive matching SERF: slow, bad architecture I’ve later found more, but was not impressed So, it seems we still have to do it ourselves
15. Finds in the research literature General problem is well-understood "naïve Bayes" is naïve lots of interesting work on value comparisons performance problem 'solved' with "blocking" build a key from parts of the data sort records by key compare each record with m nearest neighbours performance goes from O(n2) to O(n m) parallel processing widely used Swoosh paper compare and merge should have ICAR1 properties optimal algorithms for general merge found run-time for 14,000 records ~1.5 hours... 1 Idempotence, commutativity, associativity, reflexivity
16. Good research papers Threat and Fraud Intelligence, Las Vegas Style, Jeff Jonas http://jeffjonas.typepad.com/IEEE.Identity.Resolution.pdf Real-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfo http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.3496&rep=rep1&type=pdf Swoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et al http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.5696&rep=rep1&type=pdf
18. Java deduplication engine Work in progress so far spent only ~20 hours on it only command-line batch client built so far Based on Lucene 3.1 Open source (on Google Code) http://code.google.com/p/duke/ Blazingly fast 960,000 records in 11 minutes on this laptop
19. Architecture data in equivalences out SDshare client SDshare server RDF frontend Datastore API Duke engine Lucene H2 database
24. Architecture #3 data in equivalences out REST interface X frontend Datastore API Duke engine Lucene H2 database
25. Weaknesses Tied to naïve Bayes model research shows more sophisticated models perform better non-trivial to reconcile these with index lookup Value comparison sophistication limited Lucene does support Levenshtein queries (these are slow, though. will be fast in 4.x)