Deduplication

4. The problem – take 2 Suppliers Customers Customers Customers Companies CRM Billing ERP Each of these has internal duplicates, plus duplicates across the tables. No easy fix.

5. But ... what about identifiers? No, there are no system IDs across these tables Yes, there are outside identifiers organization number for companies personal number for people But, these are problematic many records don't have them they are inconsistently formatted sometimes they are misspelled some parts of huge organizations have the same org number, but need to be treated as separate

6. First attempt at solution I wrote a simple Python script in ~2 hours It does the following: load all records normalize the data strip extra whitespace, lowercase, remove letters from org codes... use Bayesian inferencing for matching

7. Configuration

8. Matching This sums out to 0.93 probability

9. Problems The functions comparing values are still pretty primitive Performance is abysmal 90 minutes to process 14,500 records performance is O(n2) total number of records is ~2.5 million time to process all records: 1 year 10 months Now what?

10. An idea Well, we don't necessarily need to compare each record with all others if we have indexes we can look up the records which have matching values Use DBM for the indexes, for example Unfortunately, these only allow exact matching But, we can break up complex values into tokens, and index those Hang on, isn't this rather like a search engine? Bing! Let's try Lucene!

11. Lucene-based prototype I whip out Jython and try it New script first builds Lucene index Then searches all records against the index Time to process 14,500 records: 1 minute Now we're talking...

12. Reality sets in A splash of cold water to the face

13. Prior art It turns out people have been doing this before They call it entity resolution identity resolution merge/purge deduplication record linkage ... This makes Googling for information an absolute nightmare

14. Existing tools Several commercial tools they look big and expensive: we skip those Stian found some open source tools Oyster: slow, bad architecture, primitive matching SERF: slow, bad architecture I’ve later found more, but was not impressed So, it seems we still have to do it ourselves

15. Finds in the research literature General problem is well-understood "naïve Bayes" is naïve lots of interesting work on value comparisons performance problem 'solved' with "blocking" build a key from parts of the data sort records by key compare each record with m nearest neighbours performance goes from O(n2) to O(n m) parallel processing widely used Swoosh paper compare and merge should have ICAR1 properties optimal algorithms for general merge found run-time for 14,000 records ~1.5 hours... 1 Idempotence, commutativity, associativity, reflexivity

16. Good research papers Threat and Fraud Intelligence, Las Vegas Style, Jeff Jonas http://jeffjonas.typepad.com/IEEE.Identity.Resolution.pdf Real-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfo http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.3496&rep=rep1&type=pdf Swoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et al http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.5696&rep=rep1&type=pdf

17. DUplicate KillEr Duke

18. Java deduplication engine Work in progress so far spent only ~20 hours on it only command-line batch client built so far Based on Lucene 3.1 Open source (on Google Code) http://code.google.com/p/duke/ Blazingly fast 960,000 records in 11 minutes on this laptop

19. Architecture data in equivalences out SDshare client SDshare server RDF frontend Datastore API Duke engine Lucene H2 database

20.

21. SPARQL

22. RDF file

23. ...CSV frontend Datastore API Duke engine Lucene

24. Architecture #3 data in equivalences out REST interface X frontend Datastore API Duke engine Lucene H2 database

25. Weaknesses Tied to naïve Bayes model research shows more sophisticated models perform better non-trivial to reconcile these with index lookup Value comparison sophistication limited Lucene does support Levenshtein queries (these are slow, though. will be fast in 4.x)

26. Comments/questions?

Deduplication

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deduplication

Similar to Deduplication (20)

More from Lars Marius Garshol

More from Lars Marius Garshol (20)

Recently uploaded

Recently uploaded (20)

Deduplication