SlideShare a Scribd company logo
1 of 35
Download to read offline
with Apache Solr
Markus Günther
Freelance Software Engineer / Architect
| |
Phonetic Matching
mail@mguenther.net mguenther.net @markus_guenther
Phonetic matching is concerned with searching for spelling variations in large databases.
Age-old problem
Algorithmic solutions date back to the pre-computer era
Soundex was invented by Russell and Odell in 1912
Compute a phonetic value for a given name
Names that sound the same share the same phonetic value
Variation: American Soundex
Variation: Daitch-Mokotoff (DM) Soundex
Soundex tends to generate many false hits which lowers precision
© 2022 Markus Günther IT-Beratung
American Soundex is a reasonably simple algorithm.
Rules
1. Replace the first letter of the name and drop all occurences of a, e, i, o, u ,y, h, w.
2. Replace consonants with digits as suggested by mapping table.
3. Retain only the first letter for two or more adjacent letters mapped to the same number.
4. Retain only the first letter for two letters mapped to the same number that are separated
by h, w, or y.
5. Trim the encoded numbers to a total of three. Pad with 0 if there are less than three.
© 2022 Markus Günther IT-Beratung
American Soundex is a reasonably simple algorithm.
public class AmericanSoundex {
private static final String MAPPING = "01230120022455012623010202";
public static String encode(final String term) {
char code[] = { term.charAt(0), '0', '0', '0'};
char previousDigit = encode(code[0]);
int count = 1;
for (int i = 1; i < term.length() && count < code.length; i++) {
final char ch = term.charAt(i);
if (ch == 'H' || ch == 'W' || ch == 'Y') continue;
final char digit = encode(ch);
if (digit != '0' && digit != previousDigit) {
code[count++] = digit;
}
previousDigit = digit;
}
return String.valueOf(code);
}
}
© 2022 Markus Günther IT-Beratung
Let's take a look at a couple of examples.
Name Phonetic value
Robert R163
Rupert R163
Rubin R150
Ashcraft A261
Ashcroft A261
© 2022 Markus Günther IT-Beratung
American Soundex is not optimized for Eastern European names.
Name Phonetic value
Schwarzenegger S625
Shwarzenegger S625
Schwartsenegger S632
A search application would not be able to find a match with that misspelling.
© 2022 Markus Günther IT-Beratung
Daitch-Mokotoff Soundex has a solution for this.
Name Phonetic values
Schwarzenegger 474659, 479465
Shwarzenegger 474659, 479465
Schwartsenegger 479465
Given a pair of names, we have a phonetic match if at least one of their codes match.
© 2022 Markus Günther IT-Beratung
Soundex suffers from a focus on the anlaut for short names leading to false-positives.
Phonetic value Names
S300 Scott, Seth, Sadie, Satoya, ...
C500 Connie, Cheyenne, Conway, ...
T200 Tasha, Tessa, Tekia, ...
© 2022 Markus Günther IT-Beratung
This isn't always the case, though.
Phonetic value Names
M622 Marcus, Marcos, Marques, Markus, Marquice, Marquisa, ...
F652 Frank, Francisco, Francis, Franklin, Francois, ...
C150 Chevonne, Chavon, Chavonne, Chivon, Cobin, ...
© 2022 Markus Günther IT-Beratung
Beider-Morse Phonetic Matching
Instead of focusing on spelling, Beider-Morse factors in linguistic properties of a language.
Of limited interest for common nouns, adjectives, adverbs and verbs
Good strategy for proper nouns (i.e., names)
History: Started off primarily for matching surnames of Ashkenazic Jews
Example: Consider variations of Schwarz (standard German spelling)
Schwartz (alternate German spelling)
Shwartz, Shvartz, Shvarts (Anglicized spelling)
Szwarc (Polish), Szwartz (blended German-Polish)
Svarc (Hungarian), Chvartz (blended French-German)
© 2022 Markus Günther IT-Beratung
Step 1: Identifying the language
BMPM includes about 200 rules for determining the language
Some are general, some need context
Examples Inferred Language(s)
tsch, final mann or witz German
final and initial cs or zs Hungarian
cz, cy, initial rz or wl, ... Polish
ö and ü German, Hungarian
Allows to specify a language explicitly
© 2022 Markus Günther IT-Beratung
Step 2: Calculating the exact phonetic value
Forms of surnames used by women differ in some languages
Affects Slavic languages, Polish, Russian, Lithuanian, Latvian
Masculine endings Feminine endings
Suchy Sucha
Novikov Novikova
BMPM replaces feminine endings with masculine ones
© 2022 Markus Günther IT-Beratung
Step 2: Calculating the exact phonetic value
1. Replace feminine endings with masculine ones.
2. Identify the exact phonetic value of all letters.
1. Transcribe letters into a phonetic alphabet.
Applies language-specific rule set in case of one possible language.
Applies generic rule set in case of multiple possible languages.
2. Apply phonetic rules that are common to many languages.
e.g. final devoicing, regressive assimilation
3. At the end, the algorithm yields the exact phonetic value.
© 2022 Markus Günther IT-Beratung
Step 2: What do language-specific rules look like?
BMPM applies roughly 80 mapping rules for German
sch maps to S
s at the start and s between two vowels maps to z
w maps to v
© 2022 Markus Günther IT-Beratung
Step 2: What do language-agnostic rules look like?
BMPM uses more than 300 generic rules
a final tz maps to ts
Some generic rules might be applicable to specific languages only
step 1 rules out certain languages
rule is applied if it complies with the remaining possible languages
© 2022 Markus Günther IT-Beratung
Step 3: Calculating the approximate phonetic value
Some sounds can be interchangeable in specific contexts
beginning / end of word
previous next / letter
Language Example Sounds alike
Russian unstressed o is pronounced as a Mostov, Mastov
German n before b is close to m Grinberg, Grimberg
Spanish phonetic equivalence of n and m Grinberg, Grimberg
Rules can be language-agnostic or -specific
© 2022 Markus Günther IT-Beratung
Step 4: Searching for matches
1. BMPM generates the exact and approximate phonetic value for a given name.
2. We have an exact match if two names match on their exact phonetic value.
This might be too aggressive for your use-case.
3. We have an approximate match if two names match on their approximate phonetic value.
Matches done by BMPM are not necessarily commutative.
© 2022 Markus Günther IT-Beratung
Integration with Apache Solr
Apache Solr supports a variety of phonetic matching algorithms.
Beider-Morse Phonetic Matching
Daitch-Mokotoff Soundex
Double Metaphone
Metaphone
Soundex
© 2022 Markus Günther IT-Beratung
Refined Soundex
Caverphone
Cologne Phonetic
NYSISS
Add a field type that works with the phonetic matching algorithm.
Admissible values for ruleType are: APPROX and EXACT
They map to the semantics of approximate matches resp. exact matches
<fieldType name="phonetic_names" class="solr.TextField"
positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"></tokenizer>
<filter class="solr.BeiderMorseFilterFactory"
nameType="GENERIC"
ruleType="APPROX"
concat="true"
languageSet="auto"></filter>
</analyzer>
</fieldType>
© 2022 Markus Günther IT-Beratung
Add an index field using the resp. field type.
You probably already have a name field of sorts for basic name searches.
Use a copyField-directive to source name_phonetic from that field.
<field name="name_phonetic"
type="phonetic_names"
indexed="true"
stored="false"
multiValued="false"></field>
<copyField source="name" dest="name_phonetic"></copyField>
© 2022 Markus Günther IT-Beratung
Execute queries against that field.
Query for mustermann
(name_phonetic:mustermann)
© 2022 Markus Günther IT-Beratung
Evaluation
Let's do a couple of experiments with different parameters for BMPN.
Dataset: Large enterprise naming directory, approx. 340k individual persons
Naive implementation using phonetic matching incl. wildcards and N-Gram backed
fields yields approx. 3k results for a popular surname
Queries:
Large result set: q=(name_phonetic:meier)
Small result set: q=(name_phonetic:<some-unique-name>)
© 2022 Markus Günther IT-Beratung
Experiment 1: Querying for a popular name
Variant ruleType languageSet q=(phonetic_name:meier)
Naive - - 2997
1 APPROX auto 1279
2 EXACT auto 1228
3 APPROX german,english 1261
4 EXACT german,english 1216
Restricting languages to pre-dominantly ones of the corpus removes non-intuitive matches
Almost no noticeable diff between APPROX and EXACT wrt. result quality
© 2022 Markus Günther IT-Beratung
Few ordering issues, meier almost everytime before phonetic variations
Experiment 2: Querying for a unique name with spelling variations
Variant ruleType languageSet Correct Var. 1 Var. 2
Naive - - 7 0 30 (non-intuitive)
1 APPROX auto 1 5 14 (no match, not
intuitive)
2 EXACT auto 1 0 1 (no match)
3 APPROX german,english 7 (top
match)
5
(match)
25 (no match,
intuitive)
4 EXACT german,english 1 0 1 (no match)
Variant 3: Precision is good, recall could be better (i.e. one-off-corrections)
© 2022 Markus Günther IT-Beratung
Adding one-off-corrections using Damerau-Levensthein distance complements BMPM.
Prerequisites
name index field that stores <first name> <middle-initial> <surname>
name index field uses n-grams
Refine the query
Can be applied within phrases as well to allow for displacements
"Mustermann Max" should yield the same results as "Max Mustermann"
(name_phonetic:mustermann) OR (name:mustermann~1)
© 2022 Markus Günther IT-Beratung
Adding a boost on first name and surname for direct matches.
Influence ordering a bit to always prefer direct matches before phonetic variations.
Prerequisites:
firstname index field that stores <first name> (non-analyzed, lowercased)
surname index field that stores <surname> (non-analyzed, lowercased)
Refine the query
bq=firstname:("mustermann")surname:("mustermann")
© 2022 Markus Günther IT-Beratung
Tuning BMPM using additional mechanisms yields well-grounded phonetic matches.
What have we done?
Test the effect of BMPM parameterizations on your dataset
Add one-off-corrections to mitigate spelling mistakes that phonetics won't catch
Allow for displacement of max. two terms within a phrase
Boost on first and surname separately to influence relevance sorting
© 2022 Markus Günther IT-Beratung
Tuning BMPM using additional mechanisms yields well-grounded phonetic matches.
Achievements
Good trade-off between precision and recall
usually top match on search for unique names
Result sets are explainable
Relevance ordering feels natural
direct matches, phonetic variations, one-off corrections
© 2022 Markus Günther IT-Beratung
Questions?
Phonetic Matching with Apache Solr

More Related Content

What's hot

Scaling Up to Your First 10 Million Users
Scaling Up to Your First 10 Million UsersScaling Up to Your First 10 Million Users
Scaling Up to Your First 10 Million UsersAmazon Web Services
 
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA
 
Natural language processing (NLP)
Natural language processing (NLP) Natural language processing (NLP)
Natural language processing (NLP) ASWINKP11
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AIVikasBisoi
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022Flink Forward
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
Near RealTime search @Flipkart
Near RealTime search @FlipkartNear RealTime search @Flipkart
Near RealTime search @FlipkartUmesh Prasad
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingVeenaSKumar2
 
Twilio Voice Applications with Amazon AWS S3 and EC2
Twilio Voice Applications with Amazon AWS S3 and EC2Twilio Voice Applications with Amazon AWS S3 and EC2
Twilio Voice Applications with Amazon AWS S3 and EC2Twilio Inc
 

What's hot (10)

Scaling Up to Your First 10 Million Users
Scaling Up to Your First 10 Million UsersScaling Up to Your First 10 Million Users
Scaling Up to Your First 10 Million Users
 
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
Data Con LA 2022 - Pre- Recorded - OpenSearch: Everything You Need to Know Ab...
 
Natural language processing (NLP)
Natural language processing (NLP) Natural language processing (NLP)
Natural language processing (NLP)
 
Google Vertex AI
Google Vertex AIGoogle Vertex AI
Google Vertex AI
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Near RealTime search @Flipkart
Near RealTime search @FlipkartNear RealTime search @Flipkart
Near RealTime search @Flipkart
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
 
Twilio Voice Applications with Amazon AWS S3 and EC2
Twilio Voice Applications with Amazon AWS S3 and EC2Twilio Voice Applications with Amazon AWS S3 and EC2
Twilio Voice Applications with Amazon AWS S3 and EC2
 

Recently uploaded

5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 

Recently uploaded (20)

5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 

Phonetic Matching with Apache Solr

  • 1. with Apache Solr Markus Günther Freelance Software Engineer / Architect | | Phonetic Matching mail@mguenther.net mguenther.net @markus_guenther
  • 2. Phonetic matching is concerned with searching for spelling variations in large databases. Age-old problem Algorithmic solutions date back to the pre-computer era Soundex was invented by Russell and Odell in 1912 Compute a phonetic value for a given name Names that sound the same share the same phonetic value Variation: American Soundex Variation: Daitch-Mokotoff (DM) Soundex Soundex tends to generate many false hits which lowers precision © 2022 Markus Günther IT-Beratung
  • 3. American Soundex is a reasonably simple algorithm. Rules 1. Replace the first letter of the name and drop all occurences of a, e, i, o, u ,y, h, w. 2. Replace consonants with digits as suggested by mapping table. 3. Retain only the first letter for two or more adjacent letters mapped to the same number. 4. Retain only the first letter for two letters mapped to the same number that are separated by h, w, or y. 5. Trim the encoded numbers to a total of three. Pad with 0 if there are less than three. © 2022 Markus Günther IT-Beratung
  • 4. American Soundex is a reasonably simple algorithm. public class AmericanSoundex { private static final String MAPPING = "01230120022455012623010202"; public static String encode(final String term) { char code[] = { term.charAt(0), '0', '0', '0'}; char previousDigit = encode(code[0]); int count = 1; for (int i = 1; i < term.length() && count < code.length; i++) { final char ch = term.charAt(i); if (ch == 'H' || ch == 'W' || ch == 'Y') continue; final char digit = encode(ch); if (digit != '0' && digit != previousDigit) { code[count++] = digit; } previousDigit = digit; } return String.valueOf(code); } } © 2022 Markus Günther IT-Beratung
  • 5. Let's take a look at a couple of examples. Name Phonetic value Robert R163 Rupert R163 Rubin R150 Ashcraft A261 Ashcroft A261 © 2022 Markus Günther IT-Beratung
  • 6. American Soundex is not optimized for Eastern European names. Name Phonetic value Schwarzenegger S625 Shwarzenegger S625 Schwartsenegger S632 A search application would not be able to find a match with that misspelling. © 2022 Markus Günther IT-Beratung
  • 7. Daitch-Mokotoff Soundex has a solution for this. Name Phonetic values Schwarzenegger 474659, 479465 Shwarzenegger 474659, 479465 Schwartsenegger 479465 Given a pair of names, we have a phonetic match if at least one of their codes match. © 2022 Markus Günther IT-Beratung
  • 8. Soundex suffers from a focus on the anlaut for short names leading to false-positives. Phonetic value Names S300 Scott, Seth, Sadie, Satoya, ... C500 Connie, Cheyenne, Conway, ... T200 Tasha, Tessa, Tekia, ... © 2022 Markus Günther IT-Beratung
  • 9. This isn't always the case, though. Phonetic value Names M622 Marcus, Marcos, Marques, Markus, Marquice, Marquisa, ... F652 Frank, Francisco, Francis, Franklin, Francois, ... C150 Chevonne, Chavon, Chavonne, Chivon, Cobin, ... © 2022 Markus Günther IT-Beratung
  • 11. Instead of focusing on spelling, Beider-Morse factors in linguistic properties of a language. Of limited interest for common nouns, adjectives, adverbs and verbs Good strategy for proper nouns (i.e., names) History: Started off primarily for matching surnames of Ashkenazic Jews Example: Consider variations of Schwarz (standard German spelling) Schwartz (alternate German spelling) Shwartz, Shvartz, Shvarts (Anglicized spelling) Szwarc (Polish), Szwartz (blended German-Polish) Svarc (Hungarian), Chvartz (blended French-German) © 2022 Markus Günther IT-Beratung
  • 12. Step 1: Identifying the language BMPM includes about 200 rules for determining the language Some are general, some need context Examples Inferred Language(s) tsch, final mann or witz German final and initial cs or zs Hungarian cz, cy, initial rz or wl, ... Polish ö and ü German, Hungarian Allows to specify a language explicitly © 2022 Markus Günther IT-Beratung
  • 13. Step 2: Calculating the exact phonetic value Forms of surnames used by women differ in some languages Affects Slavic languages, Polish, Russian, Lithuanian, Latvian Masculine endings Feminine endings Suchy Sucha Novikov Novikova BMPM replaces feminine endings with masculine ones © 2022 Markus Günther IT-Beratung
  • 14. Step 2: Calculating the exact phonetic value 1. Replace feminine endings with masculine ones. 2. Identify the exact phonetic value of all letters. 1. Transcribe letters into a phonetic alphabet. Applies language-specific rule set in case of one possible language. Applies generic rule set in case of multiple possible languages. 2. Apply phonetic rules that are common to many languages. e.g. final devoicing, regressive assimilation 3. At the end, the algorithm yields the exact phonetic value. © 2022 Markus Günther IT-Beratung
  • 15. Step 2: What do language-specific rules look like? BMPM applies roughly 80 mapping rules for German sch maps to S s at the start and s between two vowels maps to z w maps to v © 2022 Markus Günther IT-Beratung
  • 16. Step 2: What do language-agnostic rules look like? BMPM uses more than 300 generic rules a final tz maps to ts Some generic rules might be applicable to specific languages only step 1 rules out certain languages rule is applied if it complies with the remaining possible languages © 2022 Markus Günther IT-Beratung
  • 17. Step 3: Calculating the approximate phonetic value Some sounds can be interchangeable in specific contexts beginning / end of word previous next / letter Language Example Sounds alike Russian unstressed o is pronounced as a Mostov, Mastov German n before b is close to m Grinberg, Grimberg Spanish phonetic equivalence of n and m Grinberg, Grimberg Rules can be language-agnostic or -specific © 2022 Markus Günther IT-Beratung
  • 18. Step 4: Searching for matches 1. BMPM generates the exact and approximate phonetic value for a given name. 2. We have an exact match if two names match on their exact phonetic value. This might be too aggressive for your use-case. 3. We have an approximate match if two names match on their approximate phonetic value. Matches done by BMPM are not necessarily commutative. © 2022 Markus Günther IT-Beratung
  • 20. Apache Solr supports a variety of phonetic matching algorithms. Beider-Morse Phonetic Matching Daitch-Mokotoff Soundex Double Metaphone Metaphone Soundex © 2022 Markus Günther IT-Beratung
  • 22. Add a field type that works with the phonetic matching algorithm. Admissible values for ruleType are: APPROX and EXACT They map to the semantics of approximate matches resp. exact matches <fieldType name="phonetic_names" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"></tokenizer> <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto"></filter> </analyzer> </fieldType> © 2022 Markus Günther IT-Beratung
  • 23. Add an index field using the resp. field type. You probably already have a name field of sorts for basic name searches. Use a copyField-directive to source name_phonetic from that field. <field name="name_phonetic" type="phonetic_names" indexed="true" stored="false" multiValued="false"></field> <copyField source="name" dest="name_phonetic"></copyField> © 2022 Markus Günther IT-Beratung
  • 24. Execute queries against that field. Query for mustermann (name_phonetic:mustermann) © 2022 Markus Günther IT-Beratung
  • 26. Let's do a couple of experiments with different parameters for BMPN. Dataset: Large enterprise naming directory, approx. 340k individual persons Naive implementation using phonetic matching incl. wildcards and N-Gram backed fields yields approx. 3k results for a popular surname Queries: Large result set: q=(name_phonetic:meier) Small result set: q=(name_phonetic:<some-unique-name>) © 2022 Markus Günther IT-Beratung
  • 27. Experiment 1: Querying for a popular name Variant ruleType languageSet q=(phonetic_name:meier) Naive - - 2997 1 APPROX auto 1279 2 EXACT auto 1228 3 APPROX german,english 1261 4 EXACT german,english 1216 Restricting languages to pre-dominantly ones of the corpus removes non-intuitive matches Almost no noticeable diff between APPROX and EXACT wrt. result quality © 2022 Markus Günther IT-Beratung
  • 28. Few ordering issues, meier almost everytime before phonetic variations
  • 29. Experiment 2: Querying for a unique name with spelling variations Variant ruleType languageSet Correct Var. 1 Var. 2 Naive - - 7 0 30 (non-intuitive) 1 APPROX auto 1 5 14 (no match, not intuitive) 2 EXACT auto 1 0 1 (no match) 3 APPROX german,english 7 (top match) 5 (match) 25 (no match, intuitive) 4 EXACT german,english 1 0 1 (no match) Variant 3: Precision is good, recall could be better (i.e. one-off-corrections) © 2022 Markus Günther IT-Beratung
  • 30. Adding one-off-corrections using Damerau-Levensthein distance complements BMPM. Prerequisites name index field that stores <first name> <middle-initial> <surname> name index field uses n-grams Refine the query Can be applied within phrases as well to allow for displacements "Mustermann Max" should yield the same results as "Max Mustermann" (name_phonetic:mustermann) OR (name:mustermann~1) © 2022 Markus Günther IT-Beratung
  • 31. Adding a boost on first name and surname for direct matches. Influence ordering a bit to always prefer direct matches before phonetic variations. Prerequisites: firstname index field that stores <first name> (non-analyzed, lowercased) surname index field that stores <surname> (non-analyzed, lowercased) Refine the query bq=firstname:("mustermann")surname:("mustermann") © 2022 Markus Günther IT-Beratung
  • 32. Tuning BMPM using additional mechanisms yields well-grounded phonetic matches. What have we done? Test the effect of BMPM parameterizations on your dataset Add one-off-corrections to mitigate spelling mistakes that phonetics won't catch Allow for displacement of max. two terms within a phrase Boost on first and surname separately to influence relevance sorting © 2022 Markus Günther IT-Beratung
  • 33. Tuning BMPM using additional mechanisms yields well-grounded phonetic matches. Achievements Good trade-off between precision and recall usually top match on search for unique names Result sets are explainable Relevance ordering feels natural direct matches, phonetic variations, one-off corrections © 2022 Markus Günther IT-Beratung