SlideShare a Scribd company logo
1 of 17
Download to read offline
CANTONESE SPELL CHECKING WITH
                           METAPHONES

                                                  /

                        full study on http://www.corgitoergosum.net/studies/


Tuesday, 14 June 2011
a study to see if this sort of spell suggestion is possible:



                        query:
                        query:                             Do you mean“           ”?

                        query:




Tuesday, 14 June 2011
Unlike more formal pingyin methods, the “HK Govt Cantonese
            pingyin”, commonly used in street names and
            identification papers in HK, is based on English
            pronunciation...




Tuesday, 14 June 2011
In other words, two Cantonese words should have cantonese
           pingyins that sound similar in English




                                       “fung hei yue”




Tuesday, 14 June 2011
query: “   ”




Tuesday, 14 June 2011
query: “        ”

                                           look up pronunciation table



                        pronunciation: “luk ding gei”




Tuesday, 14 June 2011
query: “        ”

                                           look up pronunciation table



                        pronunciation: “luk ding gei”

                                             search phrase dictionary for similar sounding words




                        suggestion: Do you mean“                 ”?


Tuesday, 14 June 2011
pronunciation table
                                              pingyin   char

                                               ban

                                               ban

                                               ban

                                               ban

                                               ban

                                               ban



Tuesday, 14 June 2011
using metaphone algorithm to compute “sound hashes”

                        (chan yik suen)   XNYKXN

                                          XNYKXN

                                          XNYKSN

                                          XNLKSN

                                          XNYXN

                                          XNSNK


Tuesday, 14 June 2011
finding similar sounding words: the levenshtein distance
                               algorithm
                                             distance from XNYKXN
                               XNYJNK                 3
                                XNYKSN                1
                                XNLKSN                2
                                XNY_XN                1
                               XN_SNK                 4
                               XNYTKWN                2
                               XMYTSNK                4
                               YTYMSYPFT              8

Tuesday, 14 June 2011
seems like one is the safest from this limited test set

                                                             distance from XNYKXN
                                             XNYJNK                   3
                                             XNYKSN                   1
                                             XNLKSN                   2
                                             XNY_XN                   1
                                             XN_SNK                   4
                                            XNYTKWN                   2
                                            XMYTSNK                   4
                                            YTYMSYPFT                 8

Tuesday, 14 June 2011
some test results

                        query             suggestion




                                         no matches


                                         no matches




Tuesday, 14 June 2011
some test results

                        query             suggestion


                                         no matches




Tuesday, 14 June 2011
some test results

                        query             suggestion




Tuesday, 14 June 2011
limitations

                                        no tonal info

                              HK Govt pingyin is very loose

                        doesn’t consider similar shaped characters, like:

                                    ,      ,            ,      ,      ......




Tuesday, 14 June 2011
improvement areas

                                         speed

                         only looks at            of first character now

                                    english is ignored
                use libyell dict both for segmentation and common words (n-
                                         grams) checking

                                 simplified chinese rewrite


Tuesday, 14 June 2011
feedback?




Tuesday, 14 June 2011

More Related Content

Recently uploaded

Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 

Recently uploaded (20)

Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 

Cantonese spellcheck with Metaphones

  • 1. CANTONESE SPELL CHECKING WITH METAPHONES / full study on http://www.corgitoergosum.net/studies/ Tuesday, 14 June 2011
  • 2. a study to see if this sort of spell suggestion is possible: query: query: Do you mean“ ”? query: Tuesday, 14 June 2011
  • 3. Unlike more formal pingyin methods, the “HK Govt Cantonese pingyin”, commonly used in street names and identification papers in HK, is based on English pronunciation... Tuesday, 14 June 2011
  • 4. In other words, two Cantonese words should have cantonese pingyins that sound similar in English “fung hei yue” Tuesday, 14 June 2011
  • 5. query: “ ” Tuesday, 14 June 2011
  • 6. query: “ ” look up pronunciation table pronunciation: “luk ding gei” Tuesday, 14 June 2011
  • 7. query: “ ” look up pronunciation table pronunciation: “luk ding gei” search phrase dictionary for similar sounding words suggestion: Do you mean“ ”? Tuesday, 14 June 2011
  • 8. pronunciation table pingyin char ban ban ban ban ban ban Tuesday, 14 June 2011
  • 9. using metaphone algorithm to compute “sound hashes” (chan yik suen) XNYKXN XNYKXN XNYKSN XNLKSN XNYXN XNSNK Tuesday, 14 June 2011
  • 10. finding similar sounding words: the levenshtein distance algorithm distance from XNYKXN XNYJNK 3 XNYKSN 1 XNLKSN 2 XNY_XN 1 XN_SNK 4 XNYTKWN 2 XMYTSNK 4 YTYMSYPFT 8 Tuesday, 14 June 2011
  • 11. seems like one is the safest from this limited test set distance from XNYKXN XNYJNK 3 XNYKSN 1 XNLKSN 2 XNY_XN 1 XN_SNK 4 XNYTKWN 2 XMYTSNK 4 YTYMSYPFT 8 Tuesday, 14 June 2011
  • 12. some test results query suggestion no matches no matches Tuesday, 14 June 2011
  • 13. some test results query suggestion no matches Tuesday, 14 June 2011
  • 14. some test results query suggestion Tuesday, 14 June 2011
  • 15. limitations no tonal info HK Govt pingyin is very loose doesn’t consider similar shaped characters, like: , , , , ...... Tuesday, 14 June 2011
  • 16. improvement areas speed only looks at of first character now english is ignored use libyell dict both for segmentation and common words (n- grams) checking simplified chinese rewrite Tuesday, 14 June 2011