SlideShare a Scribd company logo
1 of 22
Download to read offline
PO Department
PEOPLE OPERATION’S
MONTHLY UPDATE
09/2019
1
CPU and memory efficient
spellchecker implementation in TIKI
2
Results for “iphone”
3
Results for “ipohne” without spellchecker
4
Results for “ipohne” with spellchecker
5
General approach
words, result = (tokenize(query), [])
for w in words:
candidates = generate_candidates(w)
best_c, best_score = (None, 0.)
for c in candidates:
score = spellchecker_score(w, c)
if score > best_score:
best_c, best_score = (c, score)
result.append(best_c)
6
Generate candidates
Generate all possible similar words:
- Need to define a measure of similarity - we use Damerau-Levenshtein distance
- It allows insertions, deletions, substitutions and transpositions of symbols
- We limit maximum allowed distance depending on the length of the word
- Then just generate all edits out of 4 possible types (CPU greedy)
- We will optimize this approach later
Examples of Damerau-Levenshtein distance:
- distance(nguyễn, nguyên) = 1 (one substitution)
- distance(nguyễn, nguyeenx) = 3 (one substitution, two insertions)
- distance(behaivour, behaviour) = 1 (one transposition)
7
Spellchecker score
“Noisy channel” model:
- Bayesian formula: P(c|w) = P(w|c) * P(c) / P(w)
- Need to find candidate c which maximizes P(c|w)
- Can simplify to P(w|c) * P(c) because P(w) is constant for all candidates
Used probabilities:
- P(c|w) - probability of c being intended when w was observed
- P(w|c) - probability of the word w to be a misspelling of c - error model
- P(c) - probability to observe c - language model
8
Building the language model
N-gram model:
- Building a 2-gram dictionary
- Remove 2-grams below a certain threshold
Used data:
- All product contents on Tiki
- All Tiki search queries for a year
- Some randomly crawled texts from the Vietnamese Web
- Total: 5.5Gb gzip-ed
9
Building the language model (example)
Data (queries on Tiki):
máy rửa mặt
máy rửa mắt
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy xay sinh tố
máy sấy tóc
...
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy rửa mắt
máy xay sinh tố
máy sấy tóc
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
10
Building the language model (example)
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
Language model:
410 <
410 >
410 máy
410 < máy
205 máy rửa
100 máy sấy
105 máy xay
105 tóc >
100 sấy tóc
5 xay tóc
105 tóc
...
We just count all possible single words and
word pairs from our counted queries data and
write it down into language model.
This will let us calculate the probability of the
word to be observed without a context or with
a context of 1 word before or after it.
11
Building the language model (example)
Language model:
410 <
410 >
410 máy
410 < máy
205 máy rửa
100 máy sấy
105 máy xay
105 tóc >
100 sấy tóc
5 xay tóc
105 tóc
...
Query: máy => “< máy >"
P(máy) = 0.5 * (P(< máy) + P(máy >))
= 0.5 * (410/410+0/410) = 0.5
Query: máy xay tóc
P(xay) = 0.5 * (P(máy xay) + P(xay tóc))
= 0.5 * (105/410+5/105) ~ 0.30
P(sấy) = 0.5 * (P(máy sấy) + P(sấy tóc))
= 0.5 * (100/410+100/105) ~ 0.60
Language model here suggests that the
probability to see “sấy” in this context is
higher than the probability to see “xay”.
12
Building the error model
Automatic extraction of P(w|c):
- Extract triplets (w1, w2, w3) from our texts set
- Group triplets by (w1, *, w3) and sort by descending popularity
- Remove groupings below a certain threshold
- Remove samples where w2 words are too far from each other (using
Damerau-Levenshtein distance)
- Remove samples with popularity comparable to the most popular sample in this
grouping
- Write w2 words from all left samples into error model mapping as triplets of
(observed word, intended word, count)
Used data:
- Same as for the language model
13
Building the error model (example)
Data (queries on Tiki):
máy rửa mặt
máy rửa mắt
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy xay sinh tố
máy sấy tóc
...
máy sấy tóc
máy xay tóc
máy xay sinh tố
máy rửa mắt
máy rửa mắt
máy xay sinh tố
máy sấy tóc
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
14
Building the error model (example)
Counted queries:
200 máy rửa mặt
5 máy rửa mắt
100 máy sấy tóc
5 máy xay tóc
100 máy xay sinh tố
Triplets:
205 < máy rửa
200 rửa mặt >
5 rửa mắt >
100 máy sấy tóc
5 máy xay tóc
200 máy rửa mặt
5 máy rửa mắt
105 < máy xay
100 sinh tố >
...
We count all possible triplets from our counted
queries data.
15
Building the error model (example)
Triplets (grouped):
rửa * >
200 rửa mặt >
5 rửa mắt >
máy * tóc
100 máy sấy tóc
5 máy xay tóc
máy * sinh
100 máy xay sinh
sinh * >
100 sinh tố >
...
Error model:
200 mặt mặt
5 mắt mặt
100 sấy sấy
5 xay sấy
100 xay xay
100 tố tố
...
Format:
count
observed_word
intended_word
16
Building the error model (example)
Query: kem rửa mắt
P(mắt|mắt) = 0/5 = 0.0 - we divide the number of
times “mắt" was intended when "mắt" was
observed in error model to just the total number of
times when "mắt" was observed in error model.
P(mắt|mặt) = 5/5 = 1.0 - again, we divide the
number of times "mặt" was intended when "mắt"
was observed in error model to just the total
number of times when "mắt" was observed in error
model.
This means that according to error model built
on our data, it is extremely likely for “mắt" to
be a misspelling of “mặt".
Error model:
200 mặt mặt
5 mắt mặt
100 sấy sấy
5 xay sấy
100 xay xay
100 tố tố
...
Format:
count
observed_word
intended_word
17
Quality optimizations
Idea:
- Language model is more important in bigger context
- Instead of P(w|c)*P(c) use P(w|c)*pow(P(c),lambda)
- Lambda depends on the length of available context
Results:
- Using bigger lambda for longer context => better test result (idea works!)
- For bigger N-gram need to use machine learning to optimize lambdas
18
Performance optimizations
Important fact:
It is possible to prove that if Damerau-Levenshtein distance(w, c) = N, then for any w
and c we can find a combination of no more than N deletes of a single character from
each side, which will lead to the same result. Examples below:
distance(iphone, iphobee) = 2 (one insertion, one substitution)
iphone -> iphoe VS iphobee -> iphoee -> iphoe (match!)
distance(iphone, pihoone) = 2 (one transposition, one insertion)
iphone -> ihone VS pihoone -> ihoone -> ihone (match!)
Let’s use it to optimize candidates generation!
19
Performance optimizations
Problem 1 - generating candidates is CPU greedy:
- Precompute “deletes” dictionary
- Use only delete operations from both sides
- Need to double-check the distance (can be up to 2N, but we need N)
- Fast, but requires RAM
Problem 2 - having “deletes” dictionary requires RAM:
- Use different data compression techniques
- From what we’ve tried, Judy dynamic arrays work the best
- We decreased RAM requirements from 10.5Gb to 2.3Gb
20
Testing results
Testing set:
- 5,000 random queries, 10,000 misspelled queries
- Suggestions collected through Google API and then manually checked
- Only one marker per query
Results:
- Slightly (10-12%) worse than Google (ok for such RAM requirements)
- In A/B test shows 3-9% purchases increase
21
Future plans
Implementation:
- Use 3-gram data (still trying to keep it RAM-optimal)
Testing:
- Use multi-marker test set
- Properly handle cases when spellchecker returns multiple variants
Thank you!
22

More Related Content

What's hot

서버 아키텍처 이해를 위한 프로세스와 쓰레드
서버 아키텍처 이해를 위한 프로세스와 쓰레드서버 아키텍처 이해를 위한 프로세스와 쓰레드
서버 아키텍처 이해를 위한 프로세스와 쓰레드KwangSeob Jeong
 
Introduction to Docker Compose | Docker Intermediate Workshop
Introduction to Docker Compose | Docker Intermediate WorkshopIntroduction to Docker Compose | Docker Intermediate Workshop
Introduction to Docker Compose | Docker Intermediate WorkshopAjeet Singh Raina
 
Room 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsi
Room 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsiRoom 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsi
Room 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsiVietnam Open Infrastructure User Group
 
Securing Prometheus exporters using HashiCorp Vault
Securing Prometheus exporters using HashiCorp VaultSecuring Prometheus exporters using HashiCorp Vault
Securing Prometheus exporters using HashiCorp VaultBram Vogelaar
 
Docker Security workshop slides
Docker Security workshop slidesDocker Security workshop slides
Docker Security workshop slidesDocker, Inc.
 
Introducing GitLab (September 2018)
Introducing GitLab (September 2018)Introducing GitLab (September 2018)
Introducing GitLab (September 2018)Noa Harel
 
Sonatype nexus 로 docker registry 관리하기
Sonatype nexus 로 docker registry 관리하기Sonatype nexus 로 docker registry 관리하기
Sonatype nexus 로 docker registry 관리하기KwangSeob Jeong
 
Building Bizweb Microservices with Docker
Building Bizweb Microservices with DockerBuilding Bizweb Microservices with Docker
Building Bizweb Microservices with DockerKhôi Nguyễn Minh
 
CI/CD 101
CI/CD 101CI/CD 101
CI/CD 101djdule
 
Learning git
Learning gitLearning git
Learning gitSid Anand
 
Introduction to Gitlab | Gitlab 101 | Training Session
Introduction to Gitlab | Gitlab 101 | Training SessionIntroduction to Gitlab | Gitlab 101 | Training Session
Introduction to Gitlab | Gitlab 101 | Training SessionAnwarul Islam
 
Room 1 - 3 - Lê Anh Tuấn - Build a High Performance Identification at GHTK wi...
Room 1 - 3 - Lê Anh Tuấn - Build a High Performance Identification at GHTK wi...Room 1 - 3 - Lê Anh Tuấn - Build a High Performance Identification at GHTK wi...
Room 1 - 3 - Lê Anh Tuấn - Build a High Performance Identification at GHTK wi...Vietnam Open Infrastructure User Group
 
Writing the Container Network Interface(CNI) plugin in golang
Writing the Container Network Interface(CNI) plugin in golangWriting the Container Network Interface(CNI) plugin in golang
Writing the Container Network Interface(CNI) plugin in golangHungWei Chiu
 
Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...
Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...
Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...Vietnam Open Infrastructure User Group
 
[MeetUp][2nd] 오리뎅이의_쿠버네티스_네트워킹_v1.2
[MeetUp][2nd] 오리뎅이의_쿠버네티스_네트워킹_v1.2[MeetUp][2nd] 오리뎅이의_쿠버네티스_네트워킹_v1.2
[MeetUp][2nd] 오리뎅이의_쿠버네티스_네트워킹_v1.2InfraEngineer
 
[MeetUp][1st] 오리뎅이의_쿠버네티스_네트워킹
[MeetUp][1st] 오리뎅이의_쿠버네티스_네트워킹[MeetUp][1st] 오리뎅이의_쿠버네티스_네트워킹
[MeetUp][1st] 오리뎅이의_쿠버네티스_네트워킹InfraEngineer
 
eBPF - Observability In Deep
eBPF - Observability In DeepeBPF - Observability In Deep
eBPF - Observability In DeepMydbops
 
Room 2 - 3 - Nguyễn Hoài Nam & Nguyễn Việt Hùng - Terraform & Pulumi Comparin...
Room 2 - 3 - Nguyễn Hoài Nam & Nguyễn Việt Hùng - Terraform & Pulumi Comparin...Room 2 - 3 - Nguyễn Hoài Nam & Nguyễn Việt Hùng - Terraform & Pulumi Comparin...
Room 2 - 3 - Nguyễn Hoài Nam & Nguyễn Việt Hùng - Terraform & Pulumi Comparin...Vietnam Open Infrastructure User Group
 

What's hot (20)

서버 아키텍처 이해를 위한 프로세스와 쓰레드
서버 아키텍처 이해를 위한 프로세스와 쓰레드서버 아키텍처 이해를 위한 프로세스와 쓰레드
서버 아키텍처 이해를 위한 프로세스와 쓰레드
 
Introduction to Docker Compose | Docker Intermediate Workshop
Introduction to Docker Compose | Docker Intermediate WorkshopIntroduction to Docker Compose | Docker Intermediate Workshop
Introduction to Docker Compose | Docker Intermediate Workshop
 
Room 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsi
Room 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsiRoom 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsi
Room 1 - 2 - Nguyễn Văn Thắng & Dzung Nguyen - Proxmox VE và ZFS over iscsi
 
Securing Prometheus exporters using HashiCorp Vault
Securing Prometheus exporters using HashiCorp VaultSecuring Prometheus exporters using HashiCorp Vault
Securing Prometheus exporters using HashiCorp Vault
 
Docker Security workshop slides
Docker Security workshop slidesDocker Security workshop slides
Docker Security workshop slides
 
Introducing GitLab (September 2018)
Introducing GitLab (September 2018)Introducing GitLab (September 2018)
Introducing GitLab (September 2018)
 
Powershell Demo Presentation
Powershell Demo PresentationPowershell Demo Presentation
Powershell Demo Presentation
 
Sonatype nexus 로 docker registry 관리하기
Sonatype nexus 로 docker registry 관리하기Sonatype nexus 로 docker registry 관리하기
Sonatype nexus 로 docker registry 관리하기
 
Building Bizweb Microservices with Docker
Building Bizweb Microservices with DockerBuilding Bizweb Microservices with Docker
Building Bizweb Microservices with Docker
 
CI/CD 101
CI/CD 101CI/CD 101
CI/CD 101
 
Learning git
Learning gitLearning git
Learning git
 
Introduction to Gitlab | Gitlab 101 | Training Session
Introduction to Gitlab | Gitlab 101 | Training SessionIntroduction to Gitlab | Gitlab 101 | Training Session
Introduction to Gitlab | Gitlab 101 | Training Session
 
Room 1 - 3 - Lê Anh Tuấn - Build a High Performance Identification at GHTK wi...
Room 1 - 3 - Lê Anh Tuấn - Build a High Performance Identification at GHTK wi...Room 1 - 3 - Lê Anh Tuấn - Build a High Performance Identification at GHTK wi...
Room 1 - 3 - Lê Anh Tuấn - Build a High Performance Identification at GHTK wi...
 
DevOps 3 - Docker.pdf
DevOps 3 - Docker.pdfDevOps 3 - Docker.pdf
DevOps 3 - Docker.pdf
 
Writing the Container Network Interface(CNI) plugin in golang
Writing the Container Network Interface(CNI) plugin in golangWriting the Container Network Interface(CNI) plugin in golang
Writing the Container Network Interface(CNI) plugin in golang
 
Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...
Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...
Room 2 - 4 - Juncheng Anthony Lin - Redhat - A Practical Approach to Traditio...
 
[MeetUp][2nd] 오리뎅이의_쿠버네티스_네트워킹_v1.2
[MeetUp][2nd] 오리뎅이의_쿠버네티스_네트워킹_v1.2[MeetUp][2nd] 오리뎅이의_쿠버네티스_네트워킹_v1.2
[MeetUp][2nd] 오리뎅이의_쿠버네티스_네트워킹_v1.2
 
[MeetUp][1st] 오리뎅이의_쿠버네티스_네트워킹
[MeetUp][1st] 오리뎅이의_쿠버네티스_네트워킹[MeetUp][1st] 오리뎅이의_쿠버네티스_네트워킹
[MeetUp][1st] 오리뎅이의_쿠버네티스_네트워킹
 
eBPF - Observability In Deep
eBPF - Observability In DeepeBPF - Observability In Deep
eBPF - Observability In Deep
 
Room 2 - 3 - Nguyễn Hoài Nam & Nguyễn Việt Hùng - Terraform & Pulumi Comparin...
Room 2 - 3 - Nguyễn Hoài Nam & Nguyễn Việt Hùng - Terraform & Pulumi Comparin...Room 2 - 3 - Nguyễn Hoài Nam & Nguyễn Việt Hùng - Terraform & Pulumi Comparin...
Room 2 - 3 - Nguyễn Hoài Nam & Nguyễn Việt Hùng - Terraform & Pulumi Comparin...
 

Similar to Grokking TechTalk #35: Efficient spellchecking

Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
c++ Data Types and Selection
c++ Data Types and Selectionc++ Data Types and Selection
c++ Data Types and SelectionAhmed Nobi
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I thinkWim Godden
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine LearningNarong Intiruk
 
Spock Framework - Slidecast
Spock Framework - SlidecastSpock Framework - Slidecast
Spock Framework - SlidecastDaniel Kolman
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I thinkWim Godden
 
Railway Oriented Programming in Elixir
Railway Oriented Programming in ElixirRailway Oriented Programming in Elixir
Railway Oriented Programming in ElixirMustafa TURAN
 
Word embeddings as a service - PyData NYC 2015
Word embeddings as a service -  PyData NYC 2015Word embeddings as a service -  PyData NYC 2015
Word embeddings as a service - PyData NYC 2015François Scharffe
 
Network automation with Ansible and Python
Network automation with Ansible and PythonNetwork automation with Ansible and Python
Network automation with Ansible and PythonJisc
 
Django in the Office: Get Your Admin for Nothing and Your SQL for Free
Django in the Office: Get Your Admin for Nothing and Your SQL for FreeDjango in the Office: Get Your Admin for Nothing and Your SQL for Free
Django in the Office: Get Your Admin for Nothing and Your SQL for FreeHarvard Web Working Group
 
The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6Wim Godden
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I thinkWim Godden
 
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...Codemotion
 
Dialog Engine for Product Information
Dialog Engine for Product InformationDialog Engine for Product Information
Dialog Engine for Product InformationVamsee Chamakura
 
Testing Adhearsion Applications
Testing Adhearsion ApplicationsTesting Adhearsion Applications
Testing Adhearsion ApplicationsLuca Pradovera
 
Logical Expressions in C/C++. Mistakes Made by Professionals
Logical Expressions in C/C++. Mistakes Made by ProfessionalsLogical Expressions in C/C++. Mistakes Made by Professionals
Logical Expressions in C/C++. Mistakes Made by ProfessionalsPVS-Studio
 

Similar to Grokking TechTalk #35: Efficient spellchecking (20)

Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
c++ Data Types and Selection
c++ Data Types and Selectionc++ Data Types and Selection
c++ Data Types and Selection
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine Learning
 
Spock Framework - Slidecast
Spock Framework - SlidecastSpock Framework - Slidecast
Spock Framework - Slidecast
 
Spock Framework
Spock FrameworkSpock Framework
Spock Framework
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
Conf orm - explain
Conf orm - explainConf orm - explain
Conf orm - explain
 
Railway Oriented Programming in Elixir
Railway Oriented Programming in ElixirRailway Oriented Programming in Elixir
Railway Oriented Programming in Elixir
 
Word embeddings as a service - PyData NYC 2015
Word embeddings as a service -  PyData NYC 2015Word embeddings as a service -  PyData NYC 2015
Word embeddings as a service - PyData NYC 2015
 
Network automation with Ansible and Python
Network automation with Ansible and PythonNetwork automation with Ansible and Python
Network automation with Ansible and Python
 
Django in the Office: Get Your Admin for Nothing and Your SQL for Free
Django in the Office: Get Your Admin for Nothing and Your SQL for FreeDjango in the Office: Get Your Admin for Nothing and Your SQL for Free
Django in the Office: Get Your Admin for Nothing and Your SQL for Free
 
The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6The why and how of moving to PHP 5.5/5.6
The why and how of moving to PHP 5.5/5.6
 
My app is secure... I think
My app is secure... I thinkMy app is secure... I think
My app is secure... I think
 
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
 
Dialog Engine for Product Information
Dialog Engine for Product InformationDialog Engine for Product Information
Dialog Engine for Product Information
 
Testing Adhearsion Applications
Testing Adhearsion ApplicationsTesting Adhearsion Applications
Testing Adhearsion Applications
 
Logical Expressions in C/C++. Mistakes Made by Professionals
Logical Expressions in C/C++. Mistakes Made by ProfessionalsLogical Expressions in C/C++. Mistakes Made by Professionals
Logical Expressions in C/C++. Mistakes Made by Professionals
 
Php optimization
Php optimizationPhp optimization
Php optimization
 
Php101
Php101Php101
Php101
 

More from Grokking VN

Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking VN
 
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...Grokking VN
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking VN
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking VN
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking VN
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking VN
 
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 Grokking Techtalk #39: How to build an event driven architecture with Kafka ... Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...Grokking VN
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problemGrokking VN
 
Grokking Techtalk #37: Software design and refactoring
 Grokking Techtalk #37: Software design and refactoring Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoringGrokking VN
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...Grokking VN
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking VN
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking VN
 
SOLID & Design Patterns
SOLID & Design PatternsSOLID & Design Patterns
SOLID & Design PatternsGrokking VN
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking VN
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking VN
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking VN
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking VN
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking VN
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking VN
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking VN
 

More from Grokking VN (20)

Grokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles ThinkingGrokking Techtalk #45: First Principles Thinking
Grokking Techtalk #45: First Principles Thinking
 
Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...Grokking Techtalk #42: Engineering challenges on building data platform for M...
Grokking Techtalk #42: Engineering challenges on building data platform for M...
 
Grokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystifiedGrokking Techtalk #43: Payment gateway demystified
Grokking Techtalk #43: Payment gateway demystified
 
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database clusterGrokking Techtalk #40: Consistency and Availability tradeoff in database cluster
Grokking Techtalk #40: Consistency and Availability tradeoff in database cluster
 
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platformGrokking Techtalk #40: AWS’s philosophy on designing MLOps platform
Grokking Techtalk #40: AWS’s philosophy on designing MLOps platform
 
Grokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applicationsGrokking Techtalk #39: Gossip protocol and applications
Grokking Techtalk #39: Gossip protocol and applications
 
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 Grokking Techtalk #39: How to build an event driven architecture with Kafka ... Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
Grokking Techtalk #39: How to build an event driven architecture with Kafka ...
 
Grokking Techtalk #37: Data intensive problem
 Grokking Techtalk #37: Data intensive problem Grokking Techtalk #37: Data intensive problem
Grokking Techtalk #37: Data intensive problem
 
Grokking Techtalk #37: Software design and refactoring
 Grokking Techtalk #37: Software design and refactoring Grokking Techtalk #37: Software design and refactoring
Grokking Techtalk #37: Software design and refactoring
 
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer... Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
Grokking Techtalk #34: K8S On-premise: Incident & Lesson Learned ZaloPay Mer...
 
Grokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKIGrokking TechTalk #33: High Concurrency Architecture at TIKI
Grokking TechTalk #33: High Concurrency Architecture at TIKI
 
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...
 
SOLID & Design Patterns
SOLID & Design PatternsSOLID & Design Patterns
SOLID & Design Patterns
 
Grokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous CommunicationsGrokking TechTalk #31: Asynchronous Communications
Grokking TechTalk #31: Asynchronous Communications
 
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at ScaleGrokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
Grokking TechTalk #30: From App to Ecosystem: Lessons Learned at Scale
 
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedInGrokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
Grokking TechTalk #29: Building Realtime Metrics Platform at LinkedIn
 
Grokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search TreeGrokking TechTalk #27: Optimal Binary Search Tree
Grokking TechTalk #27: Optimal Binary Search Tree
 
Grokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the MagicGrokking TechTalk #26: Kotlin, Understand the Magic
Grokking TechTalk #26: Kotlin, Understand the Magic
 
Grokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platformGrokking TechTalk #26: Compare ios and android platform
Grokking TechTalk #26: Compare ios and android platform
 
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
Grokking TechTalk #24: Thiết kế hệ thống Background Job Queue bằng Ruby & Pos...
 

Recently uploaded

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...ranjana rawat
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...roncy bisnoi
 

Recently uploaded (20)

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 

Grokking TechTalk #35: Efficient spellchecking

  • 1. PO Department PEOPLE OPERATION’S MONTHLY UPDATE 09/2019 1 CPU and memory efficient spellchecker implementation in TIKI
  • 3. 3 Results for “ipohne” without spellchecker
  • 4. 4 Results for “ipohne” with spellchecker
  • 5. 5 General approach words, result = (tokenize(query), []) for w in words: candidates = generate_candidates(w) best_c, best_score = (None, 0.) for c in candidates: score = spellchecker_score(w, c) if score > best_score: best_c, best_score = (c, score) result.append(best_c)
  • 6. 6 Generate candidates Generate all possible similar words: - Need to define a measure of similarity - we use Damerau-Levenshtein distance - It allows insertions, deletions, substitutions and transpositions of symbols - We limit maximum allowed distance depending on the length of the word - Then just generate all edits out of 4 possible types (CPU greedy) - We will optimize this approach later Examples of Damerau-Levenshtein distance: - distance(nguyễn, nguyên) = 1 (one substitution) - distance(nguyễn, nguyeenx) = 3 (one substitution, two insertions) - distance(behaivour, behaviour) = 1 (one transposition)
  • 7. 7 Spellchecker score “Noisy channel” model: - Bayesian formula: P(c|w) = P(w|c) * P(c) / P(w) - Need to find candidate c which maximizes P(c|w) - Can simplify to P(w|c) * P(c) because P(w) is constant for all candidates Used probabilities: - P(c|w) - probability of c being intended when w was observed - P(w|c) - probability of the word w to be a misspelling of c - error model - P(c) - probability to observe c - language model
  • 8. 8 Building the language model N-gram model: - Building a 2-gram dictionary - Remove 2-grams below a certain threshold Used data: - All product contents on Tiki - All Tiki search queries for a year - Some randomly crawled texts from the Vietnamese Web - Total: 5.5Gb gzip-ed
  • 9. 9 Building the language model (example) Data (queries on Tiki): máy rửa mặt máy rửa mắt máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy xay sinh tố máy sấy tóc ... máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy rửa mắt máy xay sinh tố máy sấy tóc Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố
  • 10. 10 Building the language model (example) Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố Language model: 410 < 410 > 410 máy 410 < máy 205 máy rửa 100 máy sấy 105 máy xay 105 tóc > 100 sấy tóc 5 xay tóc 105 tóc ... We just count all possible single words and word pairs from our counted queries data and write it down into language model. This will let us calculate the probability of the word to be observed without a context or with a context of 1 word before or after it.
  • 11. 11 Building the language model (example) Language model: 410 < 410 > 410 máy 410 < máy 205 máy rửa 100 máy sấy 105 máy xay 105 tóc > 100 sấy tóc 5 xay tóc 105 tóc ... Query: máy => “< máy >" P(máy) = 0.5 * (P(< máy) + P(máy >)) = 0.5 * (410/410+0/410) = 0.5 Query: máy xay tóc P(xay) = 0.5 * (P(máy xay) + P(xay tóc)) = 0.5 * (105/410+5/105) ~ 0.30 P(sấy) = 0.5 * (P(máy sấy) + P(sấy tóc)) = 0.5 * (100/410+100/105) ~ 0.60 Language model here suggests that the probability to see “sấy” in this context is higher than the probability to see “xay”.
  • 12. 12 Building the error model Automatic extraction of P(w|c): - Extract triplets (w1, w2, w3) from our texts set - Group triplets by (w1, *, w3) and sort by descending popularity - Remove groupings below a certain threshold - Remove samples where w2 words are too far from each other (using Damerau-Levenshtein distance) - Remove samples with popularity comparable to the most popular sample in this grouping - Write w2 words from all left samples into error model mapping as triplets of (observed word, intended word, count) Used data: - Same as for the language model
  • 13. 13 Building the error model (example) Data (queries on Tiki): máy rửa mặt máy rửa mắt máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy xay sinh tố máy sấy tóc ... máy sấy tóc máy xay tóc máy xay sinh tố máy rửa mắt máy rửa mắt máy xay sinh tố máy sấy tóc Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố
  • 14. 14 Building the error model (example) Counted queries: 200 máy rửa mặt 5 máy rửa mắt 100 máy sấy tóc 5 máy xay tóc 100 máy xay sinh tố Triplets: 205 < máy rửa 200 rửa mặt > 5 rửa mắt > 100 máy sấy tóc 5 máy xay tóc 200 máy rửa mặt 5 máy rửa mắt 105 < máy xay 100 sinh tố > ... We count all possible triplets from our counted queries data.
  • 15. 15 Building the error model (example) Triplets (grouped): rửa * > 200 rửa mặt > 5 rửa mắt > máy * tóc 100 máy sấy tóc 5 máy xay tóc máy * sinh 100 máy xay sinh sinh * > 100 sinh tố > ... Error model: 200 mặt mặt 5 mắt mặt 100 sấy sấy 5 xay sấy 100 xay xay 100 tố tố ... Format: count observed_word intended_word
  • 16. 16 Building the error model (example) Query: kem rửa mắt P(mắt|mắt) = 0/5 = 0.0 - we divide the number of times “mắt" was intended when "mắt" was observed in error model to just the total number of times when "mắt" was observed in error model. P(mắt|mặt) = 5/5 = 1.0 - again, we divide the number of times "mặt" was intended when "mắt" was observed in error model to just the total number of times when "mắt" was observed in error model. This means that according to error model built on our data, it is extremely likely for “mắt" to be a misspelling of “mặt". Error model: 200 mặt mặt 5 mắt mặt 100 sấy sấy 5 xay sấy 100 xay xay 100 tố tố ... Format: count observed_word intended_word
  • 17. 17 Quality optimizations Idea: - Language model is more important in bigger context - Instead of P(w|c)*P(c) use P(w|c)*pow(P(c),lambda) - Lambda depends on the length of available context Results: - Using bigger lambda for longer context => better test result (idea works!) - For bigger N-gram need to use machine learning to optimize lambdas
  • 18. 18 Performance optimizations Important fact: It is possible to prove that if Damerau-Levenshtein distance(w, c) = N, then for any w and c we can find a combination of no more than N deletes of a single character from each side, which will lead to the same result. Examples below: distance(iphone, iphobee) = 2 (one insertion, one substitution) iphone -> iphoe VS iphobee -> iphoee -> iphoe (match!) distance(iphone, pihoone) = 2 (one transposition, one insertion) iphone -> ihone VS pihoone -> ihoone -> ihone (match!) Let’s use it to optimize candidates generation!
  • 19. 19 Performance optimizations Problem 1 - generating candidates is CPU greedy: - Precompute “deletes” dictionary - Use only delete operations from both sides - Need to double-check the distance (can be up to 2N, but we need N) - Fast, but requires RAM Problem 2 - having “deletes” dictionary requires RAM: - Use different data compression techniques - From what we’ve tried, Judy dynamic arrays work the best - We decreased RAM requirements from 10.5Gb to 2.3Gb
  • 20. 20 Testing results Testing set: - 5,000 random queries, 10,000 misspelled queries - Suggestions collected through Google API and then manually checked - Only one marker per query Results: - Slightly (10-12%) worse than Google (ok for such RAM requirements) - In A/B test shows 3-9% purchases increase
  • 21. 21 Future plans Implementation: - Use 3-gram data (still trying to keep it RAM-optimal) Testing: - Use multi-marker test set - Properly handle cases when spellchecker returns multiple variants