Introduction to japanese tokenizer

Introduction to Japanese
tokenizers
WebHack 2020-11-10
by Wanasit T.

About Me
● Github: @wanasit
○ Text / NLP projects
● Manager, Software Engineer @ Indeed
○ Search Quality (Metadata) team
○ Work on NLP problems for Jobs / Resumes

Disclaimer
1. This talk NOT related to any of Indeed’s technology
2. I’m not a Japanese (or a native-speaker)
○ But I built a Japanese tokenizer on my free time

Today Topics
● NLP and Tokenization (for Japanese)
● Lattice-based Tokenizers (MeCab -style tokenizers)
● How it works
○ Dictionary
○ Tokenization

NLP and Tokenization
● How does computer represent text?
● String (or Char[ ] or Byte[ ] )
■ "Abc"
■ "Hello World"

"Biden is projected winner in Michigan,
Wisconsin as tense nation watch final tally"
Source: NBC News

"Biden is projected winner in Michigan,
Wisconsin as tense nation watch final tally"
● What’s the topic?
● Who is winning? where?
Source: NBC News

● Tokenization / Segmentation
● The ﬁrst step to solve NLP problems is usually
identifying words from the string
○ Input: string, char[ ] (or byte[ ])
○ Output: a list of meaningful words (or tokens)

"Biden is projected winner in Michigan, Wisconsin as
tense nation watch final tally".split(/W+/)
> ["Biden", "is", "projected", "winner", "in", ...]

Japanese Tokenization
"バイデン氏がミシガン州勝利、大統領にむけ“王手"
Source: TBS News

"バイデン氏がミシガン州勝利、大統領にむけ“王手"
● No punctuations
● Q: How do you split this into words?
Source: TBS News

● Use prior Japanese knowledge (Dictionary)
○ が, に, …, 氏, 州, …, バイデン
● Consider the context and combination of characters
● Consider the likelihood
○ e.g. 東京都 => [東京, 都], or [東, 京都]

Lattice-based Tokenizers
● aka. MeCab -based tokenizer (or Viterbi tokenizer)
● How:
○ From a Dictionary (required)
○ Build a Lattice (or a graph) from surface dictionary terms
○ Run Viterbi algorithm to ﬁnd the best connected path

Lattice-Based Tokenizers
● Most tokenizers are MeCab (C/C++)’s re-implementation on
different platforms:
○ Kuromoji, Sudachi (Java), Kotori (Kotlin)
○ Janome, SudachiPy (Python)
○ Kagome (Go)
○ ...

Non- Lattice-Based Tokenizers
● Is Lattice-based the only approach?
● Mostly yes, but there are also:
○ Juman++, Nagisa (RNN)
○ SentencePiece (Unsupervised, used in BERT)
● Out-of-scope of this presentation

Dictionary
● Lattice-based tokenizers need dictionary
○ To recognize predeﬁned terms and grammar
● Dictionaries are often can be downloaded as Plugins e.g.
○ $ brew install mecab
○ $ brew install mecab-ipadic

Dictionary
● Recommended beginner dictionary is MeCab’s IPADIC
● Available from this website

Dictionary - Term Table / Lexicon / CSV ﬁles
Surface Form
Context ID
(left)
Context ID
(right)
Cost Type Form Spelling ...
東京 1293 1293 3003 名詞 (place) - トウキョウ ...
京都 1293 1293 2135 名詞 (place) - キョウト ...
東京塚 1293 1293 8676 名詞 (place) - ヒガシキョウ
ヅカ
...
行く 992 992 8852 動詞 (v) 基本形イク ...
行か 1002 1002 7754 動詞 (v) 未然形イカ ...
いく 992 992 9672 動詞 (v) 基本形イク ...

Dictionary - Term Table
● Surface Form: How the term should appear in the string
● Context ID (left/right): ID used for connecting terms
together (see. later)
● Cost: How commonly used the term
○ The more the cost, the less common or less likely

Dictionary - Connection Table / Connection Cost
Context ID
(from)
Context ID
(to)
Cost
... ...
992 992 3003
992 993 2135
... ...
992 1293 -1000
992 1294 -1000
... ...
● Connection cost between
type of terms.
● The lower, the more likely
● e.g.
● 992 (v-ru) then 992 (v-ru)
○ Cost = 3000 (unlikely)
● 992 (v-ru) then 1294 (noun)
○ Cost = -1000 (likely)

Term table size:
● Kotori (default) ~380,000 terms (3.7 MB)
● MeCab-IPADict ~400,000 terms (12.2 MB)
● Sudachi - Small ~750,000 terms (39.8 MB)
● Sudachi - Full ~2,800,000 terms (121 MB)

Term table size:
● Kotori (default) ~380,000 terms (3.7 MB)
● MeCab-IPADict ~400,000 terms (12.2 MB)
● Sudachi - Small ~750,000 terms (39.8 MB)
● Sudachi - Full ~2,800,000 terms (121 MB)
○ Include term like: "ヽ(`ー`)ノ"

● What about words not in the table?
○ e.g. "ワナシットタナキットルンアン"
○ “Unknown-Term Extraction” Problem
○ Typically, some heuristic rules
■ e.g. if there are consecutive katana, it’s a Noun.
● Out-of-scope of this presentation

Lattice-Based Tokenization
Given:
● The Dictionary
● Input:"東京都に住む"
Tokenizer:
1. Find all terms in the input
and build a lattice
2. Find the minimum cost
path through the lattice

Step 1: Finding all terms
● For each index i-th
○ ﬁnd all terms in dictionary starting at i-th location
● String / Pattern Matching problem
○ Require eﬃcient lookup data structure for the dictionary
○ e.g. Trie, Finite-State-Transidual

Step 2: Finding minimum cost
● Viterbi Algorithm (Dynamic Programing)
● For each node from the left to right
○ Find the minimum cost path leading to that node
○ Reuse the selected path when consider the following
nodes

Introduction to Japanese Tokenizers
● Introduction to NLP and Tokenization
● Lattice-based tokenizers (MeCab and others)
○ Dictionary
■ Term table, Connection Cost, ...
○ Tokenization Algorithms
■ Pattern Matching, Viterbi Algorithm, ...

Learn more:
● Kotori (on Github), A Japanese tokenizer written in Kotlin
○ Small and performant (fastest among JVM-based)
○ Support multiple dictionary formats
● Article: How Japanese Tokenizers Work (by Wanasit)
● Article: 日本語形態素解析の裏側を覗く！ (by Cookpad Developer)
● Book: 自然言語処理の基礎 (by Manabu Okumura)

Introduction to japanese tokenizer

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to japanese tokenizer

Similar to Introduction to japanese tokenizer (7)

More from Fangda Wang

More from Fangda Wang (11)

Recently uploaded

Recently uploaded (20)

Introduction to japanese tokenizer