SlideShare a Scribd company logo
1 of 36
Download to read offline
Introduction to Japanese
tokenizers
WebHack 2020-11-10
by Wanasit T.
About Me
● Github: @wanasit
○ Text / NLP projects
● Manager, Software Engineer @ Indeed
○ Search Quality (Metadata) team
○ Work on NLP problems for Jobs / Resumes
Disclaimer
1. This talk NOT related to any of Indeed’s technology
2. I’m not a Japanese (or a native-speaker)
○ But I built a Japanese tokenizer on my free time
Today Topics
● NLP and Tokenization (for Japanese)
● Lattice-based Tokenizers (MeCab -style tokenizers)
● How it works
○ Dictionary
○ Tokenization
NLP and Tokenization
NLP and Tokenization
● How does computer represent text?
● String (or Char[ ] or Byte[ ] )
■ "Abc"
■ "Hello World"
NLP and Tokenization
"Biden is projected winner in Michigan,
Wisconsin as tense nation watch final tally"
Source: NBC News
NLP and Tokenization
"Biden is projected winner in Michigan,
Wisconsin as tense nation watch final tally"
● What’s the topic?
● Who is winning? where?
Source: NBC News
NLP and Tokenization
"Biden is projected winner in Michigan,
Wisconsin as tense nation watch final tally"
● What’s the topic?
● Who is winning? where?
Source: NBC News
NLP and Tokenization
● Tokenization / Segmentation
● The first step to solve NLP problems is usually
identifying words from the string
○ Input: string, char[ ] (or byte[ ])
○ Output: a list of meaningful words (or tokens)
NLP and Tokenization
"Biden is projected winner in Michigan, Wisconsin as
tense nation watch final tally".split(/W+/)
> ["Biden", "is", "projected", "winner", "in", ...]
Japanese Tokenization
"バイデン氏がミシガン州勝利、大統領にむけ“王手"
Source: TBS News
Japanese Tokenization
"バイデン氏がミシガン州勝利、大統領にむけ“王手"
Source: TBS News
Japanese Tokenization
"バイデン氏がミシガン州勝利、大統領にむけ“王手"
● No punctuations
● Q: How do you split this into words?
Source: TBS News
Japanese Tokenization
● Use prior Japanese knowledge (Dictionary)
○ が, に, …, 氏, 州, …, バイデン
● Consider the context and combination of characters
● Consider the likelihood
○ e.g. 東京都 => [東京, 都], or [東, 京都]
Lattice-based Tokenizers
Lattice-based Tokenizers
● aka. MeCab -based tokenizer (or Viterbi tokenizer)
● How:
○ From a Dictionary (required)
○ Build a Lattice (or a graph) from surface dictionary terms
○ Run Viterbi algorithm to find the best connected path
Lattice-Based Tokenizers
● Most tokenizers are MeCab (C/C++)’s re-implementation on
different platforms:
○ Kuromoji, Sudachi (Java), Kotori (Kotlin)
○ Janome, SudachiPy (Python)
○ Kagome (Go)
○ ...
Non- Lattice-Based Tokenizers
● Is Lattice-based the only approach?
● Mostly yes, but there are also:
○ Juman++, Nagisa (RNN)
○ SentencePiece (Unsupervised, used in BERT)
● Out-of-scope of this presentation
How it works
> Dictionary
Dictionary
● Lattice-based tokenizers need dictionary
○ To recognize predefined terms and grammar
● Dictionaries are often can be downloaded as Plugins e.g.
○ $ brew install mecab
○ $ brew install mecab-ipadic
Dictionary
● Recommended beginner dictionary is MeCab’s IPADIC
● Available from this website
Dictionary - Term Table / Lexicon / CSV files
Surface Form
Context ID
(left)
Context ID
(right)
Cost Type Form Spelling ...
東京 1293 1293 3003 名詞 (place) - トウキョウ ...
京都 1293 1293 2135 名詞 (place) - キョウト ...
東京塚 1293 1293 8676 名詞 (place) - ヒガシキョウ
ヅカ
...
行く 992 992 8852 動詞 (v) 基本形 イク ...
行か 1002 1002 7754 動詞 (v) 未然形 イカ ...
いく 992 992 9672 動詞 (v) 基本形 イク ...
Dictionary - Term Table
● Surface Form: How the term should appear in the string
● Context ID (left/right): ID used for connecting terms
together (see. later)
● Cost: How commonly used the term
○ The more the cost, the less common or less likely
Dictionary - Connection Table / Connection Cost
Context ID
(from)
Context ID
(to)
Cost
... ...
992 992 3003
992 993 2135
... ...
992 1293 -1000
992 1294 -1000
... ...
● Connection cost between
type of terms.
● The lower, the more likely
● e.g.
● 992 (v-ru) then 992 (v-ru)
○ Cost = 3000 (unlikely)
● 992 (v-ru) then 1294 (noun)
○ Cost = -1000 (likely)
Dictionary - Term Table
Term table size:
● Kotori (default) ~380,000 terms (3.7 MB)
● MeCab-IPADict ~400,000 terms (12.2 MB)
● Sudachi - Small ~750,000 terms (39.8 MB)
● Sudachi - Full ~2,800,000 terms (121 MB)
Dictionary - Term Table
Term table size:
● Kotori (default) ~380,000 terms (3.7 MB)
● MeCab-IPADict ~400,000 terms (12.2 MB)
● Sudachi - Small ~750,000 terms (39.8 MB)
● Sudachi - Full ~2,800,000 terms (121 MB)
○ Include term like: "ヽ(`ー`)ノ"
Dictionary - Term Table
● What about words not in the table?
○ e.g. "ワナシット タナキットルンアン"
○ “Unknown-Term Extraction” Problem
○ Typically, some heuristic rules
■ e.g. if there are consecutive katana, it’s a Noun.
● Out-of-scope of this presentation
How it works
> Tokenization
Lattice-Based Tokenization
Given:
● The Dictionary
● Input:"東京都に住む"
Tokenizer:
1. Find all terms in the input
and build a lattice
2. Find the minimum cost
path through the lattice
Step 1: Finding all terms
Step 1: Finding all terms
● For each index i-th
○ find all terms in dictionary starting at i-th location
● String / Pattern Matching problem
○ Require efficient lookup data structure for the dictionary
○ e.g. Trie, Finite-State-Transidual
Step 2: Finding minimum cost
● Viterbi Algorithm (Dynamic Programing)
● For each node from the left to right
○ Find the minimum cost path leading to that node
○ Reuse the selected path when consider the following
nodes
Summary
Introduction to Japanese Tokenizers
● Introduction to NLP and Tokenization
● Lattice-based tokenizers (MeCab and others)
○ Dictionary
■ Term table, Connection Cost, ...
○ Tokenization Algorithms
■ Pattern Matching, Viterbi Algorithm, ...
Learn more:
● Kotori (on Github), A Japanese tokenizer written in Kotlin
○ Small and performant (fastest among JVM-based)
○ Support multiple dictionary formats
● Article: How Japanese Tokenizers Work (by Wanasit)
● Article: 日本語形態素解析の裏側を覗く! (by Cookpad Developer)
● Book: 自然言語処理の基礎 (by Manabu Okumura)

More Related Content

What's hot

LSTM (Long short-term memory) 概要
LSTM (Long short-term memory) 概要LSTM (Long short-term memory) 概要
LSTM (Long short-term memory) 概要Kenji Urai
 
複素ラプラス分布に基づく非負値行列因子分解
複素ラプラス分布に基づく非負値行列因子分解複素ラプラス分布に基づく非負値行列因子分解
複素ラプラス分布に基づく非負値行列因子分解Hiroki_Tanji
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extractionGabriel Hamilton
 
(2020.10) 分子のグラフ表現と機械学習: Graph Neural Networks (GNNs) とは?
(2020.10) 分子のグラフ表現と機械学習: Graph Neural Networks (GNNs) とは?(2020.10) 分子のグラフ表現と機械学習: Graph Neural Networks (GNNs) とは?
(2020.10) 分子のグラフ表現と機械学習: Graph Neural Networks (GNNs) とは?Ichigaku Takigawa
 
実環境音響信号処理における収音技術
実環境音響信号処理における収音技術実環境音響信号処理における収音技術
実環境音響信号処理における収音技術Yuma Koizumi
 
[DL輪読会]GANとエネルギーベースモデル
[DL輪読会]GANとエネルギーベースモデル[DL輪読会]GANとエネルギーベースモデル
[DL輪読会]GANとエネルギーベースモデルDeep Learning JP
 
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...Deep Learning JP
 
[DL輪読会]SeqGan Sequence Generative Adversarial Nets with Policy Gradient
[DL輪読会]SeqGan Sequence Generative Adversarial Nets with Policy Gradient[DL輪読会]SeqGan Sequence Generative Adversarial Nets with Policy Gradient
[DL輪読会]SeqGan Sequence Generative Adversarial Nets with Policy GradientDeep Learning JP
 
単語・句の分散表現の学習
単語・句の分散表現の学習単語・句の分散表現の学習
単語・句の分散表現の学習Naoaki Okazaki
 
短時間発話を用いた話者照合のための音声加工の効果に関する検討
短時間発話を用いた話者照合のための音声加工の効果に関する検討短時間発話を用いた話者照合のための音声加工の効果に関する検討
短時間発話を用いた話者照合のための音声加工の効果に関する検討Shinnosuke Takamichi
 
How to create a Composite FEM via Hypermesh
How to create a Composite FEM via HypermeshHow to create a Composite FEM via Hypermesh
How to create a Composite FEM via HypermeshNorm Lamar
 
Finite element analysis theory and application with ansys (3rd edition) pdf
Finite element analysis theory and application with ansys (3rd edition) pdfFinite element analysis theory and application with ansys (3rd edition) pdf
Finite element analysis theory and application with ansys (3rd edition) pdfWeber Ribeiro
 
DEEP LEARNING、トレーニング・インファレンスのGPUによる高速化
DEEP LEARNING、トレーニング・インファレンスのGPUによる高速化DEEP LEARNING、トレーニング・インファレンスのGPUによる高速化
DEEP LEARNING、トレーニング・インファレンスのGPUによる高速化RCCSRENKEI
 
情報検索とゼロショット学習
情報検索とゼロショット学習情報検索とゼロショット学習
情報検索とゼロショット学習kt.mako
 
(文献紹介)Deep Unrolling: Learned ISTA (LISTA)
(文献紹介)Deep Unrolling: Learned ISTA (LISTA)(文献紹介)Deep Unrolling: Learned ISTA (LISTA)
(文献紹介)Deep Unrolling: Learned ISTA (LISTA)Morpho, Inc.
 
z変換をやさしく教えて下さい (音響学入門ペディア)
z変換をやさしく教えて下さい (音響学入門ペディア)z変換をやさしく教えて下さい (音響学入門ペディア)
z変換をやさしく教えて下さい (音響学入門ペディア)Shinnosuke Takamichi
 
Generating Better Search Engine Text Advertisements with Deep Reinforcement L...
Generating Better Search Engine Text Advertisements with Deep Reinforcement L...Generating Better Search Engine Text Advertisements with Deep Reinforcement L...
Generating Better Search Engine Text Advertisements with Deep Reinforcement L...harmonylab
 
系列ラベリングの基礎
系列ラベリングの基礎系列ラベリングの基礎
系列ラベリングの基礎Takatomo Isikawa
 
OLED Display 1.2 inch 60 Hz 390x390 QSPI
OLED Display 1.2 inch 60 Hz 390x390 QSPIOLED Display 1.2 inch 60 Hz 390x390 QSPI
OLED Display 1.2 inch 60 Hz 390x390 QSPIPanox Display
 

What's hot (20)

LSTM (Long short-term memory) 概要
LSTM (Long short-term memory) 概要LSTM (Long short-term memory) 概要
LSTM (Long short-term memory) 概要
 
複素ラプラス分布に基づく非負値行列因子分解
複素ラプラス分布に基づく非負値行列因子分解複素ラプラス分布に基づく非負値行列因子分解
複素ラプラス分布に基づく非負値行列因子分解
 
Natural language processing: feature extraction
Natural language processing: feature extractionNatural language processing: feature extraction
Natural language processing: feature extraction
 
(2020.10) 分子のグラフ表現と機械学習: Graph Neural Networks (GNNs) とは?
(2020.10) 分子のグラフ表現と機械学習: Graph Neural Networks (GNNs) とは?(2020.10) 分子のグラフ表現と機械学習: Graph Neural Networks (GNNs) とは?
(2020.10) 分子のグラフ表現と機械学習: Graph Neural Networks (GNNs) とは?
 
実環境音響信号処理における収音技術
実環境音響信号処理における収音技術実環境音響信号処理における収音技術
実環境音響信号処理における収音技術
 
[DL輪読会]GANとエネルギーベースモデル
[DL輪読会]GANとエネルギーベースモデル[DL輪読会]GANとエネルギーベースモデル
[DL輪読会]GANとエネルギーベースモデル
 
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...[DL輪読会]Diffusion-based Voice Conversion with Fast  Maximum Likelihood Samplin...
[DL輪読会]Diffusion-based Voice Conversion with Fast Maximum Likelihood Samplin...
 
[DL輪読会]SeqGan Sequence Generative Adversarial Nets with Policy Gradient
[DL輪読会]SeqGan Sequence Generative Adversarial Nets with Policy Gradient[DL輪読会]SeqGan Sequence Generative Adversarial Nets with Policy Gradient
[DL輪読会]SeqGan Sequence Generative Adversarial Nets with Policy Gradient
 
単語・句の分散表現の学習
単語・句の分散表現の学習単語・句の分散表現の学習
単語・句の分散表現の学習
 
短時間発話を用いた話者照合のための音声加工の効果に関する検討
短時間発話を用いた話者照合のための音声加工の効果に関する検討短時間発話を用いた話者照合のための音声加工の効果に関する検討
短時間発話を用いた話者照合のための音声加工の効果に関する検討
 
How to create a Composite FEM via Hypermesh
How to create a Composite FEM via HypermeshHow to create a Composite FEM via Hypermesh
How to create a Composite FEM via Hypermesh
 
Finite element analysis theory and application with ansys (3rd edition) pdf
Finite element analysis theory and application with ansys (3rd edition) pdfFinite element analysis theory and application with ansys (3rd edition) pdf
Finite element analysis theory and application with ansys (3rd edition) pdf
 
DEEP LEARNING、トレーニング・インファレンスのGPUによる高速化
DEEP LEARNING、トレーニング・インファレンスのGPUによる高速化DEEP LEARNING、トレーニング・インファレンスのGPUによる高速化
DEEP LEARNING、トレーニング・インファレンスのGPUによる高速化
 
情報検索とゼロショット学習
情報検索とゼロショット学習情報検索とゼロショット学習
情報検索とゼロショット学習
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
(文献紹介)Deep Unrolling: Learned ISTA (LISTA)
(文献紹介)Deep Unrolling: Learned ISTA (LISTA)(文献紹介)Deep Unrolling: Learned ISTA (LISTA)
(文献紹介)Deep Unrolling: Learned ISTA (LISTA)
 
z変換をやさしく教えて下さい (音響学入門ペディア)
z変換をやさしく教えて下さい (音響学入門ペディア)z変換をやさしく教えて下さい (音響学入門ペディア)
z変換をやさしく教えて下さい (音響学入門ペディア)
 
Generating Better Search Engine Text Advertisements with Deep Reinforcement L...
Generating Better Search Engine Text Advertisements with Deep Reinforcement L...Generating Better Search Engine Text Advertisements with Deep Reinforcement L...
Generating Better Search Engine Text Advertisements with Deep Reinforcement L...
 
系列ラベリングの基礎
系列ラベリングの基礎系列ラベリングの基礎
系列ラベリングの基礎
 
OLED Display 1.2 inch 60 Hz 390x390 QSPI
OLED Display 1.2 inch 60 Hz 390x390 QSPIOLED Display 1.2 inch 60 Hz 390x390 QSPI
OLED Display 1.2 inch 60 Hz 390x390 QSPI
 

Similar to Introduction to japanese tokenizer

Algorithms - A Sneak Peek
Algorithms - A Sneak PeekAlgorithms - A Sneak Peek
Algorithms - A Sneak PeekBADR
 
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classificationshakimov
 
Deep Learning and Text Mining
Deep Learning and Text MiningDeep Learning and Text Mining
Deep Learning and Text MiningWill Stanton
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
 
Shallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender SystemShallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender SystemAnoop Deoras
 
Shilpa shukla processing_text
Shilpa shukla processing_textShilpa shukla processing_text
Shilpa shukla processing_textshilpashukla01
 
NLP in the Deep Learning Era: the story so far
NLP in the Deep Learning Era: the story so farNLP in the Deep Learning Era: the story so far
NLP in the Deep Learning Era: the story so farIlias Chalkidis
 

Similar to Introduction to japanese tokenizer (7)

Algorithms - A Sneak Peek
Algorithms - A Sneak PeekAlgorithms - A Sneak Peek
Algorithms - A Sneak Peek
 
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classification
 
Deep Learning and Text Mining
Deep Learning and Text MiningDeep Learning and Text Mining
Deep Learning and Text Mining
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
 
Shallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender SystemShallow and Deep Latent Models for Recommender System
Shallow and Deep Latent Models for Recommender System
 
Shilpa shukla processing_text
Shilpa shukla processing_textShilpa shukla processing_text
Shilpa shukla processing_text
 
NLP in the Deep Learning Era: the story so far
NLP in the Deep Learning Era: the story so farNLP in the Deep Learning Era: the story so far
NLP in the Deep Learning Era: the story so far
 

More from Fangda Wang

[WWCode] How aware are you of your deciding model?
[WWCode] How aware are you of your deciding model?[WWCode] How aware are you of your deciding model?
[WWCode] How aware are you of your deciding model?Fangda Wang
 
Under the hood of architecture interviews at indeed
Under the hood of architecture interviews at indeedUnder the hood of architecture interviews at indeed
Under the hood of architecture interviews at indeedFangda Wang
 
How Indeed asks coding interview questions
How Indeed asks coding interview questionsHow Indeed asks coding interview questions
How Indeed asks coding interview questionsFangda Wang
 
Types are eating the world
Types are eating the worldTypes are eating the world
Types are eating the worldFangda Wang
 
From ic to tech lead
From ic to tech leadFrom ic to tech lead
From ic to tech leadFangda Wang
 
Gentle Introduction to Scala
Gentle Introduction to ScalaGentle Introduction to Scala
Gentle Introduction to ScalaFangda Wang
 
To pair or not to pair
To pair or not to pairTo pair or not to pair
To pair or not to pairFangda Wang
 
Functional programming and Elm
Functional programming and ElmFunctional programming and Elm
Functional programming and ElmFangda Wang
 
Elm at large (companies)
Elm at large (companies)Elm at large (companies)
Elm at large (companies)Fangda Wang
 
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the tradeFangda Wang
 

More from Fangda Wang (11)

[WWCode] How aware are you of your deciding model?
[WWCode] How aware are you of your deciding model?[WWCode] How aware are you of your deciding model?
[WWCode] How aware are you of your deciding model?
 
Under the hood of architecture interviews at indeed
Under the hood of architecture interviews at indeedUnder the hood of architecture interviews at indeed
Under the hood of architecture interviews at indeed
 
How Indeed asks coding interview questions
How Indeed asks coding interview questionsHow Indeed asks coding interview questions
How Indeed asks coding interview questions
 
Types are eating the world
Types are eating the worldTypes are eating the world
Types are eating the world
 
From ic to tech lead
From ic to tech leadFrom ic to tech lead
From ic to tech lead
 
Gentle Introduction to Scala
Gentle Introduction to ScalaGentle Introduction to Scala
Gentle Introduction to Scala
 
To pair or not to pair
To pair or not to pairTo pair or not to pair
To pair or not to pair
 
Balanced Team
Balanced TeamBalanced Team
Balanced Team
 
Functional programming and Elm
Functional programming and ElmFunctional programming and Elm
Functional programming and Elm
 
Elm at large (companies)
Elm at large (companies)Elm at large (companies)
Elm at large (companies)
 
Data science tools of the trade
Data science tools of the tradeData science tools of the trade
Data science tools of the trade
 

Recently uploaded

Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labsamber724300
 
Detection&Tracking - Thermal imaging object detection and tracking
Detection&Tracking - Thermal imaging object detection and trackingDetection&Tracking - Thermal imaging object detection and tracking
Detection&Tracking - Thermal imaging object detection and trackinghadarpinhas1
 
Uk-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Exp...
Uk-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Exp...Uk-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Exp...
Uk-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Exp...Amil baba
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 
Module-1-Building Acoustics(Introduction)(Unit-1).pdf
Module-1-Building Acoustics(Introduction)(Unit-1).pdfModule-1-Building Acoustics(Introduction)(Unit-1).pdf
Module-1-Building Acoustics(Introduction)(Unit-1).pdfManish Kumar
 
tourism-management-srs_compress-software-engineering.pdf
tourism-management-srs_compress-software-engineering.pdftourism-management-srs_compress-software-engineering.pdf
tourism-management-srs_compress-software-engineering.pdfchess188chess188
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Sumanth A
 
Indian Tradition, Culture & Societies.pdf
Indian Tradition, Culture & Societies.pdfIndian Tradition, Culture & Societies.pdf
Indian Tradition, Culture & Societies.pdfalokitpathak01
 
Theory of Machine Notes / Lecture Material .pdf
Theory of Machine Notes / Lecture Material .pdfTheory of Machine Notes / Lecture Material .pdf
Theory of Machine Notes / Lecture Material .pdfShreyas Pandit
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...arifengg7
 
A brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision ProA brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision ProRay Yuan Liu
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsResearcher Researcher
 
ADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain studyADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain studydhruvamdhruvil123
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTSneha Padhiar
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfalene1
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organizationchnrketan
 

Recently uploaded (20)

Secure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech LabsSecure Key Crypto - Tech Paper JET Tech Labs
Secure Key Crypto - Tech Paper JET Tech Labs
 
Detection&Tracking - Thermal imaging object detection and tracking
Detection&Tracking - Thermal imaging object detection and trackingDetection&Tracking - Thermal imaging object detection and tracking
Detection&Tracking - Thermal imaging object detection and tracking
 
Uk-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Exp...
Uk-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Exp...Uk-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Exp...
Uk-NO1 Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Exp...
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 
Module-1-Building Acoustics(Introduction)(Unit-1).pdf
Module-1-Building Acoustics(Introduction)(Unit-1).pdfModule-1-Building Acoustics(Introduction)(Unit-1).pdf
Module-1-Building Acoustics(Introduction)(Unit-1).pdf
 
tourism-management-srs_compress-software-engineering.pdf
tourism-management-srs_compress-software-engineering.pdftourism-management-srs_compress-software-engineering.pdf
tourism-management-srs_compress-software-engineering.pdf
 
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
Robotics-Asimov's Laws, Mechanical Subsystems, Robot Kinematics, Robot Dynami...
 
Indian Tradition, Culture & Societies.pdf
Indian Tradition, Culture & Societies.pdfIndian Tradition, Culture & Societies.pdf
Indian Tradition, Culture & Societies.pdf
 
Theory of Machine Notes / Lecture Material .pdf
Theory of Machine Notes / Lecture Material .pdfTheory of Machine Notes / Lecture Material .pdf
Theory of Machine Notes / Lecture Material .pdf
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
Versatile Engineering Construction Firms
Versatile Engineering Construction FirmsVersatile Engineering Construction Firms
Versatile Engineering Construction Firms
 
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
Analysis and Evaluation of Dal Lake Biomass for Conversion to Fuel/Green fert...
 
A brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision ProA brief look at visionOS - How to develop app on Apple's Vision Pro
A brief look at visionOS - How to develop app on Apple's Vision Pro
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitos
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending Actuators
 
ADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain studyADM100 Running Book for sap basis domain study
ADM100 Running Book for sap basis domain study
 
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENTFUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
FUNCTIONAL AND NON FUNCTIONAL REQUIREMENT
 
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdfComprehensive energy systems.pdf Comprehensive energy systems.pdf
Comprehensive energy systems.pdf Comprehensive energy systems.pdf
 
priority interrupt computer organization
priority interrupt computer organizationpriority interrupt computer organization
priority interrupt computer organization
 

Introduction to japanese tokenizer

  • 2. About Me ● Github: @wanasit ○ Text / NLP projects ● Manager, Software Engineer @ Indeed ○ Search Quality (Metadata) team ○ Work on NLP problems for Jobs / Resumes
  • 3. Disclaimer 1. This talk NOT related to any of Indeed’s technology 2. I’m not a Japanese (or a native-speaker) ○ But I built a Japanese tokenizer on my free time
  • 4. Today Topics ● NLP and Tokenization (for Japanese) ● Lattice-based Tokenizers (MeCab -style tokenizers) ● How it works ○ Dictionary ○ Tokenization
  • 6. NLP and Tokenization ● How does computer represent text? ● String (or Char[ ] or Byte[ ] ) ■ "Abc" ■ "Hello World"
  • 7. NLP and Tokenization "Biden is projected winner in Michigan, Wisconsin as tense nation watch final tally" Source: NBC News
  • 8. NLP and Tokenization "Biden is projected winner in Michigan, Wisconsin as tense nation watch final tally" ● What’s the topic? ● Who is winning? where? Source: NBC News
  • 9. NLP and Tokenization "Biden is projected winner in Michigan, Wisconsin as tense nation watch final tally" ● What’s the topic? ● Who is winning? where? Source: NBC News
  • 10. NLP and Tokenization ● Tokenization / Segmentation ● The first step to solve NLP problems is usually identifying words from the string ○ Input: string, char[ ] (or byte[ ]) ○ Output: a list of meaningful words (or tokens)
  • 11. NLP and Tokenization "Biden is projected winner in Michigan, Wisconsin as tense nation watch final tally".split(/W+/) > ["Biden", "is", "projected", "winner", "in", ...]
  • 14. Japanese Tokenization "バイデン氏がミシガン州勝利、大統領にむけ“王手" ● No punctuations ● Q: How do you split this into words? Source: TBS News
  • 15. Japanese Tokenization ● Use prior Japanese knowledge (Dictionary) ○ が, に, …, 氏, 州, …, バイデン ● Consider the context and combination of characters ● Consider the likelihood ○ e.g. 東京都 => [東京, 都], or [東, 京都]
  • 17. Lattice-based Tokenizers ● aka. MeCab -based tokenizer (or Viterbi tokenizer) ● How: ○ From a Dictionary (required) ○ Build a Lattice (or a graph) from surface dictionary terms ○ Run Viterbi algorithm to find the best connected path
  • 18. Lattice-Based Tokenizers ● Most tokenizers are MeCab (C/C++)’s re-implementation on different platforms: ○ Kuromoji, Sudachi (Java), Kotori (Kotlin) ○ Janome, SudachiPy (Python) ○ Kagome (Go) ○ ...
  • 19. Non- Lattice-Based Tokenizers ● Is Lattice-based the only approach? ● Mostly yes, but there are also: ○ Juman++, Nagisa (RNN) ○ SentencePiece (Unsupervised, used in BERT) ● Out-of-scope of this presentation
  • 20. How it works > Dictionary
  • 21. Dictionary ● Lattice-based tokenizers need dictionary ○ To recognize predefined terms and grammar ● Dictionaries are often can be downloaded as Plugins e.g. ○ $ brew install mecab ○ $ brew install mecab-ipadic
  • 22. Dictionary ● Recommended beginner dictionary is MeCab’s IPADIC ● Available from this website
  • 23. Dictionary - Term Table / Lexicon / CSV files Surface Form Context ID (left) Context ID (right) Cost Type Form Spelling ... 東京 1293 1293 3003 名詞 (place) - トウキョウ ... 京都 1293 1293 2135 名詞 (place) - キョウト ... 東京塚 1293 1293 8676 名詞 (place) - ヒガシキョウ ヅカ ... 行く 992 992 8852 動詞 (v) 基本形 イク ... 行か 1002 1002 7754 動詞 (v) 未然形 イカ ... いく 992 992 9672 動詞 (v) 基本形 イク ...
  • 24. Dictionary - Term Table ● Surface Form: How the term should appear in the string ● Context ID (left/right): ID used for connecting terms together (see. later) ● Cost: How commonly used the term ○ The more the cost, the less common or less likely
  • 25. Dictionary - Connection Table / Connection Cost Context ID (from) Context ID (to) Cost ... ... 992 992 3003 992 993 2135 ... ... 992 1293 -1000 992 1294 -1000 ... ... ● Connection cost between type of terms. ● The lower, the more likely ● e.g. ● 992 (v-ru) then 992 (v-ru) ○ Cost = 3000 (unlikely) ● 992 (v-ru) then 1294 (noun) ○ Cost = -1000 (likely)
  • 26. Dictionary - Term Table Term table size: ● Kotori (default) ~380,000 terms (3.7 MB) ● MeCab-IPADict ~400,000 terms (12.2 MB) ● Sudachi - Small ~750,000 terms (39.8 MB) ● Sudachi - Full ~2,800,000 terms (121 MB)
  • 27. Dictionary - Term Table Term table size: ● Kotori (default) ~380,000 terms (3.7 MB) ● MeCab-IPADict ~400,000 terms (12.2 MB) ● Sudachi - Small ~750,000 terms (39.8 MB) ● Sudachi - Full ~2,800,000 terms (121 MB) ○ Include term like: "ヽ(`ー`)ノ"
  • 28. Dictionary - Term Table ● What about words not in the table? ○ e.g. "ワナシット タナキットルンアン" ○ “Unknown-Term Extraction” Problem ○ Typically, some heuristic rules ■ e.g. if there are consecutive katana, it’s a Noun. ● Out-of-scope of this presentation
  • 29. How it works > Tokenization
  • 30. Lattice-Based Tokenization Given: ● The Dictionary ● Input:"東京都に住む" Tokenizer: 1. Find all terms in the input and build a lattice 2. Find the minimum cost path through the lattice
  • 31. Step 1: Finding all terms
  • 32. Step 1: Finding all terms ● For each index i-th ○ find all terms in dictionary starting at i-th location ● String / Pattern Matching problem ○ Require efficient lookup data structure for the dictionary ○ e.g. Trie, Finite-State-Transidual
  • 33. Step 2: Finding minimum cost ● Viterbi Algorithm (Dynamic Programing) ● For each node from the left to right ○ Find the minimum cost path leading to that node ○ Reuse the selected path when consider the following nodes
  • 35. Introduction to Japanese Tokenizers ● Introduction to NLP and Tokenization ● Lattice-based tokenizers (MeCab and others) ○ Dictionary ■ Term table, Connection Cost, ... ○ Tokenization Algorithms ■ Pattern Matching, Viterbi Algorithm, ...
  • 36. Learn more: ● Kotori (on Github), A Japanese tokenizer written in Kotlin ○ Small and performant (fastest among JVM-based) ○ Support multiple dictionary formats ● Article: How Japanese Tokenizers Work (by Wanasit) ● Article: 日本語形態素解析の裏側を覗く! (by Cookpad Developer) ● Book: 自然言語処理の基礎 (by Manabu Okumura)