Presented by Christian Moen, Founder and CEO Atilika Inc - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012
This talk gives an introduction to searching Japanese text and an overview of the new Japanese search features available out-of-the-box in Lucene and Solr.
Atilika developed a new Japanese morphological analyzer (Kuromoji) in 2010 when they couldn't find any easy-to-use, high-quality morphological analyzer in Java that was good for both search and other Japanese NLP tasks. Kuromoji was built with the goal of donating it to the Apache Software Foundation in order to make Japanese work well for both Lucene and Solr, and is now a standard part of these software packages.
2. About me
• MSc. in computer science, University of Oslo, Norway
• Worked with search at FAST (now Microsoft) for 10 years
• 5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway
• 5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan
• Founded アティリカ株式会社 in 2009
• We help companies innovate using search technologies and good ideas
• We know information retrieval, natural language processing and big data
• We are based in Tokyo, but we have clients everywhere
• Newbie Lucene & Solr Committer
• Mostly been working on Japanese language support (Kuromoji) so far
• Please write me on christian@atilika.com or cm@apache.org
41. Problems with n-gramming
JR R新 新宿 宿駅 駅の の近 近く ...
● × ● ×
change of
semantics!
means ‘post town’, ‘relay station’ or ‘stage’
42. Problems with n-gramming
JR R新 新宿 宿駅 駅の の近 近く ...
● × ● × ×
change of
semantics!
means ‘post town’, ‘relay station’ or ‘stage’
43. Problems with n-gramming
JR R新 新宿 宿駅 駅の の近 近く ...
● × ● × × ×
change of
semantics!
means ‘post town’, ‘relay station’ or ‘stage’
44. Problems with n-gramming
JR R新 新宿 宿駅 駅の の近 近く ...
● × ● × × × ●
change of
semantics!
means ‘post town’, ‘relay station’ or ‘stage’
45. Problems with n-gramming
JR R新 新宿 宿駅 駅の の近 近く ...
● × ● × × × ●
change of
semantics!
means ‘post town’, ‘relay station’ or ‘stage’
• Does not preserve meaning well and often changes semantics
• Impacts on ranking - search precision (many false positives)
Generates many terms per document or query
Impacts on index size and search performance
Sometimes appropriate for certain search applications
Compliance, e-commerce with non product names, ...
46. Problems with n-gramming
JR R新 新宿 宿駅 駅の の近 近く ...
● × ● × × × ●
change of
semantics!
means ‘post town’, ‘relay station’ or ‘stage’
• Does not preserve meaning well and often changes semantics
• Impacts on ranking - search precision (many false positives)
• Also generates many terms per document or query
• Impacts on index size and performance
Sometimes appropriate for certain search applications
Compliance, e-commerce with non product names, ...
47. Problems with n-gramming
JR R新 新宿 宿駅 駅の の近 近く ...
● × ● × × × ●
change of
semantics!
means ‘post town’, ‘relay station’ or ‘stage’
• Does not preserve meaning well and often changes semantics
• Impacts on ranking - search precision (many false positives)
• Also generates many terms per document or query
• Impacts on index size and performance
• Still sometimes appropriate for certain search applications
• Compliance, e-commerce with special product names, ...
48. Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
Shall we go for a beer near JR Shinjuku station?
49. Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
Shall we go for a beer near JR Shinjuku station?
JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
51. Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
Shall we go for a beer near JR Shinjuku station?
JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
● ● ● ● ● ● ● ● ● ● ● ● ● ●
• Tokens reflect what a Japanese speaker consider as words
• Machine-learned statistical approach
• CRFs decoded using Viterbi
• Also does part-of-speech tagging, readings for kanji, etc.
• Several statistical models available with high accuracy (F > 0.97)
• Models/dictionaries are available as IPADIC, UniDic, ...
52. Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
Shall we go for a beer near JR Shinjuku station?
JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
● ● ● ● ● ● ● ● ● ● ● ● ● ●
• Tokens reflect what a Japanese speaker consider as words
• Machine-learned statistical approach
• Conditional Random Fields (CRFs) decoded using Viterbi
• Also does part-of-speech tagging, extract readings for kanji, etc.
• Several statistical models available with high accuracy (F > 0.97)
• Models/dictionaries are available as IPADIC, UniDic, ...
53. Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
Shall we go for a beer near JR Shinjuku station?
JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
● ● ● ● ● ● ● ● ● ● ● ● ● ●
• Tokens reflect what a Japanese speaker consider as words
• Machine-learned statistical approach
• Conditional Random Fields (CRFs) decoded using Viterbi
• Also does part-of-speech tagging, readings for kanji, etc.
• Several statistical models available with high accuracy (F > 0.97)
• Models/dictionaries are available as IPADIC, UniDic, ...
60. Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6
! Available out-of-the-box
! Easy to use with reasonable defaults
61. Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6
! Available out-of-the-box
! Easy to use with reasonable defaults
! Provides sophisticated Japanese linguistics
62. Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6
! Available out-of-the-box
! Easy to use with reasonable defaults
! Provides sophisticated Japanese linguistics
! Customisable
67. Feature summary / text_ja analyzer chain
Segments Japanese text into tokens with very high accuracy
JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc.
• Compound segmentation with compound synonyms
• Segmentation is customisable using user dictionaries
68. Feature summary / text_ja analyzer chain
Segments Japanese text into tokens with very high accuracy
JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc.
• Compound segmentation with compound synonyms
• Segmentation is customisable using user dictionaries
JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
69. Feature summary / text_ja analyzer chain
Segments Japanese text into tokens with very high accuracy
JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc.
• Compound segmentation with compound synonyms
• Segmentation is customisable using user dictionaries
JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
See example/solr/conf/lang/stoptags_ja.txt
70. Feature summary / text_ja analyzer chain
Segments Japanese text into tokens with very high accuracy
JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc.
• Compound segmentation with compound synonyms
• Segmentation is customisable using user dictionaries
JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
See example/solr/conf/lang/stoptags_ja.txt
CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)
71. Feature summary / text_ja analyzer chain
Segments Japanese text into tokens with very high accuracy
JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc.
• Compound segmentation with compound synonyms
• Segmentation is customisable using user dictionaries
JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
See example/solr/conf/lang/stoptags_ja.txt
CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)
Stop-words removal
StopFilter
See example/solr/conf/lang/stopwords_ja.txt
72. Feature summary / text_ja analyzer chain
Segments Japanese text into tokens with very high accuracy
JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc.
• Compound segmentation with compound synonyms
• Segmentation is customisable using user dictionaries
JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
See example/solr/conf/lang/stoptags_ja.txt
CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)
Stop-words removal
StopFilter
See example/solr/conf/lang/stopwords_ja.txt
JapaneseKatakanaStemFilter Normalises common katakana spelling variations
73. Feature summary / text_ja analyzer chain
Segments Japanese text into tokens with very high accuracy
JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc.
• Compound segmentation with compound synonyms
• Segmentation is customisable using user dictionaries
JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
See example/solr/conf/lang/stoptags_ja.txt
CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)
Stop-words removal
StopFilter
See example/solr/conf/lang/stopwords_ja.txt
JapaneseKatakanaStemFilter Normalises common katakana spelling variations
LowerCaseFilter Lowercases
76. Compound nouns
? How do we deal with compound nouns?
Japanese English
関西国際空港 Kansai International Airport
シニアソフトウェアエンジニア Senior Software Engineer
77. Compound nouns
? How do we deal with compound nouns?
Japanese English
関西国際空港 Kansai International Airport
シニアソフトウェアエンジニア Senior Software Engineer
! These are one word in Japanese, so
searching for 空港 (airport) doesn’t match
78. Compound nouns
? How do we deal with compound nouns?
Japanese English
関西国際空港 Kansai International Airport
シニアソフトウェアエンジニア Senior Software Engineer
! These are one word in Japanese, so
searching for 空港 (airport) doesn’t match
! We need to segment the compounds, too
79. Compound segmentation
関西国際空港
Kansai International Airport
シニアソフトウェアエンジニナ
Senior Software Engineer
! We are using a heuristic to implement this
80. Compound segmentation
関西国際空港 関西
Kansai International Airport Kansai
シニアソフトウェアエンジニナ シニア
Senior Software Engineer Senior
! We are using a heuristic to implement this
81. Compound segmentation
関西国際空港 関西 国際
Kansai International Airport Kansai International
シニアソフトウェアエンジニナ シニア ソフトウェア
Senior Software Engineer Senior Software
! We are using a heuristic to implement this
82. Compound segmentation
関西国際空港 関西 国際 空港
Kansai International Airport Kansai International Airport
シニアソフトウェアエンジニナ シニア ソフトウェア エンジニナ
Senior Software Engineer Senior Software Engineer
! We are using a heuristic to implement this
83. Compound synonym tokens
Position 1 Position 2 Position 3
関西 国際 空港
関西国際空港
• Segment the compounds into its part
• Good for recall - we can also search and match 空港 (airport)
• We keep the compound itself as a synonym
• Good for precision with an exact hit because of IDF
• Approach benefits both precision and recall for overall good ranking
• JapaneseTokenizer actually returns a graph of tokens
84. Compound synonym tokens
Position 1 Position 2 Position 3
関西 国際 空港
関西国際空港
• Segment the compounds into its parts
• Good for recall - we can also search and match 空港 (airport)
• We keep the compound itself as a synonym
• Good for precision with an exact hit because of IDF
• Approach benefits both precision and recall for overall good ranking
• JapaneseTokenizer actually returns a graph of tokens
85. Compound synonym tokens
Position 1 Position 2 Position 3
関西 国際 空港
関西国際空港
• Segment the compounds into its parts
• Good for recall - we can also search and match 空港 (airport)
• We keep the compound itself as a synonym
• Good for precision with an exact hit because of IDF
• Approach benefits both precision and recall for overall good ranking
• JapaneseTokenizer actually returns a graph of tokens
86. Compound synonym tokens
Position 1 Position 2 Position 3
関西 国際 空港
関西国際空港
• Segment the compounds into its parts
• Good for recall - we can also search and match 空港 (airport)
• We keep the compound itself as a synonym
• Good for precision with an exact hit because of IDF
• Approach benefits both precision and recall for overall good ranking
• JapaneseTokenizer actually returns a graph of tokens
87. Character width normalisation
? How do we deal with character widths?
Half-width・半角 Full-width・全角
Lucene Lucene
カタカナ カタカナ
123 123
88. Character width normalisation
? How do we deal with character widths?
Half-width・半角 Full-width・全角
Lucene Lucene
カタカナ カタカナ
123 123
! Use CJKWidthFilter to normalise them
(Unicode NFKC subset)
Input text Lucene カタカナ 123
CJKWidthFilter Lucene カタカナ 123
half-width full-width half-width
89. Katakana end-vowel stemming
? A common spelling variation in
katakana is a end long-vowel sound
English Japanese spelling variations
manager マネージャー マネージャ マネジャー
90. Katakana end-vowel stemming
? A common spelling variation in
katakana is a end long-vowel sound
English Japanese spelling variations
manager マネージャー マネージャ マネジャー
! We JapaneseKatakanaStemFilter to
normalise/stem end-vowel for long terms
Input text コピー マネージャー マネージャ マネジャー
JapaneseKatakanaStemFilter コピー マネージャ マネージャ マネジャ
copy manager manager “manager”
93. Lemmatisation
? Japanese adjectives and verbs are highly
inflected, how do we deal with that?
Dictionary form Inflected forms (not exhaustive)
買いなさい 買いませんでしたら 買える 買わせられる
買う 買いなさるな
買いましたら
買いませんでしたり
買いませんなら
買おう
買った
買わせる
買わない
買いましたり 買うだろう 買ったら 買わないだろう
kau 買いまして
買いましょう
買うでしょう
買うな
買ったり
買って
買わないで
買わないでしょう
買わせない
to buy
買います 買うまい 買わなかった
買いますまい 買え 買わせます 買わなかったら
買いませば 買えない 買わせません 買わなかったり
買いません 買えば 買わせられない 買わなければ
買いませんで 買えます 買わせられます 買われない
買いませんでした 買えません 買わせられません 買われます
94. Lemmatisation
? Japanese adjectives and verbs are highly
inflected, how do we deal with that?
Dictionary form Inflected forms (not exhaustive)
買いなさい 買いませんでしたら 買える 買わせられる
買う 買いなさるな
買いましたら
買いませんでしたり
買いませんなら
買おう
買った
買わせる
買わない
買いましたり 買うだろう 買ったら 買わないだろう
kau 買いまして
買いましょう
買うでしょう
買うな
買ったり
買って
買わないで
買わないでしょう
買わせない
to buy
買います 買うまい 買わなかった
買いますまい 買え 買わせます 買わなかったら
買いませば 買えない 買わせません 買わなかったり
買いません 買えば 買わせられない 買わなければ
買いませんで 買えます 買わせられます 買われない
買いませんでした 買えません 買わせられません 買われます
! Use JapaneseBaseformFilter to normalise
inflected adjectives and verbs to dictionary form
(lemmatisation by reduction)
95. User dictionaries
• Own dictionaries can be used for ad hoc
segmentation, i.e. to override default model
• File format is simple and there’s no need to
assign weights, etc. before using them
• Example custom dictionary:
# Custom segmentation and POS entry for long entries
関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,カスタム名詞
# Custom reading and POS former sumo wrestler Asashoryu
朝青龍,朝青龍,アサショウリュウ,カスタム人名
96. Japanese focus in 4.0
• Improvements in JapaneseTokenizer
• Improved search mode for katakana compounds
• Improved unknown word segmentation
• Some performance improvements
• CharFilters for various character normalisations
• Dates and numbers
• Repetition marks (odoriji)
• Japanese spell-checker
• Robert and Koji almost got this into 3.6, but it got
postponed because of API changes being necessary
97. Acknowledgements
Robert Muir
Thanks for the heavy lifting integrating Kuromoji into Lucene
and always reviewing my patches quickly and friendly help
Michael McCandless
Thanks for streaming Viterbi and synonym compounds!
Uwe Schindler
Thanks for performance improvements + being the policeman
Simon Willnauer
Thanks for doing the Kuromoji code donation process so well
Gaute Lambertsen & Gerry Hocks
Thanks for presentation feedback and being great colleagues