SlideShare a Scribd company logo
1 of 99
Download to read offline
Japanese linguistics
in Apache Lucene™ and Apache Solr™

             May 9th, 2012

             Christian Moen
          christian@atilika.com
About me
•   MSc. in computer science, University of Oslo, Norway
•   Worked with search at FAST (now Microsoft) for 10 years
     •   5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway
     •   5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan
•   Founded アティリカ株式会社 in 2009
     •   We help companies innovate using search technologies and good ideas
     •   We know information retrieval, natural language processing and big data
     •   We are based in Tokyo, but we have clients everywhere
•   Newbie Lucene & Solr Committer
     •   Mostly been working on Japanese language support (Kuromoji) so far
•   Please write me on christian@atilika.com or cm@apache.org
Today’s topics
Today’s topics

•   Japanese 101 - ordering beer and toasting


•   Japanese language processing


•   Japanese features in Lucene/Solr
Today’s topics

•   Japanese 101 - ordering beer and toasting


•   Japanese language processing


•   Japanese features in Lucene/Solr
Today’s topics

•   Japanese 101 - ordering beer and toasting


•   Japanese language processing


•   Japanese features in Lucene/Solr
Japanese 101
ビールください
 bi-ru kudasai
ビールください
 bi-ru kudasai

A beer, please
ありがとうございます!
 arigatō gozaimasu!
ありがとうございます!
 arigatō gozaimasu!

Thank you very much!
乾杯!
kanpai!
乾杯!
kanpai!

Cheers!
JR新宿駅の近くにビールを飲みに行こうか?
JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?
JR新宿駅の近くにビールを飲みに行こうか?
JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka?

  Shall we go for a beer near JR Shinjuku station?
JR新宿駅の近くにビールを飲みに行こうか?
Romaji - ローマ字
・Latin characters (26+)
・Used for proper nouns, etc.



 JR新宿駅の近くにビールを飲みに行こうか?
Katakana - カタカナ
          ・Phonetic script (~50)
          ・Typically used for loan words



JR新宿駅の近くにビールを飲みに行こうか?
JR新宿駅の近くにビールを飲みに行こうか?


Kanji - 漢字
・Chinese characters (50,000+)
・Used for stems & proper nouns
JR新宿駅の近くにビールを飲みに行こうか?


          Hiragana - ひらがな
          ・Phonetic script (~50)
          ・Used for inflections & particles
Romaji - ローマ字                   Katakana - カタカナ
・Latin characters (26+)         ・Phonetic script (~50)
・Used for proper nouns, etc.    ・Typically used for loan words



 JR新宿駅の近くにビールを飲みに行こうか?


Kanji - 漢字                      Hiragana - ひらがな
・Chinese characters (50,000+)   ・Phonetic script (~50)
・Used for stems & proper nouns ・Used for inflections & particles
JR新宿駅の近くにビールを飲みに行こうか?
JR新宿駅の近くにビールを飲みに行こうか?
? What are the words in this sentence?
JR新宿駅の近くにビールを飲みに行こうか?
? What are the words in this sentence?
! Words are implicit in Japanese - there
  is no white space that separates them
JR新宿駅の近くにビールを飲みに行こうか?
? How do we index this for search, then?
JR新宿駅の近くにビールを飲みに行こうか?
? How do we index this for search, then?
! We need to segment text into tokens first
! Two major approaches for segmentation

          1. n-gramming
          2. morphological analysis
            (statistical approach)
n-gramming (n=2)
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?
n-gramming (n=2)
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR               Shall we go for a beer near JR Shinjuku station?
 n=2




JR
n-gramming (n=2)
J R新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                Shall we go for a beer near JR Shinjuku station?
 n=2
       R新




JR R新
n-gramming (n=2)
J R 新宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                     Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿




JR R新 新宿
n-gramming (n=2)
J R 新 宿駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                      Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅




JR R新 新宿 宿駅
n-gramming (n=2)
J R 新 宿 駅の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                        Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅

                      駅の




JR R新 新宿 宿駅 駅の
n-gramming (n=2)
J R 新 宿 駅 の近 く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                             Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅

                      駅の

                           の近




JR R新 新宿 宿駅 駅の の近
n-gramming (n=2)
J R 新 宿 駅 の 近く に ビ ー ル を 飲 み に 行 こ う か ?
 JR                                  Shall we go for a beer near JR Shinjuku station?
 n=2
       R新

            新宿

                 宿駅

                      駅の

                           の近


                                近く




JR R新 新宿 宿駅 駅の の近 近く
Problems with n-gramming
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×  ×
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×  ×  ×
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
  JR R新 新宿 宿駅 駅の の近 近く ...
   ●  ×  ●  ×  ×  ×  ●
                     change of
                    semantics!
        means ‘post town’, ‘relay station’ or ‘stage’
Problems with n-gramming
         JR R新 新宿 宿駅 駅の の近 近く ...
          ●  ×  ●  ×  ×  ×  ●
                                        change of
                                       semantics!
                           means ‘post town’, ‘relay station’ or ‘stage’




•   Does not preserve meaning well and often changes semantics
     •   Impacts on ranking - search precision (many false positives)
Generates many terms per document or query
Impacts on index size and search performance
Sometimes appropriate for certain search applications
Compliance, e-commerce with non product names, ...
Problems with n-gramming
         JR R新 新宿 宿駅 駅の の近 近く ...
          ●  ×  ●  ×  ×  ×  ●
                                        change of
                                       semantics!
                           means ‘post town’, ‘relay station’ or ‘stage’




•   Does not preserve meaning well and often changes semantics
     •   Impacts on ranking - search precision (many false positives)
•   Also generates many terms per document or query
     •   Impacts on index size and performance
Sometimes appropriate for certain search applications
Compliance, e-commerce with non product names, ...
Problems with n-gramming
         JR R新 新宿 宿駅 駅の の近 近く ...
          ●  ×  ●  ×  ×  ×  ●
                                        change of
                                       semantics!
                           means ‘post town’, ‘relay station’ or ‘stage’




•   Does not preserve meaning well and often changes semantics
     •   Impacts on ranking - search precision (many false positives)
•   Also generates many terms per document or query
     •   Impacts on index size and performance
•   Still sometimes appropriate for certain search applications
     •   Compliance, e-commerce with special product names, ...
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                  Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                                             Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
  •   Tokens reflect what a Japanese speaker consider as words
  •   Machine-learned statistical approach
       •   CRFs decoded using Viterbi
       •   Also does part-of-speech tagging, readings for kanji, etc.
  •   Several statistical models available with high accuracy (F > 0.97)
       •   Models/dictionaries are available as IPADIC, UniDic, ...
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                                             Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
  •   Tokens reflect what a Japanese speaker consider as words
  •   Machine-learned statistical approach
       •   Conditional Random Fields (CRFs) decoded using Viterbi
       •   Also does part-of-speech tagging, extract readings for kanji, etc.
  •   Several statistical models available with high accuracy (F > 0.97)
       •   Models/dictionaries are available as IPADIC, UniDic, ...
Morphological analysis
JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ?
                                             Shall we go for a beer near JR Shinjuku station?


JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
 ●  ● ● ● ● ●    ●   ● ● ● ● ● ● ●
  •   Tokens reflect what a Japanese speaker consider as words
  •   Machine-learned statistical approach
       •   Conditional Random Fields (CRFs) decoded using Viterbi
       •   Also does part-of-speech tagging, readings for kanji, etc.
  •   Several statistical models available with high accuracy (F > 0.97)
       •   Models/dictionaries are available as IPADIC, UniDic, ...
How does this actually work?
Demo
Japanese support in
  Lucene and Solr
Japanese in Lucene/Solr
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box

! Easy to use with reasonable defaults
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box

! Easy to use with reasonable defaults

! Provides sophisticated Japanese linguistics
Japanese in Lucene/Solr
! New feature in Lucene/Solr 3.6

! Available out-of-the-box

! Easy to use with reasonable defaults

! Provides sophisticated Japanese linguistics

! Customisable
How do we use it?
How do we use it?

      ! Use JapaneseAnalyzer
How do we use it?

      ! Use JapaneseAnalyzer



      ! Use field type “text_ja”
        in example schema.xml
Demo
Feature summary / text_ja analyzer chain
                       Segments Japanese text into tokens with very high accuracy
   JapaneseTokenizer   •   Token attributes for part-of-speech, base form, readings, etc.
                       •   Compound segmentation with compound synonyms
                       •   Segmentation is customisable using user dictionaries
Feature summary / text_ja analyzer chain
                         Segments Japanese text into tokens with very high accuracy
     JapaneseTokenizer    •   Token attributes for part-of-speech, base form, readings, etc.
                          •   Compound segmentation with compound synonyms
                          •   Segmentation is customisable using user dictionaries


JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
Feature summary / text_ja analyzer chain
                                 Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer     •   Token attributes for part-of-speech, base form, readings, etc.
                                  •   Compound segmentation with compound synonyms
                                  •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                 Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                 See example/solr/conf/lang/stoptags_ja.txt
Feature summary / text_ja analyzer chain
                                 Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer     •   Token attributes for part-of-speech, base form, readings, etc.
                                  •   Compound segmentation with compound synonyms
                                  •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                 Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                 See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)
Feature summary / text_ja analyzer chain
                                   Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer       •   Token attributes for part-of-speech, base form, readings, etc.
                                    •   Compound segmentation with compound synonyms
                                    •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                   Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                   See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)

                                   Stop-words removal
                      StopFilter
                                   See example/solr/conf/lang/stopwords_ja.txt
Feature summary / text_ja analyzer chain
                                   Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer       •   Token attributes for part-of-speech, base form, readings, etc.
                                    •   Compound segmentation with compound synonyms
                                    •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                   Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                   See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)

                                   Stop-words removal
                      StopFilter
                                   See example/solr/conf/lang/stopwords_ja.txt


   JapaneseKatakanaStemFilter Normalises common katakana spelling variations
Feature summary / text_ja analyzer chain
                                   Segments Japanese text into tokens with very high accuracy
            JapaneseTokenizer       •   Token attributes for part-of-speech, base form, readings, etc.
                                    •   Compound segmentation with compound synonyms
                                    •   Segmentation is customisable using user dictionaries


       JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)

                                   Stop-words removal based on part-of-speech tags
JapanesePartOfSpeechStopFilter
                                   See example/solr/conf/lang/stoptags_ja.txt


                CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)

                                   Stop-words removal
                      StopFilter
                                   See example/solr/conf/lang/stopwords_ja.txt


   JapaneseKatakanaStemFilter Normalises common katakana spelling variations

               LowerCaseFilter Lowercases
Feature details
Compound nouns
? How do we deal with compound nouns?
Compound nouns
? How do we deal with compound nouns?
      Japanese                English
    関西国際空港           Kansai International Airport
シニアソフトウェアエンジニア        Senior Software Engineer
Compound nouns
? How do we deal with compound nouns?
       Japanese                  English
    関西国際空港              Kansai International Airport
シニアソフトウェアエンジニア           Senior Software Engineer


! These are one word in Japanese, so
  searching for 空港 (airport) doesn’t match
Compound nouns
? How do we deal with compound nouns?
       Japanese                  English
    関西国際空港              Kansai International Airport
シニアソフトウェアエンジニア           Senior Software Engineer


! These are one word in Japanese, so
  searching for 空港 (airport) doesn’t match

! We need to segment the compounds, too
Compound segmentation

    関西国際空港
Kansai International Airport
シニアソフトウェアエンジニナ
 Senior Software Engineer




 ! We are using a heuristic to implement this
Compound segmentation

    関西国際空港                     関西
Kansai International Airport   Kansai
シニアソフトウェアエンジニナ                 シニア
 Senior Software Engineer      Senior




 ! We are using a heuristic to implement this
Compound segmentation

    関西国際空港                     関西          国際
Kansai International Airport   Kansai   International
シニアソフトウェアエンジニナ                 シニア      ソフトウェア
 Senior Software Engineer      Senior    Software




 ! We are using a heuristic to implement this
Compound segmentation

    関西国際空港                     関西          国際            空港
Kansai International Airport   Kansai   International   Airport
シニアソフトウェアエンジニナ                 シニア      ソフトウェア          エンジニナ
 Senior Software Engineer      Senior    Software       Engineer




 ! We are using a heuristic to implement this
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its part
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its parts
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its parts
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Compound synonym tokens
            Position 1              Position 2                Position 3
                関西                      国際                      空港
          関西国際空港

•   Segment the compounds into its parts
    •   Good for recall - we can also search and match 空港 (airport)
•   We keep the compound itself as a synonym
    •   Good for precision with an exact hit because of IDF
•   Approach benefits both precision and recall for overall good ranking
    •   JapaneseTokenizer actually returns a graph of tokens
Character width normalisation
? How do we deal with character widths?
         Half-width・半角   Full-width・全角
            Lucene        Lucene
             カタカナ          カタカナ
             123           123
Character width normalisation
? How do we deal with character widths?
              Half-width・半角              Full-width・全角
                   Lucene                 Lucene
                    カタカナ                   カタカナ
                    123                    123


! Use CJKWidthFilter to normalise them
  (Unicode NFKC subset)



             Input text Lucene             カタカナ        123

        CJKWidthFilter      Lucene        カタカナ          123

                            half-width    full-width   half-width
Katakana end-vowel stemming
? A common spelling variation in
  katakana is a end long-vowel sound
   English   Japanese spelling variations
  manager    マネージャー            マネージャ        マネジャー
Katakana end-vowel stemming
  ? A common spelling variation in
    katakana is a end long-vowel sound
       English     Japanese spelling variations
       manager     マネージャー            マネージャ         マネジャー



   ! We JapaneseKatakanaStemFilter to
     normalise/stem end-vowel for long terms

                 Input text コピー     マネージャー        マネージャ      マネジャー
JapaneseKatakanaStemFilter コピー       マネージャ        マネージャ      マネジャ
                            copy       manager     manager   “manager”
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
    Dictionary form


        買う
       kau
      to buy
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
    Dictionary form   Inflected forms (not exhaustive)
                       買いなさい       買いませんでしたら   買える        買わせられる


        買う             買いなさるな
                       買いましたら
                                   買いませんでしたり
                                   買いませんなら
                                               買おう
                                               買った
                                                          買わせる
                                                          買わない
                       買いましたり      買うだろう       買ったら       買わないだろう


       kau             買いまして
                       買いましょう
                                   買うでしょう
                                   買うな
                                               買ったり
                                               買って
                                                          買わないで
                                                          買わないでしょう
                                               買わせない

      to buy
                       買います        買うまい                   買わなかった
                       買いますまい      買え          買わせます      買わなかったら
                       買いませば       買えない        買わせません     買わなかったり
                       買いません       買えば         買わせられない    買わなければ
                       買いませんで      買えます        買わせられます    買われない
                       買いませんでした    買えません       買わせられません   買われます
Lemmatisation
? Japanese adjectives and verbs are highly
  inflected, how do we deal with that?
    Dictionary form      Inflected forms (not exhaustive)
                           買いなさい      買いませんでしたら   買える        買わせられる


        買う                 買いなさるな
                           買いましたら
                                      買いませんでしたり
                                      買いませんなら
                                                  買おう
                                                  買った
                                                             買わせる
                                                             買わない
                           買いましたり     買うだろう       買ったら       買わないだろう


       kau                 買いまして
                           買いましょう
                                      買うでしょう
                                      買うな
                                                  買ったり
                                                  買って
                                                             買わないで
                                                             買わないでしょう
                                                  買わせない

      to buy
                           買います       買うまい                   買わなかった
                           買いますまい     買え          買わせます      買わなかったら
                           買いませば      買えない        買わせません     買わなかったり
                           買いません      買えば         買わせられない    買わなければ
                           買いませんで     買えます        買わせられます    買われない
                           買いませんでした   買えません       買わせられません   買われます




 ! Use JapaneseBaseformFilter to normalise
   inflected adjectives and verbs to dictionary form
   (lemmatisation by reduction)
User dictionaries
•   Own dictionaries can be used for ad hoc
    segmentation, i.e. to override default model
•   File format is simple and there’s no need to
    assign weights, etc. before using them
•   Example custom dictionary:
# Custom segmentation and POS entry for long entries
関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,カスタム名詞

# Custom reading and POS former sumo wrestler Asashoryu
朝青龍,朝青龍,アサショウリュウ,カスタム人名
Japanese focus in 4.0
•   Improvements in JapaneseTokenizer
     •   Improved search mode for katakana compounds
     •   Improved unknown word segmentation
     •   Some performance improvements
•   CharFilters for various character normalisations
     •   Dates and numbers
     •   Repetition marks (odoriji)
•   Japanese spell-checker
     •   Robert and Koji almost got this into 3.6, but it got
         postponed because of API changes being necessary
Acknowledgements
Robert Muir
Thanks for the heavy lifting integrating Kuromoji into Lucene
and always reviewing my patches quickly and friendly help
Michael McCandless
Thanks for streaming Viterbi and synonym compounds!
Uwe Schindler
Thanks for performance improvements + being the policeman
Simon Willnauer
Thanks for doing the Kuromoji code donation process so well
Gaute Lambertsen & Gerry Hocks
Thanks for presentation feedback and being great colleagues
Q&A
ありがとうございました!
 arigatō gozaimashita!

Thank you very much!

More Related Content

What's hot

Fargate起動歴1日の男が語る運用の勘どころ
Fargate起動歴1日の男が語る運用の勘どころFargate起動歴1日の男が語る運用の勘どころ
Fargate起動歴1日の男が語る運用の勘どころYuto Komai
 
20211209 Ops-JAWS Re invent2021re-cap-cloud operations
20211209 Ops-JAWS Re invent2021re-cap-cloud operations20211209 Ops-JAWS Re invent2021re-cap-cloud operations
20211209 Ops-JAWS Re invent2021re-cap-cloud operationsAmazon Web Services Japan
 
Kubernetes × 可用性 -- cndjp第3回勉強会
Kubernetes × 可用性 -- cndjp第3回勉強会Kubernetes × 可用性 -- cndjp第3回勉強会
Kubernetes × 可用性 -- cndjp第3回勉強会Hiroshi Hayakawa
 
SolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみようSolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみようShinsuke Sugaya
 
iOSレガシーコード改善ガイド〜マンガボックス開発における事例〜
iOSレガシーコード改善ガイド〜マンガボックス開発における事例〜iOSレガシーコード改善ガイド〜マンガボックス開発における事例〜
iOSレガシーコード改善ガイド〜マンガボックス開発における事例〜Kentaro Matsumae
 
甘酸っぱいGCPレガシーApp Engine python2からCloud Runへの移行の勘所
甘酸っぱいGCPレガシーApp Engine python2からCloud Runへの移行の勘所甘酸っぱいGCPレガシーApp Engine python2からCloud Runへの移行の勘所
甘酸っぱいGCPレガシーApp Engine python2からCloud Runへの移行の勘所Ryusuke Kimura
 
呪符式高速詠唱シェル芸
呪符式高速詠唱シェル芸呪符式高速詠唱シェル芸
呪符式高速詠唱シェル芸xztaityozx
 
実践 WebRTC 〜最新事例と開発ノウハウの紹介〜
実践 WebRTC 〜最新事例と開発ノウハウの紹介〜実践 WebRTC 〜最新事例と開発ノウハウの紹介〜
実践 WebRTC 〜最新事例と開発ノウハウの紹介〜Yusuke Naka
 
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案Yahoo!デベロッパーネットワーク
 
AbemaTVの動画配信を支えるサーバーサイドシステム
AbemaTVの動画配信を支えるサーバーサイドシステムAbemaTVの動画配信を支えるサーバーサイドシステム
AbemaTVの動画配信を支えるサーバーサイドシステムyuichiro nakazawa
 
スクラムパタン入門
スクラムパタン入門スクラムパタン入門
スクラムパタン入門Kiro Harada
 
PHPの今とこれから2021
PHPの今とこれから2021PHPの今とこれから2021
PHPの今とこれから2021Rui Hirokawa
 
Javascriptで無限ループを実現する5つの方法
Javascriptで無限ループを実現する5つの方法Javascriptで無限ループを実現する5つの方法
Javascriptで無限ループを実現する5つの方法yhara
 
ドメイン駆動設計 ~ユーザー、モデル、エンジニアの新たな関係~
ドメイン駆動設計 ~ユーザー、モデル、エンジニアの新たな関係~ドメイン駆動設計 ~ユーザー、モデル、エンジニアの新たな関係~
ドメイン駆動設計 ~ユーザー、モデル、エンジニアの新たな関係~啓 杉本
 
現状分析→価値開発→仕様化 As is
現状分析→価値開発→仕様化 As is現状分析→価値開発→仕様化 As is
現状分析→価値開発→仕様化 As isZenji Kanzaki
 
Rails上でのpub/sub イベントハンドラの扱い
Rails上でのpub/sub イベントハンドラの扱いRails上でのpub/sub イベントハンドラの扱い
Rails上でのpub/sub イベントハンドラの扱いota42y
 
Aws amplify studioが変えるフロントエンド開発の未来とは v2
Aws amplify studioが変えるフロントエンド開発の未来とは v2Aws amplify studioが変えるフロントエンド開発の未来とは v2
Aws amplify studioが変えるフロントエンド開発の未来とは v2Koitabashi Yoshitaka
 
AWSを用いた耐障害性の高いアプリケーションの設計
AWSを用いた耐障害性の高いアプリケーションの設計AWSを用いた耐障害性の高いアプリケーションの設計
AWSを用いた耐障害性の高いアプリケーションの設計SORACOM, INC
 
ホットペッパービューティーにおけるモバイルアプリ向けAPIのBFF/Backend分割
ホットペッパービューティーにおけるモバイルアプリ向けAPIのBFF/Backend分割ホットペッパービューティーにおけるモバイルアプリ向けAPIのBFF/Backend分割
ホットペッパービューティーにおけるモバイルアプリ向けAPIのBFF/Backend分割Recruit Lifestyle Co., Ltd.
 
リクルート式 自然言語処理技術の適応事例紹介
リクルート式 自然言語処理技術の適応事例紹介リクルート式 自然言語処理技術の適応事例紹介
リクルート式 自然言語処理技術の適応事例紹介Recruit Technologies
 

What's hot (20)

Fargate起動歴1日の男が語る運用の勘どころ
Fargate起動歴1日の男が語る運用の勘どころFargate起動歴1日の男が語る運用の勘どころ
Fargate起動歴1日の男が語る運用の勘どころ
 
20211209 Ops-JAWS Re invent2021re-cap-cloud operations
20211209 Ops-JAWS Re invent2021re-cap-cloud operations20211209 Ops-JAWS Re invent2021re-cap-cloud operations
20211209 Ops-JAWS Re invent2021re-cap-cloud operations
 
Kubernetes × 可用性 -- cndjp第3回勉強会
Kubernetes × 可用性 -- cndjp第3回勉強会Kubernetes × 可用性 -- cndjp第3回勉強会
Kubernetes × 可用性 -- cndjp第3回勉強会
 
SolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみようSolrとElasticsearchを比べてみよう
SolrとElasticsearchを比べてみよう
 
iOSレガシーコード改善ガイド〜マンガボックス開発における事例〜
iOSレガシーコード改善ガイド〜マンガボックス開発における事例〜iOSレガシーコード改善ガイド〜マンガボックス開発における事例〜
iOSレガシーコード改善ガイド〜マンガボックス開発における事例〜
 
甘酸っぱいGCPレガシーApp Engine python2からCloud Runへの移行の勘所
甘酸っぱいGCPレガシーApp Engine python2からCloud Runへの移行の勘所甘酸っぱいGCPレガシーApp Engine python2からCloud Runへの移行の勘所
甘酸っぱいGCPレガシーApp Engine python2からCloud Runへの移行の勘所
 
呪符式高速詠唱シェル芸
呪符式高速詠唱シェル芸呪符式高速詠唱シェル芸
呪符式高速詠唱シェル芸
 
実践 WebRTC 〜最新事例と開発ノウハウの紹介〜
実践 WebRTC 〜最新事例と開発ノウハウの紹介〜実践 WebRTC 〜最新事例と開発ノウハウの紹介〜
実践 WebRTC 〜最新事例と開発ノウハウの紹介〜
 
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
第17回Lucene/Solr勉強会 #SolrJP – Apache Lucene Solrによる形態素解析の課題とN-bestの提案
 
AbemaTVの動画配信を支えるサーバーサイドシステム
AbemaTVの動画配信を支えるサーバーサイドシステムAbemaTVの動画配信を支えるサーバーサイドシステム
AbemaTVの動画配信を支えるサーバーサイドシステム
 
スクラムパタン入門
スクラムパタン入門スクラムパタン入門
スクラムパタン入門
 
PHPの今とこれから2021
PHPの今とこれから2021PHPの今とこれから2021
PHPの今とこれから2021
 
Javascriptで無限ループを実現する5つの方法
Javascriptで無限ループを実現する5つの方法Javascriptで無限ループを実現する5つの方法
Javascriptで無限ループを実現する5つの方法
 
ドメイン駆動設計 ~ユーザー、モデル、エンジニアの新たな関係~
ドメイン駆動設計 ~ユーザー、モデル、エンジニアの新たな関係~ドメイン駆動設計 ~ユーザー、モデル、エンジニアの新たな関係~
ドメイン駆動設計 ~ユーザー、モデル、エンジニアの新たな関係~
 
現状分析→価値開発→仕様化 As is
現状分析→価値開発→仕様化 As is現状分析→価値開発→仕様化 As is
現状分析→価値開発→仕様化 As is
 
Rails上でのpub/sub イベントハンドラの扱い
Rails上でのpub/sub イベントハンドラの扱いRails上でのpub/sub イベントハンドラの扱い
Rails上でのpub/sub イベントハンドラの扱い
 
Aws amplify studioが変えるフロントエンド開発の未来とは v2
Aws amplify studioが変えるフロントエンド開発の未来とは v2Aws amplify studioが変えるフロントエンド開発の未来とは v2
Aws amplify studioが変えるフロントエンド開発の未来とは v2
 
AWSを用いた耐障害性の高いアプリケーションの設計
AWSを用いた耐障害性の高いアプリケーションの設計AWSを用いた耐障害性の高いアプリケーションの設計
AWSを用いた耐障害性の高いアプリケーションの設計
 
ホットペッパービューティーにおけるモバイルアプリ向けAPIのBFF/Backend分割
ホットペッパービューティーにおけるモバイルアプリ向けAPIのBFF/Backend分割ホットペッパービューティーにおけるモバイルアプリ向けAPIのBFF/Backend分割
ホットペッパービューティーにおけるモバイルアプリ向けAPIのBFF/Backend分割
 
リクルート式 自然言語処理技術の適応事例紹介
リクルート式 自然言語処理技術の適応事例紹介リクルート式 自然言語処理技術の適応事例紹介
リクルート式 自然言語処理技術の適応事例紹介
 

Viewers also liked

形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介Toshinori Sato
 
機械学習の全般について 4
機械学習の全般について 4機械学習の全般について 4
機械学習の全般について 4Masato Nakai
 
Language support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco systemLanguage support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco systemlucenerevolution
 
Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話Koki Shibata
 
深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーション深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーションYuya Unno
 

Viewers also liked (6)

形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
形態素解析器 MeCab の新語・固有表現辞書 mecab-ipadic-NEologd のご紹介
 
機械学習の全般について 4
機械学習の全般について 4機械学習の全般について 4
機械学習の全般について 4
 
Language support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco systemLanguage support and linguistics in lucene solr & its eco system
Language support and linguistics in lucene solr & its eco system
 
Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話Spark MLlibでリコメンドエンジンを作った話
Spark MLlibでリコメンドエンジンを作った話
 
深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーション深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーション
 
深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向深層学習による自然言語処理の研究動向
深層学習による自然言語処理の研究動向
 

More from lucenerevolution

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucenelucenerevolution
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! lucenerevolution
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solrlucenerevolution
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationslucenerevolution
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloudlucenerevolution
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusterslucenerevolution
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiledlucenerevolution
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs lucenerevolution
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchlucenerevolution
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Stormlucenerevolution
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?lucenerevolution
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APIlucenerevolution
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucenelucenerevolution
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMlucenerevolution
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucenelucenerevolution
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenallucenerevolution
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside downlucenerevolution
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...lucenerevolution
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - finallucenerevolution
 

More from lucenerevolution (20)

Text Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and LuceneText Classification Powered by Apache Mahout and Lucene
Text Classification Powered by Apache Mahout and Lucene
 
State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here! State of the Art Logging. Kibana4Solr is Here!
State of the Art Logging. Kibana4Solr is Here!
 
Search at Twitter
Search at TwitterSearch at Twitter
Search at Twitter
 
Building Client-side Search Applications with Solr
Building Client-side Search Applications with SolrBuilding Client-side Search Applications with Solr
Building Client-side Search Applications with Solr
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Scaling Solr with SolrCloud
Scaling Solr with SolrCloudScaling Solr with SolrCloud
Scaling Solr with SolrCloud
 
Administering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud ClustersAdministering and Monitoring SolrCloud Clusters
Administering and Monitoring SolrCloud Clusters
 
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and ParboiledImplementing a Custom Search Syntax using Solr, Lucene, and Parboiled
Implementing a Custom Search Syntax using Solr, Lucene, and Parboiled
 
Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs Using Solr to Search and Analyze Logs
Using Solr to Search and Analyze Logs
 
Enhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic searchEnhancing relevancy through personalization & semantic search
Enhancing relevancy through personalization & semantic search
 
Real-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and StormReal-time Inverted Search in the Cloud Using Lucene and Storm
Real-time Inverted Search in the Cloud Using Lucene and Storm
 
Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?Solr's Admin UI - Where does the data come from?
Solr's Admin UI - Where does the data come from?
 
Schemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST APISchemaless Solr and the Solr Schema REST API
Schemaless Solr and the Solr Schema REST API
 
High Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with LuceneHigh Performance JSON Search and Relational Faceted Browsing with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
 
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVMText Classification with Lucene/Solr, Apache Hadoop and LibSVM
Text Classification with Lucene/Solr, Apache Hadoop and LibSVM
 
Faceted Search with Lucene
Faceted Search with LuceneFaceted Search with Lucene
Faceted Search with Lucene
 
Recent Additions to Lucene Arsenal
Recent Additions to Lucene ArsenalRecent Additions to Lucene Arsenal
Recent Additions to Lucene Arsenal
 
Turning search upside down
Turning search upside downTurning search upside down
Turning search upside down
 
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
Spellchecking in Trovit: Implementing a Contextual Multi-language Spellchecke...
 
Shrinking the haystack wes caldwell - final
Shrinking the haystack   wes caldwell - finalShrinking the haystack   wes caldwell - final
Shrinking the haystack wes caldwell - final
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

Japanese Linguistics in Lucene and Solr

  • 1. Japanese linguistics in Apache Lucene™ and Apache Solr™ May 9th, 2012 Christian Moen christian@atilika.com
  • 2. About me • MSc. in computer science, University of Oslo, Norway • Worked with search at FAST (now Microsoft) for 10 years • 5 years in R&D building FAST Enterprise Search Platform in Oslo, Norway • 5 years in Services doing solution delivery, sales, etc. in Tokyo, Japan • Founded アティリカ株式会社 in 2009 • We help companies innovate using search technologies and good ideas • We know information retrieval, natural language processing and big data • We are based in Tokyo, but we have clients everywhere • Newbie Lucene & Solr Committer • Mostly been working on Japanese language support (Kuromoji) so far • Please write me on christian@atilika.com or cm@apache.org
  • 4. Today’s topics • Japanese 101 - ordering beer and toasting • Japanese language processing • Japanese features in Lucene/Solr
  • 5. Today’s topics • Japanese 101 - ordering beer and toasting • Japanese language processing • Japanese features in Lucene/Solr
  • 6. Today’s topics • Japanese 101 - ordering beer and toasting • Japanese language processing • Japanese features in Lucene/Solr
  • 15. JR新宿駅の近くにビールを飲みに行こうか? JR Shinjuku eki no chikaku ni bi-ru ō nomi ni ikō ka? Shall we go for a beer near JR Shinjuku station?
  • 17. Romaji - ローマ字 ・Latin characters (26+) ・Used for proper nouns, etc. JR新宿駅の近くにビールを飲みに行こうか?
  • 18. Katakana - カタカナ ・Phonetic script (~50) ・Typically used for loan words JR新宿駅の近くにビールを飲みに行こうか?
  • 20. JR新宿駅の近くにビールを飲みに行こうか? Hiragana - ひらがな ・Phonetic script (~50) ・Used for inflections & particles
  • 21. Romaji - ローマ字 Katakana - カタカナ ・Latin characters (26+) ・Phonetic script (~50) ・Used for proper nouns, etc. ・Typically used for loan words JR新宿駅の近くにビールを飲みに行こうか? Kanji - 漢字 Hiragana - ひらがな ・Chinese characters (50,000+) ・Phonetic script (~50) ・Used for stems & proper nouns ・Used for inflections & particles
  • 24. JR新宿駅の近くにビールを飲みに行こうか? ? What are the words in this sentence? ! Words are implicit in Japanese - there is no white space that separates them
  • 26. JR新宿駅の近くにビールを飲みに行こうか? ? How do we index this for search, then? ! We need to segment text into tokens first
  • 27. ! Two major approaches for segmentation 1. n-gramming 2. morphological analysis (statistical approach)
  • 28. n-gramming (n=2) JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station?
  • 29. n-gramming (n=2) JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 JR
  • 30. n-gramming (n=2) J R新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 JR R新
  • 31. n-gramming (n=2) J R 新宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 JR R新 新宿
  • 32. n-gramming (n=2) J R 新 宿駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 JR R新 新宿 宿駅
  • 33. n-gramming (n=2) J R 新 宿 駅の 近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 駅の JR R新 新宿 宿駅 駅の
  • 34. n-gramming (n=2) J R 新 宿 駅 の近 く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 駅の の近 JR R新 新宿 宿駅 駅の の近
  • 35. n-gramming (n=2) J R 新 宿 駅 の 近く に ビ ー ル を 飲 み に 行 こ う か ? JR Shall we go for a beer near JR Shinjuku station? n=2 R新 新宿 宿駅 駅の の近 近く JR R新 新宿 宿駅 駅の の近 近く
  • 37. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ...
  • 38. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ●
  • 39. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● ×
  • 40. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ●
  • 41. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 42. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 43. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 44. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’
  • 45. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) Generates many terms per document or query Impacts on index size and search performance Sometimes appropriate for certain search applications Compliance, e-commerce with non product names, ...
  • 46. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) • Also generates many terms per document or query • Impacts on index size and performance Sometimes appropriate for certain search applications Compliance, e-commerce with non product names, ...
  • 47. Problems with n-gramming JR R新 新宿 宿駅 駅の の近 近く ... ● × ● × × × ● change of semantics! means ‘post town’, ‘relay station’ or ‘stage’ • Does not preserve meaning well and often changes semantics • Impacts on ranking - search precision (many false positives) • Also generates many terms per document or query • Impacts on index size and performance • Still sometimes appropriate for certain search applications • Compliance, e-commerce with special product names, ...
  • 48. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station?
  • 49. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ?
  • 50. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ●
  • 51. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • CRFs decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ...
  • 52. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • Conditional Random Fields (CRFs) decoded using Viterbi • Also does part-of-speech tagging, extract readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ...
  • 53. Morphological analysis JR 新 宿 駅 の 近 く に ビ ー ル を 飲 み に 行 こ う か ? Shall we go for a beer near JR Shinjuku station? JR 新宿 駅 の 近く に ビ ー ル を 飲み に 行こ う か ? ● ● ● ● ● ● ● ● ● ● ● ● ● ● • Tokens reflect what a Japanese speaker consider as words • Machine-learned statistical approach • Conditional Random Fields (CRFs) decoded using Viterbi • Also does part-of-speech tagging, readings for kanji, etc. • Several statistical models available with high accuracy (F > 0.97) • Models/dictionaries are available as IPADIC, UniDic, ...
  • 54. How does this actually work?
  • 55. Demo
  • 56. Japanese support in Lucene and Solr
  • 58. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6
  • 59. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box
  • 60. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy to use with reasonable defaults
  • 61. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy to use with reasonable defaults ! Provides sophisticated Japanese linguistics
  • 62. Japanese in Lucene/Solr ! New feature in Lucene/Solr 3.6 ! Available out-of-the-box ! Easy to use with reasonable defaults ! Provides sophisticated Japanese linguistics ! Customisable
  • 63. How do we use it?
  • 64. How do we use it? ! Use JapaneseAnalyzer
  • 65. How do we use it? ! Use JapaneseAnalyzer ! Use field type “text_ja” in example schema.xml
  • 66. Demo
  • 67. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries
  • 68. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction)
  • 69. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt
  • 70. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset)
  • 71. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Stop-words removal StopFilter See example/solr/conf/lang/stopwords_ja.txt
  • 72. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Stop-words removal StopFilter See example/solr/conf/lang/stopwords_ja.txt JapaneseKatakanaStemFilter Normalises common katakana spelling variations
  • 73. Feature summary / text_ja analyzer chain Segments Japanese text into tokens with very high accuracy JapaneseTokenizer • Token attributes for part-of-speech, base form, readings, etc. • Compound segmentation with compound synonyms • Segmentation is customisable using user dictionaries JapaneseBaseFormFilter Adjective and verb lemmatisation (by reduction) Stop-words removal based on part-of-speech tags JapanesePartOfSpeechStopFilter See example/solr/conf/lang/stoptags_ja.txt CJKWidthFilter Character width normalisation (fast Unicode NFKC subset) Stop-words removal StopFilter See example/solr/conf/lang/stopwords_ja.txt JapaneseKatakanaStemFilter Normalises common katakana spelling variations LowerCaseFilter Lowercases
  • 75. Compound nouns ? How do we deal with compound nouns?
  • 76. Compound nouns ? How do we deal with compound nouns? Japanese English 関西国際空港 Kansai International Airport シニアソフトウェアエンジニア Senior Software Engineer
  • 77. Compound nouns ? How do we deal with compound nouns? Japanese English 関西国際空港 Kansai International Airport シニアソフトウェアエンジニア Senior Software Engineer ! These are one word in Japanese, so searching for 空港 (airport) doesn’t match
  • 78. Compound nouns ? How do we deal with compound nouns? Japanese English 関西国際空港 Kansai International Airport シニアソフトウェアエンジニア Senior Software Engineer ! These are one word in Japanese, so searching for 空港 (airport) doesn’t match ! We need to segment the compounds, too
  • 79. Compound segmentation 関西国際空港 Kansai International Airport シニアソフトウェアエンジニナ Senior Software Engineer ! We are using a heuristic to implement this
  • 80. Compound segmentation 関西国際空港 関西 Kansai International Airport Kansai シニアソフトウェアエンジニナ シニア Senior Software Engineer Senior ! We are using a heuristic to implement this
  • 81. Compound segmentation 関西国際空港 関西 国際 Kansai International Airport Kansai International シニアソフトウェアエンジニナ シニア ソフトウェア Senior Software Engineer Senior Software ! We are using a heuristic to implement this
  • 82. Compound segmentation 関西国際空港 関西 国際 空港 Kansai International Airport Kansai International Airport シニアソフトウェアエンジニナ シニア ソフトウェア エンジニナ Senior Software Engineer Senior Software Engineer ! We are using a heuristic to implement this
  • 83. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its part • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 84. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its parts • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 85. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its parts • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 86. Compound synonym tokens Position 1 Position 2 Position 3 関西 国際 空港 関西国際空港 • Segment the compounds into its parts • Good for recall - we can also search and match 空港 (airport) • We keep the compound itself as a synonym • Good for precision with an exact hit because of IDF • Approach benefits both precision and recall for overall good ranking • JapaneseTokenizer actually returns a graph of tokens
  • 87. Character width normalisation ? How do we deal with character widths? Half-width・半角 Full-width・全角 Lucene Lucene カタカナ カタカナ 123 123
  • 88. Character width normalisation ? How do we deal with character widths? Half-width・半角 Full-width・全角 Lucene Lucene カタカナ カタカナ 123 123 ! Use CJKWidthFilter to normalise them (Unicode NFKC subset) Input text Lucene カタカナ 123 CJKWidthFilter Lucene カタカナ 123 half-width full-width half-width
  • 89. Katakana end-vowel stemming ? A common spelling variation in katakana is a end long-vowel sound English Japanese spelling variations manager マネージャー マネージャ マネジャー
  • 90. Katakana end-vowel stemming ? A common spelling variation in katakana is a end long-vowel sound English Japanese spelling variations manager マネージャー マネージャ マネジャー ! We JapaneseKatakanaStemFilter to normalise/stem end-vowel for long terms Input text コピー マネージャー マネージャ マネジャー JapaneseKatakanaStemFilter コピー マネージャ マネージャ マネジャ copy manager manager “manager”
  • 91. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that?
  • 92. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that? Dictionary form 買う kau to buy
  • 93. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that? Dictionary form Inflected forms (not exhaustive) 買いなさい 買いませんでしたら 買える 買わせられる 買う 買いなさるな 買いましたら 買いませんでしたり 買いませんなら 買おう 買った 買わせる 買わない 買いましたり 買うだろう 買ったら 買わないだろう kau 買いまして 買いましょう 買うでしょう 買うな 買ったり 買って 買わないで 買わないでしょう 買わせない to buy 買います 買うまい 買わなかった 買いますまい 買え 買わせます 買わなかったら 買いませば 買えない 買わせません 買わなかったり 買いません 買えば 買わせられない 買わなければ 買いませんで 買えます 買わせられます 買われない 買いませんでした 買えません 買わせられません 買われます
  • 94. Lemmatisation ? Japanese adjectives and verbs are highly inflected, how do we deal with that? Dictionary form Inflected forms (not exhaustive) 買いなさい 買いませんでしたら 買える 買わせられる 買う 買いなさるな 買いましたら 買いませんでしたり 買いませんなら 買おう 買った 買わせる 買わない 買いましたり 買うだろう 買ったら 買わないだろう kau 買いまして 買いましょう 買うでしょう 買うな 買ったり 買って 買わないで 買わないでしょう 買わせない to buy 買います 買うまい 買わなかった 買いますまい 買え 買わせます 買わなかったら 買いませば 買えない 買わせません 買わなかったり 買いません 買えば 買わせられない 買わなければ 買いませんで 買えます 買わせられます 買われない 買いませんでした 買えません 買わせられません 買われます ! Use JapaneseBaseformFilter to normalise inflected adjectives and verbs to dictionary form (lemmatisation by reduction)
  • 95. User dictionaries • Own dictionaries can be used for ad hoc segmentation, i.e. to override default model • File format is simple and there’s no need to assign weights, etc. before using them • Example custom dictionary: # Custom segmentation and POS entry for long entries 関西国際空港,関西 国際 空港,カンサイ コクサイ クウコウ,カスタム名詞 # Custom reading and POS former sumo wrestler Asashoryu 朝青龍,朝青龍,アサショウリュウ,カスタム人名
  • 96. Japanese focus in 4.0 • Improvements in JapaneseTokenizer • Improved search mode for katakana compounds • Improved unknown word segmentation • Some performance improvements • CharFilters for various character normalisations • Dates and numbers • Repetition marks (odoriji) • Japanese spell-checker • Robert and Koji almost got this into 3.6, but it got postponed because of API changes being necessary
  • 97. Acknowledgements Robert Muir Thanks for the heavy lifting integrating Kuromoji into Lucene and always reviewing my patches quickly and friendly help Michael McCandless Thanks for streaming Viterbi and synonym compounds! Uwe Schindler Thanks for performance improvements + being the policeman Simon Willnauer Thanks for doing the Kuromoji code donation process so well Gaute Lambertsen & Gerry Hocks Thanks for presentation feedback and being great colleagues
  • 98. Q&A