SlideShare a Scribd company logo
1 of 86
Download to read offline
An Introduction to
NLP4L
Natural Language Processing tool for
Apache Lucene
Koji Sekiguchi @kojisays
Founder & CEO, RONDHUIT
Agenda
• What s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transliteration
• Future Plans
2
Agenda
• What s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transliteration
• Future Plans
3
What s NLP4L?
4
What s NLP4L?
• GOAL
• Improve Lucene users search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Lucene API Front-end written in Scala
• NLP4L provides
• Preprocessors for existing ML tools
• Provision of ML algorithms and Applications (e.g. Transliteration)
5
What s NLP4L?
• GOAL
• Improve Lucene users search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Lucene API Front-end written in Scala
• NLP4L provides
• Preprocessors for existing ML tools
• Provision of ML algorithms and Applications (e.g. Transliteration)
6
What s NLP4L?
• GOAL
• Improve Lucene users search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Lucene API Front-end written in Scala
• NLP4L provides
• Preprocessors for existing ML tools
• Provision of ML algorithms and Applications (e.g. Transliteration)
7
What s Lucene?
alice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
Alice ate an apple.
Mike likes an orange.
An apple is red.
1:
2:
3:
indexing
apple
searching
(inverted) index
Lucene is a high-performance, full-featured text
search engine library written entirely in Java.
8
Agenda
• What s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transliteration
• Future Plans
9
Evaluation Measures
10
Evaluation Measures
target
11
Evaluation Measures
targetresult
12
Evaluation Measures
targetresult
tpfp fn
tn
13
Evaluation Measures
targetresult
positive
14
Evaluation Measures
negative
15
result
Evaluation Measures
16
true
positive
true negative
Evaluation Measures
targetresult
17
false
positive
false
negative
Evaluation Measures
targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
18
Recall ,Precision
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
19
targetresult
tpfp fn
tn
precision = tp / (tp + fp)
recall = tp / (tp + fn)
Recall ,Precision
20
Solution
n-gram, synonym dictionary, etc.
facet (filter query)
Ranking Tuning
recall
precision
recall , precision
21
Solution
n-gram, synonym dictionary, etc.
facet (filter query)
Ranking Tuning
recall
precision
recall , precision
22
Solution
n-gram, synonym dictionary, etc.
e.g. Transliteration
facet (filter query)
recall
precision
recall , precision
Ranking Tuning
23
Solution
n-gram, synonym dictionary, etc.
e.g. Transliteration
facet (filter query)
e.g. Named Entity Extraction
recall
precision
recall , precision
Ranking Tuning
24
gradual precision improvement
q=watch
25
targetresult
filter by
Gender=Men s
26
targetresult
gradual precision improvement
27
targetresult
filter by
Gender=Men s
filter by
Price=100-150
gradual precision improvement
Structured Documents
ID product price gender
1
CURREN New Men s Date Stainless
Steel Military Sport Quartz Wrist Watch
8.92 Men s
2 Suiksilver The Gamer Watch 87.99 Men s
28
Unstructured Documents
ID article
1
David Cameron says he has a mandate to pursue EU reform following the
Conservatives' general election victory. The Prime Minister will be hoping his
majority government will give him extra leverage in Brussels.
2
He wants to renegotiate the terms of the UK's membership ahead of a
referendum by the end of 2017. He has said he will campaign for Britain to
remain in the EU if he gets the reforms he wants.
29
Make them Structured
I
D
article person org loc
1
David Cameron says he has a mandate to pursue EU reform following
the Conservatives' general election victory. The Prime Minister will be
hoping his majority government will give him extra leverage in Brussels.
David
Cameron
EU
Bruss
els
2
He wants to renegotiate the terms of the UK's membership ahead of a
referendum by the end of 2017. He has said he will campaign for
Britain to remain in the EU if he gets the reforms he wants.
EU
UK
Britai
n
NEE[1] extracts interesting words.
[1] Named Entity Extraction
30
Manual Tagging using brat
31
Agenda
• What s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transliteration
• Future Plans
32
A small Corpus
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
33
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
val fieldTypes = Map(
"text" -> FieldType(analyzer, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
def doc(text: String): Document = {
Document(Set(
Field("text", text)
)
)
}
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
34
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
val fieldTypes = Map(
"text" -> FieldType(analyzer, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
def doc(text: String): Document = {
Document(Set(
Field("text", text)
)
)
}
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
35
text data
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
val fieldTypes = Map(
"text" -> FieldType(analyzer, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
def doc(text: String): Document = {
Document(Set(
Field("text", text)
)
)
}
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
36
Lucene index directory
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
val fieldTypes = Map(
"text" -> FieldType(analyzer, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
def doc(text: String): Document = {
Document(Set(
Field("text", text)
)
)
}
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
37
schema definition
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
val fieldTypes = Map(
"text" -> FieldType(analyzer, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
def doc(text: String): Document = {
Document(Set(
Field("text", text)
)
)
}
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
38
create Lucene document
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
val fieldTypes = Map(
"text" -> FieldType(analyzer, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
def doc(text: String): Document = {
Document(Set(
Field("text", text)
)
)
}
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
39
open a writer
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
val fieldTypes = Map(
"text" -> FieldType(analyzer, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
def doc(text: String): Document = {
Document(Set(
Field("text", text)
)
)
}
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
40
write documents
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
val fieldTypes = Map(
"text" -> FieldType(analyzer, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
def doc(text: String): Document = {
Document(Set(
Field("text", text)
)
)
}
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
41
close writer
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
val fieldTypes = Map(
"text" -> FieldType(analyzer, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
def doc(text: String): Document = {
Document(Set(
Field("text", text)
)
)
}
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
42
As for code snippets used in my talk, please look at:
https://github.com/NLP4L/meetups/tree/master/20150818
Getting word counts
alice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
43
Getting word counts
alice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
val reader = RawReader(index)
reader.sumTotalTermFreq("text")
// -> 12
reader.field("text").get.terms.size
// -> 9
reader.totalTermFreq("text", "an")
// -> 3
reader.close
getting_word_counts.scala
44
Getting word counts
alice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
val reader = RawReader(index)
reader.sumTotalTermFreq("text")
// -> 12
reader.field("text").get.terms.size
// -> 9
reader.totalTermFreq("text", "an")
// -> 3
reader.close
getting_word_counts.scala
45
Getting word counts
alice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
val reader = RawReader(index)
reader.sumTotalTermFreq("text")
// -> 12
reader.field("text").get.terms.size
// -> 9
reader.totalTermFreq("text", "an")
// -> 3
reader.close
getting_word_counts.scala
46
Getting top terms
alice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
val reader = RawReader(index)
reader.topTermsByDocFreq("text")
reader.topTermsByTotalTermFreq("text")
// ->
// (term, docFreq, totalTermFreq)
// (an,3,3)
// (apple,2,2)
// (likes,1,1)
// (is,1,1)
// (orange,1,1)
// (mike,1,1)
// (ate,1,1)
// (red,1,1)
// (alice,1,1)
reader.close
getting_word_counts.scala
47
What s ShingleFilter?
• ShingleFilter = Word n-gram TokenFilter
WhitespaceTokenizer
ShingleFilter (N=2)
“Lucene is a popular software”
Lucene/is/a/popular/software
Lucene is/is a/a popular/
popular software
48
Language Model
• LM represents the fluency of language
• N-gram model is the LM which is most widely
used
• Calculation example for 2-gram
49
val index = "/tmp/index-lm"
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
builder.addTokenFilter("shingle", "minShingleSize", "2", "maxShingleSize", "2",
"outputUnigrams", "false")
val analyzer2g = builder.build
val fieldTypes = Map(
"word" -> FieldType(analyzer, true, true, true, true),
"word2g" -> FieldType(analyzer2g, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
// create a language model index
val writer = IWriter(index, schema())
def addDocument(doc: String): Unit = {
writer.write(Document(Set(
Field("word", doc),
Field("word2g", doc)
)))
}
CORPUS.foreach(addDocument(_))
writer.close()
language_model.scala
1/2
50
val index = "/tmp/index-lm"
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
builder.addTokenFilter("shingle", "minShingleSize", "2", "maxShingleSize", "2",
"outputUnigrams", "false")
val analyzer2g = builder.build
val fieldTypes = Map(
"word" -> FieldType(analyzer, true, true, true, true),
"word2g" -> FieldType(analyzer2g, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
// create a language model index
val writer = IWriter(index, schema())
def addDocument(doc: String): Unit = {
writer.write(Document(Set(
Field("word", doc),
Field("word2g", doc)
)))
}
CORPUS.foreach(addDocument(_))
writer.close()
language_model.scala
1/2
51
schema definition
val reader = RawReader(index)
// P(apple|an) = C(an apple) / C(an)
val count_an_apple = reader.totalTermFreq("word2g", "an apple")
val count_an = reader.totalTermFreq("word", "an")
val prob_apple_an = count_an_apple.toFloat / count_an.toFloat
// P(orange|an) = C(an orange) / C(an)
val count_an_orange = reader.totalTermFreq("word2g", "an orange")
val prob_orange_an = count_an_orange.toFloat / count_an.toFloat
reader.close
language_model.scala
2/2
52
Alice/NNP ate/VB an/AT apple/NNP ./.
Mike/NNP likes/VB an/AT orange/NNP ./.
An/AT apple/NNP is/VB red/JJ ./.
NNP Proper noun, singular
VB Verb
AT Article
JJ Adjective
. period
Part-of-Speech Tagging
53
Our Corpus for training
Hidden Markov Model
54
Hidden Markov Model
55
Series of Words
Hidden Markov Model
56
Series of Part-of-Speech
Hidden Markov Model
57
Hidden Markov Model
58
HMM state diagram
NNP
0.667
VB
0.0
.
0.0
JJ
0.0
AT
0.333
1.0
1.0
0.4 0.6
0.6670.333
59
alice 0.2
apple 0.4
mike 0.2
orange 0.2
ate 0.333
is 0.333
likes 0.333
an 1.0
red 1.0
. 1.0
val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/NNP ./.",
"An/AT apple/NNP is/VB red/JJ ./."
)
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text =>
val pairs = text.split("s+")
val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))}
indexer.addDocument(doc)
}
indexer.close()
// execute part-of-speech tagging on an unknown text
val model = HmmModel(index)
val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
60
val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/NNP ./.",
"An/AT apple/NNP is/VB red/JJ ./."
)
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text =>
val pairs = text.split("s+")
val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))}
indexer.addDocument(doc)
}
indexer.close()
// execute part-of-speech tagging on an unknown text
val model = HmmModel(index)
val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
61
text data (they are tagged!)
val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/NNP ./.",
"An/AT apple/NNP is/VB red/JJ ./."
)
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text =>
val pairs = text.split("s+")
val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))}
indexer.addDocument(doc)
}
indexer.close()
// execute part-of-speech tagging on an unknown text
val model = HmmModel(index)
val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
62
write-open Lucene index
val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/NNP ./.",
"An/AT apple/NNP is/VB red/JJ ./."
)
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text =>
val pairs = text.split("s+")
val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))}
indexer.addDocument(doc)
}
indexer.close()
// execute part-of-speech tagging on an unknown text
val model = HmmModel(index)
val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
63
tagged texts are indexed here
val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/NNP ./.",
"An/AT apple/NNP is/VB red/JJ ./."
)
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text =>
val pairs = text.split("s+")
val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))}
indexer.addDocument(doc)
}
indexer.close()
// execute part-of-speech tagging on an unknown text
val model = HmmModel(index)
val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
64
make an HmmModel from Lucene index
val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/NNP ./.",
"An/AT apple/NNP is/VB red/JJ ./."
)
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text =>
val pairs = text.split("s+")
val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))}
indexer.addDocument(doc)
}
indexer.close()
// execute part-of-speech tagging on an unknown text
val model = HmmModel(index)
val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
65
get HmmTagger from HmmModel
val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/NNP ./.",
"An/AT apple/NNP is/VB red/JJ ./."
)
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text =>
val pairs = text.split("s+")
val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))}
indexer.addDocument(doc)
}
indexer.close()
// execute part-of-speech tagging on an unknown text
val model = HmmModel(index)
val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
66
use HmmTagger to annotate unknown sentence
val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/NNP ./.",
"An/AT apple/NNP is/VB red/JJ ./."
)
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text =>
val pairs = text.split("s+")
val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))}
indexer.addDocument(doc)
}
indexer.close()
// execute part-of-speech tagging on an unknown text
val model = HmmModel(index)
val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
NLP4L has hmm_postagger.scala in examples directory.
It uses brown corpus for HMM training.
67
Agenda
• What s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transliteration
• Future Plans
68
Transliteration
Transliteration is a process of transcribing letters or words from one
alphabet to another one to facilitate comprehension and pronunciation
for non-native speakers.
computer コンピューター
server サーバー
internet インターネット
mouse マウス
information インフォメーション
examples of transliteration from English to Japanese
69
It helps improve recall
you search English mouse
70
It helps improve recall
but you got マウス (=mouse)
highlighted in Japanese
71
Training data in NLP4L
アaカcaデdeミーmy
アaクcセceンnトt
アaクcセceスss
アaクcシciデdeンnトt
アaクcロroバッbaトt
アaクcショtioンn
アaダdaプpターter
アaフfリriカca
エaアirバbuスs
アaラlaスsカka
アaルlコーcohoルl
アaレlleルrギーgy
train_data/alpha_katakana.txt train_data/alpha_katakana_aligned.txt
72
academy,アカデミー
accent,アクセント
access,アクセス
accident,アクシデント
acrobat,アクロバット
action,アクション
adapter,アダプター
africa,アフリカ
airbus,エアバス
alaska,アラスカ
alcohol,アルコール
allergy,アレルギー
Demo: Transliteration
Input Prediction Right Answer
アルゴリズム algorism algorithm
プログラム program (OK)
ケミカル chaemmical chemical
ダイニング dining (OK)
コミッター committer (OK)
エントリー entree entry
nlp4l> :load examples/trans_katakana_alpha.scala
73
Gathering loan words
① crawl
gathering
Katakana-Alphabet
string pairs
アルゴリズム, algorithm
Transliteration
アルゴリズム
algorism
calculate
edit distance
synonyms.txt
74
store pair of strings
if edit distance
is small enough
②
③
④
⑤
⑥
Gathering loan words
① crawl
gathering
Katakana-Alphabet
string pairs
アルゴリズム, algorithm
Transliteration
アルゴリズム
algorism
calculate
edit distance
synonyms.txt
75
store pair of strings
if edit distance
is small enough
②
③
④
⑤
⑥
Got 1,800+ records of
synonym knowledge
from jawiki
Agenda
• What s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transliteration
• Future Plans
76
NLP4L Framework
• A framework that improves search experience (for mainly
Lucene-based search system). Pluggable.
• Reference implementation of plug-ins and corpora
provided.
• Uses NLP/ML technologies to output models, dictionaries
and indexes.
• Since NLP/ML are not perfect, an interface that enables
users to personally examine output dictionaries is
provided as well.
77
NLP4L Framework
• A framework that improves search experience (for mainly
Lucene-based search system). Pluggable.
• Reference implementation of plug-ins and corpora
provided.
• Uses NLP/ML technologies to output models, dictionaries
and indexes.
• Since NLP/ML are not perfect, an interface that enables
users to personally examine output dictionaries is
provided as well.
78
NLP4L Framework
• A framework that improves search experience (for mainly
Lucene-based search system). Pluggable.
• Reference implementation of plug-ins and corpora
provided.
• Uses NLP/ML technologies to output models, dictionaries
and indexes.
• Since NLP/ML are not perfect, an interface that enables
users to personally examine output dictionaries is
provided as well.
79
NLP4L Framework
• A framework that improves search experience (for mainly
Lucene-based search system). Pluggable.
• Reference implementation of plug-ins and corpora
provided.
• Uses NLP/ML technologies to output models, dictionaries
and indexes.
• Since NLP/ML are not perfect, an interface that enables
users to personally examine output dictionaries is
provided as well.
80
Solr
ES
Mahout Spark
Data Source
・Corpus (Text data, Lucene index)
・Query Log
・Access Log
Dictionaries
・Suggestion
(auto complete)
・Did you mean?
・synonyms.txt
・userdic.txt
・keyword attachment
maintenance
Model files
Tagged
Corpus
Document
Vectors
・TermExtractor
・Transliteration
・NEE
・Classification
・Document Vectors
・Language Detection
・Learning to Rank
・Personalized Search
81
Keyword Attachment
• Keyword attachment is a general format that enables the
following functions.
• Learning to Rank
• Personalized Search
• Named Entity Extraction
• Document Classification
Lucene
doc
Lucene
doc
keyword
↑
Increase boost
82
Learning to Rank
• Program learns, from access log and other
sources, that the score of document d for a
query q should be larger than the normal
score(q,d)
Lucene
doc d
q, q, …
https://en.wikipedia.org/wiki/Learning_to_rank
83
Personalized Search
• Program learns, from access log and other sources, that
the score of document d for a query q by user u should
be larger than the normal score(q,d)
• Since you cannot specify score(q,d,u) as Lucene restricts
doing so, you have to specify score(qu,d).
• Limit the data to high-order queries or divide fields
depending on a user as the number of q-u combinations
can be enormous.
Lucene
doc d1
q1u1, q2u2
Lucene
doc d2
q2u1, q1u2
84
Join and Code with Us!
Contact us at
koji at apache dot org
for the details.
85
Demo or
Q & A
Thank you!
86

More Related Content

What's hot

10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks
10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks
10 Keys to Solr's Future: Presented by Grant Ingersoll, LucidworksLucidworks
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginsearchbox-com
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHPPaul Borgermans
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveAndrea Gazzarini
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5israelekpo
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Boosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User PreferencesBoosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User PreferencesLucidworks (Archived)
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)Erik Hatcher
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engineth0masr
 
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Lucidworks
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr DevelopersErik Hatcher
 
Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...
Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...
Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...Lucidworks
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy SokolenkoProvectus
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupSease
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchCloudera, Inc.
 
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014francelabs
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and FriendsRob Vesse
 
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best PracticesApache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practicesfelixcss
 

What's hot (20)

10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks
10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks
10 Keys to Solr's Future: Presented by Grant Ingersoll, Lucidworks
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Boosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User PreferencesBoosting Documents in Solr by Recency, Popularity, and User Preferences
Boosting Documents in Solr by Recency, Popularity, and User Preferences
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 
Integrating the Solr search engine
Integrating the Solr search engineIntegrating the Solr search engine
Integrating the Solr search engine
 
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
Semantic & Multilingual Strategies in Lucene/Solr: Presented by Trey Grainger...
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...
Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...
Customizing Ranking Models for Enterprise Search: Presented by Ammar Haris & ...
 
Apache Solr/Lucene Internals by Anatoliy Sokolenko
Apache Solr/Lucene Internals  by Anatoliy SokolenkoApache Solr/Lucene Internals  by Anatoliy Sokolenko
Apache Solr/Lucene Internals by Anatoliy Sokolenko
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Solr+Hadoop = Big Data Search
Solr+Hadoop = Big Data SearchSolr+Hadoop = Big Data Search
Solr+Hadoop = Big Data Search
 
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
Apache Solr for eCommerce at Allopneus with France Labs - Lib'Day 2014
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and Friends
 
Apache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best PracticesApache: Big Data - Starting with Apache Spark, Best Practices
Apache: Big Data - Starting with Apache Spark, Best Practices
 

Similar to An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimizationg3_nittala
 
CS101- Introduction to Computing- Lecture 29
CS101- Introduction to Computing- Lecture 29CS101- Introduction to Computing- Lecture 29
CS101- Introduction to Computing- Lecture 29Bilal Ahmed
 
The openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageThe openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageNeo4j
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data ManagementAlbert Bifet
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming languageJulian Hyde
 
Functional Programming - Past, Present and Future
Functional Programming - Past, Present and FutureFunctional Programming - Past, Present and Future
Functional Programming - Past, Present and FuturePushkar Kulkarni
 
Functional Programming Past Present Future
Functional Programming Past Present FutureFunctional Programming Past Present Future
Functional Programming Past Present FutureIndicThreads
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowDatabricks
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Ontico
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsRajendran
 
Demystifying Oak Search
Demystifying Oak SearchDemystifying Oak Search
Demystifying Oak SearchJustin Edelson
 
INTRODUCTION TO LISP
INTRODUCTION TO LISPINTRODUCTION TO LISP
INTRODUCTION TO LISPNilt1234
 
Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programaciónSoftware Guru
 
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query PitfallsMongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query PitfallsMongoDB
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Spark Summit
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
 
Functional Python Webinar from October 22nd, 2014
Functional Python Webinar from October 22nd, 2014Functional Python Webinar from October 22nd, 2014
Functional Python Webinar from October 22nd, 2014Reuven Lerner
 
Performance Tuning and Optimization
Performance Tuning and OptimizationPerformance Tuning and Optimization
Performance Tuning and OptimizationMongoDB
 

Similar to An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015) (20)

Profiling and optimization
Profiling and optimizationProfiling and optimization
Profiling and optimization
 
CS101- Introduction to Computing- Lecture 29
CS101- Introduction to Computing- Lecture 29CS101- Introduction to Computing- Lecture 29
CS101- Introduction to Computing- Lecture 29
 
The openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query LanguageThe openCypher Project - An Open Graph Query Language
The openCypher Project - An Open Graph Query Language
 
Real Time Big Data Management
Real Time Big Data ManagementReal Time Big Data Management
Real Time Big Data Management
 
Full Text Search In PostgreSQL
Full Text Search In PostgreSQLFull Text Search In PostgreSQL
Full Text Search In PostgreSQL
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
 
Functional Programming - Past, Present and Future
Functional Programming - Past, Present and FutureFunctional Programming - Past, Present and Future
Functional Programming - Past, Present and Future
 
Functional Programming Past Present Future
Functional Programming Past Present FutureFunctional Programming Past Present Future
Functional Programming Past Present Future
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
 
Basic terminologies & asymptotic notations
Basic terminologies & asymptotic notationsBasic terminologies & asymptotic notations
Basic terminologies & asymptotic notations
 
Demystifying Oak Search
Demystifying Oak SearchDemystifying Oak Search
Demystifying Oak Search
 
EVOLVE'14 | Enhance | Justin Edelson & Darin Kuntze | Demystifying Oak Search
EVOLVE'14 | Enhance | Justin Edelson & Darin Kuntze | Demystifying Oak SearchEVOLVE'14 | Enhance | Justin Edelson & Darin Kuntze | Demystifying Oak Search
EVOLVE'14 | Enhance | Justin Edelson & Darin Kuntze | Demystifying Oak Search
 
INTRODUCTION TO LISP
INTRODUCTION TO LISPINTRODUCTION TO LISP
INTRODUCTION TO LISP
 
Ejercicios de estilo en la programación
Ejercicios de estilo en la programaciónEjercicios de estilo en la programación
Ejercicios de estilo en la programación
 
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query PitfallsMongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
MongoDB.local DC 2018: Tips and Tricks for Avoiding Common Query Pitfalls
 
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
Distributed Real-Time Stream Processing: Why and How: Spark Summit East talk ...
 
Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017Distributed Stream Processing - Spark Summit East 2017
Distributed Stream Processing - Spark Summit East 2017
 
Functional Python Webinar from October 22nd, 2014
Functional Python Webinar from October 22nd, 2014Functional Python Webinar from October 22nd, 2014
Functional Python Webinar from October 22nd, 2014
 
Performance Tuning and Optimization
Performance Tuning and OptimizationPerformance Tuning and Optimization
Performance Tuning and Optimization
 

More from Koji Sekiguchi

20221209-ApacheSolrによるはじめてのセマンティックサーチ.pdf
20221209-ApacheSolrによるはじめてのセマンティックサーチ.pdf20221209-ApacheSolrによるはじめてのセマンティックサーチ.pdf
20221209-ApacheSolrによるはじめてのセマンティックサーチ.pdfKoji Sekiguchi
 
Solr から使う OpenNLP の日本語固有表現抽出
Solr から使う OpenNLP の日本語固有表現抽出Solr から使う OpenNLP の日本語固有表現抽出
Solr から使う OpenNLP の日本語固有表現抽出Koji Sekiguchi
 
Learning-to-Rank meetup Vol. 1
Learning-to-Rank meetup Vol. 1Learning-to-Rank meetup Vol. 1
Learning-to-Rank meetup Vol. 1Koji Sekiguchi
 
Lucene 6819-good-bye-index-time-boost
Lucene 6819-good-bye-index-time-boostLucene 6819-good-bye-index-time-boost
Lucene 6819-good-bye-index-time-boostKoji Sekiguchi
 
NLP4L - 情報検索における性能改善のためのコーパスの活用とランキング学習
NLP4L - 情報検索における性能改善のためのコーパスの活用とランキング学習NLP4L - 情報検索における性能改善のためのコーパスの活用とランキング学習
NLP4L - 情報検索における性能改善のためのコーパスの活用とランキング学習Koji Sekiguchi
 
コーパス学習による Apache Solr の徹底活用
コーパス学習による Apache Solr の徹底活用コーパス学習による Apache Solr の徹底活用
コーパス学習による Apache Solr の徹底活用Koji Sekiguchi
 
情報検索の基礎からデータの徹底活用まで
情報検索の基礎からデータの徹底活用まで情報検索の基礎からデータの徹底活用まで
情報検索の基礎からデータの徹底活用までKoji Sekiguchi
 
LUCENE-5252 NGramSynonymTokenizer
LUCENE-5252 NGramSynonymTokenizerLUCENE-5252 NGramSynonymTokenizer
LUCENE-5252 NGramSynonymTokenizerKoji Sekiguchi
 
情報検索におけるランキング計算の紹介
情報検索におけるランキング計算の紹介情報検索におけるランキング計算の紹介
情報検索におけるランキング計算の紹介Koji Sekiguchi
 
系列パターンマイニングを用いた単語パターン学習とWikipediaからの組織名抽出
系列パターンマイニングを用いた単語パターン学習とWikipediaからの組織名抽出系列パターンマイニングを用いた単語パターン学習とWikipediaからの組織名抽出
系列パターンマイニングを用いた単語パターン学習とWikipediaからの組織名抽出Koji Sekiguchi
 
Luceneインデックスの共起単語分析とSolrによる共起単語サジェスチョン
Luceneインデックスの共起単語分析とSolrによる共起単語サジェスチョンLuceneインデックスの共起単語分析とSolrによる共起単語サジェスチョン
Luceneインデックスの共起単語分析とSolrによる共起単語サジェスチョンKoji Sekiguchi
 
Lucene terms extraction
Lucene terms extractionLucene terms extraction
Lucene terms extractionKoji Sekiguchi
 
Visualize terms network in Lucene index
Visualize terms network in Lucene indexVisualize terms network in Lucene index
Visualize terms network in Lucene indexKoji Sekiguchi
 
WikipediaからのSolr用類義語辞書の自動生成
WikipediaからのSolr用類義語辞書の自動生成WikipediaからのSolr用類義語辞書の自動生成
WikipediaからのSolr用類義語辞書の自動生成Koji Sekiguchi
 
OpenNLP - MEM and Perceptron
OpenNLP - MEM and PerceptronOpenNLP - MEM and Perceptron
OpenNLP - MEM and PerceptronKoji Sekiguchi
 
自然言語処理における機械学習による曖昧性解消入門
自然言語処理における機械学習による曖昧性解消入門自然言語処理における機械学習による曖昧性解消入門
自然言語処理における機械学習による曖昧性解消入門Koji Sekiguchi
 

More from Koji Sekiguchi (20)

20221209-ApacheSolrによるはじめてのセマンティックサーチ.pdf
20221209-ApacheSolrによるはじめてのセマンティックサーチ.pdf20221209-ApacheSolrによるはじめてのセマンティックサーチ.pdf
20221209-ApacheSolrによるはじめてのセマンティックサーチ.pdf
 
Solr から使う OpenNLP の日本語固有表現抽出
Solr から使う OpenNLP の日本語固有表現抽出Solr から使う OpenNLP の日本語固有表現抽出
Solr から使う OpenNLP の日本語固有表現抽出
 
Learning-to-Rank meetup Vol. 1
Learning-to-Rank meetup Vol. 1Learning-to-Rank meetup Vol. 1
Learning-to-Rank meetup Vol. 1
 
Lucene 6819-good-bye-index-time-boost
Lucene 6819-good-bye-index-time-boostLucene 6819-good-bye-index-time-boost
Lucene 6819-good-bye-index-time-boost
 
NLP4L - 情報検索における性能改善のためのコーパスの活用とランキング学習
NLP4L - 情報検索における性能改善のためのコーパスの活用とランキング学習NLP4L - 情報検索における性能改善のためのコーパスの活用とランキング学習
NLP4L - 情報検索における性能改善のためのコーパスの活用とランキング学習
 
Nlp4 l intro-20150513
Nlp4 l intro-20150513Nlp4 l intro-20150513
Nlp4 l intro-20150513
 
コーパス学習による Apache Solr の徹底活用
コーパス学習による Apache Solr の徹底活用コーパス学習による Apache Solr の徹底活用
コーパス学習による Apache Solr の徹底活用
 
情報検索の基礎からデータの徹底活用まで
情報検索の基礎からデータの徹底活用まで情報検索の基礎からデータの徹底活用まで
情報検索の基礎からデータの徹底活用まで
 
LUCENE-5252 NGramSynonymTokenizer
LUCENE-5252 NGramSynonymTokenizerLUCENE-5252 NGramSynonymTokenizer
LUCENE-5252 NGramSynonymTokenizer
 
情報検索におけるランキング計算の紹介
情報検索におけるランキング計算の紹介情報検索におけるランキング計算の紹介
情報検索におけるランキング計算の紹介
 
系列パターンマイニングを用いた単語パターン学習とWikipediaからの組織名抽出
系列パターンマイニングを用いた単語パターン学習とWikipediaからの組織名抽出系列パターンマイニングを用いた単語パターン学習とWikipediaからの組織名抽出
系列パターンマイニングを用いた単語パターン学習とWikipediaからの組織名抽出
 
Luceneインデックスの共起単語分析とSolrによる共起単語サジェスチョン
Luceneインデックスの共起単語分析とSolrによる共起単語サジェスチョンLuceneインデックスの共起単語分析とSolrによる共起単語サジェスチョン
Luceneインデックスの共起単語分析とSolrによる共起単語サジェスチョン
 
Html noise reduction
Html noise reductionHtml noise reduction
Html noise reduction
 
Lucene terms extraction
Lucene terms extractionLucene terms extraction
Lucene terms extraction
 
Visualize terms network in Lucene index
Visualize terms network in Lucene indexVisualize terms network in Lucene index
Visualize terms network in Lucene index
 
WikipediaからのSolr用類義語辞書の自動生成
WikipediaからのSolr用類義語辞書の自動生成WikipediaからのSolr用類義語辞書の自動生成
WikipediaからのSolr用類義語辞書の自動生成
 
HMM viterbi
HMM viterbiHMM viterbi
HMM viterbi
 
NLP x Lucene/Solr
NLP x Lucene/SolrNLP x Lucene/Solr
NLP x Lucene/Solr
 
OpenNLP - MEM and Perceptron
OpenNLP - MEM and PerceptronOpenNLP - MEM and Perceptron
OpenNLP - MEM and Perceptron
 
自然言語処理における機械学習による曖昧性解消入門
自然言語処理における機械学習による曖昧性解消入門自然言語処理における機械学習による曖昧性解消入門
自然言語処理における機械学習による曖昧性解消入門
 

Recently uploaded

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 

Recently uploaded (20)

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 

An Introduction to NLP4L (Scala by the Bay / Big Data Scala 2015)

  • 1. An Introduction to NLP4L Natural Language Processing tool for Apache Lucene Koji Sekiguchi @kojisays Founder & CEO, RONDHUIT
  • 2. Agenda • What s NLP4L? • How NLP improves search experience • Count number of words in Lucene index • Application: Transliteration • Future Plans 2
  • 3. Agenda • What s NLP4L? • How NLP improves search experience • Count number of words in Lucene index • Application: Transliteration • Future Plans 3
  • 5. What s NLP4L? • GOAL • Improve Lucene users search experience • FEATURES • Use of Lucene index as a Corpus Database • Lucene API Front-end written in Scala • NLP4L provides • Preprocessors for existing ML tools • Provision of ML algorithms and Applications (e.g. Transliteration) 5
  • 6. What s NLP4L? • GOAL • Improve Lucene users search experience • FEATURES • Use of Lucene index as a Corpus Database • Lucene API Front-end written in Scala • NLP4L provides • Preprocessors for existing ML tools • Provision of ML algorithms and Applications (e.g. Transliteration) 6
  • 7. What s NLP4L? • GOAL • Improve Lucene users search experience • FEATURES • Use of Lucene index as a Corpus Database • Lucene API Front-end written in Scala • NLP4L provides • Preprocessors for existing ML tools • Provision of ML algorithms and Applications (e.g. Transliteration) 7
  • 8. What s Lucene? alice 1 an 1, 2, 3 apple 1, 3 ate 1 is 3 likes 2 mike 2 orange 2 red 3 Alice ate an apple. Mike likes an orange. An apple is red. 1: 2: 3: indexing apple searching (inverted) index Lucene is a high-performance, full-featured text search engine library written entirely in Java. 8
  • 9. Agenda • What s NLP4L? • How NLP improves search experience • Count number of words in Lucene index • Application: Transliteration • Future Plans 9
  • 18. Evaluation Measures targetresult tpfp fn tn precision = tp / (tp + fp) recall = tp / (tp + fn) 18
  • 19. Recall ,Precision tpfp fn tn precision = tp / (tp + fp) recall = tp / (tp + fn) 19
  • 20. targetresult tpfp fn tn precision = tp / (tp + fp) recall = tp / (tp + fn) Recall ,Precision 20
  • 21. Solution n-gram, synonym dictionary, etc. facet (filter query) Ranking Tuning recall precision recall , precision 21
  • 22. Solution n-gram, synonym dictionary, etc. facet (filter query) Ranking Tuning recall precision recall , precision 22
  • 23. Solution n-gram, synonym dictionary, etc. e.g. Transliteration facet (filter query) recall precision recall , precision Ranking Tuning 23
  • 24. Solution n-gram, synonym dictionary, etc. e.g. Transliteration facet (filter query) e.g. Named Entity Extraction recall precision recall , precision Ranking Tuning 24
  • 27. 27 targetresult filter by Gender=Men s filter by Price=100-150 gradual precision improvement
  • 28. Structured Documents ID product price gender 1 CURREN New Men s Date Stainless Steel Military Sport Quartz Wrist Watch 8.92 Men s 2 Suiksilver The Gamer Watch 87.99 Men s 28
  • 29. Unstructured Documents ID article 1 David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels. 2 He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants. 29
  • 30. Make them Structured I D article person org loc 1 David Cameron says he has a mandate to pursue EU reform following the Conservatives' general election victory. The Prime Minister will be hoping his majority government will give him extra leverage in Brussels. David Cameron EU Bruss els 2 He wants to renegotiate the terms of the UK's membership ahead of a referendum by the end of 2017. He has said he will campaign for Britain to remain in the EU if he gets the reforms he wants. EU UK Britai n NEE[1] extracts interesting words. [1] Named Entity Extraction 30
  • 32. Agenda • What s NLP4L? • How NLP improves search experience • Count number of words in Lucene index • Application: Transliteration • Future Plans 32
  • 33. A small Corpus val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) 33
  • 34. val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) val index = "/tmp/index-simple" def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } def doc(text: String): Document = { Document(Set( Field("text", text) ) ) } val writer = IWriter(index, schema) CORPUS.foreach(text => writer.write(doc(text))) writer.close index_simple.scala 34
  • 35. val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) val index = "/tmp/index-simple" def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } def doc(text: String): Document = { Document(Set( Field("text", text) ) ) } val writer = IWriter(index, schema) CORPUS.foreach(text => writer.write(doc(text))) writer.close index_simple.scala 35 text data
  • 36. val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) val index = "/tmp/index-simple" def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } def doc(text: String): Document = { Document(Set( Field("text", text) ) ) } val writer = IWriter(index, schema) CORPUS.foreach(text => writer.write(doc(text))) writer.close index_simple.scala 36 Lucene index directory
  • 37. val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) val index = "/tmp/index-simple" def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } def doc(text: String): Document = { Document(Set( Field("text", text) ) ) } val writer = IWriter(index, schema) CORPUS.foreach(text => writer.write(doc(text))) writer.close index_simple.scala 37 schema definition
  • 38. val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) val index = "/tmp/index-simple" def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } def doc(text: String): Document = { Document(Set( Field("text", text) ) ) } val writer = IWriter(index, schema) CORPUS.foreach(text => writer.write(doc(text))) writer.close index_simple.scala 38 create Lucene document
  • 39. val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) val index = "/tmp/index-simple" def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } def doc(text: String): Document = { Document(Set( Field("text", text) ) ) } val writer = IWriter(index, schema) CORPUS.foreach(text => writer.write(doc(text))) writer.close index_simple.scala 39 open a writer
  • 40. val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) val index = "/tmp/index-simple" def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } def doc(text: String): Document = { Document(Set( Field("text", text) ) ) } val writer = IWriter(index, schema) CORPUS.foreach(text => writer.write(doc(text))) writer.close index_simple.scala 40 write documents
  • 41. val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) val index = "/tmp/index-simple" def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } def doc(text: String): Document = { Document(Set( Field("text", text) ) ) } val writer = IWriter(index, schema) CORPUS.foreach(text => writer.write(doc(text))) writer.close index_simple.scala 41 close writer
  • 42. val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) val index = "/tmp/index-simple" def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build val fieldTypes = Map( "text" -> FieldType(analyzer, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } def doc(text: String): Document = { Document(Set( Field("text", text) ) ) } val writer = IWriter(index, schema) CORPUS.foreach(text => writer.write(doc(text))) writer.close index_simple.scala 42 As for code snippets used in my talk, please look at: https://github.com/NLP4L/meetups/tree/master/20150818
  • 43. Getting word counts alice 1 an 1, 2, 3 apple 1, 3 ate 1 is 3 likes 2 mike 2 orange 2 red 3 43
  • 44. Getting word counts alice 1 an 1, 2, 3 apple 1, 3 ate 1 is 3 likes 2 mike 2 orange 2 red 3 val reader = RawReader(index) reader.sumTotalTermFreq("text") // -> 12 reader.field("text").get.terms.size // -> 9 reader.totalTermFreq("text", "an") // -> 3 reader.close getting_word_counts.scala 44
  • 45. Getting word counts alice 1 an 1, 2, 3 apple 1, 3 ate 1 is 3 likes 2 mike 2 orange 2 red 3 val reader = RawReader(index) reader.sumTotalTermFreq("text") // -> 12 reader.field("text").get.terms.size // -> 9 reader.totalTermFreq("text", "an") // -> 3 reader.close getting_word_counts.scala 45
  • 46. Getting word counts alice 1 an 1, 2, 3 apple 1, 3 ate 1 is 3 likes 2 mike 2 orange 2 red 3 val reader = RawReader(index) reader.sumTotalTermFreq("text") // -> 12 reader.field("text").get.terms.size // -> 9 reader.totalTermFreq("text", "an") // -> 3 reader.close getting_word_counts.scala 46
  • 47. Getting top terms alice 1 an 1, 2, 3 apple 1, 3 ate 1 is 3 likes 2 mike 2 orange 2 red 3 val reader = RawReader(index) reader.topTermsByDocFreq("text") reader.topTermsByTotalTermFreq("text") // -> // (term, docFreq, totalTermFreq) // (an,3,3) // (apple,2,2) // (likes,1,1) // (is,1,1) // (orange,1,1) // (mike,1,1) // (ate,1,1) // (red,1,1) // (alice,1,1) reader.close getting_word_counts.scala 47
  • 48. What s ShingleFilter? • ShingleFilter = Word n-gram TokenFilter WhitespaceTokenizer ShingleFilter (N=2) “Lucene is a popular software” Lucene/is/a/popular/software Lucene is/is a/a popular/ popular software 48
  • 49. Language Model • LM represents the fluency of language • N-gram model is the LM which is most widely used • Calculation example for 2-gram 49
  • 50. val index = "/tmp/index-lm" val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build builder.addTokenFilter("shingle", "minShingleSize", "2", "maxShingleSize", "2", "outputUnigrams", "false") val analyzer2g = builder.build val fieldTypes = Map( "word" -> FieldType(analyzer, true, true, true, true), "word2g" -> FieldType(analyzer2g, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } // create a language model index val writer = IWriter(index, schema()) def addDocument(doc: String): Unit = { writer.write(Document(Set( Field("word", doc), Field("word2g", doc) ))) } CORPUS.foreach(addDocument(_)) writer.close() language_model.scala 1/2 50
  • 51. val index = "/tmp/index-lm" val CORPUS = Array( "Alice ate an apple.", "Mike likes an orange.", "An apple is red." ) def schema(): Schema = { val builder = AnalyzerBuilder() builder.withTokenizer("standard") builder.addTokenFilter("lowercase") val analyzer = builder.build builder.addTokenFilter("shingle", "minShingleSize", "2", "maxShingleSize", "2", "outputUnigrams", "false") val analyzer2g = builder.build val fieldTypes = Map( "word" -> FieldType(analyzer, true, true, true, true), "word2g" -> FieldType(analyzer2g, true, true, true, true) ) val analyzerDefault = analyzer Schema(analyzerDefault, fieldTypes) } // create a language model index val writer = IWriter(index, schema()) def addDocument(doc: String): Unit = { writer.write(Document(Set( Field("word", doc), Field("word2g", doc) ))) } CORPUS.foreach(addDocument(_)) writer.close() language_model.scala 1/2 51 schema definition
  • 52. val reader = RawReader(index) // P(apple|an) = C(an apple) / C(an) val count_an_apple = reader.totalTermFreq("word2g", "an apple") val count_an = reader.totalTermFreq("word", "an") val prob_apple_an = count_an_apple.toFloat / count_an.toFloat // P(orange|an) = C(an orange) / C(an) val count_an_orange = reader.totalTermFreq("word2g", "an orange") val prob_orange_an = count_an_orange.toFloat / count_an.toFloat reader.close language_model.scala 2/2 52
  • 53. Alice/NNP ate/VB an/AT apple/NNP ./. Mike/NNP likes/VB an/AT orange/NNP ./. An/AT apple/NNP is/VB red/JJ ./. NNP Proper noun, singular VB Verb AT Article JJ Adjective . period Part-of-Speech Tagging 53 Our Corpus for training
  • 56. Hidden Markov Model 56 Series of Part-of-Speech
  • 59. HMM state diagram NNP 0.667 VB 0.0 . 0.0 JJ 0.0 AT 0.333 1.0 1.0 0.4 0.6 0.6670.333 59 alice 0.2 apple 0.4 mike 0.2 orange 0.2 ate 0.333 is 0.333 likes 0.333 an 1.0 red 1.0 . 1.0
  • 60. val index = "/tmp/index-hmm" val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." ) val indexer = HmmModelIndexer(index) CORPUS.foreach{ text => val pairs = text.split("s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) } indexer.close() // execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model) tagger.tokens("alice likes an apple .") hmm.scala 60
  • 61. val index = "/tmp/index-hmm" val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." ) val indexer = HmmModelIndexer(index) CORPUS.foreach{ text => val pairs = text.split("s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) } indexer.close() // execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model) tagger.tokens("alice likes an apple .") hmm.scala 61 text data (they are tagged!)
  • 62. val index = "/tmp/index-hmm" val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." ) val indexer = HmmModelIndexer(index) CORPUS.foreach{ text => val pairs = text.split("s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) } indexer.close() // execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model) tagger.tokens("alice likes an apple .") hmm.scala 62 write-open Lucene index
  • 63. val index = "/tmp/index-hmm" val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." ) val indexer = HmmModelIndexer(index) CORPUS.foreach{ text => val pairs = text.split("s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) } indexer.close() // execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model) tagger.tokens("alice likes an apple .") hmm.scala 63 tagged texts are indexed here
  • 64. val index = "/tmp/index-hmm" val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." ) val indexer = HmmModelIndexer(index) CORPUS.foreach{ text => val pairs = text.split("s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) } indexer.close() // execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model) tagger.tokens("alice likes an apple .") hmm.scala 64 make an HmmModel from Lucene index
  • 65. val index = "/tmp/index-hmm" val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." ) val indexer = HmmModelIndexer(index) CORPUS.foreach{ text => val pairs = text.split("s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) } indexer.close() // execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model) tagger.tokens("alice likes an apple .") hmm.scala 65 get HmmTagger from HmmModel
  • 66. val index = "/tmp/index-hmm" val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." ) val indexer = HmmModelIndexer(index) CORPUS.foreach{ text => val pairs = text.split("s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) } indexer.close() // execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model) tagger.tokens("alice likes an apple .") hmm.scala 66 use HmmTagger to annotate unknown sentence
  • 67. val index = "/tmp/index-hmm" val CORPUS = Array( "Alice/NNP ate/VB an/AT apple/NNP ./.", "Mike/NNP likes/VB an/AT orange/NNP ./.", "An/AT apple/NNP is/VB red/JJ ./." ) val indexer = HmmModelIndexer(index) CORPUS.foreach{ text => val pairs = text.split("s+") val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))} indexer.addDocument(doc) } indexer.close() // execute part-of-speech tagging on an unknown text val model = HmmModel(index) val tagger = HmmTagger(model) tagger.tokens("alice likes an apple .") hmm.scala NLP4L has hmm_postagger.scala in examples directory. It uses brown corpus for HMM training. 67
  • 68. Agenda • What s NLP4L? • How NLP improves search experience • Count number of words in Lucene index • Application: Transliteration • Future Plans 68
  • 69. Transliteration Transliteration is a process of transcribing letters or words from one alphabet to another one to facilitate comprehension and pronunciation for non-native speakers. computer コンピューター server サーバー internet インターネット mouse マウス information インフォメーション examples of transliteration from English to Japanese 69
  • 70. It helps improve recall you search English mouse 70
  • 71. It helps improve recall but you got マウス (=mouse) highlighted in Japanese 71
  • 72. Training data in NLP4L アaカcaデdeミーmy アaクcセceンnトt アaクcセceスss アaクcシciデdeンnトt アaクcロroバッbaトt アaクcショtioンn アaダdaプpターter アaフfリriカca エaアirバbuスs アaラlaスsカka アaルlコーcohoルl アaレlleルrギーgy train_data/alpha_katakana.txt train_data/alpha_katakana_aligned.txt 72 academy,アカデミー accent,アクセント access,アクセス accident,アクシデント acrobat,アクロバット action,アクション adapter,アダプター africa,アフリカ airbus,エアバス alaska,アラスカ alcohol,アルコール allergy,アレルギー
  • 73. Demo: Transliteration Input Prediction Right Answer アルゴリズム algorism algorithm プログラム program (OK) ケミカル chaemmical chemical ダイニング dining (OK) コミッター committer (OK) エントリー entree entry nlp4l> :load examples/trans_katakana_alpha.scala 73
  • 74. Gathering loan words ① crawl gathering Katakana-Alphabet string pairs アルゴリズム, algorithm Transliteration アルゴリズム algorism calculate edit distance synonyms.txt 74 store pair of strings if edit distance is small enough ② ③ ④ ⑤ ⑥
  • 75. Gathering loan words ① crawl gathering Katakana-Alphabet string pairs アルゴリズム, algorithm Transliteration アルゴリズム algorism calculate edit distance synonyms.txt 75 store pair of strings if edit distance is small enough ② ③ ④ ⑤ ⑥ Got 1,800+ records of synonym knowledge from jawiki
  • 76. Agenda • What s NLP4L? • How NLP improves search experience • Count number of words in Lucene index • Application: Transliteration • Future Plans 76
  • 77. NLP4L Framework • A framework that improves search experience (for mainly Lucene-based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well. 77
  • 78. NLP4L Framework • A framework that improves search experience (for mainly Lucene-based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well. 78
  • 79. NLP4L Framework • A framework that improves search experience (for mainly Lucene-based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well. 79
  • 80. NLP4L Framework • A framework that improves search experience (for mainly Lucene-based search system). Pluggable. • Reference implementation of plug-ins and corpora provided. • Uses NLP/ML technologies to output models, dictionaries and indexes. • Since NLP/ML are not perfect, an interface that enables users to personally examine output dictionaries is provided as well. 80
  • 81. Solr ES Mahout Spark Data Source ・Corpus (Text data, Lucene index) ・Query Log ・Access Log Dictionaries ・Suggestion (auto complete) ・Did you mean? ・synonyms.txt ・userdic.txt ・keyword attachment maintenance Model files Tagged Corpus Document Vectors ・TermExtractor ・Transliteration ・NEE ・Classification ・Document Vectors ・Language Detection ・Learning to Rank ・Personalized Search 81
  • 82. Keyword Attachment • Keyword attachment is a general format that enables the following functions. • Learning to Rank • Personalized Search • Named Entity Extraction • Document Classification Lucene doc Lucene doc keyword ↑ Increase boost 82
  • 83. Learning to Rank • Program learns, from access log and other sources, that the score of document d for a query q should be larger than the normal score(q,d) Lucene doc d q, q, … https://en.wikipedia.org/wiki/Learning_to_rank 83
  • 84. Personalized Search • Program learns, from access log and other sources, that the score of document d for a query q by user u should be larger than the normal score(q,d) • Since you cannot specify score(q,d,u) as Lucene restricts doing so, you have to specify score(qu,d). • Limit the data to high-order queries or divide fields depending on a user as the number of q-u combinations can be enormous. Lucene doc d1 q1u1, q2u2 Lucene doc d2 q2u1, q1u2 84
  • 85. Join and Code with Us! Contact us at koji at apache dot org for the details. 85
  • 86. Demo or Q & A Thank you! 86