5. What s NLP4L?
• GOAL
• Improve Lucene users search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Lucene API Front-end written in Scala
• NLP4L provides
• Preprocessors for existing ML tools
• Provision of ML algorithms and Applications (e.g. Transliteration)
5
6. What s NLP4L?
• GOAL
• Improve Lucene users search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Lucene API Front-end written in Scala
• NLP4L provides
• Preprocessors for existing ML tools
• Provision of ML algorithms and Applications (e.g. Transliteration)
6
7. What s NLP4L?
• GOAL
• Improve Lucene users search experience
• FEATURES
• Use of Lucene index as a Corpus Database
• Lucene API Front-end written in Scala
• NLP4L provides
• Preprocessors for existing ML tools
• Provision of ML algorithms and Applications (e.g. Transliteration)
7
8. What s Lucene?
alice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
Alice ate an apple.
Mike likes an orange.
An apple is red.
1:
2:
3:
indexing
apple
searching
(inverted) index
Lucene is a high-performance, full-featured text
search engine library written entirely in Java.
8
9. Agenda
• What s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transliteration
• Future Plans
9
28. Structured Documents
ID product price gender
1
CURREN New Men s Date Stainless
Steel Military Sport Quartz Wrist Watch
8.92 Men s
2 Suiksilver The Gamer Watch 87.99 Men s
28
29. Unstructured Documents
ID article
1
David Cameron says he has a mandate to pursue EU reform following the
Conservatives' general election victory. The Prime Minister will be hoping his
majority government will give him extra leverage in Brussels.
2
He wants to renegotiate the terms of the UK's membership ahead of a
referendum by the end of 2017. He has said he will campaign for Britain to
remain in the EU if he gets the reforms he wants.
29
30. Make them Structured
I
D
article person org loc
1
David Cameron says he has a mandate to pursue EU reform following
the Conservatives' general election victory. The Prime Minister will be
hoping his majority government will give him extra leverage in Brussels.
David
Cameron
EU
Bruss
els
2
He wants to renegotiate the terms of the UK's membership ahead of a
referendum by the end of 2017. He has said he will campaign for
Britain to remain in the EU if he gets the reforms he wants.
EU
UK
Britai
n
NEE[1] extracts interesting words.
[1] Named Entity Extraction
30
32. Agenda
• What s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transliteration
• Future Plans
32
33. A small Corpus
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
33
34. val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
val fieldTypes = Map(
"text" -> FieldType(analyzer, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
def doc(text: String): Document = {
Document(Set(
Field("text", text)
)
)
}
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
34
35. val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
val fieldTypes = Map(
"text" -> FieldType(analyzer, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
def doc(text: String): Document = {
Document(Set(
Field("text", text)
)
)
}
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
35
text data
36. val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
val fieldTypes = Map(
"text" -> FieldType(analyzer, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
def doc(text: String): Document = {
Document(Set(
Field("text", text)
)
)
}
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
36
Lucene index directory
37. val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
val fieldTypes = Map(
"text" -> FieldType(analyzer, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
def doc(text: String): Document = {
Document(Set(
Field("text", text)
)
)
}
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
37
schema definition
38. val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
val fieldTypes = Map(
"text" -> FieldType(analyzer, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
def doc(text: String): Document = {
Document(Set(
Field("text", text)
)
)
}
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
38
create Lucene document
39. val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
val fieldTypes = Map(
"text" -> FieldType(analyzer, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
def doc(text: String): Document = {
Document(Set(
Field("text", text)
)
)
}
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
39
open a writer
40. val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
val fieldTypes = Map(
"text" -> FieldType(analyzer, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
def doc(text: String): Document = {
Document(Set(
Field("text", text)
)
)
}
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
40
write documents
41. val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
val fieldTypes = Map(
"text" -> FieldType(analyzer, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
def doc(text: String): Document = {
Document(Set(
Field("text", text)
)
)
}
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
41
close writer
42. val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
val index = "/tmp/index-simple"
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
val fieldTypes = Map(
"text" -> FieldType(analyzer, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
def doc(text: String): Document = {
Document(Set(
Field("text", text)
)
)
}
val writer = IWriter(index, schema)
CORPUS.foreach(text => writer.write(doc(text)))
writer.close
index_simple.scala
42
As for code snippets used in my talk, please look at:
https://github.com/NLP4L/meetups/tree/master/20150818
44. Getting word counts
alice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
val reader = RawReader(index)
reader.sumTotalTermFreq("text")
// -> 12
reader.field("text").get.terms.size
// -> 9
reader.totalTermFreq("text", "an")
// -> 3
reader.close
getting_word_counts.scala
44
45. Getting word counts
alice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
val reader = RawReader(index)
reader.sumTotalTermFreq("text")
// -> 12
reader.field("text").get.terms.size
// -> 9
reader.totalTermFreq("text", "an")
// -> 3
reader.close
getting_word_counts.scala
45
46. Getting word counts
alice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
val reader = RawReader(index)
reader.sumTotalTermFreq("text")
// -> 12
reader.field("text").get.terms.size
// -> 9
reader.totalTermFreq("text", "an")
// -> 3
reader.close
getting_word_counts.scala
46
47. Getting top terms
alice 1
an 1, 2, 3
apple 1, 3
ate 1
is 3
likes 2
mike 2
orange 2
red 3
val reader = RawReader(index)
reader.topTermsByDocFreq("text")
reader.topTermsByTotalTermFreq("text")
// ->
// (term, docFreq, totalTermFreq)
// (an,3,3)
// (apple,2,2)
// (likes,1,1)
// (is,1,1)
// (orange,1,1)
// (mike,1,1)
// (ate,1,1)
// (red,1,1)
// (alice,1,1)
reader.close
getting_word_counts.scala
47
48. What s ShingleFilter?
• ShingleFilter = Word n-gram TokenFilter
WhitespaceTokenizer
ShingleFilter (N=2)
“Lucene is a popular software”
Lucene/is/a/popular/software
Lucene is/is a/a popular/
popular software
48
49. Language Model
• LM represents the fluency of language
• N-gram model is the LM which is most widely
used
• Calculation example for 2-gram
49
50. val index = "/tmp/index-lm"
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
builder.addTokenFilter("shingle", "minShingleSize", "2", "maxShingleSize", "2",
"outputUnigrams", "false")
val analyzer2g = builder.build
val fieldTypes = Map(
"word" -> FieldType(analyzer, true, true, true, true),
"word2g" -> FieldType(analyzer2g, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
// create a language model index
val writer = IWriter(index, schema())
def addDocument(doc: String): Unit = {
writer.write(Document(Set(
Field("word", doc),
Field("word2g", doc)
)))
}
CORPUS.foreach(addDocument(_))
writer.close()
language_model.scala
1/2
50
51. val index = "/tmp/index-lm"
val CORPUS = Array(
"Alice ate an apple.",
"Mike likes an orange.",
"An apple is red."
)
def schema(): Schema = {
val builder = AnalyzerBuilder()
builder.withTokenizer("standard")
builder.addTokenFilter("lowercase")
val analyzer = builder.build
builder.addTokenFilter("shingle", "minShingleSize", "2", "maxShingleSize", "2",
"outputUnigrams", "false")
val analyzer2g = builder.build
val fieldTypes = Map(
"word" -> FieldType(analyzer, true, true, true, true),
"word2g" -> FieldType(analyzer2g, true, true, true, true)
)
val analyzerDefault = analyzer
Schema(analyzerDefault, fieldTypes)
}
// create a language model index
val writer = IWriter(index, schema())
def addDocument(doc: String): Unit = {
writer.write(Document(Set(
Field("word", doc),
Field("word2g", doc)
)))
}
CORPUS.foreach(addDocument(_))
writer.close()
language_model.scala
1/2
51
schema definition
52. val reader = RawReader(index)
// P(apple|an) = C(an apple) / C(an)
val count_an_apple = reader.totalTermFreq("word2g", "an apple")
val count_an = reader.totalTermFreq("word", "an")
val prob_apple_an = count_an_apple.toFloat / count_an.toFloat
// P(orange|an) = C(an orange) / C(an)
val count_an_orange = reader.totalTermFreq("word2g", "an orange")
val prob_orange_an = count_an_orange.toFloat / count_an.toFloat
reader.close
language_model.scala
2/2
52
53. Alice/NNP ate/VB an/AT apple/NNP ./.
Mike/NNP likes/VB an/AT orange/NNP ./.
An/AT apple/NNP is/VB red/JJ ./.
NNP Proper noun, singular
VB Verb
AT Article
JJ Adjective
. period
Part-of-Speech Tagging
53
Our Corpus for training
60. val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/NNP ./.",
"An/AT apple/NNP is/VB red/JJ ./."
)
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text =>
val pairs = text.split("s+")
val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))}
indexer.addDocument(doc)
}
indexer.close()
// execute part-of-speech tagging on an unknown text
val model = HmmModel(index)
val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
60
61. val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/NNP ./.",
"An/AT apple/NNP is/VB red/JJ ./."
)
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text =>
val pairs = text.split("s+")
val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))}
indexer.addDocument(doc)
}
indexer.close()
// execute part-of-speech tagging on an unknown text
val model = HmmModel(index)
val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
61
text data (they are tagged!)
62. val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/NNP ./.",
"An/AT apple/NNP is/VB red/JJ ./."
)
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text =>
val pairs = text.split("s+")
val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))}
indexer.addDocument(doc)
}
indexer.close()
// execute part-of-speech tagging on an unknown text
val model = HmmModel(index)
val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
62
write-open Lucene index
63. val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/NNP ./.",
"An/AT apple/NNP is/VB red/JJ ./."
)
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text =>
val pairs = text.split("s+")
val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))}
indexer.addDocument(doc)
}
indexer.close()
// execute part-of-speech tagging on an unknown text
val model = HmmModel(index)
val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
63
tagged texts are indexed here
64. val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/NNP ./.",
"An/AT apple/NNP is/VB red/JJ ./."
)
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text =>
val pairs = text.split("s+")
val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))}
indexer.addDocument(doc)
}
indexer.close()
// execute part-of-speech tagging on an unknown text
val model = HmmModel(index)
val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
64
make an HmmModel from Lucene index
65. val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/NNP ./.",
"An/AT apple/NNP is/VB red/JJ ./."
)
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text =>
val pairs = text.split("s+")
val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))}
indexer.addDocument(doc)
}
indexer.close()
// execute part-of-speech tagging on an unknown text
val model = HmmModel(index)
val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
65
get HmmTagger from HmmModel
66. val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/NNP ./.",
"An/AT apple/NNP is/VB red/JJ ./."
)
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text =>
val pairs = text.split("s+")
val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))}
indexer.addDocument(doc)
}
indexer.close()
// execute part-of-speech tagging on an unknown text
val model = HmmModel(index)
val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
66
use HmmTagger to annotate unknown sentence
67. val index = "/tmp/index-hmm"
val CORPUS = Array(
"Alice/NNP ate/VB an/AT apple/NNP ./.",
"Mike/NNP likes/VB an/AT orange/NNP ./.",
"An/AT apple/NNP is/VB red/JJ ./."
)
val indexer = HmmModelIndexer(index)
CORPUS.foreach{ text =>
val pairs = text.split("s+")
val doc = pairs.map{h => h.split("/")}.map{i => (i(0).toLowerCase(), i(1))}
indexer.addDocument(doc)
}
indexer.close()
// execute part-of-speech tagging on an unknown text
val model = HmmModel(index)
val tagger = HmmTagger(model)
tagger.tokens("alice likes an apple .")
hmm.scala
NLP4L has hmm_postagger.scala in examples directory.
It uses brown corpus for HMM training.
67
68. Agenda
• What s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transliteration
• Future Plans
68
69. Transliteration
Transliteration is a process of transcribing letters or words from one
alphabet to another one to facilitate comprehension and pronunciation
for non-native speakers.
computer コンピューター
server サーバー
internet インターネット
mouse マウス
information インフォメーション
examples of transliteration from English to Japanese
69
73. Demo: Transliteration
Input Prediction Right Answer
アルゴリズム algorism algorithm
プログラム program (OK)
ケミカル chaemmical chemical
ダイニング dining (OK)
コミッター committer (OK)
エントリー entree entry
nlp4l> :load examples/trans_katakana_alpha.scala
73
74. Gathering loan words
① crawl
gathering
Katakana-Alphabet
string pairs
アルゴリズム, algorithm
Transliteration
アルゴリズム
algorism
calculate
edit distance
synonyms.txt
74
store pair of strings
if edit distance
is small enough
②
③
④
⑤
⑥
75. Gathering loan words
① crawl
gathering
Katakana-Alphabet
string pairs
アルゴリズム, algorithm
Transliteration
アルゴリズム
algorism
calculate
edit distance
synonyms.txt
75
store pair of strings
if edit distance
is small enough
②
③
④
⑤
⑥
Got 1,800+ records of
synonym knowledge
from jawiki
76. Agenda
• What s NLP4L?
• How NLP improves search experience
• Count number of words in Lucene index
• Application: Transliteration
• Future Plans
76
77. NLP4L Framework
• A framework that improves search experience (for mainly
Lucene-based search system). Pluggable.
• Reference implementation of plug-ins and corpora
provided.
• Uses NLP/ML technologies to output models, dictionaries
and indexes.
• Since NLP/ML are not perfect, an interface that enables
users to personally examine output dictionaries is
provided as well.
77
78. NLP4L Framework
• A framework that improves search experience (for mainly
Lucene-based search system). Pluggable.
• Reference implementation of plug-ins and corpora
provided.
• Uses NLP/ML technologies to output models, dictionaries
and indexes.
• Since NLP/ML are not perfect, an interface that enables
users to personally examine output dictionaries is
provided as well.
78
79. NLP4L Framework
• A framework that improves search experience (for mainly
Lucene-based search system). Pluggable.
• Reference implementation of plug-ins and corpora
provided.
• Uses NLP/ML technologies to output models, dictionaries
and indexes.
• Since NLP/ML are not perfect, an interface that enables
users to personally examine output dictionaries is
provided as well.
79
80. NLP4L Framework
• A framework that improves search experience (for mainly
Lucene-based search system). Pluggable.
• Reference implementation of plug-ins and corpora
provided.
• Uses NLP/ML technologies to output models, dictionaries
and indexes.
• Since NLP/ML are not perfect, an interface that enables
users to personally examine output dictionaries is
provided as well.
80
81. Solr
ES
Mahout Spark
Data Source
・Corpus (Text data, Lucene index)
・Query Log
・Access Log
Dictionaries
・Suggestion
(auto complete)
・Did you mean?
・synonyms.txt
・userdic.txt
・keyword attachment
maintenance
Model files
Tagged
Corpus
Document
Vectors
・TermExtractor
・Transliteration
・NEE
・Classification
・Document Vectors
・Language Detection
・Learning to Rank
・Personalized Search
81
82. Keyword Attachment
• Keyword attachment is a general format that enables the
following functions.
• Learning to Rank
• Personalized Search
• Named Entity Extraction
• Document Classification
Lucene
doc
Lucene
doc
keyword
↑
Increase boost
82
83. Learning to Rank
• Program learns, from access log and other
sources, that the score of document d for a
query q should be larger than the normal
score(q,d)
Lucene
doc d
q, q, …
https://en.wikipedia.org/wiki/Learning_to_rank
83
84. Personalized Search
• Program learns, from access log and other sources, that
the score of document d for a query q by user u should
be larger than the normal score(q,d)
• Since you cannot specify score(q,d,u) as Lucene restricts
doing so, you have to specify score(qu,d).
• Limit the data to high-order queries or divide fields
depending on a user as the number of q-u combinations
can be enormous.
Lucene
doc d1
q1u1, q2u2
Lucene
doc d2
q2u1, q1u2
84
85. Join and Code with Us!
Contact us at
koji at apache dot org
for the details.
85