Apache OpenNLP can be used with Lucene and Solr to tag words with part-of-speech, produce lemmas (words’ base forms), and to extract named entities: people, places, organizations, etc.
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Using OpenNLP with Solr to improve search relevance and to extract named entities
1. Using OpenNLP with Solr to improve search
relevance and to extract named entities
Steve Rowe
Lucidworks
2. About me
• Previously worked at the Center for Natural
Language Processing at Syracuse University
• Sr. Software Engineer at Lucidworks
• Committer on Apache Lucene/Solr project
• Committer on JFlex scanner generator project
3. Apache OpenNLP
• sentence segmentation
• tokenization
• part-of-speech tagging
• lemmatization
• named entity extraction
• phrase chunking
• parsing
• coreference resolution
• machine learning: maximum entropy and perceptron based
• caveat: model licensing: not Apache
4. Expectation Management
• OpenNLP isn’t integrated with Solr in any release
• LUCENE-2899: patches
• TDD (talk driven development)
• No Spanish - OpenNLP doesn’t publish Spanish
models for sentence splitting, tokenization, or part-
of-speech.
• No precision/recall/F-measure/MAP testing
5. LUCENE-2899
• Created: 30/Jan/11 10:44 <- over 5 years old
• Lance Norskog wrote the bulk of the
implementation
• I modernized Lance’s patch and added
lemmatization support
6. Lemmatization vs. stemming
• Both can be used with search to increase recall
• Lemmas are real words: infinitive verbs, singular nouns
• e.g. Speaking/VBG, spoke/VB -> speak; stigmata/NNS -> stigma
• Can be produced by algorithm and/or known-item dictionary
• OpenNLP 1.6.1 will include a machine-learned lemmatization implementation
• Caveat: poor quality part-of-speech over short query text
• Stems are not (necessarily) real words
• e.g. Speaking -> speak, spoke -> spoke, stigmata -> stigmata
(Porter stemmer)
• produced via algorithm
7. Penn Treebank part of speech tags
https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition/subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NNP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh-pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb
8. Solr OpenNLP integration
• Put jars on classpath
• Add required resources to configset:
• models
• lemmatization dictionary
• Add field type(s) using OpenNLP-based analysis
components, then fields using these field types
9. Put jars on classpath
• Add to configset’s solrconfig.xml:
<lib dir="${solr.install.dir:../../../..}
/contrib/analysis-extras/lucene-libs"
regex=".*.jar" />
<lib dir="${solr.install.dir:../../../..}
/contrib/analysis-extras/lib"
regex="opennlp-.*.jar"/>
10. Add required resources to configset
• Download models from
http://opennlp.sourceforge.net/models-1.5/
• Download lemma dictionary from
http://ixa2.si.ehu.es/ragerri/lemmatizer-dicts.tgz
conf/
opennlp/
en-ner-person.bin
en-pos-maxent.bin
en-sent.bin
en-token.bin
language-tool-en-lemmatizer.txt
12. Next steps
• Switch tags from payloads to token “type” attribute
• Make Solr update request processors for named
entity extraction, maybe phrase chunker
• Commit/release LUCENE-2899!