f you want to expand your query/documents with synonyms in Apache Lucene, you need to have a predefined file containing the list of terms that share the same semantic. It’s not always easy to find a list of basic synonyms for a language and, even if you find it, this doesn’t necessarily match with your contextual domain.
The term “daemon” in the domain of operating system articles is not a synonym of “devil” but it’s closer to the term “process”.
Word2Vec is a two-layer neural network that takes as input a text and outputs a vector representation for each word in the dictionary. Two words with similar meanings are identified with two vectors close to each other.
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
1. Word2Vec Model To Generate
Synonyms on the Fly in Apache
Lucene
Daniele Antuzi, Software Engineer
Ilaria Petreti, Software Engineer
14th
June, Berlin Buzzwords 2022
2. Who We Are
Daniele Antuzi
● R&D Search Software Engineer
● Master Degree in Computer Science
● Passionate about coding
● Food and sport lover
LinkedIn
d.antuzi@sease.io
3. Who We Are
Ilaria Petreti
● Information Retrieval/Machine Learning Engineer
● Master in Data Science
● Data Mining and Machine Learning technologies
passionate
● Sports lover (Basketball)
LinkedIn
i.petreti@sease.io
4. ‣ Headquarter in London/distributed
‣ Open Source Enthusiasts
‣ Apache Lucene/Solr experts
‣ Elasticsearch experts
‣ Community Contributors
‣ Active Researchers
‣ London Information Retrieval Meetup
‣ Hot Trends : Neural Search,
Natural Language Processing
Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevancy Tuning
SEArch SErvices
www.sease.io
8. Synonyms Expansion in Apache Lucene/Solr
STATE OF THE ART: Vocabulary-based Synonym Expansion
SynonymGraphFilter
● Static synonym list mysynonyms.txt file
● Wordnet vocabulary mysynonyms-wn.txt file
9. Synonyms Expansion in Apache Lucene/Solr
https://solr.apache.org/guide/8_9/filter-descriptions.html#synonym-graph-filter
STATE OF THE ART: Vocabulary-based Synonym Expansion
https://sease.io/2020/03/introducing-weighted-synonyms-in-apache-lucene.html
SynonymGraphFilter + DelimitedBoostFilter
● Weighted synonym list boostedSynonyms.txtfile
10. Limits of Vocabulary-based Synonym Expansion
1. different domains
2. different languages
3. manual maintenance additional cost
4. based on the word’s denotation and NOT on its connotation
The term "daemon" in the domain of operating system articles is
not a synonym of "devil" but it's closer to the term "process"
11. Machine Learning Solution
Word2Vec-based Synonym Expansion
Idea and Image Source:
Teofili, T., & Mattmann, C. A. (2019). Deep learning for search. Shelter Island, NY: Manning
Publications Co.
Advantages:
● learning from the data to be indexed
● avoid missing relevant search results
● language agnostic
● no grammar or syntax involved
13. Word2vec
Word2Vec is a neural network-based algorithm for learning word representations
➢ It takes text corpus as an input and outputs a series of vector representation, one for each word in
the text, called neural word embedding
➢ Based on The Distributional Hypothesis
➢ Two similar words, in term of semantics, are identified with two vectors closed to each other in the space
14. Word2vec
0
1
0
0
... ... ...
Input Vector
1-hot encoding
Len. of Vocabulary
Hidden Layer
Dimension of Embeddings
Output Layer
Softmax
Len. of Vocabulary
Word Embedding
Input Weight Matrix
● Feedforward neural network
● Input is one-hot-encoded
● Hidden layer (1) => desired embeddings
size
● Output is also in one-hot encoding form
● The word embeddings are the vectors
from the network
15. Word2vec - CBOW vs Skip-Gram
Source: https://arxiv.org/pdf/1301.3781.pdf
using a context (neighboring words) to
predict a target word
using a word to predict a target
context (neighboring words)
16. Word2vec - windowSize
with a windows of 2 words
Word pairs
for training
(the, cat)
(the, chased)
(cat, the)
(cat, chased)
(cat, the)
The chased
cat the mouse up to the den
The chased
cat the mouse up to the den
The chased
cat the up to the den
mouse
(chased, the)
(chased, cat)
(chased, the)
(chased, mouse)
chased
cat the to the den
mouse up
The
(the, cat)
(the, chased)
(the, mouse)
(the, up)
Context of the phrase
17. DeepLearning4J
❏ Open-source, distributed deep-learning library written for Java and Scala
❏ Integrated with Hadoop and Apache Spark
❏ Good developer community
❏ Out-of-the-box implementation of word2vec, based on the skip-gram model
22. Synonym
expansion
1 prototype - deeplearning4j
Model
Storing
Model
parsing
Deeplearning4j
Pros
● Already implemented and tested
Cons
● Too many dependencies
● Search is quite slow (~70ms* for each
synonym expansion)
*in accordance with our preliminary experiments
Future works: more accurate benchmarks
23. Synonym expansion - How it works
*in accordance with our preliminary experiments
Future works: more accurate benchmarks
Original word = W
W
A T
L
V
Q
B
P
Z
1
Searching the vectors with highest cosine
similarity
3
Getting the vector
corresponding to
the original term
2
Select the sub-set
of vectors with the
highest cosine
similarity with the
query vector
4
W
A T
Z
25. Hierarchical Navigable Small World (HNSW)
● Navigable Small world graph is a proximity graph
○ vertices are vectors
○ edges means that two vectors are close to each other
● Hierarchical Layers based on skip lists
○ longer edges in higher layers(fast retrieval)
○ shorter edges in lower layers(accuracy)
https://sease.io/2022/01/apache-solr-neural-search.html
Layer 2
Layer 1
Layer 0
entry point
nearest neighbor
approximate K-nearest neighbor search based on navigable small world graphs with controllable hierarchy
26. Word2VecSynonymFilter - HNSW
stream
Graph Searcher
Hnsw Graph
Ad-hoc parser
Improvements
● Fast search (~70ms ~6ms* for each synonym expansions)
● No additional dependencies
Synonym
expansion
Model
Storing
Model
parsing
*in accordance with our preliminary experiments
Future works: more accurate benchmarks
27. Word2VecSynonymFilter - How to use
Word2VecSynonymFilter Configuration Parameters:
❏ Word2Vec model: REQUIRED file containing the trained model
❏ Word2VecSupportedFormats: default DL4J DL4J is currently the only supported format
❏ maxSynonymsPerTerm: default 10 maximum number of result returned by the synonym search
❏ minAcceptedSimilarity: default 0.7f minimum value of cosine similarity between the searched vector and
the retrieved ones
❏ similarityAsBoost: default true assign the similarity value as boost term
.addTokenFilter(Word2VecSynonymFilterFactory.NAME, "model", "<model_file>")
29. LuceneWord2VecModelTrainer
Input
Lucene Index Path
path to the folder containing
the index, used to fetch the
document values
Field Name
to fetch the values from
Output
DL4J Word2Vec model file
(.zip):
contains a dictionary in which
each token has a vector
attached to it
java -jar build/libs/LuceneWord2VecModelTrainer.jar
-p <lucene_index_path> -f <field_name> -o <model_file>
Command-line to train a Word2vec model from a Lucene Index:
30. LuceneWord2VecModelTrainer
● FieldValuesSentenceIterator class:
to read stored field values from the Lucene index to be used for training the word2vec model
● Model Training
○ Library: DeepLearning4J (DL4J)
○ Algorithm: Skip-gram model
○ Default parameters/hyperparameters
SentenceIterator iter = new FieldValuesSentenceIterator(config);
Word2Vec vec = new Word2Vec.Builder()
.layerSize(100)
.minWordFrequency(5)
.windowSize(5)
.iterate(iter)
.build();
vec.fit();
WordVectorSerializer.writeWord2VecModel(vec, config.getModelFilePath());
31. Our works
- LuceneWord2VecModelTrainer:
Command line tool to generate a DL4J Word2Vec model using a specific field of a Apache
Lucene index
Currently in our Github repository:
https://github.com/SeaseLtd/LuceneWord2VecModelTrainer
- Word2VecSynonymFilter:
New token filter in Lucene that queries the Word2Vec model on input tokens to get the
weighted list of synonyms of a specific term
Currently in our Lucene fork:
https://github.com/SeaseLtd/lucene/tree/word2vec
33. Example - Index Time
java -jar LuceneWord2VecModelTrainer.jar -p /sease/word2vec_model/italian_wikipedia_data
-f text -o wikipedia-model.zip
[INFO ] 00:44:43.240 [main] ModelGenerator - indexPath =
/sease/word2vec_model/italian_wikipedia_data
[INFO ] 00:44:43.244 [main] ModelGenerator - field = text
[INFO ] 00:44:43.244 [main] ModelGenerator - modelFile = wikipedia-model.zip
[INFO ] 03:28:27.653 [main] ModelGenerator - Model trained in 163 min
[INFO ] 03:31:30.708 [main] ModelGenerator - Model file wikipedia-model.zip generated
For the experiment we used the WikipediaExtractor to download the documents of the Italian Wikipedia:
italian_wikipedia_data
1. Index the Italian Wikipedia documents
2. Train the model using a specific field of the Lucene index
1.820.000 documents
(3.4GB)
34. Example - Query Time
Word2vec Searcher
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer(StandardTokenizerFactory.NAME)
.addTokenFilter(Word2VecSynonymFilterFactory.NAME, "model", "wikipedia-model.zip")
.build();
DirectoryReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
. . .
System.out.println("Enter an italian word : ");
String searchTerm = inputReader.readLine();
Query query = parser.parse(searchTerm);
log.info(query.toString());
. . .
TopDocs docs = searcher.search(query, 10);
35. Example - Query Time
Word2vec Searcher
Enter an italian word : computer
Found synonym microprocessore with similarity 0.8636
Found synonym controller with similarity 0.8663
Found synonym microcomputer with similarity 0.8687
Found synonym desktop with similarity 0.8754
Found synonym notebook with similarity 0.8761
Found synonym hardware with similarity 0.8838
Found synonym software with similarity 0.8960
Found synonym chip with similarity 0.8994
Found synonym mainframe with similarity 0.9054
Synonym(text:chip^0.8994 text:computer text:controller^0.8663 text:desktop^0.8754
text:hardware^0.8838 text:mainframe^0.9054 text:microcomputer^0.868
text:microprocessore^0.8636 text:notebook^0.8761 text:software^0.8960)
found 10 documents in 8 ms
36. Example - Index Time
Synonym expansion at index time
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer(StandardTokenizerFactory.NAME)
.addTokenFilter(Word2VecSynonymFilterFactory.NAME, "model", "wikipedia-model.zip")
.build();
IndexWriterConfig luceneConfig = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(directory, luceneConfig);
Document doc = new Document();
doc.add(new TextField("value", "computer", Field.Store.YES));
writer.addDocument(doc);
writer.commit();
● Bigger index
● Indexing process slower
● Need to re-index the whole collection if synonym model changes
!
37. Example - Index Time
Using Luke to check the index after the synonym expansion:
> Word2VecIndexerWithSynonyms.main()
File read successfully
Building the HNSW graph
Created HNSW graph in 2 min
Created document with value: computer
Index created
It took 2 minutes
to load 299.853 vectors …
… can we improve it?
Let’s try … Terms stored in the index
some words are not
synonyms …
… can we improve it?
39. Current limitation
● Model in memory
○ Disaster recovery => longer time to recover
○ Multi process => multiple models
How we plan to solve it?
● Change the “model storage” part to store the model into a lucene index:
○ no need to load the model and rebuild the HNSW graph on process startup
=> faster disaster recovery
○ single model instances
=> multi process access the same model
Future Works - Model stored into a Lucene Index
40. Future Works - Improvements
● Introduce model hyperparameters tuning in our LuceneWord2VecModelTrainer tool
● Synonyms expansion using other NLP language models (e.g. BERT)