SlideShare a Scribd company logo
1 of 42
Download to read offline
Word2Vec Model To Generate
Synonyms on the Fly in Apache
Lucene



Daniele Antuzi, Software Engineer
Ilaria Petreti, Software Engineer
14th
June, Berlin Buzzwords 2022
Who We Are
Daniele Antuzi
● R&D Search Software Engineer
● Master Degree in Computer Science
● Passionate about coding
● Food and sport lover
LinkedIn
d.antuzi@sease.io
Who We Are
Ilaria Petreti
● Information Retrieval/Machine Learning Engineer
● Master in Data Science
● Data Mining and Machine Learning technologies
passionate
● Sports lover (Basketball)
LinkedIn
i.petreti@sease.io
‣ Headquarter in London/distributed
‣ Open Source Enthusiasts
‣ Apache Lucene/Solr experts
‣ Elasticsearch experts
‣ Community Contributors
‣ Active Researchers
‣ London Information Retrieval Meetup
‣ Hot Trends : Neural Search,
Natural Language Processing
Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevancy Tuning
SEArch SErvices
www.sease.io
Agenda
Our Contribution
Word2Vec Algorithm
Synonym Expansion
Example - Index/Query time
Future Works
Agenda
Our Contribution
Word2Vec Algorithm
Synonym Expansion
Example - Index/Query time
Future Works
Synonyms Expansion
Query: “Best places for a walk in the mountains”
to improve RECALL
hiking, trekking
synonyms
Synonyms Expansion in Apache Lucene/Solr
STATE OF THE ART: Vocabulary-based Synonym Expansion
SynonymGraphFilter
● Static synonym list mysynonyms.txt file
● Wordnet vocabulary mysynonyms-wn.txt file
Synonyms Expansion in Apache Lucene/Solr
https://solr.apache.org/guide/8_9/filter-descriptions.html#synonym-graph-filter
STATE OF THE ART: Vocabulary-based Synonym Expansion
https://sease.io/2020/03/introducing-weighted-synonyms-in-apache-lucene.html
SynonymGraphFilter + DelimitedBoostFilter
● Weighted synonym list boostedSynonyms.txtfile
Limits of Vocabulary-based Synonym Expansion
1. different domains
2. different languages
3. manual maintenance additional cost
4. based on the word’s denotation and NOT on its connotation
The term "daemon" in the domain of operating system articles is
not a synonym of "devil" but it's closer to the term "process"
Machine Learning Solution
Word2Vec-based Synonym Expansion
Idea and Image Source:
Teofili, T., & Mattmann, C. A. (2019). Deep learning for search. Shelter Island, NY: Manning
Publications Co.
Advantages:
● learning from the data to be indexed
● avoid missing relevant search results
● language agnostic
● no grammar or syntax involved
Agenda
Our Contribution
Word2Vec Algorithm
Synonym Expansion
Example - Index/Query time
Future Works
Word2vec
Word2Vec is a neural network-based algorithm for learning word representations
➢ It takes text corpus as an input and outputs a series of vector representation, one for each word in
the text, called neural word embedding
➢ Based on The Distributional Hypothesis
➢ Two similar words, in term of semantics, are identified with two vectors closed to each other in the space
Word2vec
0
1
0
0
... ... ...
Input Vector
1-hot encoding
Len. of Vocabulary
Hidden Layer
Dimension of Embeddings
Output Layer
Softmax
Len. of Vocabulary
Word Embedding
Input Weight Matrix
● Feedforward neural network
● Input is one-hot-encoded
● Hidden layer (1) => desired embeddings
size
● Output is also in one-hot encoding form
● The word embeddings are the vectors
from the network
Word2vec - CBOW vs Skip-Gram
Source: https://arxiv.org/pdf/1301.3781.pdf
using a context (neighboring words) to
predict a target word
using a word to predict a target
context (neighboring words)
Word2vec - windowSize
with a windows of 2 words
Word pairs
for training
(the, cat)
(the, chased)
(cat, the)
(cat, chased)
(cat, the)
The chased
cat the mouse up to the den
The chased
cat the mouse up to the den
The chased
cat the up to the den
mouse
(chased, the)
(chased, cat)
(chased, the)
(chased, mouse)
chased
cat the to the den
mouse up
The
(the, cat)
(the, chased)
(the, mouse)
(the, up)
Context of the phrase
DeepLearning4J
❏ Open-source, distributed deep-learning library written for Java and Scala
❏ Integrated with Hadoop and Apache Spark
❏ Good developer community
❏ Out-of-the-box implementation of word2vec, based on the skip-gram model
DeepLearning4J Model Output
DL4J Word2Vec Model Output Example
Token (B64 encoded) + associated Vector:
B64:ZGk= 0.06251079589128494 -0.9980443120002747
B64:ZQ== 0.5112091898918152 -0.8594563603401184
B64:aWw= 0.5138685703277588 -0.8578689694404602
B64:bGE= 0.4818926453590393 -0.8762302398681641
B64:aQ== 0.9747347831726074 -0.22336536645889282
B64:ZGVsbGE= 0.3850429654121399 -0.9228987097740173
B64:cGVy 0.964830219745636 -0.26287391781806946
…
…
vectorDimension = 2
zip file syn0.txt
Agenda
Our Contribution
Word2Vec Algorithm
Synonym Expansion
Example - Index/Query time
Future Works
Word2VecSynonymFilter
Model
Storing
Model
parsing
Word2VecSynonymFilter - Phases
Synonym
expansion
Synonym
expansion
1 prototype - deeplearning4j
Model
Storing
Model
parsing
Deeplearning4j
Pros
● Already implemented and tested
Cons
● Too many dependencies
● Search is quite slow (~70ms* for each
synonym expansion)
*in accordance with our preliminary experiments
Future works: more accurate benchmarks
Synonym expansion - How it works
*in accordance with our preliminary experiments
Future works: more accurate benchmarks
Original word = W
W
A T
L
V
Q
B
P
Z
1
Searching the vectors with highest cosine
similarity
3
Getting the vector
corresponding to
the original term
2
Select the sub-set
of vectors with the
highest cosine
similarity with the
query vector
4
W
A T
Z
Model
Storing
Model
parsing
Word2VecSynonymFilter - Phases
Synonym
expansion
Lucene already implements
K-Nearest-Neighbor search
using HNSW
Image from The Big Bang Theory (HBO)
Hierarchical Navigable Small World (HNSW)
● Navigable Small world graph is a proximity graph
○ vertices are vectors
○ edges means that two vectors are close to each other
● Hierarchical Layers based on skip lists
○ longer edges in higher layers(fast retrieval)
○ shorter edges in lower layers(accuracy)
https://sease.io/2022/01/apache-solr-neural-search.html
Layer 2
Layer 1
Layer 0
entry point
nearest neighbor
approximate K-nearest neighbor search based on navigable small world graphs with controllable hierarchy
Word2VecSynonymFilter - HNSW
stream
Graph Searcher
Hnsw Graph
Ad-hoc parser
Improvements
● Fast search (~70ms ~6ms* for each synonym expansions)
● No additional dependencies
Synonym
expansion
Model
Storing
Model
parsing
*in accordance with our preliminary experiments
Future works: more accurate benchmarks
Word2VecSynonymFilter - How to use
Word2VecSynonymFilter Configuration Parameters:
❏ Word2Vec model: REQUIRED file containing the trained model
❏ Word2VecSupportedFormats: default DL4J DL4J is currently the only supported format
❏ maxSynonymsPerTerm: default 10 maximum number of result returned by the synonym search
❏ minAcceptedSimilarity: default 0.7f minimum value of cosine similarity between the searched vector and
the retrieved ones
❏ similarityAsBoost: default true assign the similarity value as boost term
.addTokenFilter(Word2VecSynonymFilterFactory.NAME, "model", "<model_file>")
LuceneWord2VecModelTrainer
LuceneWord2VecModelTrainer
Input
Lucene Index Path
path to the folder containing
the index, used to fetch the
document values
Field Name
to fetch the values from
Output
DL4J Word2Vec model file
(.zip):
contains a dictionary in which
each token has a vector
attached to it
java -jar build/libs/LuceneWord2VecModelTrainer.jar
-p <lucene_index_path> -f <field_name> -o <model_file>
Command-line to train a Word2vec model from a Lucene Index:
LuceneWord2VecModelTrainer
● FieldValuesSentenceIterator class:
to read stored field values from the Lucene index to be used for training the word2vec model
● Model Training
○ Library: DeepLearning4J (DL4J)
○ Algorithm: Skip-gram model
○ Default parameters/hyperparameters
SentenceIterator iter = new FieldValuesSentenceIterator(config);
Word2Vec vec = new Word2Vec.Builder()
.layerSize(100)
.minWordFrequency(5)
.windowSize(5)
.iterate(iter)
.build();
vec.fit();
WordVectorSerializer.writeWord2VecModel(vec, config.getModelFilePath());
Our works
- LuceneWord2VecModelTrainer:
Command line tool to generate a DL4J Word2Vec model using a specific field of a Apache
Lucene index
Currently in our Github repository:
https://github.com/SeaseLtd/LuceneWord2VecModelTrainer
- Word2VecSynonymFilter:
New token filter in Lucene that queries the Word2Vec model on input tokens to get the
weighted list of synonyms of a specific term
Currently in our Lucene fork:
https://github.com/SeaseLtd/lucene/tree/word2vec
Agenda
Our Contribution
Word2Vec Algorithm
Synonym Expansion
Example - Index/Query time
Future Works
Example - Index Time
java -jar LuceneWord2VecModelTrainer.jar -p /sease/word2vec_model/italian_wikipedia_data
-f text -o wikipedia-model.zip
[INFO ] 00:44:43.240 [main] ModelGenerator - indexPath =
/sease/word2vec_model/italian_wikipedia_data
[INFO ] 00:44:43.244 [main] ModelGenerator - field = text
[INFO ] 00:44:43.244 [main] ModelGenerator - modelFile = wikipedia-model.zip
[INFO ] 03:28:27.653 [main] ModelGenerator - Model trained in 163 min
[INFO ] 03:31:30.708 [main] ModelGenerator - Model file wikipedia-model.zip generated
For the experiment we used the WikipediaExtractor to download the documents of the Italian Wikipedia:
italian_wikipedia_data
1. Index the Italian Wikipedia documents
2. Train the model using a specific field of the Lucene index
1.820.000 documents
(3.4GB)
Example - Query Time
Word2vec Searcher
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer(StandardTokenizerFactory.NAME)
.addTokenFilter(Word2VecSynonymFilterFactory.NAME, "model", "wikipedia-model.zip")
.build();
DirectoryReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);
. . .
System.out.println("Enter an italian word : ");
String searchTerm = inputReader.readLine();
Query query = parser.parse(searchTerm);
log.info(query.toString());
. . .
TopDocs docs = searcher.search(query, 10);
Example - Query Time
Word2vec Searcher
Enter an italian word : computer
Found synonym microprocessore with similarity 0.8636
Found synonym controller with similarity 0.8663
Found synonym microcomputer with similarity 0.8687
Found synonym desktop with similarity 0.8754
Found synonym notebook with similarity 0.8761
Found synonym hardware with similarity 0.8838
Found synonym software with similarity 0.8960
Found synonym chip with similarity 0.8994
Found synonym mainframe with similarity 0.9054
Synonym(text:chip^0.8994 text:computer text:controller^0.8663 text:desktop^0.8754
text:hardware^0.8838 text:mainframe^0.9054 text:microcomputer^0.868
text:microprocessore^0.8636 text:notebook^0.8761 text:software^0.8960)
found 10 documents in 8 ms
Example - Index Time
Synonym expansion at index time
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer(StandardTokenizerFactory.NAME)
.addTokenFilter(Word2VecSynonymFilterFactory.NAME, "model", "wikipedia-model.zip")
.build();
IndexWriterConfig luceneConfig = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(directory, luceneConfig);
Document doc = new Document();
doc.add(new TextField("value", "computer", Field.Store.YES));
writer.addDocument(doc);
writer.commit();
● Bigger index
● Indexing process slower
● Need to re-index the whole collection if synonym model changes
!
Example - Index Time
Using Luke to check the index after the synonym expansion:
> Word2VecIndexerWithSynonyms.main()
File read successfully
Building the HNSW graph
Created HNSW graph in 2 min
Created document with value: computer
Index created
It took 2 minutes
to load 299.853 vectors …
… can we improve it?
Let’s try … Terms stored in the index
some words are not
synonyms …
… can we improve it?
Agenda
Our Contribution
Word2Vec Algorithm
Synonym Expansion
Example - Index/Query time
Future Works
Current limitation
● Model in memory
○ Disaster recovery => longer time to recover
○ Multi process => multiple models
How we plan to solve it?
● Change the “model storage” part to store the model into a lucene index:
○ no need to load the model and rebuild the HNSW graph on process startup
=> faster disaster recovery
○ single model instances
=> multi process access the same model
Future Works - Model stored into a Lucene Index
Future Works - Improvements
● Introduce model hyperparameters tuning in our LuceneWord2VecModelTrainer tool
● Synonyms expansion using other NLP language models (e.g. BERT)
Future Works
● Solr/Elasticsearch/OpenSearch integration?
● Introduce multi-term synonyms
Thank you for your attention!

More Related Content

What's hot

Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack PresentationAmr Alaa Yassen
 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaAttack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaPrajal Kulkarni
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearchhypto
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackRich Lee
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchRuslan Zavacky
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 BasicFlink Forward
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
Elasticsearch for Data Analytics
Elasticsearch for Data AnalyticsElasticsearch for Data Analytics
Elasticsearch for Data AnalyticsFelipe
 
Introduction to Apache solr
Introduction to Apache solrIntroduction to Apache solr
Introduction to Apache solrKnoldus Inc.
 
Data modeling for Elasticsearch
Data modeling for ElasticsearchData modeling for Elasticsearch
Data modeling for ElasticsearchFlorian Hopf
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elkRushika Shah
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connectconfluent
 
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...lucenerevolution
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Flink Forward
 

What's hot (20)

ELK introduction
ELK introductionELK introduction
ELK introduction
 
Elastic stack Presentation
Elastic stack PresentationElastic stack Presentation
Elastic stack Presentation
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Attack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and KibanaAttack monitoring using ElasticSearch Logstash and Kibana
Attack monitoring using ElasticSearch Logstash and Kibana
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Apache Flink Training: DataStream API Part 1 Basic
 Apache Flink Training: DataStream API Part 1 Basic Apache Flink Training: DataStream API Part 1 Basic
Apache Flink Training: DataStream API Part 1 Basic
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Elasticsearch for Data Analytics
Elasticsearch for Data AnalyticsElasticsearch for Data Analytics
Elasticsearch for Data Analytics
 
Introduction to Apache solr
Introduction to Apache solrIntroduction to Apache solr
Introduction to Apache solr
 
Data modeling for Elasticsearch
Data modeling for ElasticsearchData modeling for Elasticsearch
Data modeling for Elasticsearch
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elk
 
Diving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka ConnectDiving into the Deep End - Kafka Connect
Diving into the Deep End - Kafka Connect
 
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
Boosting Documents in Solr by Recency, Popularity and Personal Preferences - ...
 
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
Streaming Event Time Partitioning with Apache Flink and Apache Iceberg - Juli...
 
Introduction to ELK
Introduction to ELKIntroduction to ELK
Introduction to ELK
 

Similar to Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf

ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopiwan_rg
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Chris Fregly
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016MLconf
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptxKtonNguyn2
 
Andrey Kutuzov and Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...
Andrey Kutuzov and  Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...Andrey Kutuzov and  Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...
Andrey Kutuzov and Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...AIST
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
BERT QnA System for Airplane Flight Manual
BERT QnA System for Airplane Flight ManualBERT QnA System for Airplane Flight Manual
BERT QnA System for Airplane Flight ManualArkaGhosh65
 
Amazon Product Sentiment review
Amazon Product Sentiment reviewAmazon Product Sentiment review
Amazon Product Sentiment reviewLalit Jain
 
Project Presentation
Project PresentationProject Presentation
Project Presentationbutest
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail
 
Authorcontext:ire
Authorcontext:ireAuthorcontext:ire
Authorcontext:ireSoham Saha
 
Exploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, LucidworksExploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, LucidworksLucidworks
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia VoulibasiISSEL
 
Red Hat Summit Connect 2023 - Redis Enterprise, the engine of Generative AI
Red Hat Summit Connect 2023 - Redis Enterprise, the engine of Generative AIRed Hat Summit Connect 2023 - Redis Enterprise, the engine of Generative AI
Red Hat Summit Connect 2023 - Redis Enterprise, the engine of Generative AILuigi Fugaro
 
Exploring Direct Concept Search
Exploring Direct Concept SearchExploring Direct Concept Search
Exploring Direct Concept SearchSteve Rowe
 
Database Firewall from Scratch
Database Firewall from ScratchDatabase Firewall from Scratch
Database Firewall from ScratchDenis Kolegov
 
AAT LOD Microthesauri
AAT LOD MicrothesauriAAT LOD Microthesauri
AAT LOD MicrothesauriMarcia Zeng
 

Similar to Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf (20)

ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016Atlanta MLconf Machine Learning Conference 09-23-2016
Atlanta MLconf Machine Learning Conference 09-23-2016
 
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptx
 
Andrey Kutuzov and Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...
Andrey Kutuzov and  Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...Andrey Kutuzov and  Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...
Andrey Kutuzov and Elizaveta Kuzmenko - WebVectors: Toolkit for Building Web...
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
BERT QnA System for Airplane Flight Manual
BERT QnA System for Airplane Flight ManualBERT QnA System for Airplane Flight Manual
BERT QnA System for Airplane Flight Manual
 
Word2 vec
Word2 vecWord2 vec
Word2 vec
 
Amazon Product Sentiment review
Amazon Product Sentiment reviewAmazon Product Sentiment review
Amazon Product Sentiment review
 
Project Presentation
Project PresentationProject Presentation
Project Presentation
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
Authorcontext:ire
Authorcontext:ireAuthorcontext:ire
Authorcontext:ire
 
Exploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, LucidworksExploring Direct Concept Search - Steve Rowe, Lucidworks
Exploring Direct Concept Search - Steve Rowe, Lucidworks
 
Triantafyllia Voulibasi
Triantafyllia VoulibasiTriantafyllia Voulibasi
Triantafyllia Voulibasi
 
Red Hat Summit Connect 2023 - Redis Enterprise, the engine of Generative AI
Red Hat Summit Connect 2023 - Redis Enterprise, the engine of Generative AIRed Hat Summit Connect 2023 - Redis Enterprise, the engine of Generative AI
Red Hat Summit Connect 2023 - Redis Enterprise, the engine of Generative AI
 
SEppt
SEpptSEppt
SEppt
 
Exploring Direct Concept Search
Exploring Direct Concept SearchExploring Direct Concept Search
Exploring Direct Concept Search
 
Database Firewall from Scratch
Database Firewall from ScratchDatabase Firewall from Scratch
Database Firewall from Scratch
 
AAT LOD Microthesauri
AAT LOD MicrothesauriAAT LOD Microthesauri
AAT LOD Microthesauri
 
Best node js course
Best node js courseBest node js course
Best node js course
 

More from Sease

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors LuceneSease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneSease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Sease
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveSease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale IndexingSease
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfSease
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxSease
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingSease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Sease
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationSease
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneSease
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSease
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information RetrievalSease
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationSease
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to RankSease
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusSease
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupSease
 

More from Sease (20)

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information Retrieval
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to Rank
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
 

Recently uploaded

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 

Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf

  • 1. Word2Vec Model To Generate Synonyms on the Fly in Apache Lucene
 
 Daniele Antuzi, Software Engineer Ilaria Petreti, Software Engineer 14th June, Berlin Buzzwords 2022
  • 2. Who We Are Daniele Antuzi ● R&D Search Software Engineer ● Master Degree in Computer Science ● Passionate about coding ● Food and sport lover LinkedIn d.antuzi@sease.io
  • 3. Who We Are Ilaria Petreti ● Information Retrieval/Machine Learning Engineer ● Master in Data Science ● Data Mining and Machine Learning technologies passionate ● Sports lover (Basketball) LinkedIn i.petreti@sease.io
  • 4. ‣ Headquarter in London/distributed ‣ Open Source Enthusiasts ‣ Apache Lucene/Solr experts ‣ Elasticsearch experts ‣ Community Contributors ‣ Active Researchers ‣ London Information Retrieval Meetup ‣ Hot Trends : Neural Search, Natural Language Processing Learning To Rank, Document Similarity, Search Quality Evaluation, Relevancy Tuning SEArch SErvices www.sease.io
  • 5. Agenda Our Contribution Word2Vec Algorithm Synonym Expansion Example - Index/Query time Future Works
  • 6. Agenda Our Contribution Word2Vec Algorithm Synonym Expansion Example - Index/Query time Future Works
  • 7. Synonyms Expansion Query: “Best places for a walk in the mountains” to improve RECALL hiking, trekking synonyms
  • 8. Synonyms Expansion in Apache Lucene/Solr STATE OF THE ART: Vocabulary-based Synonym Expansion SynonymGraphFilter ● Static synonym list mysynonyms.txt file ● Wordnet vocabulary mysynonyms-wn.txt file
  • 9. Synonyms Expansion in Apache Lucene/Solr https://solr.apache.org/guide/8_9/filter-descriptions.html#synonym-graph-filter STATE OF THE ART: Vocabulary-based Synonym Expansion https://sease.io/2020/03/introducing-weighted-synonyms-in-apache-lucene.html SynonymGraphFilter + DelimitedBoostFilter ● Weighted synonym list boostedSynonyms.txtfile
  • 10. Limits of Vocabulary-based Synonym Expansion 1. different domains 2. different languages 3. manual maintenance additional cost 4. based on the word’s denotation and NOT on its connotation The term "daemon" in the domain of operating system articles is not a synonym of "devil" but it's closer to the term "process"
  • 11. Machine Learning Solution Word2Vec-based Synonym Expansion Idea and Image Source: Teofili, T., & Mattmann, C. A. (2019). Deep learning for search. Shelter Island, NY: Manning Publications Co. Advantages: ● learning from the data to be indexed ● avoid missing relevant search results ● language agnostic ● no grammar or syntax involved
  • 12. Agenda Our Contribution Word2Vec Algorithm Synonym Expansion Example - Index/Query time Future Works
  • 13. Word2vec Word2Vec is a neural network-based algorithm for learning word representations ➢ It takes text corpus as an input and outputs a series of vector representation, one for each word in the text, called neural word embedding ➢ Based on The Distributional Hypothesis ➢ Two similar words, in term of semantics, are identified with two vectors closed to each other in the space
  • 14. Word2vec 0 1 0 0 ... ... ... Input Vector 1-hot encoding Len. of Vocabulary Hidden Layer Dimension of Embeddings Output Layer Softmax Len. of Vocabulary Word Embedding Input Weight Matrix ● Feedforward neural network ● Input is one-hot-encoded ● Hidden layer (1) => desired embeddings size ● Output is also in one-hot encoding form ● The word embeddings are the vectors from the network
  • 15. Word2vec - CBOW vs Skip-Gram Source: https://arxiv.org/pdf/1301.3781.pdf using a context (neighboring words) to predict a target word using a word to predict a target context (neighboring words)
  • 16. Word2vec - windowSize with a windows of 2 words Word pairs for training (the, cat) (the, chased) (cat, the) (cat, chased) (cat, the) The chased cat the mouse up to the den The chased cat the mouse up to the den The chased cat the up to the den mouse (chased, the) (chased, cat) (chased, the) (chased, mouse) chased cat the to the den mouse up The (the, cat) (the, chased) (the, mouse) (the, up) Context of the phrase
  • 17. DeepLearning4J ❏ Open-source, distributed deep-learning library written for Java and Scala ❏ Integrated with Hadoop and Apache Spark ❏ Good developer community ❏ Out-of-the-box implementation of word2vec, based on the skip-gram model
  • 18. DeepLearning4J Model Output DL4J Word2Vec Model Output Example Token (B64 encoded) + associated Vector: B64:ZGk= 0.06251079589128494 -0.9980443120002747 B64:ZQ== 0.5112091898918152 -0.8594563603401184 B64:aWw= 0.5138685703277588 -0.8578689694404602 B64:bGE= 0.4818926453590393 -0.8762302398681641 B64:aQ== 0.9747347831726074 -0.22336536645889282 B64:ZGVsbGE= 0.3850429654121399 -0.9228987097740173 B64:cGVy 0.964830219745636 -0.26287391781806946 … … vectorDimension = 2 zip file syn0.txt
  • 19. Agenda Our Contribution Word2Vec Algorithm Synonym Expansion Example - Index/Query time Future Works
  • 22. Synonym expansion 1 prototype - deeplearning4j Model Storing Model parsing Deeplearning4j Pros ● Already implemented and tested Cons ● Too many dependencies ● Search is quite slow (~70ms* for each synonym expansion) *in accordance with our preliminary experiments Future works: more accurate benchmarks
  • 23. Synonym expansion - How it works *in accordance with our preliminary experiments Future works: more accurate benchmarks Original word = W W A T L V Q B P Z 1 Searching the vectors with highest cosine similarity 3 Getting the vector corresponding to the original term 2 Select the sub-set of vectors with the highest cosine similarity with the query vector 4 W A T Z
  • 24. Model Storing Model parsing Word2VecSynonymFilter - Phases Synonym expansion Lucene already implements K-Nearest-Neighbor search using HNSW Image from The Big Bang Theory (HBO)
  • 25. Hierarchical Navigable Small World (HNSW) ● Navigable Small world graph is a proximity graph ○ vertices are vectors ○ edges means that two vectors are close to each other ● Hierarchical Layers based on skip lists ○ longer edges in higher layers(fast retrieval) ○ shorter edges in lower layers(accuracy) https://sease.io/2022/01/apache-solr-neural-search.html Layer 2 Layer 1 Layer 0 entry point nearest neighbor approximate K-nearest neighbor search based on navigable small world graphs with controllable hierarchy
  • 26. Word2VecSynonymFilter - HNSW stream Graph Searcher Hnsw Graph Ad-hoc parser Improvements ● Fast search (~70ms ~6ms* for each synonym expansions) ● No additional dependencies Synonym expansion Model Storing Model parsing *in accordance with our preliminary experiments Future works: more accurate benchmarks
  • 27. Word2VecSynonymFilter - How to use Word2VecSynonymFilter Configuration Parameters: ❏ Word2Vec model: REQUIRED file containing the trained model ❏ Word2VecSupportedFormats: default DL4J DL4J is currently the only supported format ❏ maxSynonymsPerTerm: default 10 maximum number of result returned by the synonym search ❏ minAcceptedSimilarity: default 0.7f minimum value of cosine similarity between the searched vector and the retrieved ones ❏ similarityAsBoost: default true assign the similarity value as boost term .addTokenFilter(Word2VecSynonymFilterFactory.NAME, "model", "<model_file>")
  • 29. LuceneWord2VecModelTrainer Input Lucene Index Path path to the folder containing the index, used to fetch the document values Field Name to fetch the values from Output DL4J Word2Vec model file (.zip): contains a dictionary in which each token has a vector attached to it java -jar build/libs/LuceneWord2VecModelTrainer.jar -p <lucene_index_path> -f <field_name> -o <model_file> Command-line to train a Word2vec model from a Lucene Index:
  • 30. LuceneWord2VecModelTrainer ● FieldValuesSentenceIterator class: to read stored field values from the Lucene index to be used for training the word2vec model ● Model Training ○ Library: DeepLearning4J (DL4J) ○ Algorithm: Skip-gram model ○ Default parameters/hyperparameters SentenceIterator iter = new FieldValuesSentenceIterator(config); Word2Vec vec = new Word2Vec.Builder() .layerSize(100) .minWordFrequency(5) .windowSize(5) .iterate(iter) .build(); vec.fit(); WordVectorSerializer.writeWord2VecModel(vec, config.getModelFilePath());
  • 31. Our works - LuceneWord2VecModelTrainer: Command line tool to generate a DL4J Word2Vec model using a specific field of a Apache Lucene index Currently in our Github repository: https://github.com/SeaseLtd/LuceneWord2VecModelTrainer - Word2VecSynonymFilter: New token filter in Lucene that queries the Word2Vec model on input tokens to get the weighted list of synonyms of a specific term Currently in our Lucene fork: https://github.com/SeaseLtd/lucene/tree/word2vec
  • 32. Agenda Our Contribution Word2Vec Algorithm Synonym Expansion Example - Index/Query time Future Works
  • 33. Example - Index Time java -jar LuceneWord2VecModelTrainer.jar -p /sease/word2vec_model/italian_wikipedia_data -f text -o wikipedia-model.zip [INFO ] 00:44:43.240 [main] ModelGenerator - indexPath = /sease/word2vec_model/italian_wikipedia_data [INFO ] 00:44:43.244 [main] ModelGenerator - field = text [INFO ] 00:44:43.244 [main] ModelGenerator - modelFile = wikipedia-model.zip [INFO ] 03:28:27.653 [main] ModelGenerator - Model trained in 163 min [INFO ] 03:31:30.708 [main] ModelGenerator - Model file wikipedia-model.zip generated For the experiment we used the WikipediaExtractor to download the documents of the Italian Wikipedia: italian_wikipedia_data 1. Index the Italian Wikipedia documents 2. Train the model using a specific field of the Lucene index 1.820.000 documents (3.4GB)
  • 34. Example - Query Time Word2vec Searcher Analyzer analyzer = CustomAnalyzer.builder() .withTokenizer(StandardTokenizerFactory.NAME) .addTokenFilter(Word2VecSynonymFilterFactory.NAME, "model", "wikipedia-model.zip") .build(); DirectoryReader reader = DirectoryReader.open(directory); IndexSearcher searcher = new IndexSearcher(reader); . . . System.out.println("Enter an italian word : "); String searchTerm = inputReader.readLine(); Query query = parser.parse(searchTerm); log.info(query.toString()); . . . TopDocs docs = searcher.search(query, 10);
  • 35. Example - Query Time Word2vec Searcher Enter an italian word : computer Found synonym microprocessore with similarity 0.8636 Found synonym controller with similarity 0.8663 Found synonym microcomputer with similarity 0.8687 Found synonym desktop with similarity 0.8754 Found synonym notebook with similarity 0.8761 Found synonym hardware with similarity 0.8838 Found synonym software with similarity 0.8960 Found synonym chip with similarity 0.8994 Found synonym mainframe with similarity 0.9054 Synonym(text:chip^0.8994 text:computer text:controller^0.8663 text:desktop^0.8754 text:hardware^0.8838 text:mainframe^0.9054 text:microcomputer^0.868 text:microprocessore^0.8636 text:notebook^0.8761 text:software^0.8960) found 10 documents in 8 ms
  • 36. Example - Index Time Synonym expansion at index time Analyzer analyzer = CustomAnalyzer.builder() .withTokenizer(StandardTokenizerFactory.NAME) .addTokenFilter(Word2VecSynonymFilterFactory.NAME, "model", "wikipedia-model.zip") .build(); IndexWriterConfig luceneConfig = new IndexWriterConfig(analyzer); IndexWriter writer = new IndexWriter(directory, luceneConfig); Document doc = new Document(); doc.add(new TextField("value", "computer", Field.Store.YES)); writer.addDocument(doc); writer.commit(); ● Bigger index ● Indexing process slower ● Need to re-index the whole collection if synonym model changes !
  • 37. Example - Index Time Using Luke to check the index after the synonym expansion: > Word2VecIndexerWithSynonyms.main() File read successfully Building the HNSW graph Created HNSW graph in 2 min Created document with value: computer Index created It took 2 minutes to load 299.853 vectors … … can we improve it? Let’s try … Terms stored in the index some words are not synonyms … … can we improve it?
  • 38. Agenda Our Contribution Word2Vec Algorithm Synonym Expansion Example - Index/Query time Future Works
  • 39. Current limitation ● Model in memory ○ Disaster recovery => longer time to recover ○ Multi process => multiple models How we plan to solve it? ● Change the “model storage” part to store the model into a lucene index: ○ no need to load the model and rebuild the HNSW graph on process startup => faster disaster recovery ○ single model instances => multi process access the same model Future Works - Model stored into a Lucene Index
  • 40. Future Works - Improvements ● Introduce model hyperparameters tuning in our LuceneWord2VecModelTrainer tool ● Synonyms expansion using other NLP language models (e.g. BERT)
  • 41. Future Works ● Solr/Elasticsearch/OpenSearch integration? ● Introduce multi-term synonyms
  • 42. Thank you for your attention!