SlideShare a Scribd company logo
1 of 55
Download to read offline
Seminars
Let’s Build an Inverted Index:
Introduction to Apache
Lucene/Solr

Alessandro Benedetti, Software Engineer



Andrea Gazzarini, Software Engineer
28th November 2019
Seminars
▪ R&D Software Engineer
▪ Search Consultant
▪ Director
▪ Master Degree in Computer Science
▪ Apache Lucene/Solr Enthusiast
▪ Semantic, NLP, Machine Learning
Technologies passionate
▪ Conference Speaker
▪ Beach Volleyball Player &
Snowboarder
Alessandro Benedetti
Seminars
▪ Software Engineer (1999-)
▪ “Hermit” Software Engineer (2010-)
▪ Java & Information Retrieval Passionate
▪ Apache Qpid (past) Committer
▪ Husband & Father
▪ Bass Player
Andrea Gazzarini, “Gazza”
Seminars
Search Services

● London Based - Italian made :)
● Open Source Enthusiasts
● Apache Lucene/Solr experts
! Community Contributors
● Active Researchers
● Hot Trends : Learning To Rank, 

Document Similarity,
Search Quality Evaluation,

Relevancy Tuning
Seminars
Who we are
Seminars
Why should you use Open Source?
• State of the Art / very valid technologies
• Community Support
• Vast Documentation
• Code is accessible!
• Customisable
• Mostly free licensing
Seminars
Why should you contribute to Open Source?
• Share knowledge and ideas
• Improve established technologies
• Become part of a Community
• Not only code - all your skills are relevant!
• Be useful to the world
Seminars
We only deal with Open Source Informational
Retrieval … Revenue ?
● Trainings - Beginner/Intermediate/Advance/Ad Hoc for

Information Retrieval, Apache Lucene/Solr, Search Relevance, Learning To Rank…
● Consulting - Open Source Software is ubiquitous/ Expertise ? Not really

! R&D Projects - Cheaper and more flexible for Companies using Open Source

! IR Projects - From the Client requirements collection till the Software delivery
Seminars
Clients
Seminars
Information Retrieval
“Information retrieval (IR) is the activity of
obtaining information system resources relevant to
an information need from a collection of
information resources. Searches can be based on
full-text or other content-based indexing.
Information retrieval is the science of searching for
information in a document, searching for
documents themselves, and also searching for
metadata that describe data, and for databases of
texts, images or sounds.” Wikipedia
Information Need
Corpus
Seminars
Apache Lucene
• http://lucene.apache.org
• High-performance, scalable information retrieval software *library*
• Enables search capabilities to your applications
• Cohesive and simple interface, which hides a really complex world
• Open Source: Apache Top Level Project
Seminars
Apache Lucene - Brief History
2019 Lucene 8.3.0 (November)
Seminars
Apache Solr
• http://lucene.apache.org/solr
• Highly reliable, scalable and fault tolerant search *server*
• A Lucene “serverization” with a lot of additional features
• All services are exposed through a HTTP (REST-Like interface)
• Written in Java
• Rich ecosystem for building enterprise-level applications (Plugins,
Integrations, Clients)
• Open Source: Apache Top Level Project
“Solr is the popular, blazing-fast, open
source enterprise search platform built on
Apache Lucene™.”
Seminars
Seminars
Apache Solr - Brief History
Version 8.3.0 (November)2019
Seminars
The Inverted Index
The Inverted Index is the basic data structure
used by Lucene to provide Search in a corpus of
documents.
From wikipedia :
“In computer science, an inverted index (also
referred to as postings file or inverted file) is an
index data structure storing a mapping from
content, such as words or numbers, to its locations
in a database file, or in a document or a set of
documents.”
Seminars
The Lucene Document
Document
Field ValueField Name
• Documents are the unit of information 

for indexing and search.
• A Document is a set of fields.
• Each field has a name and a value.
Seminars
The Lucene Inverted Index
Seminars
The Lucene Inverted Index
• Lucene directory (in memory, on disk, memory mapped)
• Collection of immutable segments (fully working)
• Each segment is composed by a set of binary files[1]
[1] Lucene File Format Documentation
Indexes evolve by:
1. Creating new segments for newly added documents.
2. Merging existing segments.
Seminars
Schema Configuration
• Per collection/index
• Xml file
• Define how the inverted Index will be built
• Fields/Field Types definition
Seminars
Schema Configuration
• Define flexible expressions for groups of fields
• Shared attributes for each field instance
• Copy the source content to a destination field
• Allow to run multiple analysis chains for the same content
Seminars
Field Type
• Define how the single terms (in the inverted index) will be generated out of the content
Index Time
Query Time
Analysis chain executed
when building the index
Analysis chain executed
when building the query
Seminars
Text Analysis
• Only text fields types (e.g. solr.TextField or subclasses) have a text analysis chain associated
An analyzer can define
• Zero or more CharFilter
• One and only one Tokenizer
• Zero or more TokenFilter
Seminars
Char Filters
• CharFilter is a component that pre-processes input characters.
• CharFilters can be chained like Token Filters and placed in front of a Tokenizer.
• CharFilters can add, change, or remove characters

while preserving the original character offsets to support features like highlighting.
Seminars
Tokenizers
Tokenizers are responsible for breaking field data into lexical units, or tokens.[1]
[1] https://lucene.apache.org/solr/guide/8_3/tokenizers.html
Seminars
Token Filters
Filters[1] examine a stream of tokens and keep them, transform them or discard them, 

depending on the filter type being used.
[1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html
Seminars
Word Delimiters Filter
• Improve recall
• Dedicated Filters: 

solr.WordDelimiterGraphFilterFactory
[1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#word-delimiter-graph-filter
Example:
Default behavior. The whitespace tokenizer is used here to preserve non-alphanumeric characters.
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory"/>
<filter class="solr.FlattenGraphFilterFactory"/> <!-- required on index analyzers after graph filters -->
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterGraphFilterFactory"/>
</analyzer>

In: "hot-spot RoboBlaster/9000 100XL"
Tokenizer to Filter: "hot-spot", "RoboBlaster/9000", "100XL"
Out: "hot", "spot", "Robo", "Blaster", "9000", "100", "XL"
Seminars
Stopword Filters
• Reduce index size
• Can improve precision (removing terms with low semantic value)
• Can improve recall
• Dedicated Filters: solr.StopFilterFactory, solr.ManagedStopFilterFactory
[1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#stop-filter
Example:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
In: "To be or what?"
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)
Out: "what"(4)
Seminars
Stemmers
• Improve Recall
• Reduce index size
• Dedicated Filters: solr.EnglishMinimalStemFilterFactory, solr.HunspellStemFilterFactory, solr.KStemFilterFactory,

solr.PorterStemFilterFactory, solr.SnowballPorterFilterFactory
[1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#porter-stem-filter
Example:
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
</analyzer>
In: "dogs cats"
Tokenizer to Filter: "dogs", "cats"
Out: "dog", "cat"
Seminars
Synonym Filters[1/2]
• Improve Recall
• Dedicated Filters: solr.SynonymGraphFilterFactory
• Index Time -> affect terms distributions, needs re-indexing
• Query Time -> more flexible
[1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#synonym-graph-filter
couch,sofa,divan
teh => the
huge,ginormous,humungous => large
small => tiny,teeny,weeny
Seminars
Synonym Filters[2/2]
• Improve Recall
• Dedicated Filters: 

solr.SynonymGraphFilterFactory
[1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#synonym-graph-filter
Example:
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="mysynonyms.txt"/>
<filter class="solr.FlattenGraphFilterFactory"/> <!-- required on index analyzers after graph filters -->
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="mysynonyms.txt"/>
</analyzer>

In: "teh small couch"
Tokenizer to Filter: "teh"(1), "small"(2), "couch"(3)
Out: "the"(1), "tiny"(2), "teeny"(2), "weeny"(2), "couch"(3), "sofa"(3), "divan"(3)
Seminars
Keep Word Filter
• Help in Entity tagging
• Dedicated Filters: solr.KeepWordFilterFactory
[1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#keep-word-filter
Example:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.KeepWordFilterFactory" words="keepwords.txt" ignoreCase="true"/>
</analyzer>
In: "Happy, sad or funny"
Tokenizer to Filter: "Happy", "sad", "or", "funny"
Out: "Happy", "funny"
Seminars
N-Gram Filtering
• Improve Recall
• Ideal for autocompletion
• Dedicated Filters: solr.EdgeNGramFilterFactory, solr.NGramFilterFactory
[1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#edge-n-gram-filter
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="4"/>
</analyzer>
Seminars
Phonetic Matching
• Improve Recall
• Dedicated Filters: solr.BeiderMorseFilterFactory, solr.DaitchMokotoffSoundexFilterFactory,
solr.DoubleMetaphoneFilterFactory, solr.PhoneticFilterFactory
[1] https://lucene.apache.org/solr/guide/8_3/phonetic-matching.html
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true"
languageSet="auto">
</filter>
</analyzer>
Seminars
Common Grams Filter
• Improve Precision
• Useful for phrase queries
[1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#common-grams-filter
Example:
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.CommonGramsFilterFactory" words="stopwords.txt" ignoreCase="true"/>
</analyzer>
In: "the Cat"
Tokenizer to Filter: "the", "Cat"
Out: "the_cat"
Seminars
Field Attributes
Seminars
Field Attributes
Seminars
Solr Text Analysis - Hands On!
• Analysis Screen from Solr Admin
• Let’s explore the schema.xml
Seminars
Indexing
• Using the Solr Cell framework built on Apache Tika 

for ingesting binary files or structured files such as Office, Word, PDF, and other proprietary formats.

(Recommended for prototyping and exercise)

• Uploading XML files by sending HTTP requests to the Solr server 

from any environment where such requests can be generated.

(Recommended for prototyping and exercise)

• Writing a custom Java application to ingest data through Solr’s Java Client API 

(which is described in more detail in Client APIs). 

Using the Java API may be the best choice if you’re working with an application, 

such as a Content Management System (CMS), that offers a Java API.

Seminars
Indexing
• Indexing is the procedure of building an index from the documents in input
• Transaction Log (Rotating on hard commits)
• Index built in memory
• Soft commits(visibility)
• Hard commits(durability)
• openSearcher=true(visibility)
• Auto commit
• Merge policy
Seminars
Lucene Score
In order to measure the relevancy of a given result, Solr(Lucene) assigns it a “score”
The formula behind the score computation is behind the scope of this course, however important things tha
contribute to that formula are:
• Term Frequency (TF): how many times a given term occurs within a single document
• Document Frequency (DF): how many documents in the dataset contain a given term
• TF/IDF: the ratio between the term frequency and the inverse document frequency (1/DF)
• Field length: how many terms compose a field
• Boosting: functions or in general things that boost the score computed for a given match. Boosting 

can be applied at index time (deprecated now) or a query time
Score values cannot be compared across queries, or even with the same query but with a different index.
Seminars
! Origin from Probabilistic Information Retrieval
! Default Similarity from Lucene 6.0 [1]
! 25th iteration in improving TF-IDF
! TF
! IDF
! Document(Field) Length
! Configuration parameters
[1] LUCENE-6789
BM25 Term Scorer
Seminars
BM25 Term Scorer - Inverse Document Frequency
IDF Score

has very similar behavior
Seminars
BM25 Term Scorer - Term Frequency
TF Score

approaches

asymptotically (k+1)



k=1.2 in this example
Seminars
BM25 Term Scorer - Document Length
Document Length /

Avg Document Length



affects how fast we
saturate TF score
Seminars
Basic Search
The list is not exhaustive and is not statically defined, because it depends on the query parser
Some parameter (i.e. filters) accepts more than one value:
Seminars
Queries
Query
• Regulated by Query Parsers
• Calculates scores
• Cached with results order preserved
Filter Query
• Regulated by Query Parsers
• Does not calculate scores
• Cached independently
• Reusable
q=field:value fq=field:value
Seminars
Query Parsers
• Main responsibility of the query parser is understand the
input query syntax and build a Lucene query
• This is the first component involved in the query
execution chain
• If it is not specified, then a default parser is used (Solr
Standard Query Parser)
• Solr comes with several available and ready-to-use query
parsers
• The query parameter “defType” defines the query parser
that will be used in a request
Seminars
Standard Query Parser
Parameter Description
q Defines a query using standard query syntax. This parameter is mandatory.
q.op Specifies the default operator for query expressions, overriding the default
operator specified in the Schema. Possible values are "AND" or "OR".
df Specifies a default field, overriding the definition of a default field in the
Schema.
sow Split on whitespace: if set to false, whitespace-separated term sequences will
be provided to text analysis in one shot, enabling proper function of analysis
filters that operate over term sequences, e.g. multi-word synonyms and
shingles. Defaults to true: text analysis is invoked separately for each
individual whitespace-separated term.
Seminars
Standard Query Parser
• Phrase Search

q=title:”a tale of two cities”

• Wildcard Search

q=title:c?ti*

• Fuzzy Search

q=title:cties~1
• Proximity Search 

q=title:"tale cities"~2
• Range Search 

downloads:[1000 TO 2000], author:{Ada TO Carmen}
• Boosted Search

q=tale of two cities^100 bunny
• Constant Score Search

AND subjects:(war stories)^=4
• Boolean Search

(field1:term1) AND (field2:term1)
Seminars
Date Queries
Queries against fields using the TrieDateField type (typically range queries) should use the appropriate date syntax [1]:
• timestamp:[* TO NOW]
• createdate:[1976-03-06T23:59:59.999Z TO *]
• createdate:[1995-12-31T23:59:59.999Z TO 2007-03-06T00:00:00Z]
• pubdate:[NOW-1YEAR/DAY TO NOW/DAY+1DAY]
• createdate:[1976-03-06T23:59:59.999Z TO 1976-03-06T23:59:59.999Z+1YEAR]
• createdate:[1976-03-06T23:59:59.999Z/YEAR TO 1976-03-06T23:59:59.999Z]
[1] https://en.wikipedia.org/wiki/ISO_8601
Timezone
By default, all date math expressions are evaluated relative to the UTC TimeZone, but the TZ parameter can be
specified to override this behaviour
N.B. Independently of the locale Solr is executed, only ISO-8601 dates are supported in requests
Seminars
Solr Query Debug - Hands On!
• debug=query: return debug information about the query
only.
• debug=timing: return debug information about how long the
query took to process.
• debug=results: return debug information about the score
results (also known as "explain").
Seminars
Master Thesis:
Click Models to Estimate Relevancy Ratings from
Users Interactions
Main responsibility of the candidate will be to:

• learn basic concepts of Agile methodologies for software engineering

• learn details of Search Quality Evaluation

• grasp the fundamentals of click modelling, implicit and explicit
relevancy feedback

• design and implement the module in an existing Spring Boot REST
service application 

• benchmark the solution(s) through a careful quality/performance(times/
space) analysis
Seminars
Master Thesis:
Search Quality Evaluation for Continuous
Integration Tools
Main responsibility of the candidate will be to: 



• learn basic concepts of Agile methodologies for software engineering
• get familiar with Apache Lucene based search engines (Apache Solr/
Elasticsearch)
• learn details of Search Quality Evaluation
• grasp the fundamentals of Continuous Integration and Continuous Deployment
through well established industry level technologies
• design and implement plugins for Apache Jenkins, Atlassian Bamboo and
JetBrains 

TeamCity
Seminars

More Related Content

What's hot

What's hot (20)

Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
Elasticsearch Introduction
Elasticsearch IntroductionElasticsearch Introduction
Elasticsearch Introduction
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
 
Lucene
LuceneLucene
Lucene
 
SPARQL Cheat Sheet
SPARQL Cheat SheetSPARQL Cheat Sheet
SPARQL Cheat Sheet
 
Centralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stackCentralized log-management-with-elastic-stack
Centralized log-management-with-elastic-stack
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
A New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDKA New Chapter of Data Processing with CDK
A New Chapter of Data Processing with CDK
 
효율적 클러스터 활용을 위한 job scheduler
효율적 클러스터 활용을 위한 job scheduler효율적 클러스터 활용을 위한 job scheduler
효율적 클러스터 활용을 위한 job scheduler
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 
Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)Kafka Connect and Streams (Concepts, Architecture, Features)
Kafka Connect and Streams (Concepts, Architecture, Features)
 
Elasticsearch From the Bottom Up
Elasticsearch From the Bottom UpElasticsearch From the Bottom Up
Elasticsearch From the Bottom Up
 
Kafka 101
Kafka 101Kafka 101
Kafka 101
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
 
Introduction to elasticsearch
Introduction to elasticsearchIntroduction to elasticsearch
Introduction to elasticsearch
 
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
Elasticsearch Tutorial | Getting Started with Elasticsearch | ELK Stack Train...
 

Similar to Let's Build an Inverted Index: Introduction to Apache Lucene/Solr

Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
Erik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
Erik Hatcher
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
GokulD
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
Erik Hatcher
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
Rahul Jain
 

Similar to Let's Build an Inverted Index: Introduction to Apache Lucene/Solr (20)

Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Enterprise Search Using Apache Solr
Enterprise Search Using Apache SolrEnterprise Search Using Apache Solr
Enterprise Search Using Apache Solr
 
Apache solr
Apache solrApache solr
Apache solr
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Lucene for Solr Developers
Lucene for Solr DevelopersLucene for Solr Developers
Lucene for Solr Developers
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Apache Solr - Enterprise search platform
Apache Solr - Enterprise search platformApache Solr - Enterprise search platform
Apache Solr - Enterprise search platform
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Introduction to libre « fulltext » technology
Introduction to libre « fulltext » technologyIntroduction to libre « fulltext » technology
Introduction to libre « fulltext » technology
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!If You Have The Content, Then Apache Has The Technology!
If You Have The Content, Then Apache Has The Technology!
 
Illuminating Lucene.Net
Illuminating Lucene.NetIlluminating Lucene.Net
Illuminating Lucene.Net
 
Drupal and Apache Stanbol
Drupal and Apache StanbolDrupal and Apache Stanbol
Drupal and Apache Stanbol
 
Solr Recipes
Solr RecipesSolr Recipes
Solr Recipes
 
Apereo OAE - Bootcamp
Apereo OAE - BootcampApereo OAE - Bootcamp
Apereo OAE - Bootcamp
 
Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5Building Intelligent Search Applications with Apache Solr and PHP5
Building Intelligent Search Applications with Apache Solr and PHP5
 
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recallICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
ICIC 2013 Conference Proceedings Andreas Pesenhofer max.recall
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 

More from Sease

When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
Sease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Sease
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
Sease
 

More from Sease (20)

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information Retrieval
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
 

Recently uploaded

notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
Epec Engineered Technologies
 

Recently uploaded (20)

22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf22-prompt engineering noted slide shown.pdf
22-prompt engineering noted slide shown.pdf
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Wakad Call Me 7737669865 Budget Friendly No Advance Booking
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 

Let's Build an Inverted Index: Introduction to Apache Lucene/Solr

  • 1. Seminars Let’s Build an Inverted Index: Introduction to Apache Lucene/Solr
 Alessandro Benedetti, Software Engineer
 
 Andrea Gazzarini, Software Engineer 28th November 2019
  • 2. Seminars ▪ R&D Software Engineer ▪ Search Consultant ▪ Director ▪ Master Degree in Computer Science ▪ Apache Lucene/Solr Enthusiast ▪ Semantic, NLP, Machine Learning Technologies passionate ▪ Conference Speaker ▪ Beach Volleyball Player & Snowboarder Alessandro Benedetti
  • 3. Seminars ▪ Software Engineer (1999-) ▪ “Hermit” Software Engineer (2010-) ▪ Java & Information Retrieval Passionate ▪ Apache Qpid (past) Committer ▪ Husband & Father ▪ Bass Player Andrea Gazzarini, “Gazza”
  • 4. Seminars Search Services
 ● London Based - Italian made :) ● Open Source Enthusiasts ● Apache Lucene/Solr experts ! Community Contributors ● Active Researchers ● Hot Trends : Learning To Rank, 
 Document Similarity, Search Quality Evaluation,
 Relevancy Tuning
  • 6. Seminars Why should you use Open Source? • State of the Art / very valid technologies • Community Support • Vast Documentation • Code is accessible! • Customisable • Mostly free licensing
  • 7. Seminars Why should you contribute to Open Source? • Share knowledge and ideas • Improve established technologies • Become part of a Community • Not only code - all your skills are relevant! • Be useful to the world
  • 8. Seminars We only deal with Open Source Informational Retrieval … Revenue ? ● Trainings - Beginner/Intermediate/Advance/Ad Hoc for
 Information Retrieval, Apache Lucene/Solr, Search Relevance, Learning To Rank… ● Consulting - Open Source Software is ubiquitous/ Expertise ? Not really
 ! R&D Projects - Cheaper and more flexible for Companies using Open Source
 ! IR Projects - From the Client requirements collection till the Software delivery
  • 10. Seminars Information Retrieval “Information retrieval (IR) is the activity of obtaining information system resources relevant to an information need from a collection of information resources. Searches can be based on full-text or other content-based indexing. Information retrieval is the science of searching for information in a document, searching for documents themselves, and also searching for metadata that describe data, and for databases of texts, images or sounds.” Wikipedia Information Need Corpus
  • 11. Seminars Apache Lucene • http://lucene.apache.org • High-performance, scalable information retrieval software *library* • Enables search capabilities to your applications • Cohesive and simple interface, which hides a really complex world • Open Source: Apache Top Level Project
  • 12. Seminars Apache Lucene - Brief History 2019 Lucene 8.3.0 (November)
  • 13. Seminars Apache Solr • http://lucene.apache.org/solr • Highly reliable, scalable and fault tolerant search *server* • A Lucene “serverization” with a lot of additional features • All services are exposed through a HTTP (REST-Like interface) • Written in Java • Rich ecosystem for building enterprise-level applications (Plugins, Integrations, Clients) • Open Source: Apache Top Level Project “Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™.”
  • 15. Seminars Apache Solr - Brief History Version 8.3.0 (November)2019
  • 16. Seminars The Inverted Index The Inverted Index is the basic data structure used by Lucene to provide Search in a corpus of documents. From wikipedia : “In computer science, an inverted index (also referred to as postings file or inverted file) is an index data structure storing a mapping from content, such as words or numbers, to its locations in a database file, or in a document or a set of documents.”
  • 17. Seminars The Lucene Document Document Field ValueField Name • Documents are the unit of information 
 for indexing and search. • A Document is a set of fields. • Each field has a name and a value.
  • 19. Seminars The Lucene Inverted Index • Lucene directory (in memory, on disk, memory mapped) • Collection of immutable segments (fully working) • Each segment is composed by a set of binary files[1] [1] Lucene File Format Documentation Indexes evolve by: 1. Creating new segments for newly added documents. 2. Merging existing segments.
  • 20. Seminars Schema Configuration • Per collection/index • Xml file • Define how the inverted Index will be built • Fields/Field Types definition
  • 21. Seminars Schema Configuration • Define flexible expressions for groups of fields • Shared attributes for each field instance • Copy the source content to a destination field • Allow to run multiple analysis chains for the same content
  • 22. Seminars Field Type • Define how the single terms (in the inverted index) will be generated out of the content Index Time Query Time Analysis chain executed when building the index Analysis chain executed when building the query
  • 23. Seminars Text Analysis • Only text fields types (e.g. solr.TextField or subclasses) have a text analysis chain associated An analyzer can define • Zero or more CharFilter • One and only one Tokenizer • Zero or more TokenFilter
  • 24. Seminars Char Filters • CharFilter is a component that pre-processes input characters. • CharFilters can be chained like Token Filters and placed in front of a Tokenizer. • CharFilters can add, change, or remove characters
 while preserving the original character offsets to support features like highlighting.
  • 25. Seminars Tokenizers Tokenizers are responsible for breaking field data into lexical units, or tokens.[1] [1] https://lucene.apache.org/solr/guide/8_3/tokenizers.html
  • 26. Seminars Token Filters Filters[1] examine a stream of tokens and keep them, transform them or discard them, 
 depending on the filter type being used. [1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html
  • 27. Seminars Word Delimiters Filter • Improve recall • Dedicated Filters: 
 solr.WordDelimiterGraphFilterFactory [1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#word-delimiter-graph-filter Example: Default behavior. The whitespace tokenizer is used here to preserve non-alphanumeric characters. <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterGraphFilterFactory"/> <filter class="solr.FlattenGraphFilterFactory"/> <!-- required on index analyzers after graph filters --> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterGraphFilterFactory"/> </analyzer>
 In: "hot-spot RoboBlaster/9000 100XL" Tokenizer to Filter: "hot-spot", "RoboBlaster/9000", "100XL" Out: "hot", "spot", "Robo", "Blaster", "9000", "100", "XL"
  • 28. Seminars Stopword Filters • Reduce index size • Can improve precision (removing terms with low semantic value) • Can improve recall • Dedicated Filters: solr.StopFilterFactory, solr.ManagedStopFilterFactory [1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#stop-filter Example: <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/> </analyzer> In: "To be or what?" Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4) Out: "what"(4)
  • 29. Seminars Stemmers • Improve Recall • Reduce index size • Dedicated Filters: solr.EnglishMinimalStemFilterFactory, solr.HunspellStemFilterFactory, solr.KStemFilterFactory,
 solr.PorterStemFilterFactory, solr.SnowballPorterFilterFactory [1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#porter-stem-filter Example: <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.EnglishMinimalStemFilterFactory"/> </analyzer> In: "dogs cats" Tokenizer to Filter: "dogs", "cats" Out: "dog", "cat"
  • 30. Seminars Synonym Filters[1/2] • Improve Recall • Dedicated Filters: solr.SynonymGraphFilterFactory • Index Time -> affect terms distributions, needs re-indexing • Query Time -> more flexible [1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#synonym-graph-filter couch,sofa,divan teh => the huge,ginormous,humungous => large small => tiny,teeny,weeny
  • 31. Seminars Synonym Filters[2/2] • Improve Recall • Dedicated Filters: 
 solr.SynonymGraphFilterFactory [1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#synonym-graph-filter Example: <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymGraphFilterFactory" synonyms="mysynonyms.txt"/> <filter class="solr.FlattenGraphFilterFactory"/> <!-- required on index analyzers after graph filters --> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymGraphFilterFactory" synonyms="mysynonyms.txt"/> </analyzer>
 In: "teh small couch" Tokenizer to Filter: "teh"(1), "small"(2), "couch"(3) Out: "the"(1), "tiny"(2), "teeny"(2), "weeny"(2), "couch"(3), "sofa"(3), "divan"(3)
  • 32. Seminars Keep Word Filter • Help in Entity tagging • Dedicated Filters: solr.KeepWordFilterFactory [1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#keep-word-filter Example: <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.KeepWordFilterFactory" words="keepwords.txt" ignoreCase="true"/> </analyzer> In: "Happy, sad or funny" Tokenizer to Filter: "Happy", "sad", "or", "funny" Out: "Happy", "funny"
  • 33. Seminars N-Gram Filtering • Improve Recall • Ideal for autocompletion • Dedicated Filters: solr.EdgeNGramFilterFactory, solr.NGramFilterFactory [1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#edge-n-gram-filter <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="4"/> </analyzer>
  • 34. Seminars Phonetic Matching • Improve Recall • Dedicated Filters: solr.BeiderMorseFilterFactory, solr.DaitchMokotoffSoundexFilterFactory, solr.DoubleMetaphoneFilterFactory, solr.PhoneticFilterFactory [1] https://lucene.apache.org/solr/guide/8_3/phonetic-matching.html <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.BeiderMorseFilterFactory" nameType="GENERIC" ruleType="APPROX" concat="true" languageSet="auto"> </filter> </analyzer>
  • 35. Seminars Common Grams Filter • Improve Precision • Useful for phrase queries [1] https://lucene.apache.org/solr/guide/8_3/filter-descriptions.html#common-grams-filter Example: <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.CommonGramsFilterFactory" words="stopwords.txt" ignoreCase="true"/> </analyzer> In: "the Cat" Tokenizer to Filter: "the", "Cat" Out: "the_cat"
  • 38. Seminars Solr Text Analysis - Hands On! • Analysis Screen from Solr Admin • Let’s explore the schema.xml
  • 39. Seminars Indexing • Using the Solr Cell framework built on Apache Tika 
 for ingesting binary files or structured files such as Office, Word, PDF, and other proprietary formats.
 (Recommended for prototyping and exercise)
 • Uploading XML files by sending HTTP requests to the Solr server 
 from any environment where such requests can be generated.
 (Recommended for prototyping and exercise)
 • Writing a custom Java application to ingest data through Solr’s Java Client API 
 (which is described in more detail in Client APIs). 
 Using the Java API may be the best choice if you’re working with an application, 
 such as a Content Management System (CMS), that offers a Java API.

  • 40. Seminars Indexing • Indexing is the procedure of building an index from the documents in input • Transaction Log (Rotating on hard commits) • Index built in memory • Soft commits(visibility) • Hard commits(durability) • openSearcher=true(visibility) • Auto commit • Merge policy
  • 41. Seminars Lucene Score In order to measure the relevancy of a given result, Solr(Lucene) assigns it a “score” The formula behind the score computation is behind the scope of this course, however important things tha contribute to that formula are: • Term Frequency (TF): how many times a given term occurs within a single document • Document Frequency (DF): how many documents in the dataset contain a given term • TF/IDF: the ratio between the term frequency and the inverse document frequency (1/DF) • Field length: how many terms compose a field • Boosting: functions or in general things that boost the score computed for a given match. Boosting 
 can be applied at index time (deprecated now) or a query time Score values cannot be compared across queries, or even with the same query but with a different index.
  • 42. Seminars ! Origin from Probabilistic Information Retrieval ! Default Similarity from Lucene 6.0 [1] ! 25th iteration in improving TF-IDF ! TF ! IDF ! Document(Field) Length ! Configuration parameters [1] LUCENE-6789 BM25 Term Scorer
  • 43. Seminars BM25 Term Scorer - Inverse Document Frequency IDF Score
 has very similar behavior
  • 44. Seminars BM25 Term Scorer - Term Frequency TF Score
 approaches
 asymptotically (k+1)
 
 k=1.2 in this example
  • 45. Seminars BM25 Term Scorer - Document Length Document Length /
 Avg Document Length
 
 affects how fast we saturate TF score
  • 46. Seminars Basic Search The list is not exhaustive and is not statically defined, because it depends on the query parser Some parameter (i.e. filters) accepts more than one value:
  • 47. Seminars Queries Query • Regulated by Query Parsers • Calculates scores • Cached with results order preserved Filter Query • Regulated by Query Parsers • Does not calculate scores • Cached independently • Reusable q=field:value fq=field:value
  • 48. Seminars Query Parsers • Main responsibility of the query parser is understand the input query syntax and build a Lucene query • This is the first component involved in the query execution chain • If it is not specified, then a default parser is used (Solr Standard Query Parser) • Solr comes with several available and ready-to-use query parsers • The query parameter “defType” defines the query parser that will be used in a request
  • 49. Seminars Standard Query Parser Parameter Description q Defines a query using standard query syntax. This parameter is mandatory. q.op Specifies the default operator for query expressions, overriding the default operator specified in the Schema. Possible values are "AND" or "OR". df Specifies a default field, overriding the definition of a default field in the Schema. sow Split on whitespace: if set to false, whitespace-separated term sequences will be provided to text analysis in one shot, enabling proper function of analysis filters that operate over term sequences, e.g. multi-word synonyms and shingles. Defaults to true: text analysis is invoked separately for each individual whitespace-separated term.
  • 50. Seminars Standard Query Parser • Phrase Search
 q=title:”a tale of two cities”
 • Wildcard Search
 q=title:c?ti*
 • Fuzzy Search
 q=title:cties~1 • Proximity Search 
 q=title:"tale cities"~2 • Range Search 
 downloads:[1000 TO 2000], author:{Ada TO Carmen} • Boosted Search
 q=tale of two cities^100 bunny • Constant Score Search
 AND subjects:(war stories)^=4 • Boolean Search
 (field1:term1) AND (field2:term1)
  • 51. Seminars Date Queries Queries against fields using the TrieDateField type (typically range queries) should use the appropriate date syntax [1]: • timestamp:[* TO NOW] • createdate:[1976-03-06T23:59:59.999Z TO *] • createdate:[1995-12-31T23:59:59.999Z TO 2007-03-06T00:00:00Z] • pubdate:[NOW-1YEAR/DAY TO NOW/DAY+1DAY] • createdate:[1976-03-06T23:59:59.999Z TO 1976-03-06T23:59:59.999Z+1YEAR] • createdate:[1976-03-06T23:59:59.999Z/YEAR TO 1976-03-06T23:59:59.999Z] [1] https://en.wikipedia.org/wiki/ISO_8601 Timezone By default, all date math expressions are evaluated relative to the UTC TimeZone, but the TZ parameter can be specified to override this behaviour N.B. Independently of the locale Solr is executed, only ISO-8601 dates are supported in requests
  • 52. Seminars Solr Query Debug - Hands On! • debug=query: return debug information about the query only. • debug=timing: return debug information about how long the query took to process. • debug=results: return debug information about the score results (also known as "explain").
  • 53. Seminars Master Thesis: Click Models to Estimate Relevancy Ratings from Users Interactions Main responsibility of the candidate will be to:
 • learn basic concepts of Agile methodologies for software engineering
 • learn details of Search Quality Evaluation
 • grasp the fundamentals of click modelling, implicit and explicit relevancy feedback
 • design and implement the module in an existing Spring Boot REST service application 
 • benchmark the solution(s) through a careful quality/performance(times/ space) analysis
  • 54. Seminars Master Thesis: Search Quality Evaluation for Continuous Integration Tools Main responsibility of the candidate will be to: 
 
 • learn basic concepts of Agile methodologies for software engineering • get familiar with Apache Lucene based search engines (Apache Solr/ Elasticsearch) • learn details of Search Quality Evaluation • grasp the fundamentals of Continuous Integration and Continuous Deployment through well established industry level technologies • design and implement plugins for Apache Jenkins, Atlassian Bamboo and JetBrains 
 TeamCity