SlideShare a Scribd company logo
1 of 36
Download to read offline
Search Quality Evaluation
Tools and Techniques
Alessandro Benedetti, Software Engineer
Andrea Gazzarini, Software Engineer
2nd October 2018
Who we are
▪ Search Consultant
▪ R&D Software Engineer
▪ Master in Computer Science
▪ Apache Lucene/Solr Enthusiast
▪ Semantic, NLP, Machine Learning
Technologies passionate
▪ Beach Volleyball Player & Snowboarder
Alessandro Benedetti
Who we are
▪ Software Engineer (1999-)
▪ “Hermit” Software Engineer (2010-)
▪ Java & Information Retrieval Passionate
▪ Apache Qpid (past) Committer
▪ Husband & Father
▪ Bass Player
Andrea Gazzarini, “Gazza”
Sease
Search Services
● Open Source Enthusiasts
● Apache Lucene/Solr experts
! Community Contributors
● Active Researchers
● Hot Trends : Learning To Rank, Document Similarity,
Measuring Search Quality, Relevancy Tuning
✓ Search Quality Evaluation
‣ Context overview
‣ Correctness
‣ Evaluation Measures
➢ Rated Ranking Evaluator (RRE)
➢ Future Works
➢ Q&A
Agenda
Search engineering is the production of quality
search systems.
Search quality (and in general software quality) is a
huge topic which can be described using internal
and external factors.
In the end, only external factors matter, those that
can be perceived by users and customers. But the
key for getting optimal levels of those external
factors are the internal ones.
One of the main differences between search and
software quality (especially from a correctness
perspective) is in the ok / ko judgment, which is, in
general, more “deterministic” in case of software
development.
Context OverviewSearch Quality
Internal Factors
External Factors
Correctness
RobustnessExtendibility
Reusability
Efficiency
Timeliness
Modularity
Readability
Maintainability
Testability
Maintainability
Understandability
Reusability
….
Focused on
Primarily focused on
Search Quality Evaluation
Search Quality Evaluation: Correctness
Correctness is the ability of a system to perform its
exact task, as defined by its specification.
Search domain is critical from this perspective
because correctness depends on arbitrary user
judgments.
For each internal (gray) and external (red) iteration
we need to find a way to measure the correctness.
Evaluation measures for an information retrieval
system are used to assert how well the search results
satisfied the user's query intent.
Correctness
New system Existing system
Here are the requirements
V1.0 has been
Cool!
a month later…
We have a change request.We found a bug
We need to improve our search
system, users are complaining
about junk in search results.
v0.1
…
v0.9
v1.1
v1.2
v1.3
…
v2.0
v2.0
How can we know where our system is going
between versions, in terms of correctness?
Search Quality Evaluation / Measures
Evaluation measures for an information retrieval
system try to formalise how well a search system
satisfies its user information needs.
Measures are generally split into two categories:
online and offline measures.
In this context we will focus on offline measures.
We will talk about something that can help a search
engineer during his ordinary day (i.e. in those phases
previously called “internal iterations”)
We will also see how the same tool can be used for
a broader usage, like contributing in the continuous
integration pipeline or even for delivering value to
functional stakeholders.
Evaluation MeasuresEvaluation Measures
Online Measures
Offline Measures
Average Precision
Mean Reciprocal Rank
Recall
NDCG
Precision Click-through rate
F-Measure
Zero result rate
Session abandonment rate
Session success rate
….
….
We are mainly focused here
➢ Search Quality Evaluation
✓Rated Ranking Evaluator (RRE)
‣ What is it?
‣ How does it work?
‣ Evaluation Process Input & Output
‣ Challenges
➢ Future Works
➢ Q&A
Agenda
RRE: What is it?
• A set of search quality evaluation tools
• A search quality evaluation framework
• Multi (search) platform
• Written in Java
• It can be used also in non-Java projects
• Licensed under Apache 2.0
• Open to contributions
• Extremely dynamic!
RRE: What is it?
https://github.com/SeaseLtd/rated-ranking-evaluator
RRE: At a glance
2________________________________________________________________________________________________________________________________________________________________________________________________________________
____________
People
Apache Lucene/Solr
London
10________________________________________________________________________________________________________________________________________________________________________________________________________________
____________
Modules
10________________________________________________________________________________________________________________________________________________________________________________________________________________
____________
Modules
48950________________________________________________________________________________________________________________________________________________________________________________________________________________
____________
Lines of Code
2________________________________________________________________________________________________________________________________________________________________________________________________________________
____________
Months
2________________________________________________________________________________________________________________________________________________________________________________________________________________
____________
People
10________________________________________________________________________________________________________________________________________________________________________________________________________________
____________
Modules
10________________________________________________________________________________________________________________________________________________________________________________________________________________
____________
Modules
67317________________________________________________________________________________________________________________________________________________________________________________________________________________
____________
Lines of Code
5________________________________________________________________________________________________________________________________________________________________________________________________________________
____________
Months
RRE: Ecosystem
The picture illustrates the main modules composing
the RRE ecosystem.
All modules with a dashed border are planned for a
future release.
RRE CLI has a double border because although the
rre-cli module hasn’t been developed, you can run
RRE from a command line using RRE Maven
archetype, which is part of the current release.
As you can see, the current implementation includes
two target search platforms: Apache Solr and
Elasticsearch.
The Search Platform API module provide a search
platform abstraction for plugging-in additional
search systems.
RRE Ecosystem
CORE
Plugin
Plugin
Reporting Plugin
Search
Platform
API
RequestHandler
RRE Server
RRE CLI
Plugin
Plugin
Plugin
Archetypes
RRE: Available metrics
These are the RRE built-in metrics which can be
used out of the box.
The most part of them are computed at query level
and then aggregated at upper levels.
However, compound metrics (e.g. MAP, or GMAP)
are not explicitly declared or defined, because the
computation doesn’t happen at query level. The result
of the aggregation executed on the upper levels will
automatically produce these metric.
For example, the Average Precision computed for
Q1, Q2, Q3, Qn becomes the Mean Average
Precision at Query Group or Topic levels.
Available Metrics
Precision
Recall
Precision at 1 (P@1)
Precision at 2 (P@2)
Precision at 3 (P@3)
Precision at 10 (P@10)
Average Precision (AP)
Reciprocal Rank
Mean Reciprocal Rank
Mean Average Precision (MAP)
Normalised Discounted Cumulative Gain
F-Measure Compound Metric
RRE: Domain Model (1/2)
RRE Domain Model is organized into a composite /
tree-like structure where the relationships between
entities are always 1 to many.
The top level entity is a placeholder representing an
evaluation execution.
Versioned metrics are computed at query level and
then reported, using an aggregation function, at
upper levels.
The benefit of having a composite structure is clear:
we can see a metric value at different levels (e.g. a
query, all queries belonging to a query group, all
queries belonging to a topic or at corpus level)
RRE Domain ModelEvaluation
Corpus
1..*
v1.0
P@10
NDCG
AP
F-MEASURE
….
v1.1
P@10
NDCG
AP
F-MEASURE
….
v1.2
P@10
NDCG
AP
F-MEASURE
….
v1.n
P@10
NDCG
AP
F-MEASURE
….
Topic
Query Group
Query
1..*
1..*
1..*
…
Top level domain entity
Test dataset / collection
Information need
Query variants
Queries
RRE: Domain Model (2/2)
Although the domain model structure is able to
capture complex scenarios, sometimes we want to
model simpler contexts.
In order to avoid verbose and redundant ratings
definitions it’s possibile to omit some level.
Specifically we can be in one of the following:
• only queries
• query groups and queries
• topics, query groups and queries
RRE Domain ModelEvaluation
Corpus
1..*
v1.0
P@10
NDCG
AP
F-MEASURE
….
v1.1
P@10
NDCG
AP
F-MEASURE
….
v1.2
P@10
NDCG
AP
F-MEASURE
….
v1.n
P@10
NDCG
AP
F-MEASURE
….
Topic
Query Group
Query
1..*
1..*
1..*
…
= Required
= Optional
RRE: Evaluation process overview (1/2)
Data
Configuration
Ratings
Search Platform
uses a
produces
Evaluation Data
INPUT LAYER EVALUATION LAYER OUTPUT LAYER
JSON
RRE Console
…
used for generating
RRE: Evaluation process overview (2/2)
Runtime Container
RRE Core
For each ratings set
For each dataset
For each topic
For each query group
For each query
Starts the search platform
Stops the search platform
Creates & configure the index
Indexes data
For each version Executes query
Computes metric
2
3
4
5
6
7
8
9 12
13
1
11
outputs the evaluation data
14
uses the evaluation data
15
RRE: Corpora
An evaluation execution can involve more than one
datasets targeting a given search platform.
A dataset consists consists of representative domain
data; although a compressed dataset can be
provided, generally it has a small/medium size.
Within RRE, corpus, dataset, collection are
synonyms.
Datasets must be located under a configurable
folder. Each dataset is then referenced in one or
more ratings file.
Corpora
RRE: Configuration Sets
The search platform configuration evolves over
time (e.g. change requests, enhancements, bugs)
RRE encourages an incremental approach for
managing the configuration instances. Even for
internal or small iterations, each time we make a
relevant change to the current configuration, it’s
better to clone it and move forward with a new
version.
In this way we’ll end up having the historical
progression of our system, and RRE will be able to
make comparisons.
The evaluation process allows you to define
inclusion / exclusion rules (i.e. include only version
1.0 and 2.0)
Configuration Sets
RRE: Ratings
Ratings files associate the RRE domain model
entities with relevance judgments. A ratings file
provides the association between queries and
relevant documents.
There must be at least one ratings file (otherwise no
evaluation happens). Usually there’s a 1:1
relationship between a rating file and a dataset.
Judgments, the most important part of this file,
consist of a list of all relevant documents for a
query group.
Each listed document has a corresponding “gain”
which is the relevancy judgment we want to assign
to that document.
Ratings
OR
RRE: Evaluation Output
The RRE Core itself is a library, so it outputs its
result as a Plain Java object that must be
programmatically used.
However when wrapped within a runtime container,
like the Maven Plugin, the evaluation object tree is
marshalled in JSON format.
Being interoperable, the JSON format can be used by
some other component for producing a different kind
of output.
An example of such usage is the RRE Apache
Maven Reporting Plugin which can
• output a spreadsheet
• send the evaluation data to a running RRE Server
Evaluation output
RRE: Workbook
The RRE domain model (topics, groups and queries)
is on the left and each metric (on the right section)
has a value for each version / entity pair.
In case the evaluation process includes multiple
datasets, there will be a spreadsheet for each of
them.
This output format is useful when
• you want to have (or maintain somewhere) a
snapshot about how the system performed in a
given moment
• the comparison includes a lot of versions
• you want to include all available metrics
Workbook
RRE: RRE Server (1/2)
The RRE console is a SpringBoot/AngularJS
application which shows real-time information about
evaluation results.
Each time a build happens, the RRE reporting
plugin sends the evaluation result to a RESTFul
endpoint provided by RRE Server.
The received data immediately updates the web
dashboard with fresh data.
Useful during the development / tuning phase
iterations (you don’t have to open again and again
the excel report)
RRE Server
The evaluation data, at query / version level, collects the top n search results.
In the web console, under each query, there’s a little arrow which allows to open / hide the section which contains those results.
In this way you can get immediately the meaning of each metric and its values between different versions.
In the example above, you can immediately see why there’s a loss of precision (first metric) between v1.0, v1.1, which got fixed in v1.2
RRE: RRE Server (2/2)
RRE: Iterative development & tuning
Dev, tune & Build
Check evaluation results
We are thinking about how
to fill a third monitor
RRE: Challenges
“I think if we could create a simplified
pass/fail report for the business team,
that would be ideal. So they could
understand the tradeoffs of the new
search.”
“Many search engines process the user
query heavily before it's submitted to the
search engine in whatever DSL is required,
and if you don't retain some idea of the
original query in the system how can you”
relate the test results back to user
behaviour?
Do I have to write all judgments
manually??
How can I use RRE if I have a custom
search platform?
Java is not in my stack
➢ Search Quality Evaluation
➢ Rated Ranking Evaluator
✓ Future Works / Idea
➢ Q&A
Agenda
Future Works: Solr Rank Eval API
The RRE core can be used for implementing a
RequestHandler which will be able to expose a
Ranking Evaluation endpoint.
That would result in the same functionality introduced
in Elasticsearch 6.2 [1] with some differences.
• rich tree data model
• metrics framework
Note that in this case it doesn’t make so much sense
to provide comparisons between versions.
As part of the same module there could be a
SearchComponent for evaluating a single query
interaction.
[1] https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search-rank-eval.html
Rank Eval API
/rank_eval
?q=something&evaluate=true
+
RRE
RequestHandler
+
RRE
SearchComponent
Future Works: Jenkins Plugin
RRE Maven plugin already produces the evaluation
data in a machine-readable format (JSON) which
can be consumed by another component.
The Maven RRE Report plugin or the RRE Server
are just two examples of such consumers.
RRE can be already executed in a Jenkins CI build
cycle (using the Maven plugin).
By means of a dedicated Jenkins plugin, the
evaluation data could be graphically displayed in the
Jenkins dashboard. It could be even used for
blocking builds which produce bad evaluation results.
Jenkins Plugin
Future Works: Building the input
The main input for RRE is the Ratings file, in JSON format.
Writing a comprehensive JSON to detail the ratings sets for your Search ecosystem can be expensive!
1. Explicit feedback from users judgements
2. An intuitive UI allow judges to run queries, see documents and rate them
3. Relevance label is explicitly assigned by domain experts
1. Implicit feedback from users interactions (Clicks, Sales …)
2. Log to disk / internal Solr instance for analytics
3. Estimate <q,d> relevance label based on Click Through Rate, Sales Rate
Users Interactions Logger
Judgement Collector UI
Quality
Metrics
Ratings
Set
Explicit
Feedback RRE
Implicit
Feedback
Judgements Collector
Interactions Logger
Future Works: Learning To Rank
Once you collected the ratings, could we use them to actively improve the quality metrics ?

“Learning to rank is the application of machine
learning, typically supervised, semi-supervised or
reinforcement learning, in the construction of
ranking models for information retrieval
systems.” Wikipedia
Learning To Rank Users Interactions
Logger
Judgement Collector 

UI
Interactions
Training
Future Works: Training Set Building
Creating a Learning To Rank Training Set from the collected interactions is not going to be trivial.
It normally requires ad hoc data manipulation depending on the use case…

… but some steps could be automated and make available for a generic configurable approach


▪ Null feature sanitisation
▪ Query Id calculation
▪ Query document feature generation
▪ Single/Multi valued categorical feature encoding
Configuration
1. Ad Hoc category, Artificial values, keep NaN 

-> depends of Training Library to use

2. Optional Query Level features to be hashed as
QueryId

3. Intersect related query and document level
categorical features to generate Ordinal query-
document features

4. Label Encoding ? One Hot Encoding? Binary
Encoding? [1]
Dummy Variable Trap
[1] https://www.datacamp.com/community/tutorials/categorical-data
Future Works: Training Set Building
What about the relevance label for each training vector ?

Can we estimate it from the interactions collected ?


▪ Interaction Type Counts

▪ Click Through Rate/Sales Through Rate calculation

▪ Relevance label normalisation
Configuration
1. Impressions? Clicks? Bookmarks? Add To
Charts? Sales?

2. Define the objective: Clicks/Impressions ?

Sales/Impressions?

3. Relevance Label : 0…4

Future Works: Learning To Rank Solr Configs
Can the features.json configuration generation be automated?
The features.json is a configuration file necessary
for Solr Learning To Rank
extension to work.
It is a configuration file that describes how the
features that were used at
training time for the model can be extracted at
query time.
This file is coupled both with the training set
features and the query time
features.
Learning To Rank - Solr features.json
Users Interactions
Logger
Training Set Builder
Configuration
Features.json
➢ Search Quality Evaluation
➢ Rated Ranking Evaluator
➢ Future Works
✓ Q&A
Search Quality Evaluation
Tools and techniques
Alessandro Benedetti - Software Engineer
Andrea Gazzarini - Software Engineer
2nd October 2018
Thank you!

More Related Content

What's hot

How to Build your Training Set for a Learning To Rank Project - Haystack
How to Build your Training Set for a Learning To Rank Project - HaystackHow to Build your Training Set for a Learning To Rank Project - Haystack
How to Build your Training Set for a Learning To Rank Project - HaystackSease
 
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationAlessandro Benedetti
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusSease
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Sease
 
From Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryFrom Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryAlessandro Benedetti
 
Advanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAdvanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAlessandro Benedetti
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrSease
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to RankSease
 
Enterprise Search – How Relevant Is Relevance?
Enterprise Search – How Relevant Is Relevance?Enterprise Search – How Relevant Is Relevance?
Enterprise Search – How Relevant Is Relevance?Sease
 
How the Lucene More Like This Works
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This WorksSease
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Lucidworks
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Lucidworks
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrTrey Grainger
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough dataВладимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough dataMail.ru Group
 

What's hot (17)

How to Build your Training Set for a Learning To Rank Project - Haystack
How to Build your Training Set for a Learning To Rank Project - HaystackHow to Build your Training Set for a Learning To Rank Project - Haystack
How to Build your Training Set for a Learning To Rank Project - Haystack
 
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
 
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
Evaluating Your Learning to Rank Model: Dos and Don’ts in Offline/Online Eval...
 
From Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryFrom Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank Story
 
Advanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAdvanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache Lucene
 
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to Rank
 
Enterprise Search – How Relevant Is Relevance?
Enterprise Search – How Relevant Is Relevance?Enterprise Search – How Relevant Is Relevance?
Enterprise Search – How Relevant Is Relevance?
 
How the Lucene More Like This Works
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This Works
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
 
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
Reflected Intelligence - Lucene/Solr as a self-learning data system: Presente...
 
Self-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache SolrSelf-learned Relevancy with Apache Solr
Self-learned Relevancy with Apache Solr
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough dataВладимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
 
Haystacks slides
Haystacks slidesHaystacks slides
Haystacks slides
 

Similar to Haystack London - Search Quality Evaluation, Tools and Techniques

Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationAlessandro Benedetti
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationSease
 
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...OpenSource Connections
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSease
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State Universitydhabalia
 
Web Rec Final Report
Web Rec Final ReportWeb Rec Final Report
Web Rec Final Reportweichen
 
Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...Andrei Lopatenko
 
Opinion Driven Decision Support System
Opinion Driven Decision Support SystemOpinion Driven Decision Support System
Opinion Driven Decision Support SystemKavita Ganesan
 
E-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer ReviewE-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer ReviewIRJET Journal
 
Different Methodologies For Testing Web Application Testing
Different Methodologies For Testing Web Application TestingDifferent Methodologies For Testing Web Application Testing
Different Methodologies For Testing Web Application TestingRachel Davis
 
FinalInnovationWeek-Dec12
FinalInnovationWeek-Dec12FinalInnovationWeek-Dec12
FinalInnovationWeek-Dec12Praneet Mhatre
 
Feature Based Opinion Mining from Amazon Reviews
Feature Based Opinion Mining from Amazon ReviewsFeature Based Opinion Mining from Amazon Reviews
Feature Based Opinion Mining from Amazon ReviewsRavi Kiran Holur Vijay
 
Ijmer 46067276
Ijmer 46067276Ijmer 46067276
Ijmer 46067276IJMER
 
Ijmer 46067276
Ijmer 46067276Ijmer 46067276
Ijmer 46067276IJMER
 
A Review on Sentimental Analysis of Application Reviews
A Review on Sentimental Analysis of Application ReviewsA Review on Sentimental Analysis of Application Reviews
A Review on Sentimental Analysis of Application ReviewsIJMER
 
Online BookStore Recommender Systems Using Collaborative Filtering Algorithm
Online BookStore Recommender Systems Using Collaborative Filtering AlgorithmOnline BookStore Recommender Systems Using Collaborative Filtering Algorithm
Online BookStore Recommender Systems Using Collaborative Filtering AlgorithmBinay Sharma
 
Datapedia Analysis Report
Datapedia Analysis ReportDatapedia Analysis Report
Datapedia Analysis ReportAbanoub Amgad
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comSimon Hughes
 

Similar to Haystack London - Search Quality Evaluation, Tools and Techniques (20)

Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
 
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
 
Web Rec Final Report
Web Rec Final ReportWeb Rec Final Report
Web Rec Final Report
 
Building multi billion ( dollars, users, documents ) search engines on open ...
Building multi billion ( dollars, users, documents ) search engines  on open ...Building multi billion ( dollars, users, documents ) search engines  on open ...
Building multi billion ( dollars, users, documents ) search engines on open ...
 
Opinion Driven Decision Support System
Opinion Driven Decision Support SystemOpinion Driven Decision Support System
Opinion Driven Decision Support System
 
E-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer ReviewE-Commerce Product Rating Based on Customer Review
E-Commerce Product Rating Based on Customer Review
 
Different Methodologies For Testing Web Application Testing
Different Methodologies For Testing Web Application TestingDifferent Methodologies For Testing Web Application Testing
Different Methodologies For Testing Web Application Testing
 
FinalInnovationWeek-Dec12
FinalInnovationWeek-Dec12FinalInnovationWeek-Dec12
FinalInnovationWeek-Dec12
 
Feature Based Opinion Mining from Amazon Reviews
Feature Based Opinion Mining from Amazon ReviewsFeature Based Opinion Mining from Amazon Reviews
Feature Based Opinion Mining from Amazon Reviews
 
Innovation week dec12
Innovation week dec12Innovation week dec12
Innovation week dec12
 
Ijmer 46067276
Ijmer 46067276Ijmer 46067276
Ijmer 46067276
 
Ijmer 46067276
Ijmer 46067276Ijmer 46067276
Ijmer 46067276
 
A Review on Sentimental Analysis of Application Reviews
A Review on Sentimental Analysis of Application ReviewsA Review on Sentimental Analysis of Application Reviews
A Review on Sentimental Analysis of Application Reviews
 
Introduction
IntroductionIntroduction
Introduction
 
Online BookStore Recommender Systems Using Collaborative Filtering Algorithm
Online BookStore Recommender Systems Using Collaborative Filtering AlgorithmOnline BookStore Recommender Systems Using Collaborative Filtering Algorithm
Online BookStore Recommender Systems Using Collaborative Filtering Algorithm
 
Datapedia Analysis Report
Datapedia Analysis ReportDatapedia Analysis Report
Datapedia Analysis Report
 
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.comEnhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
Enhancing Enterprise Search with Machine Learning - Simon Hughes, Dice.com
 

Recently uploaded

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 

Recently uploaded (20)

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 

Haystack London - Search Quality Evaluation, Tools and Techniques

  • 1. Search Quality Evaluation Tools and Techniques Alessandro Benedetti, Software Engineer Andrea Gazzarini, Software Engineer 2nd October 2018
  • 2. Who we are ▪ Search Consultant ▪ R&D Software Engineer ▪ Master in Computer Science ▪ Apache Lucene/Solr Enthusiast ▪ Semantic, NLP, Machine Learning Technologies passionate ▪ Beach Volleyball Player & Snowboarder Alessandro Benedetti
  • 3. Who we are ▪ Software Engineer (1999-) ▪ “Hermit” Software Engineer (2010-) ▪ Java & Information Retrieval Passionate ▪ Apache Qpid (past) Committer ▪ Husband & Father ▪ Bass Player Andrea Gazzarini, “Gazza”
  • 4. Sease Search Services ● Open Source Enthusiasts ● Apache Lucene/Solr experts ! Community Contributors ● Active Researchers ● Hot Trends : Learning To Rank, Document Similarity, Measuring Search Quality, Relevancy Tuning
  • 5. ✓ Search Quality Evaluation ‣ Context overview ‣ Correctness ‣ Evaluation Measures ➢ Rated Ranking Evaluator (RRE) ➢ Future Works ➢ Q&A Agenda
  • 6. Search engineering is the production of quality search systems. Search quality (and in general software quality) is a huge topic which can be described using internal and external factors. In the end, only external factors matter, those that can be perceived by users and customers. But the key for getting optimal levels of those external factors are the internal ones. One of the main differences between search and software quality (especially from a correctness perspective) is in the ok / ko judgment, which is, in general, more “deterministic” in case of software development. Context OverviewSearch Quality Internal Factors External Factors Correctness RobustnessExtendibility Reusability Efficiency Timeliness Modularity Readability Maintainability Testability Maintainability Understandability Reusability …. Focused on Primarily focused on Search Quality Evaluation
  • 7. Search Quality Evaluation: Correctness Correctness is the ability of a system to perform its exact task, as defined by its specification. Search domain is critical from this perspective because correctness depends on arbitrary user judgments. For each internal (gray) and external (red) iteration we need to find a way to measure the correctness. Evaluation measures for an information retrieval system are used to assert how well the search results satisfied the user's query intent. Correctness New system Existing system Here are the requirements V1.0 has been Cool! a month later… We have a change request.We found a bug We need to improve our search system, users are complaining about junk in search results. v0.1 … v0.9 v1.1 v1.2 v1.3 … v2.0 v2.0 How can we know where our system is going between versions, in terms of correctness?
  • 8. Search Quality Evaluation / Measures Evaluation measures for an information retrieval system try to formalise how well a search system satisfies its user information needs. Measures are generally split into two categories: online and offline measures. In this context we will focus on offline measures. We will talk about something that can help a search engineer during his ordinary day (i.e. in those phases previously called “internal iterations”) We will also see how the same tool can be used for a broader usage, like contributing in the continuous integration pipeline or even for delivering value to functional stakeholders. Evaluation MeasuresEvaluation Measures Online Measures Offline Measures Average Precision Mean Reciprocal Rank Recall NDCG Precision Click-through rate F-Measure Zero result rate Session abandonment rate Session success rate …. …. We are mainly focused here
  • 9. ➢ Search Quality Evaluation ✓Rated Ranking Evaluator (RRE) ‣ What is it? ‣ How does it work? ‣ Evaluation Process Input & Output ‣ Challenges ➢ Future Works ➢ Q&A Agenda
  • 10. RRE: What is it? • A set of search quality evaluation tools • A search quality evaluation framework • Multi (search) platform • Written in Java • It can be used also in non-Java projects • Licensed under Apache 2.0 • Open to contributions • Extremely dynamic! RRE: What is it? https://github.com/SeaseLtd/rated-ranking-evaluator
  • 11. RRE: At a glance 2________________________________________________________________________________________________________________________________________________________________________________________________________________ ____________ People Apache Lucene/Solr London 10________________________________________________________________________________________________________________________________________________________________________________________________________________ ____________ Modules 10________________________________________________________________________________________________________________________________________________________________________________________________________________ ____________ Modules 48950________________________________________________________________________________________________________________________________________________________________________________________________________________ ____________ Lines of Code 2________________________________________________________________________________________________________________________________________________________________________________________________________________ ____________ Months 2________________________________________________________________________________________________________________________________________________________________________________________________________________ ____________ People 10________________________________________________________________________________________________________________________________________________________________________________________________________________ ____________ Modules 10________________________________________________________________________________________________________________________________________________________________________________________________________________ ____________ Modules 67317________________________________________________________________________________________________________________________________________________________________________________________________________________ ____________ Lines of Code 5________________________________________________________________________________________________________________________________________________________________________________________________________________ ____________ Months
  • 12. RRE: Ecosystem The picture illustrates the main modules composing the RRE ecosystem. All modules with a dashed border are planned for a future release. RRE CLI has a double border because although the rre-cli module hasn’t been developed, you can run RRE from a command line using RRE Maven archetype, which is part of the current release. As you can see, the current implementation includes two target search platforms: Apache Solr and Elasticsearch. The Search Platform API module provide a search platform abstraction for plugging-in additional search systems. RRE Ecosystem CORE Plugin Plugin Reporting Plugin Search Platform API RequestHandler RRE Server RRE CLI Plugin Plugin Plugin Archetypes
  • 13. RRE: Available metrics These are the RRE built-in metrics which can be used out of the box. The most part of them are computed at query level and then aggregated at upper levels. However, compound metrics (e.g. MAP, or GMAP) are not explicitly declared or defined, because the computation doesn’t happen at query level. The result of the aggregation executed on the upper levels will automatically produce these metric. For example, the Average Precision computed for Q1, Q2, Q3, Qn becomes the Mean Average Precision at Query Group or Topic levels. Available Metrics Precision Recall Precision at 1 (P@1) Precision at 2 (P@2) Precision at 3 (P@3) Precision at 10 (P@10) Average Precision (AP) Reciprocal Rank Mean Reciprocal Rank Mean Average Precision (MAP) Normalised Discounted Cumulative Gain F-Measure Compound Metric
  • 14. RRE: Domain Model (1/2) RRE Domain Model is organized into a composite / tree-like structure where the relationships between entities are always 1 to many. The top level entity is a placeholder representing an evaluation execution. Versioned metrics are computed at query level and then reported, using an aggregation function, at upper levels. The benefit of having a composite structure is clear: we can see a metric value at different levels (e.g. a query, all queries belonging to a query group, all queries belonging to a topic or at corpus level) RRE Domain ModelEvaluation Corpus 1..* v1.0 P@10 NDCG AP F-MEASURE …. v1.1 P@10 NDCG AP F-MEASURE …. v1.2 P@10 NDCG AP F-MEASURE …. v1.n P@10 NDCG AP F-MEASURE …. Topic Query Group Query 1..* 1..* 1..* … Top level domain entity Test dataset / collection Information need Query variants Queries
  • 15. RRE: Domain Model (2/2) Although the domain model structure is able to capture complex scenarios, sometimes we want to model simpler contexts. In order to avoid verbose and redundant ratings definitions it’s possibile to omit some level. Specifically we can be in one of the following: • only queries • query groups and queries • topics, query groups and queries RRE Domain ModelEvaluation Corpus 1..* v1.0 P@10 NDCG AP F-MEASURE …. v1.1 P@10 NDCG AP F-MEASURE …. v1.2 P@10 NDCG AP F-MEASURE …. v1.n P@10 NDCG AP F-MEASURE …. Topic Query Group Query 1..* 1..* 1..* … = Required = Optional
  • 16. RRE: Evaluation process overview (1/2) Data Configuration Ratings Search Platform uses a produces Evaluation Data INPUT LAYER EVALUATION LAYER OUTPUT LAYER JSON RRE Console … used for generating
  • 17. RRE: Evaluation process overview (2/2) Runtime Container RRE Core For each ratings set For each dataset For each topic For each query group For each query Starts the search platform Stops the search platform Creates & configure the index Indexes data For each version Executes query Computes metric 2 3 4 5 6 7 8 9 12 13 1 11 outputs the evaluation data 14 uses the evaluation data 15
  • 18. RRE: Corpora An evaluation execution can involve more than one datasets targeting a given search platform. A dataset consists consists of representative domain data; although a compressed dataset can be provided, generally it has a small/medium size. Within RRE, corpus, dataset, collection are synonyms. Datasets must be located under a configurable folder. Each dataset is then referenced in one or more ratings file. Corpora
  • 19. RRE: Configuration Sets The search platform configuration evolves over time (e.g. change requests, enhancements, bugs) RRE encourages an incremental approach for managing the configuration instances. Even for internal or small iterations, each time we make a relevant change to the current configuration, it’s better to clone it and move forward with a new version. In this way we’ll end up having the historical progression of our system, and RRE will be able to make comparisons. The evaluation process allows you to define inclusion / exclusion rules (i.e. include only version 1.0 and 2.0) Configuration Sets
  • 20. RRE: Ratings Ratings files associate the RRE domain model entities with relevance judgments. A ratings file provides the association between queries and relevant documents. There must be at least one ratings file (otherwise no evaluation happens). Usually there’s a 1:1 relationship between a rating file and a dataset. Judgments, the most important part of this file, consist of a list of all relevant documents for a query group. Each listed document has a corresponding “gain” which is the relevancy judgment we want to assign to that document. Ratings OR
  • 21. RRE: Evaluation Output The RRE Core itself is a library, so it outputs its result as a Plain Java object that must be programmatically used. However when wrapped within a runtime container, like the Maven Plugin, the evaluation object tree is marshalled in JSON format. Being interoperable, the JSON format can be used by some other component for producing a different kind of output. An example of such usage is the RRE Apache Maven Reporting Plugin which can • output a spreadsheet • send the evaluation data to a running RRE Server Evaluation output
  • 22. RRE: Workbook The RRE domain model (topics, groups and queries) is on the left and each metric (on the right section) has a value for each version / entity pair. In case the evaluation process includes multiple datasets, there will be a spreadsheet for each of them. This output format is useful when • you want to have (or maintain somewhere) a snapshot about how the system performed in a given moment • the comparison includes a lot of versions • you want to include all available metrics Workbook
  • 23. RRE: RRE Server (1/2) The RRE console is a SpringBoot/AngularJS application which shows real-time information about evaluation results. Each time a build happens, the RRE reporting plugin sends the evaluation result to a RESTFul endpoint provided by RRE Server. The received data immediately updates the web dashboard with fresh data. Useful during the development / tuning phase iterations (you don’t have to open again and again the excel report) RRE Server
  • 24. The evaluation data, at query / version level, collects the top n search results. In the web console, under each query, there’s a little arrow which allows to open / hide the section which contains those results. In this way you can get immediately the meaning of each metric and its values between different versions. In the example above, you can immediately see why there’s a loss of precision (first metric) between v1.0, v1.1, which got fixed in v1.2 RRE: RRE Server (2/2)
  • 25. RRE: Iterative development & tuning Dev, tune & Build Check evaluation results We are thinking about how to fill a third monitor
  • 26. RRE: Challenges “I think if we could create a simplified pass/fail report for the business team, that would be ideal. So they could understand the tradeoffs of the new search.” “Many search engines process the user query heavily before it's submitted to the search engine in whatever DSL is required, and if you don't retain some idea of the original query in the system how can you” relate the test results back to user behaviour? Do I have to write all judgments manually?? How can I use RRE if I have a custom search platform? Java is not in my stack
  • 27. ➢ Search Quality Evaluation ➢ Rated Ranking Evaluator ✓ Future Works / Idea ➢ Q&A Agenda
  • 28. Future Works: Solr Rank Eval API The RRE core can be used for implementing a RequestHandler which will be able to expose a Ranking Evaluation endpoint. That would result in the same functionality introduced in Elasticsearch 6.2 [1] with some differences. • rich tree data model • metrics framework Note that in this case it doesn’t make so much sense to provide comparisons between versions. As part of the same module there could be a SearchComponent for evaluating a single query interaction. [1] https://www.elastic.co/guide/en/elasticsearch/reference/6.2/search-rank-eval.html Rank Eval API /rank_eval ?q=something&evaluate=true + RRE RequestHandler + RRE SearchComponent
  • 29. Future Works: Jenkins Plugin RRE Maven plugin already produces the evaluation data in a machine-readable format (JSON) which can be consumed by another component. The Maven RRE Report plugin or the RRE Server are just two examples of such consumers. RRE can be already executed in a Jenkins CI build cycle (using the Maven plugin). By means of a dedicated Jenkins plugin, the evaluation data could be graphically displayed in the Jenkins dashboard. It could be even used for blocking builds which produce bad evaluation results. Jenkins Plugin
  • 30. Future Works: Building the input The main input for RRE is the Ratings file, in JSON format. Writing a comprehensive JSON to detail the ratings sets for your Search ecosystem can be expensive! 1. Explicit feedback from users judgements 2. An intuitive UI allow judges to run queries, see documents and rate them 3. Relevance label is explicitly assigned by domain experts 1. Implicit feedback from users interactions (Clicks, Sales …) 2. Log to disk / internal Solr instance for analytics 3. Estimate <q,d> relevance label based on Click Through Rate, Sales Rate Users Interactions Logger Judgement Collector UI Quality Metrics Ratings Set Explicit Feedback RRE Implicit Feedback Judgements Collector Interactions Logger
  • 31. Future Works: Learning To Rank Once you collected the ratings, could we use them to actively improve the quality metrics ?
 “Learning to rank is the application of machine learning, typically supervised, semi-supervised or reinforcement learning, in the construction of ranking models for information retrieval systems.” Wikipedia Learning To Rank Users Interactions Logger Judgement Collector 
 UI Interactions Training
  • 32. Future Works: Training Set Building Creating a Learning To Rank Training Set from the collected interactions is not going to be trivial. It normally requires ad hoc data manipulation depending on the use case…
 … but some steps could be automated and make available for a generic configurable approach 
 ▪ Null feature sanitisation ▪ Query Id calculation ▪ Query document feature generation ▪ Single/Multi valued categorical feature encoding Configuration 1. Ad Hoc category, Artificial values, keep NaN 
 -> depends of Training Library to use
 2. Optional Query Level features to be hashed as QueryId
 3. Intersect related query and document level categorical features to generate Ordinal query- document features
 4. Label Encoding ? One Hot Encoding? Binary Encoding? [1] Dummy Variable Trap [1] https://www.datacamp.com/community/tutorials/categorical-data
  • 33. Future Works: Training Set Building What about the relevance label for each training vector ?
 Can we estimate it from the interactions collected ? 
 ▪ Interaction Type Counts
 ▪ Click Through Rate/Sales Through Rate calculation
 ▪ Relevance label normalisation Configuration 1. Impressions? Clicks? Bookmarks? Add To Charts? Sales?
 2. Define the objective: Clicks/Impressions ?
 Sales/Impressions?
 3. Relevance Label : 0…4

  • 34. Future Works: Learning To Rank Solr Configs Can the features.json configuration generation be automated? The features.json is a configuration file necessary for Solr Learning To Rank extension to work. It is a configuration file that describes how the features that were used at training time for the model can be extracted at query time. This file is coupled both with the training set features and the query time features. Learning To Rank - Solr features.json Users Interactions Logger Training Set Builder Configuration Features.json
  • 35. ➢ Search Quality Evaluation ➢ Rated Ranking Evaluator ➢ Future Works ✓ Q&A
  • 36. Search Quality Evaluation Tools and techniques Alessandro Benedetti - Software Engineer Andrea Gazzarini - Software Engineer 2nd October 2018 Thank you!