Made to Measure: Ranking Evaluation using Elasticsearch
1. Daniel Schneiter
Elastic{Meetup} #41, Zürich, April 9, 2019
Original author: Christoph Büscher
Made to Measure:
Ranking Evaluation
using Elasticsearch
2. !2
If you can not
measure it,
you cannot
improve it!
AlmostAnActualQuoteTM by Lord Kelvin
https://commons.wikimedia.org/wiki/File:Portrait_of_William_Thomson,_Baron_Kelvin.jpg
8. !8
• numeric output
• support of different metrics
• define “quality“ in your domain
QUALITY
MEASURE
9. !9
• optimize across wider range of
use case (aka “information
needs”)
• think about what the majority
of your users want
• collect data to discover what is
important for your use case
USER
NEEDS
10. !10
Prerequisites for Ranking Evaluation
1. Define a set of typical information needs
2. For each search case, rate your documents for those information needs
(either binary relevant/non-relevant or on some graded scale)
3. If full labelling is not feasible, choose a small subset instead
(often the case because document set is too large)
4. Choose a metric to calculate.
Some good metrics already defined in Information Retrieval research:
• Precision@K, (N)DCG, ERR, Reciprocal Rank etc…
Source: Gray Arial 10pt
12. !12
Where Ranking Evaluation can help
Development Production Communication
Tool
• guiding design decisions
• enabling quick iteration
• helps defining “search quality”
clearer
• forces stakeholders to “get
real” about their expectations
• monitor changes
• spot degradations
17. !17
How to get document ratings?
1. Define a set of typical information needs of user
(e.g. analyze logs, ask product management / customer etc…)
2. For each case, get small set of candidate documents
(e.g. by very broad query)
3. Rate those documents with respect to the underlying information need
• can initially be done by you or other stakeholders;
later maybe outsource e.g. via Mechanical Turk
4. Iterate!
Source: Gray Arial 10pt
18. !18
Metrics currently available
Metric Description Ratings
Precision At K Set-based metric; ratio of relevant doc in top K results binary
Reciprocal Rank (RR) Positional metric; inverse of the first relevant document binary
Discounted Cumulative
Gain (DCG)
takes order into account; highly relevant docs score more
if they appear earlier in result list
graded
Expected Reciprocal
Rank (ERR)
motivated by “cascade model” of search; models
dependency of results with respect to their predecessors
graded
19. !19
Precision At K
• In short: “How many good results appear in the first K results”
(e.g. first few pages in UI)
• supports only boolean relevance judgements
• PROS: easy to understand & communicate
• CONS: least stable across different user needs, e.g. total number of
relevant documents for a query influences precision at k
Source: Gray Arial 10pt
prec@k =
# relevant docs{ }
# all results at k{ }
20. !20
Reciprocal Rank
• supports only boolean relevance judgements
• PROS: easy to understand & communicate
• CONS: limited to cases where amount of good results doesn’t matter
• If averaged over a sample of queries Q often called MRR
(mean reciprocal rank):
Source: Gray Arial 10pt
RR =
1
position of first relevant document
MRR =
1
Q
1
rankii
Q
∑
21. !21
Discounted Cumulative Gain (DCG)
• Predecessor: Cumulative Gain (CG)
• sums relevance judgement over top k results
Source: Gray Arial 10pt
CG = relk
i=1
k
∑
DCG =
reli
log2
(i +1)i=1
k
∑
• DCG takes position into account
• divides by log2 at each position
• NDCG (Normalized DCG)
• divides by “ideal” DCG for a query (IDCG) NDCG =
DCG
IDCG
22. !22
Expected Reciprocal Rank (ERR)
• cascade based metric
• supports graded relevance judgements
• model assumes user goes through
result list in order and is satisfied with
the first relevant document
• R_i probability that user stops at position i
• ERR is high
when relevant document appear early
Source: Gray Arial 10pt
ERR =
1
r
(1− Ri
)Rr
i=1
r−1
∏r=1
k
∑
Ri
=
2
reli
−1
2
relmax
reli
! relevance at pos. i
relmax
! maximal relevance grade
24. !24
Demo project and Data
• Demo uses aprox. 1800 documents from the english Wikipedia
• Wikipedias Discovery department collects and publishes relevance
judgements with their Discernatron project
• Bulk data and all query examples available at
https://github.com/cbuescher/rankEvalDemo
Source: Gray Arial 10pt
26. !26
Some questions I have for you…
• How do you measure search relevance currently?
• Did you find anything useful about the ranking evaluation approach?
• Feedback about usability of the API
(ping be on Github or our Discuss Forum @cbuescher)
Source: Gray Arial 10pt
27. !27
Further reading
• Manning, Raghavan & Schütze: Introduction to Information
Retrieval, Cambridge University Press. 2008.
• Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected
reciprocal rank for graded relevance. Proceeding of the 18th
ACM Conference on Information and Knowledge
Management - CIKM ’09, 621.
• Blog: https://www.elastic.co/blog/made-to-measure-how-to-
use-the-ranking-evaluation-api-in-elasticsearch
• Docs: https://www.elastic.co/guide/en/elasticsearch/reference/
current/search-rank-eval.html
• Discuss: https://discuss.elastic.co/c/elasticsearch (cbuescher)
• Github: :Search/Ranking Label (cbuescher)
Source: Gray Arial 10pt