Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Daniel Schneiter
Elastic{Meetup} #41, Zürich, April 9, 2019
Original author: Christoph Büscher
Made to Measure:

Ranking E...
!2
If you can not
measure it,

you cannot
improve it!
AlmostAnActualQuoteTM by Lord Kelvin
https://commons.wikimedia.org/w...
?!3
How
good
is
your
search
Image by Kecko
https://www.flickr.com/photos/kecko/18146364972 (CC BY 2.0)
!4
Image by Muff Wiggler
https://www.flickr.com/photos/muffwiggler/5605240619 (CC BY 2.0)
!5
Ranking Evaluation



A repeatable way
to quickly measure the quality
of search results

over a wide range of user needs
!6
• Automate - don’t make people
look at screens
• no gut-feeling / “management-
driven” ad-hoc search ranking
REPEATABIL...
!7
• fast iterations instead of long
waits (e.g. in A/B testing)
SPEED
!8
• numeric output
• support of different metrics
• define “quality“ in your domain
QUALITY

MEASURE
!9
• optimize across wider range of
use case (aka “information
needs”)
• think about what the majority
of your users want
...
!10
Prerequisites for Ranking Evaluation
1. Define a set of typical information needs
2. For each search case, rate your d...
!11
Search Evaluation Continuum
speed
preparation time
people looking 

at screens
Some sort of

unit test
QA assisted by
...
!12
Where Ranking Evaluation can help
Development Production Communication

Tool
• guiding design decisions
• enabling qui...
!13
Elasticsearch 

‘rank_eval’ API
!14
Ranking Evaluation API
GET /my_index/_rank_eval
{
"metric": {
"mean_reciprocal_rank": {
[...]
}
},
"templates": [{
[.....
!15
Ranking Evaluation API Details
"metric": {
"precision": {
"relevant_rating_threshold": "2",
"k": 5
}
}
metric
"request...
{
"rank_eval": {
"metric_score": 0.431,
"details": {
"my_query_id1": {
"metric_score": 0.6,
"unrated_docs": [
{
"_index": ...
!17
How to get document ratings?
1. Define a set of typical information needs of user

(e.g. analyze logs, ask product man...
!18
Metrics currently available
Metric Description Ratings
Precision At K Set-based metric; ratio of relevant doc in top K...
!19
Precision At K
• In short: “How many good results appear in the first K results”

(e.g. first few pages in UI)
• suppo...
!20
Reciprocal Rank
• supports only boolean relevance judgements
• PROS: easy to understand & communicate
• CONS: limited ...
!21
Discounted Cumulative Gain (DCG)
• Predecessor: Cumulative Gain (CG)
• sums relevance judgement over top k results
Sou...
!22
Expected Reciprocal Rank (ERR)
• cascade based metric
• supports graded relevance judgements
• model assumes user goes...
!23
DEMO TIME
!24
Demo project and Data
• Demo uses aprox. 1800 documents from the english Wikipedia
• Wikipedias Discovery department c...
!25
Q&A
!26
Some questions I have for you…
• How do you measure search relevance currently?
• Did you find anything useful about t...
!27
Further reading
• Manning, Raghavan & Schütze: Introduction to Information
Retrieval, Cambridge University Press. 2008...
Upcoming SlideShare
Loading in …5
×

1

Share

Download to read offline

Made to Measure: Ranking Evaluation using Elasticsearch

Download to read offline

Elastic{Meetup} #41, Zurich, April 9, 2019
presented by Daniel Schneiter

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Made to Measure: Ranking Evaluation using Elasticsearch

  1. 1. Daniel Schneiter Elastic{Meetup} #41, Zürich, April 9, 2019 Original author: Christoph Büscher Made to Measure:
 Ranking Evaluation using Elasticsearch
  2. 2. !2 If you can not measure it,
 you cannot improve it! AlmostAnActualQuoteTM by Lord Kelvin https://commons.wikimedia.org/wiki/File:Portrait_of_William_Thomson,_Baron_Kelvin.jpg
  3. 3. ?!3 How good is your search Image by Kecko https://www.flickr.com/photos/kecko/18146364972 (CC BY 2.0)
  4. 4. !4 Image by Muff Wiggler https://www.flickr.com/photos/muffwiggler/5605240619 (CC BY 2.0)
  5. 5. !5 Ranking Evaluation
 
 A repeatable way to quickly measure the quality of search results
 over a wide range of user needs
  6. 6. !6 • Automate - don’t make people look at screens • no gut-feeling / “management- driven” ad-hoc search ranking REPEATABILITY
  7. 7. !7 • fast iterations instead of long waits (e.g. in A/B testing) SPEED
  8. 8. !8 • numeric output • support of different metrics • define “quality“ in your domain QUALITY
 MEASURE
  9. 9. !9 • optimize across wider range of use case (aka “information needs”) • think about what the majority of your users want • collect data to discover what is important for your use case USER
 NEEDS
  10. 10. !10 Prerequisites for Ranking Evaluation 1. Define a set of typical information needs 2. For each search case, rate your documents for those information needs
 (either binary relevant/non-relevant or on some graded scale) 3. If full labelling is not feasible, choose a small subset instead
 (often the case because document set is too large) 4. Choose a metric to calculate.
 Some good metrics already defined in Information Retrieval research: • Precision@K, (N)DCG, ERR, Reciprocal Rank etc… Source: Gray Arial 10pt
  11. 11. !11 Search Evaluation Continuum speed preparation time people looking 
 at screens Some sort of
 unit test QA assisted by scripts user studies A/B testing Ranking Evaluation slow fast little lots
  12. 12. !12 Where Ranking Evaluation can help Development Production Communication
 Tool • guiding design decisions • enabling quick iteration • helps defining “search quality” clearer • forces stakeholders to “get real” about their expectations • monitor changes • spot degradations
  13. 13. !13 Elasticsearch 
 ‘rank_eval’ API
  14. 14. !14 Ranking Evaluation API GET /my_index/_rank_eval { "metric": { "mean_reciprocal_rank": { [...] } }, "templates": [{ [...] }], "requests": [{
 "template_id": “my_query_template”, "ratings": [...], "params": { "query_string": “hotel amsterdam", "field": "text" }
 [...] }] } • introduced in 6.2 (still experimental API) • joint work between • Christoph Büscher (@dalatangi) • Isabel Drost-Fromm (@MaineC) • Inputs: • a set of search requests (“information needs”) • document ratings for each request • a metrics definition; currently available • Precision@K • Discounted Cumulative Gain / (N)DCG • Expected Reciprocal Rank / ERR • MRR, …

  15. 15. !15 Ranking Evaluation API Details "metric": { "precision": { "relevant_rating_threshold": "2", "k": 5 } } metric "requests": [{ "id": "JFK_query", "request": { “query”: { […] } }, "ratings": […] }, … other use cases …] requests "ratings": [ { "_id": "3054546", "rating": 3 }, { "_id": "5119376", "rating": 1 }, […] ] ratings
  16. 16. { "rank_eval": { "metric_score": 0.431, "details": { "my_query_id1": { "metric_score": 0.6, "unrated_docs": [ { "_index": "idx", "_id": "1960795" }, [...] ], "hits": [...], "metric_details": { “precision" : { “relevant_docs_retrieved": 6,
 "docs_retrieved": 10 } } }, "my_query_id2" : { [...] } } } } !16 _rank_eval response overall score details per query maybe rate those? details about metric
  17. 17. !17 How to get document ratings? 1. Define a set of typical information needs of user
 (e.g. analyze logs, ask product management / customer etc…) 2. For each case, get small set of candidate documents
 (e.g. by very broad query) 3. Rate those documents with respect to the underlying information need • can initially be done by you or other stakeholders;
 later maybe outsource e.g. via Mechanical Turk 4. Iterate! Source: Gray Arial 10pt
  18. 18. !18 Metrics currently available Metric Description Ratings Precision At K Set-based metric; ratio of relevant doc in top K results binary Reciprocal Rank (RR) Positional metric; inverse of the first relevant document binary Discounted Cumulative Gain (DCG) takes order into account; highly relevant docs score more
 if they appear earlier in result list graded Expected Reciprocal Rank (ERR) motivated by “cascade model” of search; models dependency of results with respect to their predecessors graded
  19. 19. !19 Precision At K • In short: “How many good results appear in the first K results”
 (e.g. first few pages in UI) • supports only boolean relevance judgements • PROS: easy to understand & communicate • CONS: least stable across different user needs, e.g. total number of relevant documents for a query influences precision at k Source: Gray Arial 10pt prec@k = # relevant docs{ } # all results at k{ }
  20. 20. !20 Reciprocal Rank • supports only boolean relevance judgements • PROS: easy to understand & communicate • CONS: limited to cases where amount of good results doesn’t matter • If averaged over a sample of queries Q often called MRR
 (mean reciprocal rank): Source: Gray Arial 10pt RR = 1 position of first relevant document MRR = 1 Q 1 rankii Q ∑
  21. 21. !21 Discounted Cumulative Gain (DCG) • Predecessor: Cumulative Gain (CG) • sums relevance judgement over top k results Source: Gray Arial 10pt CG = relk i=1 k ∑ DCG = reli log2 (i +1)i=1 k ∑ • DCG takes position into account • divides by log2 at each position • NDCG (Normalized DCG) • divides by “ideal” DCG for a query (IDCG) NDCG = DCG IDCG
  22. 22. !22 Expected Reciprocal Rank (ERR) • cascade based metric • supports graded relevance judgements • model assumes user goes through
 result list in order and is satisfied with
 the first relevant document • R_i probability that user stops at position i • ERR is high
 when relevant document appear early Source: Gray Arial 10pt ERR = 1 r (1− Ri )Rr i=1 r−1 ∏r=1 k ∑ Ri = 2 reli −1 2 relmax reli ! relevance at pos. i relmax ! maximal relevance grade
  23. 23. !23 DEMO TIME
  24. 24. !24 Demo project and Data • Demo uses aprox. 1800 documents from the english Wikipedia • Wikipedias Discovery department collects and publishes relevance judgements with their Discernatron project • Bulk data and all query examples available at
 https://github.com/cbuescher/rankEvalDemo Source: Gray Arial 10pt
  25. 25. !25 Q&A
  26. 26. !26 Some questions I have for you… • How do you measure search relevance currently? • Did you find anything useful about the ranking evaluation approach? • Feedback about usability of the API
 (ping be on Github or our Discuss Forum @cbuescher) Source: Gray Arial 10pt
  27. 27. !27 Further reading • Manning, Raghavan & Schütze: Introduction to Information Retrieval, Cambridge University Press. 2008. • Metlzer, D., Zhang, Y., & Grinspan, P. (2009). Expected reciprocal rank for graded relevance. Proceeding of the 18th ACM Conference on Information and Knowledge Management - CIKM ’09, 621. • Blog: https://www.elastic.co/blog/made-to-measure-how-to- use-the-ranking-evaluation-api-in-elasticsearch • Docs: https://www.elastic.co/guide/en/elasticsearch/reference/ current/search-rank-eval.html • Discuss: https://discuss.elastic.co/c/elasticsearch (cbuescher) • Github: :Search/Ranking Label (cbuescher) Source: Gray Arial 10pt
  • newpopsyoshioka

    Sep. 15, 2019

Elastic{Meetup} #41, Zurich, April 9, 2019 presented by Daniel Schneiter

Views

Total views

331

On Slideshare

0

From embeds

0

Number of embeds

1

Actions

Downloads

11

Shares

0

Comments

0

Likes

1

×