Modern search engines has to keep up with the enormous growth in the number of documents and queries submitted by users. One of the problem to deal with is finding the best k relevant documents for a given query. This operation has to be fast and this is possible only by using specialised technologies.
Block max wand is one of the best known algorithm for solving this problem without any effectiveness degradation of its ranking.
After a brief introduction, in this talk I’m going to show a strategy introduced in “Faster BlockMax WAND with Variable-sized Blocks” (SIGIR 2017), that applied to BlockMaxWand data has made possible to speed up the algorithm execution by almost 2x.
Then, will be presented another optimisation of the BlockMaxWand algorithm (“Faster BlockMax WAND with Longer Skipping”, ECIR 2019) for reducing the time execution of short queries.
Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Skipping
1. London Information Retrieval Meetup
19 Feb 2019
Improving Top-K Retrieval
Algorithms Using Dynamic
Programming and Longer
Skipping
Elia Porciani, Software Engineer
19th February 2019
2. Introduction
●Top-k retrieval and inverted index
●Introduction to early termination techniques
●Block max wand
Faster BlockMax WAND with Variable-sized Blocks
A Mallia, G Ottaviano, E Porciani, N Tonellotto, R Venturini
SIGIR, 2017
Faster BlockMax WAND with Longer Skipping
A Mallia, E Porciani
ECIR, 2019
5. Inverted index compression
We compressed posting lists with
partitioned Elias-Fano.
Giuseppe Ottaviano and Rossano Venturini. Partitioned elias-fano
indexes. In Proceedings of the 37th International ACM SIGIR Conference
on Research; Development in Information Retrieval, SIGIR ’14
1 2 5 7 12 13 14 20Inverted List
1 1 3 2 5 1 1 6Dgaps
Only few bits are necessary to store
each item of an inverted list
7. Tf-Idf
In details, we use OKAPI BM25
Term frequency
Inverse document frequency
1 2 5 7 12 13 14 20Doc-id
3 2 1 8 2 4 6 2Frequencies
Term
tfij =
|nij |
|dj |
idfi =
|D|
|d : i ∈ d |
8. Inverted list iterator operations
next() Find the next document Id
nextGEQ(k) Find the next document id in the list with id >= k
score() Compute the score of the current document id,
considering the frequency associated
The score() function involves in
floating point computations
Iterating over inverted index is
expansive because it is
compressed.
10. Early termination techniques
●It is not necessary to compute the score function on all the
postings.
●Max score
●Wand
●BlockMaxWand These algorithms compute the exact top-k
documents.
11. Wand
Doc-Id
T1
T2
T3
T4
T5
𝜭 = 15.8
Ms = 5.4
Ms = 5.0
Ms = 4.2
Ms = 4.3
Ms = 2.3
Pivot
List
Sum = 5.4
Sum = 14.6
Sum = 9.6
Sum = 16.9
Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. Efficient query evaluation using a two-level retrieval
process., CIKM ’03,
12. Block-Max-Wand
Doc-Id
T1
T2
T3
T4
T5
𝜭 = 15.8
3.2 4.2
4.5 5.4 2.1
2.1 3.2
2.1 2.3 0.8
3.5 1.4
4.1
2.7
Pivot
List
Block max upper
estimation = 10.2
Ms = 5.4
Ms = 5.0
Ms = 4.2
Ms = 4.3
Ms = 2.3
Sum = 5.4
Sum = 14.6
Sum = 9.6
Sum = 16.9
The less is the average
approximation error, the better are
performance.
Shuai Ding and Torsten Suel. Faster top-k document retrieval using block-max indexes. SIGIR ’11
2.0
13. Block Max Wand
1.Pivot selection as in wand.
2.Compute block max contributions (blockmaxsum) of the pivot doc-id
3.If block max sum overcomes the threshold:
1.Full evaluate the document of the pivot.
2.Move iterator to pivot.docid + 1
4.Otherwise, move iterator to the leftmost boundary of the blocks evaluated.
15. Shortest Path Problem
• V postings sorted by their position in the list
• E every possible block in the list
• C(i,j) is the approximation error
We add a fix cost F to the cost
function C(i,j)
O(n2
)
17. Experimental analysis
Collection Size
Size after
compression
# lists # postings # documents
Gov2 44 GiB 4.4 GiB 35 millions 6 billions 24 millions
Clueweb09 120 GiB 15 GiB 92 millions 15 billions 50 millions
● Trec2005 and Trec2006 query collections.
● The code is written in C++ 11 and compiled with GCC 5.3.1 with the highest optimization settings
and it is executed on a 8-core i7-4790K with 32GiB ram running Linux kernel v. 4.4.0.