SlideShare a Scribd company logo
1 of 26
Download to read offline
London Information Retrieval Meetup
19 Feb 2019
Improving Top-K Retrieval
Algorithms Using Dynamic
Programming and Longer
Skipping
Elia Porciani, Software Engineer
19th February 2019
Introduction
●Top-k retrieval and inverted index
●Introduction to early termination techniques
●Block max wand
Faster BlockMax WAND with Variable-sized Blocks
A Mallia, G Ottaviano, E Porciani, N Tonellotto, R Venturini
SIGIR, 2017
Faster BlockMax WAND with Longer Skipping
A Mallia, E Porciani
ECIR, 2019
Queries over search engines
Inverted index
Documents
term1 term2
term3 term4
term5
Inverted index compression
We compressed posting lists with
partitioned Elias-Fano.
Giuseppe Ottaviano and Rossano Venturini. Partitioned elias-fano
indexes. In Proceedings of the 37th International ACM SIGIR Conference
on Research; Development in Information Retrieval, SIGIR ’14
1 2 5 7 12 13 14 20Inverted List
1 1 3 2 5 1 1 6Dgaps
Only few bits are necessary to store
each item of an inverted list
Top-K Retrieval
We are interested only in the first
K documents, with k small.
Tf-Idf
In details, we use OKAPI BM25
Term frequency
Inverse document frequency
1 2 5 7 12 13 14 20Doc-id
3 2 1 8 2 4 6 2Frequencies
Term
tfij =
|nij |
|dj |
idfi =
|D|
|d : i ∈ d |
Inverted list iterator operations
next() Find the next document Id
nextGEQ(k) Find the next document id in the list with id >= k
score() Compute the score of the current document id,
considering the frequency associated
The score() function involves in
floating point computations
Iterating over inverted index is
expansive because it is
compressed.
Ranked-Or
Doc-Id
T1
T2
T3
T4
T5
Early termination techniques
●It is not necessary to compute the score function on all the
postings.
●Max score
●Wand
●BlockMaxWand These algorithms compute the exact top-k
documents.
Wand
Doc-Id
T1
T2
T3
T4
T5
𝜭 = 15.8
Ms = 5.4
Ms = 5.0
Ms = 4.2
Ms = 4.3
Ms = 2.3
Pivot
List
Sum = 5.4
Sum = 14.6
Sum = 9.6
Sum = 16.9
Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. Efficient query evaluation using a two-level retrieval
process., CIKM ’03,
Block-Max-Wand
Doc-Id
T1
T2
T3
T4
T5
𝜭 = 15.8
3.2 4.2
4.5 5.4 2.1
2.1 3.2
2.1 2.3 0.8
3.5 1.4
4.1
2.7
Pivot
List
Block max upper
estimation = 10.2
Ms = 5.4
Ms = 5.0
Ms = 4.2
Ms = 4.3
Ms = 2.3
Sum = 5.4
Sum = 14.6
Sum = 9.6
Sum = 16.9
The less is the average
approximation error, the better are
performance.
Shuai Ding and Torsten Suel. Faster top-k document retrieval using block-max indexes. SIGIR ’11
2.0
Block Max Wand
1.Pivot selection as in wand.
2.Compute block max contributions (blockmaxsum) of the pivot doc-id
3.If block max sum overcomes the threshold:
1.Full evaluate the document of the pivot.
2.Move iterator to pivot.docid + 1
4.Otherwise, move iterator to the leftmost boundary of the blocks evaluated.
Partitioning
Fixed size blocks Variable size blocks
∑
𝑏∈𝐵
(max(𝑠  ∈ 𝑏) 𝑏   −  
∑
𝑠 ∈ 𝑏 
𝑠 )
𝑏
min
∑
𝑏∈𝐵
(max(𝑠  ∈ 𝑏) 𝑏 ) + 𝑆
Shortest Path Problem
• V postings sorted by their position in the list
• E every possible block in the list
• C(i,j) is the approximation error
We add a fix cost F to the cost
function C(i,j)
O(n2
)
Approximation algorithm
● Monotonicity: Quasi-subaddictivity:
𝑂( 𝑛2
) → 𝑂(𝑛 log 𝑈 ) 𝑂( 𝑛log𝑈) → 𝑂(𝑛)
C(i, j) ≤ C(i, j + 1)
C(i, j) ≤ C(i − 1,j)
G1
= {(i, j) ∈ G|∃k . C(i, j) ≤ F(1 + α)k
≤ C(i, j + 1)}
C(i, k) + C(k + 1,j) ≤ C(i, j) + F + 1
G2
= {(i, j) ∈ G1
|C(i, j) ≤ F/β}
sp(G2
) = (1 + α)(1 + β)sp(G)
Experimental analysis
Collection Size
Size after
compression
# lists # postings # documents
Gov2 44 GiB 4.4 GiB 35 millions 6 billions 24 millions
Clueweb09 120 GiB 15 GiB 92 millions 15 billions 50 millions
● Trec2005 and Trec2006 query collections.
● The code is written in C++ 11 and compiled with GCC 5.3.1 with the highest optimization settings
and it is executed on a 8-core i7-4790K with 32GiB ram running Linux kernel v. 4.4.0.
Choosing block size
Block size Block size
Block-Max-Wand Compression
Maximum impact
element
Boundary doc-id
Block-Max-Wand Compression (score quantization)
Uniform partitioning
Opt partitioning
Sort
Compression algorithms comparisons
Gov2 Clueweb09
Trec2005 Trec2006 Trec2005 Trec2006
Wand 7.06 (1.92x) 12.92 (1.55x) 28.85 (2.25x) 37.55 (1.40x)
MaxScore 6.59 (1.79x) 11.35 (1.36x) 23.58 (1.84x) 32.28 (1.21x)
BMW 3.67 8.33 12.81 26.64
Gov2 Clueweb09
Plain index 6.91 8.36
Wand/MaxScore 7.24 (1.04x) 8.65 (1.03x)
BMW/VBMW 9.14 (1.32x) 10.68 (1.27x)
VBMW c. 8.07 (1.16x) 9.51 (1.13x)
Gov2 Clueweb09
Trec2005 Trec2006 Trec2005 Trec2006
Wand 7.06 (3.34x) 12.92 (2.72x) 28.85 (3.98x) 37.55 (2.55x)
MaxScore 6.59 (3.12x) 11.35 (2.39x) 23.58 (3.25x) 32.28 (2.11x)
BMW 3.67 (1.73x) 8.33 (1.75x) 12.81 (1.77x) 26.64 (1.74x)
VBMW 2.11 4.75 7.25 15.30
VBMW c. 2.35 (1.11x) 5.29 (1.11x ) 8.21 (1.13x ) 17.00 (1.11x )
Time in ms
Space in bits
per posting
Longer skipping
We can do better than skip
at the block boundary.
Ls-boundaryBoundary
Iterate over the
blocks at runtime
Add a pointer per
block
Block-Max-Wand
Doc-Id
T1
T2
T3
T4
T5
𝜭 = 15.8
3.2 4.2
4.5 5.4 2.1
2.1 3.2
2.1 2.3 0.8
3.5 1.4
4.1
2.7
Pivot
List
Block max upper
estimation = 10.2
Ms = 5.4
Ms = 5.0
Ms = 4.2
Ms = 4.3
Ms = 2.3
Sum = 5.4
Sum = 14.6
Sum = 9.6
Sum = 16.9
Shuai Ding and Torsten Suel. Faster top-k document retrieval using block-max indexes. SIGIR ’11
2.0
Longer skipping
2 3 4 5 6+
VBMW 3.17 (1.45x) 6.39 (1.13x) 8.92 (1.04x) 14.46 (1.00x) 32.04 (1.03x)
VBMW LS 2.18 5.66 8.57 14.44 31.05 (1.04x)
VBMW c. 3.53 (1.31x) 6.97 (1.15x) 9.86 (1.04x) 16.06 (1.00x) 36.26 (1.01x)
VBMW LSP c. 2.68 6.3 9.52 16.07 36.01
ClueWeb - Trec2005
Thank you

More Related Content

What's hot

Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositories
feiwin
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
Tommaso Teofili
 
An Integrated Framework on Mining Logs Files for Computing System Management
An Integrated Framework on Mining Logs Files for Computing System ManagementAn Integrated Framework on Mining Logs Files for Computing System Management
An Integrated Framework on Mining Logs Files for Computing System Management
feiwin
 
2009 Dils Flyweb
2009 Dils Flyweb2009 Dils Flyweb
2009 Dils Flyweb
Jun Zhao
 

What's hot (19)

Java - File Input Output Concepts
Java - File Input Output ConceptsJava - File Input Output Concepts
Java - File Input Output Concepts
 
Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...Predicting query performance and explaining results to assist Linked Data con...
Predicting query performance and explaining results to assist Linked Data con...
 
Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!Ontology-based data access: why it is so cool!
Ontology-based data access: why it is so cool!
 
Algorithm Name Detection & Extraction
Algorithm Name Detection & ExtractionAlgorithm Name Detection & Extraction
Algorithm Name Detection & Extraction
 
Java methods or Subroutines or Functions
Java methods or Subroutines or FunctionsJava methods or Subroutines or Functions
Java methods or Subroutines or Functions
 
Document Classification and Clustering
Document Classification and ClusteringDocument Classification and Clustering
Document Classification and Clustering
 
Text categorization
Text categorizationText categorization
Text categorization
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 
Finding Similar Files in Large Document Repositories
Finding Similar Files in Large Document RepositoriesFinding Similar Files in Large Document Repositories
Finding Similar Files in Large Document Repositories
 
Advanced R cheat sheet
Advanced R cheat sheetAdvanced R cheat sheet
Advanced R cheat sheet
 
Collections - Array List
Collections - Array List Collections - Array List
Collections - Array List
 
Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ...
 Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ... Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ...
Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ...
 
Text categorization with Lucene and Solr
Text categorization with Lucene and SolrText categorization with Lucene and Solr
Text categorization with Lucene and Solr
 
Sparql semantic information retrieval by
Sparql semantic information retrieval bySparql semantic information retrieval by
Sparql semantic information retrieval by
 
An Integrated Framework on Mining Logs Files for Computing System Management
An Integrated Framework on Mining Logs Files for Computing System ManagementAn Integrated Framework on Mining Logs Files for Computing System Management
An Integrated Framework on Mining Logs Files for Computing System Management
 
Linq And Its Impact On The.Net Framework
Linq And Its Impact On The.Net FrameworkLinq And Its Impact On The.Net Framework
Linq And Its Impact On The.Net Framework
 
2009 Dils Flyweb
2009 Dils Flyweb2009 Dils Flyweb
2009 Dils Flyweb
 
Java Input Output (java.io.*)
Java Input Output (java.io.*)Java Input Output (java.io.*)
Java Input Output (java.io.*)
 
Overloadingmethod
OverloadingmethodOverloadingmethod
Overloadingmethod
 

Similar to Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Skipping

DReAMS: High Performance Reconfigurable Computing at NECSTLab
DReAMS: High Performance Reconfigurable Computing at NECSTLabDReAMS: High Performance Reconfigurable Computing at NECSTLab
DReAMS: High Performance Reconfigurable Computing at NECSTLab
NECST Lab @ Politecnico di Milano
 
High Performance Reconfigurable Computing at NECSTLab
High Performance Reconfigurable Computing at NECSTLabHigh Performance Reconfigurable Computing at NECSTLab
High Performance Reconfigurable Computing at NECSTLab
NECST Lab @ Politecnico di Milano
 
A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...
A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...
A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...
NECST Lab @ Politecnico di Milano
 
PG Day'14 Russia, GIN — Stronger than ever in 9.4 and further, Александр Коро...
PG Day'14 Russia, GIN — Stronger than ever in 9.4 and further, Александр Коро...PG Day'14 Russia, GIN — Stronger than ever in 9.4 and further, Александр Коро...
PG Day'14 Russia, GIN — Stronger than ever in 9.4 and further, Александр Коро...
pgdayrussia
 

Similar to Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Skipping (20)

TiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architectureTiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architecture
 
Binary Analysis - Luxembourg
Binary Analysis - LuxembourgBinary Analysis - Luxembourg
Binary Analysis - Luxembourg
 
MaPU-HPCA2016
MaPU-HPCA2016MaPU-HPCA2016
MaPU-HPCA2016
 
Xbfs HPDC'2019
Xbfs HPDC'2019Xbfs HPDC'2019
Xbfs HPDC'2019
 
Coco co-desing and co-verification of masked software implementations on cp us
Coco   co-desing and co-verification of masked software implementations on cp usCoco   co-desing and co-verification of masked software implementations on cp us
Coco co-desing and co-verification of masked software implementations on cp us
 
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
Apache Carbondata: An Indexed Columnar File Format for Interactive Query with...
 
Super-Encryption Cryptography with IDEA and WAKE Algorithm
Super-Encryption Cryptography with IDEA and WAKE AlgorithmSuper-Encryption Cryptography with IDEA and WAKE Algorithm
Super-Encryption Cryptography with IDEA and WAKE Algorithm
 
Sbst2018 contest2018
Sbst2018 contest2018Sbst2018 contest2018
Sbst2018 contest2018
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and Visualization
 
Deep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLabDeep Learning Initiative @ NECSTLab
Deep Learning Initiative @ NECSTLab
 
Report
ReportReport
Report
 
DReAMS: High Performance Reconfigurable Computing at NECSTLab
DReAMS: High Performance Reconfigurable Computing at NECSTLabDReAMS: High Performance Reconfigurable Computing at NECSTLab
DReAMS: High Performance Reconfigurable Computing at NECSTLab
 
High Performance Reconfigurable Computing at NECSTLab
High Performance Reconfigurable Computing at NECSTLabHigh Performance Reconfigurable Computing at NECSTLab
High Performance Reconfigurable Computing at NECSTLab
 
A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...
A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...
A Parallel, Energy Efficient Hardware Architecture for the merAligner on FPGA...
 
Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017Chronix Poster for the Poster Session FAST 2017
Chronix Poster for the Poster Session FAST 2017
 
FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...
FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...
FACE: Fast and Customizable Sorting Accelerator for Heterogeneous Many-core S...
 
The Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data ScienceThe Transformation of Systems Biology Into A Large Data Science
The Transformation of Systems Biology Into A Large Data Science
 
PG Day'14 Russia, GIN — Stronger than ever in 9.4 and further, Александр Коро...
PG Day'14 Russia, GIN — Stronger than ever in 9.4 and further, Александр Коро...PG Day'14 Russia, GIN — Stronger than ever in 9.4 and further, Александр Коро...
PG Day'14 Russia, GIN — Stronger than ever in 9.4 and further, Александр Коро...
 
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
 
Writing Applications for Scylla
Writing Applications for ScyllaWriting Applications for Scylla
Writing Applications for Scylla
 

More from Sease

When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
Sease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Sease
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
Sease
 

More from Sease (20)

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Neural Search Comes to Apache Solr
Neural Search Comes to Apache SolrNeural Search Comes to Apache Solr
Neural Search Comes to Apache Solr
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
 
Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information Retrieval
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
 

Recently uploaded

Recently uploaded (20)

Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Skipping

  • 1. London Information Retrieval Meetup 19 Feb 2019 Improving Top-K Retrieval Algorithms Using Dynamic Programming and Longer Skipping Elia Porciani, Software Engineer 19th February 2019
  • 2. Introduction ●Top-k retrieval and inverted index ●Introduction to early termination techniques ●Block max wand Faster BlockMax WAND with Variable-sized Blocks A Mallia, G Ottaviano, E Porciani, N Tonellotto, R Venturini SIGIR, 2017 Faster BlockMax WAND with Longer Skipping A Mallia, E Porciani ECIR, 2019
  • 5. Inverted index compression We compressed posting lists with partitioned Elias-Fano. Giuseppe Ottaviano and Rossano Venturini. Partitioned elias-fano indexes. In Proceedings of the 37th International ACM SIGIR Conference on Research; Development in Information Retrieval, SIGIR ’14 1 2 5 7 12 13 14 20Inverted List 1 1 3 2 5 1 1 6Dgaps Only few bits are necessary to store each item of an inverted list
  • 6. Top-K Retrieval We are interested only in the first K documents, with k small.
  • 7. Tf-Idf In details, we use OKAPI BM25 Term frequency Inverse document frequency 1 2 5 7 12 13 14 20Doc-id 3 2 1 8 2 4 6 2Frequencies Term tfij = |nij | |dj | idfi = |D| |d : i ∈ d |
  • 8. Inverted list iterator operations next() Find the next document Id nextGEQ(k) Find the next document id in the list with id >= k score() Compute the score of the current document id, considering the frequency associated The score() function involves in floating point computations Iterating over inverted index is expansive because it is compressed.
  • 10. Early termination techniques ●It is not necessary to compute the score function on all the postings. ●Max score ●Wand ●BlockMaxWand These algorithms compute the exact top-k documents.
  • 11. Wand Doc-Id T1 T2 T3 T4 T5 𝜭 = 15.8 Ms = 5.4 Ms = 5.0 Ms = 4.2 Ms = 4.3 Ms = 2.3 Pivot List Sum = 5.4 Sum = 14.6 Sum = 9.6 Sum = 16.9 Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. Efficient query evaluation using a two-level retrieval process., CIKM ’03,
  • 12. Block-Max-Wand Doc-Id T1 T2 T3 T4 T5 𝜭 = 15.8 3.2 4.2 4.5 5.4 2.1 2.1 3.2 2.1 2.3 0.8 3.5 1.4 4.1 2.7 Pivot List Block max upper estimation = 10.2 Ms = 5.4 Ms = 5.0 Ms = 4.2 Ms = 4.3 Ms = 2.3 Sum = 5.4 Sum = 14.6 Sum = 9.6 Sum = 16.9 The less is the average approximation error, the better are performance. Shuai Ding and Torsten Suel. Faster top-k document retrieval using block-max indexes. SIGIR ’11 2.0
  • 13. Block Max Wand 1.Pivot selection as in wand. 2.Compute block max contributions (blockmaxsum) of the pivot doc-id 3.If block max sum overcomes the threshold: 1.Full evaluate the document of the pivot. 2.Move iterator to pivot.docid + 1 4.Otherwise, move iterator to the leftmost boundary of the blocks evaluated.
  • 14. Partitioning Fixed size blocks Variable size blocks ∑ 𝑏∈𝐵 (max(𝑠  ∈ 𝑏) 𝑏   −   ∑ 𝑠 ∈ 𝑏  𝑠 ) 𝑏 min ∑ 𝑏∈𝐵 (max(𝑠  ∈ 𝑏) 𝑏 ) + 𝑆
  • 15. Shortest Path Problem • V postings sorted by their position in the list • E every possible block in the list • C(i,j) is the approximation error We add a fix cost F to the cost function C(i,j) O(n2 )
  • 16. Approximation algorithm ● Monotonicity: Quasi-subaddictivity: 𝑂( 𝑛2 ) → 𝑂(𝑛 log 𝑈 ) 𝑂( 𝑛log𝑈) → 𝑂(𝑛) C(i, j) ≤ C(i, j + 1) C(i, j) ≤ C(i − 1,j) G1 = {(i, j) ∈ G|∃k . C(i, j) ≤ F(1 + α)k ≤ C(i, j + 1)} C(i, k) + C(k + 1,j) ≤ C(i, j) + F + 1 G2 = {(i, j) ∈ G1 |C(i, j) ≤ F/β} sp(G2 ) = (1 + α)(1 + β)sp(G)
  • 17. Experimental analysis Collection Size Size after compression # lists # postings # documents Gov2 44 GiB 4.4 GiB 35 millions 6 billions 24 millions Clueweb09 120 GiB 15 GiB 92 millions 15 billions 50 millions ● Trec2005 and Trec2006 query collections. ● The code is written in C++ 11 and compiled with GCC 5.3.1 with the highest optimization settings and it is executed on a 8-core i7-4790K with 32GiB ram running Linux kernel v. 4.4.0.
  • 18. Choosing block size Block size Block size
  • 20. Block-Max-Wand Compression (score quantization) Uniform partitioning Opt partitioning Sort
  • 22. Gov2 Clueweb09 Trec2005 Trec2006 Trec2005 Trec2006 Wand 7.06 (1.92x) 12.92 (1.55x) 28.85 (2.25x) 37.55 (1.40x) MaxScore 6.59 (1.79x) 11.35 (1.36x) 23.58 (1.84x) 32.28 (1.21x) BMW 3.67 8.33 12.81 26.64 Gov2 Clueweb09 Plain index 6.91 8.36 Wand/MaxScore 7.24 (1.04x) 8.65 (1.03x) BMW/VBMW 9.14 (1.32x) 10.68 (1.27x) VBMW c. 8.07 (1.16x) 9.51 (1.13x) Gov2 Clueweb09 Trec2005 Trec2006 Trec2005 Trec2006 Wand 7.06 (3.34x) 12.92 (2.72x) 28.85 (3.98x) 37.55 (2.55x) MaxScore 6.59 (3.12x) 11.35 (2.39x) 23.58 (3.25x) 32.28 (2.11x) BMW 3.67 (1.73x) 8.33 (1.75x) 12.81 (1.77x) 26.64 (1.74x) VBMW 2.11 4.75 7.25 15.30 VBMW c. 2.35 (1.11x) 5.29 (1.11x ) 8.21 (1.13x ) 17.00 (1.11x ) Time in ms Space in bits per posting
  • 23. Longer skipping We can do better than skip at the block boundary. Ls-boundaryBoundary Iterate over the blocks at runtime Add a pointer per block
  • 24. Block-Max-Wand Doc-Id T1 T2 T3 T4 T5 𝜭 = 15.8 3.2 4.2 4.5 5.4 2.1 2.1 3.2 2.1 2.3 0.8 3.5 1.4 4.1 2.7 Pivot List Block max upper estimation = 10.2 Ms = 5.4 Ms = 5.0 Ms = 4.2 Ms = 4.3 Ms = 2.3 Sum = 5.4 Sum = 14.6 Sum = 9.6 Sum = 16.9 Shuai Ding and Torsten Suel. Faster top-k document retrieval using block-max indexes. SIGIR ’11 2.0
  • 25. Longer skipping 2 3 4 5 6+ VBMW 3.17 (1.45x) 6.39 (1.13x) 8.92 (1.04x) 14.46 (1.00x) 32.04 (1.03x) VBMW LS 2.18 5.66 8.57 14.44 31.05 (1.04x) VBMW c. 3.53 (1.31x) 6.97 (1.15x) 9.86 (1.04x) 16.06 (1.00x) 36.26 (1.01x) VBMW LSP c. 2.68 6.3 9.52 16.07 36.01 ClueWeb - Trec2005