SlideShare a Scribd company logo
B-Tree Lexicon, Min-Heaps
Kira Radinsky
Min-Heap slides are courtesy of Aya Soffer and David Carmel,
IBM Haifa Research Lab
2 November 2010 236621 Search Engine Technology 2
The Lexicon as a B-Tree
• B-Tree: a balanced tree that is optimized for disk I/O, holding key/value
pairs
• Branching is defined by a min-degree parameter t, t > 1
– t is chosen according to the size of a disk block
• Any internal node other than the root has at least t and at most 2t
children; the root has either no children, or at least two and at most 2t
children
• Any internal node with k children also stores k-1 keys which serve as
separator values: separator j is larger than the keys of subtree j and
smaller than the keys of subtree j+1
• Leaf nodes, like all nodes, store at most 2t-1 key/value pairs
– When not the root, store at least t-1 key/value pairs
• Lookup, insertion and deletion operations on a B-Tree are linear in its
height (and t-logarithmic in the number of keys)
2 November 2010 236621 Search Engine Technology 3
B-Tree Lexicon - Example
• t=2
• Each key is associated with a value that contains a DF and
a pointer to the postings list (dashed line)
gets more
1 2
and as bad
3 1 2
good is it
2 1 2
the ugly
1 2
2 November 2010 236620 Search Engine Technology 4
B-Tree Lookup
Looking up the value associated with key x:
1. current_node  root
2. Let k1<k2<…<km be the keys of current_node
3. if x{k1,k2,…,km} – we’re done, return associated value
4. else, if current_node is a leaf node, return null
5. else, let j be the smallest index s.t. x<kj (j  m+1 if x>km);
– current_node  j’th subtree, and goto 2
2 November 2010 236621 Search Engine Technology 5
Top-r Document Selection
Problem definition: Given a set A of scored documents,
select the r documents with the highest scores in A and
return them in decreasing relevance order
• Naïve method: sort the set A by score
– If |A|=M, time complexity is O(M logM)
• Better approach: since typically r<<M, selecting the r
top scores can be done in O(M+r log M) time using a
heap:
1. Heapify the set of M scores (about 2M comparisons) so that the
top score is at the root
2. Repeatedly extract the heap’s root (r times), each time fixing
the heap in O(logM)
2 November 2010 236621 Search Engine Technology 6
The Heap Data Structure - Reminder
• A binary heap is a (mostly full) binary tree with values
stored at all leaves and internal nodes, and an ordering
rule that requires values to be non-decreasing
(alternatively, non-increasing) along each path from a leaf
to the root
– Largest/smallest value is at the root
• Heap implemented in an Array:
– Root at index 1
– For node at index i, left child is at index 2i and right child at index
2i+1
– Thus the parent of the node at index i is at index i/2
2 November 2010 236621 Search Engine Technology 7
Binary Heap Stored in an Array
23
17
28
5
15
13
144
17
23 17 15 17 8 2 13 4 14 5
1 2 3 4 5 6 7 8 9 10
2 November 2010 236621 Search Engine Technology 8
Extracting the Top Element
• Remove the largest item r times
• Each time:
– Remove the largest item – the root of the heap
– Replace it with the last element of the heap
– Sift the new root down until restoring order
• Example
– Remove item 23 from the root
– Last item in array 5 (at location 10) replaces it
– Reinstate heap order - worst case 5 will be sifted
back down the tree - number of sifts is bounded
by log(size of heap)
2 November 2010 236621 Search Engine Technology 9
Heap Example (cont.)
To restore order at the top level
of tree, item 17, the larger of
the 2 children of root must be
swapped with 5.
This limits the order violation to
the left sub-tree.
5
17
28
15
13
144
17
The process is repeated until heap order is restored
2 November 2010 236621 Search Engine Technology 10
5
17
28
15
13
144
17
17
17
28
15
13
54
14
17
5
28
15
13
144
17
17
17
28
15
13
144
5
Heap Example (cont.)
2 November 2010 236621 Search Engine Technology 11
Top-r Selection Using a Min-Heap
• The selection problem can be solved by a heap that stores
the smallest item at the root: min-heap
• A min-heap of r items is held instead of a max-heap of M –
lots of memory is saved, which is always good
• Process the M scores, storing in the min-heap the r largest
values seen so far
– First r values are heapified in O(r) comparisons
– Replace the smallest value in the min-heap (the rth largest)
whenever a larger value is found
• Sort the r highest values in descending order and return
the corresponding documents – O(r log r)
2 November 2010 236621 Search Engine Technology 12
Min-Heap Processing - Illustration
Processed Unprocessed
Min-heap of r
largest items
Discard smallest
value
2 November 2010 236621 Search Engine Technology 13
Top-r Selection Using a Min-Heap:
Complexity Analysis
• Worst case: the scores are already in increasing order
– Each of the M-r last values is inserted into the heap
– Furthermore, it percolates to the bottom of the heap
– Complexity is O( (M-r)*log(r) )
• Average case – the scores arrive in a permutation of size
M chosen uniformly at random
– The expected number of times one of the M-r last values is
inserted into the heap is ~ r*ln(M/r)
– Each insertion costs O(log(r))
– Complexity is O( r*log(r)*log(M/r) )
• Proof on the board

More Related Content

What's hot

SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
Yasuo Tabei
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
Subhas Kumar Ghosh
 

What's hot (20)

How does one go from binary data to HDF files efficiently?
How does one go from binary data to HDF files efficiently?How does one go from binary data to HDF files efficiently?
How does one go from binary data to HDF files efficiently?
 
Extensible hashing
Extensible hashingExtensible hashing
Extensible hashing
 
heap Sort Algorithm
heap  Sort Algorithmheap  Sort Algorithm
heap Sort Algorithm
 
Heap sort
Heap sortHeap sort
Heap sort
 
Hashing Technique In Data Structures
Hashing Technique In Data StructuresHashing Technique In Data Structures
Hashing Technique In Data Structures
 
Heapify algorithm
Heapify algorithmHeapify algorithm
Heapify algorithm
 
Starting work with R
Starting work with RStarting work with R
Starting work with R
 
Heap tree
Heap treeHeap tree
Heap tree
 
Heapsort using Heap
Heapsort using HeapHeapsort using Heap
Heapsort using Heap
 
HeapSort
HeapSortHeapSort
HeapSort
 
SPIRE2013-tabei20131009
SPIRE2013-tabei20131009SPIRE2013-tabei20131009
SPIRE2013-tabei20131009
 
4.4 external hashing
4.4 external hashing4.4 external hashing
4.4 external hashing
 
Heap Sort || Heapify Method || Build Max Heap Algorithm
Heap Sort || Heapify Method || Build Max Heap AlgorithmHeap Sort || Heapify Method || Build Max Heap Algorithm
Heap Sort || Heapify Method || Build Max Heap Algorithm
 
Cis435 week05
Cis435 week05Cis435 week05
Cis435 week05
 
Hash tables
Hash tablesHash tables
Hash tables
 
06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering06 how to write a map reduce version of k-means clustering
06 how to write a map reduce version of k-means clustering
 
Heap Data Structure
 Heap Data Structure Heap Data Structure
Heap Data Structure
 
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
2021 Dask Summit - Using STAC to catalog SpatioTemporal datasets
 
Ist year Msc,2nd sem module1
Ist year Msc,2nd sem module1Ist year Msc,2nd sem module1
Ist year Msc,2nd sem module1
 
Application of hashing in better alg design tanmay
Application of hashing in better alg design tanmayApplication of hashing in better alg design tanmay
Application of hashing in better alg design tanmay
 

Similar to Tutorial 3 (b tree min heap)

Fundamentalsofdatastructures 110501104205-phpapp02
Fundamentalsofdatastructures 110501104205-phpapp02Fundamentalsofdatastructures 110501104205-phpapp02
Fundamentalsofdatastructures 110501104205-phpapp02
Getachew Ganfur
 

Similar to Tutorial 3 (b tree min heap) (20)

Fundamentalsofdatastructures 110501104205-phpapp02
Fundamentalsofdatastructures 110501104205-phpapp02Fundamentalsofdatastructures 110501104205-phpapp02
Fundamentalsofdatastructures 110501104205-phpapp02
 
03-data-structures.pdf
03-data-structures.pdf03-data-structures.pdf
03-data-structures.pdf
 
lecture4.pdf
lecture4.pdflecture4.pdf
lecture4.pdf
 
Introduction to data structure by anil dutt
Introduction to data structure by anil duttIntroduction to data structure by anil dutt
Introduction to data structure by anil dutt
 
Red Black Trees
Red Black TreesRed Black Trees
Red Black Trees
 
Analysis and design of algorithms part2
Analysis and design of algorithms part2Analysis and design of algorithms part2
Analysis and design of algorithms part2
 
Multiway Trees.ppt
Multiway Trees.pptMultiway Trees.ppt
Multiway Trees.ppt
 
Analysis Of Algorithms - Hashing
Analysis Of Algorithms - HashingAnalysis Of Algorithms - Hashing
Analysis Of Algorithms - Hashing
 
Spatial index(2)
Spatial index(2)Spatial index(2)
Spatial index(2)
 
Heap and heapsort
Heap and heapsortHeap and heapsort
Heap and heapsort
 
tree.ppt
tree.ppttree.ppt
tree.ppt
 
Algo-Exercises-2-hash-AVL-Tree.ppt
Algo-Exercises-2-hash-AVL-Tree.pptAlgo-Exercises-2-hash-AVL-Tree.ppt
Algo-Exercises-2-hash-AVL-Tree.ppt
 
Furnish an Index Using the Works of Tree Structures
Furnish an Index Using the Works of Tree StructuresFurnish an Index Using the Works of Tree Structures
Furnish an Index Using the Works of Tree Structures
 
[Www.pkbulk.blogspot.com]dbms12
[Www.pkbulk.blogspot.com]dbms12[Www.pkbulk.blogspot.com]dbms12
[Www.pkbulk.blogspot.com]dbms12
 
Master of Computer Application (MCA) – Semester 4 MC0080
Master of Computer Application (MCA) – Semester 4  MC0080Master of Computer Application (MCA) – Semester 4  MC0080
Master of Computer Application (MCA) – Semester 4 MC0080
 
Binary Search Tree
Binary Search TreeBinary Search Tree
Binary Search Tree
 
Analysis of Algorithms-Heapsort
Analysis of Algorithms-HeapsortAnalysis of Algorithms-Heapsort
Analysis of Algorithms-Heapsort
 
14 query processing-sorting
14 query processing-sorting14 query processing-sorting
14 query processing-sorting
 
002.decision trees
002.decision trees002.decision trees
002.decision trees
 
sorting
sortingsorting
sorting
 

More from Kira

More from Kira (13)

Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)Tutorial 14 (collaborative filtering)
Tutorial 14 (collaborative filtering)
 
Tutorial 12 (click models)
Tutorial 12 (click models)Tutorial 12 (click models)
Tutorial 12 (click models)
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
 
Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)Tutorial 10 (computational advertising)
Tutorial 10 (computational advertising)
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
 
Tutorial 8 (web graph models)
Tutorial 8 (web graph models)Tutorial 8 (web graph models)
Tutorial 8 (web graph models)
 
Tutorial 7 (link analysis)
Tutorial 7 (link analysis)Tutorial 7 (link analysis)
Tutorial 7 (link analysis)
 
Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)Tutorial 6 (web graph attributes)
Tutorial 6 (web graph attributes)
 
Tutorial 5 (lucene)
Tutorial 5 (lucene)Tutorial 5 (lucene)
Tutorial 5 (lucene)
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
 
Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)Tutorial 2 (mle + language models)
Tutorial 2 (mle + language models)
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)Tutorial 13 (explicit ugc + sentiment analysis)
Tutorial 13 (explicit ugc + sentiment analysis)
 

Recently uploaded

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 

Recently uploaded (20)

Agentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdfAgentic RAG What it is its types applications and implementation.pdf
Agentic RAG What it is its types applications and implementation.pdf
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Intelligent Gimbal FINAL PAPER Engineering.pdf
Intelligent Gimbal FINAL PAPER Engineering.pdfIntelligent Gimbal FINAL PAPER Engineering.pdf
Intelligent Gimbal FINAL PAPER Engineering.pdf
 
Optimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through ObservabilityOptimizing NoSQL Performance Through Observability
Optimizing NoSQL Performance Through Observability
 
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptxUnpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
Unpacking Value Delivery - Agile Oxford Meetup - May 2024.pptx
 
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
SOQL 201 for Admins & Developers: Slice & Dice Your Org’s Data With Aggregate...
 
Connecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAKConnecting the Dots in Product Design at KAYAK
Connecting the Dots in Product Design at KAYAK
 
Motion for AI: Creating Empathy in Technology
Motion for AI: Creating Empathy in TechnologyMotion for AI: Creating Empathy in Technology
Motion for AI: Creating Empathy in Technology
 
Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024IoT Analytics Company Presentation May 2024
IoT Analytics Company Presentation May 2024
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
Transforming The New York Times: Empowering Evolution through UX
Transforming The New York Times: Empowering Evolution through UXTransforming The New York Times: Empowering Evolution through UX
Transforming The New York Times: Empowering Evolution through UX
 
The architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdfThe architecture of Generative AI for enterprises.pdf
The architecture of Generative AI for enterprises.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1UiPath Test Automation using UiPath Test Suite series, part 1
UiPath Test Automation using UiPath Test Suite series, part 1
 
Introduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG EvaluationIntroduction to Open Source RAG and RAG Evaluation
Introduction to Open Source RAG and RAG Evaluation
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 
IESVE for Early Stage Design and Planning
IESVE for Early Stage Design and PlanningIESVE for Early Stage Design and Planning
IESVE for Early Stage Design and Planning
 
Strategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering TeamsStrategic AI Integration in Engineering Teams
Strategic AI Integration in Engineering Teams
 

Tutorial 3 (b tree min heap)

  • 1. B-Tree Lexicon, Min-Heaps Kira Radinsky Min-Heap slides are courtesy of Aya Soffer and David Carmel, IBM Haifa Research Lab
  • 2. 2 November 2010 236621 Search Engine Technology 2 The Lexicon as a B-Tree • B-Tree: a balanced tree that is optimized for disk I/O, holding key/value pairs • Branching is defined by a min-degree parameter t, t > 1 – t is chosen according to the size of a disk block • Any internal node other than the root has at least t and at most 2t children; the root has either no children, or at least two and at most 2t children • Any internal node with k children also stores k-1 keys which serve as separator values: separator j is larger than the keys of subtree j and smaller than the keys of subtree j+1 • Leaf nodes, like all nodes, store at most 2t-1 key/value pairs – When not the root, store at least t-1 key/value pairs • Lookup, insertion and deletion operations on a B-Tree are linear in its height (and t-logarithmic in the number of keys)
  • 3. 2 November 2010 236621 Search Engine Technology 3 B-Tree Lexicon - Example • t=2 • Each key is associated with a value that contains a DF and a pointer to the postings list (dashed line) gets more 1 2 and as bad 3 1 2 good is it 2 1 2 the ugly 1 2
  • 4. 2 November 2010 236620 Search Engine Technology 4 B-Tree Lookup Looking up the value associated with key x: 1. current_node  root 2. Let k1<k2<…<km be the keys of current_node 3. if x{k1,k2,…,km} – we’re done, return associated value 4. else, if current_node is a leaf node, return null 5. else, let j be the smallest index s.t. x<kj (j  m+1 if x>km); – current_node  j’th subtree, and goto 2
  • 5. 2 November 2010 236621 Search Engine Technology 5 Top-r Document Selection Problem definition: Given a set A of scored documents, select the r documents with the highest scores in A and return them in decreasing relevance order • Naïve method: sort the set A by score – If |A|=M, time complexity is O(M logM) • Better approach: since typically r<<M, selecting the r top scores can be done in O(M+r log M) time using a heap: 1. Heapify the set of M scores (about 2M comparisons) so that the top score is at the root 2. Repeatedly extract the heap’s root (r times), each time fixing the heap in O(logM)
  • 6. 2 November 2010 236621 Search Engine Technology 6 The Heap Data Structure - Reminder • A binary heap is a (mostly full) binary tree with values stored at all leaves and internal nodes, and an ordering rule that requires values to be non-decreasing (alternatively, non-increasing) along each path from a leaf to the root – Largest/smallest value is at the root • Heap implemented in an Array: – Root at index 1 – For node at index i, left child is at index 2i and right child at index 2i+1 – Thus the parent of the node at index i is at index i/2
  • 7. 2 November 2010 236621 Search Engine Technology 7 Binary Heap Stored in an Array 23 17 28 5 15 13 144 17 23 17 15 17 8 2 13 4 14 5 1 2 3 4 5 6 7 8 9 10
  • 8. 2 November 2010 236621 Search Engine Technology 8 Extracting the Top Element • Remove the largest item r times • Each time: – Remove the largest item – the root of the heap – Replace it with the last element of the heap – Sift the new root down until restoring order • Example – Remove item 23 from the root – Last item in array 5 (at location 10) replaces it – Reinstate heap order - worst case 5 will be sifted back down the tree - number of sifts is bounded by log(size of heap)
  • 9. 2 November 2010 236621 Search Engine Technology 9 Heap Example (cont.) To restore order at the top level of tree, item 17, the larger of the 2 children of root must be swapped with 5. This limits the order violation to the left sub-tree. 5 17 28 15 13 144 17 The process is repeated until heap order is restored
  • 10. 2 November 2010 236621 Search Engine Technology 10 5 17 28 15 13 144 17 17 17 28 15 13 54 14 17 5 28 15 13 144 17 17 17 28 15 13 144 5 Heap Example (cont.)
  • 11. 2 November 2010 236621 Search Engine Technology 11 Top-r Selection Using a Min-Heap • The selection problem can be solved by a heap that stores the smallest item at the root: min-heap • A min-heap of r items is held instead of a max-heap of M – lots of memory is saved, which is always good • Process the M scores, storing in the min-heap the r largest values seen so far – First r values are heapified in O(r) comparisons – Replace the smallest value in the min-heap (the rth largest) whenever a larger value is found • Sort the r highest values in descending order and return the corresponding documents – O(r log r)
  • 12. 2 November 2010 236621 Search Engine Technology 12 Min-Heap Processing - Illustration Processed Unprocessed Min-heap of r largest items Discard smallest value
  • 13. 2 November 2010 236621 Search Engine Technology 13 Top-r Selection Using a Min-Heap: Complexity Analysis • Worst case: the scores are already in increasing order – Each of the M-r last values is inserted into the heap – Furthermore, it percolates to the bottom of the heap – Complexity is O( (M-r)*log(r) ) • Average case – the scores arrive in a permutation of size M chosen uniformly at random – The expected number of times one of the M-r last values is inserted into the heap is ~ r*ln(M/r) – Each insertion costs O(log(r)) – Complexity is O( r*log(r)*log(M/r) ) • Proof on the board