The Case for Learned Index Structures

The Case for
Learned Index Structures
Eric Fu

Agenda
• Introduction
• Background
• Range Index
• RM-Index
• Performance
• Point Index & Existence Index

Introduction
• Index: map key to position efficiently
• B-Tree
• Self-balanced binary search tree
• Store on disk
• Lookup in O(log n)

B-Tree vs. Models
• Task: Predict the offset of value given a key
• Input: key
• Output:
• B-Tree: [pos, pos + pagesize]
• Model: [pos - min_err, pos + max_err]

What is Machine Learning?
• Machine learning is a field of computer science that
gives computers the ability to learn without being explicitly
programmed.
• Statistics: collect data  build model  predict

Problems
• Regression 回归
• Classification 分类
• Clustering 聚类 Clustering

Algorithms
• Linear Regression 线性回归
• Decision Tree 决策树
• Neural Network 神经网络
• Support Vector Machine (SVM) 支持向量机
• Bayes Classifier 贝叶斯分类器
• K-means
.......

Linear Regression 线性回归

Neuron 神经元
Activation Function

Neural Network 神经网络
neuron

Deep Neural Network 深度神经网络
GoogLeNet, 22 layers

Machine Learning
• A regular process
• Feature extraction
• Train model
• Test model
• Objective function
• minimum error (e.g. MSE)

Machine Learning
• The biggest challenge - overfitting

B-Tree vs. Models
• Task: Predict the offset of value given a key
• Input: key
• Output:
• B-Tree: [pos, pos + pagesize]
• Model: [pos - min_err, pos + max_err]
How to bound
min_err, max_err? No test dataset!

Index as a Function
• B-Tree or ML model are fitting this curve in different approach.

Recursive Model Index (RMI)
Root and middle nodes
• Pick a model for next stage
Leaf nodes
• Predict position

Recursive Model Index (RMI)
Solved last-mile dilemma!
• 100M - 1M - 10K – 100
• Not a tree

Hybrid Index
Parameter: model structures

Test datasets
• weblogs：访问时间 timestamp -> log entry (~200M)
• maps：纬度 longitude -> locations (~200M)
• web-documents：documents (strings) -> document-id (~10M)
• lognormal：

Test models
• B-Tree with different page sizes
• very competitive performance
• RMI with 2-stage models using simple grid-search
• 0 to 2 hidden layers
• layer-width ranging from 4 to 32 nodes
• Total time = lookup time + search time

Conclusion
• Up to 3x faster
• An order-of-magnitude smaller
• Data distribution dependent

Inserts and Updates
• Achilles heel of learned indexes because of the potentially high cost
for learning models
• Introduce additional space in sorted dataset, similar to a B-Tree
• Assume that the inserts follow roughly a similar pattern as the
learned CDF
• What happens if the distribution changes?
• Retrain model. Stage 2 -> Stage 1
• Delta-index

Point Index
• Hash collisions
• Probing (e.g. linked list)
• Trade-off between time and space
• Learned Hash-map
• more uniform hash function
• more uniquely mapping

Point Index
• Baseline Hash-map
• only uses 2 multiplications,
3 bitshifts and 3 XORs
• 2-stage RMI models
• 100k models on the 2nd stage
• without any hidden layers.
• available slots from 75% to
125% of the data

Existence Index
• most importantly Bloom-Filters
• Guarantee no false negative
• Potential false positive (FP)
• Targeted FPR = 0.1% then ~14x bits
• Targeted FPR = 0.01% then ~18x bits

Bloom-filters with learned hash-functions
• We denote the set of keys by K and the set of non-keys by U
• Dataset of non-keys U
• randomly generated keys
• based on logs of previous queries
• generated by another ML model
• A binary classification task
• NN with Sigmoid activation function

• non-zero FPR and FNR
• as the FPR goes down, the FNR will go up
• How to preserve FPR = 0 constraint?
• an overflow Bloom filter

• Task: blacklisted phishing URLs (~1.7M)
• a character-level RNN (GRU)
• Bloom filter
• desired 1% FPR requires 2.04MB space
• Our approach
• GRU model: 0.0259MB
• With spill-over Bloom filter: 1.07MB (save ~47%)

Furthermore...
• Machine Learning by Zhou Zhi-hua
• TensorFlow and deep learning without a PhD

The Case for Learned Index Structures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The Case for Learned Index Structures

Similar to The Case for Learned Index Structures (20)

More from 宇傅

More from 宇傅 (12)

Recently uploaded

Recently uploaded (20)