7. What is Machine Learning?
• Machine learning is a field of computer science that
gives computers the ability to learn without being explicitly
programmed.
• Statistics: collect data build model predict
18. B-Tree vs. Models
• Task: Predict the offset of value given a key
• Input: key
• Output:
• B-Tree: [pos, pos + pagesize]
• Model: [pos - min_err, pos + max_err]
How to bound
min_err, max_err? No test dataset!
19. Index as a Function
• B-Tree or ML model are fitting this curve in different approach.
27. Test models
• B-Tree with different page sizes
• very competitive performance
• RMI with 2-stage models using simple grid-search
• 0 to 2 hidden layers
• layer-width ranging from 4 to 32 nodes
• Total time = lookup time + search time
28.
29.
30.
31. Conclusion
• Up to 3x faster
• An order-of-magnitude smaller
• Data distribution dependent
32. Inserts and Updates
• Achilles heel of learned indexes because of the potentially high cost
for learning models
• Introduce additional space in sorted dataset, similar to a B-Tree
• Assume that the inserts follow roughly a similar pattern as the
learned CDF
• What happens if the distribution changes?
• Retrain model. Stage 2 -> Stage 1
• Delta-index
34. Point Index
• Hash collisions
• Probing (e.g. linked list)
• Trade-off between time and space
• Learned Hash-map
• more uniform hash function
• more uniquely mapping
35. Point Index
• Baseline Hash-map
• only uses 2 multiplications,
3 bitshifts and 3 XORs
• 2-stage RMI models
• 100k models on the 2nd stage
• without any hidden layers.
• available slots from 75% to
125% of the data
36. Existence Index
• most importantly Bloom-Filters
• Guarantee no false negative
• Potential false positive (FP)
• Targeted FPR = 0.1% then ~14x bits
• Targeted FPR = 0.01% then ~18x bits
37. Bloom-filters with learned hash-functions
• We denote the set of keys by K and the set of non-keys by U
• Dataset of non-keys U
• randomly generated keys
• based on logs of previous queries
• generated by another ML model
• A binary classification task
• NN with Sigmoid activation function
38. Bloom-filters with learned hash-functions
• non-zero FPR and FNR
• as the FPR goes down, the FNR will go up
• How to preserve FPR = 0 constraint?
• an overflow Bloom filter