Bullet (https://github.com/bullet-db) is an open-sourced, lightweight, scalable, pluggable, multi-tenant query system that lets you query any data flowing through a streaming system without having to store it. Bullet queries look forward in time - they are submitted first and operate on data flowing through the system from the point of submission and can run forever. Bullet addresses the challenges of supporting intractable Big Data aggregations like Top K, Counting Distincts, and Windowing efficiently without having a storage layer using Sketch-based algorithms.
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Bullet - Open Source Real-Time Data Query Engine, Michael Natkovich, Director Software Dev Engineering & Nate Speidel, Software Engineer, Oath
1. A REAL TIME DATA QUERY ENGINE
Michael Natkovich & Nate Speidel
2. Allow Myself to Introduce . . . Myself
■ Nate Speidel
● nspeidel@oath.com
● Software Engineer
● 2+ years of solving data problems at Yahoo
3. Allow Myself to Introduce . . . Myself
■ Michael Natkovich
● mln@oath.com
● Director Engineer
● 10+ years of causing data problems at Yahoo
4. Motivation: Cycle of Sadness
■ Instrumentation validation is unbearably slow
● Needs to be seconds not hours
● Needs to be easy to query
● Needs programmatic access
9. Bullet
■ Retrieves data that arrives after query submission
● Look Forward!
■ No persistence layer
■ Light-weight, fast, and scalable
■ UI for Ad-Hoc queries
■ API for programmatic querying
■ Pluggable interface to integrate with streaming data
10. What It’s For
Single stream,
multiple
consumers
Adhoc interactive
usage
Programmatic
short lived queries
11. What It’s Not For
Repeatable
queries
Currently no joins Not meant for ETL
12.
13. Querying in Bullet
■ Support filtering, logical operators on typed data
■ Supports aggregations
● Group By, Count Distincts, Top K, Distributions
● DataSketches based
■ Queries have life spans
● All queries run for a specified duration (or infinitely)
■ Results are Windowed
● Windows can be time or record based
● Raw record or aggregate based
14. Streaming Aggregations
■ Motivation
● Calculating cardinality
● Getting live latency distributions
● Validate experimentation bucket sizes
■ Aggregations are Hard
● Data skew
● Intermediate results are large and expensive to move
● The longer you run, the more memory you need
● Incremental results can’t be combined
15. Overwhelm Single Combiner
Count Distinct: Naive
1. Read Input
2. Round Robin
3. Extract Field
4. Send to Combiner
5. Count Distincts
16. Vulnerable to Data Skew
Count Distinct: Typical
1. Read Input
2. Round Robin
3. Extract Field
4. Hash Partition
5. Count Distincts
6. Send Count
7. Combine Counts
18. Data Sketches
■ Sketches are a class of stochastic
streaming algorithms
■ Provides approximate results (if data
is too large)
■ Provable error bounds
■ Fixed memory footprint
■ Mergeable, allowing for parallel
processing
19. Data Sketches in Streams
■ Accurate to a Point
● Sketches sized correctly will be 100% accurate
● Error rate is inversely proportional to size of a Sketch
■ Fixed Memory Ceiling
● Maximum Sketch size is configured in advance
● Memory cost of a query is thus known in advance
■ Allows Non-additive Operations to be Additive
● Sketches can be merged into a single Sketch without over
counting
● Allows tasks to be parallelized and cheaply combined later
● Allows results to be combined across windows of execution
20. Bullet’s Use of Data Sketches
Data Sketch Query Type
Theta Sketch Count Distinct
Tuple Sketch Group By
Quantile Sketch Distributions
Frequent Items Sketch Top K
21. Windowing
■ A way of breaking up an endless stream into digestible
components
■ Typically broken using time or records
■ Needed for incremental results
■ A window is the unit of incrementation
22. Windowing
■ Tumbling Windows*
● Contiguous non-overlapping windows at regular intervals
■ Hopping Windows
● Contiguous (possibly) overlapping windows at regular intervals
■ Sliding Windows*
● Event based windows looking back at regular event intervals
■ Cascading Windows
● Sliding windows that reset at a regular intervals too
■ Session Windows
● Sliding windows that reset if distance between events is exceeded
23. Why Windowing
■ Example: Number of distinct users in the next 60 seconds
■ Option 1: Wait 60 secs to get results
● No feedback :(
■ Option 2: Every 5 secs, get current state until end
● Continuous feedback with same final results
● Stop queries early (sufficient information gleaned, query bad, etc.)
● Quickly iterate queries
29. Core Design Principles
■ No persistence
● Tradeoff: Query Speed, Low Storage Cost > Repeatability
■ Scale for data and queries
● Each query cost is fixed and negligible, relative to data ingestion
■ Pluggable everything
● Run on top of any stream processor (Spark, Storm, etc.)
● Read from any data source (Kafka, Kinesis, etc.)
● Choose an implementation of the PubSub (Kafka, REST, etc.)
■ Tune everything
● Example: Sketch size vs Sketch accuracy
37. In Summary
■ Bullet is a lightweight and cheap stream query engine
■ It offers raw record and OLAP style queries
■ Leverages the power of Data Sketches
■ Only need to enough hardware to read data
● Queries are basically free!
■ Abstraction layer that can sit on any Stream Framework
● Implementations available for Storm and Spark
■ Pluggable allowing for consumption from any data source
■ Fully open sourced!!
38. Future Work
■ BQL: SQL-like interface support (already supported in WS)
■ More stream processor support (Flink)
■ All the Windows!
■ More aggregations (Group By Count Distinct)