Dr. Edward (Eddie) Bortnikov (Senior Director of Research) @ Verizon Media:
Ingestion and queries of real-time data in Druid are performed by a core software component named Incremental Index (I^2).
I^2’s scalability is paramount to the speed of the ingested data becoming queryable as well as to the operational efficiency of the Druid cluster.
The current I^2 Implementation is based on the traditional ordered JDK key-value (KV-)map.
We present an experimental I^2 implementation that is based on a novel data structure named OakMap - a scalable thread-safe off-heap KV-map for Big Data applications in Java.
With OakMap, I^2 can ingest data at almost 2x speed while using 30% less RAM.
The project is expected to become GA in 2020.
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Scalable Incremental Index for Druid
1. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Scalable Incremental Index
for Druid
Anastasia Braginsky, Liran Funaro, Eran Meir, Edward Bortnikov
Yahoo Research
2. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Just a bit about us
Verizon Media Group (VMG)
Creating what's next in content, advertising, and technology with brands like Yahoo, HuffPost, TechCrunch, etc.
Reaching 900MM consumers.
Very large private cloud for our data. Huge variety of workloads. Open source technologies operated at scale.
Yahoo Research
Working in many research areas to build the breakthroughs that power our products at a global scale.
AI/ML, Computational Advertising, Search, Content Recommendation, User Modeling, Systems, and more.
Scalable Systems Research Team
Part of Yahoo Research Lab, Haifa.
Working on novel algorithms and systems to make our products fast, scalable and reliable.
Steady flow of contributions to open source technologies: HBase, RocksDB, Phoenix, DataSketches, Druid.
2
3. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Just a bit about Druid at VMG
Used by dozens of products
Content analytics, Ad analytics, User analytics, Network and System analytics, ...
Hosting many petabytes of data
Deployed on Private and Public Cloud infrastructure
>1K hosts
Contributing to Druid code since Yahoo times
Example - integrated DataSketches for real-time aggregations over streaming data.
3
4. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
This Talk is About
Data ingestion in Druid
… and speeding it up with Oak
… a scalable off-heap concurrent key-value map in Java
4
5. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Data Model
5
Primary dimension Numeric or Sketch
(approximate aggregate)
Numeric or String
6. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Physical Storage and Serving
Workload characteristics
Data is mostly write-once, append-only.
Queries mostly focus on the time dimension first.
Data organization
Data is ordered by time (sequence of chunks, in the order of ingestion).
Chunks consist of segments, ordered internally by time.
Druid figures out which segments are required to serve a query.
6
7. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Druid Architecture
7
ingestion
8. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Real-Time Ingestion
Works with a variety of data feeds
Kafka, Kinesis, Spark, Local FS, …
Real work - Middle Manager (MM) processes
MM runs multiple indexing tasks (Peons). Task = JVM.
Experimental architecture: Indexer (replacing MM/Peons). Task = thread.
Indexing tasks
Generate data segments and flush them to deep storage (HDFS, S3, …)
Serve queries - ingested data is immediately queryable
8
9. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Performance Caveats
Segment size crucial for system performance
Queries are slow if segments are too big or too small.
Segment granularity is controlled by indexer task settings.
Current default size (in rows): 5M rows/segment
Current recommended size (in bytes): 300-700 MB/segment.
Druid automatically compacts (merges) small segments in the background.
We want relatively big segments
With sketches, 5M rows easily translates to 10-20 GB’s.
RAM is becoming cheaper (128/256 GB/host are common).
Storage is becoming faster, optimized for big transfers
Scanning many small files is inefficient.
Background compactions of many small files inefficient.
9
10. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Incremental Index (I2
)
Core data structure used by indexing tasks
Ordered map of data rows - point updates/point lookups/range queries.
Concurrent (thread safe) - allows parallel reads and writes.
Comes in two forms
Plain (collection of facts) or Rollup (on-the-fly aggregation).
Built on top of JDK ConcurrentSkipList (CSL)
Adapts the Druid row-oriented data model to the KV paradigm.
Key = Timestamp + {Dimension}*. Value = {Metric}*.
How far can it scale?
10
11. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
GC - the Blessing and the Curse
Java Garbage Collection (GC)
Manages objects allocated on the JVM heap.
Awesome for programmer experience (no memory management headache).
… but never really designed for Big Data applications.
GC Perils
Resource waste: steals CPU cycles + RAM headroom.
Poor SLA: GC pauses → software stalls → high tail latencies.
Experiment - GC impact on I2
10M rows → tens of millions of Java objects.
25% of data ingestion time spent on GC.
2x memory consumed by GC mechanisms to sustain reasonable performance.
11
Source: jelastic.com
12. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Scaling I2
Goal: eliminate the GC overhead in I2
Better scaling with data size (up to tens of GB’s per index).
More efficient use of RAM resources (low overhead).
How: GC-insensitive implementation
Replace the JDK skiplist with an ordered map with off-heap data storage.
… with identical (strong) concurrency guarantees.
… desirably, orders-of-magnitude less metadata objects.
Oak comes to help
Open source, off-heap ordered map by Yahoo.
https://github.com/yahoo/Oak
12
Source: clipartsign.com
13. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Oak, Introduced
Features
Concurrent ordered map with atomic semantics.
Managed-memory programming experience.
Unmanaged-memory performance.
Design principles
Data (keys and values) stored off-heap.
Metadata (internal search tree) stored on-heap.
Custom low-overhead internal GC.
API
Traditional JDK API (ConcurrentNavigableMap) for backward compatibility.
Zero-copy (ZC) API for the best performance - queries and in-place updates.
Stream API optimized for fast scans.
13
14. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Oak in a Nutshell
14
15. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Scaling with Parallelism (11M KV-pairs)
15
Put Get (ZC)
16. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Scaling with Parallelism (11M KV-pairs)
16
Ascending scan (ZC), 10K pairs/scan Descending scan (ZC), 10K pairs/scan
17. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
I2
-Oak
I2
implementation on top of OakMap
Configurable at system level (the legacy I2
is still a default).
Minor refactoring of the Druid code (I2
API abstraction).
Implemented as core part of Druid but could be an extension to reduce friction.
Details
Druid I2
schema mapped to OakMap keys and values.
Leverages the ZC API for queries and in-place aggregation.
Auxiliary data structures (e.g., string dictionaries) remain on-heap.
Sketch aggregators currently unsupported.
Project Status
Code complete. Component- and system-level benchmarks.
Community: Git issue. GA expected mid-2020.
17
18. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Druid Ingestion - Scaling with Data Size
18
Ingesting 1M to 7M tuples
Tuple size 1.25KB
30GB available RAM
19. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Druid Ingestion - Scaling with RAM
19
Ingesting 7M tuples
Tuple size 1.25KB
RAM scaling 25GB to 32GB
20. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Druid Ingestion - RAM overhead
20
21. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Summary
21
GC overhead is critical for real-time ingestion performance
Solution - off-heap incremental index
Implementation based on Oak.
Better scalability and RAM efficiency vs the legacy.
Contribution to Druid
Implementation under community review, expected GA mid-2020.