Scalable Incremental Index for Druid

Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Scalable Incremental Index
for Druid
Anastasia Braginsky, Liran Funaro, Eran Meir, Edward Bortnikov
Yahoo Research

Just a bit about us
Verizon Media Group (VMG)
Creating what's next in content, advertising, and technology with brands like Yahoo, HuﬀPost, TechCrunch, etc.
Reaching 900MM consumers.
Very large private cloud for our data. Huge variety of workloads. Open source technologies operated at scale.
Yahoo Research
Working in many research areas to build the breakthroughs that power our products at a global scale.
AI/ML, Computational Advertising, Search, Content Recommendation, User Modeling, Systems, and more.
Scalable Systems Research Team
Part of Yahoo Research Lab, Haifa.
Working on novel algorithms and systems to make our products fast, scalable and reliable.
Steady flow of contributions to open source technologies: HBase, RocksDB, Phoenix, DataSketches, Druid.
2

Just a bit about Druid at VMG
Used by dozens of products
Content analytics, Ad analytics, User analytics, Network and System analytics, ...
Hosting many petabytes of data
Deployed on Private and Public Cloud infrastructure
>1K hosts
Contributing to Druid code since Yahoo times
Example - integrated DataSketches for real-time aggregations over streaming data.
3

This Talk is About
Data ingestion in Druid
… and speeding it up with Oak
… a scalable oﬀ-heap concurrent key-value map in Java
4

Data Model
5
Primary dimension Numeric or Sketch
(approximate aggregate)
Numeric or String

Physical Storage and Serving
Workload characteristics
Data is mostly write-once, append-only.
Queries mostly focus on the time dimension first.
Data organization
Data is ordered by time (sequence of chunks, in the order of ingestion).
Chunks consist of segments, ordered internally by time.
Druid figures out which segments are required to serve a query.
6

Druid Architecture
7
ingestion

Real-Time Ingestion
Works with a variety of data feeds
Kafka, Kinesis, Spark, Local FS, …
Real work - Middle Manager (MM) processes
MM runs multiple indexing tasks (Peons). Task = JVM.
Experimental architecture: Indexer (replacing MM/Peons). Task = thread.
Indexing tasks
Generate data segments and flush them to deep storage (HDFS, S3, …)
Serve queries - ingested data is immediately queryable
8

Performance Caveats
Segment size crucial for system performance
Queries are slow if segments are too big or too small.
Segment granularity is controlled by indexer task settings.
Current default size (in rows): 5M rows/segment
Current recommended size (in bytes): 300-700 MB/segment.
Druid automatically compacts (merges) small segments in the background.
We want relatively big segments
With sketches, 5M rows easily translates to 10-20 GB’s.
RAM is becoming cheaper (128/256 GB/host are common).
Storage is becoming faster, optimized for big transfers
Scanning many small files is ineﬃcient.
Background compactions of many small files ineﬃcient.
9

Incremental Index (I2
)
Core data structure used by indexing tasks
Ordered map of data rows - point updates/point lookups/range queries.
Concurrent (thread safe) - allows parallel reads and writes.
Comes in two forms
Plain (collection of facts) or Rollup (on-the-fly aggregation).
Built on top of JDK ConcurrentSkipList (CSL)
Adapts the Druid row-oriented data model to the KV paradigm.
Key = Timestamp + {Dimension}*. Value = {Metric}*.
How far can it scale?
10

GC - the Blessing and the Curse
Java Garbage Collection (GC)
Manages objects allocated on the JVM heap.
Awesome for programmer experience (no memory management headache).
… but never really designed for Big Data applications.
GC Perils
Resource waste: steals CPU cycles + RAM headroom.
Poor SLA: GC pauses → software stalls → high tail latencies.
Experiment - GC impact on I2
10M rows → tens of millions of Java objects.
25% of data ingestion time spent on GC.
2x memory consumed by GC mechanisms to sustain reasonable performance.
11
Source: jelastic.com

Scaling I2
Goal: eliminate the GC overhead in I2
Better scaling with data size (up to tens of GB’s per index).
More efficient use of RAM resources (low overhead).
How: GC-insensitive implementation
Replace the JDK skiplist with an ordered map with off-heap data storage.
… with identical (strong) concurrency guarantees.
… desirably, orders-of-magnitude less metadata objects.
Oak comes to help
Open source, off-heap ordered map by Yahoo.
https://github.com/yahoo/Oak
12
Source: clipartsign.com

Oak, Introduced
Features
Concurrent ordered map with atomic semantics.
Managed-memory programming experience.
Unmanaged-memory performance.
Design principles
Data (keys and values) stored oﬀ-heap.
Metadata (internal search tree) stored on-heap.
Custom low-overhead internal GC.
API
Traditional JDK API (ConcurrentNavigableMap) for backward compatibility.
Zero-copy (ZC) API for the best performance - queries and in-place updates.
Stream API optimized for fast scans.
13

Oak in a Nutshell
14

Scaling with Parallelism (11M KV-pairs)
15
Put Get (ZC)

Scaling with Parallelism (11M KV-pairs)
16
Ascending scan (ZC), 10K pairs/scan Descending scan (ZC), 10K pairs/scan

I2
-Oak
I2
implementation on top of OakMap
Configurable at system level (the legacy I2
is still a default).
Minor refactoring of the Druid code (I2
API abstraction).
Implemented as core part of Druid but could be an extension to reduce friction.
Details
Druid I2
schema mapped to OakMap keys and values.
Leverages the ZC API for queries and in-place aggregation.
Auxiliary data structures (e.g., string dictionaries) remain on-heap.
Sketch aggregators currently unsupported.
Project Status
Code complete. Component- and system-level benchmarks.
Community: Git issue. GA expected mid-2020.
17

Druid Ingestion - Scaling with Data Size
18
Ingesting 1M to 7M tuples
Tuple size 1.25KB
30GB available RAM

Druid Ingestion - Scaling with RAM
19
Ingesting 7M tuples
Tuple size 1.25KB
RAM scaling 25GB to 32GB

Druid Ingestion - RAM overhead
20

Summary
21
GC overhead is critical for real-time ingestion performance
Solution - oﬀ-heap incremental index
Implementation based on Oak.
Better scalability and RAM eﬃciency vs the legacy.
Contribution to Druid
Implementation under community review, expected GA mid-2020.

Scalable Incremental Index for Druid

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Scalable Incremental Index for Druid

Similar to Scalable Incremental Index for Druid (20)

More from Itai Yaffe

More from Itai Yaffe (20)

Recently uploaded

Recently uploaded (20)

Scalable Incremental Index for Druid