Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scalable Incremental Index for Druid

Dr. Edward (Eddie) Bortnikov (Senior Director of Research) @ Verizon Media:
Ingestion and queries of real-time data in Druid are performed by a core software component named Incremental Index (I^2).
I^2’s scalability is paramount to the speed of the ingested data becoming queryable as well as to the operational efficiency of the Druid cluster.
The current I^2 Implementation is based on the traditional ordered JDK key-value (KV-)map.
We present an experimental I^2 implementation that is based on a novel data structure named OakMap - a scalable thread-safe off-heap KV-map for Big Data applications in Java.
With OakMap, I^2 can ingest data at almost 2x speed while using 30% less RAM.
The project is expected to become GA in 2020.

  • Be the first to comment

  • Be the first to like this

Scalable Incremental Index for Druid

  1. 1. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Scalable Incremental Index for Druid Anastasia Braginsky, Liran Funaro, Eran Meir, Edward Bortnikov Yahoo Research
  2. 2. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Just a bit about us Verizon Media Group (VMG) Creating what's next in content, advertising, and technology with brands like Yahoo, HuffPost, TechCrunch, etc. Reaching 900MM consumers. Very large private cloud for our data. Huge variety of workloads. Open source technologies operated at scale. Yahoo Research Working in many research areas to build the breakthroughs that power our products at a global scale. AI/ML, Computational Advertising, Search, Content Recommendation, User Modeling, Systems, and more. Scalable Systems Research Team Part of Yahoo Research Lab, Haifa. Working on novel algorithms and systems to make our products fast, scalable and reliable. Steady flow of contributions to open source technologies: HBase, RocksDB, Phoenix, DataSketches, Druid. 2
  3. 3. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Just a bit about Druid at VMG Used by dozens of products Content analytics, Ad analytics, User analytics, Network and System analytics, ... Hosting many petabytes of data Deployed on Private and Public Cloud infrastructure >1K hosts Contributing to Druid code since Yahoo times Example - integrated DataSketches for real-time aggregations over streaming data. 3
  4. 4. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. This Talk is About Data ingestion in Druid … and speeding it up with Oak … a scalable off-heap concurrent key-value map in Java 4
  5. 5. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Data Model 5 Primary dimension Numeric or Sketch (approximate aggregate) Numeric or String
  6. 6. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Physical Storage and Serving Workload characteristics Data is mostly write-once, append-only. Queries mostly focus on the time dimension first. Data organization Data is ordered by time (sequence of chunks, in the order of ingestion). Chunks consist of segments, ordered internally by time. Druid figures out which segments are required to serve a query. 6
  7. 7. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Druid Architecture 7 ingestion
  8. 8. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Real-Time Ingestion Works with a variety of data feeds Kafka, Kinesis, Spark, Local FS, … Real work - Middle Manager (MM) processes MM runs multiple indexing tasks (Peons). Task = JVM. Experimental architecture: Indexer (replacing MM/Peons). Task = thread. Indexing tasks Generate data segments and flush them to deep storage (HDFS, S3, …) Serve queries - ingested data is immediately queryable 8
  9. 9. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Performance Caveats Segment size crucial for system performance Queries are slow if segments are too big or too small. Segment granularity is controlled by indexer task settings. Current default size (in rows): 5M rows/segment Current recommended size (in bytes): 300-700 MB/segment. Druid automatically compacts (merges) small segments in the background. We want relatively big segments With sketches, 5M rows easily translates to 10-20 GB’s. RAM is becoming cheaper (128/256 GB/host are common). Storage is becoming faster, optimized for big transfers Scanning many small files is inefficient. Background compactions of many small files inefficient. 9
  10. 10. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Incremental Index (I2 ) Core data structure used by indexing tasks Ordered map of data rows - point updates/point lookups/range queries. Concurrent (thread safe) - allows parallel reads and writes. Comes in two forms Plain (collection of facts) or Rollup (on-the-fly aggregation). Built on top of JDK ConcurrentSkipList (CSL) Adapts the Druid row-oriented data model to the KV paradigm. Key = Timestamp + {Dimension}*. Value = {Metric}*. How far can it scale? 10
  11. 11. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. GC - the Blessing and the Curse Java Garbage Collection (GC) Manages objects allocated on the JVM heap. Awesome for programmer experience (no memory management headache). … but never really designed for Big Data applications. GC Perils Resource waste: steals CPU cycles + RAM headroom. Poor SLA: GC pauses → software stalls → high tail latencies. Experiment - GC impact on I2 10M rows → tens of millions of Java objects. 25% of data ingestion time spent on GC. 2x memory consumed by GC mechanisms to sustain reasonable performance. 11 Source: jelastic.com
  12. 12. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Scaling I2 Goal: eliminate the GC overhead in I2 Better scaling with data size (up to tens of GB’s per index). More efficient use of RAM resources (low overhead). How: GC-insensitive implementation Replace the JDK skiplist with an ordered map with off-heap data storage. … with identical (strong) concurrency guarantees. … desirably, orders-of-magnitude less metadata objects. Oak comes to help Open source, off-heap ordered map by Yahoo. https://github.com/yahoo/Oak 12 Source: clipartsign.com
  13. 13. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Oak, Introduced Features Concurrent ordered map with atomic semantics. Managed-memory programming experience. Unmanaged-memory performance. Design principles Data (keys and values) stored off-heap. Metadata (internal search tree) stored on-heap. Custom low-overhead internal GC. API Traditional JDK API (ConcurrentNavigableMap) for backward compatibility. Zero-copy (ZC) API for the best performance - queries and in-place updates. Stream API optimized for fast scans. 13
  14. 14. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Oak in a Nutshell 14
  15. 15. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Scaling with Parallelism (11M KV-pairs) 15 Put Get (ZC)
  16. 16. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Scaling with Parallelism (11M KV-pairs) 16 Ascending scan (ZC), 10K pairs/scan Descending scan (ZC), 10K pairs/scan
  17. 17. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. I2 -Oak I2 implementation on top of OakMap Configurable at system level (the legacy I2 is still a default). Minor refactoring of the Druid code (I2 API abstraction). Implemented as core part of Druid but could be an extension to reduce friction. Details Druid I2 schema mapped to OakMap keys and values. Leverages the ZC API for queries and in-place aggregation. Auxiliary data structures (e.g., string dictionaries) remain on-heap. Sketch aggregators currently unsupported. Project Status Code complete. Component- and system-level benchmarks. Community: Git issue. GA expected mid-2020. 17
  18. 18. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Druid Ingestion - Scaling with Data Size 18 Ingesting 1M to 7M tuples Tuple size 1.25KB 30GB available RAM
  19. 19. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Druid Ingestion - Scaling with RAM 19 Ingesting 7M tuples Tuple size 1.25KB RAM scaling 25GB to 32GB
  20. 20. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Druid Ingestion - RAM overhead 20
  21. 21. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Summary 21 GC overhead is critical for real-time ingestion performance Solution - off-heap incremental index Implementation based on Oak. Better scalability and RAM efficiency vs the legacy. Contribution to Druid Implementation under community review, expected GA mid-2020.

×