Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Day in the Life of a Druid Implementor and Druid's Roadmap

Benjamin Hopp (Solutions Architect) @ Imply:
Druid is an emerging standard in the data infrastructure world, designed for high-performance slice-and-dice analytics (“OLAP”-style) on large data sets.
This talk is for you if you’re interested in learning more about pushing Druid’s analytical performance to the limit.
Perhaps you’re already running Druid and are looking to speed up your deployment, or perhaps you aren’t familiar with Druid and are interested in learning the basics. 
Some of the tips in this talk are Druid-specific, but many of them will apply to any operational analytics technology stack.

The most important contributor to a fast analytical setup is getting the data model right. 
The talk will center around various choices you can make to prepare your data to get best possible query performance.

We’ll look at some general best practices to model your data before ingestion such as OLAP dimensional modeling (called “roll-up” in Druid), data partitioning, and tips for choosing column types and indexes. 
We’ll also look at how more can be less: often, storing copies of your data partitioned, sorted, or aggregated in different ways can speed up queries by reducing the amount of computation needed.

We’ll also look at Druid-specific optimizations that take advantage of approximations; where you can trade accuracy for performance and reduced storage. 
You’ll get introduced to Druid’s features for approximate counting, set operations, ranking, quantiles, and more. 
And we will finish with the latest and greatest Druid news, including details about the latest roadmap and releases.

  • Be the first to comment

A Day in the Life of a Druid Implementor and Druid's Roadmap

  1. 1. A Day in the life of a Druid Architect Benjamin Hopp Senior Solutions Architect @ Imply ben@imply.io
  2. 2. San Francisco Airport Marriott Waterfront Real-Time Analytics at Scale https://www.druidsummit.org/
  3. 3. What do I do? Productionalization Implementation Recommendation Education
  4. 4. Ask a lot of Questions ● What is the use-case? ○ Is it a good fit for druid? ● Who are the stakeholders? ○ End users - running queries ○ Data Engineers - ingesting data ○ Cluster Administrators - managing services ● How are they using the cluster? ● Where is the data coming from? ● What are the issues or concerns? ● Where does druid fit in the technology stack?
  5. 5. When to use Druid 6 Search platform OLAP ● Real-time ingestion ● Flexible schema ● Full text search ● Batch ingestion ● Efficient storage ● Fast analytic queries Timeseries database ● Optimized for time-based datasets ● Time-based functions
  6. 6. When NOT to use Druid 7 OLTP Individual record update/delet e Big join operations
  7. 7. Where Druid fits in 8 Data lakes Message buses Raw data Storage Analyze Application
  8. 8. Cluster Evaluation
  9. 9. Druid Architecture
  10. 10. Pick your servers Data NodesD ● Large-ish ● Scales with size of data and query volume ● Lots of cores, lots of memory, fast NVMe disk Query NodesQ ● Medium-ish ● Scales with concurrency and # of Data nodes ● Typically CPU bound Master NodesM ● Small-ish Nodes ● Coordinator scales with # of segments ● Overlord scales with # of supervisors and tasks
  11. 11. Configure for MAXIMUM PERFORMANCE Data NodesD ● Enable Cache ● Heap/maxDirectMemory size ● druid.processing.buffer.sizeBytes ● druid.processing.numMergeBuffers ● druid.processing.numThreads Query NodesQ ● Disable Caching ● Heap/maxDirectMemory size ● druid.broker.http.numConnections ● druid.processing.numMergeBuffers ● druid.processing.numThreads Master NodesM ● Heap Size
  12. 12. Data Evaluation
  13. 13. Unified Console
  14. 14. Optimize segment size Ideally 300 - 700 mb (~ 5 million rows) To control segment size ● Alter segment granularity ● Specify partition spec ● Use Automatic Compaction
  15. 15. Controlling Segment Size ● Number of Tasks - Keep to lowest number that supports max ingestion rate. ● Segment Granularity - Increase if only 1 file per segment and < 200MB "segmentGranularity": "HOUR" ● Max Rows Per Segment - Increase if a single segment is < 200MB "maxRowsPerSegment": 5000000
  16. 16. Compaction ● Combines small segments into larger segments ● Useful for late-arriving data ● Task submitted to Overlord { "type" : "compact", "dataSource" : "wikipedia", "interval" : "2017-01-01/2018-01-01" }
  17. 17. Rollup ● Pre-aggregation at ingestion time ● Saves space, better compression ● Query performance boost
  18. 18. Rollup timestamp page city count sum_added sum_deleted 2011-01-01T00:00:00Z Justin Bieber SF 3 50 61 2011-01-01T00:00:00Z Ke$ha LA 2 46 53 2011-01-01T00:00:00Z Miley Cyrus DC 4 198 88 timestamp page city added deleted 2011-01-01T00:01:35Z Justin Bieber SF 10 5 2011-01-01T00:03:45Z Justin Bieber SF 25 37 2011-01-01T00:05:62Z Justin Bieber SF 15 19 2011-01-01T00:06:33Z Ke$ha LA 30 45 2011-01-01T00:08:51Z Ke$ha LA 16 8 2011-01-01T00:09:17Z Miley Cyrus DC 75 10 2011-01-01T00:11:25Z Miley Cyrus DC 11 25 2011-01-01T00:23:30Z Miley Cyrus DC 22 12 2011-01-01T00:49:33Z Miley Cyrus DC 90 41
  19. 19. Summarize with data sketches timestamp page city count sum_ added sum_ deleted userid_sketch 2011-01-01T00:00:00Z Justin Bieber SF 3 50 61 sketch_obj 2011-01-01T00:00:00Z Ke$ha LA 2 46 53 sketch_obj 2011-01-01T00:00:00Z Miley Cyrus DC 4 198 88 sketch_obj timestamp page userid city added deleted 2011-01-01T00:01:3 5Z Justin Bieber user11 SF 10 5 2011-01-01T00:03:4 5Z Justin Bieber user22 SF 25 37 2011-01-01T00:05:6 2Z Justin Bieber user11 SF 15 19 2011-01-01T00:06:3 3Z Ke$ha user33 LA 30 45 2011-01-01T00:08:5 1Z Ke$ha user33 LA 16 8 2011-01-01T00:09:1 7Z Miley Cyrus user11 DC 75 10 2011-01-01T00:11:2 5Z Miley Cyrus user44 DC 11 25 2011-01-01T00:23:3 0Z Miley Cyrus user44 DC 22 12 2011-01-01T00:49:3 3Z Miley Cyrus user55 DC 90 41
  20. 20. Choose column types carefully String column indexed fast aggregation fast grouping Numeric column indexed fast aggregation fast grouping
  21. 21. Partitioning beyond time ● Druid always partitions by time ● Decide which dimension to partition on… next ● Partition by some dimension you often filter on ● Improves locality, compression, storage size, query performance
  22. 22. Query Evaluation
  23. 23. Decisions based on data!
  24. 24. Use Druid SQL ● Easier to learn/more familiar ● Will attempt to make intelligent query type choices (timeseries vs topN vs groupBy) ● There are some limitations - such as multi-value dimensions, not all aggregations are supported
  25. 25. Explain Plan EXPLAIN PLAN FOR SELECT channel, sum(added) FROM wikipedia WHERE commentLength >= 50 GROUP BY channel ORDER BY sum(added) desc LIMIT 3
  26. 26. Pick your query carefully ● TimeBoundary - Returns min/max timestamp for given interval. ● Timeseries - When you don’t want to group by dimension ● TopN - When you want to group by a single dimension ○ Approximate if > 1000 dimension values ● GroupBy - Least performant/most flexible ● Scan - For returning streaming raw data ○ Perfect ordering not preserved ● Select - For returning paginated raw data ● Search - Returns dimensions that match text search
  27. 27. Using Lookups ● Use lookups when you have dimensions that change to avoid re-indexing data ● Lookups are key/value pairs stored on every node. ● Loaded via file or JDBC connection to external database ● Lookups are loaded into the java heap size, so large lookups need larger heaps
  28. 28. Stay in touch 29 @druidio https://imply.io https://druid.apache.org/ Ben Hopp Benjamin.hopp@imply.io LinkedIn: benhopp @implydata
  29. 29. roadmap and community update Ben Hopp ben@imply.io
  30. 30. Apache Druid 0.17.0
  31. 31. Druid 0.17.0 Our first release as a top-level Apache project! 3
  32. 32. Druid 0.17.0 Highlights ● Native batch - binary inputs & more ○ Supports non-binary formats such as ORC, Parquet, and Avro ○ Native batch tasks can now read from HDFS ○ Single-dimension range partitioning for parallel native batch ● Compaction improvements ○ Parallel index task split hints and parallel auto-compaction ○ Stateful auto-compaction ● Parallel query merge on brokers ○ Broker can now optionally merge query results in parallel using multiple threads. 4
  33. 33. Druid 0.17.0 Highlights ● ...and More! ○ Improved SQL-compatible null handling ○ New dropwizard emitter which supports counter, gauge, meter, timer and histogram metric types ○ Task supervisors (e.g. Kafka or Kinesis supervisors) are now recorded in the system tables in a new sys.supervisors table ○ Fast historical start with deferred loading of segments until query time ○ New readiness and self-discovery resources ○ Task assignment based on MiddleManager categories ○ Security updates 5
  34. 34. Apache Druid 0.16.0
  35. 35. Druid 0.16.0 Over 350 new features from 50 contributors! Released September 2019. 7
  36. 36. Druid 0.16.0 Highlights ● Native parallel batch shuffle ○ Two-phase shuffle system allows for ‘perfect rollup’ and partitioning on dimensions ● Query vectorization phase one ○ Allows queries to be sped up by reducing the number of method calls ● Indexer process ○ An alternative to the MiddleManager + Peon task execution system which is easier to configure and deploy ● Improved web console ○ Kafka & Kinesis support! ○ Point-and-click reindexing 8
  37. 37. Druid 0.17.0 Our first release as a top-level Apache project! Coming soon (really soon). 9
  38. 38. Druid 0.17.0 Highlights ● Native batch - binary inputs & more ○ Supports non-binary formats such as ORC, Parquet, and Avro ○ Native batch tasks can now read from HDFS ○ Single-dimension range partitioning for parallel native batch ● Compaction improvements ○ Parallel index task split hints and parallel auto-compaction ○ Stateful auto-compaction ● Parallel query merge on brokers ○ Broker can now optionally merge query results in parallel using multiple threads. 10
  39. 39. Druid 0.17.0 Highlights ● ...and More! ○ Improved SQL-compatible null handling ○ New dropwizard emitter which supports counter, gauge, meter, timer and histogram metric types ○ Task supervisors (e.g. Kafka or Kinesis supervisors) are now recorded in the system tables in a new sys.supervisors table ○ Fast historical start with deferred loading of segments until query time ○ New readiness and self-discovery resources ○ Task assignment based on MiddleManager categories ○ Security updates 11
  40. 40. …and beyond!!
  41. 41. …and beyond!! A selection of items planned for future 2020 Druid releases. 13
  42. 42. …and beyond!! ● SQL Joins ○ A multi-phase project to add full SQL Join support to Druid. Coming up first - sub-queries and lookups ● Windowed aggregations ○ For example, moving average and cumulative sum aggregations. ● Dynamic query prioritization & laning ○ Mix ‘heavy’ and ‘light’ workloads in the same cluster without heavy workloads blocking light ones. ● Extended query vectorization support ○ Richer support for query vectorization against more query types 14
  43. 43. Download Druid community site (new): https://druid.apache.org/ Imply distribution: https://imply.io/get-started 15
  44. 44. Contribute 16 https://github.com/apache/druid
  45. 45. Stay in touch 17 @druidio Join the community! http://druid.io/community Free training hosted by Imply! https://imply.io/druid-days Follow the Druid project on Twitter!

×