SlideShare a Scribd company logo
1 of 39
Download to read offline
Druid
Rostislav Pashuto
November, 2015
The pattern
● we have to scale, current storage no longer able to support our growth
○ horizontal scaling for data which is doubling, quadrupling, …
○ compression
○ cost effective please
● we want near real-time reports
○ sub-second queries
○ multi-tenancy
● we have to do a real-time ingestion
○ insights on events immediately after they occur
● we need something stable and maintained
○ highly available
○ open source solution with active community
Once upon a time
“Over the last twelve months, we tried and failed to achieve scale and speed
with relational databases (Greenplum, InfoBright, MySQL) and NoSQL offerings
(HBase). So instead we did something crazy: we rolled our own database. Druid is
the distributed, in-memory OLAP data store that resulted.” © by Eric Tschetter ·
April, 2011
Started in 2011, open sourced in 2012, under Apache 2.0 licence since 20
Feb, 2015.
Druid is a fast column-oriented, distributed, not only in-memory data store
designed for low latency ingestion, ad-hoc aggregations, keeping a history for
years.
Druid
Pros
● aggregate operations in sub-second for most use cases
● real-time streaming ingestion and batch Ingestion
● denormalized data
● horizontal scalability with linear performance
● active community
Cons
● Lack of real joins
● Limited query power compared to SQL/MDX
Druid: checklist
You need
● fast aggregations and exploratory analytics
● sub-second queries for near real-time analysis
● no SPoF data store
● to store a lot of events (trillions, petabytes of data) which you can define as a
set of dimensions
● to process denormalized data, which is not completely unstructured data
● basic search is ok for you (regexp included)
Druid in production
Existing production cluster according druid.io whitepaper
● 3+ trillion events/month
● 3M+ events/sec through Druid's real-time ingestion
● 100+ PB of raw data
● 50+ trillion events
● Thousands of queries per second for applications used by
thousands of users
● Tens of thousands of cores
Case: GumGum
GumGum, a digital marketing platform reported about
3 billion events per day in real time => 5 TB of new data per day with:
● Brokers – 2 m4.xlarge (Round-robin DNS)
● Coordinators – 2 c4.large
● Historical (Cold) – 2 m4.2xlarge (1 x 1000GB EBS SSD)
● Historical (Hot) – 4 m4.2xlarge (1 x 250GB EBS SSD)
● Middle Managers – 15 c4.4xlarge (1 x 300GB EBS SSD)
● Overlords – 2 c4.large
● Zookeeper – 3 c4.large
● MySQL – RDS – db.m3.medium
More: http://goo.gl/tKKmw5
Case: GumGum
Production
Netflix
Netflix engineers use Druid to aggregate multiple data streams, ingesting up to two
terabytes per hour, with the ability to query data as it's being ingested. They use Druid to
pinpoint anomalies within their infrastructure, endpoint activity and content flow.
Paypal
The Druid production deployment at PayPal processes a very large volume of data and
is used for internal exploratory analytics by business analytic teams
Xiaomi
Xiaomi uses Druid as an analytics tool to analyze online advertising data.
More: http://druid.io/druid-powered.html
Sample data
Wikipedia “edit” events
Druid cluster and the flow of data through the cluster
Components
● Realtime Node
● Historical Node
● Broker Node
● Coordinator Node
● Indexing Service
Real-time nodes
Ingest and query event streams. Events indexed via
these nodes are immediately available for querying.
Buffer incoming events to an in-memory index, which is
regularly persisted to disk.
On a periodic basis, persisted indexes are then merged
together before getting handed off.
Queries will hit both the in-memory and persisted
indexes.
Later, during the handoff stage, a real-time node
uploads segment to a permanent backup storage called
“deep storage”, typically a distributed file system like S3
or HDFS.
Real-time nodes leverage Zookeeper for coordination
with the rest of the Druid cluster.
One of the largest production Druid clusters to be able
to consume raw data at approximately 500 MB/s
(150,000 events/s or 2 TB/hour).
Historical nodes
Encapsulate the functionality to load and serve the immutable blocks of data (segments)
created by real-time nodes.
Every node share nothing and only know how to load, drop, and serve immutable segments.
Historical nodes announce their online state and the data they are serving in Zookeeper.
Instructions to load and drop segments are sent over Zookeeper.
Before a historical node downloads a particular segment from deep storage, it first checks a
local cache that maintains information about what segments already exist on the node.
Once processing is complete, the segment is announced in Zookeeper. At this point, the
segment is queryable.
Segments must be loaded in memory before they can be queried. Druid supports memory
mapped files.
On Zookeeper outage historical nodes are still able to respond to query requests for the data
they are currently serving, but no longer able to serve new data or drop outdated data.
Broker nodes
Query router to historical and real-time
nodes.
Merge partial results from historical and real-
time nodes
Understand what segments are queryable
and where those segments are located.
On ZK fail broker use last known view of the
cluster
Coordinator nodes
In charge of data management and distribution on historical nodes.
Tell historical nodes to load new data, drop outdated data, replicate data, and
move data to load balance.
Undergo a leader-election process that determines a single node that runs the
coordinator functionality.
Act as redundant backups.
On ZK downtime will no longer be able to send instructions
Indexing service
Consists of
● Overlord (manages tasks distribution to middle manager)
● Middle Manager (create peons for running tasks)
● Peons (run a single task in a single JVM)
Creates and destroy segments
Components overview
Ingestion
Streaming data (does not guarantee exactly once processing)
● Stream processor ( like Apache Samza or Apache Storm )
● Kafka support
You can use Tranquility library to send event streams to Druid.
Batch Data
● Hadoop-based indexing
● Index task
Lambda Architecture
Druid staff recommend running a streaming real-time pipeline to run queries over events as they are occurring and a
batch pipeline to perform periodic cleanups of data.
Data Formats
● JSON
● CSV
● A custom delimited form such as TSV
● Protobuf
Multi-value dimensions support
Storage format
● data tables called data sources are collections of timestamped events and
partitioned into a set of segments
● each segment is typically 5–10 million rows
● storage format is highly optimized for linear scans
Replication
● replication and distribution are done at a segment level
● druid’s data distribution is segment-based and leverages a highly available
"deep" storage such as S3 or HDFS. Scaling up (or down) does not require
massive copy actions or downtime; in fact, losing any number of historical
nodes does not result in data loss because new historical nodes can always
be brought up by reading data from "deep" storage.
QUERYING
Key differences
Limited support for joins (through query-
time lookups)
Replace dimension value with another
value.
No official SQL support 3rd party drivers
Immutable dimension data Re-index specific data segment
Specific “>” and “<” support in query filters AND, OR, NOT, REG_EXP, Java Script
(Rhino), IN (as lookups), partial match
Workaround for “more”/”less”:
{
"type" : "javascript",
"dimension" : "age",
"function" : "function(x) { return(x >= '21' &&
x <= '35') }"
}
Timeseries
{
"queryType": "timeseries",
"dataSource": "city",
"granularity": "day",
"dimensions": ["income"],
"aggregations": [{ "type": "count", "name": "total", "fieldName": "userId" }],
"intervals": [ "2012-01-01T00:00:00.000/2016-01-04T00:00:00.000" ],
"filter": {
"type" : "and",
"fields" : [{"type": "javascript", "dimension": "income", "function": "function(x) {return x > 100 }"}
]
}
}
-----------------------------------------------------------------------------------------------------------------------------------------
[
{"timestamp": "2015-06-09T00:00:00.000Z", "result": {"total": 112}},
{"timestamp": "2015-06-10T00:00:00.000Z", "result": {"total": 117}}
]
TopN
TopNs are much faster and resource efficient than “GroupBy” for this use case.
{
"queryType": "topN",
"dataSource": "city",
"granularity": "day",
"dimension": "income",
“threshold”: 3,
“metric”: “count”,
"aggregations": [...],
...
}
-----------------------------------------------------------------------------------------------------------------------------------------
[
{
"timestamp": "2015-06-09T00:00:00.000Z",
"result": [
{ “who”: “Bob”, "count": 100},
{ “who”: “Alice”, "count": 40},
{ “who”: “Jane”, "count": 15},
]
}
]
GroupBy
● Use “Timeseries” to do straight aggregates for some time range.
● Use “TopN for an ordered groupBy over a single dimension.
{
"queryType": "groupBy",
"dataSource": "twitterstream",
"granularity": "all",
"dimensions": ["lang", "utc_offset"],
"aggregations":[
{ "type": "count", "name": "rows"},
{ "type": "doubleSum", "fieldName": "tweets", "name": "tweets"}
],
"filter": { "type": "selector", "dimension": "lang", "value": "en" },
"intervals":["2012-10-01T00:00/2020-01-01T00"]
}
-----------------------------------------------------------------------------------------------------------------------------------------
[{ "version": "v1",
"timestamp": "2012-10-01T00:00:00.000Z",
"event": {
"utc_offset": "-10800",
"tweets": 90,
"lang": "en",
"rows": 81
}
}...
Other
Time Boundary. Return the earliest and latest data points of a data set.
Segment Metadata. Return per segment information about: cardinality, byte size, type of columns, segment intervals and etc.
Data Source Metadata. Return timestamp of last ingested event.
Search. Returns dimension values that match the search specification.
Filters
● exact match (“=”)
● and
● not
● or
● reg_exp (Java reg_exp)
● JavaScript
● extraction (similar to “in”)
● search (capture partial search match)
Aggregations
● count
● min/max
● JavaScript (All JavaScript functions must return numerical values)
● cardinality (by value and by row)
● HyperUnique aggregator (uses HyperLogLog to compute the estimated cardinality of a dimension
that has been aggregated as a "hyperUnique" metric at indexing time)
● filtered agregator (wraps any given aggregator, but only aggregates the values for which the given
dimension filter matches)
Post Aggregations
● arithmetic (applies the provided function to the given fields from left to right)
● filed accessor (returns the value produced by the specified)
● constant value (always returns the specified value)
● JavaScript
● HyperUnique Cardinality (is used to wrap a hyperUnique object such that it can be used in post aggregations)
...
"aggregations" : [
{ "type" : "count", "name" : "rows" },
{ "type" : "doubleSum", "name" : "tot", "fieldName" : "total" }],
"postAggregations" : [{
"type" : "arithmetic",
"name" : "average",
"fn" : "*",
"fields" : [
{ "type" : "arithmetic",
"name" : "div",
"fn" : "/",
"fields" : [
{ "type" : "fieldAccess", "name" : "tot", "fieldName" : "tot" },
{ "type" : "fieldAccess", "name" : "rows", "fieldName" : "rows" }]
},
...
Party
Druid in a party: Spark
Druid is designed to power analytic applications and focuses on the latencies to
ingest data and serve queries over that data. If you were to build an application
where users could arbitrarily explore data, the latencies seen by using Spark will
likely be too slow for an interactive experience.
Druid in a party: SQL on hadoop
Druid was designed to
1. be an always on service
2. ingest data in real-time
3. handle slice-n-dice style ad-hoc queries
SQL-on-Hadoop engines generally sidestep Map/Reduce, instead querying data directly from HDFS or, in
some cases, other storage systems. Some of these engines (including Impala and Presto) can be
collocated with HDFS data nodes and coordinate with them to achieve data locality for queries. What does
this mean? We can talk about it in terms of three general areas
1. Queries
2. Data Ingestion
3. Query Flexibility
Resources
● http://druid.io/
● Druid in nutshell http://static.druid.io/docs/druid.pdf
● Druid API https://github.com/druid-io/druid-api
● Analytic UI http://imply.io/
● 3rd party SQL interface https://github.com/srikalyc/Sql4D
THANK YOU

More Related Content

What's hot

Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka IntroductionAmita Mirajkar
 
Apache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-PatternApache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-Patternconfluent
 
Uber: Kafka Consumer Proxy
Uber: Kafka Consumer ProxyUber: Kafka Consumer Proxy
Uber: Kafka Consumer Proxyconfluent
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?Kai Wähner
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKai Wähner
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controllerconfluent
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database Systemconfluent
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsKetan Gote
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Storesconfluent
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registryconfluent
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache KafkaChhavi Parasher
 

What's hot (20)

Druid
DruidDruid
Druid
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
Apache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-PatternApache Kafka – (Pattern and) Anti-Pattern
Apache Kafka – (Pattern and) Anti-Pattern
 
kafka
kafkakafka
kafka
 
Uber: Kafka Consumer Proxy
Uber: Kafka Consumer ProxyUber: Kafka Consumer Proxy
Uber: Kafka Consumer Proxy
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
When NOT to use Apache Kafka?
When NOT to use Apache Kafka?When NOT to use Apache Kafka?
When NOT to use Apache Kafka?
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
 
A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
ksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database SystemksqlDB: A Stream-Relational Database System
ksqlDB: A Stream-Relational Database System
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
APACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka StreamsAPACHE KAFKA / Kafka Connect / Kafka Streams
APACHE KAFKA / Kafka Connect / Kafka Streams
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Performance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State StoresPerformance Tuning RocksDB for Kafka Streams’ State Stores
Performance Tuning RocksDB for Kafka Streams’ State Stores
 
Getting Started with Confluent Schema Registry
Getting Started with Confluent Schema RegistryGetting Started with Confluent Schema Registry
Getting Started with Confluent Schema Registry
 
Fundamentals of Apache Kafka
Fundamentals of Apache KafkaFundamentals of Apache Kafka
Fundamentals of Apache Kafka
 

Similar to Aggregated queries with Druid on terrabytes and petabytes of data

Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Hernan Costante
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsMapR Technologies
 
Security Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetSecurity Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetJuan Berner
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Codemotion
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding HadoopAhmed Ossama
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerMichael Spector
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?CodePolitan
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022StreamNative
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and howPetr Zapletal
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Felicia Haggarty
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgDavid Pilato
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceSATOSHI TAGOMORI
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingChen-en Lu
 

Similar to Aggregated queries with Druid on terrabytes and petabytes of data (20)

Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
Eko10 - Security Monitoring for Big Infrastructures without a Million Dollar ...
 
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
Security Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budgetSecurity Monitoring for big Infrastructures without a Million Dollar budget
Security Monitoring for big Infrastructures without a Million Dollar budget
 
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | EnglishAWS big-data-demystified #1.1  | Big Data Architecture Lessons Learned | English
AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big data nyu
Big data nyuBig data nyu
Big data nyu
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
 
Distributed real time stream processing- why and how
Distributed real time stream processing- why and howDistributed real time stream processing- why and how
Distributed real time stream processing- why and how
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015Building a system for machine and event-oriented data - SF HUG Nov 2015
Building a system for machine and event-oriented data - SF HUG Nov 2015
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed Luxembourg
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
 

Recently uploaded

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniquesugginaramesh
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...srsj9000
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxbritheesh05
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)dollysharma2066
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfAsst.prof M.Gokilavani
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleAlluxio, Inc.
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AIabhishek36461
 

Recently uploaded (20)

Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
Comparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization TechniquesComparative Analysis of Text Summarization Techniques
Comparative Analysis of Text Summarization Techniques
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
Gfe Mayur Vihar Call Girls Service WhatsApp -> 9999965857 Available 24x7 ^ De...
 
Artificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptxArtificial-Intelligence-in-Electronics (K).pptx
Artificial-Intelligence-in-Electronics (K).pptx
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
Call Us ≽ 8377877756 ≼ Call Girls In Shastri Nagar (Delhi)
 
Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdfCCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
CCS355 Neural Network & Deep Learning UNIT III notes and Question bank .pdf
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Past, Present and Future of Generative AI
Past, Present and Future of Generative AIPast, Present and Future of Generative AI
Past, Present and Future of Generative AI
 

Aggregated queries with Druid on terrabytes and petabytes of data

  • 2. The pattern ● we have to scale, current storage no longer able to support our growth ○ horizontal scaling for data which is doubling, quadrupling, … ○ compression ○ cost effective please ● we want near real-time reports ○ sub-second queries ○ multi-tenancy ● we have to do a real-time ingestion ○ insights on events immediately after they occur ● we need something stable and maintained ○ highly available ○ open source solution with active community
  • 3. Once upon a time “Over the last twelve months, we tried and failed to achieve scale and speed with relational databases (Greenplum, InfoBright, MySQL) and NoSQL offerings (HBase). So instead we did something crazy: we rolled our own database. Druid is the distributed, in-memory OLAP data store that resulted.” © by Eric Tschetter · April, 2011 Started in 2011, open sourced in 2012, under Apache 2.0 licence since 20 Feb, 2015. Druid is a fast column-oriented, distributed, not only in-memory data store designed for low latency ingestion, ad-hoc aggregations, keeping a history for years.
  • 4. Druid Pros ● aggregate operations in sub-second for most use cases ● real-time streaming ingestion and batch Ingestion ● denormalized data ● horizontal scalability with linear performance ● active community Cons ● Lack of real joins ● Limited query power compared to SQL/MDX
  • 5. Druid: checklist You need ● fast aggregations and exploratory analytics ● sub-second queries for near real-time analysis ● no SPoF data store ● to store a lot of events (trillions, petabytes of data) which you can define as a set of dimensions ● to process denormalized data, which is not completely unstructured data ● basic search is ok for you (regexp included)
  • 6. Druid in production Existing production cluster according druid.io whitepaper ● 3+ trillion events/month ● 3M+ events/sec through Druid's real-time ingestion ● 100+ PB of raw data ● 50+ trillion events ● Thousands of queries per second for applications used by thousands of users ● Tens of thousands of cores
  • 7. Case: GumGum GumGum, a digital marketing platform reported about 3 billion events per day in real time => 5 TB of new data per day with: ● Brokers – 2 m4.xlarge (Round-robin DNS) ● Coordinators – 2 c4.large ● Historical (Cold) – 2 m4.2xlarge (1 x 1000GB EBS SSD) ● Historical (Hot) – 4 m4.2xlarge (1 x 250GB EBS SSD) ● Middle Managers – 15 c4.4xlarge (1 x 300GB EBS SSD) ● Overlords – 2 c4.large ● Zookeeper – 3 c4.large ● MySQL – RDS – db.m3.medium More: http://goo.gl/tKKmw5
  • 9. Production Netflix Netflix engineers use Druid to aggregate multiple data streams, ingesting up to two terabytes per hour, with the ability to query data as it's being ingested. They use Druid to pinpoint anomalies within their infrastructure, endpoint activity and content flow. Paypal The Druid production deployment at PayPal processes a very large volume of data and is used for internal exploratory analytics by business analytic teams Xiaomi Xiaomi uses Druid as an analytics tool to analyze online advertising data. More: http://druid.io/druid-powered.html
  • 10.
  • 12. Druid cluster and the flow of data through the cluster
  • 13. Components ● Realtime Node ● Historical Node ● Broker Node ● Coordinator Node ● Indexing Service
  • 14. Real-time nodes Ingest and query event streams. Events indexed via these nodes are immediately available for querying. Buffer incoming events to an in-memory index, which is regularly persisted to disk. On a periodic basis, persisted indexes are then merged together before getting handed off. Queries will hit both the in-memory and persisted indexes. Later, during the handoff stage, a real-time node uploads segment to a permanent backup storage called “deep storage”, typically a distributed file system like S3 or HDFS. Real-time nodes leverage Zookeeper for coordination with the rest of the Druid cluster. One of the largest production Druid clusters to be able to consume raw data at approximately 500 MB/s (150,000 events/s or 2 TB/hour).
  • 15. Historical nodes Encapsulate the functionality to load and serve the immutable blocks of data (segments) created by real-time nodes. Every node share nothing and only know how to load, drop, and serve immutable segments. Historical nodes announce their online state and the data they are serving in Zookeeper. Instructions to load and drop segments are sent over Zookeeper. Before a historical node downloads a particular segment from deep storage, it first checks a local cache that maintains information about what segments already exist on the node. Once processing is complete, the segment is announced in Zookeeper. At this point, the segment is queryable. Segments must be loaded in memory before they can be queried. Druid supports memory mapped files. On Zookeeper outage historical nodes are still able to respond to query requests for the data they are currently serving, but no longer able to serve new data or drop outdated data.
  • 16. Broker nodes Query router to historical and real-time nodes. Merge partial results from historical and real- time nodes Understand what segments are queryable and where those segments are located. On ZK fail broker use last known view of the cluster
  • 17. Coordinator nodes In charge of data management and distribution on historical nodes. Tell historical nodes to load new data, drop outdated data, replicate data, and move data to load balance. Undergo a leader-election process that determines a single node that runs the coordinator functionality. Act as redundant backups. On ZK downtime will no longer be able to send instructions
  • 18. Indexing service Consists of ● Overlord (manages tasks distribution to middle manager) ● Middle Manager (create peons for running tasks) ● Peons (run a single task in a single JVM) Creates and destroy segments
  • 20. Ingestion Streaming data (does not guarantee exactly once processing) ● Stream processor ( like Apache Samza or Apache Storm ) ● Kafka support You can use Tranquility library to send event streams to Druid. Batch Data ● Hadoop-based indexing ● Index task Lambda Architecture Druid staff recommend running a streaming real-time pipeline to run queries over events as they are occurring and a batch pipeline to perform periodic cleanups of data.
  • 21. Data Formats ● JSON ● CSV ● A custom delimited form such as TSV ● Protobuf Multi-value dimensions support
  • 22. Storage format ● data tables called data sources are collections of timestamped events and partitioned into a set of segments ● each segment is typically 5–10 million rows ● storage format is highly optimized for linear scans
  • 23. Replication ● replication and distribution are done at a segment level ● druid’s data distribution is segment-based and leverages a highly available "deep" storage such as S3 or HDFS. Scaling up (or down) does not require massive copy actions or downtime; in fact, losing any number of historical nodes does not result in data loss because new historical nodes can always be brought up by reading data from "deep" storage.
  • 25. Key differences Limited support for joins (through query- time lookups) Replace dimension value with another value. No official SQL support 3rd party drivers Immutable dimension data Re-index specific data segment Specific “>” and “<” support in query filters AND, OR, NOT, REG_EXP, Java Script (Rhino), IN (as lookups), partial match Workaround for “more”/”less”: { "type" : "javascript", "dimension" : "age", "function" : "function(x) { return(x >= '21' && x <= '35') }" }
  • 26. Timeseries { "queryType": "timeseries", "dataSource": "city", "granularity": "day", "dimensions": ["income"], "aggregations": [{ "type": "count", "name": "total", "fieldName": "userId" }], "intervals": [ "2012-01-01T00:00:00.000/2016-01-04T00:00:00.000" ], "filter": { "type" : "and", "fields" : [{"type": "javascript", "dimension": "income", "function": "function(x) {return x > 100 }"} ] } } ----------------------------------------------------------------------------------------------------------------------------------------- [ {"timestamp": "2015-06-09T00:00:00.000Z", "result": {"total": 112}}, {"timestamp": "2015-06-10T00:00:00.000Z", "result": {"total": 117}} ]
  • 27. TopN TopNs are much faster and resource efficient than “GroupBy” for this use case. { "queryType": "topN", "dataSource": "city", "granularity": "day", "dimension": "income", “threshold”: 3, “metric”: “count”, "aggregations": [...], ... } ----------------------------------------------------------------------------------------------------------------------------------------- [ { "timestamp": "2015-06-09T00:00:00.000Z", "result": [ { “who”: “Bob”, "count": 100}, { “who”: “Alice”, "count": 40}, { “who”: “Jane”, "count": 15}, ] } ]
  • 28. GroupBy ● Use “Timeseries” to do straight aggregates for some time range. ● Use “TopN for an ordered groupBy over a single dimension. { "queryType": "groupBy", "dataSource": "twitterstream", "granularity": "all", "dimensions": ["lang", "utc_offset"], "aggregations":[ { "type": "count", "name": "rows"}, { "type": "doubleSum", "fieldName": "tweets", "name": "tweets"} ], "filter": { "type": "selector", "dimension": "lang", "value": "en" }, "intervals":["2012-10-01T00:00/2020-01-01T00"] } ----------------------------------------------------------------------------------------------------------------------------------------- [{ "version": "v1", "timestamp": "2012-10-01T00:00:00.000Z", "event": { "utc_offset": "-10800", "tweets": 90, "lang": "en", "rows": 81 } }...
  • 29. Other Time Boundary. Return the earliest and latest data points of a data set. Segment Metadata. Return per segment information about: cardinality, byte size, type of columns, segment intervals and etc. Data Source Metadata. Return timestamp of last ingested event. Search. Returns dimension values that match the search specification.
  • 30. Filters ● exact match (“=”) ● and ● not ● or ● reg_exp (Java reg_exp) ● JavaScript ● extraction (similar to “in”) ● search (capture partial search match)
  • 31. Aggregations ● count ● min/max ● JavaScript (All JavaScript functions must return numerical values) ● cardinality (by value and by row) ● HyperUnique aggregator (uses HyperLogLog to compute the estimated cardinality of a dimension that has been aggregated as a "hyperUnique" metric at indexing time) ● filtered agregator (wraps any given aggregator, but only aggregates the values for which the given dimension filter matches)
  • 32. Post Aggregations ● arithmetic (applies the provided function to the given fields from left to right) ● filed accessor (returns the value produced by the specified) ● constant value (always returns the specified value) ● JavaScript ● HyperUnique Cardinality (is used to wrap a hyperUnique object such that it can be used in post aggregations) ... "aggregations" : [ { "type" : "count", "name" : "rows" }, { "type" : "doubleSum", "name" : "tot", "fieldName" : "total" }], "postAggregations" : [{ "type" : "arithmetic", "name" : "average", "fn" : "*", "fields" : [ { "type" : "arithmetic", "name" : "div", "fn" : "/", "fields" : [ { "type" : "fieldAccess", "name" : "tot", "fieldName" : "tot" }, { "type" : "fieldAccess", "name" : "rows", "fieldName" : "rows" }] }, ...
  • 33. Party
  • 34. Druid in a party: Spark Druid is designed to power analytic applications and focuses on the latencies to ingest data and serve queries over that data. If you were to build an application where users could arbitrarily explore data, the latencies seen by using Spark will likely be too slow for an interactive experience.
  • 35. Druid in a party: SQL on hadoop Druid was designed to 1. be an always on service 2. ingest data in real-time 3. handle slice-n-dice style ad-hoc queries SQL-on-Hadoop engines generally sidestep Map/Reduce, instead querying data directly from HDFS or, in some cases, other storage systems. Some of these engines (including Impala and Presto) can be collocated with HDFS data nodes and coordinate with them to achieve data locality for queries. What does this mean? We can talk about it in terms of three general areas 1. Queries 2. Data Ingestion 3. Query Flexibility
  • 36.
  • 37.
  • 38. Resources ● http://druid.io/ ● Druid in nutshell http://static.druid.io/docs/druid.pdf ● Druid API https://github.com/druid-io/druid-api ● Analytic UI http://imply.io/ ● 3rd party SQL interface https://github.com/srikalyc/Sql4D