When interacting with analytics dashboards, in order to achieve a smooth user experience, two major key requirements are quick response time and data freshness. To meet the requirements of creating fast interactive BI dashboards over streaming data, organizations often struggle with selecting a proper serving layer.
Cluster computing frameworks such as Hadoop or Spark work well for storing large volumes of data, although they are not optimized for making it available for queries in real time. Long query latencies also make these systems suboptimal choices for powering interactive dashboards and BI use cases.
This talk presents an open source real time data analytics stack using Apache Kafka, Druid, and Superset. The stack combines the low-latency streaming and processing capabilities of Kafka with Druid, which enables immediate exploration and provides low-latency queries over the ingested data streams. Superset provides the visualization and dashboarding that integrates nicely with Druid. In this talk we will discuss why this architecture is well suited to interactive applications over streaming data, present an end-to-end demo of complete stack, discuss its key features, and discuss performance characteristics from real-world use cases.
Speaker
Nishant Bangarwa, Software Engineer, Hortonworks
In this talk we will discuss an end-to-end stack using open-source technologies and build a dashboard on top of streaming data.
We will discuss the challenges involved and how each component in the stack addresses those challenges.
As a sample problem, we will look at wikipedia editstream provided by wikipedia.
Whenever any page is edited on wikipedia an edit event is generated which contains details about which page was edited,
Lets try to break down this problem further
A sample event from wikipedia editstream is formatted as follows –
Title , URL of page edited, IP address of user, number of characters added/deleted
First we would like to consume the events as coming from wikipedia
Second we would like to enrich the event by doing an IP lookup and add the geolocation info about the user, and add more fields liks city, country from where the edit is being made.
Third we would like to store these streaming events in a data store from where they can be queried and finally visualized on a dashboard.
So to solve the wikipedia problem we need to have four components –
First, A solution that can move events from one place to another in a reliable and guaranteed.
Second, Event Processing layer which can process events and transform/enrich them. (Also termed as ETL)
Third, Data Storage layer that can provide an sub-second queries on incoming data streams.
Finally, A Visualization layer that allows creating of dashboards on top the data store, users can interact with the dashboards to gain insights out of the data.
Producers produce the events in some message queue from where consumers fetch those events.
In Apache Kafka, Each topic is divided into set of partitions,
Ordering of events are guaranteed within one partition.
Each producer can produce to multiple partitions,
Each message in the queue is identifiable by an offset.
Consumers consume messages from partitions sequentially and are also responsible for keeping track of their offsets.
This also helps in minimizing the overhead.
Local State – data is locally stored in RocksDb, for recovery each change to the local state is also propagated as an event in kafka.
In case of failure or topology restart, Local State is restored from the changelog topic in kafka.
The changelog topic is periodically compacted to reduce size.
Druid Architecture
Column-oriented distributed datastore – data is stored in columnar format, in general many datasets have a large number of dimensions e.g 100s or 1000s , but most of the time queries only need 5-10s of columns, the column oriented format helps druid in only scanning the required columns.
Sub-Second query times – It utilizes various techniques like bitmap indexes to do fast filtering of data, uses memory mapped files to serve data from memory, data summarization and compression, query caching to do fast filtering of data and have very optimized algorithms for different query types. And is able to achievesub second query times
Realtime streaming ingestion from almost any ETL pipeline.
Arbitrary slicing and dicing of data – no need to create pre-canned drill downs
Automatic Data Summarization – during ingestion it can summarize your data based, e.g If my dashboard only shows events aggregated by HOUR, we can optionally configure druid to do pre-aggregation at ingestion time.
Approximate algorithms (hyperLogLog, theta) – for fast approximate answers
Scalable to petabytes of data
Highly available
Retention analysis
Druid: Segments
Data in Druid is stored in Segment Files.
Partitioned by time
Ideally, segment files are each smaller than 1GB.
If files are large, smaller time partitions are needed.
Druid has concept of different nodes, where each node is designed and optimized to perform specific set of tasks.
Realtime Index Tasks / Realtime Nodes-
Handle Real-Time Ingestion, Support both pull & push based ingestion.
Handle Queries - Ability to serve queries as soon as data is ingested.
Store data in write optimized data structure on heap, periodically convert it to write optimized time partitioned immutable segments and persist it to deep storage.
In case you need to do any ETL like data enrichment or joining multiple streams of data, you can do it in a separate ETL and send your massaged data to druid.
Deep storage can be any distributed FS and acts as a permanent backup of data
Historical Nodes -
Main workhorses of druid cluster
Use Memory Mapped files to load immutable segments
Respond to User queries
Now Lets see the how data can be queried.
Broker Nodes -
Keeps track of the data chunks being loaded by each node in the cluster
Ability to scatter query across multiple Historical and Realtime nodes
Caching Layer
Now Lets discuss another case, when you are not having streaming data, but want to Ingest Batch data into druid
Batch ingestion can be done using either Hadoop MR or spark job, which converts your data into time partitioned segments and persist it to deep storage.