We present Burst, an analytic query system with a scalable and flexible approach to performing lowlatency ad hoc analysis over large complex datasets. The architecture consists of hardwareefficient scan techniques and a language facility to transform an extensible set of ad hoc declarative queries into imperative physical scan plans. These plans are multicast across all nodes/cores of a two level sharded/distributed ingestion, storage, and execution topology and executed. The first release of this system is the query engine behind the Flurry Explorer product. Here we explore the design details of that system as well as the incremental ingestion pipeline enhancement currently being implemented for the next major release.
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
1.
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
Erik Freed Brian Anderson
Flurry/Yahoo Flurry/Yahoo
erikfreed@yahooinc.com briananderson@yahooinc.com
Abstract
We present Burst, an analytic query system with a scalable and flexible approach to performing lowlatency ad hoc
analysis over large complex datasets. The architecture consists of hardwareefficient scan techniques and a
language facility to transform an extensible set of ad hoc declarative queries into imperative physical scan plans.
These plans are multicast across all nodes/cores of a two level sharded/distributed ingestion, storage, and execution
topology and executed. The first release of this system is the query engine behind the Flurry Explorer product. Here
we explore the design details of that system as well as the incremental ingestion pipeline enhancement currently
being implemented for the next major release.
NOTE: This copy has had performance numbers updated and is not the same as the one submitted to Tech Pulse.
1. Introduction
The promise of the Flurry Explorer Product is to invite the user into an unstructured interactive discovery session
where they can easily pose arbitrary offthecuff and potentially complex questions about end user behavior. If they
get back answers quickly enough then their next question starts a virtuous cycle of more targeted questions
continuously leading to more specific and valuable results. The first major release of the back end query engine
engineered to fully support this type of exploration was developed in the Flurry Analytics group in Q1 2015 and
delivered as part of a limited beta of the Explorer feature within Flurry Analytics. We successfully utilized a unique
hyper distributed/parallel/concurrent object tree scanning model with a simple daily batched ingestion system for
this limited audience. The next major release of this scanning architecture replaces the batched ingestion system
with a more scalable incremental data ingestion pipeline to expand the reach of Explorer to all Flurry customers.
Here we present the architectural basis and specifics of the previous and upcoming release.
2. Background
For those of us who have spent any time with production scale SQL databases, seeing large table scans being sorted
and joined in a query plan is cause for panic. We can only relax once we find a way to constrain that query and/or
implement heavyweight indices so the query transforms into pure index lookups and partial joins. However for
analytics the use cases are inherently unbounded, personalized, and constantly evolving while the corpora are
typically enormous. This makes adding indices intractable in most cases. These limitations forced us to reevaluate
our previous nemesis, the full table scan. We determined that if we could make the scans efficient enough, distribute
the scans across enough nodes and CPU cores, and develop a query language that could take an arbitrary ad hoc
analytic question and transform it into an instance of this hyper paralleldistributedconcurrent scan model, then we
would have an attractively simple general purpose model. We reasoned that this model would scale well not only in
terms of input size and general query complexity, but in terms of feature development time, risk, and effort.
page 1 of 7
2.
Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
3. Top Level View
The basic components of the Burst ecosystem are:
1. External Datasource(s)
2. Ingestion Subsystem
3. Data Model
4. Sample Store
5. Dataset Store
6. Query Subsystem
The previous release of Burst had a simplified batched ingestion model where the exporting Mapreduce jobs wrote
the entire history of a given mobile application’s event stream into new HDFS sequence files on a daily basis. These
datasets were then read into memory on demand as users posed queries. This initial beta pipeline design is being
replaced by the incremental version described in subsequent sections. The rest of the architecture described here is
as currently deployed.
Each of these components (other than the external data sources) are deployed on one or more clusters called a Cell
where each Cell is comprised of a master node, a failover master node and a set of worker nodes. Each Cell has its
own Apache Kafka [KAFKA], Apache HBase[HBASE], and Apache Spark[SPARK] clusters deployed. The Master
(and failover Master) node contains the master process for each of these systems as well as a Docker [DOCKER]
container populated with all of the Burst specific JVM service processes. The Worker nodes are populated only by
the associated Spark, HBase, and Kafka worker specific deployments. Burst does not itself deploy anything directly
onto Worker nodes.
4. Data Sources
Burst is inherently schema independent as well as agnostic to the specific technology of the external datasource.
However the data source must have the following basic characteristics:
1. it must be in a schema that can be expressed in the relationships and datatypes of the Burst Data Model
2. The external data model can be partitioned into two levels of well defined shards:
a. The first level is composed of a set of Domain instances that each represent a subset of data that is
the input to a single query e.g. for Flurry Explorer, this is a event stream associated with a single
‘Mobile Application’ or constructed ‘Mobile Application Group’. A query can only be executed
against a single Domain at a time.
b. The second level is a strict partitioning across a Domain creating order independent subsets of
Item instances that each has a well defined rooted acyclic object model (tree) that can be scanned
in a depth first, preferably time ordered, traversal. For Flurry Explorer, this is a single ‘Mobile
Device’, each of which has a set of time ordered ‘sessions’, each of which has a set of time
ordered ‘events’, each of which has a set of unordered keyvalue map ‘parameters’
page 2 of 7
3.
Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
3. The external data source physical form can be exported as both a periodic historical batch and a continuous
incremental update and fed to the Burst Kafka based Ingestion API. e.g. for Flurry this is our 2,000 node,
six petabyte, ~50 trillion mobile device events, evergrowing HBase cluster with custom MapReduce jobs
performing both initializing batch and daily incremental update feeds.
5. Ingestion
The new Burst Ingestion Subsystem design starts with a Kafka queues that provide a controlplane
(control/administration), and the dataplane (data feeds). The data source is responsible for sending and responding
to the controlplane, as well as feeding the dataplane in response to controlplane messages. An Apache Spark
based process model manages controlplane and dataplane operations. It is responsible for transforming the
schema of the external system into an appropriate Burst schema, updating the Sample Store as it arrives.
6. Data Model
The Burst Data Model has the following requirements/features/implementation details:
1. It is schema independent, but schema defined.
2. It is schema versioned, and supports heterogeneous versioned collections.
3. The data model/schema supports type structures, singular and plural structure reference relationships, value
collections, value maps, and atomic data types (boolean, byte, short, int, long, double, string)
4. The data model/schema inherently defines a tree with a well defined root as part of well defined traversal
5. Data is encoded in a single byte array where the disk storage encoding is identical to the inmemory format.
6. This encoding is an unrolled depth first traversal of the object tree as a linear sequence of bytes. The
reading from disk into memory and traversal scans are in the same exact byte order and thus can take direct
advantage of the OS disk mmap semantics with the associated high performance kernel buffer management
and aggressive prefetching. The data can be cached in memory or not depending on your preferences with
respect to repeated queries on identical datasets . 1
7. All interpretation of atomic data fields are done insitu within the byte array ondemand iff any given field
is accessed in a query. The data model structures are not ever deserialized and no ephemeral objects are
created. This is similar to columnar storage, as it eliminates much of the costs of accessing unused columns
in standard bulk serializing models, but along with a higher degree of inherent simplicity and attendant
efficiency. A truly adhoc system, where it is not known what fields will be accessed at what frequency, if
at all, is not an ideal columnar storage candidate.
8. Fetching, in memory storage, and scans of the data model generate zero JVM objects. They bypass the
JVM memory models as well. The byte sequence traversal is scanned using efficient stack based protocols
with data accesses performed via ‘unsafe off heap’ libraries. The problems associated with large JVM 2
heaps are minimized as none of this memory is actually ‘seen’ by the JVM. The JVM processes have quite
small heap sizes.
9. There are various optimizations for immutable encodings e.g. for value maps we store the keys and the
values as twin sorted arrays using a binary search to lookup key values. We also use dictionaries to reduce
string storage requirements.
1
Burst may support streaming query processing in a future release
2
‘Unsafe’ refers to a design pattern where Java code is written using the same techniques the Java libraries use to access non
JVM heap memory (e.g. Network & Disc IO). It is called unsafe because JVM manufacturers do not offer support for these lower
level libraries, even though they are extensively used and quite reliable.
page 3 of 7
4.
Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
7. Sample Store
The Burst architecture uses a Apache HBase keyvalue store, to reliably and efficiently store the continuous largely
unordered incremental feed of assorted Item updates from assorted Domains coming from one or more external data
sources. This data is stored in one of a plurality of tables each called a Province . Each arriving update is a new 3
cell, encoded in the Burst Data Model, in a row keyed by the specific Item, Domain and Channel in a single 4
Province table where the given Domain is hosted.
8. Dataset Store
For a query to be executed over a Domain, appropriate rows in the Sample Store and appropriate update cells for
each Item must scanned and transformed into a Dataset in Brio Data Model encoding. This transformation is called
melding and happens locally on each worker node. Each node creates and stores a single partition of the Dataset. 5
These partitions are the most recent ‘view’ of the data as a single byte array cached in local disc (magnetic or solid
state). When a query is executed, if the local Worker node has cached the partition, and if it is not considered ‘stale’
then it is read directly from disc and no meld is required. The melding can also customize the dataset by down
sampling items along with other forms of object tree filtering if it is desired to reduce the datasets size for
performance/resource utilization reasons. It is also possible to have more than one defined and reified custom
Dataset ‘view’ per Domain.
Caching
It is vital that the Dataset partitions be loaded into memory quickly and released aggressively in order to manage
expensive/limited DRAM resources efficiently. The load of a Dataset partition is a simple mmap()call of a single
file as a single byte array into offheap memory managed directly by the OS. The scan can proceed before the file
has been fully read due to the natural OS semantics of paged disc reads with linear order prefetching. Since there
3
Provinces are used to subdivide the overall dataset into separate tables so that efficient table operations can be used to manage,
move, and cleanup data as needed in manageable chunks.
4
An Ingestion API/Sample Store management artifact
5
i.e. without replication or fault tolerance. In the case of worker node failure, these dataset partitions are recreated on whatever
replica location is targeted by HBase/Spark for the next query.
page 4 of 7
5.
Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
are essentially zero onheap artifacts associated with this load, the release of the byte array has minimal GC
implications. In this way, the local disc, especially if it is SSD, acts as a cost effective second level DRAM cache. 6
9. Query Engine
The Query Subsystem has an API that consists of a programmerfriendly declarative query language called SILQ
which is translated into a machinefriendly imperative query language called GIST. Both of these are textual
languages with a well defined grammar and syntax . The details of this are described in [SILQ]. Here we will say 7
that these languages provide a rich and extensible set of aggregation, dimensioning, filtering and causal/temporal
reasoning features. Burst clients form their queries as SILQ, which the SILQ pipeline transforms into GIST. The
GIST pipeline transforms those into well defined execution plans that are multicast to worker nodes. The
multidimensional result model is gathered and delivered back to the client.
Execution Models
These execution plans contain:
1. Traversal Model a simple numeric array based state machine holding the semantics of what to do where
in the object tree traversal
2. Result Schema the semantics of all aggregations, dimensions, and merges and joins.
3. Closures filters and traversal data model updates in generated and JIT optimized JVM byte code
4. Routes Log structured record of graph automata paths
Zap Data Structures
Because of the extreme number of objects visited and the prolific object churn associated with standard data
structures, Burst requires specialized data structures called Zap structures for inner loops. These are designed to 8
use nothing but simple off heap blocks of memory, preallocated in perthread chunks, reused over and over again,
and with all needed functions coded using unsafe access patterns. There are just a two of these currently : 9
● Zap Maps: The object tree scan requires a nested overlay of lightweight hash maps with the ability to join 10
with child/peer maps on the fly as the traversal unfolds from parent to child. The ways these nested self
joins can be expressed is an important part of how GIST creates complex adhoc multidimensional result
models. The performance of Zap Maps is a key factor in the overall performance of the system.
6
If desired, a future version of Burst may support ‘streaming’ semantics where the scan is executed as the data is read from disc
and never cached in memory.
7
very convenient for unit and system testing!
8
‘Zero Allocation Protocol’
9
We are working on another structure, a Zap Lexicon that eliminates the use of standard JVM strings which are quite noisy from
the perspective of JVM object creation
10
something like a cross join
page 5 of 7
6.
Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
● Zap Routes: For causal/temporal reasoning we implemented an off heap logstructured recording structure
with a graph automata to discover and capture ‘paths’ through sequences of events. This how ‘Funnels’ are
implemented in the Explorer product..
Concurrency
Because each of the Item instances in a Dataset partition is part of a sequence of individual order independent object
trees, we refine our concurrency model to a single core/thread dedicated to each traversal. Each of these can be
executed in parallel on available cores using a fixed pool model. This makes hardware happy as the linear byte array
being scanned is read solely by a single Core.
Spark
Like the Ingestion Subsystem, the Query Subsystem is built on top of Apache Spark with a Spark Executor on each 11
worker node initialized with a Query Kernel that can execute scan plans. The scan traversals are carefully designed
to use a minimum of JVM memory and create a minimum of JVM objects. There is essentially no JVM memory
overhead to the storage and execution models other than created by the ipc protocols.
10. Performance
Because of the efficiency of the scanning techniques involved, one can think of Burst as an objects scanned per
second machine and so the performance of queries is almost exclusively about how many objects the query needs to
visit. As an example, in the Flurry mobile analytics world, queries that only look at the top level object in the tree
(the User or Mobile Device) run much faster than queries that need to visit the sessions associated with that User. At
the next level, queries that need to visit the events in each session run slower than ones that only look at sessions.
Generally the complexity of the query in terms of what data is accessed and what results are generated at each object
is not nearly as impactful.
In our 250 node, 6 SATA spindle, 48 haswell hthread cluster, we see a sustained 50 QPS with >1,000 applications in
memory. Datasets cold load in <10s, cache load in <1s. Generally we scan about 200K objects/sec/hthread.
11. Future Work
The Burst architecture was designed to be extensible and the GIST language is implemented on top of a ‘plugin’
abstraction. We have a working first version plugin of a next generation of SILQ/GIST called HYDRA, that
combines both into a single language that is more performant in a few key areas. One is that you can combine any
number of queries into a single concurrent scan . We are also well into developing more efficient filtering using 12
code generated predicates that can be used by both HYDRA and for melding.
12. Conclusions
By rigorously constraining the data to be queried in terms of a two level partition model, where the first level
partition (Domains) subdivides the entire dataset into individually queryable subsets, and a second level partition
(Items) defines unordered parallel/distributed partitions of sequences of scannable object graphs, and by
implementing hyper paralleldistributedconcurrent scans we can provide a linearly scaling, cost effective,
completely general purpose, ad hoc low latency query engine. The first version is deployed in beta behind the
11
Burst does not use Spark features extensively in fact for the most part it uses Spark as a distributed process manager. The actual
Spark execution model is a very simple single stage scatter/gather model. The implementation abstracts this facility so as to make
it easy to move to a different distributed process manager or to roll our own multicast execution model such as with JGroups.
12
This is an important optimization for multiple use case including 1) ‘dashboards’ where a mobile application displays an
initial UI view with a fixed set of personalized queries 2) when a dataset is melded, it is critical to provide metadata about that
dataset to the query clients in terms of a fixed set of queries e.g. for the Flurry product the UI needs to display user, session,
event, and parameter counts as well as parameter keys and value frequencies to help inform users about formed query relevance
during interactive query sessions.
page 6 of 7
7.
Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
recently released Explorer Product. The next release introduces an incremental ingestion pipeline allowing this
query system to scale to serve all Flurry Explorer customers.
13. References
● [DREMEL] Sergey Melnik and Andrey Gubarev and Jing Jing Long and Geoffrey Romer and Shiva
Shivakumar and Matt Tolton and Theo Vassilakis, “Dremel: Interactive Analysis of WebScale Datasets”,
Proc. of the 36th Int'l Conf on Very Large Data Bases: http://research.google.com/pubs/pub36632.html
● [DRUID] Druid, “Open Source Data Store for Interactive Analytics at Scale”: http://druid.io/
● [BLINK] AmpLab, “Queries with Bounded Errors and Bounded Response Times on Very Large Data”:
http://blinkdb.org/
● [DRILL] MAPR, “Industry's First SchemaFree SQL Engine for Big Data”:
https://www.mapr.com/products/apachedrill
● [TEZ] https://tez.apache.org/
● [PRESTO] https://prestodb.io/
● [SPARK] http://spark.apache.org/
● [DOCKER] https://www.docker.com/
● [HBASE] http://hbase.apache.org/
● [KAFKA] http://kafka.apache.org/
● [SILQ]
https://docs.google.com/a/yahooinc.com/document/d/1of2GDtLJuItLdNQxDO7E24D6T8hOGspdKnm8l
FnDkM/edit?usp=sharing
page 7 of 7