Systems like Hadoop, Hbase and Hive allowed the world to take huge strides in managing and analyzing large amounts of data. Products like Flurry analytics make efficient use of large amounts of hardware using these tools to build statistics for hundreds of thousands of applications. However, these tools require the end user to first set up relevant analytics queries and then wait days for the results. If the results prompt new questions or the original query is not quite right, the user must rerun and wait again for the results.
We present the Burst system developed at Flurry to support low-latency single pass queries over very large and complex mobile application streams. We have created a data schema and query model that can answer very complex ad-hoc queries over data, and is highly parallelizable while maintaining low-latency. We implement these scans so that they are time and space efficient using the advanced disk scanning techniques provided by the underlying operating system.
A Query Model for Ad Hoc Queries using a Scanning Architecture
1.
A Query Model for Ad Hoc Queries using a Scanning Architecture
Erik Freed Brian Anderson
Flurry/Yahoo Flurry/Yahoo
erikfreed@yahooinc.com briananderson@yahooinc.com
Abstract
Systems like Hadoop, Hbase and Hive allowed the world to take huge strides in managing and analyzing large
amounts of data. Products like Flurry analytics make efficient use of large amounts of hardware using these tools to
build statistics for hundreds of thousands of applications. However, these tools require the end user to first set up
relevant analytics queries and then wait days for the results. If the results prompt new questions or the original
query is not quite right, the user must rerun and wait again for the results.
We present the Burst system developed at Flurry to support lowlatency single pass queries over very large and
complex mobile application streams. We have created a data schema and query model that can answer very
complex adhoc queries over data, and is highly parallelizable while maintaining lowlatency. We implement these
scans so that they are time and space efficient using the advanced disk scanning techniques provided by the
underlying operating system.
1. Introduction
Flurry gathers mobile application analytics for over 500,000 applications on hundreds of millions of devices.
Currently we have accumulated petabytes of metrics across a 2000 node hbase cluster. 1000 node hadoop jobs run
throughout the day to calculate data for graphs and displays that appear in the Flurry Developer Portal. Metrics and
graphs have been previously specified by the user of the developer portal. If the developer wants to explore the data
with new graphs or metrics, they must change the definitions and wait days for the next job run.
Many of these metrics are based on a series of timeordered dependent events. For example, funnels define a cohort
[COHORT] event that partitions the application’s users into groups. Afterwards, funnel events track significant
events performed by users in that cohort. These metrics are not traditional associative and commutative aggregation
functions but instead need finite state machine functions to calculate.
We developed Explorer as an adhoc query product that allows the developer to interactively explore their
application metrics and get graphs and charts in subsecond time. This allows the user to do iterative deepdives
into their application statistics in order to increase the retention and revenue of their application. The Burst system
is the backend storage and query system that supports the Explorer product.
This paper will discuss how Burst has chosen to focus on a scanonly architecture for processing very large amounts
of data. We will cover the Burst data and query execution model. The underlying architecture and implementation
of Burst is covered in more detail elsewhere [BURST].
2. Background
An adhoc query is one where the execution engine cannot predict what form the question will take. In the world of
mobile analytics, developers are constantly asking iterative questions about their users and their usage of an
application so they can improve adoption, retention and ultimately increase revenue. The answer from one query
drives the next, so the turnaround for results must be subsecond. While the developer is sifting through the
timeordered record of events performed by a user in one or more of their applications, they are doing
multidimensional aggregates as well as temporal and causal analysis in the form of cohort and funnel analysis
[COHORT]. Flurry provides analytics as a service to hundreds of thousands of applications so the Burst system
supports hundreds of simultaneous queries by developers analyzing their event data.
page 1 of 6
3.
A Model for Ad Hoc Queries over Large Datasets using a Scan Optimized Architecture
● References point to other items, but an item can only be referenced once: either as the member of a dataset
or in a scalar reference field or a scalar vector field. 1
● An item is versioned so that the data schema can evolve over time. A dataset as well as scalar and vector
reference fields can contain items at different versions.
3.2. Query Model
Use cases for mobile analytics are inherently unbounded, personalized, and constantly evolving. Queries range from
simple counting aggregations, multidimensional aggregations, up to complex time sequence conditionals. Some
examples include:
● Count the number of users by day with sessions where they spent more than $5 on inapp purchases
● Count the number of users by day with sessions where they made a bet and then made an inapp purchase
to buy more gold before betting again. 2
The low level Burst query execution engine scans the dataset and produces a collection of tuples for each item. A
tuple consists of a number of fields:
● An aggregation field is one of the following functions: count, sum, , topK
, max, min . These functions are 3
associative and commutative so they can be applied in any order.
● A dimension field is a scalar value that can partition data using group functions: enum, splits, month, day,
year, and a menagerie of other time partitioning functions.
As a tuple is created during the scan, it is combined with any existing tuple with matching values in all the
dimension fields. There is no ordering of the scan of the items in dataset, and with the assurance of the associativity
and commutativity restrictions, the scan can be done in any order and split up into any number of arbitrary streams
that are merged together into a final set of tuples. This result set of tuples is the query result.
Each item is evaluated by traversing it in a depth first search manner starting at the root item and visiting each
referenced item in a scalar or vector reference field. The evaluation is done by a single thread and is guaranteed to
traverse the items by following the item relationships, as defined by the scalar and vector reference fields of an
1
A tree graph.
2
This filter on a timeordered series of events would require one or more subqueries in a relational system and can
typically be unimplementable in time series databases.
3
This is an approximated singlepass topK
.
page 3 of 6
4.
A Model for Ad Hoc Queries over Large Datasets using a Scan Optimized Architecture
item, in DFS order. During the traversal, tuple fields are assigned to build partial results for this item. However,
while the traversal can be shortcircuited it cannot back track to reexamine items already visited.
During evaluation, the query has two temporary data structures to help it keep state information about the item:
● A global register can store a single scalar value. It can be set and/or reference at any point in the traversal.
● A route is defined by a number of steps as well as the valid transitions from one step to another (with
optional time constraints). The route has at least one starting step with no transitions into it, and one
terminating step with exit transitions . The route is finite state machine which logs any valid step transition 4
along with a timestamp. One can assert an step occurrence to the route any time, but the route object will
only record the assertion if a transition from the current route state is allowed. As the route records step
transitions, it cuts the log into paths that always begin with a starting step and finish with a terminating
step. At any time in the traversal the route can be queried for any currently recorded steps or completed
paths.
An evaluation can use multiple instances of each type which are valid within the scope of a single item evaluation.
They allow the engine to record enough state to evaluate complex event queries in a single pass.
Evaluation is defined so that it always advancing through an item. The underlying storage layout of items in Burst
can take advantage of this property in order to make evaluation very fast and efficient [BURST]. Some important
sources of speedup are:
1. memory cache line prefetching in high end CPUs
2. disk head readhead in the disk controller
3. disk to memory prefetching in the disk controller and OS
4. single copy memory mapping support in the OS
Imperative Query Plans
The Burst execution engine scans a dataset using a GIST plan. The GIST plan is imperative execution plan that
specifies the schema of the result tuples and what actions to perform during an item evaluation. Just to give the
reader a sense of a GIST plan, the following example calculates the total the length of all sessions for all users in
flurry’s usual metric event schema:
Gist(Over(1L,512,"America/Los_Angeles"), NoOptions,
Declare(
Gather("user",
NoDimensions,
Aggregations(
Sum[Long]("totalsessiontime")
)
)),
VisitReferenceVector("user.sessions",
pre=NoPre,
post=Post{
s⇒
if(!s.fldIsNull("user.sessions","duration")){s.aggLongWr("totalsessiontime",
s.fldScalarLong("user.sessions","duration"))}
}))
This doesn’t come close to showing the full power of GIST, but a more meaningful analytics query would be quite
large. Notice, the plan consists of a number of gather clauses each with a path identifying a location in the data
schema plus a number of optional aggregate and dimension fields. There is always one root gather at the top of the
schema as well as optional nested gathers. A gather defines a join point where results are built from partial results
of the children. A gather also has a number of visit declarations each with a path and a closure. The closure is
4
They may even be the same node.
page 4 of 6
6.
A Model for Ad Hoc Queries over Large Datasets using a Scan Optimized Architecture
References
● [BURST] Erik Freed and Brian Anderson, “A General Purpose Extensible Scanning Query Architecture for
Ad Hoc Analytics”
● [BLINK] AmpLab, “Queries with Bounded Errors and Bounded Response Times on Very Large Data”:
http://blinkdb.org/
● [CODD] E.F. Codd, “A Relational Model of Data for Large Shared Data Banks”, Communications of the
ACM, 1970: http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf
● [COHORTS] Wikipedia, “Cohort Analysis”: https://en.wikipedia.org/wiki/Cohort_analysis
● [DRUID] Druid, “Open Source Data Store for Interactive Analytics at Scale”: http://druid.io/
● [DATE] C. J. Date. An Introduction to Database Systems. O’Reilly, 7 edition, 2000
● [DRILL] MAPR, “Industry's First SchemaFree SQL Engine for Big Data”:
https://www.mapr.com/products/apachedrill
● [DREMEL] Sergey Melnik and Andrey Gubarev and Jing Jing Long and Geoffrey Romer and Shiva
Shivakumar and Matt Tolton and Theo Vassilakis, “Dremel: Interactive Analysis of WebScale Datasets”,
Proc. of the 36th Int'l Conf on Very Large Data Bases: http://research.google.com/pubs/pub36632.html
● [HIVE] Apache Software Foundation, “Hive: A data warehouse infrastructure built on top of Hadoop for
providing data summarization, query, and analysis“: https://hive.apache.org/
● [TSDB] OpenTSDB, “The Scalable Time Series Database”: http://opentsdb.net/
● [REDHIFT] Amazon Web Services, “Amazon Redshift”: http://aws.amazon.com/redshift/
page 6 of 6