SlideShare a Scribd company logo
1 of 7
Download to read offline
 
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics 
 
             Erik Freed           Brian Anderson 
         Flurry/Yahoo           Flurry/Yahoo 
  erikfreed@yahoo­inc.com  briananderson@yahoo­inc.com 
 
Abstract 
We present Burst, an analytic query system with a scalable and flexible approach to performing low­latency ad hoc                                   
analysis over large complex datasets. The architecture consists of hardware­efficient scan techniques and a                           
language facility to transform an extensible set of ad hoc declarative queries into imperative physical scan plans.                                 
These plans are multicast across all nodes/cores of a two level sharded/distributed ingestion, storage, and execution                               
topology and executed. The first release of this system is the query engine behind the Flurry Explorer product. Here                                     
we explore the design details of that system as well as the incremental ingestion pipeline enhancement currently                                 
being implemented for the next major release. 
 
NOTE:​ This copy has had performance numbers updated and is not the same as the one submitted to Tech Pulse. 
1. Introduction 
The promise of the Flurry Explorer Product is to invite the user into an unstructured interactive ​discovery session                                   
where they can easily pose arbitrary off­the­cuff and potentially complex questions about end user behavior. If they                                 
get back answers quickly enough then their next question starts a virtuous cycle of more targeted questions                                 
continuously leading to more specific and valuable results. The first major release of the back end query engine                                   
engineered to fully support this type of exploration was developed in the Flurry Analytics group in Q1 2015 and                                     
delivered as part of a limited beta of the Explorer feature within Flurry Analytics. We successfully utilized a unique                                     
hyper distributed/parallel/concurrent object tree scanning model with a simple daily batched ingestion system for                           
this limited audience. The next major release of this scanning architecture replaces the batched ingestion system                               
with a more scalable incremental data ingestion pipeline to expand the reach of Explorer to all Flurry customers.                                   
Here we present the architectural basis and specifics of the previous and upcoming release. 
2. Background 
For those of us who have spent any time with production scale SQL databases, seeing large table scans being sorted                                       
and joined in a query plan is cause for panic. We can only relax once we find a way to constrain that query and/or                                               
implement heavyweight indices so the query transforms into pure index lookups and partial joins. However for                               
analytics the use cases are inherently unbounded, personalized, and constantly evolving while the corpora are                             
typically enormous. This makes adding indices intractable in most cases. These limitations forced us to reevaluate                               
our previous nemesis, the full table scan. We determined that if we could make the scans efficient enough, distribute                                     
the scans across enough nodes and CPU cores, and develop a query language that could take an arbitrary ad hoc                                       
analytic question and transform it into an instance of this hyper parallel­distributed­concurrent scan model, then we                               
would have an attractively simple general purpose model. We reasoned that this model would scale well not only in                                     
terms of input size and general query complexity, but in terms of feature development time, risk, and effort. 
 
page 1 of 7 
 
 
 
Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics 
3. Top Level View 
 
The basic components of the Burst ecosystem are: 
1. External Datasource(s) 
2. Ingestion Subsystem 
3. Data Model 
4. Sample Store 
5. Dataset Store 
6. Query Subsystem 
 
The previous release of Burst had a simplified batched ingestion model where the exporting Mapreduce jobs wrote                                 
the entire history of a given mobile application’s event stream into new HDFS sequence files on a daily basis. These                                       
datasets were then read into memory on demand as users posed queries. This initial beta pipeline design is being                                     
replaced by the incremental version described in subsequent sections. The rest of the architecture described here is                                 
as currently deployed. 
 
Each of these components (other than the external data sources) are deployed on one or more clusters called a ​Cell                                       
where each Cell is comprised of a master node, a failover master node and a set of worker nodes. Each Cell has its                                             
own Apache Kafka [KAFKA], Apache HBase[HBASE], and Apache Spark[SPARK] clusters deployed. The Master                         
(and failover Master) node contains the master process for each of these systems as well as a Docker [DOCKER]                                     
container populated with all of the Burst specific JVM service processes. The Worker nodes are populated only by                                   
the associated Spark, HBase, and Kafka worker specific deployments. Burst does not itself deploy anything directly                               
onto Worker nodes. 
4. Data Sources 
Burst is inherently schema independent as well as agnostic to the specific technology of the external datasource.                                 
However the data source must have  the following basic characteristics: 
1. it must be in a schema that can be expressed in the relationships and datatypes of the Burst Data Model 
2. The external data  model can be partitioned into two levels of well defined shards: 
a. The first level is composed of a set of ​Domain instances that each represent a subset of data that is                                       
the input to a single query e.g. for Flurry Explorer, this is a event stream associated with a single                                     
‘Mobile Application’ or constructed ‘Mobile Application Group’. A query can only be executed                         
against a single Domain at a time. 
b. The second level is a strict partitioning across a Domain creating order independent subsets of                             
Item instances that each has a well defined rooted acyclic object model (tree) that can be scanned                                 
in a depth first, preferably time ordered, traversal. For Flurry Explorer, this is a single ‘Mobile                               
Device’, each of which has a set of time ordered ‘sessions’, each of which has a set of time                                     
ordered ‘events’, each of which has a set of unordered key­value map ‘parameters’ 
 
page 2 of 7 
 
 
Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics 
3. The external data source physical form can be exported as both a periodic historical batch and a continuous                                   
incremental update and fed to the Burst Kafka based Ingestion API. e.g. for Flurry this is our 2,000 node,                                     
six petabyte, ~50 trillion mobile device events, ever­growing HBase cluster with custom Map­Reduce jobs                           
performing both initializing batch and daily incremental update feeds. 
5. Ingestion 
The new Burst Ingestion Subsystem design starts with a Kafka queues that provide a ​control­plane                             
(control/administration), and the ​data­plane ​(data feeds). The data source is responsible for sending and responding                             
to the ​control­plane​, as well as feeding the ​data­plane in response to ​control­plane messages. An Apache Spark                                 
based process model manages ​control­plane and ​data­plane operations. It is responsible for transforming the                           
schema of the external system into an appropriate Burst schema, updating the Sample Store as it arrives. 
6. Data Model 
The Burst Data Model has the following requirements/features/implementation details: 
1. It is schema independent, but schema defined.  
2. It is schema versioned, and supports heterogeneous versioned collections. 
3. The data model/schema supports type structures, singular and plural structure reference relationships, value                         
collections, value maps, and atomic data types (boolean, byte, short, int, long, double, string) 
4. The data model/schema inherently defines a tree with a well defined root as part of well defined traversal 
5. Data is encoded in a single byte array where the disk storage encoding is identical to the in­memory format.  
6. This encoding is an unrolled depth first traversal of the object tree as a linear sequence of bytes. The                                     
reading from disk into memory and traversal scans are in the same exact byte order and thus can take direct                                       
advantage of the OS disk mmap semantics with the associated high performance kernel buffer management                             
and aggressive prefetching. The data can be cached in memory or not depending on your preferences with                                 
respect to repeated queries on identical datasets . 1
7. All interpretation of atomic data fields are done in­situ within the byte array on­demand ​iff ​any given field                                   
is accessed in a query. The data model structures are not ever deserialized and no ephemeral objects are                                   
created. This is similar to columnar storage, as it eliminates much of the costs of accessing unused columns                                   
in standard bulk serializing models, but along with a higher degree of inherent simplicity and attendant                               
efficiency. A truly ad­hoc system, where it is not known what fields will be accessed at what frequency, if                                     
at all, is not an ideal columnar storage candidate. 
8. Fetching, in memory storage, and scans of the data model generate zero JVM objects. They bypass the                                 
JVM memory models as well. The byte sequence traversal is scanned using efficient stack based protocols                               
with data accesses performed via ‘unsafe off heap’ libraries. The problems associated with large JVM                             2
heaps are minimized as none of this memory is actually ‘seen’ by the JVM. The JVM processes have quite                                     
small heap sizes. 
9. There are various optimizations for immutable encodings e.g. for value maps we store the keys and the                                 
values as twin sorted arrays using a binary search to lookup key values. We also use dictionaries to reduce                                     
string storage requirements. 
1
 Burst may support streaming query processing in a future release 
2
‘Unsafe’ refers to a design pattern where Java code is written using the same techniques the Java libraries use to access non                                             
JVM heap memory (e.g. Network & Disc IO). It is called unsafe because JVM manufacturers do not offer support for these lower                                           
level libraries, even though they are extensively used and quite reliable. 
 
page 3 of 7 
 
 
Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics 
7. Sample Store 
 
The Burst architecture uses a Apache HBase key­value store, to reliably and efficiently store the continuous largely                                 
unordered incremental feed of assorted Item updates from assorted Domains coming from one or more external data                                 
sources. This data is stored in one of a plurality of tables each called a ​Province ​. Each arriving update is a new                                           3
cell, encoded in the Burst Data Model, in a row keyed by the specific Item, Domain ​and Channel in a single                                         4
Province table where the given Domain is hosted.  
8. Dataset Store 
 
For a query to be executed over a Domain, appropriate rows in the Sample Store and appropriate update cells for                                       
each Item must scanned and transformed into a Dataset in Brio Data Model encoding. This transformation is called                                   
melding and happens locally on each worker node. Each node creates and stores a single partition of the Dataset.                                     5
These partitions are the most recent ‘view’ of the data as a single byte array cached in local disc (magnetic or solid                                           
state). When a query is executed, if the local Worker node has cached the partition, and if it is not considered ‘stale’                                           
then it is read directly from disc and no meld is required. The melding can also customize the dataset by down                                         
sampling items along with other forms of object tree filtering if it is desired to reduce the datasets size for                                       
performance/resource utilization reasons. It is also possible to have more than one defined and reified custom                               
Dataset ‘view’ per Domain. 
Caching 
It is vital that the Dataset partitions be loaded into memory quickly and released aggressively in order to manage                                     
expensive/limited DRAM resources efficiently. The load of a Dataset partition is a simple ​mmap()call of a single                                   
file as a single byte array into off­heap memory managed directly by the OS. The scan can proceed before the file                                         
has been fully read due to the natural OS semantics of paged disc reads with linear order prefetching. Since there                                       
3
Provinces are used to subdivide the overall dataset into separate tables so that efficient table operations can be used to manage,                                           
move, and cleanup data as needed in manageable chunks. 
4
 An Ingestion API/Sample Store management artifact 
5
i.e. without replication or fault tolerance. In the case of worker node failure, these dataset partitions are recreated on whatever                                         
replica location is targeted by HBase/Spark for the next query. 
 
page 4 of 7 
 
 
Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics 
are essentially zero on­heap artifacts associated with this load, the release of the byte array has minimal GC                                   
implications. In this way, the local disc, especially if it is SSD, acts as a cost effective second level DRAM cache.   6
9. Query Engine 
 
The Query Subsystem has an API that consists of a ​programmer­friendly declarative query language called ​SILQ                               
which is translated into a ​machine­friendly imperative query language called ​GIST​. Both of these are textual                               
languages with a well defined grammar and syntax . The details of this are described in [SILQ]. Here we will say                                       7
that these languages provide a rich and extensible set of aggregation, dimensioning, filtering and causal/temporal                             
reasoning features. Burst clients form their queries as SILQ, which the SILQ pipeline transforms into GIST. The                                 
GIST pipeline transforms those into well defined execution plans that are multicast to worker nodes. The                               
multidimensional result model is gathered and delivered back to the client.  
Execution Models 
These execution plans contain: 
1. Traversal Model​­ a simple numeric array based state machine holding the semantics of what to do where                                 
in the object tree traversal 
2. Result Schema​ the semantics of all aggregations, dimensions, and merges and joins. 
3. Closures​ ­ filters and traversal data model updates in generated and JIT optimized JVM byte code  
4. Routes​ ­ Log structured record of graph automata paths 
Zap Data Structures 
Because of the extreme number of objects visited and the prolific object churn associated with standard data                                 
structures, Burst requires specialized data structures called Zap structures for inner loops. These are designed to                               8
use nothing but simple off heap blocks of memory, pre­allocated in per­thread chunks, re­used over and over again,                                   
and with all needed functions coded using ​unsafe access patterns. There are just a two of these currently : 9
● Zap Maps​: The object tree scan requires a nested overlay of lightweight hash maps with the ability to ​join                                     10
with child/peer maps on the fly as the traversal unfolds from parent to child. The ways these nested self                                     
joins can be expressed is an important part of how GIST creates complex ad­hoc multi­dimensional result                               
models. The performance of Zap Maps is a key factor in the overall performance of the system. 
6
If desired, a future version of Burst may support ‘streaming’ semantics where the scan is executed as the data is read from disc                                               
and never cached in memory. 
7
 very convenient  for unit and system testing! 
8
 ‘Zero Allocation Protocol’ 
9
​We are working on another structure, a ​Zap Lexicon ​that eliminates the use of standard JVM strings which are quite ​noisy from                                             
the perspective of JVM object creation 
10
 something like a cross join 
 
page 5 of 7 
 
 
Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics 
● Zap Routes​: For causal/temporal reasoning we implemented an off heap log­structured recording structure                         
with a graph automata to discover and capture ‘paths’ through sequences of events. This how ‘Funnels’ are                                 
implemented in the Explorer product.. 
Concurrency 
Because each of the Item instances in a Dataset partition is part of a sequence of individual order independent object                                       
trees, we refine our concurrency model to a single core/thread dedicated to each traversal. Each of these can be                                     
executed in parallel on available cores using a fixed pool model. This makes hardware happy as the linear byte array                                       
being scanned is read solely by a single Core.  
Spark 
Like the Ingestion Subsystem, the Query Subsystem is built on top of Apache Spark with a Spark Executor on each                                       11
worker node initialized with a Query Kernel that can execute scan plans. The scan traversals are carefully designed                                   
to use a minimum of JVM memory and create a minimum of JVM objects. There is essentially no JVM memory                                       
overhead to the storage and execution models other than created by the ipc protocols. 
10. Performance 
Because of the efficiency of the scanning techniques involved, one can think of Burst as an ​objects scanned per                                     
second machine and so the performance of queries is almost exclusively about how many objects the query needs to                                     
visit. As an example, in the Flurry mobile analytics world, queries that only look at the top level object in the tree                                           
(the User or Mobile Device) run much faster than queries that need to visit the sessions associated with that User. At                                         
the next level, queries that need to visit the events in each session run slower than ones that only look at sessions.                                           
Generally the complexity of the query in terms of what data is accessed and what results are generated at each object                                         
is not nearly as impactful.  
In our 250 node, 6 SATA spindle, 48 haswell hthread cluster, we see a sustained 50 QPS with >1,000 applications in                                         
memory.  Datasets cold load in <10s, cache load in <1s. Generally we scan about 200K objects/sec/hthread. 
11. Future Work 
The Burst architecture was designed to be extensible and the GIST language is implemented on top of a ‘plugin’                                     
abstraction. We have a working first version plugin of a next generation of SILQ/GIST called HYDRA, that                                 
combines both into a single language that is more performant in a few key areas. One is that you can combine any                                           
number of queries into a single concurrent scan . We are also well into developing more efficient filtering using                                   12
code generated predicates that can be used by both HYDRA and for melding. 
12. Conclusions 
By rigorously constraining the data to be queried in terms of a two level partition model, where the first level                                       
partition (Domains) subdivides the entire dataset into individually queryable subsets, and a second level partition                             
(Items) defines unordered parallel/distributed partitions of sequences of scannable object graphs, and by                         
implementing hyper parallel­distributed­concurrent scans we can provide a linearly scaling, cost effective,                       
completely general purpose, ad hoc low latency query engine. The first version is deployed in beta behind the                                   
11
Burst does not use Spark features extensively in fact for the most part it uses Spark as a distributed process manager. The actual                                               
Spark execution model is a very simple single stage scatter/gather model. The implementation abstracts this facility so as to make                                       
it easy to move to a different distributed process manager or to roll our own multicast execution model such as with JGroups. 
12
This is an important optimization for multiple use case including­ 1) ‘dashboards’ where a mobile application displays an                                     
initial UI view with a fixed set of personalized queries ­ 2) when a dataset is melded, it is critical to provide metadata about that                                                 
dataset to the query clients in terms of a fixed set of queries e.g. for the Flurry product the UI needs to display user, session,                                                 
event, and parameter counts as well as parameter keys and value frequencies to help inform users about formed query relevance                                       
during interactive query sessions. 
 
page 6 of 7 
 
 
Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics 
recently released Explorer Product. The next release introduces an incremental ingestion pipeline allowing this                           
query system to scale to serve all Flurry Explorer customers. 
13. References 
● [DREMEL] Sergey Melnik and Andrey Gubarev and Jing Jing Long and Geoffrey Romer and Shiva                             
Shivakumar and Matt Tolton and Theo Vassilakis, “Dremel: Interactive Analysis of Web­Scale Datasets”,                         
Proc. of the 36th Int'l Conf on Very Large Data Bases: ​http://research.google.com/pubs/pub36632.html 
● [DRUID] Druid, “Open Source Data Store for Interactive Analytics at Scale”:  ​http://druid.io/ 
● [BLINK] AmpLab, “Queries with Bounded Errors and Bounded Response Times on Very Large Data”:                           
http://blinkdb.org/ 
● [DRILL] MAPR, “Industry's First Schema­Free SQL Engine for Big Data”:                   
https://www.mapr.com/products/apache­drill 
● [TEZ] ​https://tez.apache.org/  
● [PRESTO] ​https://prestodb.io/  
● [SPARK] ​http://spark.apache.org/  
● [DOCKER] ​https://www.docker.com/  
● [HBASE] ​http://hbase.apache.org/  
● [KAFKA] ​http://kafka.apache.org/  
● [SILQ] 
https://docs.google.com/a/yahoo­inc.com/document/d/1of2GDtLJuItLdNQxDO7E24D6T8hOGspd­Knm8l
FnDkM/edit?usp=sharing  
 
page 7 of 7 

More Related Content

What's hot

Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?IJCSIS Research Publications
 
A Review of Elastic Search: Performance Metrics and challenges
A Review of Elastic Search: Performance Metrics and challengesA Review of Elastic Search: Performance Metrics and challenges
A Review of Elastic Search: Performance Metrics and challengesrahulmonikasharma
 
Log Analysis Engine with Integration of Hadoop and Spark
Log Analysis Engine with Integration of Hadoop and SparkLog Analysis Engine with Integration of Hadoop and Spark
Log Analysis Engine with Integration of Hadoop and SparkIRJET Journal
 
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...Journal For Research
 
Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...DESTIN-Informatique.com
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...CitiusTech
 
Big_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperBig_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperScott Gray
 
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICSQUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICSijcsit
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmodwaqasm86
 
Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...
Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...
Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...Darshan Gorasiya
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaNithin Kakkireni
 
SplunkLive! Data Models 101
SplunkLive! Data Models 101SplunkLive! Data Models 101
SplunkLive! Data Models 101Splunk
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computinghuda2018
 
Hardware enhanced association rule mining
Hardware enhanced association rule miningHardware enhanced association rule mining
Hardware enhanced association rule miningStudsPlanet.com
 
A cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataA cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataredpel dot com
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop frameworkTu Pham
 

What's hot (19)

Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
Which NoSQL Database to Combine with Spark for Real Time Big Data Analytics?
 
A Review of Elastic Search: Performance Metrics and challenges
A Review of Elastic Search: Performance Metrics and challengesA Review of Elastic Search: Performance Metrics and challenges
A Review of Elastic Search: Performance Metrics and challenges
 
Log Analysis Engine with Integration of Hadoop and Spark
Log Analysis Engine with Integration of Hadoop and SparkLog Analysis Engine with Integration of Hadoop and Spark
Log Analysis Engine with Integration of Hadoop and Spark
 
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
PERFORMANCE EVALUATION OF SOCIAL NETWORK ANALYSIS ALGORITHMS USING DISTRIBUTE...
 
Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...Dynamic and repeatable transformation of existing Thesauri and Authority list...
Dynamic and repeatable transformation of existing Thesauri and Authority list...
 
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
 
Big_SQL_3.0_Whitepaper
Big_SQL_3.0_WhitepaperBig_SQL_3.0_Whitepaper
Big_SQL_3.0_Whitepaper
 
Search Approach - ES, GraphDB
Search Approach - ES, GraphDBSearch Approach - ES, GraphDB
Search Approach - ES, GraphDB
 
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICSQUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
Sparkr sigmod
Sparkr sigmodSparkr sigmod
Sparkr sigmod
 
Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...
Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...
Comparison of Open-Source Data Stream Processing Engines: Spark Streaming, Fl...
 
Final Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_SharmilaFinal Report_798 Project_Nithin_Sharmila
Final Report_798 Project_Nithin_Sharmila
 
SplunkLive! Data Models 101
SplunkLive! Data Models 101SplunkLive! Data Models 101
SplunkLive! Data Models 101
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computing
 
Hardware enhanced association rule mining
Hardware enhanced association rule miningHardware enhanced association rule mining
Hardware enhanced association rule mining
 
A cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring dataA cloud service architecture for analyzing big monitoring data
A cloud service architecture for analyzing big monitoring data
 
Updating and Scheduling of Streaming Web Services in Data Warehouses
Updating and Scheduling of Streaming Web Services in Data WarehousesUpdating and Scheduling of Streaming Web Services in Data Warehouses
Updating and Scheduling of Streaming Web Services in Data Warehouses
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 

Viewers also liked

Getting Started With Mobile Analytics: iOS Connect Santa Clara Meetup | Flurr...
Getting Started With Mobile Analytics: iOS Connect Santa Clara Meetup | Flurr...Getting Started With Mobile Analytics: iOS Connect Santa Clara Meetup | Flurr...
Getting Started With Mobile Analytics: iOS Connect Santa Clara Meetup | Flurr...Flurry, Inc.
 
The State of AppNation 2015
The State of AppNation 2015The State of AppNation 2015
The State of AppNation 2015Flurry, Inc.
 
Best Strategy for Developing App Architecture and High Quality App
Best Strategy for Developing App Architecture and High Quality AppBest Strategy for Developing App Architecture and High Quality App
Best Strategy for Developing App Architecture and High Quality AppFlurry, Inc.
 
Flurry Road Trip - Germany state of mobile
Flurry Road Trip - Germany state of mobileFlurry Road Trip - Germany state of mobile
Flurry Road Trip - Germany state of mobileFlurry, Inc.
 
Flurry State of App Nation 2016 - CES APPNATION VII
Flurry State of App Nation 2016 - CES APPNATION VII Flurry State of App Nation 2016 - CES APPNATION VII
Flurry State of App Nation 2016 - CES APPNATION VII Flurry, Inc.
 
Flurry State of App Nation: Asia Edition, June 2015
Flurry State of App Nation: Asia Edition, June 2015Flurry State of App Nation: Asia Edition, June 2015
Flurry State of App Nation: Asia Edition, June 2015Flurry, Inc.
 
Flurry Analytics - Mobile Monetization - ASW Berlin
Flurry Analytics - Mobile Monetization - ASW BerlinFlurry Analytics - Mobile Monetization - ASW Berlin
Flurry Analytics - Mobile Monetization - ASW BerlinFlurry, Inc.
 
Insights & Opportunities in the Mobile Age - Business Insider Ignition 2014
Insights & Opportunities in the Mobile Age - Business Insider Ignition 2014Insights & Opportunities in the Mobile Age - Business Insider Ignition 2014
Insights & Opportunities in the Mobile Age - Business Insider Ignition 2014Flurry, Inc.
 
Yahoo Mobile Developer Conference NYC - Mobile Revolution: Seven Years On
Yahoo Mobile Developer Conference NYC - Mobile Revolution: Seven Years OnYahoo Mobile Developer Conference NYC - Mobile Revolution: Seven Years On
Yahoo Mobile Developer Conference NYC - Mobile Revolution: Seven Years OnFlurry, Inc.
 
Yahoo Mobile Meetup: Bangalore & Hyderabad December 2015
Yahoo Mobile Meetup: Bangalore & Hyderabad December 2015Yahoo Mobile Meetup: Bangalore & Hyderabad December 2015
Yahoo Mobile Meetup: Bangalore & Hyderabad December 2015Flurry, Inc.
 
Yahoo Mobile Developer Conference: State of Mobile
Yahoo Mobile Developer Conference: State of MobileYahoo Mobile Developer Conference: State of Mobile
Yahoo Mobile Developer Conference: State of MobileFlurry, Inc.
 
App Nation - Build and Grow with Facebook Login and Sharing
App Nation - Build and Grow with Facebook Login and SharingApp Nation - Build and Grow with Facebook Login and Sharing
App Nation - Build and Grow with Facebook Login and SharingPeter Yang
 
2016 Yahoo Taiwan Mobile Developer Conference
2016 Yahoo Taiwan Mobile Developer Conference 2016 Yahoo Taiwan Mobile Developer Conference
2016 Yahoo Taiwan Mobile Developer Conference Flurry, Inc.
 
Firefox OS - mobile trends, learnings & visions, at FOKUS FUSECO Forum 2014
Firefox OS - mobile trends, learnings & visions, at FOKUS FUSECO Forum 2014Firefox OS - mobile trends, learnings & visions, at FOKUS FUSECO Forum 2014
Firefox OS - mobile trends, learnings & visions, at FOKUS FUSECO Forum 2014Robert Nyman
 

Viewers also liked (14)

Getting Started With Mobile Analytics: iOS Connect Santa Clara Meetup | Flurr...
Getting Started With Mobile Analytics: iOS Connect Santa Clara Meetup | Flurr...Getting Started With Mobile Analytics: iOS Connect Santa Clara Meetup | Flurr...
Getting Started With Mobile Analytics: iOS Connect Santa Clara Meetup | Flurr...
 
The State of AppNation 2015
The State of AppNation 2015The State of AppNation 2015
The State of AppNation 2015
 
Best Strategy for Developing App Architecture and High Quality App
Best Strategy for Developing App Architecture and High Quality AppBest Strategy for Developing App Architecture and High Quality App
Best Strategy for Developing App Architecture and High Quality App
 
Flurry Road Trip - Germany state of mobile
Flurry Road Trip - Germany state of mobileFlurry Road Trip - Germany state of mobile
Flurry Road Trip - Germany state of mobile
 
Flurry State of App Nation 2016 - CES APPNATION VII
Flurry State of App Nation 2016 - CES APPNATION VII Flurry State of App Nation 2016 - CES APPNATION VII
Flurry State of App Nation 2016 - CES APPNATION VII
 
Flurry State of App Nation: Asia Edition, June 2015
Flurry State of App Nation: Asia Edition, June 2015Flurry State of App Nation: Asia Edition, June 2015
Flurry State of App Nation: Asia Edition, June 2015
 
Flurry Analytics - Mobile Monetization - ASW Berlin
Flurry Analytics - Mobile Monetization - ASW BerlinFlurry Analytics - Mobile Monetization - ASW Berlin
Flurry Analytics - Mobile Monetization - ASW Berlin
 
Insights & Opportunities in the Mobile Age - Business Insider Ignition 2014
Insights & Opportunities in the Mobile Age - Business Insider Ignition 2014Insights & Opportunities in the Mobile Age - Business Insider Ignition 2014
Insights & Opportunities in the Mobile Age - Business Insider Ignition 2014
 
Yahoo Mobile Developer Conference NYC - Mobile Revolution: Seven Years On
Yahoo Mobile Developer Conference NYC - Mobile Revolution: Seven Years OnYahoo Mobile Developer Conference NYC - Mobile Revolution: Seven Years On
Yahoo Mobile Developer Conference NYC - Mobile Revolution: Seven Years On
 
Yahoo Mobile Meetup: Bangalore & Hyderabad December 2015
Yahoo Mobile Meetup: Bangalore & Hyderabad December 2015Yahoo Mobile Meetup: Bangalore & Hyderabad December 2015
Yahoo Mobile Meetup: Bangalore & Hyderabad December 2015
 
Yahoo Mobile Developer Conference: State of Mobile
Yahoo Mobile Developer Conference: State of MobileYahoo Mobile Developer Conference: State of Mobile
Yahoo Mobile Developer Conference: State of Mobile
 
App Nation - Build and Grow with Facebook Login and Sharing
App Nation - Build and Grow with Facebook Login and SharingApp Nation - Build and Grow with Facebook Login and Sharing
App Nation - Build and Grow with Facebook Login and Sharing
 
2016 Yahoo Taiwan Mobile Developer Conference
2016 Yahoo Taiwan Mobile Developer Conference 2016 Yahoo Taiwan Mobile Developer Conference
2016 Yahoo Taiwan Mobile Developer Conference
 
Firefox OS - mobile trends, learnings & visions, at FOKUS FUSECO Forum 2014
Firefox OS - mobile trends, learnings & visions, at FOKUS FUSECO Forum 2014Firefox OS - mobile trends, learnings & visions, at FOKUS FUSECO Forum 2014
Firefox OS - mobile trends, learnings & visions, at FOKUS FUSECO Forum 2014
 

Similar to A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics

Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_Tina Zhang
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBArangoDB Database
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine LearningVasu S
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overviewABC Talks
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetHostedbyConfluent
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkVenkata Naga Ravi
 
Product data processing 30.08.2011 gg
Product data processing 30.08.2011 ggProduct data processing 30.08.2011 gg
Product data processing 30.08.2011 ggkinor_arch
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingDibyendu Bhattacharya
 
Reactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/SubscribeReactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/SubscribeSumant Tambe
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...GeeksLab Odessa
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesVasu S
 
Climbing the beanstalk
Climbing the beanstalkClimbing the beanstalk
Climbing the beanstalkgordonyorke
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyPeter Clapham
 
The Overview of Discovery and Reconciliation of LTE Network
The Overview of Discovery and Reconciliation of LTE NetworkThe Overview of Discovery and Reconciliation of LTE Network
The Overview of Discovery and Reconciliation of LTE NetworkIRJET Journal
 

Similar to A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics (20)

Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_Predictive maintenance withsensors_in_utilities_
Predictive maintenance withsensors_in_utilities_
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDB
 
As34269277
As34269277As34269277
As34269277
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine Learning
 
Elastic search overview
Elastic search overviewElastic search overview
Elastic search overview
 
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, PresetStreaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
Streaming Data Analytics with ksqlDB and Superset | Robert Stolz, Preset
 
In Memory Analytics with Apache Spark
In Memory Analytics with Apache SparkIn Memory Analytics with Apache Spark
In Memory Analytics with Apache Spark
 
Product data processing 30.08.2011 gg
Product data processing 30.08.2011 ggProduct data processing 30.08.2011 gg
Product data processing 30.08.2011 gg
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
 
Reactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/SubscribeReactive Stream Processing for Data-centric Publish/Subscribe
Reactive Stream Processing for Data-centric Publish/Subscribe
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
Enabling SQL Access to Data Lakes
Enabling SQL Access to Data LakesEnabling SQL Access to Data Lakes
Enabling SQL Access to Data Lakes
 
Climbing the beanstalk
Climbing the beanstalkClimbing the beanstalk
Climbing the beanstalk
 
4 026
4 0264 026
4 026
 
Nosql databases
Nosql databasesNosql databases
Nosql databases
 
Apache phoenix
Apache phoenixApache phoenix
Apache phoenix
 
shashank_hpca1995_00386533
shashank_hpca1995_00386533shashank_hpca1995_00386533
shashank_hpca1995_00386533
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
The Overview of Discovery and Reconciliation of LTE Network
The Overview of Discovery and Reconciliation of LTE NetworkThe Overview of Discovery and Reconciliation of LTE Network
The Overview of Discovery and Reconciliation of LTE Network
 

More from Flurry, Inc.

Railsplitter: Simplify Your CRUD
Railsplitter: Simplify Your CRUDRailsplitter: Simplify Your CRUD
Railsplitter: Simplify Your CRUDFlurry, Inc.
 
Insights & Opportunities in the Mobile Age
Insights & Opportunities in the Mobile AgeInsights & Opportunities in the Mobile Age
Insights & Opportunities in the Mobile AgeFlurry, Inc.
 
The Global Village: How Mobile Games Cross Borders, or Fail to
The Global Village: How Mobile Games Cross Borders, or Fail toThe Global Village: How Mobile Games Cross Borders, or Fail to
The Global Village: How Mobile Games Cross Borders, or Fail toFlurry, Inc.
 
Source14: The Age of Living Mobile
Source14: The Age of Living MobileSource14: The Age of Living Mobile
Source14: The Age of Living MobileFlurry, Inc.
 
The global android explosion gdc 2014
The global android explosion gdc 2014The global android explosion gdc 2014
The global android explosion gdc 2014Flurry, Inc.
 
Reaching your audience on mobile one Person(a) at a time.
Reaching your audience on mobile one Person(a) at a time.Reaching your audience on mobile one Person(a) at a time.
Reaching your audience on mobile one Person(a) at a time.Flurry, Inc.
 
PoMo: The Post Mobile World (Business Insider Ignition, Nov. 2013)
PoMo: The Post Mobile World (Business Insider Ignition, Nov. 2013)PoMo: The Post Mobile World (Business Insider Ignition, Nov. 2013)
PoMo: The Post Mobile World (Business Insider Ignition, Nov. 2013)Flurry, Inc.
 
MMA Forum London November 2013 Richard Firminger presentation
MMA Forum London November 2013 Richard Firminger presentationMMA Forum London November 2013 Richard Firminger presentation
MMA Forum London November 2013 Richard Firminger presentationFlurry, Inc.
 
Quality over Quantity: Mobile Users Matter (GROW Conference, Aug 2013)
Quality over Quantity: Mobile Users Matter (GROW Conference, Aug 2013)Quality over Quantity: Mobile Users Matter (GROW Conference, Aug 2013)
Quality over Quantity: Mobile Users Matter (GROW Conference, Aug 2013)Flurry, Inc.
 
Flurry iab presentation final_7.15.2013
Flurry iab presentation final_7.15.2013Flurry iab presentation final_7.15.2013
Flurry iab presentation final_7.15.2013Flurry, Inc.
 
The Mobile Consumer Age from SourceDigital13 (June 2013)
The Mobile Consumer Age from SourceDigital13 (June 2013)The Mobile Consumer Age from SourceDigital13 (June 2013)
The Mobile Consumer Age from SourceDigital13 (June 2013)Flurry, Inc.
 
Mobile Outlook 2013
Mobile Outlook 2013Mobile Outlook 2013
Mobile Outlook 2013Flurry, Inc.
 
The State of the App Economy
The State of the App EconomyThe State of the App Economy
The State of the App EconomyFlurry, Inc.
 
Flurry variety appconference_29nov12
Flurry variety appconference_29nov12Flurry variety appconference_29nov12
Flurry variety appconference_29nov12Flurry, Inc.
 
Flurry Presents at Digital Analytics Association Symposium - San Francisco, C...
Flurry Presents at Digital Analytics Association Symposium - San Francisco, C...Flurry Presents at Digital Analytics Association Symposium - San Francisco, C...
Flurry Presents at Digital Analytics Association Symposium - San Francisco, C...Flurry, Inc.
 
Games on Smartphones & Tablets: Demand, Revenue, Cost, Business Model, Usage,...
Games on Smartphones & Tablets: Demand, Revenue, Cost, Business Model, Usage,...Games on Smartphones & Tablets: Demand, Revenue, Cost, Business Model, Usage,...
Games on Smartphones & Tablets: Demand, Revenue, Cost, Business Model, Usage,...Flurry, Inc.
 
Flurry presents at Vancouver Social Games and Unity Meetup
Flurry presents at Vancouver Social Games and Unity MeetupFlurry presents at Vancouver Social Games and Unity Meetup
Flurry presents at Vancouver Social Games and Unity MeetupFlurry, Inc.
 

More from Flurry, Inc. (17)

Railsplitter: Simplify Your CRUD
Railsplitter: Simplify Your CRUDRailsplitter: Simplify Your CRUD
Railsplitter: Simplify Your CRUD
 
Insights & Opportunities in the Mobile Age
Insights & Opportunities in the Mobile AgeInsights & Opportunities in the Mobile Age
Insights & Opportunities in the Mobile Age
 
The Global Village: How Mobile Games Cross Borders, or Fail to
The Global Village: How Mobile Games Cross Borders, or Fail toThe Global Village: How Mobile Games Cross Borders, or Fail to
The Global Village: How Mobile Games Cross Borders, or Fail to
 
Source14: The Age of Living Mobile
Source14: The Age of Living MobileSource14: The Age of Living Mobile
Source14: The Age of Living Mobile
 
The global android explosion gdc 2014
The global android explosion gdc 2014The global android explosion gdc 2014
The global android explosion gdc 2014
 
Reaching your audience on mobile one Person(a) at a time.
Reaching your audience on mobile one Person(a) at a time.Reaching your audience on mobile one Person(a) at a time.
Reaching your audience on mobile one Person(a) at a time.
 
PoMo: The Post Mobile World (Business Insider Ignition, Nov. 2013)
PoMo: The Post Mobile World (Business Insider Ignition, Nov. 2013)PoMo: The Post Mobile World (Business Insider Ignition, Nov. 2013)
PoMo: The Post Mobile World (Business Insider Ignition, Nov. 2013)
 
MMA Forum London November 2013 Richard Firminger presentation
MMA Forum London November 2013 Richard Firminger presentationMMA Forum London November 2013 Richard Firminger presentation
MMA Forum London November 2013 Richard Firminger presentation
 
Quality over Quantity: Mobile Users Matter (GROW Conference, Aug 2013)
Quality over Quantity: Mobile Users Matter (GROW Conference, Aug 2013)Quality over Quantity: Mobile Users Matter (GROW Conference, Aug 2013)
Quality over Quantity: Mobile Users Matter (GROW Conference, Aug 2013)
 
Flurry iab presentation final_7.15.2013
Flurry iab presentation final_7.15.2013Flurry iab presentation final_7.15.2013
Flurry iab presentation final_7.15.2013
 
The Mobile Consumer Age from SourceDigital13 (June 2013)
The Mobile Consumer Age from SourceDigital13 (June 2013)The Mobile Consumer Age from SourceDigital13 (June 2013)
The Mobile Consumer Age from SourceDigital13 (June 2013)
 
Mobile Outlook 2013
Mobile Outlook 2013Mobile Outlook 2013
Mobile Outlook 2013
 
The State of the App Economy
The State of the App EconomyThe State of the App Economy
The State of the App Economy
 
Flurry variety appconference_29nov12
Flurry variety appconference_29nov12Flurry variety appconference_29nov12
Flurry variety appconference_29nov12
 
Flurry Presents at Digital Analytics Association Symposium - San Francisco, C...
Flurry Presents at Digital Analytics Association Symposium - San Francisco, C...Flurry Presents at Digital Analytics Association Symposium - San Francisco, C...
Flurry Presents at Digital Analytics Association Symposium - San Francisco, C...
 
Games on Smartphones & Tablets: Demand, Revenue, Cost, Business Model, Usage,...
Games on Smartphones & Tablets: Demand, Revenue, Cost, Business Model, Usage,...Games on Smartphones & Tablets: Demand, Revenue, Cost, Business Model, Usage,...
Games on Smartphones & Tablets: Demand, Revenue, Cost, Business Model, Usage,...
 
Flurry presents at Vancouver Social Games and Unity Meetup
Flurry presents at Vancouver Social Games and Unity MeetupFlurry presents at Vancouver Social Games and Unity Meetup
Flurry presents at Vancouver Social Games and Unity Meetup
 

Recently uploaded

UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingrknatarajan
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxupamatechverse
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Call Girls in Nagpur High Profile
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Dr.Costas Sachpazis
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 

Recently uploaded (20)

Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
Introduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptxIntroduction to IEEE STANDARDS and its different types.pptx
Introduction to IEEE STANDARDS and its different types.pptx
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 

A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics

  • 1.   A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics                 Erik Freed           Brian Anderson           Flurry/Yahoo           Flurry/Yahoo    erikfreed@yahoo­inc.com  briananderson@yahoo­inc.com    Abstract  We present Burst, an analytic query system with a scalable and flexible approach to performing low­latency ad hoc                                    analysis over large complex datasets. The architecture consists of hardware­efficient scan techniques and a                            language facility to transform an extensible set of ad hoc declarative queries into imperative physical scan plans.                                  These plans are multicast across all nodes/cores of a two level sharded/distributed ingestion, storage, and execution                                topology and executed. The first release of this system is the query engine behind the Flurry Explorer product. Here                                      we explore the design details of that system as well as the incremental ingestion pipeline enhancement currently                                  being implemented for the next major release.    NOTE:​ This copy has had performance numbers updated and is not the same as the one submitted to Tech Pulse.  1. Introduction  The promise of the Flurry Explorer Product is to invite the user into an unstructured interactive ​discovery session                                    where they can easily pose arbitrary off­the­cuff and potentially complex questions about end user behavior. If they                                  get back answers quickly enough then their next question starts a virtuous cycle of more targeted questions                                  continuously leading to more specific and valuable results. The first major release of the back end query engine                                    engineered to fully support this type of exploration was developed in the Flurry Analytics group in Q1 2015 and                                      delivered as part of a limited beta of the Explorer feature within Flurry Analytics. We successfully utilized a unique                                      hyper distributed/parallel/concurrent object tree scanning model with a simple daily batched ingestion system for                            this limited audience. The next major release of this scanning architecture replaces the batched ingestion system                                with a more scalable incremental data ingestion pipeline to expand the reach of Explorer to all Flurry customers.                                    Here we present the architectural basis and specifics of the previous and upcoming release.  2. Background  For those of us who have spent any time with production scale SQL databases, seeing large table scans being sorted                                        and joined in a query plan is cause for panic. We can only relax once we find a way to constrain that query and/or                                                implement heavyweight indices so the query transforms into pure index lookups and partial joins. However for                                analytics the use cases are inherently unbounded, personalized, and constantly evolving while the corpora are                              typically enormous. This makes adding indices intractable in most cases. These limitations forced us to reevaluate                                our previous nemesis, the full table scan. We determined that if we could make the scans efficient enough, distribute                                      the scans across enough nodes and CPU cores, and develop a query language that could take an arbitrary ad hoc                                        analytic question and transform it into an instance of this hyper parallel­distributed­concurrent scan model, then we                                would have an attractively simple general purpose model. We reasoned that this model would scale well not only in                                      terms of input size and general query complexity, but in terms of feature development time, risk, and effort.    page 1 of 7   
  • 2.     Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics  3. Top Level View    The basic components of the Burst ecosystem are:  1. External Datasource(s)  2. Ingestion Subsystem  3. Data Model  4. Sample Store  5. Dataset Store  6. Query Subsystem    The previous release of Burst had a simplified batched ingestion model where the exporting Mapreduce jobs wrote                                  the entire history of a given mobile application’s event stream into new HDFS sequence files on a daily basis. These                                        datasets were then read into memory on demand as users posed queries. This initial beta pipeline design is being                                      replaced by the incremental version described in subsequent sections. The rest of the architecture described here is                                  as currently deployed.    Each of these components (other than the external data sources) are deployed on one or more clusters called a ​Cell                                        where each Cell is comprised of a master node, a failover master node and a set of worker nodes. Each Cell has its                                              own Apache Kafka [KAFKA], Apache HBase[HBASE], and Apache Spark[SPARK] clusters deployed. The Master                          (and failover Master) node contains the master process for each of these systems as well as a Docker [DOCKER]                                      container populated with all of the Burst specific JVM service processes. The Worker nodes are populated only by                                    the associated Spark, HBase, and Kafka worker specific deployments. Burst does not itself deploy anything directly                                onto Worker nodes.  4. Data Sources  Burst is inherently schema independent as well as agnostic to the specific technology of the external datasource.                                  However the data source must have  the following basic characteristics:  1. it must be in a schema that can be expressed in the relationships and datatypes of the Burst Data Model  2. The external data  model can be partitioned into two levels of well defined shards:  a. The first level is composed of a set of ​Domain instances that each represent a subset of data that is                                        the input to a single query e.g. for Flurry Explorer, this is a event stream associated with a single                                      ‘Mobile Application’ or constructed ‘Mobile Application Group’. A query can only be executed                          against a single Domain at a time.  b. The second level is a strict partitioning across a Domain creating order independent subsets of                              Item instances that each has a well defined rooted acyclic object model (tree) that can be scanned                                  in a depth first, preferably time ordered, traversal. For Flurry Explorer, this is a single ‘Mobile                                Device’, each of which has a set of time ordered ‘sessions’, each of which has a set of time                                      ordered ‘events’, each of which has a set of unordered key­value map ‘parameters’    page 2 of 7 
  • 3.     Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics  3. The external data source physical form can be exported as both a periodic historical batch and a continuous                                    incremental update and fed to the Burst Kafka based Ingestion API. e.g. for Flurry this is our 2,000 node,                                      six petabyte, ~50 trillion mobile device events, ever­growing HBase cluster with custom Map­Reduce jobs                            performing both initializing batch and daily incremental update feeds.  5. Ingestion  The new Burst Ingestion Subsystem design starts with a Kafka queues that provide a ​control­plane                              (control/administration), and the ​data­plane ​(data feeds). The data source is responsible for sending and responding                              to the ​control­plane​, as well as feeding the ​data­plane in response to ​control­plane messages. An Apache Spark                                  based process model manages ​control­plane and ​data­plane operations. It is responsible for transforming the                            schema of the external system into an appropriate Burst schema, updating the Sample Store as it arrives.  6. Data Model  The Burst Data Model has the following requirements/features/implementation details:  1. It is schema independent, but schema defined.   2. It is schema versioned, and supports heterogeneous versioned collections.  3. The data model/schema supports type structures, singular and plural structure reference relationships, value                          collections, value maps, and atomic data types (boolean, byte, short, int, long, double, string)  4. The data model/schema inherently defines a tree with a well defined root as part of well defined traversal  5. Data is encoded in a single byte array where the disk storage encoding is identical to the in­memory format.   6. This encoding is an unrolled depth first traversal of the object tree as a linear sequence of bytes. The                                      reading from disk into memory and traversal scans are in the same exact byte order and thus can take direct                                        advantage of the OS disk mmap semantics with the associated high performance kernel buffer management                              and aggressive prefetching. The data can be cached in memory or not depending on your preferences with                                  respect to repeated queries on identical datasets . 1 7. All interpretation of atomic data fields are done in­situ within the byte array on­demand ​iff ​any given field                                    is accessed in a query. The data model structures are not ever deserialized and no ephemeral objects are                                    created. This is similar to columnar storage, as it eliminates much of the costs of accessing unused columns                                    in standard bulk serializing models, but along with a higher degree of inherent simplicity and attendant                                efficiency. A truly ad­hoc system, where it is not known what fields will be accessed at what frequency, if                                      at all, is not an ideal columnar storage candidate.  8. Fetching, in memory storage, and scans of the data model generate zero JVM objects. They bypass the                                  JVM memory models as well. The byte sequence traversal is scanned using efficient stack based protocols                                with data accesses performed via ‘unsafe off heap’ libraries. The problems associated with large JVM                             2 heaps are minimized as none of this memory is actually ‘seen’ by the JVM. The JVM processes have quite                                      small heap sizes.  9. There are various optimizations for immutable encodings e.g. for value maps we store the keys and the                                  values as twin sorted arrays using a binary search to lookup key values. We also use dictionaries to reduce                                      string storage requirements.  1  Burst may support streaming query processing in a future release  2 ‘Unsafe’ refers to a design pattern where Java code is written using the same techniques the Java libraries use to access non                                              JVM heap memory (e.g. Network & Disc IO). It is called unsafe because JVM manufacturers do not offer support for these lower                                            level libraries, even though they are extensively used and quite reliable.    page 3 of 7 
  • 4.     Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics  7. Sample Store    The Burst architecture uses a Apache HBase key­value store, to reliably and efficiently store the continuous largely                                  unordered incremental feed of assorted Item updates from assorted Domains coming from one or more external data                                  sources. This data is stored in one of a plurality of tables each called a ​Province ​. Each arriving update is a new                                           3 cell, encoded in the Burst Data Model, in a row keyed by the specific Item, Domain ​and Channel in a single                                         4 Province table where the given Domain is hosted.   8. Dataset Store    For a query to be executed over a Domain, appropriate rows in the Sample Store and appropriate update cells for                                        each Item must scanned and transformed into a Dataset in Brio Data Model encoding. This transformation is called                                    melding and happens locally on each worker node. Each node creates and stores a single partition of the Dataset.                                     5 These partitions are the most recent ‘view’ of the data as a single byte array cached in local disc (magnetic or solid                                            state). When a query is executed, if the local Worker node has cached the partition, and if it is not considered ‘stale’                                            then it is read directly from disc and no meld is required. The melding can also customize the dataset by down                                          sampling items along with other forms of object tree filtering if it is desired to reduce the datasets size for                                        performance/resource utilization reasons. It is also possible to have more than one defined and reified custom                                Dataset ‘view’ per Domain.  Caching  It is vital that the Dataset partitions be loaded into memory quickly and released aggressively in order to manage                                      expensive/limited DRAM resources efficiently. The load of a Dataset partition is a simple ​mmap()call of a single                                    file as a single byte array into off­heap memory managed directly by the OS. The scan can proceed before the file                                          has been fully read due to the natural OS semantics of paged disc reads with linear order prefetching. Since there                                        3 Provinces are used to subdivide the overall dataset into separate tables so that efficient table operations can be used to manage,                                            move, and cleanup data as needed in manageable chunks.  4  An Ingestion API/Sample Store management artifact  5 i.e. without replication or fault tolerance. In the case of worker node failure, these dataset partitions are recreated on whatever                                          replica location is targeted by HBase/Spark for the next query.    page 4 of 7 
  • 5.     Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics  are essentially zero on­heap artifacts associated with this load, the release of the byte array has minimal GC                                    implications. In this way, the local disc, especially if it is SSD, acts as a cost effective second level DRAM cache.   6 9. Query Engine    The Query Subsystem has an API that consists of a ​programmer­friendly declarative query language called ​SILQ                                which is translated into a ​machine­friendly imperative query language called ​GIST​. Both of these are textual                                languages with a well defined grammar and syntax . The details of this are described in [SILQ]. Here we will say                                       7 that these languages provide a rich and extensible set of aggregation, dimensioning, filtering and causal/temporal                              reasoning features. Burst clients form their queries as SILQ, which the SILQ pipeline transforms into GIST. The                                  GIST pipeline transforms those into well defined execution plans that are multicast to worker nodes. The                                multidimensional result model is gathered and delivered back to the client.   Execution Models  These execution plans contain:  1. Traversal Model​­ a simple numeric array based state machine holding the semantics of what to do where                                  in the object tree traversal  2. Result Schema​ the semantics of all aggregations, dimensions, and merges and joins.  3. Closures​ ­ filters and traversal data model updates in generated and JIT optimized JVM byte code   4. Routes​ ­ Log structured record of graph automata paths  Zap Data Structures  Because of the extreme number of objects visited and the prolific object churn associated with standard data                                  structures, Burst requires specialized data structures called Zap structures for inner loops. These are designed to                               8 use nothing but simple off heap blocks of memory, pre­allocated in per­thread chunks, re­used over and over again,                                    and with all needed functions coded using ​unsafe access patterns. There are just a two of these currently : 9 ● Zap Maps​: The object tree scan requires a nested overlay of lightweight hash maps with the ability to ​join                                     10 with child/peer maps on the fly as the traversal unfolds from parent to child. The ways these nested self                                      joins can be expressed is an important part of how GIST creates complex ad­hoc multi­dimensional result                                models. The performance of Zap Maps is a key factor in the overall performance of the system.  6 If desired, a future version of Burst may support ‘streaming’ semantics where the scan is executed as the data is read from disc                                                and never cached in memory.  7  very convenient  for unit and system testing!  8  ‘Zero Allocation Protocol’  9 ​We are working on another structure, a ​Zap Lexicon ​that eliminates the use of standard JVM strings which are quite ​noisy from                                              the perspective of JVM object creation  10  something like a cross join    page 5 of 7 
  • 6.     Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics  ● Zap Routes​: For causal/temporal reasoning we implemented an off heap log­structured recording structure                          with a graph automata to discover and capture ‘paths’ through sequences of events. This how ‘Funnels’ are                                  implemented in the Explorer product..  Concurrency  Because each of the Item instances in a Dataset partition is part of a sequence of individual order independent object                                        trees, we refine our concurrency model to a single core/thread dedicated to each traversal. Each of these can be                                      executed in parallel on available cores using a fixed pool model. This makes hardware happy as the linear byte array                                        being scanned is read solely by a single Core.   Spark  Like the Ingestion Subsystem, the Query Subsystem is built on top of Apache Spark with a Spark Executor on each                                       11 worker node initialized with a Query Kernel that can execute scan plans. The scan traversals are carefully designed                                    to use a minimum of JVM memory and create a minimum of JVM objects. There is essentially no JVM memory                                        overhead to the storage and execution models other than created by the ipc protocols.  10. Performance  Because of the efficiency of the scanning techniques involved, one can think of Burst as an ​objects scanned per                                      second machine and so the performance of queries is almost exclusively about how many objects the query needs to                                      visit. As an example, in the Flurry mobile analytics world, queries that only look at the top level object in the tree                                            (the User or Mobile Device) run much faster than queries that need to visit the sessions associated with that User. At                                          the next level, queries that need to visit the events in each session run slower than ones that only look at sessions.                                            Generally the complexity of the query in terms of what data is accessed and what results are generated at each object                                          is not nearly as impactful.   In our 250 node, 6 SATA spindle, 48 haswell hthread cluster, we see a sustained 50 QPS with >1,000 applications in                                          memory.  Datasets cold load in <10s, cache load in <1s. Generally we scan about 200K objects/sec/hthread.  11. Future Work  The Burst architecture was designed to be extensible and the GIST language is implemented on top of a ‘plugin’                                      abstraction. We have a working first version plugin of a next generation of SILQ/GIST called HYDRA, that                                  combines both into a single language that is more performant in a few key areas. One is that you can combine any                                            number of queries into a single concurrent scan . We are also well into developing more efficient filtering using                                   12 code generated predicates that can be used by both HYDRA and for melding.  12. Conclusions  By rigorously constraining the data to be queried in terms of a two level partition model, where the first level                                        partition (Domains) subdivides the entire dataset into individually queryable subsets, and a second level partition                              (Items) defines unordered parallel/distributed partitions of sequences of scannable object graphs, and by                          implementing hyper parallel­distributed­concurrent scans we can provide a linearly scaling, cost effective,                        completely general purpose, ad hoc low latency query engine. The first version is deployed in beta behind the                                    11 Burst does not use Spark features extensively in fact for the most part it uses Spark as a distributed process manager. The actual                                                Spark execution model is a very simple single stage scatter/gather model. The implementation abstracts this facility so as to make                                        it easy to move to a different distributed process manager or to roll our own multicast execution model such as with JGroups.  12 This is an important optimization for multiple use case including­ 1) ‘dashboards’ where a mobile application displays an                                      initial UI view with a fixed set of personalized queries ­ 2) when a dataset is melded, it is critical to provide metadata about that                                                  dataset to the query clients in terms of a fixed set of queries e.g. for the Flurry product the UI needs to display user, session,                                                  event, and parameter counts as well as parameter keys and value frequencies to help inform users about formed query relevance                                        during interactive query sessions.    page 6 of 7 
  • 7.     Explorer Burst: A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics  recently released Explorer Product. The next release introduces an incremental ingestion pipeline allowing this                            query system to scale to serve all Flurry Explorer customers.  13. References  ● [DREMEL] Sergey Melnik and Andrey Gubarev and Jing Jing Long and Geoffrey Romer and Shiva                              Shivakumar and Matt Tolton and Theo Vassilakis, “Dremel: Interactive Analysis of Web­Scale Datasets”,                          Proc. of the 36th Int'l Conf on Very Large Data Bases: ​http://research.google.com/pubs/pub36632.html  ● [DRUID] Druid, “Open Source Data Store for Interactive Analytics at Scale”:  ​http://druid.io/  ● [BLINK] AmpLab, “Queries with Bounded Errors and Bounded Response Times on Very Large Data”:                            http://blinkdb.org/  ● [DRILL] MAPR, “Industry's First Schema­Free SQL Engine for Big Data”:                    https://www.mapr.com/products/apache­drill  ● [TEZ] ​https://tez.apache.org/   ● [PRESTO] ​https://prestodb.io/   ● [SPARK] ​http://spark.apache.org/   ● [DOCKER] ​https://www.docker.com/   ● [HBASE] ​http://hbase.apache.org/   ● [KAFKA] ​http://kafka.apache.org/   ● [SILQ]  https://docs.google.com/a/yahoo­inc.com/document/d/1of2GDtLJuItLdNQxDO7E24D6T8hOGspd­Knm8l FnDkM/edit?usp=sharing     page 7 of 7