SlideShare a Scribd company logo
1 of 54
Download to read offline
Introduction to Apache Drill
Big Data Bellevue Meetup

@tnachen
Timothy Chen
Motivation
Key Facts
Architecture Overview
About me
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
I Open Source
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Use Case: Marketing Campaign
Jane, a marketing analyst
Determine target segments
Data from different sources
Use Case: Crime Detection
•
•
•
•

Online purchases
Fraud, billing, etc.
Batch-generated overview
Modes
– Explorative
– Alerts
Requirements
•
•
•
•
•

Support for different data sources
Support for different query interfaces
Low-latency/real-time
Ad-hoc queries
Scalable, reliable
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Dremel is a scalable, interactive ad-hoc
query system for analysis of read-only
nested data. By combining multi-level
execution trees and columnar data layout,
it is capable of running aggregation
queries over trillion-row tables in
seconds. The system scales to thousands of
CPUs and petabytes of data, and has
thousands of users at Google.
…

“

“

Google’s Dremel

http://research.google.com/pubs/pub36632.html
Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar,
Matt Tolton, Theo Vassilakis, Proc. of the 36th Int'l Conf on Very Large Data Bases
(2010), pp. 330-339
Google’s Dremel

multi-level execution trees

columnar data layout
Google’s Dremel

nested data + schema

column-striped representation

map nested data to tables
Google’s Dremel
experiments:
datasets & query performance
Apache Drill–key facts
Inspired by Google’s Dremel
Standard SQL 2003 support
Plug-able data sources
Nested data is a first-class citizen
Schema is optional
Community driven, open, 100’s involved
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
High-level Architecture
Principled Query Execution
Source query—what we want to do (analyst
friendly)
Logical Plan— what we want to do (language
agnostic, computer friendly)
Physical Plan—how we want to do it (the best
way we can tell)
Execution Plan—where we want to do it
Principled Query Execution

Source
Query

SQL 2003
DrQL
MongoQL
DSL

Parser

parser API

Logical
Plan

query: [
{
@id: "log",
op: "sequence",
do: [
{
op: "scan",
source: “logs”
},
{
op: "filter",
condition:
"x > 3”
},

Optimizer

Topology
CF
etc.

Physical
Plan

Execution

scanner API
Wire-level Architecture
Each node: Drillbit - maximize data locality
Co-ordination, query planning, execution, etc, are distributed
Any node can act as endpoint for a query—foreman

Drillbit

Drillbit

Drillbit

Drillbit

Storage
Storage
Process
Process

Storage
Storage
Process
Process

Storage
Storage
Process
Process

Storage
Storage
Process
Process

node

node

node

node
Wire-level Architecture
Curator/Zookeeper for ephemeral cluster membership info
Distributed cache (Hazelcast) for metadata, locality
information, etc.
Curator/Zk
Curator/Zk

Drillbit

Drillbit

Drillbit

Drillbit

Distributed
Distributed
Cache
Cache

Distributed
Distributed
Cache
Cache

Distributed
Distributed
Cache
Cache

Distributed
Distributed
Cache
Cache

Storage
Storage
Process
Process

Storage
Storage
Process
Process

Storage
Storage
Process
Process

Storage
Storage
Process
Process

node

node

node

node
Wire-level Architecture
Originating Drillbit acts as foreman: manages query execution,
scheduling, locality information, etc.
Streaming data communication avoiding SerDe
Curator/Zk
Curator/Zk

Drillbit

Drillbit

Drillbit

Drillbit

Distributed
Distributed
Cache
Cache

Distributed
Distributed
Cache
Cache

Distributed
Distributed
Cache
Cache

Distributed
Distributed
Cache
Cache

Storage
Storage
Process
Process

Storage
Storage
Process
Process

Storage
Storage
Process
Process

Storage
Storage
Process
Process

node

node

node

node
Wire-level Architecture
Foreman turns into
root of the multi-level
execution tree, leafs
activate their storage
engine interface.

node

Curator/Zk
Curator/Zk
node

node
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
On the shoulders of giants …
Jackson for JSON SerDe for metadata
Typesafe HOCON for configuration and module management
Netty4 as core RPC engine, protobuf for communication
Vanilla Java, LArray and Netty ByteBuf for off-heap large data
structures
Hazelcast for distributed cache
Netflix Curator on top of Zookeeper for service registry
Optiq for SQL parsing and cost optimization
Parquet (http://parquet.io)/ ORC
Janino for expression compilation
ASM for ByteCode manipulation
Yammer Metrics for metrics
Guava extensively
Carrot HPC for primitive collections
Key features
Full SQL – ANSI SQL 2003
Nested Data as first class citizen
Optional Schema
Extensibility Points …
Extensibility Points
Source query  parser API
Custom operators, UDF  logical plan
Serving tree, CF, topology  physical plan/optimizer
Data sources &formats  scanner API

Source
Query

Parser

Logical
Plan

Optimizer

Physical
Plan

Execution
User Interfaces
API—DrillClient
Encapsulates endpoint discovery
Supports logical and physical plan submission, query
cancellation, query status
Supports streaming return results

JDBC driver, converting JDBC into DrillClient
communication.
REST proxy for DrillClient
User Interfaces
Let’s get our hands
dirty…
Demo
Install
Preparation
Usage

$ wget http://people.apache.org/~jacques/apache-drill-1.0.0m1.rc3/apache-drill-1.0.0-m1-binary-release.tar.gz
$ tar -zxf apache-drill-1.0.0-m1-binary-release.tar.gz

$ export
JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_11.jdk/Contents
/Home
$ export DRILL_LOG_DIR=$PWD
$ ./bin/drillbit.sh start
$ ./bin/sqlline -u jdbc:drill:schema=parquet-local -n admin -p admin
Useful Resources
Getting Started guide
https://github.com/vrtx/incubatordrill/blob/getting_started/docs/getting_started.rst
Demo HowTo
https://cwiki.apache.org/confluence/display/DRILL/
Demo+HowTo
How to build/install Apache Drill on Ubuntu 13.04
http://www.confusedcoders.com/bigdata/apachedrill/how-to-build-apache-drill-on-ubuntu-13-04
Be a part of it!
Status
Heavy development by multiple organizations
(MapR, Pentaho, Microsoft, Thoughtworks,
XingCloud, etc.)
Currently more than 100k LOC
Alpha available via
http://people.apache.org/~jacques/apache-drill1.0.0-m1.rc3/
Kudos to …
Julian Hyde, Pentaho
Lisen Mu, XingCloud
Tim Chen, Microsoft
Chris Merrick, RJMetrics
David Alves, UT Austin
Sree Vaadi, SSS
Srihari Srinivasan,
ThoughtWorks

•
•
•
•
•
•
•
•
•

Ben Becker, MapR
Jacques Nadeau, MapR
Ted Dunning, MapR
Keys Botzum, MapR
Jason Frantz
Ellen Friedman
Chris Wensel, Concurrent
Gera Shegalov, Oracle
Ryan Rawson, Ohm Data

Alexandre Beche, CERN
Jason Altekruse, MapR

http://incubator.apache.org/drill/team.html
Contributing
Contributions appreciated—not only code drops …

Test data & test queries
Use case scenarios (textual/SQL queries)
Documentation
Engage!
Follow @ApacheDrill on Twitter
Sign up at mailing lists (user | dev)

http://incubator.apache.org/drill/mailing-lists.html

Standing G+ hangouts every Tuesday at 18:00 CET
http://j.mp/apache-drill-hangouts

Keep an eye on http://drill-user.org/
Twitter: @tnachen
Email: tnachen@gmail.com

More Related Content

What's hot

Microsoft Azure Storage Basics
Microsoft Azure Storage BasicsMicrosoft Azure Storage Basics
Microsoft Azure Storage BasicsSai Kishore Naidu
 
Docker y azure container service
Docker y azure container serviceDocker y azure container service
Docker y azure container serviceFernando Mejía
 
What is cloud computing
What is cloud computingWhat is cloud computing
What is cloud computingBrian Bullard
 
Shift: Real World Migration from MongoDB to Cassandra
Shift: Real World Migration from MongoDB to CassandraShift: Real World Migration from MongoDB to Cassandra
Shift: Real World Migration from MongoDB to CassandraDataStax
 
Consistency As A Service:Auditing Cloud Consistency
Consistency As A Service:Auditing Cloud ConsistencyConsistency As A Service:Auditing Cloud Consistency
Consistency As A Service:Auditing Cloud ConsistencyLakshmiPriya UdayaKumar
 
Consistency as a Service: Auditing Cloud Consistency
Consistency as a Service: Auditing Cloud ConsistencyConsistency as a Service: Auditing Cloud Consistency
Consistency as a Service: Auditing Cloud ConsistencyPapitha Velumani
 
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...Citus Data
 
Consistency as a service auditing cloud consistency
Consistency as a service  auditing cloud consistencyConsistency as a service  auditing cloud consistency
Consistency as a service auditing cloud consistencyPapitha Velumani
 
Mesos vs kubernetes comparison
Mesos vs kubernetes comparisonMesos vs kubernetes comparison
Mesos vs kubernetes comparisonKrishna-Kumar
 
MMS - Monitoring, backup and management at a single click
MMS - Monitoring, backup and management at a single clickMMS - Monitoring, backup and management at a single click
MMS - Monitoring, backup and management at a single clickMatias Cascallares
 
Managing MySQL Scale Through Consolidation
Managing MySQL Scale Through ConsolidationManaging MySQL Scale Through Consolidation
Managing MySQL Scale Through ConsolidationNetApp
 
Release 8.1 - Breakfast Paris
Release 8.1 - Breakfast ParisRelease 8.1 - Breakfast Paris
Release 8.1 - Breakfast ParisNuxeo
 
What is cloud computing
What is cloud computingWhat is cloud computing
What is cloud computingBrian Bullard
 
Kubernetes Optimization - How We Cut Our Cloud Infrastructure Cost By 40% Usi...
Kubernetes Optimization - How We Cut Our Cloud Infrastructure Cost By 40% Usi...Kubernetes Optimization - How We Cut Our Cloud Infrastructure Cost By 40% Usi...
Kubernetes Optimization - How We Cut Our Cloud Infrastructure Cost By 40% Usi...Magalix Corporation
 
Cloudstack Open source and you
Cloudstack Open source and you Cloudstack Open source and you
Cloudstack Open source and you Brian Bullard
 
Basic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB MeetupBasic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB MeetupJohannes Moser
 
Study and implementation a cloud solution based on
Study and implementation a cloud solution based onStudy and implementation a cloud solution based on
Study and implementation a cloud solution based onDendani Bilal
 
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage Tier
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage TierIMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage Tier
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage TierIn-Memory Computing Summit
 

What's hot (20)

Microsoft Azure Storage Basics
Microsoft Azure Storage BasicsMicrosoft Azure Storage Basics
Microsoft Azure Storage Basics
 
Docker y azure container service
Docker y azure container serviceDocker y azure container service
Docker y azure container service
 
What is cloud computing
What is cloud computingWhat is cloud computing
What is cloud computing
 
Shift: Real World Migration from MongoDB to Cassandra
Shift: Real World Migration from MongoDB to CassandraShift: Real World Migration from MongoDB to Cassandra
Shift: Real World Migration from MongoDB to Cassandra
 
Tech Days 2010
Tech  Days 2010Tech  Days 2010
Tech Days 2010
 
Consistency As A Service:Auditing Cloud Consistency
Consistency As A Service:Auditing Cloud ConsistencyConsistency As A Service:Auditing Cloud Consistency
Consistency As A Service:Auditing Cloud Consistency
 
Consistency as a Service: Auditing Cloud Consistency
Consistency as a Service: Auditing Cloud ConsistencyConsistency as a Service: Auditing Cloud Consistency
Consistency as a Service: Auditing Cloud Consistency
 
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
What Microsoft is doing with Postgres & the Citus Data acquisition | PGConf E...
 
Consistency as a service auditing cloud consistency
Consistency as a service  auditing cloud consistencyConsistency as a service  auditing cloud consistency
Consistency as a service auditing cloud consistency
 
Mesos vs kubernetes comparison
Mesos vs kubernetes comparisonMesos vs kubernetes comparison
Mesos vs kubernetes comparison
 
NoSQL benchmarking
NoSQL benchmarkingNoSQL benchmarking
NoSQL benchmarking
 
MMS - Monitoring, backup and management at a single click
MMS - Monitoring, backup and management at a single clickMMS - Monitoring, backup and management at a single click
MMS - Monitoring, backup and management at a single click
 
Managing MySQL Scale Through Consolidation
Managing MySQL Scale Through ConsolidationManaging MySQL Scale Through Consolidation
Managing MySQL Scale Through Consolidation
 
Release 8.1 - Breakfast Paris
Release 8.1 - Breakfast ParisRelease 8.1 - Breakfast Paris
Release 8.1 - Breakfast Paris
 
What is cloud computing
What is cloud computingWhat is cloud computing
What is cloud computing
 
Kubernetes Optimization - How We Cut Our Cloud Infrastructure Cost By 40% Usi...
Kubernetes Optimization - How We Cut Our Cloud Infrastructure Cost By 40% Usi...Kubernetes Optimization - How We Cut Our Cloud Infrastructure Cost By 40% Usi...
Kubernetes Optimization - How We Cut Our Cloud Infrastructure Cost By 40% Usi...
 
Cloudstack Open source and you
Cloudstack Open source and you Cloudstack Open source and you
Cloudstack Open source and you
 
Basic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB MeetupBasic Introduction to Crate @ ViennaDB Meetup
Basic Introduction to Crate @ ViennaDB Meetup
 
Study and implementation a cloud solution based on
Study and implementation a cloud solution based onStudy and implementation a cloud solution based on
Study and implementation a cloud solution based on
 
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage Tier
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage TierIMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage Tier
IMC Summit 2016 Breakout - Ken Gibson - The In-Place Working Storage Tier
 

Similar to Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsMapR Technologies
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill MapR Technologies
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 
Data Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-ÖkosystemData Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-Ökosysteminovex GmbH
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Zhenxiao Luo
 
Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Aditya Varun Chadha
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data ScientistsRichard Garris
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Kevin Mao
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsVMware Tanzu
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
Alex Wade, Digital Library Interoperability
Alex Wade, Digital Library InteroperabilityAlex Wade, Digital Library Interoperability
Alex Wade, Digital Library Interoperabilityparker01
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 

Similar to Introduction to Apache Drill - Big Data Bellevue Meetup 20131023 (20)

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Data Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-ÖkosystemData Science und Machine Learning im Kubernetes-Ökosystem
Data Science und Machine Learning im Kubernetes-Ökosystem
 
Prototype Design of Open Access Institutional Repository
Prototype Design of Open Access Institutional RepositoryPrototype Design of Open Access Institutional Repository
Prototype Design of Open Access Institutional Repository
 
Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019Real time analytics at uber @ strata data 2019
Real time analytics at uber @ strata data 2019
 
Analytics&IoT
Analytics&IoTAnalytics&IoT
Analytics&IoT
 
Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010Naukri Search Team achievements, 2009-2010
Naukri Search Team achievements, 2009-2010
 
Azure Databricks for Data Scientists
Azure Databricks for Data ScientistsAzure Databricks for Data Scientists
Azure Databricks for Data Scientists
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
Achieving Real-time Ingestion and Analysis of Security Events through Kafka a...
 
RavenDB overview
RavenDB overviewRavenDB overview
RavenDB overview
 
Real-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven ApplicationsReal-time Analytics for Data-Driven Applications
Real-time Analytics for Data-Driven Applications
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Alex Wade, Digital Library Interoperability
Alex Wade, Digital Library InteroperabilityAlex Wade, Digital Library Interoperability
Alex Wade, Digital Library Interoperability
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 

Recently uploaded

UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?SANGHEE SHIN
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.YounusS2
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdfJamie (Taka) Wang
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum ComputingGDSC PJATK
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 

Recently uploaded (20)

UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?Do we need a new standard for visualizing the invisible?
Do we need a new standard for visualizing the invisible?
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.Basic Building Blocks of Internet of Things.
Basic Building Blocks of Internet of Things.
 
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
20200723_insight_release_plan_v6.pdf20200723_insight_release_plan_v6.pdf
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Introduction to Quantum Computing
Introduction to Quantum ComputingIntroduction to Quantum Computing
Introduction to Quantum Computing
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 

Introduction to Apache Drill - Big Data Bellevue Meetup 20131023

  • 1. Introduction to Apache Drill Big Data Bellevue Meetup @tnachen Timothy Chen
  • 22. Use Case: Marketing Campaign Jane, a marketing analyst Determine target segments Data from different sources
  • 23. Use Case: Crime Detection • • • • Online purchases Fraud, billing, etc. Batch-generated overview Modes – Explorative – Alerts
  • 24. Requirements • • • • • Support for different data sources Support for different query interfaces Low-latency/real-time Ad-hoc queries Scalable, reliable
  • 26. Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. … “ “ Google’s Dremel http://research.google.com/pubs/pub36632.html Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis, Proc. of the 36th Int'l Conf on Very Large Data Bases (2010), pp. 330-339
  • 27. Google’s Dremel multi-level execution trees columnar data layout
  • 28. Google’s Dremel nested data + schema column-striped representation map nested data to tables
  • 30. Apache Drill–key facts Inspired by Google’s Dremel Standard SQL 2003 support Plug-able data sources Nested data is a first-class citizen Schema is optional Community driven, open, 100’s involved
  • 34. Principled Query Execution Source query—what we want to do (analyst friendly) Logical Plan— what we want to do (language agnostic, computer friendly) Physical Plan—how we want to do it (the best way we can tell) Execution Plan—where we want to do it
  • 35. Principled Query Execution Source Query SQL 2003 DrQL MongoQL DSL Parser parser API Logical Plan query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” }, Optimizer Topology CF etc. Physical Plan Execution scanner API
  • 36. Wire-level Architecture Each node: Drillbit - maximize data locality Co-ordination, query planning, execution, etc, are distributed Any node can act as endpoint for a query—foreman Drillbit Drillbit Drillbit Drillbit Storage Storage Process Process Storage Storage Process Process Storage Storage Process Process Storage Storage Process Process node node node node
  • 37. Wire-level Architecture Curator/Zookeeper for ephemeral cluster membership info Distributed cache (Hazelcast) for metadata, locality information, etc. Curator/Zk Curator/Zk Drillbit Drillbit Drillbit Drillbit Distributed Distributed Cache Cache Distributed Distributed Cache Cache Distributed Distributed Cache Cache Distributed Distributed Cache Cache Storage Storage Process Process Storage Storage Process Process Storage Storage Process Process Storage Storage Process Process node node node node
  • 38. Wire-level Architecture Originating Drillbit acts as foreman: manages query execution, scheduling, locality information, etc. Streaming data communication avoiding SerDe Curator/Zk Curator/Zk Drillbit Drillbit Drillbit Drillbit Distributed Distributed Cache Cache Distributed Distributed Cache Cache Distributed Distributed Cache Cache Distributed Distributed Cache Cache Storage Storage Process Process Storage Storage Process Process Storage Storage Process Process Storage Storage Process Process node node node node
  • 39. Wire-level Architecture Foreman turns into root of the multi-level execution tree, leafs activate their storage engine interface. node Curator/Zk Curator/Zk node node
  • 41. On the shoulders of giants … Jackson for JSON SerDe for metadata Typesafe HOCON for configuration and module management Netty4 as core RPC engine, protobuf for communication Vanilla Java, LArray and Netty ByteBuf for off-heap large data structures Hazelcast for distributed cache Netflix Curator on top of Zookeeper for service registry Optiq for SQL parsing and cost optimization Parquet (http://parquet.io)/ ORC Janino for expression compilation ASM for ByteCode manipulation Yammer Metrics for metrics Guava extensively Carrot HPC for primitive collections
  • 42. Key features Full SQL – ANSI SQL 2003 Nested Data as first class citizen Optional Schema Extensibility Points …
  • 43. Extensibility Points Source query  parser API Custom operators, UDF  logical plan Serving tree, CF, topology  physical plan/optimizer Data sources &formats  scanner API Source Query Parser Logical Plan Optimizer Physical Plan Execution
  • 44. User Interfaces API—DrillClient Encapsulates endpoint discovery Supports logical and physical plan submission, query cancellation, query status Supports streaming return results JDBC driver, converting JDBC into DrillClient communication. REST proxy for DrillClient
  • 46. Let’s get our hands dirty…
  • 47. Demo Install Preparation Usage $ wget http://people.apache.org/~jacques/apache-drill-1.0.0m1.rc3/apache-drill-1.0.0-m1-binary-release.tar.gz $ tar -zxf apache-drill-1.0.0-m1-binary-release.tar.gz $ export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_11.jdk/Contents /Home $ export DRILL_LOG_DIR=$PWD $ ./bin/drillbit.sh start $ ./bin/sqlline -u jdbc:drill:schema=parquet-local -n admin -p admin
  • 48. Useful Resources Getting Started guide https://github.com/vrtx/incubatordrill/blob/getting_started/docs/getting_started.rst Demo HowTo https://cwiki.apache.org/confluence/display/DRILL/ Demo+HowTo How to build/install Apache Drill on Ubuntu 13.04 http://www.confusedcoders.com/bigdata/apachedrill/how-to-build-apache-drill-on-ubuntu-13-04
  • 49. Be a part of it!
  • 50. Status Heavy development by multiple organizations (MapR, Pentaho, Microsoft, Thoughtworks, XingCloud, etc.) Currently more than 100k LOC Alpha available via http://people.apache.org/~jacques/apache-drill1.0.0-m1.rc3/
  • 51. Kudos to … Julian Hyde, Pentaho Lisen Mu, XingCloud Tim Chen, Microsoft Chris Merrick, RJMetrics David Alves, UT Austin Sree Vaadi, SSS Srihari Srinivasan, ThoughtWorks • • • • • • • • • Ben Becker, MapR Jacques Nadeau, MapR Ted Dunning, MapR Keys Botzum, MapR Jason Frantz Ellen Friedman Chris Wensel, Concurrent Gera Shegalov, Oracle Ryan Rawson, Ohm Data Alexandre Beche, CERN Jason Altekruse, MapR http://incubator.apache.org/drill/team.html
  • 52. Contributing Contributions appreciated—not only code drops … Test data & test queries Use case scenarios (textual/SQL queries) Documentation
  • 53. Engage! Follow @ApacheDrill on Twitter Sign up at mailing lists (user | dev) http://incubator.apache.org/drill/mailing-lists.html Standing G+ hangouts every Tuesday at 18:00 CET http://j.mp/apache-drill-hangouts Keep an eye on http://drill-user.org/

Editor's Notes

  1. Greet First time talking in front of meetup
  2. - Search Data pipeline, real-time ingesting social data (FB/Twitter)
  3. - Battle nations, Top grossing #3, Matchmaking, Online real-time PvP
  4. - Open Source PaaS, Metrics and Usage data
  5. - Halo new Big Data pipeline, working on data ingestion, with open source like Kafka
  6. Love open source, with enthusiastic people offering their time and energy Smart people and a great sense of community
  7. - Also some less well known projects such as MMORPG game engine, etc.
  8. And I found Drill!
  9. Interactive / real-time is HOT Batch processing doesn’t serve all our needs anymore
  10. - TC originally article that led me to start contributing
  11. - Drill’s Apache incubator proposal, outlining what Drill is trying to achieve
  12. Pro: Can handle big data, MapReduce abstracts all the distribution and management, flexible code to process Con: Slow Hive query startup requires a long processing time….
  13. Pro: Highly available solutions, handle large writes/reads Con: Not easy to do adhoc SQL like queries
  14. Stream processing is becoming very popular, and new projects are rising up such as Samza based on Kafka. Walmart labs has a project called Muppet that they called Fast data.. Con: Need to definite topology, and cannot do adhoc querying
  15. AWS CloudSearch, Lucene, all these technology performs searches with the basis of an index they maintain, therefore needs preprocessing of data.
  16. What most people do in querying multiple sources is to engineer an ETL pipeline to a common DW, and query a DW for any cross segments data. But obviously ETL implies a delay, we can do better.
  17. Drill is not to replace MapReduce, but to supplement it
  18. Two innovations: handle nested-data column style (column-striped representation) and multi-level execution trees
  19. repetition levels (r) — at what repeated field in the field’s path the value has repeated. definition levels (d) — how many fields in path that could be undefined (because they are optional or repeated) are actually present Only repeated fields increment the repetition level, only non-required fields increment the definition level. Required fields are always defined and do not need a definition level. Non repeated fields do not need a repetition level. An optional field requires one extra bit to store zero if it is NULL and one if it is defined. NULL values do not need to be stored as the definition level captures this information.
  20. Source query - Human (eg DSL) or tool written(eg SQL/ANSI compliant) query Source query is parsed and transformed to produce the logical plan Logical plan: dataflow of what should logically be done Typically, the logical plan lives in memory in the form of Java objects, but also has a textual form The logical query is then transformed and optimized into the physical plan. Optimizer introduces of parallel computation, taking topology into account Optimizer handles columnar data to improve processing speed The physical plan represents the actual structure of computation as it is done by the system How physical and exchange operators should be applied Assignment to particular nodes and cores + actual query execution per node
  21. Drillbits per node, maximize data locality Co-ordination, query planning, optimization, scheduling, execution are distributed By default, Drillbits hold all roles, modules can optionally be disabled. Any node/Drillbit can act as endpoint for particular query.
  22. Zookeeper maintains ephemeral cluster membership information only Small distributed cache utilizing embedded Hazelcast maintains information about individual queue depth, cached query plans, metadata, locality information, etc.
  23. Originating Drillbit acts as foreman, manages all execution for their particular query, scheduling based on priority, queue depth and locality information. Drillbit data communication is streaming and avoids any serialization/deserialization
  24. Red: originating drillbit, is the root of the multi-level execution tree, per query/job Leafs use their storage engine interface to scan respective data source (DB, file, etc.)
  25. Handing over to Ted
  26. Michael?