SlideShare a Scribd company logo
1 of 44
Download to read offline
®
© 2014 MapR Technologies 1
®
© 2014 MapR Technologies
Drilling Into Drill: Flexiperf
Jacques Nadeau, Architect and VP Apache Drill
February 19, 2015
®
© 2014 MapR Technologies 2© 2014 MapR Technologies
®
“Drill isn’t just about SQL-on-Hadoop. It’s about SQL-on-
pretty-much-anything, immediately, and without formality.”
-Andrew Brust, GigaOM Research, Dec 2014
®
© 2014 MapR Technologies 3
Agenda
•  What?
–  SQL like mom made
–  Punk SQL
•  How?
–  Flexibility
–  Performance
•  Who & When
®
© 2014 MapR Technologies 4
SQL Like Mom Made
®
© 2014 MapR Technologies 5
SoH Table Stakes: Warehousing and Business Intelligence
ANSI Syntax
•  SELECT, FROM, WHERE,
JOIN, HAVING, ORDER BY,
WITH, CTAS, OVER*,
ROLLUP*, CUBE*, ALL,
EXISTS, ANY, IN, SOME
•  VarChar, Int, BigInt, Decimal,
VarBinary, Timestamp, Float,
Double, etc.
•  Subqueries, scalar subqueries*,
partition pruning, CTE
Interactive SQL Workloads
•  Data warehouse offload
•  Tableau, ODBC, JDBC
•  TPC-H & TPC-DS-like
workloads
*alpha or imminent
Standard Hadoop Tools
•  Supports Hive SerDes
•  Supports Hive UDFs
•  Supports Hive Metastore
®
© 2014 MapR Technologies 6
Punk SQL
®
© 2014 MapR Technologies 7
Punk SQL: SQL for a Hadoop World
Modern Syntax
•  Path based queries and
wildcards
–  select * from /my/logs/
–  select * from /revenue/*/q2
•  Modern data types
–  Any, Map, Array (JSON)
•  Complex Functions and
Relational Operators
–  FLATTEN, kvgen,
convert_from, convert_to,
repeated_count, etc
New Workloads
•  JSON Sensor analytics
•  Complex data analysis
•  Alternative DSLs
New Ways to Work
•  Query without prep
•  Workspaces without admin
intervention
•  Expose query as MapReduce
•  Expose query as Spark RDD
®
© 2014 MapR Technologies 8© 2014 MapR Technologies
®
How?
®
© 2014 MapR Technologies 9
Flexibility
your tool should
be flexible…
so you don’t have to be
®
© 2014 MapR Technologies 10
Flexibility is a Vision, Usability, a Religion
•  Deployment
•  Data Model
•  Schema
•  Security
•  Access Methods
®
© 2014 MapR Technologies 11
Supporting the Changing Roles of Big Data
Data Dev Circa 2000
1.  Developer comes up with
requirements
2.  DBA defines tables
3.  DBA defines indices
4.  DBA defines FK relationships
5.  Developer stores data
6.  BI builds reports
7.  Analyst views reports
8.  DBA adds materialized views
Data Today
1.  Developer builds app,
defines schema, stores
data
2.  Analyst queries data
3.  Data engineer fixes
performance problems or
fills functionality gaps
®
© 2014 MapR Technologies 12
SoH YSoH X
Deployment: Traditional Hadoop Query Engines
DFS
MySQL
Metadata
Service
DFS DFS
MR MR MR
Gateway
Service
Gateway
Service
Gateway
Service
MySQL
(replica)
MySQL
Metadata
Service
DFS
EXEC
Catalog
Service
State
Service
MySQL
(replica)
DFS
EXEC
DFS
EXEC
Authorization
Service Authorization
Service
®
© 2014 MapR Technologies 13
Challenges
•  Many “special” nodes
•  Each service may need special handling
–  HA process
–  Backup process
•  Each system requires you to set up a database on top of which
you run your database
•  No easy way to run on your laptop for experimentation
–  Some solutions don’t even support non-Linux environments
®
© 2014 MapR Technologies 14
Distributed on Hadoop
Distributed
Single or CLI/Embedded
Drill Deployment is Easy
•  Single Daemon for all purposes
•  No special considerations for
scaling or availability
•  With or without DFS
•  Works with other data systems
(Mongo, Cassanrda & JDBC
coming soon)
•  Runs on Linux, Mac or Windows
•  No separate database
•  JSON everything
•  Access via HTTP, Java, C, JDBC,
ODBC, CLI
Drillbit
Drillbit Drillbit Drillbit
Drillbit Drillbit Drillbit
DFS DFS DFS
®
© 2014 MapR Technologies 15
Drill Provides A Flexible Data Model
HBase
JSON
BSON
CSV
TSV
Parquet
Avro
Schema-lessFixed schema
Flat
Complex
Flexibility
Name! Gender! Age!
Michael! M! 6!
Jennifer! F! 3!
{!
name: {!
first: Michael,!
last: Smith!
},!
hobbies: [ski, soccer],!
district: Los Altos!
}!
{!
name: {!
first: Jennifer,!
last: Gates!
},!
hobbies: [sing],!
preschool: CCLC!
}!
RDBMS/SQL-on-Hadoop table
Apache Drill table
Flexibility
®
© 2014 MapR Technologies 16
Leave Your Data Where it is. Access it Centrally & Uniformly.
•  Drill is storage agnostic
•  Interacts to storage through
plugins
•  Storage plugins expose optimizer
rules
–  Optimizer rules work directly on
logical operation to expose
maximum capabilities
•  Reference multiple Hive, HBase,
MongoDB, DFS, etc systems
®
© 2014 MapR Technologies 17
Leverage that Massive Scalable Redundant Infrastructure
Single Store for Data and Metadata
•  HDFS is already your single canonical store
•  Don’t create a secondary metadata store
Avoid Metadata Management and synchronization
•  Store metadata inline
–  If you can’t, store it next to files
•  Move directories around at will
•  Delete things at will
®
© 2014 MapR Technologies 18
Flexibility in how you describe your data
•  Drill doesn’t require schema, detects file types based on
–  extensions
–  magic bytes (e.g. PAR1)
–  systems settings
•  Query can be planned on any file, anywhere
•  Data types are determined as data arrives
•  Some formats have known schema
–  If they don’t, you can expose them as such through views
–  Views are simply JSON files that define view SQL
®
© 2014 MapR Technologies 19
SELECT timestamp, message
FROM dfs1.logs.`AppServerLogs/2014/Jan/`
WHERE errorLevel > 2
This is a cluster in Apache Drill
-  DFS
-  HBase
-  Hive meta-store
A work-space
-  Typically a
sub-
directory
A table
-  pathnames
-  HBase table
-  Hive table
That means everything looks like a table…
®
© 2014 MapR Technologies 20
Query a File or a Directory Tree
-- Dynamic queries on files
select errorLevel, count(*)

from dfs.logs.`/AppServerLogs/2014/Jan/
part0001.parquet` group by errorLevel;
-- Dynamic queries on entire directory tree
select errorLevel, count(*) as TotalErrors

from dfs.logs.`/AppServerLogs`

group by errorLevel;
®
© 2014 MapR Technologies 21
Directories are Implicit Partitions
-- Direct
SELECT dir0, SUM(amount)
FROM sales

GROUP BY dir1 IN (q1, q3)
-- View
CREATE VIEW sales_q2 as
SELECT dir0 as year, amount
FROM sales
WHERE dir = ‘q2’
sales
├── 2014
│   ├── q1
│   ├── q2
│   ├── q3
│   └── q4
└── 2015
└── q1
®
© 2014 MapR Technologies 22
Access Hard to Reach Data with Nominal Configuration
-- embedded JSON value inside column donutjson inside
column-family cf1 of an hbase table donuts
SELECT
d.name, COUNT(d.fillings)
FROM (
SELECT convert_from(cf1.donutjson, JSON) as d
FROM hbase.donuts);
®
© 2014 MapR Technologies 23
On the fly binary data interpretation
CONVERT_FROM and CONVERT_TO allows access to common
Hadoop encodings:
•  Boolean, byte, byte_be, tinyint, tinyint_be, smallint, smallint_be,
int, int_be, bigint, bigint_be, float, double, int_hadoopv,
bigint_hadoopv, date_epoch_be, date_epoch, time_epoch_be,
time_epoch, utf8, utf16, json
•  E.g. CONVERT_FROM(mydata, ‘int_hadoopv’) => Internal INT
format
•  E.g. CONVERT_FROM(mydata, ‘JSON’) => UTF8 JSON to
Internal complex object
®
© 2014 MapR Technologies 24
Secure your data without an extra service
•  Drill Views
•  Ownership chaining with
configurable delegation TTL
•  Leverages existing HDFS
ACLs
•  Complete security solution
without additional services or
software
MaskedSales.view
owner: Cindy
GrossSales.view
owner: dba
RawSales.parquet
owner: dba
Frank file
view perm
Cindy file
view perm
dba
delegated
read
2-stepownershipchain
SQL by
Frank
®
© 2014 MapR Technologies 25Performance: Getting to Plaid
®
© 2014 MapR Technologies 26
Performance is about data management, not just wall clock
•  How quickly can you incorporate new data
•  What types of prep or ETL you need to do
•  What you can do without having to read data & calculate
statistics
–  Remember, many users never read all their data
•  Time to first result
•  Time to last result
•  Managing Scarce Resources
•  Concurrency
®
© 2014 MapR Technologies 27
Core Drill Architectural Goals
•  Go fast when you don’t know anything
–  And do “the right thing”
•  Go faster when you do know things
®
© 2014 MapR Technologies 28
An Optimistic, Pipelined Purpose-Built DAG Engine
•  Three-Level DAG
•  Major Fragments (phases)
•  Minor Fragments (threads)
•  Operators (in-thread operations)
> explain plan for select * from customer limit 5;
…
00-00 Screen
00-01 SelectionVectorRemover
00-02 Limit(fetch=[5])
00-03 UnionExchange
01-01 ProducerConsumer
01-02 Scan(groupscan=[ParquetGroupScan […
®
© 2014 MapR Technologies 29
Each phase (MajorFragment) gets Parallelized (MinorFragment)
1:0 1:1 1:2 1:3
0:0
®
© 2014 MapR Technologies 30
Parallel Calcite: Extensible Cost-based Optimization
•  Volcano-inspired cost based optimizer
–  CPU, IO, memory, network (data locality)
•  Pluggable rules per storage subsystem
•  Exchanges provide parallelization:
–  Hash, Ordered, Broadcast and Merging
•  Join and aggregation planning
–  Merge Join vs Hash Join
–  Partition-based join vs Broadcast-based join
–  Streaming Aggregation vs Hash Aggregation
•  Relies heavily on the awesome Apache
Calcite project
Query
Optimizer
HBase
Rules
File System
Rules
®
© 2014 MapR Technologies 31
Statistics
•  Different Storage systems have different capabilities
•  All: Basic estimation of row count
–  Allows the first 70% of critical optimization decisions for most Hadoop
workloads
–  Either exact numbers or approximations based on best effort
•  Some file formats
–  Better than basic stats in file, such as Parquet
•  Substantial effort to make certain formats ‘nearly native’
–  Drill adding support for extended attributes for enhanced file-held
statistics
®
© 2014 MapR Technologies 32
Reading Data Quickly, Moving Data Quickly
•  Highly Optimized Native Drill Readers:
–  Vectorized Parquet, Text/CSV, JSON
–  Also works with all Hive supported formats
•  Drill supports partition pruning
–  Adding direct physical property exposure soon for highly optimized
cases
•  Drill parallelizes to maximum level format allows
–  Also balances data locality and maximum parallelization
•  Bespoke Asynchronous Zero-Copy RPC Layer
–  Built specifically for Drill’s internal data format
®
© 2014 MapR Technologies 33
Parquet: The Format for the next decade
Apache Drill & Apache Parquet communities working together
•  Better Ecosystem integration and decentralized metadata
–  Self Description capabilities
–  Ecosystem support for logical data types
•  Enhanced Performance
–  New vectorized reading
–  Enhanced memory pooling and management
–  Indexing*
–  Better metadata positioning (for improved page pruning)*
–  Enhanced vectorization and late materialization reading*
*In progress
®
© 2014 MapR Technologies 34
DISK
Value Vectors & Record Batches: Drill’s In-memory
Columnar Work Units
•  Random access: sort without copy or
restructuring
•  Fully specified in memory shredded complex
data structures
•  Remove serialization or copy overhead at
node boundaries
•  Spool to disk as necessary
•  Interact with data in Java or C without copy
or overhead
Drill Bit
Memory
overflow
uses disk
®
© 2014 MapR Technologies 35
Runtime Compilation and Multi-phased planning
•  Drill does best effort initial query parsing, planning and validation
–  Where Drill doesn’t understand data, it provides support for ANY type,
allowing late type binding.
•  At execution time, individual nodes do secondary pass
–  Schema <--> Query parsing and validation
–  Type casting, coercion and promotion
–  Compilation based on schema requirements
•  As schema changes, Drill supports recompilation of each
operator as necessary
®
© 2014 MapR Technologies 36
Runtime Compilation Pattern
Custom
Bytecode
Optimization
Bytecode
Merging
Janino
compilation
CodeModel
Generated
Code
Precompiled
Bytecode
Templates Loaded Class
UDFs
Source Code
Transformation
Bytecode
Merging
®
© 2014 MapR Technologies 37
Advanced Compilation Techniques
•  Optimization based on observation and assembly
•  Drill does a number of pre-machine-code-compilation
optimizations to ensure efficient execution
•  Some examples:
–  Removal of type and bounds checking
–  Direct micro pointers for in-record-batch references
–  Little endian data formats
–  Bytecode-level scalar replacement
®
© 2014 MapR Technologies 38
Enhanced Interfaces for high-speed Storage and UDF
•  All built-in functions in Drill are simply UDFs
•  UDF interface works directly with
–  compilation engine
–  bytecode rewriting algorithms
•  Storage Interface is a columnar vectorized transfer interface
–  Allows storage system to determine best transformation approach
–  Adding support for deferred materialization
–  Support for multiple target languages with minimal overhead
®
© 2014 MapR Technologies 39
Drill Does Vectorization & Supports Columnar Functions
•  Drill often operates on more than one record at a time
–  Word-sized manipulations
–  SIMD instructions
–  Manually coded algorithms
•  Columnar Functions Improve Many Operations
–  Bitmaps allow lightning fast null-checks and reduction in branching
–  Type demotion to reduce memory and CPU computation overhead
–  Direct Conversions where possible
®
© 2014 MapR Technologies 40
Drill provides advanced query telemetry
•  What happened during query for all three levels of DAG
execution
•  Each profile is stored as JSON file for easy review, sharing and
backup (at end of query execution)
•  Profiles can be analyzed using Drill, allows:
–  easy longitudinal analysis of workload
–  multi-tenancy performance analysis
–  impact of configuration changes to benchmark workloads
®
© 2014 MapR Technologies 41
A Color-coded visual layout and Gant timing chart is provided
®
© 2014 MapR Technologies 42
Memory Efficiency
•  Drill’s in memory representation is designed to minimize memory
overhead
•  Custom implementation of columnar-aware data structures
including hash tables, sort operations, etc.
–  For example, entry overhead for hash table is 8 bytes per value
–  Sort pointer overhead for sort is 2 to 4 bytes per entry
•  Adding support for compressed columnar representation further
to improve compactness
®
© 2014 MapR Technologies 43
Scale and Concurrency
•  Drill’s execution model leverages both local and remote exchanges for
changes in parallelization
–  Muxing Local, Demuxing Local, Broadcast, Hash to Merge, Hash to Random, Ordered
Partition, Single Merge, Union
•  All operations can be parallelized at the thread and node level
•  Thread count and parallelization are influenced by data size, query phasing
and system load
–  Administrators have basic queuing control to manage workload
•  All threads are independent and pipelined, all run in a single process per
node
•  Node <> node communication is multiplexed, push-based with sub-socket
back-pressure support
•  Testing has proceeded up to 150 nodes. Target is 1000 nodes by GA.
–  Drill adapts data transfer size, buffers, muxing and other operations based on query
scale to minimize n2 multiplier effects
®
© 2014 MapR Technologies 44
Who & When
Office Hours
•  Now, table B
Apache: join in!
getdrill.org
0.8 next week
1.0 in month or two
@INTJESUS
@ApacheDrill
@MAPR
GigaOM Research, Sector Roadmap: Hadoop/Data Warehouse Interoperability

More Related Content

What's hot

Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill Carol McDonald
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillMapR Technologies
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, Howmcsrivas
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillMapR Technologies
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Gera Shegalov
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drillJulien Le Dem
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfCharles Givre
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Vince Gonzalez
 
Free Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillFree Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillMapR Technologies
 
Apache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaApache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaDataWorks Summit
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBradford Stephens
 
Rethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillRethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillMapR Technologies
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on HadoopCarol McDonald
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystemsunera pathan
 

What's hot (20)

M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill NoSQL HBase schema design and SQL with Apache Drill
NoSQL HBase schema design and SQL with Apache Drill
 
Hadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache DrillHadoop User Group - Status Apache Drill
Hadoop User Group - Status Apache Drill
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
SQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache DrillSQL-on-Hadoop with Apache Drill
SQL-on-Hadoop with Apache Drill
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013Apache Drill @ PJUG, Jan 15, 2013
Apache Drill @ PJUG, Jan 15, 2013
 
Sql on everything with drill
Sql on everything with drillSql on everything with drill
Sql on everything with drill
 
Apache drill
Apache drillApache drill
Apache drill
 
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard OfApache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of
 
Apache Spark & Hadoop
Apache Spark & HadoopApache Spark & Hadoop
Apache Spark & Hadoop
 
Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0Working with Delimited Data in Apache Drill 1.6.0
Working with Delimited Data in Apache Drill 1.6.0
 
Free Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillFree Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache Drill
 
Apache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in AlibabaApache Hadoop YARN 3.x in Alibaba
Apache Hadoop YARN 3.x in Alibaba
 
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed ComputingBuilding a Business on Hadoop, HBase, and Open Source Distributed Computing
Building a Business on Hadoop, HBase, and Open Source Distributed Computing
 
Rethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache DrillRethinking SQL for Big Data with Apache Drill
Rethinking SQL for Big Data with Apache Drill
 
Introduction to Spark on Hadoop
Introduction to Spark on HadoopIntroduction to Spark on Hadoop
Introduction to Spark on Hadoop
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 

Viewers also liked

Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill WorkshopCharles Givre
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / OptiqJulian Hyde
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can helpChristian Tzolov
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteJulian Hyde
 
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016Carl Steinbach
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerSpark Summit
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 

Viewers also liked (8)

Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill Workshop
 
Drill / SQL / Optiq
Drill / SQL / OptiqDrill / SQL / Optiq
Drill / SQL / Optiq
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can help
 
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
LinkedIn's Logical Data Access Layer for Hadoop -- Strata London 2016
 
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S OptimizerDeep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
Deep Dive Into Catalyst: Apache Spark 2.0'S Optimizer
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 

Similar to Drill into Drill – How Providing Flexibility and Performance is Possible

Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecturesaipriyacoool
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over YarnInMobi Technology
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Steve Min
 
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)BigDataEverywhere
 
Self-Service Data Exploration with Apache Drill
Self-Service Data Exploration with Apache DrillSelf-Service Data Exploration with Apache Drill
Self-Service Data Exploration with Apache DrillMapR Technologies
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platformnvvrajesh
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRData Con LA
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingBikas Saha
 
Sharing resources with non-Hadoop workloads
Sharing resources with non-Hadoop workloadsSharing resources with non-Hadoop workloads
Sharing resources with non-Hadoop workloadsDataWorks Summit
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2Raul Chong
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez Hortonworks
 

Similar to Drill into Drill – How Providing Flexibility and Performance is Possible (20)

Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
2014 08-20-pit-hug
2014 08-20-pit-hug2014 08-20-pit-hug
2014 08-20-pit-hug
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
Tez Data Processing over Yarn
Tez Data Processing over YarnTez Data Processing over Yarn
Tez Data Processing over Yarn
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
Big Data Everywhere Chicago: Getting Real with the MapR Platform (MapR)
 
Self-Service Data Exploration with Apache Drill
Self-Service Data Exploration with Apache DrillSelf-Service Data Exploration with Apache Drill
Self-Service Data Exploration with Apache Drill
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
HdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft PlatformHdInsight essentials Hadoop on Microsoft Platform
HdInsight essentials Hadoop on Microsoft Platform
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapRHadoop and NoSQL joining forces by Dale Kim of MapR
Hadoop and NoSQL joining forces by Dale Kim of MapR
 
Apache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query ProcessingApache Tez : Accelerating Hadoop Query Processing
Apache Tez : Accelerating Hadoop Query Processing
 
Sharing resources with non-Hadoop workloads
Sharing resources with non-Hadoop workloadsSharing resources with non-Hadoop workloads
Sharing resources with non-Hadoop workloads
 
IBM - Introduction to Cloudant
IBM - Introduction to CloudantIBM - Introduction to Cloudant
IBM - Introduction to Cloudant
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
 
YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez YARN Ready: Integrating to YARN with Tez
YARN Ready: Integrating to YARN with Tez
 

More from MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Recently uploaded

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 

Recently uploaded (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Drill into Drill – How Providing Flexibility and Performance is Possible

  • 1. ® © 2014 MapR Technologies 1 ® © 2014 MapR Technologies Drilling Into Drill: Flexiperf Jacques Nadeau, Architect and VP Apache Drill February 19, 2015
  • 2. ® © 2014 MapR Technologies 2© 2014 MapR Technologies ® “Drill isn’t just about SQL-on-Hadoop. It’s about SQL-on- pretty-much-anything, immediately, and without formality.” -Andrew Brust, GigaOM Research, Dec 2014
  • 3. ® © 2014 MapR Technologies 3 Agenda •  What? –  SQL like mom made –  Punk SQL •  How? –  Flexibility –  Performance •  Who & When
  • 4. ® © 2014 MapR Technologies 4 SQL Like Mom Made
  • 5. ® © 2014 MapR Technologies 5 SoH Table Stakes: Warehousing and Business Intelligence ANSI Syntax •  SELECT, FROM, WHERE, JOIN, HAVING, ORDER BY, WITH, CTAS, OVER*, ROLLUP*, CUBE*, ALL, EXISTS, ANY, IN, SOME •  VarChar, Int, BigInt, Decimal, VarBinary, Timestamp, Float, Double, etc. •  Subqueries, scalar subqueries*, partition pruning, CTE Interactive SQL Workloads •  Data warehouse offload •  Tableau, ODBC, JDBC •  TPC-H & TPC-DS-like workloads *alpha or imminent Standard Hadoop Tools •  Supports Hive SerDes •  Supports Hive UDFs •  Supports Hive Metastore
  • 6. ® © 2014 MapR Technologies 6 Punk SQL
  • 7. ® © 2014 MapR Technologies 7 Punk SQL: SQL for a Hadoop World Modern Syntax •  Path based queries and wildcards –  select * from /my/logs/ –  select * from /revenue/*/q2 •  Modern data types –  Any, Map, Array (JSON) •  Complex Functions and Relational Operators –  FLATTEN, kvgen, convert_from, convert_to, repeated_count, etc New Workloads •  JSON Sensor analytics •  Complex data analysis •  Alternative DSLs New Ways to Work •  Query without prep •  Workspaces without admin intervention •  Expose query as MapReduce •  Expose query as Spark RDD
  • 8. ® © 2014 MapR Technologies 8© 2014 MapR Technologies ® How?
  • 9. ® © 2014 MapR Technologies 9 Flexibility your tool should be flexible… so you don’t have to be
  • 10. ® © 2014 MapR Technologies 10 Flexibility is a Vision, Usability, a Religion •  Deployment •  Data Model •  Schema •  Security •  Access Methods
  • 11. ® © 2014 MapR Technologies 11 Supporting the Changing Roles of Big Data Data Dev Circa 2000 1.  Developer comes up with requirements 2.  DBA defines tables 3.  DBA defines indices 4.  DBA defines FK relationships 5.  Developer stores data 6.  BI builds reports 7.  Analyst views reports 8.  DBA adds materialized views Data Today 1.  Developer builds app, defines schema, stores data 2.  Analyst queries data 3.  Data engineer fixes performance problems or fills functionality gaps
  • 12. ® © 2014 MapR Technologies 12 SoH YSoH X Deployment: Traditional Hadoop Query Engines DFS MySQL Metadata Service DFS DFS MR MR MR Gateway Service Gateway Service Gateway Service MySQL (replica) MySQL Metadata Service DFS EXEC Catalog Service State Service MySQL (replica) DFS EXEC DFS EXEC Authorization Service Authorization Service
  • 13. ® © 2014 MapR Technologies 13 Challenges •  Many “special” nodes •  Each service may need special handling –  HA process –  Backup process •  Each system requires you to set up a database on top of which you run your database •  No easy way to run on your laptop for experimentation –  Some solutions don’t even support non-Linux environments
  • 14. ® © 2014 MapR Technologies 14 Distributed on Hadoop Distributed Single or CLI/Embedded Drill Deployment is Easy •  Single Daemon for all purposes •  No special considerations for scaling or availability •  With or without DFS •  Works with other data systems (Mongo, Cassanrda & JDBC coming soon) •  Runs on Linux, Mac or Windows •  No separate database •  JSON everything •  Access via HTTP, Java, C, JDBC, ODBC, CLI Drillbit Drillbit Drillbit Drillbit Drillbit Drillbit Drillbit DFS DFS DFS
  • 15. ® © 2014 MapR Technologies 15 Drill Provides A Flexible Data Model HBase JSON BSON CSV TSV Parquet Avro Schema-lessFixed schema Flat Complex Flexibility Name! Gender! Age! Michael! M! 6! Jennifer! F! 3! {! name: {! first: Michael,! last: Smith! },! hobbies: [ski, soccer],! district: Los Altos! }! {! name: {! first: Jennifer,! last: Gates! },! hobbies: [sing],! preschool: CCLC! }! RDBMS/SQL-on-Hadoop table Apache Drill table Flexibility
  • 16. ® © 2014 MapR Technologies 16 Leave Your Data Where it is. Access it Centrally & Uniformly. •  Drill is storage agnostic •  Interacts to storage through plugins •  Storage plugins expose optimizer rules –  Optimizer rules work directly on logical operation to expose maximum capabilities •  Reference multiple Hive, HBase, MongoDB, DFS, etc systems
  • 17. ® © 2014 MapR Technologies 17 Leverage that Massive Scalable Redundant Infrastructure Single Store for Data and Metadata •  HDFS is already your single canonical store •  Don’t create a secondary metadata store Avoid Metadata Management and synchronization •  Store metadata inline –  If you can’t, store it next to files •  Move directories around at will •  Delete things at will
  • 18. ® © 2014 MapR Technologies 18 Flexibility in how you describe your data •  Drill doesn’t require schema, detects file types based on –  extensions –  magic bytes (e.g. PAR1) –  systems settings •  Query can be planned on any file, anywhere •  Data types are determined as data arrives •  Some formats have known schema –  If they don’t, you can expose them as such through views –  Views are simply JSON files that define view SQL
  • 19. ® © 2014 MapR Technologies 19 SELECT timestamp, message FROM dfs1.logs.`AppServerLogs/2014/Jan/` WHERE errorLevel > 2 This is a cluster in Apache Drill -  DFS -  HBase -  Hive meta-store A work-space -  Typically a sub- directory A table -  pathnames -  HBase table -  Hive table That means everything looks like a table…
  • 20. ® © 2014 MapR Technologies 20 Query a File or a Directory Tree -- Dynamic queries on files select errorLevel, count(*)
 from dfs.logs.`/AppServerLogs/2014/Jan/ part0001.parquet` group by errorLevel; -- Dynamic queries on entire directory tree select errorLevel, count(*) as TotalErrors
 from dfs.logs.`/AppServerLogs`
 group by errorLevel;
  • 21. ® © 2014 MapR Technologies 21 Directories are Implicit Partitions -- Direct SELECT dir0, SUM(amount) FROM sales
 GROUP BY dir1 IN (q1, q3) -- View CREATE VIEW sales_q2 as SELECT dir0 as year, amount FROM sales WHERE dir = ‘q2’ sales ├── 2014 │   ├── q1 │   ├── q2 │   ├── q3 │   └── q4 └── 2015 └── q1
  • 22. ® © 2014 MapR Technologies 22 Access Hard to Reach Data with Nominal Configuration -- embedded JSON value inside column donutjson inside column-family cf1 of an hbase table donuts SELECT d.name, COUNT(d.fillings) FROM ( SELECT convert_from(cf1.donutjson, JSON) as d FROM hbase.donuts);
  • 23. ® © 2014 MapR Technologies 23 On the fly binary data interpretation CONVERT_FROM and CONVERT_TO allows access to common Hadoop encodings: •  Boolean, byte, byte_be, tinyint, tinyint_be, smallint, smallint_be, int, int_be, bigint, bigint_be, float, double, int_hadoopv, bigint_hadoopv, date_epoch_be, date_epoch, time_epoch_be, time_epoch, utf8, utf16, json •  E.g. CONVERT_FROM(mydata, ‘int_hadoopv’) => Internal INT format •  E.g. CONVERT_FROM(mydata, ‘JSON’) => UTF8 JSON to Internal complex object
  • 24. ® © 2014 MapR Technologies 24 Secure your data without an extra service •  Drill Views •  Ownership chaining with configurable delegation TTL •  Leverages existing HDFS ACLs •  Complete security solution without additional services or software MaskedSales.view owner: Cindy GrossSales.view owner: dba RawSales.parquet owner: dba Frank file view perm Cindy file view perm dba delegated read 2-stepownershipchain SQL by Frank
  • 25. ® © 2014 MapR Technologies 25Performance: Getting to Plaid
  • 26. ® © 2014 MapR Technologies 26 Performance is about data management, not just wall clock •  How quickly can you incorporate new data •  What types of prep or ETL you need to do •  What you can do without having to read data & calculate statistics –  Remember, many users never read all their data •  Time to first result •  Time to last result •  Managing Scarce Resources •  Concurrency
  • 27. ® © 2014 MapR Technologies 27 Core Drill Architectural Goals •  Go fast when you don’t know anything –  And do “the right thing” •  Go faster when you do know things
  • 28. ® © 2014 MapR Technologies 28 An Optimistic, Pipelined Purpose-Built DAG Engine •  Three-Level DAG •  Major Fragments (phases) •  Minor Fragments (threads) •  Operators (in-thread operations) > explain plan for select * from customer limit 5; … 00-00 Screen 00-01 SelectionVectorRemover 00-02 Limit(fetch=[5]) 00-03 UnionExchange 01-01 ProducerConsumer 01-02 Scan(groupscan=[ParquetGroupScan […
  • 29. ® © 2014 MapR Technologies 29 Each phase (MajorFragment) gets Parallelized (MinorFragment) 1:0 1:1 1:2 1:3 0:0
  • 30. ® © 2014 MapR Technologies 30 Parallel Calcite: Extensible Cost-based Optimization •  Volcano-inspired cost based optimizer –  CPU, IO, memory, network (data locality) •  Pluggable rules per storage subsystem •  Exchanges provide parallelization: –  Hash, Ordered, Broadcast and Merging •  Join and aggregation planning –  Merge Join vs Hash Join –  Partition-based join vs Broadcast-based join –  Streaming Aggregation vs Hash Aggregation •  Relies heavily on the awesome Apache Calcite project Query Optimizer HBase Rules File System Rules
  • 31. ® © 2014 MapR Technologies 31 Statistics •  Different Storage systems have different capabilities •  All: Basic estimation of row count –  Allows the first 70% of critical optimization decisions for most Hadoop workloads –  Either exact numbers or approximations based on best effort •  Some file formats –  Better than basic stats in file, such as Parquet •  Substantial effort to make certain formats ‘nearly native’ –  Drill adding support for extended attributes for enhanced file-held statistics
  • 32. ® © 2014 MapR Technologies 32 Reading Data Quickly, Moving Data Quickly •  Highly Optimized Native Drill Readers: –  Vectorized Parquet, Text/CSV, JSON –  Also works with all Hive supported formats •  Drill supports partition pruning –  Adding direct physical property exposure soon for highly optimized cases •  Drill parallelizes to maximum level format allows –  Also balances data locality and maximum parallelization •  Bespoke Asynchronous Zero-Copy RPC Layer –  Built specifically for Drill’s internal data format
  • 33. ® © 2014 MapR Technologies 33 Parquet: The Format for the next decade Apache Drill & Apache Parquet communities working together •  Better Ecosystem integration and decentralized metadata –  Self Description capabilities –  Ecosystem support for logical data types •  Enhanced Performance –  New vectorized reading –  Enhanced memory pooling and management –  Indexing* –  Better metadata positioning (for improved page pruning)* –  Enhanced vectorization and late materialization reading* *In progress
  • 34. ® © 2014 MapR Technologies 34 DISK Value Vectors & Record Batches: Drill’s In-memory Columnar Work Units •  Random access: sort without copy or restructuring •  Fully specified in memory shredded complex data structures •  Remove serialization or copy overhead at node boundaries •  Spool to disk as necessary •  Interact with data in Java or C without copy or overhead Drill Bit Memory overflow uses disk
  • 35. ® © 2014 MapR Technologies 35 Runtime Compilation and Multi-phased planning •  Drill does best effort initial query parsing, planning and validation –  Where Drill doesn’t understand data, it provides support for ANY type, allowing late type binding. •  At execution time, individual nodes do secondary pass –  Schema <--> Query parsing and validation –  Type casting, coercion and promotion –  Compilation based on schema requirements •  As schema changes, Drill supports recompilation of each operator as necessary
  • 36. ® © 2014 MapR Technologies 36 Runtime Compilation Pattern Custom Bytecode Optimization Bytecode Merging Janino compilation CodeModel Generated Code Precompiled Bytecode Templates Loaded Class UDFs Source Code Transformation Bytecode Merging
  • 37. ® © 2014 MapR Technologies 37 Advanced Compilation Techniques •  Optimization based on observation and assembly •  Drill does a number of pre-machine-code-compilation optimizations to ensure efficient execution •  Some examples: –  Removal of type and bounds checking –  Direct micro pointers for in-record-batch references –  Little endian data formats –  Bytecode-level scalar replacement
  • 38. ® © 2014 MapR Technologies 38 Enhanced Interfaces for high-speed Storage and UDF •  All built-in functions in Drill are simply UDFs •  UDF interface works directly with –  compilation engine –  bytecode rewriting algorithms •  Storage Interface is a columnar vectorized transfer interface –  Allows storage system to determine best transformation approach –  Adding support for deferred materialization –  Support for multiple target languages with minimal overhead
  • 39. ® © 2014 MapR Technologies 39 Drill Does Vectorization & Supports Columnar Functions •  Drill often operates on more than one record at a time –  Word-sized manipulations –  SIMD instructions –  Manually coded algorithms •  Columnar Functions Improve Many Operations –  Bitmaps allow lightning fast null-checks and reduction in branching –  Type demotion to reduce memory and CPU computation overhead –  Direct Conversions where possible
  • 40. ® © 2014 MapR Technologies 40 Drill provides advanced query telemetry •  What happened during query for all three levels of DAG execution •  Each profile is stored as JSON file for easy review, sharing and backup (at end of query execution) •  Profiles can be analyzed using Drill, allows: –  easy longitudinal analysis of workload –  multi-tenancy performance analysis –  impact of configuration changes to benchmark workloads
  • 41. ® © 2014 MapR Technologies 41 A Color-coded visual layout and Gant timing chart is provided
  • 42. ® © 2014 MapR Technologies 42 Memory Efficiency •  Drill’s in memory representation is designed to minimize memory overhead •  Custom implementation of columnar-aware data structures including hash tables, sort operations, etc. –  For example, entry overhead for hash table is 8 bytes per value –  Sort pointer overhead for sort is 2 to 4 bytes per entry •  Adding support for compressed columnar representation further to improve compactness
  • 43. ® © 2014 MapR Technologies 43 Scale and Concurrency •  Drill’s execution model leverages both local and remote exchanges for changes in parallelization –  Muxing Local, Demuxing Local, Broadcast, Hash to Merge, Hash to Random, Ordered Partition, Single Merge, Union •  All operations can be parallelized at the thread and node level •  Thread count and parallelization are influenced by data size, query phasing and system load –  Administrators have basic queuing control to manage workload •  All threads are independent and pipelined, all run in a single process per node •  Node <> node communication is multiplexed, push-based with sub-socket back-pressure support •  Testing has proceeded up to 150 nodes. Target is 1000 nodes by GA. –  Drill adapts data transfer size, buffers, muxing and other operations based on query scale to minimize n2 multiplier effects
  • 44. ® © 2014 MapR Technologies 44 Who & When Office Hours •  Now, table B Apache: join in! getdrill.org 0.8 next week 1.0 in month or two @INTJESUS @ApacheDrill @MAPR GigaOM Research, Sector Roadmap: Hadoop/Data Warehouse Interoperability