SlideShare a Scribd company logo
1 of 135
HAWQ
Architecture
Alexey Grishchenko
Who I am
Enterprise Architect @ Pivotal
• 7 years in data processing
• 5 years of experience with MPP
• 4 years with Hadoop
• Using HAWQ since the first internal Beta
• Responsible for designing most of the EMEA HAWQ
and Greenplum implementations
• Spark contributor
• http://0x0fff.com
Agenda
• What is HAWQ
Agenda
• What is HAWQ
• Why you need it
Agenda
• What is HAWQ
• Why you need it
• HAWQ Components
Agenda
• What is HAWQ
• Why you need it
• HAWQ Components
• HAWQ Design
Agenda
• What is HAWQ
• Why you need it
• HAWQ Components
• HAWQ Design
• Query execution example
Agenda
• What is HAWQ
• Why you need it
• HAWQ Components
• HAWQ Design
• Query execution example
• Competitive solutions
What is
• Analytical SQL-on-Hadoop engine
What is
• Analytical SQL-on-Hadoop engine
• HAdoop With Queries
What is
• Analytical SQL-on-Hadoop engine
• HAdoop With Queries
Postgres Greenplum HAWQ
2005
Fork
Postgres 8.0.2
What is
• Analytical SQL-on-Hadoop engine
• HAdoop With Queries
Postgres HAWQ
2005
Fork
Postgres 8.0.2
2009
Rebase
Postgres 8.2.15
Greenplum
What is
• Analytical SQL-on-Hadoop engine
• HAdoop With Queries
Postgres HAWQ
2005
Fork
Postgres 8.0.2
2009
Rebase
Postgres 8.2.15
2011 Fork
GPDB 4.2.0.0
Greenplum
What is
• Analytical SQL-on-Hadoop engine
• HAdoop With Queries
Postgres HAWQ
2005
Fork
Postgres 8.0.2
2009
Rebase
Postgres 8.2.15
2011 Fork
GPDB 4.2.0.0
2013
HAWQ 1.0.0.0
Greenplum
What is
• Analytical SQL-on-Hadoop engine
• HAdoop With Queries
Postgres HAWQ
2005
Fork
Postgres 8.0.2
2009
Rebase
Postgres 8.2.15
2011 Fork
GPDB 4.2.0.0
2013
HAWQ 1.0.0.0
HAWQ 2.0.0.0
Open Source
2015
Greenplum
HAWQ is …
• 1’500’000 C and C++ lines of code
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 23’000 Makefile LOC
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 23’000 Makefile LOC
• 7’000 Shell scripts LOC
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 23’000 Makefile LOC
• 7’000 Shell scripts LOC
• More than 50 enterprise customers
HAWQ is …
• 1’500’000 C and C++ lines of code
– 200’000 of them in headers only
• 180’000 Python LOC
• 60’000 Java LOC
• 23’000 Makefile LOC
• 7’000 Shell scripts LOC
• More than 50 enterprise customers
– More than 10 of them in EMEA
Apache HAWQ
• Apache HAWQ (incubating) from 09’2015
– http://hawq.incubator.apache.org
– https://github.com/apache/incubator-hawq
• What’s in Open Source
– Sources of HAWQ 2.0 alpha
– HAWQ 2.0 beta is planned for 2015’Q4
– HAWQ 2.0 GA is planned for 2016’Q1
• Community is yet young – come and join!
Why do we need it?
Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
• Universal tool for ad hoc analytics on top of
Hadoop data
Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
• Universal tool for ad hoc analytics on top of
Hadoop data
– Example - parse URL to extract protocol, host
name, port, GET parameters
Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
• Universal tool for ad hoc analytics on top of
Hadoop data
– Example - parse URL to extract protocol, host
name, port, GET parameters
• Good performance
Why do we need it?
• SQL-interface for BI solutions to the Hadoop
data complaint with ANSI SQL-92, -99, -2003
– Example - 5000-line query with a number of
window function generated by Cognos
• Universal tool for ad hoc analytics on top of
Hadoop data
– Example - parse URL to extract protocol, host
name, port, GET parameters
• Good performance
– How many times the data would hit the HDD during
a single Hive query?
HAWQ Cluster
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
Server 6
Datanode
Server N
Datanode
Server 5
Datanode
interconnect
…
HAWQ Cluster
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
Server 6
Datanode
Server N
Datanode
Server 5
Datanode
YARN NM YARN NM YARN NM
YARN RM
YARN App
Timeline
interconnect
…
HAWQ Cluster
HAWQ Master
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
HAWQ
Standby
Server 2
ZK JM
HAWQ Segment
Server 6
Datanode
HAWQ Segment
Server N
Datanode
HAWQ Segment
Server 5
Datanode
YARN NM YARN NM YARN NM
YARN RM
YARN App
Timeline
interconnect
…
Master Servers
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
HAWQ Segment
Server 6
Datanode
HAWQ Segment
Server N
Datanode
HAWQ Segment
Server 5
Datanode
YARN NM YARN NM YARN NM
YARN RM
YARN App
Timeline
interconnect
…
HAWQ Master
HAWQ
Standby
Master Servers
HAWQ Master
Query Parser
Query
Optimizer
Global
Resource
Manager
Distributed
Transactions
Manager
Query Dispatch
Metadata
Catalog
HAWQ Standby Master
Query Parser
Query
Optimizer
Global Resource
Manager
Distributed
Transactions
Manager
Query Dispatch
Metadata
Catalog
WAL
repl.
HAWQ Master
HAWQ
Standby
Segments
Server 1
SNameNode
Server 4
ZK JM
NameNode
Server 3
ZK JM
Server 2
ZK JM
Server 6
Datanode
Server N
Datanode
Server 5
Datanode
YARN NM YARN NM YARN NM
YARN RM
YARN App
Timeline
interconnect
HAWQ Segment HAWQ SegmentHAWQ Segment …
Segments
HAWQ Segment
Query Executor
libhdfs3
PXF
HDFS Datanode
Local Filesystem
Temporary Data
Directory
Logs
YARN Node Manager
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
– Histogram of values distribution for each field
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
– Histogram of values distribution for each field
– Number of unique values in the field
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
– Histogram of values distribution for each field
– Number of unique values in the field
– Number of null values in the field
Metadata
• HAWQ metadata structure is similar to
Postgres catalog structure
• Statistics
– Number of rows and pages in the table
– Most common values for each field
– Histogram of values distribution for each field
– Number of unique values in the field
– Number of null values in the field
– Average width of the field in bytes
Statistics
No Statistics
How many rows would produce the join of two
tables?
Statistics
No Statistics
How many rows would produce the join of two
tables?
 From 0 to infinity
Statistics
No Statistics
Row Count
How many rows would produce the join of two
tables?
 From 0 to infinity
How many rows would produce the join of two 1000-
row tables?
Statistics
No Statistics
Row Count
How many rows would produce the join of two
tables?
 From 0 to infinity
How many rows would produce the join of two 1000-
row tables?
 From 0 to 1’000’000
Statistics
No Statistics
Row Count
Histograms and MCV
How many rows would produce the join of two
tables?
 From 0 to infinity
How many rows would produce the join of two 1000-
row tables?
 From 0 to 1’000’000
How many rows would produce the join of two 1000-
row tables, with known field cardinality, values
distribution diagram, number of nulls, most common
values?
Statistics
No Statistics
Row Count
Histograms and MCV
How many rows would produce the join of two
tables?
 From 0 to infinity
How many rows would produce the join of two 1000-
row tables?
 From 0 to 1’000’000
How many rows would produce the join of two 1000-
row tables, with known field cardinality, values
distribution diagram, number of nulls, most common
values?
 ~ From 500 to 1’500
Metadata
• Table structure information
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
Metadata
• Table structure information
– Distribution fields
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
hash(ID)
Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
hash(ID)
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
hash(ID)
ID Name Num Price
1 Яблоко 10 50
2 Груша 20 80
3 Банан 40 40
4 Апельсин 25 50
5 Киви 5 120
6 Арбуз 20 30
7 Дыня 40 100
8 Ананас 35 90
Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
• General metadata
– Users and groups
Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
• General metadata
– Users and groups
– Access privileges
Metadata
• Table structure information
– Distribution fields
– Number of hash buckets
– Partitioning (hash, list, range)
• General metadata
– Users and groups
– Access privileges
• Stored procedures
– PL/pgSQL, PL/Java, PL/Python, PL/Perl, PL/R
Query Optimizer
• HAWQ uses cost-based query optimizers
Query Optimizer
• HAWQ uses cost-based query optimizers
• You have two options
– Planner – evolved from the Postgres query
optimizer
– ORCA (Pivotal Query Optimizer) – developed
specifically for HAWQ
Query Optimizer
• HAWQ uses cost-based query optimizers
• You have two options
– Planner – evolved from the Postgres query
optimizer
– ORCA (Pivotal Query Optimizer) – developed
specifically for HAWQ
• Optimizer hints work just like in Postgres
– Enable/disable specific operation
– Change the cost estimations for basic actions
Storage Formats
Which storage format is the most optimal?
Storage Formats
Which storage format is the most optimal?
 It depends on what you mean by “optimal”
Storage Formats
Which storage format is the most optimal?
 It depends on what you mean by “optimal”
– Minimal CPU usage for reading and writing the data
Storage Formats
Which storage format is the most optimal?
 It depends on what you mean by “optimal”
– Minimal CPU usage for reading and writing the data
– Minimal disk space usage
Storage Formats
Which storage format is the most optimal?
 It depends on what you mean by “optimal”
– Minimal CPU usage for reading and writing the data
– Minimal disk space usage
– Minimal time to retrieve record by key
Storage Formats
Which storage format is the most optimal?
 It depends on what you mean by “optimal”
– Minimal CPU usage for reading and writing the data
– Minimal disk space usage
– Minimal time to retrieve record by key
– Minimal time to retrieve subset of columns
– etc.
Storage Formats
• Row-based storage format
– Similar to Postgres heap storage
• No toast
• No ctid, xmin, xmax, cmin, cmax
Storage Formats
• Row-based storage format
– Similar to Postgres heap storage
• No toast
• No ctid, xmin, xmax, cmin, cmax
– Compression
• No compression
• Quicklz
• Zlib levels 1 - 9
Storage Formats
• Apache Parquet
– Mixed row-columnar table store, the data is split
into “row groups” stored in columnar format
Storage Formats
• Apache Parquet
– Mixed row-columnar table store, the data is split
into “row groups” stored in columnar format
– Compression
• No compression
• Snappy
• Gzip levels 1 – 9
Storage Formats
• Apache Parquet
– Mixed row-columnar table store, the data is split
into “row groups” stored in columnar format
– Compression
• No compression
• Snappy
• Gzip levels 1 – 9
– The size of “row group” and page size can be set
for each table separately
Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAWQ asks YARN Resource Manager for
query execution resources
Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAWQ asks YARN Resource Manager for
query execution resources
• Flexible cluster utilization
– Query might run on a subset of nodes if it is small
Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAWQ asks YARN Resource Manager for
query execution resources
• Flexible cluster utilization
– Query might run on a subset of nodes if it is small
– Query might have many executors on each cluster
node to make it run faster
Resource Management
• Two main options
– Static resource split – HAWQ and YARN does not
know about each other
– YARN – HAWQ asks YARN Resource Manager for
query execution resources
• Flexible cluster utilization
– Query might run on a subset of nodes if it is small
– Query might have many executors on each cluster
node to make it run faster
– You can control the parallelism of each query
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
– CPU cores usage limit
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
– CPU cores usage limit
– MIN/MAX number of executors across the system
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
– CPU cores usage limit
– MIN/MAX number of executors across the system
– MIN/MAX number of executors on each node
Resource Management
• Resource Queue can be set with
– Maximum number of parallel queries
– CPU usage priority
– Memory usage limits
– CPU cores usage limit
– MIN/MAX number of executors across the system
– MIN/MAX number of executors on each node
• Can be set up for user or group
External Data
• PXF
– Framework for external data access
– Easy to extend, many public plugins available
– Official plugins: CSV, SequenceFile, Avro, Hive,
HBase
– Open Source plugins: JSON, Accumulo,
Cassandra, JDBC, Redis, Pipe
External Data
• PXF
– Framework for external data access
– Easy to extend, many public plugins available
– Official plugins: CSV, SequenceFile, Avro, Hive,
HBase
– Open Source plugins: JSON, Accumulo,
Cassandra, JDBC, Redis, Pipe
• HCatalog
– HAWQ can query tables from HCatalog the same
way as HAWQ native tables
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Resource Prepare Execute Result CleanupPlan
QE ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
Plan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
QE
Resource
I need 5 containers
Each with 1 CPU core
and 256 MB RAM
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
QE QE QE QE QE
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Execute Result Cleanup
QE
QE QE QE QE QE
Prepare
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Execute Result Cleanup
QE
QE QE QE QE QE
Prepare
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Execute Result Cleanup
QE
QE QE QE QE QE
Prepare
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Result Cleanup
QE
QE QE QE QE QE
Prepare Execute
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Result Cleanup
QE
QE QE QE QE QE
Prepare Execute
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Result Cleanup
QE
QE QE QE QE QE
Prepare Execute
ScanBars
b
HashJoinb.name =s.bar
ScanSells
s
Filterb.city ='SanFrancisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Cleanup
QE
QE QE QE QE QE
Prepare Execute Result
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Cleanup
QE
QE QE QE QE QE
Prepare Execute Result
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Cleanup
QE
QE QE QE QE QE
Prepare Execute Result
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
OK
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
OK
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
OK
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
QE
QE QE QE QE QE
Prepare Execute Result Cleanup
Free query resources
Server 1: 2 containers
Server 2: 1 container
Server N: 2 containers
OK
ResourcePlan
Query Example
HAWQ Master
Metadata
Transaction Mgr.
Query Parser Query Optimizer
Query Dispatch
Resource Mgr.
NameNode
Server 1
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server 2
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
Server N
Local directory
HAWQ Segment
Postmaster
HDFS Datanode
YARN RMPostmaster
Prepare Execute Result Cleanup
Query Performance
• Data does not hit the disk unless this cannot be
avoided
Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
• Data is transferred between the nodes by UDP
Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
• Data is transferred between the nodes by UDP
• HAWQ has a good cost-based query optimizer
Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
• Data is transferred between the nodes by UDP
• HAWQ has a good cost-based query optimizer
• C/C++ implementation is more efficient than
Java implementation of competitive solutions
Query Performance
• Data does not hit the disk unless this cannot be
avoided
• Data is not buffered on the segments unless
this cannot be avoided
• Data is transferred between the nodes by UDP
• HAWQ has a good cost-based query optimizer
• C/C++ implementation is more efficient than
Java implementation of competitive solutions
• Query parallelism can be easily tuned
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Built-in Languages
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Built-in Languages
Disk IO
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Built-in Languages
Disk IO
Parallelism
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Built-in Languages
Disk IO
Parallelism
Distributions
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Built-in Languages
Disk IO
Parallelism
Distributions
Stability
Competitive Solutions
Hive SparkSQL Impala HAWQ
Query Optimizer
ANSI SQL
Built-in Languages
Disk IO
Parallelism
Distributions
Stability
Community
Roadmap
• AWS and S3 integration
Roadmap
• AWS and S3 integration
• Mesos integration
Roadmap
• AWS and S3 integration
• Mesos integration
• Better Ambari integration
Roadmap
• AWS and S3 integration
• Mesos integration
• Better Ambari integration
• Cloudera, MapR and IBM Hadoop distributions
native support
Roadmap
• AWS and S3 integration
• Mesos integration
• Better Ambari integration
• Cloudera, MapR and IBM Hadoop distributions
native support
• Make the SQL-on-Hadoop engine ever!
Summary
• Modern SQL-on-Hadoop engine
• For structured data processing and analysis
• Combines the best techniques of competitive
solutions
• Just released to the open source
• Community is very young
Join our community and contribute!
Questions
Apache HAWQ
http://hawq.incubator.apache.org
dev@hawq.incubator.apache.org
user@hawq.incubator.apache.org
Reach me on http://0x0fff.com

More Related Content

What's hot

Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformDatabricks
 
Azure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar PresentationAzure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar PresentationMatthew W. Bowers
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseJames Serra
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data EngineeringDurga Gadiraju
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
The Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsThe Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsNeo4j
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
Introduction of ssis
Introduction of ssisIntroduction of ssis
Introduction of ssisdeepakk073
 
Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Rajesh Kumar
 
TeraStream for ETL
TeraStream for ETLTeraStream for ETL
TeraStream for ETL치민 최
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)James Serra
 
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...VMware Tanzu
 
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...HostedbyConfluent
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020Timothy McAliley
 

What's hot (20)

Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
Azure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar PresentationAzure Synapse 101 Webinar Presentation
Azure Synapse 101 Webinar Presentation
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
The Case for Graphs in Supply Chains
The Case for Graphs in Supply ChainsThe Case for Graphs in Supply Chains
The Case for Graphs in Supply Chains
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Introduction of ssis
Introduction of ssisIntroduction of ssis
Introduction of ssis
 
Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture Azure data analytics platform - A reference architecture
Azure data analytics platform - A reference architecture
 
TeraStream for ETL
TeraStream for ETLTeraStream for ETL
TeraStream for ETL
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Spark Workshop
Spark WorkshopSpark Workshop
Spark Workshop
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)Azure Synapse Analytics Overview (r1)
Azure Synapse Analytics Overview (r1)
 
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
Achieve Extreme Simplicity and Superior Price/Performance with Greenplum Buil...
 
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
Data Mess to Data Mesh | Jay Kreps, CEO, Confluent | Kafka Summit Americas 20...
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020NOVA SQL User Group - Azure Synapse Analytics Overview -  May 2020
NOVA SQL User Group - Azure Synapse Analytics Overview - May 2020
 

Similar to Apache HAWQ Architecture

Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
aip-workshop1-dev-tutorial
aip-workshop1-dev-tutorialaip-workshop1-dev-tutorial
aip-workshop1-dev-tutorialMatthew Vaughn
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platformhadooparchbook
 
Evolve Your Schemas in a Better Way! A Deep Dive into Avro Schema Compatibili...
Evolve Your Schemas in a Better Way! A Deep Dive into Avro Schema Compatibili...Evolve Your Schemas in a Better Way! A Deep Dive into Avro Schema Compatibili...
Evolve Your Schemas in a Better Way! A Deep Dive into Avro Schema Compatibili...HostedbyConfluent
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...Aman Sinha
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platformhadooparchbook
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platformhadooparchbook
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applicationshadooparchbook
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memoryJulian Hyde
 
Genji: Framework for building resilient near-realtime data pipelines
Genji: Framework for building resilient near-realtime data pipelinesGenji: Framework for building resilient near-realtime data pipelines
Genji: Framework for building resilient near-realtime data pipelinesSwami Sundaramurthy
 
The Polyglot Data Scientist - Exploring R, Python, and SQL Server
The Polyglot Data Scientist - Exploring R, Python, and SQL ServerThe Polyglot Data Scientist - Exploring R, Python, and SQL Server
The Polyglot Data Scientist - Exploring R, Python, and SQL ServerSarah Dutkiewicz
 
APEX 5 IR: Guts & Performance
APEX 5 IR:  Guts & PerformanceAPEX 5 IR:  Guts & Performance
APEX 5 IR: Guts & PerformanceKaren Cannell
 
APEX 5 IR Guts and Performance
APEX 5 IR Guts and PerformanceAPEX 5 IR Guts and Performance
APEX 5 IR Guts and PerformanceKaren Cannell
 
Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)Brian Brazil
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleSean Chittenden
 
APEX 5 Interactive Reports: Guts and PErformance
APEX 5 Interactive Reports: Guts and PErformanceAPEX 5 Interactive Reports: Guts and PErformance
APEX 5 Interactive Reports: Guts and PErformanceKaren Cannell
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoNathaniel Braun
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
 

Similar to Apache HAWQ Architecture (20)

Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
aip-workshop1-dev-tutorial
aip-workshop1-dev-tutorialaip-workshop1-dev-tutorial
aip-workshop1-dev-tutorial
 
Architecting a Next Generation Data Platform
Architecting a Next Generation Data PlatformArchitecting a Next Generation Data Platform
Architecting a Next Generation Data Platform
 
Evolve Your Schemas in a Better Way! A Deep Dive into Avro Schema Compatibili...
Evolve Your Schemas in a Better Way! A Deep Dive into Avro Schema Compatibili...Evolve Your Schemas in a Better Way! A Deep Dive into Avro Schema Compatibili...
Evolve Your Schemas in a Better Way! A Deep Dive into Avro Schema Compatibili...
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
Accelerating SQL queries in NoSQL Databases using Apache Drill and Secondary ...
 
Architecting a next generation data platform
Architecting a next generation data platformArchitecting a next generation data platform
Architecting a next generation data platform
 
Architecting a next-generation data platform
Architecting a next-generation data platformArchitecting a next-generation data platform
Architecting a next-generation data platform
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
SQL on everything, in memory
SQL on everything, in memorySQL on everything, in memory
SQL on everything, in memory
 
Genji: Framework for building resilient near-realtime data pipelines
Genji: Framework for building resilient near-realtime data pipelinesGenji: Framework for building resilient near-realtime data pipelines
Genji: Framework for building resilient near-realtime data pipelines
 
The Polyglot Data Scientist - Exploring R, Python, and SQL Server
The Polyglot Data Scientist - Exploring R, Python, and SQL ServerThe Polyglot Data Scientist - Exploring R, Python, and SQL Server
The Polyglot Data Scientist - Exploring R, Python, and SQL Server
 
APEX 5 IR: Guts & Performance
APEX 5 IR:  Guts & PerformanceAPEX 5 IR:  Guts & Performance
APEX 5 IR: Guts & Performance
 
APEX 5 IR Guts and Performance
APEX 5 IR Guts and PerformanceAPEX 5 IR Guts and Performance
APEX 5 IR Guts and Performance
 
Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)Prometheus lightning talk (Devops Dublin March 2015)
Prometheus lightning talk (Devops Dublin March 2015)
 
Creating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at ScaleCreating PostgreSQL-as-a-Service at Scale
Creating PostgreSQL-as-a-Service at Scale
 
APEX 5 Interactive Reports: Guts and PErformance
APEX 5 Interactive Reports: Guts and PErformanceAPEX 5 Interactive Reports: Guts and PErformance
APEX 5 Interactive Reports: Guts and PErformance
 
OpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ CriteoOpenTSDB for monitoring @ Criteo
OpenTSDB for monitoring @ Criteo
 
Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...Scaling ingest pipelines with high performance computing principles - Rajiv K...
Scaling ingest pipelines with high performance computing principles - Rajiv K...
 

More from Alexey Grishchenko

More from Alexey Grishchenko (6)

MPP vs Hadoop
MPP vs HadoopMPP vs Hadoop
MPP vs Hadoop
 
Greenplum Architecture
Greenplum ArchitectureGreenplum Architecture
Greenplum Architecture
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Modern Data Architecture
Modern Data ArchitectureModern Data Architecture
Modern Data Architecture
 
Архитектура Apache HAWQ Highload++ 2015
Архитектура Apache HAWQ Highload++ 2015Архитектура Apache HAWQ Highload++ 2015
Архитектура Apache HAWQ Highload++ 2015
 
Pivotal hawq internals
Pivotal hawq internalsPivotal hawq internals
Pivotal hawq internals
 

Recently uploaded

High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...gajnagarg
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 

Recently uploaded (20)

High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 

Apache HAWQ Architecture

  • 2. Who I am Enterprise Architect @ Pivotal • 7 years in data processing • 5 years of experience with MPP • 4 years with Hadoop • Using HAWQ since the first internal Beta • Responsible for designing most of the EMEA HAWQ and Greenplum implementations • Spark contributor • http://0x0fff.com
  • 4. Agenda • What is HAWQ • Why you need it
  • 5. Agenda • What is HAWQ • Why you need it • HAWQ Components
  • 6. Agenda • What is HAWQ • Why you need it • HAWQ Components • HAWQ Design
  • 7. Agenda • What is HAWQ • Why you need it • HAWQ Components • HAWQ Design • Query execution example
  • 8. Agenda • What is HAWQ • Why you need it • HAWQ Components • HAWQ Design • Query execution example • Competitive solutions
  • 9. What is • Analytical SQL-on-Hadoop engine
  • 10. What is • Analytical SQL-on-Hadoop engine • HAdoop With Queries
  • 11. What is • Analytical SQL-on-Hadoop engine • HAdoop With Queries Postgres Greenplum HAWQ 2005 Fork Postgres 8.0.2
  • 12. What is • Analytical SQL-on-Hadoop engine • HAdoop With Queries Postgres HAWQ 2005 Fork Postgres 8.0.2 2009 Rebase Postgres 8.2.15 Greenplum
  • 13. What is • Analytical SQL-on-Hadoop engine • HAdoop With Queries Postgres HAWQ 2005 Fork Postgres 8.0.2 2009 Rebase Postgres 8.2.15 2011 Fork GPDB 4.2.0.0 Greenplum
  • 14. What is • Analytical SQL-on-Hadoop engine • HAdoop With Queries Postgres HAWQ 2005 Fork Postgres 8.0.2 2009 Rebase Postgres 8.2.15 2011 Fork GPDB 4.2.0.0 2013 HAWQ 1.0.0.0 Greenplum
  • 15. What is • Analytical SQL-on-Hadoop engine • HAdoop With Queries Postgres HAWQ 2005 Fork Postgres 8.0.2 2009 Rebase Postgres 8.2.15 2011 Fork GPDB 4.2.0.0 2013 HAWQ 1.0.0.0 HAWQ 2.0.0.0 Open Source 2015 Greenplum
  • 16. HAWQ is … • 1’500’000 C and C++ lines of code
  • 17. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only
  • 18. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only • 180’000 Python LOC
  • 19. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only • 180’000 Python LOC • 60’000 Java LOC
  • 20. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only • 180’000 Python LOC • 60’000 Java LOC • 23’000 Makefile LOC
  • 21. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only • 180’000 Python LOC • 60’000 Java LOC • 23’000 Makefile LOC • 7’000 Shell scripts LOC
  • 22. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only • 180’000 Python LOC • 60’000 Java LOC • 23’000 Makefile LOC • 7’000 Shell scripts LOC • More than 50 enterprise customers
  • 23. HAWQ is … • 1’500’000 C and C++ lines of code – 200’000 of them in headers only • 180’000 Python LOC • 60’000 Java LOC • 23’000 Makefile LOC • 7’000 Shell scripts LOC • More than 50 enterprise customers – More than 10 of them in EMEA
  • 24. Apache HAWQ • Apache HAWQ (incubating) from 09’2015 – http://hawq.incubator.apache.org – https://github.com/apache/incubator-hawq • What’s in Open Source – Sources of HAWQ 2.0 alpha – HAWQ 2.0 beta is planned for 2015’Q4 – HAWQ 2.0 GA is planned for 2016’Q1 • Community is yet young – come and join!
  • 25. Why do we need it?
  • 26. Why do we need it? • SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003
  • 27. Why do we need it? • SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 – Example - 5000-line query with a number of window function generated by Cognos
  • 28. Why do we need it? • SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 – Example - 5000-line query with a number of window function generated by Cognos • Universal tool for ad hoc analytics on top of Hadoop data
  • 29. Why do we need it? • SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 – Example - 5000-line query with a number of window function generated by Cognos • Universal tool for ad hoc analytics on top of Hadoop data – Example - parse URL to extract protocol, host name, port, GET parameters
  • 30. Why do we need it? • SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 – Example - 5000-line query with a number of window function generated by Cognos • Universal tool for ad hoc analytics on top of Hadoop data – Example - parse URL to extract protocol, host name, port, GET parameters • Good performance
  • 31. Why do we need it? • SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 – Example - 5000-line query with a number of window function generated by Cognos • Universal tool for ad hoc analytics on top of Hadoop data – Example - parse URL to extract protocol, host name, port, GET parameters • Good performance – How many times the data would hit the HDD during a single Hive query?
  • 32. HAWQ Cluster Server 1 SNameNode Server 4 ZK JM NameNode Server 3 ZK JM Server 2 ZK JM Server 6 Datanode Server N Datanode Server 5 Datanode interconnect …
  • 33. HAWQ Cluster Server 1 SNameNode Server 4 ZK JM NameNode Server 3 ZK JM Server 2 ZK JM Server 6 Datanode Server N Datanode Server 5 Datanode YARN NM YARN NM YARN NM YARN RM YARN App Timeline interconnect …
  • 34. HAWQ Cluster HAWQ Master Server 1 SNameNode Server 4 ZK JM NameNode Server 3 ZK JM HAWQ Standby Server 2 ZK JM HAWQ Segment Server 6 Datanode HAWQ Segment Server N Datanode HAWQ Segment Server 5 Datanode YARN NM YARN NM YARN NM YARN RM YARN App Timeline interconnect …
  • 35. Master Servers Server 1 SNameNode Server 4 ZK JM NameNode Server 3 ZK JM Server 2 ZK JM HAWQ Segment Server 6 Datanode HAWQ Segment Server N Datanode HAWQ Segment Server 5 Datanode YARN NM YARN NM YARN NM YARN RM YARN App Timeline interconnect … HAWQ Master HAWQ Standby
  • 36. Master Servers HAWQ Master Query Parser Query Optimizer Global Resource Manager Distributed Transactions Manager Query Dispatch Metadata Catalog HAWQ Standby Master Query Parser Query Optimizer Global Resource Manager Distributed Transactions Manager Query Dispatch Metadata Catalog WAL repl.
  • 37. HAWQ Master HAWQ Standby Segments Server 1 SNameNode Server 4 ZK JM NameNode Server 3 ZK JM Server 2 ZK JM Server 6 Datanode Server N Datanode Server 5 Datanode YARN NM YARN NM YARN NM YARN RM YARN App Timeline interconnect HAWQ Segment HAWQ SegmentHAWQ Segment …
  • 38. Segments HAWQ Segment Query Executor libhdfs3 PXF HDFS Datanode Local Filesystem Temporary Data Directory Logs YARN Node Manager
  • 39. Metadata • HAWQ metadata structure is similar to Postgres catalog structure
  • 40. Metadata • HAWQ metadata structure is similar to Postgres catalog structure • Statistics – Number of rows and pages in the table
  • 41. Metadata • HAWQ metadata structure is similar to Postgres catalog structure • Statistics – Number of rows and pages in the table – Most common values for each field
  • 42. Metadata • HAWQ metadata structure is similar to Postgres catalog structure • Statistics – Number of rows and pages in the table – Most common values for each field – Histogram of values distribution for each field
  • 43. Metadata • HAWQ metadata structure is similar to Postgres catalog structure • Statistics – Number of rows and pages in the table – Most common values for each field – Histogram of values distribution for each field – Number of unique values in the field
  • 44. Metadata • HAWQ metadata structure is similar to Postgres catalog structure • Statistics – Number of rows and pages in the table – Most common values for each field – Histogram of values distribution for each field – Number of unique values in the field – Number of null values in the field
  • 45. Metadata • HAWQ metadata structure is similar to Postgres catalog structure • Statistics – Number of rows and pages in the table – Most common values for each field – Histogram of values distribution for each field – Number of unique values in the field – Number of null values in the field – Average width of the field in bytes
  • 46. Statistics No Statistics How many rows would produce the join of two tables?
  • 47. Statistics No Statistics How many rows would produce the join of two tables?  From 0 to infinity
  • 48. Statistics No Statistics Row Count How many rows would produce the join of two tables?  From 0 to infinity How many rows would produce the join of two 1000- row tables?
  • 49. Statistics No Statistics Row Count How many rows would produce the join of two tables?  From 0 to infinity How many rows would produce the join of two 1000- row tables?  From 0 to 1’000’000
  • 50. Statistics No Statistics Row Count Histograms and MCV How many rows would produce the join of two tables?  From 0 to infinity How many rows would produce the join of two 1000- row tables?  From 0 to 1’000’000 How many rows would produce the join of two 1000- row tables, with known field cardinality, values distribution diagram, number of nulls, most common values?
  • 51. Statistics No Statistics Row Count Histograms and MCV How many rows would produce the join of two tables?  From 0 to infinity How many rows would produce the join of two 1000- row tables?  From 0 to 1’000’000 How many rows would produce the join of two 1000- row tables, with known field cardinality, values distribution diagram, number of nulls, most common values?  ~ From 500 to 1’500
  • 52. Metadata • Table structure information ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90
  • 53. Metadata • Table structure information – Distribution fields ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90 hash(ID)
  • 54. Metadata • Table structure information – Distribution fields – Number of hash buckets ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90 hash(ID) ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90
  • 55. Metadata • Table structure information – Distribution fields – Number of hash buckets – Partitioning (hash, list, range) ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90 hash(ID) ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90
  • 56. Metadata • Table structure information – Distribution fields – Number of hash buckets – Partitioning (hash, list, range) • General metadata – Users and groups
  • 57. Metadata • Table structure information – Distribution fields – Number of hash buckets – Partitioning (hash, list, range) • General metadata – Users and groups – Access privileges
  • 58. Metadata • Table structure information – Distribution fields – Number of hash buckets – Partitioning (hash, list, range) • General metadata – Users and groups – Access privileges • Stored procedures – PL/pgSQL, PL/Java, PL/Python, PL/Perl, PL/R
  • 59. Query Optimizer • HAWQ uses cost-based query optimizers
  • 60. Query Optimizer • HAWQ uses cost-based query optimizers • You have two options – Planner – evolved from the Postgres query optimizer – ORCA (Pivotal Query Optimizer) – developed specifically for HAWQ
  • 61. Query Optimizer • HAWQ uses cost-based query optimizers • You have two options – Planner – evolved from the Postgres query optimizer – ORCA (Pivotal Query Optimizer) – developed specifically for HAWQ • Optimizer hints work just like in Postgres – Enable/disable specific operation – Change the cost estimations for basic actions
  • 62. Storage Formats Which storage format is the most optimal?
  • 63. Storage Formats Which storage format is the most optimal?  It depends on what you mean by “optimal”
  • 64. Storage Formats Which storage format is the most optimal?  It depends on what you mean by “optimal” – Minimal CPU usage for reading and writing the data
  • 65. Storage Formats Which storage format is the most optimal?  It depends on what you mean by “optimal” – Minimal CPU usage for reading and writing the data – Minimal disk space usage
  • 66. Storage Formats Which storage format is the most optimal?  It depends on what you mean by “optimal” – Minimal CPU usage for reading and writing the data – Minimal disk space usage – Minimal time to retrieve record by key
  • 67. Storage Formats Which storage format is the most optimal?  It depends on what you mean by “optimal” – Minimal CPU usage for reading and writing the data – Minimal disk space usage – Minimal time to retrieve record by key – Minimal time to retrieve subset of columns – etc.
  • 68. Storage Formats • Row-based storage format – Similar to Postgres heap storage • No toast • No ctid, xmin, xmax, cmin, cmax
  • 69. Storage Formats • Row-based storage format – Similar to Postgres heap storage • No toast • No ctid, xmin, xmax, cmin, cmax – Compression • No compression • Quicklz • Zlib levels 1 - 9
  • 70. Storage Formats • Apache Parquet – Mixed row-columnar table store, the data is split into “row groups” stored in columnar format
  • 71. Storage Formats • Apache Parquet – Mixed row-columnar table store, the data is split into “row groups” stored in columnar format – Compression • No compression • Snappy • Gzip levels 1 – 9
  • 72. Storage Formats • Apache Parquet – Mixed row-columnar table store, the data is split into “row groups” stored in columnar format – Compression • No compression • Snappy • Gzip levels 1 – 9 – The size of “row group” and page size can be set for each table separately
  • 73. Resource Management • Two main options – Static resource split – HAWQ and YARN does not know about each other
  • 74. Resource Management • Two main options – Static resource split – HAWQ and YARN does not know about each other – YARN – HAWQ asks YARN Resource Manager for query execution resources
  • 75. Resource Management • Two main options – Static resource split – HAWQ and YARN does not know about each other – YARN – HAWQ asks YARN Resource Manager for query execution resources • Flexible cluster utilization – Query might run on a subset of nodes if it is small
  • 76. Resource Management • Two main options – Static resource split – HAWQ and YARN does not know about each other – YARN – HAWQ asks YARN Resource Manager for query execution resources • Flexible cluster utilization – Query might run on a subset of nodes if it is small – Query might have many executors on each cluster node to make it run faster
  • 77. Resource Management • Two main options – Static resource split – HAWQ and YARN does not know about each other – YARN – HAWQ asks YARN Resource Manager for query execution resources • Flexible cluster utilization – Query might run on a subset of nodes if it is small – Query might have many executors on each cluster node to make it run faster – You can control the parallelism of each query
  • 78. Resource Management • Resource Queue can be set with – Maximum number of parallel queries
  • 79. Resource Management • Resource Queue can be set with – Maximum number of parallel queries – CPU usage priority
  • 80. Resource Management • Resource Queue can be set with – Maximum number of parallel queries – CPU usage priority – Memory usage limits
  • 81. Resource Management • Resource Queue can be set with – Maximum number of parallel queries – CPU usage priority – Memory usage limits – CPU cores usage limit
  • 82. Resource Management • Resource Queue can be set with – Maximum number of parallel queries – CPU usage priority – Memory usage limits – CPU cores usage limit – MIN/MAX number of executors across the system
  • 83. Resource Management • Resource Queue can be set with – Maximum number of parallel queries – CPU usage priority – Memory usage limits – CPU cores usage limit – MIN/MAX number of executors across the system – MIN/MAX number of executors on each node
  • 84. Resource Management • Resource Queue can be set with – Maximum number of parallel queries – CPU usage priority – Memory usage limits – CPU cores usage limit – MIN/MAX number of executors across the system – MIN/MAX number of executors on each node • Can be set up for user or group
  • 85. External Data • PXF – Framework for external data access – Easy to extend, many public plugins available – Official plugins: CSV, SequenceFile, Avro, Hive, HBase – Open Source plugins: JSON, Accumulo, Cassandra, JDBC, Redis, Pipe
  • 86. External Data • PXF – Framework for external data access – Easy to extend, many public plugins available – Official plugins: CSV, SequenceFile, Avro, Hive, HBase – Open Source plugins: JSON, Accumulo, Cassandra, JDBC, Redis, Pipe • HCatalog – HAWQ can query tables from HCatalog the same way as HAWQ native tables
  • 87. Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Resource Prepare Execute Result CleanupPlan
  • 88. Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Resource Prepare Execute Result CleanupPlan QE
  • 89. Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Resource Prepare Execute Result CleanupPlan QE
  • 90. Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Resource Prepare Execute Result CleanupPlan QE
  • 91. Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Resource Prepare Execute Result CleanupPlan QE
  • 92. Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Resource Prepare Execute Result CleanupPlan QE ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  • 93. Plan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup QE Resource
  • 94. Plan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup QE Resource I need 5 containers Each with 1 CPU core and 256 MB RAM
  • 95. Plan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup QE Resource I need 5 containers Each with 1 CPU core and 256 MB RAM Server 1: 2 containers Server 2: 1 container Server N: 2 containers
  • 96. Plan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup QE Resource I need 5 containers Each with 1 CPU core and 256 MB RAM Server 1: 2 containers Server 2: 1 container Server N: 2 containers
  • 97. Plan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup QE Resource I need 5 containers Each with 1 CPU core and 256 MB RAM Server 1: 2 containers Server 2: 1 container Server N: 2 containers
  • 98. Plan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup QE Resource I need 5 containers Each with 1 CPU core and 256 MB RAM Server 1: 2 containers Server 2: 1 container Server N: 2 containers QE QE QE QE QE
  • 99. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Execute Result Cleanup QE QE QE QE QE QE Prepare
  • 100. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Execute Result Cleanup QE QE QE QE QE QE Prepare ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  • 101. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Execute Result Cleanup QE QE QE QE QE QE Prepare ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  • 102. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Result Cleanup QE QE QE QE QE QE Prepare Execute ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  • 103. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Result Cleanup QE QE QE QE QE QE Prepare Execute ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  • 104. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Result Cleanup QE QE QE QE QE QE Prepare Execute ScanBars b HashJoinb.name =s.bar ScanSells s Filterb.city ='SanFrancisco' Projects.beer, s.price MotionGather MotionRedist(b.name)
  • 105. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Cleanup QE QE QE QE QE QE Prepare Execute Result
  • 106. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Cleanup QE QE QE QE QE QE Prepare Execute Result
  • 107. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Cleanup QE QE QE QE QE QE Prepare Execute Result
  • 108. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster QE QE QE QE QE QE Prepare Execute Result Cleanup
  • 109. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster QE QE QE QE QE QE Prepare Execute Result Cleanup Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers
  • 110. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster QE QE QE QE QE QE Prepare Execute Result Cleanup Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers OK
  • 111. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster QE QE QE QE QE QE Prepare Execute Result Cleanup Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers OK
  • 112. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster QE QE QE QE QE QE Prepare Execute Result Cleanup Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers OK
  • 113. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster QE QE QE QE QE QE Prepare Execute Result Cleanup Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers OK
  • 114. ResourcePlan Query Example HAWQ Master Metadata Transaction Mgr. Query Parser Query Optimizer Query Dispatch Resource Mgr. NameNode Server 1 Local directory HAWQ Segment Postmaster HDFS Datanode Server 2 Local directory HAWQ Segment Postmaster HDFS Datanode Server N Local directory HAWQ Segment Postmaster HDFS Datanode YARN RMPostmaster Prepare Execute Result Cleanup
  • 115. Query Performance • Data does not hit the disk unless this cannot be avoided
  • 116. Query Performance • Data does not hit the disk unless this cannot be avoided • Data is not buffered on the segments unless this cannot be avoided
  • 117. Query Performance • Data does not hit the disk unless this cannot be avoided • Data is not buffered on the segments unless this cannot be avoided • Data is transferred between the nodes by UDP
  • 118. Query Performance • Data does not hit the disk unless this cannot be avoided • Data is not buffered on the segments unless this cannot be avoided • Data is transferred between the nodes by UDP • HAWQ has a good cost-based query optimizer
  • 119. Query Performance • Data does not hit the disk unless this cannot be avoided • Data is not buffered on the segments unless this cannot be avoided • Data is transferred between the nodes by UDP • HAWQ has a good cost-based query optimizer • C/C++ implementation is more efficient than Java implementation of competitive solutions
  • 120. Query Performance • Data does not hit the disk unless this cannot be avoided • Data is not buffered on the segments unless this cannot be avoided • Data is transferred between the nodes by UDP • HAWQ has a good cost-based query optimizer • C/C++ implementation is more efficient than Java implementation of competitive solutions • Query parallelism can be easily tuned
  • 121. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer
  • 122. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL
  • 123. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL Built-in Languages
  • 124. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL Built-in Languages Disk IO
  • 125. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL Built-in Languages Disk IO Parallelism
  • 126. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL Built-in Languages Disk IO Parallelism Distributions
  • 127. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL Built-in Languages Disk IO Parallelism Distributions Stability
  • 128. Competitive Solutions Hive SparkSQL Impala HAWQ Query Optimizer ANSI SQL Built-in Languages Disk IO Parallelism Distributions Stability Community
  • 129. Roadmap • AWS and S3 integration
  • 130. Roadmap • AWS and S3 integration • Mesos integration
  • 131. Roadmap • AWS and S3 integration • Mesos integration • Better Ambari integration
  • 132. Roadmap • AWS and S3 integration • Mesos integration • Better Ambari integration • Cloudera, MapR and IBM Hadoop distributions native support
  • 133. Roadmap • AWS and S3 integration • Mesos integration • Better Ambari integration • Cloudera, MapR and IBM Hadoop distributions native support • Make the SQL-on-Hadoop engine ever!
  • 134. Summary • Modern SQL-on-Hadoop engine • For structured data processing and analysis • Combines the best techniques of competitive solutions • Just released to the open source • Community is very young Join our community and contribute!