SlideShare a Scribd company logo
1 of 53
1© Cloudera, Inc. All rights reserved.
Apache Impala 2.5 (Incubating)
Performance improvements overview
2© Cloudera, Inc. All rights reserved.
Agenda
• What is Impala?
• Impala at Apache
• What is new in Impala 2.5 (CDH 5.7)
• Impala performance update
• Roadmap
• Q&A
3© Cloudera, Inc. All rights reserved.
SQL-on-Hadoop engines
SQL
Impala
SQL-on-Apache Hadoop – Choosing the right tool for the right
job
4© Cloudera, Inc. All rights reserved.
• General-purpose SQL engine
• Real-time queries in Apache Hadoop
• General availability (v1.0) release out since April 2013
• Analytic SQL functionality (v2.0) since October 2014
• Apache incubator project since December 2015
• Previous release 2.3 (CDH 5.5) released November 2015
• Current release 2.5 (CDH 5.7) April 2016
What is Impala?
Today’s topic
5© Cloudera, Inc. All rights reserved.
• Query speed over Hadoop that meets or exceeds that of a proprietary analytic DBMS
• General-purpose SQL query engine:
• Targeted for analytical workloads
• Supports queries that take from milliseconds to hours
• Runs directly within Hadoop:
• reads widely used Hadoop file formats
• talks to widely used Hadoop storage managers
• runs on same nodes that run Hadoop processes
• Highly available
• High performance:
• C++ instead of Java
• Run time code generation
Impala overview
6© Cloudera, Inc. All rights reserved.
Impala Use Cases
•Interactive BI/analytics on more data
•Asking new questions – exploration, ML (Ibis)
•Data processing with tight SLAs
•Query-able archive w/full fidelity
7© Cloudera, Inc. All rights reserved.
• Incubator project since
December 2015
• Development process slowly
moving to ASF infrastructure (see
IMPALA-3221)
• Help wanted!
Where to find the Impala community:
dev@impala.incubator.apache.org
user@impala.incubator.apache.org
http://impala.io
@apacheimpala
Impala at Apache
8© Cloudera, Inc. All rights reserved.
New in Impala 2.5
Usability Enhancements
• Admission Control Improvements
• Null-safe join/equals
Performance and Scalability
• Runtime filters
• Improved Cardinality Estimation and Join
Ordering
• Query start-up improvements
• Additional codegen and code
optimizations
• Decimal arithmetic improvements
• Fast min/max values on partition
columns(with query option)
Integrations
•Support for EMC DSSD
9© Cloudera, Inc. All rights reserved.
New in Impala 2.5
Performance and Scalability
• Runtime filters
• Improved Cardinality Estimation and Join
Ordering
• Query start-up improvements
• Additional codegen and code
optimizations
• Decimal arithmetic improvements
• Incremental metadata updates (DDL)
• Fast min/max values on partition
columns(with query option)
Covered today
10© Cloudera, Inc. All rights reserved.
Impala 2.5 (CDH 5.7) improvements vs Impala 2.3 (CDH 5.5)
• 2.2x speedup for TPC-H
• 1.7x speedup for TPC-H (Nested)
• 4.3X speedup for TPC-DS
11© Cloudera, Inc. All rights reserved.
Runtime filtering
• General idea: some predicates can only be computed at runtime
• Example: SELECT count(*) FROM date_dim dt ,store_sales WHERE dt.d_date_sk =
store_sales.ss_sold_date_sk AND dt.d_moy = 12;
• How does Impala execute this query?
12© Cloudera, Inc. All rights reserved.
SELECT dt.d_year
,item.i_brand brand
,sum(ss_ext_sales_price) sum_agg
FROM date_dim dt
,store_sales
,item
WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
AND store_sales.ss_item_sk = item.i_item_sk
AND i_category = "Books"
AND i_class = "fiction"
AND dt.d_moy = 12
GROUP BY dt.d_year
,item.i_brand
ORDER BY dt.d_year
,sum_agg DESC
,i_brand limit 100
Runtime filters
store_sales
43 billion rows
item
198 rows
Broadcast
Join #1
290 million rows
date_dim
6,200 rows
Broadcast
Join #2
Aggregate
47 million rows
13© Cloudera, Inc. All rights reserved.
SELECT dt.d_year
,item.i_brand brand
,sum(ss_ext_sales_price) sum_agg
FROM date_dim dt
,store_sales
,item
WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
AND store_sales.ss_item_sk = item.i_item_sk
AND i_category = "Books"
AND i_class = "fiction"
AND dt.d_moy = 12
GROUP BY dt.d_year
,item.i_brand
ORDER BY dt.d_year
,sum_agg DESC
,i_brand limit 100
Runtime filters
store_sales
43 billion rows
item
198 rows
Broadcast
Join #1
290 million rows
date_dim
6,200 rows
Broadcast
Join #2
Aggregate
47 million rows
Runtime filters: the opportunity
● The planner doesn’t know what the set of
ss_sold_date_sk and ss_item_sk contains -
even with statistics.
● opportunity to save some work - why bother
sending 43 billion of those rows to the joins?
● Runtime filters computes this predicate at
runtime.
14© Cloudera, Inc. All rights reserved.
SELECT dt.d_year
,item.i_brand brand
,sum(ss_ext_sales_price) sum_agg
FROM date_dim dt
,store_sales
,item
WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
AND store_sales.ss_item_sk = item.i_item_sk
AND i_category = "Books"
AND i_class = "fiction"
AND dt.d_moy = 12
GROUP BY dt.d_year
,item.i_brand
ORDER BY dt.d_year
,sum_agg DESC
,i_brand limit 100
Runtime filters
store_sales
43 billion rows
item
198 rows
Broadcast
Join #1
290 million rows
date_dim
6,200 rows
Broadcast
Join #2
Aggregate
47 million rows
Step 1: planner tells Join #1 to
produce bloom filter qualifying
i_item_sk & Join #2 to produce
bloom filter for qualifying
d_date_sk
15© Cloudera, Inc. All rights reserved.
SELECT dt.d_year
,item.i_brand brand
,sum(ss_ext_sales_price) sum_agg
FROM date_dim dt
,store_sales
,item
WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
AND store_sales.ss_item_sk = item.i_item_sk
AND i_category = "Books"
AND i_class = "fiction"
AND dt.d_moy = 12
GROUP BY dt.d_year
,item.i_brand
ORDER BY dt.d_year
,sum_agg DESC
,i_brand limit 100
Runtime filters
store_sales
43 billion rows
item
198 rows
Broadcast
Join #1
290 million rows
date_dim
6,200 rows
Broadcast
Join #2
Aggregate
47 million rows
Step 2: Join reads all rows from
build side (right input), and
computes filter containing all
distinct values of i_item_sk and
d_date_sk
16© Cloudera, Inc. All rights reserved.
SELECT dt.d_year
,item.i_brand brand
,sum(ss_ext_sales_price) sum_agg
FROM date_dim dt
,store_sales
,item
WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
AND store_sales.ss_item_sk = item.i_item_sk
AND i_category = "Books"
AND i_class = "fiction"
AND dt.d_moy = 12
GROUP BY dt.d_year
,item.i_brand
ORDER BY dt.d_year
,sum_agg DESC
,i_brand limit 100
Runtime filters
store_sales
43 billion rows
item
198 rows
Broadcast
Join #1
290 million rows
date_dim
6,200 rows
Broadcast
Join #2
Aggregate
47 million rows
Step 3: Join #1 & #2 sends filter
to store_sales scan.
Scan eliminates rows that don’t
have a match in the bloom
filters.
17© Cloudera, Inc. All rights reserved.
SELECT dt.d_year
,item.i_brand brand
,sum(ss_ext_sales_price) sum_agg
FROM date_dim dt
,store_sales
,item
WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
AND store_sales.ss_item_sk = item.i_item_sk
AND i_category = "Books"
AND i_class = "fiction"
AND dt.d_moy = 12
GROUP BY dt.d_year
,item.i_brand
ORDER BY dt.d_year
,sum_agg DESC
,i_brand limit 100
Runtime filters
store_sales
47 million rows
item
198 rows
Broadcast
Join #1
47 million rows
date_dim
6,200 rows
Broadcast
Join #2
Aggregate
47 million rows
store_sales scan uses bloom
filter from Join #2 to filter out
partitions (ss_sold_date_sk)and
bloom filter from Join #1 to filter
out rows that don’t qualify
(ss_item_sk)
18© Cloudera, Inc. All rights reserved.
SELECT dt.d_year
,item.i_brand brand
,sum(ss_ext_sales_price) sum_agg
FROM date_dim dt
,store_sales
,item
WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
AND store_sales.ss_item_sk = item.i_item_sk
AND i_category = "Books"
AND i_class = "fiction"
AND dt.d_moy = 12
GROUP BY dt.d_year
,item.i_brand
ORDER BY dt.d_year
,sum_agg DESC
,i_brand limit 100
Runtime filters
store_sales
47 million rows
item
198 rows
Broadcast
Join #1
47 million rows
date_dim
6,200 rows
Broadcast
Join #2
Aggregate
47 million rows
914x reduction in number
of rows coming out of scan
43 billion -> 47 million
6x reduction in number of
rows coming out of join
290 million -> 47 million
19© Cloudera, Inc. All rights reserved.
SELECT c_email_address
,sum(ss_ext_sales_price) sum_agg
FROM store_sales
,customer
,customer_demographics
WHERE ss_customer_sk = c_customer_sk
AND cd_demo_sk = c_current_cdemo_sk
AND cd_gender = ‘M’
AND cd_purchase_estimate = 10000
AND cd_credit_reting = ‘Low Risk’
GROUP BY c_email_address
ORDER BY sum_agg DESC
Runtime filters variation : Global filters
Shuffle
Join #1
43 billion rows
customer_demo
2,400 rows
Broadcast
Join #2
Aggregate
49 million rows
store_sales
43 billion rows
customer
3.8 million
Shuffle Shuffle
20© Cloudera, Inc. All rights reserved.
SELECT c_email_address
,sum(ss_ext_sales_price) sum_agg
FROM store_sales
,customer
,customer_demographics
WHERE ss_customer_sk = c_customer_sk
AND cd_demo_sk = c_current_cdemo_sk
AND cd_gender = ‘M’
AND cd_purchase_estimate = 10000
AND cd_credit_reting = ‘Low Risk’
GROUP BY c_email_address
ORDER BY sum_agg DESC
Runtime filters variation : Global filters
Shuffle
Join #1
43 billion rows
customer_demo
2,400 rows
Broadcast
Join #2
Aggregate
49 million rows
Join #1 & #2 are expensive
joins since left side of the
joins have 43 billion rows
store_sales
43 billion rows
customer
3.8 million
Shuffle Shuffle
21© Cloudera, Inc. All rights reserved.
SELECT c_email_address
,sum(ss_ext_sales_price) sum_agg
FROM store_sales
,customer
,customer_demographics
WHERE ss_customer_sk = c_customer_sk
AND cd_demo_sk = c_current_cdemo_sk
AND cd_gender = ‘M’
AND cd_purchase_estimate = 10000
AND cd_credit_reting = ‘Low Risk’
GROUP BY c_email_address
ORDER BY sum_agg DESC
Runtime filters variation : Global filters
Shuffle
Join #1
43 billion rows
customer_demo
2,400 rows
Broadcast
Join #2
Aggregate
49 million rows
Create bloom filter from
Join #2 on cd_demo_sk and
push down to customer
table scan
store_sales
43 billion rows
customer
3.8 million
Shuffle Shuffle
22© Cloudera, Inc. All rights reserved.
SELECT c_email_address
,sum(ss_ext_sales_price) sum_agg
FROM store_sales
,customer
,customer_demographics
WHERE ss_customer_sk = c_customer_sk
AND cd_demo_sk = c_current_cdemo_sk
AND cd_gender = ‘M’
AND cd_purchase_estimate = 10000
AND cd_credit_reting = ‘Low Risk’
GROUP BY c_email_address
ORDER BY sum_agg DESC
Runtime filters variation : Global filters
Shuffle
Join #1
43 billion rows
customer_demo
2,400 rows
Broadcast
Join #2
Aggregate
49 million rows
Reduced customer rows by
826X
3.8 million to 4,600 rows
store_sales
43 billion rows
customer
4,600 rows
Shuffle Shuffle
23© Cloudera, Inc. All rights reserved.
SELECT c_email_address
,sum(ss_ext_sales_price) sum_agg
FROM store_sales
,customer
,customer_demographics
WHERE ss_customer_sk = c_customer_sk
AND cd_demo_sk = c_current_cdemo_sk
AND cd_gender = ‘M’
AND cd_purchase_estimate = 10000
AND cd_credit_reting = ‘Low Risk’
GROUP BY c_email_address
ORDER BY sum_agg DESC
Runtime filters variation : Global filters
Shuffle
Join #1
43 billion rows
customer_demo
2,400 rows
Broadcast
Join #2
Aggregate
49 million rows
store_sales
43 billion rows
customer
4,600 rows
Shuffle Shuffle
Create bloom filter from
Join #1 on c_customer_sk
and push down to
store_sales table scan
24© Cloudera, Inc. All rights reserved.
SELECT c_email_address
,sum(ss_ext_sales_price) sum_agg
FROM store_sales
,customer
,customer_demographics
WHERE ss_customer_sk = c_customer_sk
AND cd_demo_sk = c_current_cdemo_sk
AND cd_gender = ‘M’
AND cd_purchase_estimate = 10000
AND cd_credit_reting = ‘Low Risk’
GROUP BY c_email_address
ORDER BY sum_agg DESC
Runtime filters variation : Global filters
Shuffle
Join #1
49 million rows
customer_demo
2,400 rows
Broadcast
Join #2
Aggregate
49 million rows
store_sales
49 million rows
customer
4,600 rows
Shuffle Shuffle
877x reduction in rows
43 billion -> 49 million rows
set RUNTIME_FILTER_MODE=GLOBAL;
25© Cloudera, Inc. All rights reserved.
Runtime filters: real-world results
• Runtime filters can be highly effective. Some benchmark queries are more than 30
times faster in Impala 2.5.0.
• As always, depends on your queries, your schemas and your cluster environment.
• By default, runtime filters are enabled in limited ‘local’ mode in Impala 2.5.0. They
can be enabled fully by setting RUNTIME_FILTER_MODE=GLOBAL.
• Other runtime filter parameters include :
• RUNTIME_BLOOM_FILTER_SIZE: [1048576]
• RUNTIME_FILTER_WAIT_TIME_MS: [0]
26© Cloudera, Inc. All rights reserved.
Improved Cardinality Estimates and Join Order
1. More robust scan cardinality estimation
• Mitigate correlated predicates (exponential backoff)
2. Improved join cardinality estimation
• Special treatment of common case of PK/FK joins
• Detect selective joins by applying the selectivity of build-side predicates to the
estimated join cardinality
• TPC-H Q8 Impact: >8x speedup (91s in Impala 2.3 -> 11s in Impala 2.5)
SELECT *
FROM cars
WHERE
cars.make = 'Toyota'
AND cars.model = 'Camry'
27© Cloudera, Inc. All rights reserved.
Query start-up: performance impact
28© Cloudera, Inc. All rights reserved.
LLVM Codegen Support in Impala
Operations:
• Hash join
• Aggregation
• Scans: Text, Sequence, Avro
• Expressions in all operators
• Sort
• Top-N
Data Types:
• TINYINT, SMALLINT, INT, BIGINT
• FLOAT, DOUBLE
• BOOLEAN
• STRING, VARCHAR
• DECIMALNew in Impala
2.5
Extended in
Impala 2.5
29© Cloudera, Inc. All rights reserved.
Codegen for Order by & Top-N
void* ExprContext::GetValue(Expr* e, TupleRow* row) {
switch (e->type_.type) {
case TYPE_BOOLEAN: {
..
..
}
case TYPE_TINYINT: {
..
..
}
case TYPE_INT: {
..
.
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
30© Cloudera, Inc. All rights reserved.
Codegen for Order by & Top-N
int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const {
int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); // i = 0
int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); // i = 1
int result = lhs_value > rhs_value ? 1 :
(lhs_value < rhs_value ? -1 : 0);
if (result != 0) return result;
// Otherwise, try the next Expr
return 0; // fully equivalent key
}
Codegen code
• Perfectly unrolls “for each grouping column” loop
• No switching on input type(s)
• Removes branching on ASCENDING/DESCENDING,
NULLS FIRST/LAST
Original code
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
31© Cloudera, Inc. All rights reserved.
Codegen for Order by & Top-N
int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const {
int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); // i = 0
int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); // i = 1
int result = lhs_value > rhs_value ? 1 :
(lhs_value < rhs_value ? -1 : 0);
if (result != 0) return result;
// Otherwise, try the next Expr
return 0; // fully equivalent key
}
Codegen code
• Perfectly unrolls “for each grouping column” loop
• No switching on input type(s)
• Removes branching on ASCENDING/DESCENDING,
NULLS FIRST/LAST
Original code
int Compare(TupleRow* lhs, TupleRow* rhs) const {
for (int i = 0; i < sort_cols_lhs_.size(); ++i) {
void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs);
void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs);
if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i];
if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i];
int result = RawValue::Compare(lhs_value, rhs_value,
sort_cols_lhs_[i]->root()->type());
if (!is_asc_[i]) result = -result;
if (result != 0) return result;
// Otherwise, try the next Expr
}
return 0; // fully equivalent key
}
10x more efficient
code
32© Cloudera, Inc. All rights reserved.
Float/Double Vs Decimal?
Pros for Float/Double
• Uses less memory.
• Faster because floating point math operations are natively supported by processors.
(Note: Decimal uses fixed-point hardware types - int64 and __int128)
• Can represent a larger range of numbers.
Cons for Float/Double
• Precision errors compound during aggregations
• Can’t do math with wide number of significant digits (123456789.1 * .0000987654321)
Decimal arithmetic and aggregation
No go for applications requiring high precision & accuracy
What about performance penalty?
33© Cloudera, Inc. All rights reserved.
Decimal arithmetic and aggregation
SELECT l_returnflag,
l_linestatus,
Sum(l_quantity) AS SUM_QTY,
Sum(l_extendedprice)AS SUM_BASE_PRICE,
Sum(l_extendedprice * ( 1 - l_discount ))AS SUM_DISC_PRICE
FROM lineitem
GROUP BY l_returnflag,
l_linestatus
ORDER BY l_returnflag,
l_linestatus
3x speedup
● Simplified overflow check for decimal.
● Extended Codegen framework to support aggregations involving decimal.
● Bridged the performance gap between double and decimal
34© Cloudera, Inc. All rights reserved.
Network
Distributed Aggregations in Impala
Preagg Preagg Preagg
Merge Merge Merge
select cust_id, sum(dollars)
from sales group by cust_id;
Scan ScanScan
• Impala aggregations have two phases:
• Pre-aggregation phase
• Merge phase
• The pre-aggregation phase greatly reduces
network traffic if there are many input
rows per grouping value.
• E.g. many sales per customer.
35© Cloudera, Inc. All rights reserved.
Network
Downsides of Pre-aggregations
Preagg Preagg Preagg
Merge Merge Merge
select distinct * from sales;
Scan ScanScan
• Pre-aggregations consume:
• Memory
• CPU cycles
• Pre-aggregations are not always effective
at reducing network traffic
• E.g. select distinct for nearly-distinct rows
• Pre-aggregations can spill to disk under
memory pressure
• Disk I/O is bad - better to send to
merge agg rather than disk
36© Cloudera, Inc. All rights reserved.
Network
Streaming Pre-aggregations in Impala 2.5
Merge Merge Merge
select distinct * from sales;
Scan ScanScan
• Reduction factor is dynamically estimated based
on the actual data processed
• Pre-aggregation expands memory usage only if
reduction factor is good
• Benefits:
• Certain aggregations with low reduction
factor see speedups of up to 40%
• Memory consumption can be reduced by
50% or more
• Streaming pre-aggregations don’t spill to
disk
37© Cloudera, Inc. All rights reserved.
Streaming Pre-aggregations in Impala 2.5
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
06:AGGREGATE 1 366.581ms 366.581ms 1 1 72.00 KB -1.00 B FINALIZE
05:EXCHANGE 1 149.923us 149.923us 15 1 0 -1.00 B UNPARTITIONED
02:AGGREGATE 15 243.604ms 248.701ms 15 1 12.00 KB 10.00 MB
04:AGGREGATE 15 8s887ms 9s585ms 450.00M 437.91M 1.53 GB 245.01 MB FINALIZE
03:EXCHANGE 15 827.770ms 932.785ms 450.00M 437.91M 0 0 HASH(o_orderkey)
01:AGGREGATE 15 9s995ms 11s484ms 450.00M 437.91M 1.64 GB 3.59 GB
00:SCAN HDFS 15 142.192ms 189.179ms 450.00M 450.00M 150.94 MB 88.00 MB tpch_300_parquet.orders
Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail
06:AGGREGATE 1 356.667ms 356.667ms 1 1 72.00 KB -1.00 B FINALIZE
05:EXCHANGE 1 110.924us 110.924us 15 1 0 -1.00 B UNPARTITIONED
02:AGGREGATE 15 246.188ms 250.408ms 15 1 12.00 KB 10.00 MB
04:AGGREGATE 15 11s174ms 11s753ms 450.00M 437.91M 1.51 GB 245.01 MB FINALIZE
03:EXCHANGE 15 750.620ms 805.099ms 450.00M 437.91M 0 0 HASH(o_orderkey)
01:AGGREGATE 15 5s670ms 6s715ms 450.00M 437.91M 153.40 MB 3.59 GB STREAMING
00:SCAN HDFS 15 151.746ms 201.804ms 450.00M 450.00M 150.95 MB 88.00 MB tpch_300_parquet.orders
Baseline finished in 23.13 seconds
With stream pre-aggregation enabled finished in 14.9 seconds
38© Cloudera, Inc. All rights reserved.
Optimization for partition keys scan
• Use metadata to avoid table accesses for partition key scans:
• select min(month), max(year) from functional.alltypes;
• month, year are partition keys of the table
• Enabled by query option OPTIMIZE_PARTITION_KEY_SCANS
• Applicable:
• min(), max(), ndv() and aggregate functions with distinct keyword
• partition keys only
01:AGGREGATE [FINALIZE]
| output: min(month),max(year)
|
00:UNION
constant-operands=24
03:AGGREGATE [FINALIZE]
| output: min:merge(month), max:merge(year)
|
02:EXCHANGE [UNPARTITIONED]
|
01:AGGREGATE
| output: min(month), max(year)
|
00:SCAN HDFS [functional.alltypes]
partitions=24/24 files=24 size=478.45KB
Plan without optimization Plan with optimization
39© Cloudera, Inc. All rights reserved.
21x node cluster each with Hardware
● 384GB memory, 2s sockets, 12x total cores, Intel Xeon CPU E5-2630L 0 at 2.00GHz
● 12 disk drives at 932GB each (one for the OS, the rest for HDFS)
Comparative Set
● Impala 2.5
○ RUNTIME_FILTER_MODE = 2;
● Spark SQL 1.6
○ Thrift JDBC server used to avoid startup cost
○ --master yarn --deploy-mode client --driver-memory 24G --driver-cores 8 --executor-memory 24G --num-executors 240
Workload
● TPC-DS 15TB stored in Parquet file format (default of 256MB block size)
● Un-modified TPC-DS queries : 3, 7, 8, 19, 25, 27, 34, 42, 43, 46, 47, 52, 53, 55, 59, 61, 63, 68, 73, 79, 88, 89, 96, 98
● Caveats:
○ Spark-SQL failed running :
■ Q25 : Bad plan
■ Q47 : StackOverflowError
■ Q89 : StackOverflowError
Competitive benchmark : TPC-DS
40© Cloudera, Inc. All rights reserved.
Q25 (Fact to fact joins)
SELECT i_item_id,i_item_desc, s_store_id, s_store_name,
Stddev_samp(ss_net_profit),Stddev_samp(sr_net_loss), Stddev_samp(cs_net_profit)
AS catalog_sales_profit
FROM store_sales,
store_returns,
catalog_sales,
date_dim d1,
date_dim d2,
date_dim d3,
store,
item
WHERE d1.d_moy = 4 AND d1.d_year = 2001 AND d1.d_date_sk = ss_sold_date_sk
AND i_item_sk = ss_item_sk AND s_store_sk = ss_store_sk AND ss_customer_sk =
sr_customer_sk AND ss_item_sk = sr_item_sk AND ss_ticket_number = sr_ticket_number
AND sr_returned_date_sk = d2.d_date_sk AND d2.d_moy BETWEEN 4 AND 10
AND d2.d_year = 2001 AND sr_customer_sk = cs_bill_customer_sk
AND sr_item_sk = cs_item_sk AND cs_sold_date_sk = d3.d_date_sk
AND d3.d_moy BETWEEN 4 AND 10 AND d3.d_year = 2001
GROUP BY i_item_id, i_item_desc,
s_store_id, s_store_name
ORDER BY i_item_id, i_item_desc,
s_store_id, s_store_name
LIMIT 100;
Competitive benchmark
Query complexity varied from Q3
SELECT dt.d_year,
item.i_brand_id brand_id,
item.i_brand brand,
Sum(ss_ext_sales_price) sum_agg
FROM date_dim dt,
store_sales,
item
WHERE dt.d_date_sk = store_sales.ss_sold_date_sk
AND store_sales.ss_item_sk = item.i_item_sk
AND item.i_manufact_id = 436
AND dt.d_moy = 12
GROUP BY dt.d_year,
item.i_brand,
item.i_brand_id
ORDER BY dt.d_year,
sum_agg DESC,
brand_id
LIMIT 100;
41© Cloudera, Inc. All rights reserved.
Competitive benchmark
42© Cloudera, Inc. All rights reserved.
Competitive benchmark
Impala 2.5 is 11x faster
(based on geomean)
43© Cloudera, Inc. All rights reserved.
Performance Benchmark Takeaways
• Impala unlocks BI usage directly on Hadoop
• Meets BI low-latency and multi-user requirements
• Advantage expands for single-user vs just 10 users
• Spark SQL enables easier Spark application development
• Enables mixed procedural Spark (Java/Scala) and SQL job development
• Mid-term trends will further favor Impala’s design approach for latency and concurrency
• More data sets move to memory (HDFS caching, in-memory joins, Intel joint roadmap)
• CPU efficiency will increase in importance
• Native code enables easy optimizations for CPU instruction sets
44© Cloudera, Inc. All rights reserved.
• Available today in Impala 2.5:
• All the same Impala functionality, performance, and third-party integrations
• Supported across our cloud partners
• Deployment via Director
• Modular architecture enables cloud’s decoupled storage and elasticity future
• Available soon in Impala 2.6:
• Impala read/write to S3 in addition to local HDFS IMPALA-1878
• Dynamically sized runtime filters
• Parquet scanner optimization
• Faster joins, aggregations, sorts and decimal arithmetic
• Rack aware scheduling
• Faster code generation
Impala and Cloud
45© Cloudera, Inc. All rights reserved.
Impala Roadmap
2H 2015 1H 2016 2016
• SQL Support & Usability
• Nested structures
• Kudu updates (beta)
• Management & Security
• Record reader service
(beta)
• Finer-grained security
(Sentry)
• Integration
• Isilon support
• Python interface (Ibis)
• Performance & Scale
• Improved predictability
under concurrency
• Performance & Scale
• Continued scalability and
concurrency
• Initial perf/scale
improvements
• Management & Security
• Improved admission
control
• Resource utilization and
showback
• SQL Support & Usability
• Dynamic partitioning
• Performance & Scale
• >20x performance
• Multi-threaded
joins/aggregations
• Continued scale work
• Cloud
• S3 read/write support
• Management & Security
• Improved YARN
integration
• Automated metadata
• SQL Support & Usability
• Data type improvements
• Added SQL extensions
46© Cloudera, Inc. All rights reserved.
Appendix.
47© Cloudera, Inc. All rights reserved.
48© Cloudera, Inc. All rights reserved.
• Pre Impala 2.5:
• Coordinator starts receiving fragments before
senders
• Problem:
• Serializes startup
• Scale and plan complexity ~ slower startup
• Impala 2.5:
• Coordinator starts fragments in any order
• Added wait logic for senders and receivers
Query start-up improvements
49© Cloudera, Inc. All rights reserved.
Scheduling Small Queries
Query scheduler assigns scan ranges to workers (running impalad).
First it selects an HDFS datanode to read from.
A B C
Selection will always start with the same
replica to make optimal use of OS buffer
caches.
This can lead to hot-spots for some
workloads.
Improvement: Pick impalad at random.
50© Cloudera, Inc. All rights reserved.
New Query Option: random_replica
Disabled by default.
set random_replica = 1;
Also has a corresponding query hint:
SELECT AVG(c1) FROM t /* +SCHEDULE_RANDOM_REPLICA */;
51© Cloudera, Inc. All rights reserved.
Where It Can Help
• Large number of small queries, each with few input tables.
• High load on only one of multiple replicas of a table.
• Queries are CPU bound.
• Benefit: Distribute load more evenly over replicas.
• Tradeoff: Distribution of local reads will increase buffer cache usage.
What’s Next
• Add possibility to prefer remote reads.
• Switch remote impalad selection from round-robin to load-based.
• Add rack-awareness.
52© Cloudera, Inc. All rights reserved.
Catalog Improvements
Incrementally update table metadata instead of force-reloading all table metadata
during DDL/DML operations
Reload metadata of only ‘dirty’ partitions
Reuse descriptors of HDFS files to avoid loading file/block metadata for files that
haven’t been modified
Significantly reduce the latency of DDL/DML operations that change a small
fraction of table metadata (e.g. alter table foo partition (year = 2010) set
location ‘blah’)
53© Cloudera, Inc. All rights reserved.
Catalog Improvements - Results

More Related Content

What's hot

Kudu Forrester Webinar
Kudu Forrester WebinarKudu Forrester Webinar
Kudu Forrester WebinarCloudera, Inc.
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera, Inc.
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...Cloudera, Inc.
 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Cloudera, Inc.
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Cloudera, Inc.
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudData Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudCloudera, Inc.
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsCloudera, Inc.
 
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data PlatformHow to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data PlatformCloudera, Inc.
 
The Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in ChurnThe Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in ChurnCloudera, Inc.
 
Intuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with SearchIntuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with SearchCloudera, Inc.
 
Making Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the EnterpriseMaking Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the EnterpriseCloudera, Inc.
 
Data Drive Applications_Webinar
Data Drive Applications_WebinarData Drive Applications_Webinar
Data Drive Applications_WebinarSean Spediacci
 
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Cloudera, Inc.
 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesCloudera, Inc.
 
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
Extreme Sports & Beyond: Exploring a new frontier in data with GoProExtreme Sports & Beyond: Exploring a new frontier in data with GoPro
Extreme Sports & Beyond: Exploring a new frontier in data with GoProCloudera, Inc.
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduCloudera, Inc.
 
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18Cloudera, Inc.
 
A Community Approach to Fighting Cyber Threats
A Community Approach to Fighting Cyber ThreatsA Community Approach to Fighting Cyber Threats
A Community Approach to Fighting Cyber ThreatsCloudera, Inc.
 
Customer Best Practices: Optimizing Cloudera on AWS
Customer Best Practices: Optimizing Cloudera on AWSCustomer Best Practices: Optimizing Cloudera on AWS
Customer Best Practices: Optimizing Cloudera on AWSCloudera, Inc.
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduCloudera, Inc.
 

What's hot (20)

Kudu Forrester Webinar
Kudu Forrester WebinarKudu Forrester Webinar
Kudu Forrester Webinar
 
Cloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made EasyCloudera Altus: Big Data in the Cloud Made Easy
Cloudera Altus: Big Data in the Cloud Made Easy
 
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ... Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
Gartner Data and Analytics Summit: Bringing Self-Service BI & SQL Analytics ...
 
Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18Consolidate your data marts for fast, flexible analytics 5.24.18
Consolidate your data marts for fast, flexible analytics 5.24.18
 
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
Part 2: Apache Kudu: Extending the Capabilities of Operational and Analytic D...
 
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the CloudData Engineering: Elastic, Low-Cost Data Processing in the Cloud
Data Engineering: Elastic, Low-Cost Data Processing in the Cloud
 
How Data Drives Business at Choice Hotels
How Data Drives Business at Choice HotelsHow Data Drives Business at Choice Hotels
How Data Drives Business at Choice Hotels
 
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data PlatformHow to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
How to Build Multi-disciplinary Analytics Applications on a Shared Data Platform
 
The Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in ChurnThe Big Picture: Learned Behaviors in Churn
The Big Picture: Learned Behaviors in Churn
 
Intuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with SearchIntuitive Real-Time Analytics with Search
Intuitive Real-Time Analytics with Search
 
Making Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the EnterpriseMaking Self-Service BI a Reality in the Enterprise
Making Self-Service BI a Reality in the Enterprise
 
Data Drive Applications_Webinar
Data Drive Applications_WebinarData Drive Applications_Webinar
Data Drive Applications_Webinar
 
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera
Supercharge Splunk with Cloudera

Supercharge Splunk with Cloudera

 
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency ObjectivesHadoop Essentials -- The What, Why and How to Meet Agency Objectives
Hadoop Essentials -- The What, Why and How to Meet Agency Objectives
 
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
Extreme Sports & Beyond: Exploring a new frontier in data with GoProExtreme Sports & Beyond: Exploring a new frontier in data with GoPro
Extreme Sports & Beyond: Exploring a new frontier in data with GoPro
 
Moving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache KuduMoving Beyond Lambda Architectures with Apache Kudu
Moving Beyond Lambda Architectures with Apache Kudu
 
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
What’s New in Cloudera Enterprise 6.0: The Inside Scoop 6.14.18
 
A Community Approach to Fighting Cyber Threats
A Community Approach to Fighting Cyber ThreatsA Community Approach to Fighting Cyber Threats
A Community Approach to Fighting Cyber Threats
 
Customer Best Practices: Optimizing Cloudera on AWS
Customer Best Practices: Optimizing Cloudera on AWSCustomer Best Practices: Optimizing Cloudera on AWS
Customer Best Practices: Optimizing Cloudera on AWS
 
Part 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache KuduPart 1: Lambda Architectures: Simplified by Apache Kudu
Part 1: Lambda Architectures: Simplified by Apache Kudu
 

Viewers also liked

Nested Types in Impala
Nested Types in ImpalaNested Types in Impala
Nested Types in ImpalaCloudera, Inc.
 
Cloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and AnalysisCloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and AnalysisYue Chen
 
How Impala Works
How Impala WorksHow Impala Works
How Impala WorksYue Chen
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in ImpalaCloudera, Inc.
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Cloudera, Inc.
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentationhadooparchbook
 
Data Infused Product Design and Insights at LinkedIn
Data Infused Product Design and Insights at LinkedInData Infused Product Design and Insights at LinkedIn
Data Infused Product Design and Insights at LinkedInYael Garten
 
A Perspective from the intersection Data Science, Mobility, and Mobile Devices
A Perspective from the intersection Data Science, Mobility, and Mobile DevicesA Perspective from the intersection Data Science, Mobility, and Mobile Devices
A Perspective from the intersection Data Science, Mobility, and Mobile DevicesYael Garten
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuningAnil Reddy
 
Remix: On-demand Live Randomization (Fine-grained live ASLR during runtime)
Remix: On-demand Live Randomization (Fine-grained live ASLR during runtime)Remix: On-demand Live Randomization (Fine-grained live ASLR during runtime)
Remix: On-demand Live Randomization (Fine-grained live ASLR during runtime)Yue Chen
 
Impala SQL Support
Impala SQL SupportImpala SQL Support
Impala SQL SupportYue Chen
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialhadooparchbook
 
How to use your data science team: Becoming a data-driven organization
How to use your data science team: Becoming a data-driven organizationHow to use your data science team: Becoming a data-driven organization
How to use your data science team: Becoming a data-driven organizationYael Garten
 
SecPod: A Framework for Virtualization-based Security Systems
SecPod: A Framework for Virtualization-based Security SystemsSecPod: A Framework for Virtualization-based Security Systems
SecPod: A Framework for Virtualization-based Security SystemsYue Chen
 
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...Cloudera, Inc.
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ ZooskCloudera, Inc.
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platformhadooparchbook
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionCloudera, Inc.
 

Viewers also liked (20)

Nested Types in Impala
Nested Types in ImpalaNested Types in Impala
Nested Types in Impala
 
Cloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and AnalysisCloudera Impala Source Code Explanation and Analysis
Cloudera Impala Source Code Explanation and Analysis
 
How Impala Works
How Impala WorksHow Impala Works
How Impala Works
 
Admission Control in Impala
Admission Control in ImpalaAdmission Control in Impala
Admission Control in Impala
 
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
Hive, Impala, and Spark, Oh My: SQL-on-Hadoop in Cloudera 5.5
 
Impala Architecture presentation
Impala Architecture presentationImpala Architecture presentation
Impala Architecture presentation
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Data Infused Product Design and Insights at LinkedIn
Data Infused Product Design and Insights at LinkedInData Infused Product Design and Insights at LinkedIn
Data Infused Product Design and Insights at LinkedIn
 
A Perspective from the intersection Data Science, Mobility, and Mobile Devices
A Perspective from the intersection Data Science, Mobility, and Mobile DevicesA Perspective from the intersection Data Science, Mobility, and Mobile Devices
A Perspective from the intersection Data Science, Mobility, and Mobile Devices
 
White paper hadoop performancetuning
White paper hadoop performancetuningWhite paper hadoop performancetuning
White paper hadoop performancetuning
 
Remix: On-demand Live Randomization (Fine-grained live ASLR during runtime)
Remix: On-demand Live Randomization (Fine-grained live ASLR during runtime)Remix: On-demand Live Randomization (Fine-grained live ASLR during runtime)
Remix: On-demand Live Randomization (Fine-grained live ASLR during runtime)
 
Impala SQL Support
Impala SQL SupportImpala SQL Support
Impala SQL Support
 
Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013Presentations from the Cloudera Impala meetup on Aug 20 2013
Presentations from the Cloudera Impala meetup on Aug 20 2013
 
Hadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorialHadoop application architectures - Fraud detection tutorial
Hadoop application architectures - Fraud detection tutorial
 
How to use your data science team: Becoming a data-driven organization
How to use your data science team: Becoming a data-driven organizationHow to use your data science team: Becoming a data-driven organization
How to use your data science team: Becoming a data-driven organization
 
SecPod: A Framework for Virtualization-based Security Systems
SecPod: A Framework for Virtualization-based Security SystemsSecPod: A Framework for Virtualization-based Security Systems
SecPod: A Framework for Virtualization-based Security Systems
 
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
Data Modeling for Data Science: Simplify Your Workload with Complex Types in ...
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
 
Architecting next generation big data platform
Architecting next generation big data platformArchitecting next generation big data platform
Architecting next generation big data platform
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
 

Similar to Apache Impala (incubating) 2.5 Performance Update

Recent Changes and Challenges for Future Presto
Recent Changes and Challenges for Future PrestoRecent Changes and Challenges for Future Presto
Recent Changes and Challenges for Future PrestoKai Sasaki
 
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB  | InfluxDays...Sam Dillard [InfluxData] | Performance Optimization in InfluxDB  | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...InfluxData
 
Upcoming changes in MySQL 5.7
Upcoming changes in MySQL 5.7Upcoming changes in MySQL 5.7
Upcoming changes in MySQL 5.7Morgan Tocker
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeAtScale
 
Oracle to Azure PostgreSQL database migration webinar
Oracle to Azure PostgreSQL database migration webinarOracle to Azure PostgreSQL database migration webinar
Oracle to Azure PostgreSQL database migration webinarMinnie Seungmin Cho
 
Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospectc-bslim
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Data Con LA
 
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)byteLAKE
 
Everything You Need to Know About Sharding
Everything You Need to Know About ShardingEverything You Need to Know About Sharding
Everything You Need to Know About ShardingMongoDB
 
Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
Leveraging Apache Spark to Develop AI-Enabled Products and Services at BoschLeveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
Leveraging Apache Spark to Develop AI-Enabled Products and Services at BoschDatabricks
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with HadoopJayant Shekhar
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger AnalyticsItzhak Kameli
 
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...Amazon Web Services
 
Lifting the Hood of FME Engine 2022.0
Lifting the Hood of FME Engine 2022.0Lifting the Hood of FME Engine 2022.0
Lifting the Hood of FME Engine 2022.0Safe Software
 
OLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseOLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseAtScale
 
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News! ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News! Embarcadero Technologies
 
Peek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and RoadmapPeek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and RoadmapNeo4j
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Giridhar Addepalli
 

Similar to Apache Impala (incubating) 2.5 Performance Update (20)

Recent Changes and Challenges for Future Presto
Recent Changes and Challenges for Future PrestoRecent Changes and Challenges for Future Presto
Recent Changes and Challenges for Future Presto
 
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB  | InfluxDays...Sam Dillard [InfluxData] | Performance Optimization in InfluxDB  | InfluxDays...
Sam Dillard [InfluxData] | Performance Optimization in InfluxDB | InfluxDays...
 
Upcoming changes in MySQL 5.7
Upcoming changes in MySQL 5.7Upcoming changes in MySQL 5.7
Upcoming changes in MySQL 5.7
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
How to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on SnowflakeHow to Realize an Additional 270% ROI on Snowflake
How to Realize an Additional 270% ROI on Snowflake
 
Oracle to Azure PostgreSQL database migration webinar
Oracle to Azure PostgreSQL database migration webinarOracle to Azure PostgreSQL database migration webinar
Oracle to Azure PostgreSQL database migration webinar
 
Apache Druid Design and Future prospect
Apache Druid Design and Future prospectApache Druid Design and Future prospect
Apache Druid Design and Future prospect
 
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
 
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)
byteLAKE's CFD Suite (AI-accelerated CFD) (2024-02)
 
Everything You Need to Know About Sharding
Everything You Need to Know About ShardingEverything You Need to Know About Sharding
Everything You Need to Know About Sharding
 
Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
Leveraging Apache Spark to Develop AI-Enabled Products and Services at BoschLeveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
Leveraging Apache Spark to Develop AI-Enabled Products and Services at Bosch
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
 
Big Data, Bigger Analytics
Big Data, Bigger AnalyticsBig Data, Bigger Analytics
Big Data, Bigger Analytics
 
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
Accelerate Machine Learning Workloads using Amazon EC2 P3 Instances - SRV201 ...
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 
Lifting the Hood of FME Engine 2022.0
Lifting the Hood of FME Engine 2022.0Lifting the Hood of FME Engine 2022.0
Lifting the Hood of FME Engine 2022.0
 
OLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure SynapseOLAP on the Cloud with Azure Databricks and Azure Synapse
OLAP on the Cloud with Azure Databricks and Azure Synapse
 
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News! ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
ER/Studio and DB PowerStudio Launch Webinar: Big Data, Big Models, Big News!
 
Peek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and RoadmapPeek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and Roadmap
 
Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01Adtech scala-performance-tuning-150323223738-conversion-gate01
Adtech scala-performance-tuning-150323223738-conversion-gate01
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 

Recently uploaded (20)

PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 

Apache Impala (incubating) 2.5 Performance Update

  • 1. 1© Cloudera, Inc. All rights reserved. Apache Impala 2.5 (Incubating) Performance improvements overview
  • 2. 2© Cloudera, Inc. All rights reserved. Agenda • What is Impala? • Impala at Apache • What is new in Impala 2.5 (CDH 5.7) • Impala performance update • Roadmap • Q&A
  • 3. 3© Cloudera, Inc. All rights reserved. SQL-on-Hadoop engines SQL Impala SQL-on-Apache Hadoop – Choosing the right tool for the right job
  • 4. 4© Cloudera, Inc. All rights reserved. • General-purpose SQL engine • Real-time queries in Apache Hadoop • General availability (v1.0) release out since April 2013 • Analytic SQL functionality (v2.0) since October 2014 • Apache incubator project since December 2015 • Previous release 2.3 (CDH 5.5) released November 2015 • Current release 2.5 (CDH 5.7) April 2016 What is Impala? Today’s topic
  • 5. 5© Cloudera, Inc. All rights reserved. • Query speed over Hadoop that meets or exceeds that of a proprietary analytic DBMS • General-purpose SQL query engine: • Targeted for analytical workloads • Supports queries that take from milliseconds to hours • Runs directly within Hadoop: • reads widely used Hadoop file formats • talks to widely used Hadoop storage managers • runs on same nodes that run Hadoop processes • Highly available • High performance: • C++ instead of Java • Run time code generation Impala overview
  • 6. 6© Cloudera, Inc. All rights reserved. Impala Use Cases •Interactive BI/analytics on more data •Asking new questions – exploration, ML (Ibis) •Data processing with tight SLAs •Query-able archive w/full fidelity
  • 7. 7© Cloudera, Inc. All rights reserved. • Incubator project since December 2015 • Development process slowly moving to ASF infrastructure (see IMPALA-3221) • Help wanted! Where to find the Impala community: dev@impala.incubator.apache.org user@impala.incubator.apache.org http://impala.io @apacheimpala Impala at Apache
  • 8. 8© Cloudera, Inc. All rights reserved. New in Impala 2.5 Usability Enhancements • Admission Control Improvements • Null-safe join/equals Performance and Scalability • Runtime filters • Improved Cardinality Estimation and Join Ordering • Query start-up improvements • Additional codegen and code optimizations • Decimal arithmetic improvements • Fast min/max values on partition columns(with query option) Integrations •Support for EMC DSSD
  • 9. 9© Cloudera, Inc. All rights reserved. New in Impala 2.5 Performance and Scalability • Runtime filters • Improved Cardinality Estimation and Join Ordering • Query start-up improvements • Additional codegen and code optimizations • Decimal arithmetic improvements • Incremental metadata updates (DDL) • Fast min/max values on partition columns(with query option) Covered today
  • 10. 10© Cloudera, Inc. All rights reserved. Impala 2.5 (CDH 5.7) improvements vs Impala 2.3 (CDH 5.5) • 2.2x speedup for TPC-H • 1.7x speedup for TPC-H (Nested) • 4.3X speedup for TPC-DS
  • 11. 11© Cloudera, Inc. All rights reserved. Runtime filtering • General idea: some predicates can only be computed at runtime • Example: SELECT count(*) FROM date_dim dt ,store_sales WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND dt.d_moy = 12; • How does Impala execute this query?
  • 12. 12© Cloudera, Inc. All rights reserved. SELECT dt.d_year ,item.i_brand brand ,sum(ss_ext_sales_price) sum_agg FROM date_dim dt ,store_sales ,item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND i_category = "Books" AND i_class = "fiction" AND dt.d_moy = 12 GROUP BY dt.d_year ,item.i_brand ORDER BY dt.d_year ,sum_agg DESC ,i_brand limit 100 Runtime filters store_sales 43 billion rows item 198 rows Broadcast Join #1 290 million rows date_dim 6,200 rows Broadcast Join #2 Aggregate 47 million rows
  • 13. 13© Cloudera, Inc. All rights reserved. SELECT dt.d_year ,item.i_brand brand ,sum(ss_ext_sales_price) sum_agg FROM date_dim dt ,store_sales ,item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND i_category = "Books" AND i_class = "fiction" AND dt.d_moy = 12 GROUP BY dt.d_year ,item.i_brand ORDER BY dt.d_year ,sum_agg DESC ,i_brand limit 100 Runtime filters store_sales 43 billion rows item 198 rows Broadcast Join #1 290 million rows date_dim 6,200 rows Broadcast Join #2 Aggregate 47 million rows Runtime filters: the opportunity ● The planner doesn’t know what the set of ss_sold_date_sk and ss_item_sk contains - even with statistics. ● opportunity to save some work - why bother sending 43 billion of those rows to the joins? ● Runtime filters computes this predicate at runtime.
  • 14. 14© Cloudera, Inc. All rights reserved. SELECT dt.d_year ,item.i_brand brand ,sum(ss_ext_sales_price) sum_agg FROM date_dim dt ,store_sales ,item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND i_category = "Books" AND i_class = "fiction" AND dt.d_moy = 12 GROUP BY dt.d_year ,item.i_brand ORDER BY dt.d_year ,sum_agg DESC ,i_brand limit 100 Runtime filters store_sales 43 billion rows item 198 rows Broadcast Join #1 290 million rows date_dim 6,200 rows Broadcast Join #2 Aggregate 47 million rows Step 1: planner tells Join #1 to produce bloom filter qualifying i_item_sk & Join #2 to produce bloom filter for qualifying d_date_sk
  • 15. 15© Cloudera, Inc. All rights reserved. SELECT dt.d_year ,item.i_brand brand ,sum(ss_ext_sales_price) sum_agg FROM date_dim dt ,store_sales ,item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND i_category = "Books" AND i_class = "fiction" AND dt.d_moy = 12 GROUP BY dt.d_year ,item.i_brand ORDER BY dt.d_year ,sum_agg DESC ,i_brand limit 100 Runtime filters store_sales 43 billion rows item 198 rows Broadcast Join #1 290 million rows date_dim 6,200 rows Broadcast Join #2 Aggregate 47 million rows Step 2: Join reads all rows from build side (right input), and computes filter containing all distinct values of i_item_sk and d_date_sk
  • 16. 16© Cloudera, Inc. All rights reserved. SELECT dt.d_year ,item.i_brand brand ,sum(ss_ext_sales_price) sum_agg FROM date_dim dt ,store_sales ,item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND i_category = "Books" AND i_class = "fiction" AND dt.d_moy = 12 GROUP BY dt.d_year ,item.i_brand ORDER BY dt.d_year ,sum_agg DESC ,i_brand limit 100 Runtime filters store_sales 43 billion rows item 198 rows Broadcast Join #1 290 million rows date_dim 6,200 rows Broadcast Join #2 Aggregate 47 million rows Step 3: Join #1 & #2 sends filter to store_sales scan. Scan eliminates rows that don’t have a match in the bloom filters.
  • 17. 17© Cloudera, Inc. All rights reserved. SELECT dt.d_year ,item.i_brand brand ,sum(ss_ext_sales_price) sum_agg FROM date_dim dt ,store_sales ,item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND i_category = "Books" AND i_class = "fiction" AND dt.d_moy = 12 GROUP BY dt.d_year ,item.i_brand ORDER BY dt.d_year ,sum_agg DESC ,i_brand limit 100 Runtime filters store_sales 47 million rows item 198 rows Broadcast Join #1 47 million rows date_dim 6,200 rows Broadcast Join #2 Aggregate 47 million rows store_sales scan uses bloom filter from Join #2 to filter out partitions (ss_sold_date_sk)and bloom filter from Join #1 to filter out rows that don’t qualify (ss_item_sk)
  • 18. 18© Cloudera, Inc. All rights reserved. SELECT dt.d_year ,item.i_brand brand ,sum(ss_ext_sales_price) sum_agg FROM date_dim dt ,store_sales ,item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND i_category = "Books" AND i_class = "fiction" AND dt.d_moy = 12 GROUP BY dt.d_year ,item.i_brand ORDER BY dt.d_year ,sum_agg DESC ,i_brand limit 100 Runtime filters store_sales 47 million rows item 198 rows Broadcast Join #1 47 million rows date_dim 6,200 rows Broadcast Join #2 Aggregate 47 million rows 914x reduction in number of rows coming out of scan 43 billion -> 47 million 6x reduction in number of rows coming out of join 290 million -> 47 million
  • 19. 19© Cloudera, Inc. All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Runtime filters variation : Global filters Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows store_sales 43 billion rows customer 3.8 million Shuffle Shuffle
  • 20. 20© Cloudera, Inc. All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Runtime filters variation : Global filters Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows Join #1 & #2 are expensive joins since left side of the joins have 43 billion rows store_sales 43 billion rows customer 3.8 million Shuffle Shuffle
  • 21. 21© Cloudera, Inc. All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Runtime filters variation : Global filters Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows Create bloom filter from Join #2 on cd_demo_sk and push down to customer table scan store_sales 43 billion rows customer 3.8 million Shuffle Shuffle
  • 22. 22© Cloudera, Inc. All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Runtime filters variation : Global filters Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows Reduced customer rows by 826X 3.8 million to 4,600 rows store_sales 43 billion rows customer 4,600 rows Shuffle Shuffle
  • 23. 23© Cloudera, Inc. All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Runtime filters variation : Global filters Shuffle Join #1 43 billion rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows store_sales 43 billion rows customer 4,600 rows Shuffle Shuffle Create bloom filter from Join #1 on c_customer_sk and push down to store_sales table scan
  • 24. 24© Cloudera, Inc. All rights reserved. SELECT c_email_address ,sum(ss_ext_sales_price) sum_agg FROM store_sales ,customer ,customer_demographics WHERE ss_customer_sk = c_customer_sk AND cd_demo_sk = c_current_cdemo_sk AND cd_gender = ‘M’ AND cd_purchase_estimate = 10000 AND cd_credit_reting = ‘Low Risk’ GROUP BY c_email_address ORDER BY sum_agg DESC Runtime filters variation : Global filters Shuffle Join #1 49 million rows customer_demo 2,400 rows Broadcast Join #2 Aggregate 49 million rows store_sales 49 million rows customer 4,600 rows Shuffle Shuffle 877x reduction in rows 43 billion -> 49 million rows set RUNTIME_FILTER_MODE=GLOBAL;
  • 25. 25© Cloudera, Inc. All rights reserved. Runtime filters: real-world results • Runtime filters can be highly effective. Some benchmark queries are more than 30 times faster in Impala 2.5.0. • As always, depends on your queries, your schemas and your cluster environment. • By default, runtime filters are enabled in limited ‘local’ mode in Impala 2.5.0. They can be enabled fully by setting RUNTIME_FILTER_MODE=GLOBAL. • Other runtime filter parameters include : • RUNTIME_BLOOM_FILTER_SIZE: [1048576] • RUNTIME_FILTER_WAIT_TIME_MS: [0]
  • 26. 26© Cloudera, Inc. All rights reserved. Improved Cardinality Estimates and Join Order 1. More robust scan cardinality estimation • Mitigate correlated predicates (exponential backoff) 2. Improved join cardinality estimation • Special treatment of common case of PK/FK joins • Detect selective joins by applying the selectivity of build-side predicates to the estimated join cardinality • TPC-H Q8 Impact: >8x speedup (91s in Impala 2.3 -> 11s in Impala 2.5) SELECT * FROM cars WHERE cars.make = 'Toyota' AND cars.model = 'Camry'
  • 27. 27© Cloudera, Inc. All rights reserved. Query start-up: performance impact
  • 28. 28© Cloudera, Inc. All rights reserved. LLVM Codegen Support in Impala Operations: • Hash join • Aggregation • Scans: Text, Sequence, Avro • Expressions in all operators • Sort • Top-N Data Types: • TINYINT, SMALLINT, INT, BIGINT • FLOAT, DOUBLE • BOOLEAN • STRING, VARCHAR • DECIMALNew in Impala 2.5 Extended in Impala 2.5
  • 29. 29© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N void* ExprContext::GetValue(Expr* e, TupleRow* row) { switch (e->type_.type) { case TYPE_BOOLEAN: { .. .. } case TYPE_TINYINT: { .. .. } case TYPE_INT: { .. . int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key }
  • 30. 30© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); // i = 0 int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); // i = 1 int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key } Codegen code • Perfectly unrolls “for each grouping column” loop • No switching on input type(s) • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key }
  • 31. 31© Cloudera, Inc. All rights reserved. Codegen for Order by & Top-N int CompareCodgened(TupleRow* lhs, TupleRow* rhs) const { int64_t lhs_value = sort_columns[i]->GetBigIntVal(lhs); // i = 0 int64_t rhs_value = sort_columns[i]->GetBigIntVal(rhs); // i = 1 int result = lhs_value > rhs_value ? 1 : (lhs_value < rhs_value ? -1 : 0); if (result != 0) return result; // Otherwise, try the next Expr return 0; // fully equivalent key } Codegen code • Perfectly unrolls “for each grouping column” loop • No switching on input type(s) • Removes branching on ASCENDING/DESCENDING, NULLS FIRST/LAST Original code int Compare(TupleRow* lhs, TupleRow* rhs) const { for (int i = 0; i < sort_cols_lhs_.size(); ++i) { void* lhs_value = sort_cols_lhs_[i]->GetValue(lhs); void* rhs_value = sort_cols_rhs_[i]->GetValue(rhs); if (lhs_value == NULL && rhs_value != NULL) return nulls_first_[i]; if (lhs_value != NULL && rhs_value == NULL) return -nulls_first_[i]; int result = RawValue::Compare(lhs_value, rhs_value, sort_cols_lhs_[i]->root()->type()); if (!is_asc_[i]) result = -result; if (result != 0) return result; // Otherwise, try the next Expr } return 0; // fully equivalent key } 10x more efficient code
  • 32. 32© Cloudera, Inc. All rights reserved. Float/Double Vs Decimal? Pros for Float/Double • Uses less memory. • Faster because floating point math operations are natively supported by processors. (Note: Decimal uses fixed-point hardware types - int64 and __int128) • Can represent a larger range of numbers. Cons for Float/Double • Precision errors compound during aggregations • Can’t do math with wide number of significant digits (123456789.1 * .0000987654321) Decimal arithmetic and aggregation No go for applications requiring high precision & accuracy What about performance penalty?
  • 33. 33© Cloudera, Inc. All rights reserved. Decimal arithmetic and aggregation SELECT l_returnflag, l_linestatus, Sum(l_quantity) AS SUM_QTY, Sum(l_extendedprice)AS SUM_BASE_PRICE, Sum(l_extendedprice * ( 1 - l_discount ))AS SUM_DISC_PRICE FROM lineitem GROUP BY l_returnflag, l_linestatus ORDER BY l_returnflag, l_linestatus 3x speedup ● Simplified overflow check for decimal. ● Extended Codegen framework to support aggregations involving decimal. ● Bridged the performance gap between double and decimal
  • 34. 34© Cloudera, Inc. All rights reserved. Network Distributed Aggregations in Impala Preagg Preagg Preagg Merge Merge Merge select cust_id, sum(dollars) from sales group by cust_id; Scan ScanScan • Impala aggregations have two phases: • Pre-aggregation phase • Merge phase • The pre-aggregation phase greatly reduces network traffic if there are many input rows per grouping value. • E.g. many sales per customer.
  • 35. 35© Cloudera, Inc. All rights reserved. Network Downsides of Pre-aggregations Preagg Preagg Preagg Merge Merge Merge select distinct * from sales; Scan ScanScan • Pre-aggregations consume: • Memory • CPU cycles • Pre-aggregations are not always effective at reducing network traffic • E.g. select distinct for nearly-distinct rows • Pre-aggregations can spill to disk under memory pressure • Disk I/O is bad - better to send to merge agg rather than disk
  • 36. 36© Cloudera, Inc. All rights reserved. Network Streaming Pre-aggregations in Impala 2.5 Merge Merge Merge select distinct * from sales; Scan ScanScan • Reduction factor is dynamically estimated based on the actual data processed • Pre-aggregation expands memory usage only if reduction factor is good • Benefits: • Certain aggregations with low reduction factor see speedups of up to 40% • Memory consumption can be reduced by 50% or more • Streaming pre-aggregations don’t spill to disk
  • 37. 37© Cloudera, Inc. All rights reserved. Streaming Pre-aggregations in Impala 2.5 Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail 06:AGGREGATE 1 366.581ms 366.581ms 1 1 72.00 KB -1.00 B FINALIZE 05:EXCHANGE 1 149.923us 149.923us 15 1 0 -1.00 B UNPARTITIONED 02:AGGREGATE 15 243.604ms 248.701ms 15 1 12.00 KB 10.00 MB 04:AGGREGATE 15 8s887ms 9s585ms 450.00M 437.91M 1.53 GB 245.01 MB FINALIZE 03:EXCHANGE 15 827.770ms 932.785ms 450.00M 437.91M 0 0 HASH(o_orderkey) 01:AGGREGATE 15 9s995ms 11s484ms 450.00M 437.91M 1.64 GB 3.59 GB 00:SCAN HDFS 15 142.192ms 189.179ms 450.00M 450.00M 150.94 MB 88.00 MB tpch_300_parquet.orders Operator #Hosts Avg Time Max Time #Rows Est. #Rows Peak Mem Est. Peak Mem Detail 06:AGGREGATE 1 356.667ms 356.667ms 1 1 72.00 KB -1.00 B FINALIZE 05:EXCHANGE 1 110.924us 110.924us 15 1 0 -1.00 B UNPARTITIONED 02:AGGREGATE 15 246.188ms 250.408ms 15 1 12.00 KB 10.00 MB 04:AGGREGATE 15 11s174ms 11s753ms 450.00M 437.91M 1.51 GB 245.01 MB FINALIZE 03:EXCHANGE 15 750.620ms 805.099ms 450.00M 437.91M 0 0 HASH(o_orderkey) 01:AGGREGATE 15 5s670ms 6s715ms 450.00M 437.91M 153.40 MB 3.59 GB STREAMING 00:SCAN HDFS 15 151.746ms 201.804ms 450.00M 450.00M 150.95 MB 88.00 MB tpch_300_parquet.orders Baseline finished in 23.13 seconds With stream pre-aggregation enabled finished in 14.9 seconds
  • 38. 38© Cloudera, Inc. All rights reserved. Optimization for partition keys scan • Use metadata to avoid table accesses for partition key scans: • select min(month), max(year) from functional.alltypes; • month, year are partition keys of the table • Enabled by query option OPTIMIZE_PARTITION_KEY_SCANS • Applicable: • min(), max(), ndv() and aggregate functions with distinct keyword • partition keys only 01:AGGREGATE [FINALIZE] | output: min(month),max(year) | 00:UNION constant-operands=24 03:AGGREGATE [FINALIZE] | output: min:merge(month), max:merge(year) | 02:EXCHANGE [UNPARTITIONED] | 01:AGGREGATE | output: min(month), max(year) | 00:SCAN HDFS [functional.alltypes] partitions=24/24 files=24 size=478.45KB Plan without optimization Plan with optimization
  • 39. 39© Cloudera, Inc. All rights reserved. 21x node cluster each with Hardware ● 384GB memory, 2s sockets, 12x total cores, Intel Xeon CPU E5-2630L 0 at 2.00GHz ● 12 disk drives at 932GB each (one for the OS, the rest for HDFS) Comparative Set ● Impala 2.5 ○ RUNTIME_FILTER_MODE = 2; ● Spark SQL 1.6 ○ Thrift JDBC server used to avoid startup cost ○ --master yarn --deploy-mode client --driver-memory 24G --driver-cores 8 --executor-memory 24G --num-executors 240 Workload ● TPC-DS 15TB stored in Parquet file format (default of 256MB block size) ● Un-modified TPC-DS queries : 3, 7, 8, 19, 25, 27, 34, 42, 43, 46, 47, 52, 53, 55, 59, 61, 63, 68, 73, 79, 88, 89, 96, 98 ● Caveats: ○ Spark-SQL failed running : ■ Q25 : Bad plan ■ Q47 : StackOverflowError ■ Q89 : StackOverflowError Competitive benchmark : TPC-DS
  • 40. 40© Cloudera, Inc. All rights reserved. Q25 (Fact to fact joins) SELECT i_item_id,i_item_desc, s_store_id, s_store_name, Stddev_samp(ss_net_profit),Stddev_samp(sr_net_loss), Stddev_samp(cs_net_profit) AS catalog_sales_profit FROM store_sales, store_returns, catalog_sales, date_dim d1, date_dim d2, date_dim d3, store, item WHERE d1.d_moy = 4 AND d1.d_year = 2001 AND d1.d_date_sk = ss_sold_date_sk AND i_item_sk = ss_item_sk AND s_store_sk = ss_store_sk AND ss_customer_sk = sr_customer_sk AND ss_item_sk = sr_item_sk AND ss_ticket_number = sr_ticket_number AND sr_returned_date_sk = d2.d_date_sk AND d2.d_moy BETWEEN 4 AND 10 AND d2.d_year = 2001 AND sr_customer_sk = cs_bill_customer_sk AND sr_item_sk = cs_item_sk AND cs_sold_date_sk = d3.d_date_sk AND d3.d_moy BETWEEN 4 AND 10 AND d3.d_year = 2001 GROUP BY i_item_id, i_item_desc, s_store_id, s_store_name ORDER BY i_item_id, i_item_desc, s_store_id, s_store_name LIMIT 100; Competitive benchmark Query complexity varied from Q3 SELECT dt.d_year, item.i_brand_id brand_id, item.i_brand brand, Sum(ss_ext_sales_price) sum_agg FROM date_dim dt, store_sales, item WHERE dt.d_date_sk = store_sales.ss_sold_date_sk AND store_sales.ss_item_sk = item.i_item_sk AND item.i_manufact_id = 436 AND dt.d_moy = 12 GROUP BY dt.d_year, item.i_brand, item.i_brand_id ORDER BY dt.d_year, sum_agg DESC, brand_id LIMIT 100;
  • 41. 41© Cloudera, Inc. All rights reserved. Competitive benchmark
  • 42. 42© Cloudera, Inc. All rights reserved. Competitive benchmark Impala 2.5 is 11x faster (based on geomean)
  • 43. 43© Cloudera, Inc. All rights reserved. Performance Benchmark Takeaways • Impala unlocks BI usage directly on Hadoop • Meets BI low-latency and multi-user requirements • Advantage expands for single-user vs just 10 users • Spark SQL enables easier Spark application development • Enables mixed procedural Spark (Java/Scala) and SQL job development • Mid-term trends will further favor Impala’s design approach for latency and concurrency • More data sets move to memory (HDFS caching, in-memory joins, Intel joint roadmap) • CPU efficiency will increase in importance • Native code enables easy optimizations for CPU instruction sets
  • 44. 44© Cloudera, Inc. All rights reserved. • Available today in Impala 2.5: • All the same Impala functionality, performance, and third-party integrations • Supported across our cloud partners • Deployment via Director • Modular architecture enables cloud’s decoupled storage and elasticity future • Available soon in Impala 2.6: • Impala read/write to S3 in addition to local HDFS IMPALA-1878 • Dynamically sized runtime filters • Parquet scanner optimization • Faster joins, aggregations, sorts and decimal arithmetic • Rack aware scheduling • Faster code generation Impala and Cloud
  • 45. 45© Cloudera, Inc. All rights reserved. Impala Roadmap 2H 2015 1H 2016 2016 • SQL Support & Usability • Nested structures • Kudu updates (beta) • Management & Security • Record reader service (beta) • Finer-grained security (Sentry) • Integration • Isilon support • Python interface (Ibis) • Performance & Scale • Improved predictability under concurrency • Performance & Scale • Continued scalability and concurrency • Initial perf/scale improvements • Management & Security • Improved admission control • Resource utilization and showback • SQL Support & Usability • Dynamic partitioning • Performance & Scale • >20x performance • Multi-threaded joins/aggregations • Continued scale work • Cloud • S3 read/write support • Management & Security • Improved YARN integration • Automated metadata • SQL Support & Usability • Data type improvements • Added SQL extensions
  • 46. 46© Cloudera, Inc. All rights reserved. Appendix.
  • 47. 47© Cloudera, Inc. All rights reserved.
  • 48. 48© Cloudera, Inc. All rights reserved. • Pre Impala 2.5: • Coordinator starts receiving fragments before senders • Problem: • Serializes startup • Scale and plan complexity ~ slower startup • Impala 2.5: • Coordinator starts fragments in any order • Added wait logic for senders and receivers Query start-up improvements
  • 49. 49© Cloudera, Inc. All rights reserved. Scheduling Small Queries Query scheduler assigns scan ranges to workers (running impalad). First it selects an HDFS datanode to read from. A B C Selection will always start with the same replica to make optimal use of OS buffer caches. This can lead to hot-spots for some workloads. Improvement: Pick impalad at random.
  • 50. 50© Cloudera, Inc. All rights reserved. New Query Option: random_replica Disabled by default. set random_replica = 1; Also has a corresponding query hint: SELECT AVG(c1) FROM t /* +SCHEDULE_RANDOM_REPLICA */;
  • 51. 51© Cloudera, Inc. All rights reserved. Where It Can Help • Large number of small queries, each with few input tables. • High load on only one of multiple replicas of a table. • Queries are CPU bound. • Benefit: Distribute load more evenly over replicas. • Tradeoff: Distribution of local reads will increase buffer cache usage. What’s Next • Add possibility to prefer remote reads. • Switch remote impalad selection from round-robin to load-based. • Add rack-awareness.
  • 52. 52© Cloudera, Inc. All rights reserved. Catalog Improvements Incrementally update table metadata instead of force-reloading all table metadata during DDL/DML operations Reload metadata of only ‘dirty’ partitions Reuse descriptors of HDFS files to avoid loading file/block metadata for files that haven’t been modified Significantly reduce the latency of DDL/DML operations that change a small fraction of table metadata (e.g. alter table foo partition (year = 2010) set location ‘blah’)
  • 53. 53© Cloudera, Inc. All rights reserved. Catalog Improvements - Results