SlideShare a Scribd company logo
1 of 55
Download to read offline
Ron Hu, Zhenhua Wang
Huawei Technologies, Inc.
Sameer Agarwal, Wenchen Fan
Databricks Inc.
Cost-Based Optimizer in
Apache Spark 2.2
Session 1 Topics
• Motivation
• Statistics Collection Framework
• Cost Based Optimizations
• TPC-DS Benchmark and Query Analysis
• Demo
2
How Spark Executes a Query?
Logical
Plan
Physical
Plan
Catalog
Optimizer
RDDs
…
SQL
Code
Generator
Data
Frames
How Spark Executes a Query?
Logical
Plan
Physical
Plan
Catalog
Optimizer
RDDs
…
SQL
Code
Generator
Data
Frames
Focus of Today’s Talk
Catalyst Optimizer: An Overview
5
events =
sc.read.json(“/logs”)
stats =
events.join(users)
.groupBy(“loc”,“status”)
.avg(“duration”)
errors = stats.where(
stats.status == “ERR”)
Query Plan is an
internal representation
of a user’s program
Series of Transformations
that convert the initial query
plan into an optimized plan
SCAN logs
JOIN
FILTER
AGG
SCAN users SCAN logsSCAN users
JOIN
FILTER
AGG
SCAN users
Catalyst Optimizer: An Overview
6
In Spark, the optimizer’s goal is to
minimize end-to-end query response time.
Two key ideas:
- Prune unnecessary data as early as possible
- e.g., filter pushdown, column pruning
- Minimize per-operator cost
- e.g., broadcast vs shuffle, optimal join order
SCAN logsSCAN users
JOIN
FILTER
AGG
SCAN users
Rule-based Optimizer in Spark 2.1
• Most of Spark SQL optimizer’s rules are heuristics rules.
– PushDownPredicate, ColumnPruning,
ConstantFolding,…
• Does NOT consider the cost of each operator
• Does NOT consider selectivity when estimating join relation size
• Therefore:
– Join order is mostly decided by its position in the SQL queries
– Physical Join implementation is decided based on heuristics
7
An Example (TPC-DS q11 variant)
8
SCAN: store_sales SCAN: customer
SCAN: date_dim
FILTER
JOIN
JOIN
SELECT customer_id
FROM customer, store_sales, date_dim
WHERE c_customer_sk = ss_customer_sk AND
ss_sold_date_sk = d_date_sk AND
c_customer_sk > 1000
An Example (TPC-DS q11 variant)
9
SCAN: store_sales SCAN: customer
SCAN: date_dim
FILTER
JOIN
JOIN
3 billion 12 million
2.5 billion
10 million
500 million
0.1 million
An Example (TPC-DS q11 variant)
10
SCAN: store_sales
SCAN: customer
SCAN: date_dim
FILTERJOIN
JOIN
3 billion
12 million
2.5 billion 500 million 10 million
500 million
0.1 million
40% faster
80% less data
An Example (TPC-DS q11 variant)
11
SCAN: store_sales
SCAN: customer
SCAN: date_dim
FILTERJOIN
JOIN
3 billion
12 million
2.5 billion 500 million 10 million
500 million
0.1 million
How do we automatically optimize queries like these?
Cost Based Optimizer (CBO)
• Collect, infer and propagate table/column
statistics on source/intermediate data
• Calculate the cost for each operator in terms of
number of output rows, size of output, etc.
• Based on the cost calculation, pick the most
optimal query execution plan
12
Rest of the Talk
• Statistics Collection Framework
– Table/Column Level Statistics Collected
– Cardinality Estimation (Filters, Joins, Aggregates etc.)
• Cost-based Optimizations
– Build Side Selection
– Multi-way Join Re-ordering
• TPC-DS Benchmarks
• Demo
13
Statistics Collection Framework
and Cost Based Optimizations
Ron Hu
Huawei Technologies
Step 1: Collect, infer and propagate table
and column statistics on source and
intermediate data
Table Statistics Collected
• Command to collect statistics of a table.
– Ex: ANALYZE TABLE table-name COMPUTE
STATISTICS
• It collects table level statistics and saves into
metastore.
– Number of rows
– Table size in bytes
16
Column Statistics Collected
• Command to collect column level statistics of individual columns.
– Ex: ANALYZE TABLE table-name COMPUTE STATISTICS
FOR COLUMNS column-name1, column-name2, ….
• It collects column level statistics and saves into meta-store.
String/Binary type
✓ Distinct count
✓ Null count
✓ Average length
✓ Max length
Numeric/Date/Timestamp type
✓ Distinct count
✓ Max
✓ Min
✓ Null count
✓ Average length (fixed length)
✓ Max length (fixed length)
17
Filter Cardinality Estimation
• Between Logical expressions: AND, OR, NOT
• In each logical expression: =, <, <=, >, >=, in, etc
• Current support type in Expression
– For <, <=, >, >=, <=>: Integer, Double, Date, Timestamp, etc
– For = , <=>: String, Integer, Double, Date, Timestamp, etc.
• Example: A <= B
– Based on A, B’s min/max/distinct count/null count values, decide
the relationships between A and B. After completing this
expression, we set the new min/max/distinct count/null count
– Assume all the data is evenly distributed if no histogram
information.
18
Filter Operator Example
• Column A (op) literal B
– (op) can be “=“, “<”, “<=”, “>”, “>=”, “like”
– Like the styles as “l_orderkey = 3”, “l_shipdate <= “1995-03-21”
– Column’s max/min/distinct count/null countshould be updated
– Example: Column A < value B
Column AB B
A.min A.max
Filtering Factor = 0%
need to changeA’s statistics
Filtering Factor = 100%
no need to changeA’s statistics
Without histograms, supposedatais evenly distributed
Filtering Factor = (B.value – A.min) / (A.max – A.min)
A.min = no change
A.max = B.value
A.ndv = A.ndv * Filtering Factor
19
Filter Operator Example
• Column A (op) Column B
– (op) can be “<”, “<=”, “>”, “>=”
– We cannot suppose the data is evenly distributed,so the empirical filtering factor is set to 1/3
– Example: Column A < Column B
B
A
AA
A
B
B B
selectivity = 100% selectivity = 0%
selectivity = 33.3%
20
selectivity = 33.3%
Join Cardinality Estimation
• Inner-Join: The number of rows of “A join B on A.k1 = B.k1” is
estimated as: num(A B) = num(A) * num(B) / max(distinct(A.k1),
distinct(B.k1)),
– where num(A) is the number of records in table A, distinct is the number of
distinct values of that column.
– The underlying assumption for this formula is that each value of the smaller
domain is included in the larger domain.
• We similarly estimate cardinalities for Left-Outer Join, Right-Outer
Join and Full-Outer Join
21
Other Operator Estimation
• Project: does not change row count
• Aggregate: consider uniqueness of group-by
columns
• Limit, Sample, etc.
22
Step 2: Cost Estimation and Optimal
Plan Selection
Build Side Selection
• For two-way hash joins, we need to choose one operand as build side and the
other as probe side.
• Choose lower-cost child as build side of hash join.
– Before: build side was selected based on original table sizes. ➔ BuildRight
– Now with CBO: build side is selected based on
estimated cost of various operators before join. ➔ BuildLeft
Join
Scan t2Filter
Scan t15 billion records,
500 GB
t1.value= 200
1 million records,
100 MB
100 million records,
20 GB
24
Hash Join Implementation: Broadcast vs. Shuffle
Physical Plan
➢ SortMergeJoinExec/
BroadcastHashJoinExec/
ShuffledHashJoinExec
➢ CartesianProductExec/
BroadcastNestedLoopJoinExec
Logical Plan
➢ Equi-join
• Inner Join
• LeftSemi/LeftAnti Join
• LeftOuter/RightOuter Join
➢ Theta-join
• Broadcast Criterion: whether the join side’s output size is small (default 10MB).
Join
Scan t2Filter
Scan t15 billion records,
500 GB
t1.value = 100
Only 1000 records,
100 KB
100 million records,
20 GB
Join
Scan t2Aggregate
…
Join
Scan t2Join
… …
25
Multi-way Join Reorder
• Reorder the joins using a dynamic programming algorithm.
1. First we put all items (basic joined nodes) into level 0.
2. Build all two-way joins at level 1 from plans at level 0 (single items).
3. Build all 3-way joins from plans at previous levels (two-way joins and single items).
4. Build all 4-way joins etc, until we build all n-way joins and pick the best plan among
them.
• When building m-way joins, only keep the best plan (optimal sub-solution)
for the same set of m items.
– E.g., for 3-way joins of items {A, B, C}, we keep only the best plan
among: (A J B) J C, (A J C) J B and (B J C) J A
26
Multi-way Join Reorder
27
Selinger et al. Access Path Selection in a Relational Database Management System. In SIGMOD 1979
Join Cost Formula
• The cost of a plan is the sum of costs of all intermediate tables.
• Cost = weight * Costcpu + CostIO * (1 - weight)
– In Spark, we use
weight * cardinality + size * (1 – weight)
– weight is a tuning parameter configured via
spark.sql.cbo.joinReorder.card.weight (0.7 as
default)
28
TPC-DS Benchmarks and
Query Analysis
Zhenhua Wang
Huawei Technologies
Session 2 Topics
• Motivation
• Statistics Collection Framework
• Cost Based Optimizations
• TPC-DS Benchmark and Query Analysis
• Demo
30
Preliminary Performance Test
• Setup:
− TPC-DS size at 1 TB (scale factor 1000)
− 4 node cluster (Huawei FusionServer RH2288: 40 cores, 384GB mem)
− Apache Spark 2.2 RC (dated 5/12/2017)
• Statistics collection
– A total of 24 tables and 425 columns
➢ Take 14 minutes to collect statistics for all tables and all columns.
– Fast because all statistics are computed by integrating with Spark’s built-in
aggregate functions.
– Should take much less time if we collect statistics for columns used in predicate,
join, and group-by only.
31
TPC-DS Query Q11
32
WITH year_total AS (
SELECT
c_customer_id customer_id,
c_first_name customer_first_name,
c_last_name customer_last_name,
c_preferred_cust_flag customer_preferred_cust_flag,
c_birth_country customer_birth_country,
c_login customer_login,
c_email_address customer_email_address,
d_year dyear,
sum(ss_ext_list_price - ss_ext_discount_amt) year_total,
's' sale_type
FROM customer, store_sales, date_dim
WHERE c_customer_sk = ss_customer_sk
AND ss_sold_date_sk = d_date_sk
GROUP BY c_customer_id, c_first_name, c_last_name, d_year
, c_preferred_cust_flag, c_birth_country, c_login, c_email_address, d_year
UNION ALL
SELECT
c_customer_id customer_id,
c_first_name customer_first_name,
c_last_name customer_last_name,
c_preferred_cust_flag customer_preferred_cust_flag,
c_birth_country customer_birth_country,
c_login customer_login,
c_email_address customer_email_address,
d_year dyear,
sum(ws_ext_list_price - ws_ext_discount_amt) year_total,
'w' sale_type
FROM customer, web_sales, date_dim
WHERE c_customer_sk = ws_bill_customer_sk AND ws_sold_date_sk = d_date_sk
GROUP BY c_customer_id, c_first_name, c_last_name, c_preferred_cust_flag,
c_birth_country, c_login, c_email_address, d_year)
SELECT t_s_secyear.customer_preferred_cust_flag
FROM year_total t_s_firstyear
, year_total t_s_secyear
, year_total t_w_firstyear
, year_total t_w_secyear
WHERE t_s_secyear.customer_id = t_s_firstyear.customer_id
AND t_s_firstyear.customer_id = t_w_secyear.customer_id
AND t_s_firstyear.customer_id = t_w_firstyear.customer_id
AND t_s_firstyear.sale_type = 's'
AND t_w_firstyear.sale_type = 'w'
AND t_s_secyear.sale_type = 's'
AND t_w_secyear.sale_type = 'w'
AND t_s_firstyear.dyear = 2001
AND t_s_secyear.dyear = 2001 + 1
AND t_w_firstyear.dyear = 2001
AND t_w_secyear.dyear = 2001 + 1
AND t_s_firstyear.year_total > 0
AND t_w_firstyear.year_total > 0
AND CASE WHEN t_w_firstyear.year_total > 0
THEN t_w_secyear.year_total / t_w_firstyear.year_total
ELSE NULL END
> CASE WHEN t_s_firstyear.year_total > 0
THEN t_s_secyear.year_total / t_s_firstyear.year_total
ELSE NULL END
ORDER BY t_s_secyear.customer_preferred_cust_flag
LIMIT 100
Query Analysis – Q11 CBO OFF
Large join result
Join	#1
store_sales customer
date_dim
2.9 billion
…
…
Join	#2
web_sales customer
date_dim
Join	#4
…
Join	#3
12 million
2.7 billion 73,049 73,049
12 million720 million
534 million
719 million
144 million
33
Query Analysis – Q11 CBO ON
Small join result
Join	#1
store_sales date_dim
customer
2.9 billion
…
…
Join	#2
web_sales date_dim
customer
Join	#4
…
Join	#3
73,049
534 million 12 million 12 million
73,049720 million
534 million
144 million
144 million
1.4x Speedup
34
80% less
TPC-DS Query Q72
35
SELECT
i_item_desc,
w_warehouse_name,
d1.d_week_seq,
count(CASE WHEN p_promo_sk IS NULL
THEN 1
ELSE 0 END) no_promo,
count(CASE WHEN p_promo_sk IS NOT NULL
THEN 1
ELSE 0 END) promo,
count(*) total_cnt
FROM catalog_sales
JOIN inventory ON (cs_item_sk = inv_item_sk)
JOIN warehouse ON (w_warehouse_sk = inv_warehouse_sk)
JOIN item ON (i_item_sk = cs_item_sk)
JOIN customer_demographics ON (cs_bill_cdemo_sk = cd_demo_sk)
JOIN household_demographics ON (cs_bill_hdemo_sk = hd_demo_sk)
JOIN date_dim d1 ON (cs_sold_date_sk = d1.d_date_sk)
JOIN date_dim d2 ON (inv_date_sk = d2.d_date_sk)
JOIN date_dim d3 ON (cs_ship_date_sk = d3.d_date_sk)
LEFT OUTER JOIN promotion ON (cs_promo_sk = p_promo_sk)
LEFT OUTER JOIN catalog_returns ON (cr_item_sk = cs_item_sk AND cr_order_number = cs_order_number)
WHERE d1.d_week_seq = d2.d_week_seq
AND inv_quantity_on_hand < cs_quantity
AND d3.d_date > (cast(d1.d_date AS DATE) + interval 5 days)
AND hd_buy_potential = '>10000'
AND d1.d_year = 1999
AND hd_buy_potential = '>10000'
AND cd_marital_status = 'D'
AND d1.d_year = 1999
GROUP BY i_item_desc, w_warehouse_name, d1.d_week_seq
ORDER BY total_cnt DESC, i_item_desc, w_warehouse_name, d_week_seq
LIMIT 100
Query Analysis – Q72 CBO OFF
Join	#1
Join	#2
Join	#3
Join	#4
Join	#5
Join	#6
Join	#7
Join	#8
catalog_sales inventory
warehouse
item
customer_demographics
date_dim	 d1
date_dim	 d2
date_dim	 d3
household_demographics
1.4 billion 783 million
223 billion 20
300,000223 billion
223 billion
44.7 million
7.5 million
1.6 million
9 million
1.9 million
7,200
73,049
73,049
73,049
8.7 million
Really large
intermediate
results
36
Query Analysis – Q72 CBO ON
Join	#1
Join	#2
Join	#3
Join	#4
Join	#8
Join	#5
Join	#6
Join	#7
catalog_sales
inventory
warehouseitem
customer_demographics
date_dim	 d1 date_dim	 d2date_dim	 d3
household_demographics
1.4 billion
783 million
20300,000
1.9 million
7,200
73,049 73,049 73,049
238 million
238 million
47.6 million
47.6 million
2,555
1 billion
1 billion
8.7 million
Much smaller
intermediate
results !
37
2-3 orders of
magnitude less!
8.1x Speedup
TPC-DS Query Performance
38
0
350
700
1050
1400
1750
q1 q5 q9 q13 q16 q20 q23b q26 q30 q34 q38 q41 q45 q49 q53 q57 q61 q65 q70 q74 q78 q82 q86 q90 q94 q98
Runtime(seconds)
without CBO
with CBO
TPC-DS Query Speedup
• TPC-DS query speedup
ratio with CBO versus
without CBO
• 16 queries show speedup
> 30%
• The max speedup is 8X.
• The geo-mean of
speedup is 2.2X.
39
TPC-DS Query 64
40
WITH cs_ui AS
(SELECT
cs_item_sk,
sum(cs_ext_list_price) AS sale,
sum(cr_refunded_cash + cr_reversed_charge + cr_store_credit) AS refund
FROM catalog_sales, catalog_returns
WHERE cs_item_sk = cr_item_sk AND cs_order_number = cr_order_number
GROUP BY cs_item_sk
HAVING sum(cs_ext_list_price) > 2 * sum(cr_refunded_cash + cr_reversed_charge + cr_store_credit)),
cross_sales AS
(SELECT
i_product_name product_name, i_item_sk item_sk, s_store_name store_name,
s_zip store_zip, ad1.ca_street_number b_street_number, ad1.ca_street_name b_streen_name,
ad1.ca_city b_city, ad1.ca_zip b_zip, ad2.ca_street_number c_street_number,
ad2.ca_street_name c_street_name, ad2.ca_city c_city, ad2.ca_zip c_zip,
d1.d_year AS syear, d2.d_year AS fsyear, d3.d_year s2year,
count(*) cnt, sum(ss_wholesale_cost) s1, sum(ss_list_price) s2, sum(ss_coupon_amt) s3
FROM store_sales, store_returns, cs_ui, date_dim d1, date_dim d2, date_dim d3,
store, customer, customer_demographics cd1, customer_demographics cd2,
promotion, household_demographics hd1, household_demographics hd2,
customer_address ad1, customer_address ad2, income_band ib1, income_band ib2, item
WHERE ss_store_sk = s_store_sk AND ss_sold_date_sk = d1.d_date_sk AND
ss_customer_sk = c_customer_sk AND ss_cdemo_sk = cd1.cd_demo_sk AND
ss_hdemo_sk = hd1.hd_demo_sk AND ss_addr_sk = ad1.ca_address_sk AND
ss_item_sk = i_item_sk AND ss_item_sk = sr_item_sk AND
ss_ticket_number = sr_ticket_number AND ss_item_sk = cs_ui.cs_item_sk AND
c_current_cdemo_sk = cd2.cd_demo_sk AND c_current_hdemo_sk = hd2.hd_demo_sk AND
c_current_addr_sk = ad2.ca_address_sk AND c_first_sales_date_sk = d2.d_date_sk AND
c_first_shipto_date_sk = d3.d_date_sk AND ss_promo_sk = p_promo_sk AND
hd1.hd_income_band_sk = ib1.ib_income_band_sk AND
hd2.hd_income_band_sk = ib2.ib_income_band_sk AND
cd1.cd_marital_status <> cd2.cd_marital_status AND
i_color IN ('purple', 'burlywood', 'indian', 'spring', 'floral', 'medium') AND
i_current_price BETWEEN 64 AND 64 + 10 AND i_current_price BETWEEN 64 + 1 AND 64 + 15
GROUP BY i_product_name, i_item_sk, s_store_name, s_zip, ad1.ca_street_number,
ad1.ca_street_name, ad1.ca_city, ad1.ca_zip, ad2.ca_street_number,
ad2.ca_street_name, ad2.ca_city, ad2.ca_zip, d1.d_year, d2.d_year, d3.d_year)
SELECT
cs1.product_name,
cs1.store_name,
cs1.store_zip,
cs1.b_street_number,
cs1.b_streen_name,
cs1.b_city,
cs1.b_zip,
cs1.c_street_number,
cs1.c_street_name,
cs1.c_city,
cs1.c_zip,
cs1.syear,
cs1.cnt,
cs1.s1,
cs1.s2,
cs1.s3,
cs2.s1,
cs2.s2,
cs2.s3,
cs2.syear,
cs2.cnt
FROM cross_sales cs1, cross_sales cs2
WHERE cs1.item_sk = cs2.item_sk AND
cs1.syear = 1999 AND
cs2.syear = 1999 + 1 AND
cs2.cnt <= cs1.cnt AND
cs1.store_name = cs2.store_name AND
cs1.store_zip = cs2.store_zip
ORDER BY cs1.product_name, cs1.store_name, cs2.cnt
Query Analysis – Q64 CBO ON
41
10% slower
FileScan (store_sales)
Exchange
ExchangeSort
Aggregate
SortMergeJoin
BroadcastHashJoin
FileScan (store_returns)
Exchange
Sort
BroadcastExchange (cs_ui)
ReusedExchange
Sort Sort
BroadcastHashJoin
ReusedExchange
ReusedExchange
Sort
SortMergeJoin
Sort
Fragment 1
Fragment 2
Query Analysis – Q64 CBO OFF
42
FileScan (store_sales)
Exchange
Exchange
Sort
Aggregate
SortMergeJoin
SortMergeJoin
FileScan (store_returns)
Exchange
Sort
Sort (cs_ui)
SortMergeJoin
Sort
ReusedExchange
Aggregate
Sort (cs_ui)
ReusedExchange
Sort
Exchange
Fragment 1
Fragment 2
CBO Demo
Wenchen Fan
Databricks
Current Status, Credits and
Future Work
Ron Hu
Huawei Technologies
Available in Apache Spark 2.2
• Configured via spark.sql.cbo.enabled
• ‘Off By Default’. Why?
– Spark is used in production
– Many Spark users may already rely on “human
intelligence” to write queries in best order
– Plan on enabling this by default in Spark 2.3
• We encourage you test CBO with Spark 2.2!
45
Current Status
• SPARK-16026 is the umbrella jira.
– 32 sub-tasks have been resolved
– A big project spanning 8 months
– 10+ Spark contributors involved
– 7000+ lines of Scala code have been contributed
• Good framework to allow integrations
– Use statistics to derive if a join attribute is unique
– Benefit star schema detection and its integration into join
reorder
46
Birth of Spark SQL CBO
• Prototype
– In 2015, Ron Hu, Fang Cao, etc. of Huawei’s research
department prototyped the CBO concept on Spark 1.2.
– After a successful prototype, we shared technology with
Zhenhua Wang, Fei Wang, etc of Huawei’s product
development team.
• We delivered a talk at Spark Summit 2016:
– “Enhancing Spark SQL Optimizer with Reliable Statistics”.
• The talk was well received by the community.
– https://issues.apache.org/jira/browse/SPARK-16026
47
Collaboration
• Good community support
– Developers: Zhenhua Wang, Ron Hu, Reynold Xin,
Wenchen Fan, Xiao Li
– Reviewers: Wenchen, Herman, Reynold, Xiao, Liang-chi,
Ioana, Nattavut, Hyukjin, Shuai, …..
– Extensive discussion in JIRAs and PRs (tens to hundreds
conversations).
– All the comments made the development time longer, but
improved code quality.
• It was a pleasure working with community.
48
Future Work: Cost Based Optimizer
• Current cost formula is coarse.
Cost = cardinality * weight + size * (1 - weight)
• Cannot tell the cost difference between sort-
merge join and hash join
– spark.sql.join.preferSortMergeJoin defaults to true.
• Underestimates (or ignores) shuffle cost.
• Will improve cost formula in next release.
49
Future Work: Statistics Collection Framework
• Advanced statistics: e.g. histograms, sketches.
• Hint mechanism.
• Partition level statistics.
• Speed up statistics collection by sampling data
for large tables.
50
Conclusion
• Motivation
• Statistics Collection Framework
– Table/Column Level Statistics Collected
– Cardinality Estimation (Filters, Joins, Aggregates etc.)
• Cost-based Optimizations
– Build Side Selection
– Multi-way Join Re-ordering
• TPC-DS Benchmarks
• Demo
51
Thank You.
ron.hu@huawei.com
wangzhenhua@huawei.com
sameer@databricks.com
wenchen@databricks.com
Sameer’s Office Hours @ 4:30pm Today
Wenchen’s Office Hours @ 3pm Tomorrow
Multi-way Join Reorder – Example
• Given A J B J C J D with join conditions A.k1 = B.k1 and
B.k2 = C.k2 and C.k3 = D.k3
level 0: p({A}), p({B}), p({C}), p({D})
level 1: p({A, B}), p({A, C}), p({A, D}), p({B, C}), p({B, D}), p({C, D})
level 2: p({A, B, C}), p({A, B, D}), p({A, C, D}), p({B, C, D})
level 3: p({A, B, C, D}) -- final output plan
53
Multi-way Join Reorder – Example
• Pruning strategy: exclude cartesian product candidates.
This significantly reduces the search space.
level 0: p({A}), p({B}), p({C}), p({D})
level 1: p({A, B}), p({A, C}), p({A, D}), p({B, C}), p({B, D}), p({C, D})
level 2: p({A, B, C}), p({A, B, D}), p({A, C, D}), p({B, C, D})
level 3: p({A, B, C, D}) -- final output plan
54
New Commands in Apache Spark 2.2
• CBO commands
– Collect table-level statistics
• ANALYZE TABLE table_name COMPUTE STATISTICS
– Collect column-level statistics
• ANALYZE TABLE table-name COMPUTE STATISTICS FOR COLUMNS column_name1,
column_name2, …
– Display statistics in the optimized logical plan
> EXPLAIN COST
> SELECT cc_call_center_sk, cc_call_center_id FROM call_center;
…
== Optimized Logical Plan ==
Project [cc_call_center_sk#75, cc_call_center_id#76], Statistics(sizeInBytes=1680.0 B, rowCount=42, hints=none)
+- Relation[…31 fields] parquet, Statistics(sizeInBytes=22.5 KB, rowCount=42, hints=none)
…
55

More Related Content

What's hot

The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Databricks
 
On Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLOn Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLDatabricks
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsDatabricks
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Databricks
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Julian Hyde
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Stamatis Zampetakis
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsJulien Le Dem
 

What's hot (20)

The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
Cardinality Estimation through Histogram in Apache Spark 2.3 with Ron Hu and ...
 
On Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQLOn Improving Broadcast Joins in Apache Spark SQL
On Improving Broadcast Joins in Apache Spark SQL
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0Deep Dive into the New Features of Apache Spark 3.0
Deep Dive into the New Features of Apache Spark 3.0
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Designing Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things RightDesigning Structured Streaming Pipelines—How to Architect Things Right
Designing Structured Streaming Pipelines—How to Architect Things Right
 
Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21Apache Calcite Tutorial - BOSS 21
Apache Calcite Tutorial - BOSS 21
 
How to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analyticsHow to use Parquet as a basis for ETL and analytics
How to use Parquet as a basis for ETL and analytics
 

Similar to TPC-DS Q11 Yearly Customer Total Sales

Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsJen Aman
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cRonald Francisco Vargas Quesada
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...InfluxData
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Zbigniew Jerzak
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Gruter
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Julian Hyde
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineApache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineDataWorks Summit
 
Beyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeBeyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeYuto Hayamizu
 
Processes in Query Optimization in (ABMS) Advanced Database Management Systems
Processes in Query Optimization in (ABMS) Advanced Database Management Systems Processes in Query Optimization in (ABMS) Advanced Database Management Systems
Processes in Query Optimization in (ABMS) Advanced Database Management Systems gamemaker762
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Oracle Query Optimizer - An Introduction
Oracle Query Optimizer - An IntroductionOracle Query Optimizer - An Introduction
Oracle Query Optimizer - An Introductionadryanbub
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftSnapLogic
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQLSatoshi Nagayasu
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer OverviewOlav Sandstå
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query ExecutionJ Singh
 
SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?Brent Ozar
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...Flink Forward
 

Similar to TPC-DS Q11 Yearly Customer Total Sales (20)

Enhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable StatisticsEnhancing Spark SQL Optimizer with Reliable Statistics
Enhancing Spark SQL Optimizer with Reliable Statistics
 
Presentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12cPresentación Oracle Database Migración consideraciones 10g/11g/12c
Presentación Oracle Database Migración consideraciones 10g/11g/12c
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized EngineApache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
Apache Tajo: Query Optimization Techniques and JIT-based Vectorized Engine
 
Beyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To CodeBeyond EXPLAIN: Query Optimization From Theory To Code
Beyond EXPLAIN: Query Optimization From Theory To Code
 
DB
DBDB
DB
 
Processes in Query Optimization in (ABMS) Advanced Database Management Systems
Processes in Query Optimization in (ABMS) Advanced Database Management Systems Processes in Query Optimization in (ABMS) Advanced Database Management Systems
Processes in Query Optimization in (ABMS) Advanced Database Management Systems
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
Oracle Query Optimizer - An Introduction
Oracle Query Optimizer - An IntroductionOracle Query Optimizer - An Introduction
Oracle Query Optimizer - An Introduction
 
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon RedshiftBest Practices for Supercharging Cloud Analytics on Amazon Redshift
Best Practices for Supercharging Cloud Analytics on Amazon Redshift
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
 
MySQL Optimizer Overview
MySQL Optimizer OverviewMySQL Optimizer Overview
MySQL Optimizer Overview
 
CS 542 -- Query Execution
CS 542 -- Query ExecutionCS 542 -- Query Execution
CS 542 -- Query Execution
 
SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?SQL Query Optimization: Why Is It So Hard to Get Right?
SQL Query Optimization: Why Is It So Hard to Get Right?
 
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...Flink Forward SF 2017: Timo Walther -  Table & SQL API – unified APIs for bat...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
 
19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyFrank van der Linden
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 

Recently uploaded (20)

Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Engage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The UglyEngage Usergroup 2024 - The Good The Bad_The Ugly
Engage Usergroup 2024 - The Good The Bad_The Ugly
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 

TPC-DS Q11 Yearly Customer Total Sales

  • 1. Ron Hu, Zhenhua Wang Huawei Technologies, Inc. Sameer Agarwal, Wenchen Fan Databricks Inc. Cost-Based Optimizer in Apache Spark 2.2
  • 2. Session 1 Topics • Motivation • Statistics Collection Framework • Cost Based Optimizations • TPC-DS Benchmark and Query Analysis • Demo 2
  • 3. How Spark Executes a Query? Logical Plan Physical Plan Catalog Optimizer RDDs … SQL Code Generator Data Frames
  • 4. How Spark Executes a Query? Logical Plan Physical Plan Catalog Optimizer RDDs … SQL Code Generator Data Frames Focus of Today’s Talk
  • 5. Catalyst Optimizer: An Overview 5 events = sc.read.json(“/logs”) stats = events.join(users) .groupBy(“loc”,“status”) .avg(“duration”) errors = stats.where( stats.status == “ERR”) Query Plan is an internal representation of a user’s program Series of Transformations that convert the initial query plan into an optimized plan SCAN logs JOIN FILTER AGG SCAN users SCAN logsSCAN users JOIN FILTER AGG SCAN users
  • 6. Catalyst Optimizer: An Overview 6 In Spark, the optimizer’s goal is to minimize end-to-end query response time. Two key ideas: - Prune unnecessary data as early as possible - e.g., filter pushdown, column pruning - Minimize per-operator cost - e.g., broadcast vs shuffle, optimal join order SCAN logsSCAN users JOIN FILTER AGG SCAN users
  • 7. Rule-based Optimizer in Spark 2.1 • Most of Spark SQL optimizer’s rules are heuristics rules. – PushDownPredicate, ColumnPruning, ConstantFolding,… • Does NOT consider the cost of each operator • Does NOT consider selectivity when estimating join relation size • Therefore: – Join order is mostly decided by its position in the SQL queries – Physical Join implementation is decided based on heuristics 7
  • 8. An Example (TPC-DS q11 variant) 8 SCAN: store_sales SCAN: customer SCAN: date_dim FILTER JOIN JOIN SELECT customer_id FROM customer, store_sales, date_dim WHERE c_customer_sk = ss_customer_sk AND ss_sold_date_sk = d_date_sk AND c_customer_sk > 1000
  • 9. An Example (TPC-DS q11 variant) 9 SCAN: store_sales SCAN: customer SCAN: date_dim FILTER JOIN JOIN 3 billion 12 million 2.5 billion 10 million 500 million 0.1 million
  • 10. An Example (TPC-DS q11 variant) 10 SCAN: store_sales SCAN: customer SCAN: date_dim FILTERJOIN JOIN 3 billion 12 million 2.5 billion 500 million 10 million 500 million 0.1 million 40% faster 80% less data
  • 11. An Example (TPC-DS q11 variant) 11 SCAN: store_sales SCAN: customer SCAN: date_dim FILTERJOIN JOIN 3 billion 12 million 2.5 billion 500 million 10 million 500 million 0.1 million How do we automatically optimize queries like these?
  • 12. Cost Based Optimizer (CBO) • Collect, infer and propagate table/column statistics on source/intermediate data • Calculate the cost for each operator in terms of number of output rows, size of output, etc. • Based on the cost calculation, pick the most optimal query execution plan 12
  • 13. Rest of the Talk • Statistics Collection Framework – Table/Column Level Statistics Collected – Cardinality Estimation (Filters, Joins, Aggregates etc.) • Cost-based Optimizations – Build Side Selection – Multi-way Join Re-ordering • TPC-DS Benchmarks • Demo 13
  • 14. Statistics Collection Framework and Cost Based Optimizations Ron Hu Huawei Technologies
  • 15. Step 1: Collect, infer and propagate table and column statistics on source and intermediate data
  • 16. Table Statistics Collected • Command to collect statistics of a table. – Ex: ANALYZE TABLE table-name COMPUTE STATISTICS • It collects table level statistics and saves into metastore. – Number of rows – Table size in bytes 16
  • 17. Column Statistics Collected • Command to collect column level statistics of individual columns. – Ex: ANALYZE TABLE table-name COMPUTE STATISTICS FOR COLUMNS column-name1, column-name2, …. • It collects column level statistics and saves into meta-store. String/Binary type ✓ Distinct count ✓ Null count ✓ Average length ✓ Max length Numeric/Date/Timestamp type ✓ Distinct count ✓ Max ✓ Min ✓ Null count ✓ Average length (fixed length) ✓ Max length (fixed length) 17
  • 18. Filter Cardinality Estimation • Between Logical expressions: AND, OR, NOT • In each logical expression: =, <, <=, >, >=, in, etc • Current support type in Expression – For <, <=, >, >=, <=>: Integer, Double, Date, Timestamp, etc – For = , <=>: String, Integer, Double, Date, Timestamp, etc. • Example: A <= B – Based on A, B’s min/max/distinct count/null count values, decide the relationships between A and B. After completing this expression, we set the new min/max/distinct count/null count – Assume all the data is evenly distributed if no histogram information. 18
  • 19. Filter Operator Example • Column A (op) literal B – (op) can be “=“, “<”, “<=”, “>”, “>=”, “like” – Like the styles as “l_orderkey = 3”, “l_shipdate <= “1995-03-21” – Column’s max/min/distinct count/null countshould be updated – Example: Column A < value B Column AB B A.min A.max Filtering Factor = 0% need to changeA’s statistics Filtering Factor = 100% no need to changeA’s statistics Without histograms, supposedatais evenly distributed Filtering Factor = (B.value – A.min) / (A.max – A.min) A.min = no change A.max = B.value A.ndv = A.ndv * Filtering Factor 19
  • 20. Filter Operator Example • Column A (op) Column B – (op) can be “<”, “<=”, “>”, “>=” – We cannot suppose the data is evenly distributed,so the empirical filtering factor is set to 1/3 – Example: Column A < Column B B A AA A B B B selectivity = 100% selectivity = 0% selectivity = 33.3% 20 selectivity = 33.3%
  • 21. Join Cardinality Estimation • Inner-Join: The number of rows of “A join B on A.k1 = B.k1” is estimated as: num(A B) = num(A) * num(B) / max(distinct(A.k1), distinct(B.k1)), – where num(A) is the number of records in table A, distinct is the number of distinct values of that column. – The underlying assumption for this formula is that each value of the smaller domain is included in the larger domain. • We similarly estimate cardinalities for Left-Outer Join, Right-Outer Join and Full-Outer Join 21
  • 22. Other Operator Estimation • Project: does not change row count • Aggregate: consider uniqueness of group-by columns • Limit, Sample, etc. 22
  • 23. Step 2: Cost Estimation and Optimal Plan Selection
  • 24. Build Side Selection • For two-way hash joins, we need to choose one operand as build side and the other as probe side. • Choose lower-cost child as build side of hash join. – Before: build side was selected based on original table sizes. ➔ BuildRight – Now with CBO: build side is selected based on estimated cost of various operators before join. ➔ BuildLeft Join Scan t2Filter Scan t15 billion records, 500 GB t1.value= 200 1 million records, 100 MB 100 million records, 20 GB 24
  • 25. Hash Join Implementation: Broadcast vs. Shuffle Physical Plan ➢ SortMergeJoinExec/ BroadcastHashJoinExec/ ShuffledHashJoinExec ➢ CartesianProductExec/ BroadcastNestedLoopJoinExec Logical Plan ➢ Equi-join • Inner Join • LeftSemi/LeftAnti Join • LeftOuter/RightOuter Join ➢ Theta-join • Broadcast Criterion: whether the join side’s output size is small (default 10MB). Join Scan t2Filter Scan t15 billion records, 500 GB t1.value = 100 Only 1000 records, 100 KB 100 million records, 20 GB Join Scan t2Aggregate … Join Scan t2Join … … 25
  • 26. Multi-way Join Reorder • Reorder the joins using a dynamic programming algorithm. 1. First we put all items (basic joined nodes) into level 0. 2. Build all two-way joins at level 1 from plans at level 0 (single items). 3. Build all 3-way joins from plans at previous levels (two-way joins and single items). 4. Build all 4-way joins etc, until we build all n-way joins and pick the best plan among them. • When building m-way joins, only keep the best plan (optimal sub-solution) for the same set of m items. – E.g., for 3-way joins of items {A, B, C}, we keep only the best plan among: (A J B) J C, (A J C) J B and (B J C) J A 26
  • 27. Multi-way Join Reorder 27 Selinger et al. Access Path Selection in a Relational Database Management System. In SIGMOD 1979
  • 28. Join Cost Formula • The cost of a plan is the sum of costs of all intermediate tables. • Cost = weight * Costcpu + CostIO * (1 - weight) – In Spark, we use weight * cardinality + size * (1 – weight) – weight is a tuning parameter configured via spark.sql.cbo.joinReorder.card.weight (0.7 as default) 28
  • 29. TPC-DS Benchmarks and Query Analysis Zhenhua Wang Huawei Technologies
  • 30. Session 2 Topics • Motivation • Statistics Collection Framework • Cost Based Optimizations • TPC-DS Benchmark and Query Analysis • Demo 30
  • 31. Preliminary Performance Test • Setup: − TPC-DS size at 1 TB (scale factor 1000) − 4 node cluster (Huawei FusionServer RH2288: 40 cores, 384GB mem) − Apache Spark 2.2 RC (dated 5/12/2017) • Statistics collection – A total of 24 tables and 425 columns ➢ Take 14 minutes to collect statistics for all tables and all columns. – Fast because all statistics are computed by integrating with Spark’s built-in aggregate functions. – Should take much less time if we collect statistics for columns used in predicate, join, and group-by only. 31
  • 32. TPC-DS Query Q11 32 WITH year_total AS ( SELECT c_customer_id customer_id, c_first_name customer_first_name, c_last_name customer_last_name, c_preferred_cust_flag customer_preferred_cust_flag, c_birth_country customer_birth_country, c_login customer_login, c_email_address customer_email_address, d_year dyear, sum(ss_ext_list_price - ss_ext_discount_amt) year_total, 's' sale_type FROM customer, store_sales, date_dim WHERE c_customer_sk = ss_customer_sk AND ss_sold_date_sk = d_date_sk GROUP BY c_customer_id, c_first_name, c_last_name, d_year , c_preferred_cust_flag, c_birth_country, c_login, c_email_address, d_year UNION ALL SELECT c_customer_id customer_id, c_first_name customer_first_name, c_last_name customer_last_name, c_preferred_cust_flag customer_preferred_cust_flag, c_birth_country customer_birth_country, c_login customer_login, c_email_address customer_email_address, d_year dyear, sum(ws_ext_list_price - ws_ext_discount_amt) year_total, 'w' sale_type FROM customer, web_sales, date_dim WHERE c_customer_sk = ws_bill_customer_sk AND ws_sold_date_sk = d_date_sk GROUP BY c_customer_id, c_first_name, c_last_name, c_preferred_cust_flag, c_birth_country, c_login, c_email_address, d_year) SELECT t_s_secyear.customer_preferred_cust_flag FROM year_total t_s_firstyear , year_total t_s_secyear , year_total t_w_firstyear , year_total t_w_secyear WHERE t_s_secyear.customer_id = t_s_firstyear.customer_id AND t_s_firstyear.customer_id = t_w_secyear.customer_id AND t_s_firstyear.customer_id = t_w_firstyear.customer_id AND t_s_firstyear.sale_type = 's' AND t_w_firstyear.sale_type = 'w' AND t_s_secyear.sale_type = 's' AND t_w_secyear.sale_type = 'w' AND t_s_firstyear.dyear = 2001 AND t_s_secyear.dyear = 2001 + 1 AND t_w_firstyear.dyear = 2001 AND t_w_secyear.dyear = 2001 + 1 AND t_s_firstyear.year_total > 0 AND t_w_firstyear.year_total > 0 AND CASE WHEN t_w_firstyear.year_total > 0 THEN t_w_secyear.year_total / t_w_firstyear.year_total ELSE NULL END > CASE WHEN t_s_firstyear.year_total > 0 THEN t_s_secyear.year_total / t_s_firstyear.year_total ELSE NULL END ORDER BY t_s_secyear.customer_preferred_cust_flag LIMIT 100
  • 33. Query Analysis – Q11 CBO OFF Large join result Join #1 store_sales customer date_dim 2.9 billion … … Join #2 web_sales customer date_dim Join #4 … Join #3 12 million 2.7 billion 73,049 73,049 12 million720 million 534 million 719 million 144 million 33
  • 34. Query Analysis – Q11 CBO ON Small join result Join #1 store_sales date_dim customer 2.9 billion … … Join #2 web_sales date_dim customer Join #4 … Join #3 73,049 534 million 12 million 12 million 73,049720 million 534 million 144 million 144 million 1.4x Speedup 34 80% less
  • 35. TPC-DS Query Q72 35 SELECT i_item_desc, w_warehouse_name, d1.d_week_seq, count(CASE WHEN p_promo_sk IS NULL THEN 1 ELSE 0 END) no_promo, count(CASE WHEN p_promo_sk IS NOT NULL THEN 1 ELSE 0 END) promo, count(*) total_cnt FROM catalog_sales JOIN inventory ON (cs_item_sk = inv_item_sk) JOIN warehouse ON (w_warehouse_sk = inv_warehouse_sk) JOIN item ON (i_item_sk = cs_item_sk) JOIN customer_demographics ON (cs_bill_cdemo_sk = cd_demo_sk) JOIN household_demographics ON (cs_bill_hdemo_sk = hd_demo_sk) JOIN date_dim d1 ON (cs_sold_date_sk = d1.d_date_sk) JOIN date_dim d2 ON (inv_date_sk = d2.d_date_sk) JOIN date_dim d3 ON (cs_ship_date_sk = d3.d_date_sk) LEFT OUTER JOIN promotion ON (cs_promo_sk = p_promo_sk) LEFT OUTER JOIN catalog_returns ON (cr_item_sk = cs_item_sk AND cr_order_number = cs_order_number) WHERE d1.d_week_seq = d2.d_week_seq AND inv_quantity_on_hand < cs_quantity AND d3.d_date > (cast(d1.d_date AS DATE) + interval 5 days) AND hd_buy_potential = '>10000' AND d1.d_year = 1999 AND hd_buy_potential = '>10000' AND cd_marital_status = 'D' AND d1.d_year = 1999 GROUP BY i_item_desc, w_warehouse_name, d1.d_week_seq ORDER BY total_cnt DESC, i_item_desc, w_warehouse_name, d_week_seq LIMIT 100
  • 36. Query Analysis – Q72 CBO OFF Join #1 Join #2 Join #3 Join #4 Join #5 Join #6 Join #7 Join #8 catalog_sales inventory warehouse item customer_demographics date_dim d1 date_dim d2 date_dim d3 household_demographics 1.4 billion 783 million 223 billion 20 300,000223 billion 223 billion 44.7 million 7.5 million 1.6 million 9 million 1.9 million 7,200 73,049 73,049 73,049 8.7 million Really large intermediate results 36
  • 37. Query Analysis – Q72 CBO ON Join #1 Join #2 Join #3 Join #4 Join #8 Join #5 Join #6 Join #7 catalog_sales inventory warehouseitem customer_demographics date_dim d1 date_dim d2date_dim d3 household_demographics 1.4 billion 783 million 20300,000 1.9 million 7,200 73,049 73,049 73,049 238 million 238 million 47.6 million 47.6 million 2,555 1 billion 1 billion 8.7 million Much smaller intermediate results ! 37 2-3 orders of magnitude less! 8.1x Speedup
  • 38. TPC-DS Query Performance 38 0 350 700 1050 1400 1750 q1 q5 q9 q13 q16 q20 q23b q26 q30 q34 q38 q41 q45 q49 q53 q57 q61 q65 q70 q74 q78 q82 q86 q90 q94 q98 Runtime(seconds) without CBO with CBO
  • 39. TPC-DS Query Speedup • TPC-DS query speedup ratio with CBO versus without CBO • 16 queries show speedup > 30% • The max speedup is 8X. • The geo-mean of speedup is 2.2X. 39
  • 40. TPC-DS Query 64 40 WITH cs_ui AS (SELECT cs_item_sk, sum(cs_ext_list_price) AS sale, sum(cr_refunded_cash + cr_reversed_charge + cr_store_credit) AS refund FROM catalog_sales, catalog_returns WHERE cs_item_sk = cr_item_sk AND cs_order_number = cr_order_number GROUP BY cs_item_sk HAVING sum(cs_ext_list_price) > 2 * sum(cr_refunded_cash + cr_reversed_charge + cr_store_credit)), cross_sales AS (SELECT i_product_name product_name, i_item_sk item_sk, s_store_name store_name, s_zip store_zip, ad1.ca_street_number b_street_number, ad1.ca_street_name b_streen_name, ad1.ca_city b_city, ad1.ca_zip b_zip, ad2.ca_street_number c_street_number, ad2.ca_street_name c_street_name, ad2.ca_city c_city, ad2.ca_zip c_zip, d1.d_year AS syear, d2.d_year AS fsyear, d3.d_year s2year, count(*) cnt, sum(ss_wholesale_cost) s1, sum(ss_list_price) s2, sum(ss_coupon_amt) s3 FROM store_sales, store_returns, cs_ui, date_dim d1, date_dim d2, date_dim d3, store, customer, customer_demographics cd1, customer_demographics cd2, promotion, household_demographics hd1, household_demographics hd2, customer_address ad1, customer_address ad2, income_band ib1, income_band ib2, item WHERE ss_store_sk = s_store_sk AND ss_sold_date_sk = d1.d_date_sk AND ss_customer_sk = c_customer_sk AND ss_cdemo_sk = cd1.cd_demo_sk AND ss_hdemo_sk = hd1.hd_demo_sk AND ss_addr_sk = ad1.ca_address_sk AND ss_item_sk = i_item_sk AND ss_item_sk = sr_item_sk AND ss_ticket_number = sr_ticket_number AND ss_item_sk = cs_ui.cs_item_sk AND c_current_cdemo_sk = cd2.cd_demo_sk AND c_current_hdemo_sk = hd2.hd_demo_sk AND c_current_addr_sk = ad2.ca_address_sk AND c_first_sales_date_sk = d2.d_date_sk AND c_first_shipto_date_sk = d3.d_date_sk AND ss_promo_sk = p_promo_sk AND hd1.hd_income_band_sk = ib1.ib_income_band_sk AND hd2.hd_income_band_sk = ib2.ib_income_band_sk AND cd1.cd_marital_status <> cd2.cd_marital_status AND i_color IN ('purple', 'burlywood', 'indian', 'spring', 'floral', 'medium') AND i_current_price BETWEEN 64 AND 64 + 10 AND i_current_price BETWEEN 64 + 1 AND 64 + 15 GROUP BY i_product_name, i_item_sk, s_store_name, s_zip, ad1.ca_street_number, ad1.ca_street_name, ad1.ca_city, ad1.ca_zip, ad2.ca_street_number, ad2.ca_street_name, ad2.ca_city, ad2.ca_zip, d1.d_year, d2.d_year, d3.d_year) SELECT cs1.product_name, cs1.store_name, cs1.store_zip, cs1.b_street_number, cs1.b_streen_name, cs1.b_city, cs1.b_zip, cs1.c_street_number, cs1.c_street_name, cs1.c_city, cs1.c_zip, cs1.syear, cs1.cnt, cs1.s1, cs1.s2, cs1.s3, cs2.s1, cs2.s2, cs2.s3, cs2.syear, cs2.cnt FROM cross_sales cs1, cross_sales cs2 WHERE cs1.item_sk = cs2.item_sk AND cs1.syear = 1999 AND cs2.syear = 1999 + 1 AND cs2.cnt <= cs1.cnt AND cs1.store_name = cs2.store_name AND cs1.store_zip = cs2.store_zip ORDER BY cs1.product_name, cs1.store_name, cs2.cnt
  • 41. Query Analysis – Q64 CBO ON 41 10% slower FileScan (store_sales) Exchange ExchangeSort Aggregate SortMergeJoin BroadcastHashJoin FileScan (store_returns) Exchange Sort BroadcastExchange (cs_ui) ReusedExchange Sort Sort BroadcastHashJoin ReusedExchange ReusedExchange Sort SortMergeJoin Sort Fragment 1 Fragment 2
  • 42. Query Analysis – Q64 CBO OFF 42 FileScan (store_sales) Exchange Exchange Sort Aggregate SortMergeJoin SortMergeJoin FileScan (store_returns) Exchange Sort Sort (cs_ui) SortMergeJoin Sort ReusedExchange Aggregate Sort (cs_ui) ReusedExchange Sort Exchange Fragment 1 Fragment 2
  • 44. Current Status, Credits and Future Work Ron Hu Huawei Technologies
  • 45. Available in Apache Spark 2.2 • Configured via spark.sql.cbo.enabled • ‘Off By Default’. Why? – Spark is used in production – Many Spark users may already rely on “human intelligence” to write queries in best order – Plan on enabling this by default in Spark 2.3 • We encourage you test CBO with Spark 2.2! 45
  • 46. Current Status • SPARK-16026 is the umbrella jira. – 32 sub-tasks have been resolved – A big project spanning 8 months – 10+ Spark contributors involved – 7000+ lines of Scala code have been contributed • Good framework to allow integrations – Use statistics to derive if a join attribute is unique – Benefit star schema detection and its integration into join reorder 46
  • 47. Birth of Spark SQL CBO • Prototype – In 2015, Ron Hu, Fang Cao, etc. of Huawei’s research department prototyped the CBO concept on Spark 1.2. – After a successful prototype, we shared technology with Zhenhua Wang, Fei Wang, etc of Huawei’s product development team. • We delivered a talk at Spark Summit 2016: – “Enhancing Spark SQL Optimizer with Reliable Statistics”. • The talk was well received by the community. – https://issues.apache.org/jira/browse/SPARK-16026 47
  • 48. Collaboration • Good community support – Developers: Zhenhua Wang, Ron Hu, Reynold Xin, Wenchen Fan, Xiao Li – Reviewers: Wenchen, Herman, Reynold, Xiao, Liang-chi, Ioana, Nattavut, Hyukjin, Shuai, ….. – Extensive discussion in JIRAs and PRs (tens to hundreds conversations). – All the comments made the development time longer, but improved code quality. • It was a pleasure working with community. 48
  • 49. Future Work: Cost Based Optimizer • Current cost formula is coarse. Cost = cardinality * weight + size * (1 - weight) • Cannot tell the cost difference between sort- merge join and hash join – spark.sql.join.preferSortMergeJoin defaults to true. • Underestimates (or ignores) shuffle cost. • Will improve cost formula in next release. 49
  • 50. Future Work: Statistics Collection Framework • Advanced statistics: e.g. histograms, sketches. • Hint mechanism. • Partition level statistics. • Speed up statistics collection by sampling data for large tables. 50
  • 51. Conclusion • Motivation • Statistics Collection Framework – Table/Column Level Statistics Collected – Cardinality Estimation (Filters, Joins, Aggregates etc.) • Cost-based Optimizations – Build Side Selection – Multi-way Join Re-ordering • TPC-DS Benchmarks • Demo 51
  • 53. Multi-way Join Reorder – Example • Given A J B J C J D with join conditions A.k1 = B.k1 and B.k2 = C.k2 and C.k3 = D.k3 level 0: p({A}), p({B}), p({C}), p({D}) level 1: p({A, B}), p({A, C}), p({A, D}), p({B, C}), p({B, D}), p({C, D}) level 2: p({A, B, C}), p({A, B, D}), p({A, C, D}), p({B, C, D}) level 3: p({A, B, C, D}) -- final output plan 53
  • 54. Multi-way Join Reorder – Example • Pruning strategy: exclude cartesian product candidates. This significantly reduces the search space. level 0: p({A}), p({B}), p({C}), p({D}) level 1: p({A, B}), p({A, C}), p({A, D}), p({B, C}), p({B, D}), p({C, D}) level 2: p({A, B, C}), p({A, B, D}), p({A, C, D}), p({B, C, D}) level 3: p({A, B, C, D}) -- final output plan 54
  • 55. New Commands in Apache Spark 2.2 • CBO commands – Collect table-level statistics • ANALYZE TABLE table_name COMPUTE STATISTICS – Collect column-level statistics • ANALYZE TABLE table-name COMPUTE STATISTICS FOR COLUMNS column_name1, column_name2, … – Display statistics in the optimized logical plan > EXPLAIN COST > SELECT cc_call_center_sk, cc_call_center_id FROM call_center; … == Optimized Logical Plan == Project [cc_call_center_sk#75, cc_call_center_id#76], Statistics(sizeInBytes=1680.0 B, rowCount=42, hints=none) +- Relation[…31 fields] parquet, Statistics(sizeInBytes=22.5 KB, rowCount=42, hints=none) … 55