2. Database performance
● Performance is a product of many
factors
● One of them is Query optimizer
● It produces query plans
– A “good” query plan only
reads rows that contribute to
the query result
– A “bad” query plan means
unnecessary work is done
3. Do my queries use bad query plans?
● Queries take a long time
● Some are just inherently hard to
compute
● Some look good but turn out bad
due to factors that were not
accounted for
4. Query plan cost depends on data statistics
select *
from
lineitem, orders
where
o_orderkey=l_orderkey and
o_orderdate between '1990-01-01' and '1998-12-06' and
l_extendedprice > 1000000
● orders->lineitem
vs
lineitem->orders
● Depends on
condition selectivity
5. Another choice optimizer has to make
select *
from
orders
where
o_orderstatus='F'
order by
order_date
limit 10
● Use index(order_date)
– Stop as soon as we find 10 matches
● Find rows with o_orderstatus='F'
– Sort by o_orderdate picking first 10
● Again, it depends on condition
selectivity.
6. Data statistics in MariaDB
● Table: #rows in the table
● Index
– cardinality: AVG(#lineitems per order)
– “range estimates” - #rows(t.key BETWEEN const1 and
const2)
● Non-index column? Histogram
7. Histogram
● Partition the value space into buckets
– Store bucket bounds and #values in the bucket
– Imprecise
– Very compact
8. Summary so far
● Good database performance requires good query plans
● To pick those, optimizer needs statistics about the data
– Condition selectivity is important
● Certain kinds of statistics are always available
– Indexes
– For non-indexed columns, histograms may be needed.
10. Will my queries benefit?
● Very complex question
● No definite answer
● Suggestions
– ANALYZE for statements, r_filtered.
– Slow query log
11. ANALYZE for statements and r_filtered
● filtered – % of rows left after applying condition (expectation)
– r_filtered - ... - the reality
● r_filtered << filtered – the optimizer didn’t know the condition is selective
– Happens on a non-first table? We are filtering out late!
●
Add histogram on the column (Check the cond in FORMAT=JSON)
analyze select *
from lineitem, orders
where o_orderkey=l_orderkey and
o_orderdate between '1990-01-01' and '1998-12-06' and
l_extendedprice > 1000000
+--+-----------+--------+----+-------------+-------+-------+-----------------+-------+-------+--------+----------+-----------+
|id|select_type|table |type|possible_keys|key |key_len|ref |rows |r_rows |filtered|r_filtered|Extra |
+--+-----------+--------+----+-------------+-------+-------+-----------------+-------+-------+--------+----------+-----------+
|1 |SIMPLE |orders |ALL |PRIMARY,i_...|NULL |NULL |NULL |1504278|1500000| 50.00 | 100.00 |Using where|
|1 |SIMPLE |lineitem|ref |PRIMARY,i_...|PRIMARY|4 |orders.o_orderkey|2 |4.00 | 100.00 | 0.00 |Using where|
+--+-----------+--------+----+-------------+-------+-------+-----------------+-------+-------+--------+----------+-----------+
12. # Query_time: 1.961549 Lock_time: 0.011164 Rows_sent: 1 Rows_examined: 11745000
# Rows_affected: 0 Bytes_sent: 73
# Full_scan: Yes Full_join: No Tmp_table: No Tmp_table_on_disk: No
# Filesort: No Filesort_on_disk: No Merge_passes: 0 Priority_queue: No
#
# explain: id select_type table type possible_keys key key_len ref rows r_rows
filtered r_filtered Extra
# explain: 1 SIMPLE inventory ALL NULL NULL NULL NULL 11837024
11745000.00 100.00 0.00 Using where
#
SET timestamp=1551155484;
select count(inv_date_sk) from inventory where inv_quantity_on_hand>10000;
Slow Query Log
slow-query-log
long-query-time=...
log-slow-verbosity=query_plan,explain
my.cnf
hostname-slow.log
● Rows_examined >> Rows_sent? Grouping,or a poor query plan
● log_slow_query=explain will shows ANALYZE output
14. Histograms in MariaDB
● Available since MariaDB 10.0 (Yes)
● Used by advanced users
● Not enabled by default
● Have limitations, not user-friendly
● MariaDB 10.4
– Fixes some of the limitations
– Makes histograms easier to use
16. Configuration for collecting histograms
histogram_size=0
histogram_type=SINGLE_PREC_HB
histogram_size=254
histogram_type=DOUBLE_PREC_HB
● MariaDB before 10.4: change the default histogram size
● MariaDB 10.4 : enable automatic sampling
histogram_size=254
histogram_type=DOUBLE_PREC_HB
analyze_sample_percentage=100
analyze_sample_percentage=0
17. Histograms are [still] not collected by default
● “ANALYZE TABLE” will not collect a histogram
MariaDB> analyze table t1;
+---------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+---------+---------+----------+----------+
| test.t1 | analyze | status | OK |
+---------+---------+----------+----------+
● This will collect only
– Total #rows in table
– Index cardinalities (#different values)
18. ANALYZE ... PERSISTENT collects histograms
– Collect statistics for everything:
analyze table t1 persistent
for columns (col1,...) indexes (idx1,...);
+---------+---------+----------+-----------------------------------------+
| Table | Op | Msg_type | Msg_text |
+---------+---------+----------+-----------------------------------------+
| test.t1 | analyze | status | Engine-independent statistics collected |
| test.t1 | analyze | status | OK |
+---------+---------+----------+-----------------------------------------+
analyze table t1 persistent for all;
19. Can make histogram collection automatic
set use_stat_tables='preferably';
analyze table t1;
+---------+---------+----------+-----------------------------------------+
| Table | Op | Msg_type | Msg_text |
+---------+---------+----------+-----------------------------------------+
| test.t1 | analyze | status | Engine-independent statistics collected |
| test.t1 | analyze | status | OK |
+---------+---------+----------+-----------------------------------------+
● Beware: this may be *much* slower than ANALYZE TABLE
you’re used to
● Great for migrations
20. Histogram collection performance
● MariaDB 10.0: uses all data in the table to build histogram
– Precise, but expensive
– Particularly so for VARCHARs
● A test on a real table:
– Real table, 740M rows, 90GB
– CHECKSUM TABLE: 5 min
– ANALYZE TABLE ... PERSISTENT FOR ALL – 30 min
21. MariaDB 10.4: Bernoulli sampling
● Default: analyze_sample_percentage=100
– Uses the entire table, slow
● Suggested: analyze_sample_percentage=0
– “Roll the dice” sampling, size picked automatically
analyze table t1 persistent for columns (...) indexes();
analyze table t1 persistent for all;
– full table and secondary index scans
– does a full table scan
22. Further plans: genuine sampling
● Work on avoiding full table scans is in progress
● Will allow to make ANALYZE TABLE collect all histograms
24. Make the optimizer use histograms
@@use_stat_tables=NEVER
@@optimizer_use_condition_selectivity=1
@@use_stat_tables=PREFERABLY // also affects ANALYZE!
@@optimizer_use_condition_selectivity=4
● MariaDB before 10.4: does not use histograms
● MariaDB 10.4 : uses histograms if they are collected
@@use_stat_tables=PREFERABLY_FOR_QUERIES
@@optimizer_use_condition_selectivity=4
– remember to re-collect!
25. Conclusions: how to start using histograms
● MariaDB before 10.4
analyze_sample_percentage=0
use_stat_tables=PREFERABLY # Changes optimizer
optimizer_use_condition_selectivity=4 # behavior
● MariaDB 10.4
● Both: ANALYZE TABLE ... PERSISTENT FOR ...
histogram_size=254 # No risk
histogram_type=DOUBLE_PREC_HB #
27. A stored procedure to analyze every table
CREATE PROCEDURE analyze_persistent_for_all(db_name VARCHAR(64))
BEGIN
DECLARE done INT DEFAULT FALSE;
DECLARE x VARCHAR(64);
DECLARE cur1 CURSOR FOR
SELECT TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_TYPE = 'BASE TABLE' AND TABLE_SCHEMA=db_name;
DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE;
OPEN cur1;
read_loop: LOOP
FETCH cur1 INTO x;
IF done THEN
LEAVE read_loop;
END IF;
SET @sql = CONCAT('analyze table ', x, ' persistent for all');
PREPARE stmt FROM @sql;
EXECUTE stmt;
DEALLOCATE PREPARE stmt;
END LOOP;
CLOSE cur1;
END|
28. Should I ANALYZE ... PERSISTENT every table?
● New application
– Worth giving it a try
– Provision for periodic ANALYZE
– Column correlations?
● Existing application
– Performance fixes on a case-by-case basis.
30. TPC-DS benchmark
● scale=1
● The same dataset
– without histograms: ~20 min
– after ‘call analyze_persistent_for_all(‘tpcds’) from two slides
prior: 5 min.
32. A customer case with ORDER BY ... LIMIT
● table/column names replaced
CREATE TABLE cars (
type varchar(10),
company varchar(20),
model varchar(20),
quantity int,
KEY quantity (quantity),
KEY type (type)
);
select * from cars
where
type='electric' and
company='audi'
order by
quantity
limit 3;
● table/column names replaced
● quantity matches the ORDER BY, but need to match condition
● type is a Restrictive index
33. A customer case with ORDER BY ... LIMIT
● Uses ORDER-BY compatible index by default
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: cars
type: index
possible_keys: type
key: quantity
key_len: 5
ref: const
rows: 994266
r_rows: 700706.00
filtered: 0.20
r_filtered: 0.00
Extra: Using where
1 row in set (2.098 sec)
select * from cars
where
type='electric' and
company='audi'
order by
quantity
limit 3;
34. A customer case with ORDER BY ... LIMIT
● Providing the optimizer with histogram
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: cars
type: ref
possible_keys: type
key: type
key_len: 13
ref: const
rows: 2022
r_rows: 3.00
filtered: 100.00
r_filtered: 100.00
Extra: Using index condition; Using where; Using filesort
1 row in set (0.010 sec)
analyze table cars persistent for all;
select * from cars
where
type='electric' and
company='audi'
order by
quantity
limit 3;
39. Problem with correlated conditions
● Possible selectivities
– MIN(1/n, 1/m)
– (1/n) * (1/m)
– 0
select ...
from order_items
where shipdate='2015-12-15' AND item_name='christmas light'
'swimsuit'
40. Problem with correlated conditions
● PostgreSQL: Multi-variate statistics
– Detects functional dependencies, col1=F(col2)
– Only used for equality predicates
– Also #DISTINCT(a,b)
● MariaDB: MDEV-11107: Use table check constraints in optimizer
– In development
select ...
from order_items
where shipdate='2015-12-15' AND item_name='christmas light'
'swimsuit'