SlideShare a Scribd company logo
1 of 27
Download to read offline
SQL:
Query optimization
in practice
@jsuchal (@rubyslava #21)
Query optimization




  "The fastest query is
   the one you never
         make"
Query planner
● Problem: Find the most efficient way to
  execute this query
  SELECT s.* FROM students s
     JOIN enrolments e ON s.id = e.student_id
     JOIN courses c ON e.course_id = c.id
     WHERE s.gender = 'F' AND e.year = '2012'
         AND c.name = 'Programming in Ruby'

● Use
  ○   Indexes
  ○   Statistics about data
  ○   Access methods
  ○   Join types
  ○   Aggregation / Sort / Pipelining
B+ tree index




● leaf nodes
  ○ pointers to table data (heap)
  ○ doubly linked list (fast in-order traversal)
Access methods
● Sequential scan (a.k.a. full table scan)
   ○ fetch data directly from table (heap)


● Index scan
   ○ fetch data from table in index order


● Bitmap index scan + Bitmap heap scan
   ○ fetch data from table in table order
   ○ requires additional memory for intermediate bitmap
   ○
● Index-Only scan (a.k.a. covering index)
   ○ fetch data from index only
Join types
● Nested loop
  ○ for each row find matching rows using join condition
  ○ basically a nested "for loop"


● Hash join
  ○ create intermediate hash table from smaller table
  ○ loop through larger table and probe against hash
  ○ recheck & emit matching rows


● Sort-Merge join
  ○ sort both tables on join attribute (if necessary)
  ○ merge using interleaved linear scan
Sort, Aggregate, Pipelining

● Sort

● Aggregate
  ○ Plain Aggregate
  ○ Sort Aggregate
  ○ Hash Aggregate


● Pipelining
  ○ stop execution after X rows are emitted
  ○ SELECT * FROM students LIMIT 10
Reading EXPLAIN (in PostgreSQL)
EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM documents;

                              est.     est.             est.
                                                                     est. row
   Plan                     startup   total           emitted
                                                Node rows            size (B)
                actual       cost     cost
               execution                      execution
Seq Scan on documents (cost=0.00..4688.45
                         I/O                  rows=183445
                                                count           width=95)
                       heavy?
(actual time=0.008..19.241 rows=183445 loops=1)
  Buffers: shared hit=2854
Total runtime: 28.376 ms
Reading EXPLAIN (2)
EXPLAIN SELECT * FROM documents WHERE id < 72284

Index Scan using documents_pkey on documents (cost=0.00..19.74
rows=76 width=95) (actual time=0.008..0.076 rows=81 loops=1)
  Index Cond: (id < 72284)
  Buffers: shared hit=75
Total runtime: 0.102 ms
                                          !!!


EXPLAIN SELECT * FROM documents WHERE id > 72284

Seq Scan on documents (cost=0.00..5147.06 rows=183369
width=95) (actual time=0.017..28.905 rows=183363 loops=1)
  Filter: (id > 72284)
  Buffers: shared hit=2854
Total runtime: 37.752 ms
Reading EXPLAIN (3)
SELECT * FROM comments ORDER BY created_at DESC LIMIT 10




Limit (cost=5.37..5.40 rows=10 width=245) (actual time=0.160..0.166 rows=10 loops=1)
  Buffers: shared hit=5
  -> Sort (cost=5.37..5.56 rows=75 width=245) (actual time=0.157..0.160 rows=10
loops=1)
         Sort Key: created_at
         Sort Method: top-N heapsort Memory: 28kB"
         Buffers: shared hit=5
         -> Seq Scan on comments (cost=0.00..3.75 rows=75 width=245) (actual
            time=0.009..0.043 rows=91 loops=1)
               Buffers: shared hit=3
Total runtime: 0.229 ms
Reading EXPLAIN (4)
SELECT * FROM comments ORDER BY id DESC LIMIT 10

Limit  (cost=0.00..2.11 rows=10 width=245)
       (actual time=0.018..0.029 rows=10 loops=1)
  Buffers: shared hit=2
  -> Index Scan Backward using comments_pkey on comments
      (cost=0.00..15.86 rows=75 width=245)
      (actual time=0.016..0.027 rows=10 loops=1)
        Buffers: shared hit=2
Total runtime: 0.081 ms
Reading EXPLAIN (5)
SELECT * FROM comments c
   JOIN users u ON c.user_id = u.id
   ORDER BY c.id DESC
   LIMIT 10




Limit (cost=94.35..2375.31 rows=10 width=946)
  -> Nested Loop (cost=94.35..7393.42 rows=32 width=946)
       -> Index Scan Backward using comments_pkey on comments c
           (cost=0.00..15.86 rows=75 width=245)
       -> Bitmap Heap Scan on users u (cost=94.35..98.36 rows=1 width=701)
             Recheck Cond: (id = c.user_id)
             -> Bitmap Index Scan on users_pkey
                   (cost=0.00..94.34 rows=1 width=0)
                   Index Cond: (id = c.user_id)
Reading EXPLAIN
http://explain.depesz.com
Tricks
Optimizing ORDER BY
SELECT * FROM documents ORDER BY created_at DESC

Sort  (cost=20726.12..21184.73 rows=183445 width=95)
      (actual time=199.680..231.948 rows=183445 loops=1)
  Sort Key: created_at
  Sort Method: external sort Disk: 19448kB
  -> Seq Scan on documents
      (cost=0.00..4688.45 rows=183445 width=95)
      (actual time=0.016..18.877 rows=183445 loops=1)
Total runtime: 245.865 ms

Index Scan using index_on_created_at on documents
  (cost=0.00..7894.56 rows=183445 width=95)
  (actual time=0.162..57.519 rows=183445 loops=1)
Total runtime: 67.679 ms
Optimizing GROUP BY
SELECT attachment_id, COUNT(*) FROM pages GROUP BY attachment_id
  LIMIT 10

Limit (cost=0.00..65.18 rows=10 width=4)
  -> GroupAggregate (cost=0.00..267621.87 rows=41056 width=4)
       -> Index Scan using idx_attachment_id on pages
           (cost=0.00..261638.63 rows=1114537 width=4)


SELECT attachment_id, COUNT(*) FROM pages GROUP BY attachment_id

HashAggregate (cost=169181.05..169591.61 rows=41056 width=4)
  -> Seq Scan on pages
      (cost=0.00..163608.37 rows=1114537 width=4)
Multicolumn / composite indexes
SELECT * FROM pages WHERE number = 1 ORDER BY created_at LIMIT 100

Limit (cost=173572.08..173572.33 rows=100 width=1047)
  -> Sort (cost=173572.08..174041.45 rows=187749 width=1047)
       Sort Key: created_at
       -> Seq Scan on pages
           (cost=0.00..166396.45 rows=187749 width=1047)
             Filter: (number = 1)


CREATE INDEX idx_number_created_at ON pages (number, created_at);

Limit (cost=0.00..247.86 rows=100 width=1047)
  -> Index Scan Backward using idx_number_created_at on pages
       Index Cond: (number = 1)
Multicolumn / composite indexes
● Order is important!
  ○ equality conditions first


● Usable for GROUP BY with WHERE
  conditions

● Mixed ordering
  ○ ORDER BY a ASC, b DESC
  ○ CREATE INDEX idx ON t (a ASC , b DESC);
Index-only scan / covering index
● MySQL, PostgreSQL 9.2+, ...

● Very useful for fast lookup
   ○ on m:n join tables - order is important!
Partial index

CREATE INDEX idx ON pages (number, created_at);

CREATE INDEX idx ON pages (number, created_at)
      WHERE number = 1;



● Index size
   ○ 34MB vs 5MB
Function / expression index

SELECT 1 FROM users
   WHERE lower(username) = 'johno';

CREATE INDEX idx_username
   ON users (lower(username));
"Hacks"
UNNEST
Problem: Show last 6 checkins for N given users


SELECT *,
    UNNEST(ARRAY(
        SELECT movie_id FROM checkins WHERE user_id = u.id
        ORDER BY created_at DESC LIMIT 6)
    ) as movie_id

   FROM users AS u
     WHERE u.id IN (2079077510, 1355625182, ...)
Exploiting knowledge about data
● Correlated columns
  ○ Example: Events stream
  ○ ORDER BY created_at vs. ORDER BY id
  ○ Exploit primary key


● Fetch top candidates, filter and compute
  ○ Fast query for candidates
  ○ expensive operations in outer query
     SELECT * FROM (
         SELECT * FROM table ORDER BY c LIMIT 100
     ) AS t GROUP BY another column
Now what?
● Denormalization = redundancy
  ○ counters
  ○ duplicate columns
  ○ materialized views
● Specialized structures
  ○ nested set
● Specialized indexes
  ○ GIN, GiST, SP-GiST, Spatial
● Specialized engines
  ○ column stores, full text, graphs
Common caveats
● most selective first myth

● unique vs non-unique indexes
   ○ additional information planner can use


● Unnecessary subqueries
   ○ Good query planner might restructure it to join


● LEFT JOIN vs. JOIN
   ○ LEFT JOIN constraints joining order possibilities
   ○ not semantically equivalent!
Resources


● http://use-the-index-luke.com/

● http://www.postgresql.org/docs/9.
  2/static/indexes.html

● http://www.postgresql.org/docs/9.
  2/static/using-explain.html

More Related Content

What's hot (20)

Query optimization
Query optimizationQuery optimization
Query optimization
 
Query processing-and-optimization
Query processing-and-optimizationQuery processing-and-optimization
Query processing-and-optimization
 
Query compiler
Query compilerQuery compiler
Query compiler
 
Query processing and Query Optimization
Query processing and Query OptimizationQuery processing and Query Optimization
Query processing and Query Optimization
 
28890 lecture10
28890 lecture1028890 lecture10
28890 lecture10
 
Data structure
Data structureData structure
Data structure
 
Query processing System
Query processing SystemQuery processing System
Query processing System
 
ch13
ch13ch13
ch13
 
Query Optimization
Query OptimizationQuery Optimization
Query Optimization
 
Join operation
Join operationJoin operation
Join operation
 
Query-porcessing-& Query optimization
Query-porcessing-& Query optimizationQuery-porcessing-& Query optimization
Query-porcessing-& Query optimization
 
Ch13
Ch13Ch13
Ch13
 
Dynamic memory allocation
Dynamic memory allocationDynamic memory allocation
Dynamic memory allocation
 
Distributed Query Processing
Distributed Query ProcessingDistributed Query Processing
Distributed Query Processing
 
ADS Introduction
ADS IntroductionADS Introduction
ADS Introduction
 
Query processing and optimization (updated)
Query processing and optimization (updated)Query processing and optimization (updated)
Query processing and optimization (updated)
 
Heuristic approch monika sanghani
Heuristic approch  monika sanghaniHeuristic approch  monika sanghani
Heuristic approch monika sanghani
 
Data Structures 7
Data Structures 7Data Structures 7
Data Structures 7
 
Data Structures 6
Data Structures 6Data Structures 6
Data Structures 6
 
Data structures
Data structuresData structures
Data structures
 

Viewers also liked

Garelic: Google Analytics as App Performance monitoring
Garelic: Google Analytics as App Performance monitoringGarelic: Google Analytics as App Performance monitoring
Garelic: Google Analytics as App Performance monitoringJano Suchal
 
SQL Transactions - What they are good for and how they work
SQL Transactions - What they are good for and how they workSQL Transactions - What they are good for and how they work
SQL Transactions - What they are good for and how they workMarkus Winand
 
Postgres in Production - Best Practices 2014
Postgres in Production - Best Practices 2014Postgres in Production - Best Practices 2014
Postgres in Production - Best Practices 2014EDB
 
Modern SQL in Open Source and Commercial Databases
Modern SQL in Open Source and Commercial DatabasesModern SQL in Open Source and Commercial Databases
Modern SQL in Open Source and Commercial DatabasesMarkus Winand
 
Sql queries with answers
Sql queries with answersSql queries with answers
Sql queries with answersvijaybusu
 
GPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServiceGPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServicePivotalOpenSourceHub
 
An introduction to column store indexes and batch mode
An introduction to column store indexes and batch modeAn introduction to column store indexes and batch mode
An introduction to column store indexes and batch modeChris Adkin
 
How the Postgres Query Optimizer Works
How the Postgres Query Optimizer WorksHow the Postgres Query Optimizer Works
How the Postgres Query Optimizer WorksEDB
 
King hug uk
King hug ukKing hug uk
King hug ukhuguk
 
PL/pgSQL - An Introduction on Using Imperative Programming in PostgreSQL
PL/pgSQL - An Introduction on Using Imperative Programming in PostgreSQLPL/pgSQL - An Introduction on Using Imperative Programming in PostgreSQL
PL/pgSQL - An Introduction on Using Imperative Programming in PostgreSQLReactive.IO
 
Consultas en MS SQL Server 2012
Consultas en MS SQL Server 2012Consultas en MS SQL Server 2012
Consultas en MS SQL Server 2012CarlosFloresRoman
 
MS SQL SERVER: Creating Views
MS SQL SERVER: Creating ViewsMS SQL SERVER: Creating Views
MS SQL SERVER: Creating Viewssqlserver content
 
Using grouping sets in sql server 2008 tech republic
Using grouping sets in sql server 2008   tech republicUsing grouping sets in sql server 2008   tech republic
Using grouping sets in sql server 2008 tech republicKaing Menglieng
 
Postgres performance for humans
Postgres performance for humansPostgres performance for humans
Postgres performance for humansCraig Kerstiens
 
Sql tuning guideline
Sql tuning guidelineSql tuning guideline
Sql tuning guidelineSidney Chen
 

Viewers also liked (20)

Garelic: Google Analytics as App Performance monitoring
Garelic: Google Analytics as App Performance monitoringGarelic: Google Analytics as App Performance monitoring
Garelic: Google Analytics as App Performance monitoring
 
SQL Transactions - What they are good for and how they work
SQL Transactions - What they are good for and how they workSQL Transactions - What they are good for and how they work
SQL Transactions - What they are good for and how they work
 
Postgres in Production - Best Practices 2014
Postgres in Production - Best Practices 2014Postgres in Production - Best Practices 2014
Postgres in Production - Best Practices 2014
 
Modern SQL in Open Source and Commercial Databases
Modern SQL in Open Source and Commercial DatabasesModern SQL in Open Source and Commercial Databases
Modern SQL in Open Source and Commercial Databases
 
Sql queries with answers
Sql queries with answersSql queries with answers
Sql queries with answers
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
 
GPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a ServiceGPORCA: Query Optimization as a Service
GPORCA: Query Optimization as a Service
 
An introduction to column store indexes and batch mode
An introduction to column store indexes and batch modeAn introduction to column store indexes and batch mode
An introduction to column store indexes and batch mode
 
How the Postgres Query Optimizer Works
How the Postgres Query Optimizer WorksHow the Postgres Query Optimizer Works
How the Postgres Query Optimizer Works
 
King hug uk
King hug ukKing hug uk
King hug uk
 
PL/pgSQL - An Introduction on Using Imperative Programming in PostgreSQL
PL/pgSQL - An Introduction on Using Imperative Programming in PostgreSQLPL/pgSQL - An Introduction on Using Imperative Programming in PostgreSQL
PL/pgSQL - An Introduction on Using Imperative Programming in PostgreSQL
 
PostgreSQL: Advanced indexing
PostgreSQL: Advanced indexingPostgreSQL: Advanced indexing
PostgreSQL: Advanced indexing
 
Consultas en MS SQL Server 2012
Consultas en MS SQL Server 2012Consultas en MS SQL Server 2012
Consultas en MS SQL Server 2012
 
MS SQL SERVER: Creating Views
MS SQL SERVER: Creating ViewsMS SQL SERVER: Creating Views
MS SQL SERVER: Creating Views
 
Sql db optimization
Sql db optimizationSql db optimization
Sql db optimization
 
Using grouping sets in sql server 2008 tech republic
Using grouping sets in sql server 2008   tech republicUsing grouping sets in sql server 2008   tech republic
Using grouping sets in sql server 2008 tech republic
 
Joins SQL Server
Joins SQL ServerJoins SQL Server
Joins SQL Server
 
Oracle 10g Form
Oracle 10g Form Oracle 10g Form
Oracle 10g Form
 
Postgres performance for humans
Postgres performance for humansPostgres performance for humans
Postgres performance for humans
 
Sql tuning guideline
Sql tuning guidelineSql tuning guideline
Sql tuning guideline
 

Similar to SQL: Query optimization in practice

Dive into EXPLAIN - PostgreSql
Dive into EXPLAIN  - PostgreSqlDive into EXPLAIN  - PostgreSql
Dive into EXPLAIN - PostgreSqlDmytro Shylovskyi
 
PostgreSQL query planner's internals
PostgreSQL query planner's internalsPostgreSQL query planner's internals
PostgreSQL query planner's internalsAlexey Ermakov
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Ontico
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Qbeast
 
On Beyond (PostgreSQL) Data Types
On Beyond (PostgreSQL) Data TypesOn Beyond (PostgreSQL) Data Types
On Beyond (PostgreSQL) Data TypesJonathan Katz
 
Row Pattern Matching in Oracle Database 12c
Row Pattern Matching in Oracle Database 12cRow Pattern Matching in Oracle Database 12c
Row Pattern Matching in Oracle Database 12cStew Ashton
 
Three steps to untangle data traffic jams
Three steps to untangle data traffic jamsThree steps to untangle data traffic jams
Three steps to untangle data traffic jamsBol.com Techlab
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINEDB
 
MySQL 8.0 EXPLAIN ANALYZE
MySQL 8.0 EXPLAIN ANALYZEMySQL 8.0 EXPLAIN ANALYZE
MySQL 8.0 EXPLAIN ANALYZENorvald Ryeng
 
Postgres can do THAT?
Postgres can do THAT?Postgres can do THAT?
Postgres can do THAT?alexbrasetvik
 
Row patternmatching12ctech14
Row patternmatching12ctech14Row patternmatching12ctech14
Row patternmatching12ctech14stewashton
 
Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++Satalia
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQLSatoshi Nagayasu
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Spark Summit
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)PingCAP
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Gruter
 

Similar to SQL: Query optimization in practice (20)

Dive into EXPLAIN - PostgreSql
Dive into EXPLAIN  - PostgreSqlDive into EXPLAIN  - PostgreSql
Dive into EXPLAIN - PostgreSql
 
PostgreSQL query planner's internals
PostgreSQL query planner's internalsPostgreSQL query planner's internals
PostgreSQL query planner's internals
 
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
 
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
Extending Spark for Qbeast's SQL Data Source​ with Paola Pardo and Cesare Cug...
 
On Beyond (PostgreSQL) Data Types
On Beyond (PostgreSQL) Data TypesOn Beyond (PostgreSQL) Data Types
On Beyond (PostgreSQL) Data Types
 
Row Pattern Matching in Oracle Database 12c
Row Pattern Matching in Oracle Database 12cRow Pattern Matching in Oracle Database 12c
Row Pattern Matching in Oracle Database 12c
 
Three steps to untangle data traffic jams
Three steps to untangle data traffic jamsThree steps to untangle data traffic jams
Three steps to untangle data traffic jams
 
A Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAINA Deeper Dive into EXPLAIN
A Deeper Dive into EXPLAIN
 
MySQL 8.0 EXPLAIN ANALYZE
MySQL 8.0 EXPLAIN ANALYZEMySQL 8.0 EXPLAIN ANALYZE
MySQL 8.0 EXPLAIN ANALYZE
 
Postgres can do THAT?
Postgres can do THAT?Postgres can do THAT?
Postgres can do THAT?
 
Indexing
IndexingIndexing
Indexing
 
Row patternmatching12ctech14
Row patternmatching12ctech14Row patternmatching12ctech14
Row patternmatching12ctech14
 
Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++Options and trade offs for parallelism and concurrency in Modern C++
Options and trade offs for parallelism and concurrency in Modern C++
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
 
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
 
MySQL performance tuning
MySQL performance tuningMySQL performance tuning
MySQL performance tuning
 
Golang in TiDB (GopherChina 2017)
Golang in TiDB  (GopherChina 2017)Golang in TiDB  (GopherChina 2017)
Golang in TiDB (GopherChina 2017)
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
linkedlist.pptx
linkedlist.pptxlinkedlist.pptx
linkedlist.pptx
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
 

More from Jano Suchal

Slovensko.Digital: Čo ďalej?
Slovensko.Digital: Čo ďalej?Slovensko.Digital: Čo ďalej?
Slovensko.Digital: Čo ďalej?Jano Suchal
 
Improving code quality
Improving code qualityImproving code quality
Improving code qualityJano Suchal
 
Beyond search queries
Beyond search queriesBeyond search queries
Beyond search queriesJano Suchal
 
Rank all the things!
Rank all the things!Rank all the things!
Rank all the things!Jano Suchal
 
Rank all the (geo) things!
Rank all the (geo) things!Rank all the (geo) things!
Rank all the (geo) things!Jano Suchal
 
Ako si vybrať programovácí jazyk alebo framework?
Ako si vybrať programovácí jazyk alebo framework?Ako si vybrať programovácí jazyk alebo framework?
Ako si vybrať programovácí jazyk alebo framework?Jano Suchal
 
Bonetics: Mastering Puppet Workshop
Bonetics: Mastering Puppet WorkshopBonetics: Mastering Puppet Workshop
Bonetics: Mastering Puppet WorkshopJano Suchal
 
Peter Mihalik: Puppet
Peter Mihalik: PuppetPeter Mihalik: Puppet
Peter Mihalik: PuppetJano Suchal
 
Tomáš Čorej: Configuration management & CFEngine3
Tomáš Čorej: Configuration management & CFEngine3Tomáš Čorej: Configuration management & CFEngine3
Tomáš Čorej: Configuration management & CFEngine3Jano Suchal
 
Ako si vybrať programovací jazyk a framework?
Ako si vybrať programovací jazyk a framework?Ako si vybrať programovací jazyk a framework?
Ako si vybrať programovací jazyk a framework?Jano Suchal
 
Miroslav Šimulčík: Temporálne databázy
Miroslav Šimulčík: Temporálne databázyMiroslav Šimulčík: Temporálne databázy
Miroslav Šimulčík: Temporálne databázyJano Suchal
 
Vojtech Rinik: Internship v USA - moje skúsenosti
Vojtech Rinik: Internship v USA - moje skúsenostiVojtech Rinik: Internship v USA - moje skúsenosti
Vojtech Rinik: Internship v USA - moje skúsenostiJano Suchal
 
Profiling and monitoring ruby & rails applications
Profiling and monitoring ruby & rails applicationsProfiling and monitoring ruby & rails applications
Profiling and monitoring ruby & rails applicationsJano Suchal
 
Aký programovací jazyk a framework si vybrať a prečo?
Aký programovací jazyk a framework si vybrať a prečo?Aký programovací jazyk a framework si vybrať a prečo?
Aký programovací jazyk a framework si vybrať a prečo?Jano Suchal
 
Petr Joachim: Redis na Super.cz
Petr Joachim: Redis na Super.czPetr Joachim: Redis na Super.cz
Petr Joachim: Redis na Super.czJano Suchal
 
Metaprogramovanie #1
Metaprogramovanie #1Metaprogramovanie #1
Metaprogramovanie #1Jano Suchal
 
PostgreSQL: Advanced features in practice
PostgreSQL: Advanced features in practicePostgreSQL: Advanced features in practice
PostgreSQL: Advanced features in practiceJano Suchal
 
elasticsearch - advanced features in practice
elasticsearch - advanced features in practiceelasticsearch - advanced features in practice
elasticsearch - advanced features in practiceJano Suchal
 

More from Jano Suchal (20)

Slovensko.Digital: Čo ďalej?
Slovensko.Digital: Čo ďalej?Slovensko.Digital: Čo ďalej?
Slovensko.Digital: Čo ďalej?
 
Datanest 3.0
Datanest 3.0Datanest 3.0
Datanest 3.0
 
Improving code quality
Improving code qualityImproving code quality
Improving code quality
 
Beyond search queries
Beyond search queriesBeyond search queries
Beyond search queries
 
Rank all the things!
Rank all the things!Rank all the things!
Rank all the things!
 
Rank all the (geo) things!
Rank all the (geo) things!Rank all the (geo) things!
Rank all the (geo) things!
 
Ako si vybrať programovácí jazyk alebo framework?
Ako si vybrať programovácí jazyk alebo framework?Ako si vybrať programovácí jazyk alebo framework?
Ako si vybrať programovácí jazyk alebo framework?
 
Bonetics: Mastering Puppet Workshop
Bonetics: Mastering Puppet WorkshopBonetics: Mastering Puppet Workshop
Bonetics: Mastering Puppet Workshop
 
Peter Mihalik: Puppet
Peter Mihalik: PuppetPeter Mihalik: Puppet
Peter Mihalik: Puppet
 
Tomáš Čorej: Configuration management & CFEngine3
Tomáš Čorej: Configuration management & CFEngine3Tomáš Čorej: Configuration management & CFEngine3
Tomáš Čorej: Configuration management & CFEngine3
 
Ako si vybrať programovací jazyk a framework?
Ako si vybrať programovací jazyk a framework?Ako si vybrať programovací jazyk a framework?
Ako si vybrať programovací jazyk a framework?
 
Miroslav Šimulčík: Temporálne databázy
Miroslav Šimulčík: Temporálne databázyMiroslav Šimulčík: Temporálne databázy
Miroslav Šimulčík: Temporálne databázy
 
Vojtech Rinik: Internship v USA - moje skúsenosti
Vojtech Rinik: Internship v USA - moje skúsenostiVojtech Rinik: Internship v USA - moje skúsenosti
Vojtech Rinik: Internship v USA - moje skúsenosti
 
Profiling and monitoring ruby & rails applications
Profiling and monitoring ruby & rails applicationsProfiling and monitoring ruby & rails applications
Profiling and monitoring ruby & rails applications
 
Aký programovací jazyk a framework si vybrať a prečo?
Aký programovací jazyk a framework si vybrať a prečo?Aký programovací jazyk a framework si vybrať a prečo?
Aký programovací jazyk a framework si vybrať a prečo?
 
Čo po GAMČI?
Čo po GAMČI?Čo po GAMČI?
Čo po GAMČI?
 
Petr Joachim: Redis na Super.cz
Petr Joachim: Redis na Super.czPetr Joachim: Redis na Super.cz
Petr Joachim: Redis na Super.cz
 
Metaprogramovanie #1
Metaprogramovanie #1Metaprogramovanie #1
Metaprogramovanie #1
 
PostgreSQL: Advanced features in practice
PostgreSQL: Advanced features in practicePostgreSQL: Advanced features in practice
PostgreSQL: Advanced features in practice
 
elasticsearch - advanced features in practice
elasticsearch - advanced features in practiceelasticsearch - advanced features in practice
elasticsearch - advanced features in practice
 

SQL: Query optimization in practice

  • 2. Query optimization "The fastest query is the one you never make"
  • 3. Query planner ● Problem: Find the most efficient way to execute this query SELECT s.* FROM students s JOIN enrolments e ON s.id = e.student_id JOIN courses c ON e.course_id = c.id WHERE s.gender = 'F' AND e.year = '2012' AND c.name = 'Programming in Ruby' ● Use ○ Indexes ○ Statistics about data ○ Access methods ○ Join types ○ Aggregation / Sort / Pipelining
  • 4. B+ tree index ● leaf nodes ○ pointers to table data (heap) ○ doubly linked list (fast in-order traversal)
  • 5. Access methods ● Sequential scan (a.k.a. full table scan) ○ fetch data directly from table (heap) ● Index scan ○ fetch data from table in index order ● Bitmap index scan + Bitmap heap scan ○ fetch data from table in table order ○ requires additional memory for intermediate bitmap ○ ● Index-Only scan (a.k.a. covering index) ○ fetch data from index only
  • 6. Join types ● Nested loop ○ for each row find matching rows using join condition ○ basically a nested "for loop" ● Hash join ○ create intermediate hash table from smaller table ○ loop through larger table and probe against hash ○ recheck & emit matching rows ● Sort-Merge join ○ sort both tables on join attribute (if necessary) ○ merge using interleaved linear scan
  • 7. Sort, Aggregate, Pipelining ● Sort ● Aggregate ○ Plain Aggregate ○ Sort Aggregate ○ Hash Aggregate ● Pipelining ○ stop execution after X rows are emitted ○ SELECT * FROM students LIMIT 10
  • 8. Reading EXPLAIN (in PostgreSQL) EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM documents; est. est. est. est. row Plan startup total emitted Node rows size (B) actual cost cost execution execution Seq Scan on documents (cost=0.00..4688.45 I/O rows=183445 count width=95) heavy? (actual time=0.008..19.241 rows=183445 loops=1) Buffers: shared hit=2854 Total runtime: 28.376 ms
  • 9. Reading EXPLAIN (2) EXPLAIN SELECT * FROM documents WHERE id < 72284 Index Scan using documents_pkey on documents (cost=0.00..19.74 rows=76 width=95) (actual time=0.008..0.076 rows=81 loops=1) Index Cond: (id < 72284) Buffers: shared hit=75 Total runtime: 0.102 ms !!! EXPLAIN SELECT * FROM documents WHERE id > 72284 Seq Scan on documents (cost=0.00..5147.06 rows=183369 width=95) (actual time=0.017..28.905 rows=183363 loops=1) Filter: (id > 72284) Buffers: shared hit=2854 Total runtime: 37.752 ms
  • 10. Reading EXPLAIN (3) SELECT * FROM comments ORDER BY created_at DESC LIMIT 10 Limit (cost=5.37..5.40 rows=10 width=245) (actual time=0.160..0.166 rows=10 loops=1) Buffers: shared hit=5 -> Sort (cost=5.37..5.56 rows=75 width=245) (actual time=0.157..0.160 rows=10 loops=1) Sort Key: created_at Sort Method: top-N heapsort Memory: 28kB" Buffers: shared hit=5 -> Seq Scan on comments (cost=0.00..3.75 rows=75 width=245) (actual time=0.009..0.043 rows=91 loops=1) Buffers: shared hit=3 Total runtime: 0.229 ms
  • 11. Reading EXPLAIN (4) SELECT * FROM comments ORDER BY id DESC LIMIT 10 Limit (cost=0.00..2.11 rows=10 width=245) (actual time=0.018..0.029 rows=10 loops=1) Buffers: shared hit=2 -> Index Scan Backward using comments_pkey on comments (cost=0.00..15.86 rows=75 width=245) (actual time=0.016..0.027 rows=10 loops=1) Buffers: shared hit=2 Total runtime: 0.081 ms
  • 12. Reading EXPLAIN (5) SELECT * FROM comments c JOIN users u ON c.user_id = u.id ORDER BY c.id DESC LIMIT 10 Limit (cost=94.35..2375.31 rows=10 width=946) -> Nested Loop (cost=94.35..7393.42 rows=32 width=946) -> Index Scan Backward using comments_pkey on comments c (cost=0.00..15.86 rows=75 width=245) -> Bitmap Heap Scan on users u (cost=94.35..98.36 rows=1 width=701) Recheck Cond: (id = c.user_id) -> Bitmap Index Scan on users_pkey (cost=0.00..94.34 rows=1 width=0) Index Cond: (id = c.user_id)
  • 15. Optimizing ORDER BY SELECT * FROM documents ORDER BY created_at DESC Sort (cost=20726.12..21184.73 rows=183445 width=95) (actual time=199.680..231.948 rows=183445 loops=1) Sort Key: created_at Sort Method: external sort Disk: 19448kB -> Seq Scan on documents (cost=0.00..4688.45 rows=183445 width=95) (actual time=0.016..18.877 rows=183445 loops=1) Total runtime: 245.865 ms Index Scan using index_on_created_at on documents (cost=0.00..7894.56 rows=183445 width=95) (actual time=0.162..57.519 rows=183445 loops=1) Total runtime: 67.679 ms
  • 16. Optimizing GROUP BY SELECT attachment_id, COUNT(*) FROM pages GROUP BY attachment_id LIMIT 10 Limit (cost=0.00..65.18 rows=10 width=4) -> GroupAggregate (cost=0.00..267621.87 rows=41056 width=4) -> Index Scan using idx_attachment_id on pages (cost=0.00..261638.63 rows=1114537 width=4) SELECT attachment_id, COUNT(*) FROM pages GROUP BY attachment_id HashAggregate (cost=169181.05..169591.61 rows=41056 width=4) -> Seq Scan on pages (cost=0.00..163608.37 rows=1114537 width=4)
  • 17. Multicolumn / composite indexes SELECT * FROM pages WHERE number = 1 ORDER BY created_at LIMIT 100 Limit (cost=173572.08..173572.33 rows=100 width=1047) -> Sort (cost=173572.08..174041.45 rows=187749 width=1047) Sort Key: created_at -> Seq Scan on pages (cost=0.00..166396.45 rows=187749 width=1047) Filter: (number = 1) CREATE INDEX idx_number_created_at ON pages (number, created_at); Limit (cost=0.00..247.86 rows=100 width=1047) -> Index Scan Backward using idx_number_created_at on pages Index Cond: (number = 1)
  • 18. Multicolumn / composite indexes ● Order is important! ○ equality conditions first ● Usable for GROUP BY with WHERE conditions ● Mixed ordering ○ ORDER BY a ASC, b DESC ○ CREATE INDEX idx ON t (a ASC , b DESC);
  • 19. Index-only scan / covering index ● MySQL, PostgreSQL 9.2+, ... ● Very useful for fast lookup ○ on m:n join tables - order is important!
  • 20. Partial index CREATE INDEX idx ON pages (number, created_at); CREATE INDEX idx ON pages (number, created_at) WHERE number = 1; ● Index size ○ 34MB vs 5MB
  • 21. Function / expression index SELECT 1 FROM users WHERE lower(username) = 'johno'; CREATE INDEX idx_username ON users (lower(username));
  • 23. UNNEST Problem: Show last 6 checkins for N given users SELECT *, UNNEST(ARRAY( SELECT movie_id FROM checkins WHERE user_id = u.id ORDER BY created_at DESC LIMIT 6) ) as movie_id FROM users AS u WHERE u.id IN (2079077510, 1355625182, ...)
  • 24. Exploiting knowledge about data ● Correlated columns ○ Example: Events stream ○ ORDER BY created_at vs. ORDER BY id ○ Exploit primary key ● Fetch top candidates, filter and compute ○ Fast query for candidates ○ expensive operations in outer query SELECT * FROM ( SELECT * FROM table ORDER BY c LIMIT 100 ) AS t GROUP BY another column
  • 25. Now what? ● Denormalization = redundancy ○ counters ○ duplicate columns ○ materialized views ● Specialized structures ○ nested set ● Specialized indexes ○ GIN, GiST, SP-GiST, Spatial ● Specialized engines ○ column stores, full text, graphs
  • 26. Common caveats ● most selective first myth ● unique vs non-unique indexes ○ additional information planner can use ● Unnecessary subqueries ○ Good query planner might restructure it to join ● LEFT JOIN vs. JOIN ○ LEFT JOIN constraints joining order possibilities ○ not semantically equivalent!
  • 27. Resources ● http://use-the-index-luke.com/ ● http://www.postgresql.org/docs/9. 2/static/indexes.html ● http://www.postgresql.org/docs/9. 2/static/using-explain.html