AWS July Webinar Series: Amazon Redshift Optimizing Performance

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Sanjay Kotecha, Solution Architect
Eric Ferreira, Principal Database Engineer
July 21, 2015
Best Practices: Amazon Redshift
Optimizing Performance

Getting Started – June Webinar Series:
https://www.youtube.com/watch?v=biqBjWqJi-Q
Best Practices – July Webinar Series:
Optimizing Performance – July 21, 2015
Migration and Data Loading – July 22,2015
Reporting and Advanced Analytics – July 23, 2015
Amazon Redshift – Resources

Architecture
Distribution
Sort Keys
Compression
DDL
Loading
Vacuum
Analyze
Workload Management
Agenda

Leader Node
• SQL endpoint
• Stores metadata
• Coordinates query execution
Compute Nodes
• Local, columnar storage
• Execute queries in parallel
• Load, backup, restore via S3
• Parallel load from DynamoDB or SSH
HW optimized for data processing
• Optimized for data processing
• DS2: HDD; scale from 2TB to 2PB
• DC1: SSD; scale from 160GB to 356TB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
Amazon Redshift Architecture

– One slice per core
– DS2 – 2 slices on XL, 16 on 8XL
– DC1 – 2 slices on XL, 32 on 8XL
Architecture – Nodes and Slices

Table Distribution Styles
Distribution Key All
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
All data on
every node
Same key to same location
Node 1
Slice
1
Slice
2
Node 2
Slice
3
Slice
4
Even
Round robin
distribution

Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
user_profile
user_id=1234
name=janet
…
user_profile
user_id=6789
name=fred
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
user_profile
user_id=2345
name=bill
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
user_profile
user_id=4312
name=fred
…
order_line
order_line_id = 25693
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
Data Distribution with Distribution Keys

Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
user_profile
user_id=1234
name=janet
…
user_profile
user_id=6789
name=fred
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
user_profile
user_id=2345
name=bill
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
user_profile
user_id=4312
name=fred
…
order_line
…
Distribution Keys determine which data resides on which slices
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
Records with same
distribution key for a table
are on the same slice
Data Distribution and Distribution Keys

Node 1
Slice 1 Slice 2
cloudfront
uri = /games/g1.exe
user_id=1234
…
user_profile
user_id=1234
name=janet
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
user_profile
user_id=2345
name=bill
…
order_line
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
Records from other tables
with the same distribution
key value are also on the
same slice
Records with same
distribution key for a table
are on the same slice
Distribution Keys help with data locality for join evaluation
Node 2
Slice 3 Slice 4
user_profile
user_id=6789
name=fred
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
user_profile
user_id=4312
name=fred
…

Example Query (TPC-H dataset)
Data Distribution - Comparison
Distribution Type
Query against the tables with distribution
key was 178% faster
Key Even
14 seconds 39 seconds

Query plan for tables with distribution key
Data Distribution - Comparison
Query plan for tables without distribution key

Query Plan
http://docs.aws.amazon.com/redshift/latest/dg/c-query-processing.html

Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
2M records
5M records
1M records
4M records
Poor key choices lead to uneven distribution of records…

Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
2M records
5M records
1M records
4M records
Unevenly distributed data cause processing imbalances!

Node 1
Slice 1 Slice 2
Node 2
Slice 3 Slice 4
cloudfront
uri = /games/g1.exe
user_id=1234
…
cloudfront
uri = /imgs/ad1.png
user_id=2345
…
cloudfront
uri=/games/g10.exe
user_id=4312
…
cloudfront
uri = /img/ad_5.img
user_id=1234
…
2M records2M records 2M records 2M records
Evenly distributed data improves query performance
select * from v_check_data_distribution where tablename = 'lineitem';

KEY
• Large Fact tables
• Large dimension tables
ALL
• Medium dimension tables (1K – 2M)
EVEN
• Tables with no joins or group by
• Small dimension tables (<1000)
Data Distribution

Tools – Admin Scripts: table_info.sql

SELECT COUNT(*) FROM
LOGS WHERE DATE = ‘09-
JUNE-2015’
MIN: 01-JUNE-2015
MAX: 20-JUNE-2015
MIN: 08-JUNE-2015
MAX: 30-JUNE-2015
MIN: 12-JUNE-2015
MAX: 20-JUNE-2015
MIN: 02-JUNE-2015
MAX: 25-JUNE-2015
MIN: 06-JUNE-2015
MAX: 12-JUNE-2015
Unsorted Table
MIN: 01-JUNE-2015
MAX: 06-JUNE-2015
MIN: 07-JUNE-2015
MAX: 12-JUNE-2015
MIN: 13-JUNE-2015
MAX: 18-JUNE-2015
MIN: 19-JUNE-2015
MAX: 24-JUNE-2015
MIN: 25-JUNE-2015
MAX: 30-JUNE-2015
Sorted By Date
READ
READ
READ
READ
READ
Sort Keys – Zone Maps

Sort Keys - How to choose
Timestamp column
Frequent range filtering or equality filtering on one column
Join column:
create table customer (
c_custkey int8 not null,
c_name varchar(25) not null,
c_address varchar(40) not null,
c_nationkey int4 not null,
c_phone char(15) not null,
c_acctbal numeric(12,2) not null,
c_mktsegment char(10) not null,
c_comment varchar(117) not null
) distkey(c_custkey) sortkey(c_custkey) ;

Single Column
Compound
Interleaved
Sort Keys

Table is sorted by 1 column
[ SORTKEY ( date ) ]
Best for:
• Queries that use 1st column (i.e. date) as primary filter
• Can speed up joins and group bys
• Quickest to VACUUM
Date Region Country
2-JUN-2015 Oceania New Zealand
2-JUN-2015 Asia Singapore
2-JUN-2015 Africa Zaire
2-JUN-2015 Asia Hong Kong
3-JUN-2015 Europe Germany
3-JUN-2015 Asia Korea
Sort Keys – Single Column

• Table is sorted by 1st column , then 2nd column etc.
[ SORTKEY COMPOUND ( date, region, country) ]
• Best for:
• Queries that use 1st column as primary filter, then other cols
• Can speed up joins and group bys
• Slower to VACUUM
Date Region Country
Sort Keys – Compound

• Equal weight is given to each column.
[ SORTKEY INTERLEAVED ( date, region, country) ]
• Best for:
• Queries that use different columns in filter
• Queries get faster the more columns used in the filter (up to 8)
• Slowest to VACUUM
Date Region Country
Sort Keys – Interleaved

Sort Keys – Comparing Styles
Single
create table
cust_sales_dt_single
sortkey (c_custkey)
as select * from
cust_sales_date;
Compound
create table
cust_sales_dt_compound
compound sortkey
(c_custkey, c_region,
c_mktsegment, d_date) as
select * from
cust_sales_date;
Interleaved
create table
cust_sales_dt_interleaved
interleaved sortkey
(c_custkey, c_region,
c_mktsegment, d_date)
as select * from
cust_sales_date;

Query 1
select max(lo_revenue),
min(lo_revenue)
from cust_sales_date_single
where c_custkey < 100000;
min(lo_revenue)
from cust_sales_date_compound
min(lo_revenue) from
cust_sales_date_interleaved
Query 2
min(lo_revenue)
where c_region = 'ASIA'
and c_mktsegment = 'FURNITURE';
min(lo_revenue)
min(lo_revenue)
from cust_sales_date_interleaved
Query 3
select max(lo_revenue), min(lo_revenue)
where d_date between '01/01/1996' and
'01/14/1996'
and c_mktsegment = 'FURNITURE'
and c_region = 'ASIA';
'01/14/1996'
from cust_sales_date_interleaved
'01/14/1996'

Sort Style Query 1 Query 2 Query 3
Single 0.25 seconds 18.37 seconds 30.04 seconds
Compound 0.27 seconds 18.24 seconds 30.14 seconds
Interleaved 0.94 seconds 1.46 seconds 0.80 seconds

Increased load and vacuum times
More effective with large tables (> 100M+ rows)
Use Compound Sort Key when appending data in order
Sort Keys – Interleaved Considerations

Raw encoding (RAW)
Byte-dictionary (BYTEDICT)
Delta encoding (DELTA / DELTA32K)
Mostly encoding (MOSTLY8 / MOSTLY16 / MOSTLY32)
Runlength encoding (RUNLENGTH)
Text encoding (TEXT255 / TEXT32K)
LZO encoding (
Average: 2-4x
Compression - Encodings

COPY samples data automatically when loading into an empty table
• Samples up to 100,000 rows and picks optimal encoding
If use temp tables or staging tables
• Turn off automatic compression
• Use analyze compression to determine the right encodings
• Bake those encodings into your DML
COPY <tablename> FROM 's3://<bucket-name>/<object-prefix>' CREDENTIALS
<AWS_ACCESS_KEY>;<AWS_SECRET_ACCESS_KEY> DELIMITER ',' COMPUPDATE OFF
MANIFEST;
Compression

Compression Encodings
Compression - Comparison
No Compression Encodings

Example Query (TPC-H dataset)
Compressed Uncompressed
14 seconds 37 seconds
Query against the tables with
compression was 164% faster
Compression - Comparison

• Zone maps store min/max per block
• Once we know which block(s) contain the
range, we know which row offsets to scan
• Highly compressed sort keys means many
rows per block
• You’ll scan more data blocks than you need
• If your sort keys compress significantly
more than your data columns, you may
want to skip compression
Compression – Sort Keys

CREATE TABLE orders (
orderkey int8 NOT NULL DISTKEY,
custkey int8 NOT NULL,
orderstatus char(1) NOT NULL ,
totalprice numeric(12,2) NOT NULL ,
orderdate date NOT NULL SORTKEY ,
orderpriority char(15) NOT NULL,
clerk char(15) NOT NULL ,
shippriority int4 NOT NULL,
comment varchar(79) NOT NULL
);
DDL

During queries and ingestion,
the system allocates buffers
based on column width
Wider than needed columns
mean memory is wasted
Fewer rows fit into memory;
increased likelihood of queries
spilling to disk
DDL – Make Columns as narrow as possible

Define Primary & Foreign Keys
Not Enforced but…..
Helps optimizer with query plan
DDL

Use the COPY command
Each slice can load one file at a
time
A single input file means only one
slice is ingesting data
Instead of 100MB/s, you’re only
getting 6.25MB/s
Loading – Use multiple input files to maximize
throughput

Use the COPY command
You need at least as many input
files as you have slices
With 16 input files, all slices are
working so you maximize
throughput
Get 100MB/s per node; scale
linearly as you add nodes
Loading – Use multiple input files to maximize
throughput

Tools – Use the AdminScripts

VACUUM reclaims space and re-sorts tables
VACUUM can be run in 4 modes:
• VACUUM FULL
• Reclaims space and re-sorts
• VACUUM DELETE ONLY
• Reclaims space but does not re-sort
• VACUUM SORT ONLY
• Re-sorts but does not reclaim space
• VACUUM REINDEX
• Used for INTERLEAVED sort keys.
• Re-Analyzes sort keys and then runs FULL VACUUM
Vacuum

VACUUM is an I/O intensive operation and can take time to run.
To minimize the impact of VACUUM:
• Run VACUUM on a regular schedule
• Use TRUNCATE instead of DELETE where possible
• TRUNCATE or DROP test tables
• Perform a Deep Copy instead of VACUUM
• Load Data in sort order and remove need for VACUUM
Vacuum

• Is an alternate to VACUUM.
• Will remove deleted rows and also re-sort the table
• Is more efficient than VACUUM
• You can’t make concurrent updates to the table
Deep copy options:
• Use original table DDL and run INSERT INTO…SELECT
• Best option - Retains all table attributes
• Use CREATE TABLE AS
• New table does not inherit encoding, distkey, sortkey, primary keys, or foreign keys.
• Use CREATE TABLE LIKE
• New table inherits all attributes except primary and foreign keys
• Use a TEMP table to COPY data out and back in again
• Retains all attributes but requires two full inserts of the table
Vacuum – Deep Copy

Redshift’s query optimizer relies on up-to-date statistics
Update stats on sort/dist key columns after every load
Analyze

Analyze – AdminScripts: missing_table_stats.sql

Workload Management
Workload management is about creating queues for different workloads
User Group A
Short-running queueLong-running queue
Short
Query Group
Long
Query Group

Workload Management
Don’t set concurrency to more that you need
set query_group to allqueries;
select avg(l.priceperticket*s.qtysold) from listing l, sales s where l.listid <40000;
reset query_group;

Resources
Sanjay Kotecha | kotechas@amazon.com
Detail Pages
• http://aws.amazon.com/redshift
• https://aws.amazon.com/marketplace/redshift/
Best Practices
• http://docs.aws.amazon.com/redshift/latest/dg/c_loading-data-best-practices.html
• http://docs.aws.amazon.com/redshift/latest/dg/c_designing-tables-best-practices.html
• http://docs.aws.amazon.com/redshift/latest/dg/c-optimizing-query-performance.html
Deep Drive Webinar Series in July
• Migration and Loading Data – July 22nd, 2015
• Reporting and Advanced Analytics – July 23rd, 2015

AWS July Webinar Series: Amazon Redshift Optimizing Performance

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (16)

Similar to AWS July Webinar Series: Amazon Redshift Optimizing Performance

Similar to AWS July Webinar Series: Amazon Redshift Optimizing Performance (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

AWS July Webinar Series: Amazon Redshift Optimizing Performance