Migration to Redshift from SQL Server

Background
RealityMine provides digital behaviour
analytics.
Our applications passively measure the
activity of opt-in users on all digital
platforms.
This could be focused on
• how to direct marketing
• how to direct product development
• question individuals whom
undertake certain behavior patterns

Starting State

•
•
•
•
•

SQL Server DW on in-house server
SQL Server 2008 R2 Enterprise Edition
Single 4 core (8 thread) i7 w/ 16GB RAM
2 960GB PCIe SSDs for DBs
1 240GB PCIe SSD for TempDb

SQL Server to Redshift - @joeharris76

Data Environment

•
•
•
•
•

~20 billion rows in active use
Largest table is also the widest
Volume is doubling more than annually
Data is in many languages
Starts as JSON, ends as Star Schema DW


Pain Points

•
•
•
•
•

Biggest cost is SQL Server license
Biggest bottleneck is single threaded perf.
Hand tuning needed to push CPU / disks
SSD reliability is not perfect
SSD performance degrades over time


Why Redshift

•
•
•
•
•
•

Vertica wanted £45k per terabyte
16 SQL Server Enterprise cores even more!
Teradata, Netezza, etc. don’t want <5TB sales
SAP HANA not viable for this volume on AWS
Infobright does not support incremental loads
Hadoop/Impala slow & requires lots of learning

Data Processing Approach

• No ETL tool truly supports Redshift
– Requirement to load from S3 is a killer
– Tried SSIS, Pentaho, Talend and others
• You’re stuck with ELT
– Load data then transform as needed
– Keep data raw as possible from source

War of Encodings
The road to heaven goes
through ÜÑÎÇØDÈ hell


Redshift: UTF-8 Only
• Redshift has zero-tolerance for certain chars
– NUL/0x00 => Treated as EOR, documented
– DEL/0x7F => Treated as EOR, undocumented
– 0xBFEFEF => UTF-8 spec "guaranteed non-char"
– These must be removed before loading data
• Other control characters can be loaded by escaping
– You cannot escape a single column, all or nothing


SQL Server: UTF-16LE Only
• NVARCHAR takes 2x as much space as a VARCHAR
• Makes functions consistent across ASCII & Unicode
– N/VARCHAR(32) = 32chars / Redshift = 32 bytes
• SQL Server tolerates anything character columns
• Input and output is not sanitized against UTF-16 spec
– Invalid or "guaranteed non-chars" are stored as is


SQL Extract: The Hard Way
• BCP is the “standard” way to extract data
• Using BCP your process looks something like this:
– Extract data as a huge UTF-16LE file using bcp
– Convert to a new UTF-8 file using iconv
– Remove or escape problem chars using sed
– Compress the final file using gzip
– All steps are heavily constrained by disk speed


SQL Extract: The Easy Way

SQLCMD one-liner for extracts:
Set the cmd code page to UTF-8
Interactive SQL terminal
Prevent summary in output
Select from the table / view
No column headers
Remove special characters
Delimit output with 1 ASCII char
No padding in output
Output in Unicode
Pipe stdout to gzip

chcp 65001 &
sqlcmd –E -Q
“SET NOCOUNT ON;
SELECT * FROM Db.Schema.Table;”
-h-1
-k1
-s”|”
-W
-u
| gzip > “C:file.gz”


Data Encryption

•
•
•
•
•

On SQL Server we use TDE
Redshift offers AES encrypted data on disk
Redshift can load client-side encrypted data
Client side encryption only applies while on S3
“Small performance penalty” for using AES


Security
• S3 Access => Create bucket(s) just for Redshift staging
• Redshift admin => Use IAM, create automation user(s)
• Redshift database =>
– Do not use admin it’s like SQL Server ‘sa’
• Database objects =>
– Must actively GRANT access to each object
– Use groups to make management easier


Sizing your cluster

• Redshift is over-provisioned on storage
• Redshift is super efficient at compression
– Compression not affected by the data model
• Redshift scale out is almost perfectly linear
– 2 nodes is twice as fast as 1 node
• You'll be sizing your cluster for speed!

Performance
• Redshift speed depends on node count
– A single node is not particularly fast
• Loading speed appears to be linked to S3 speed
– You must use multiple files for bulk loads
• Query speed appears to be CPU constrained
– Vacuum runs 250 MB/s, queries <20 MB/s
• Data modeling matters for complex query speed
– Use a star schema & well chosen distribution key

Data Modeling

2 main concepts to learn
• Distribution key
– Where data is placed, which node & slice
– Needs to be common across most tables
• Sort key
– How data is ordered on disk within the slice
– Good sort keys simply expensive joins

Database Maintenance
•
•
•
•

Data loaded to non-empty tables is not sorted
Data loaded to non-empty tables may kills their stats
ANALYZE rebuilds the stats without making changes
VACUUM re-sorts the physical data and rebuilds stats
– Needed to get the best performance
– Very similar to a REBUILD in SQL Server


Database Backups
• Redshift ‘backups’ are snapshots of the system
• Taken very quickly, much slower to restore
• Redshift automatically takes intra-day snapshots
• Manual snapshots can be run using AWS cmd line
• Snapshot storage is free up to size of cluster storage
• Snapshots must be restored to an identical cluster
• Snapshots cannot be restored to a running cluster


Code Changes

Code changes required so far
• ROW_NUMBER() missing in Redshift
• We gain LAG() and LEAD() which helps
• But very difficult to persist an order value
• DATETIMEOFFSET (e.g. timezone) not avail.
• DATETIMEs now split into 2 columns
• Work in progress…

That’s all folks!


Come Work With Me!
http://www.realitymine.com/careers/
• Currently trying to fill the following roles:
• Business Intelligence Architect (Redshift!)
• Business Intelligence Developer (Tableau!)
• Test Engineer (Quality!)
• Server Developer (C#!)
• Mobile App Developer (Android! iOS!)
• Project Manager

Migration to Redshift from SQL Server

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Migration to Redshift from SQL Server

Similar to Migration to Redshift from SQL Server (20)

Recently uploaded

Recently uploaded (20)

Migration to Redshift from SQL Server

Editor's Notes