2. Background
RealityMine provides digital behaviour
analytics.
Our applications passively measure the
activity of opt-in users on all digital
platforms.
This could be focused on
• how to direct marketing
• how to direct product development
• question individuals whom
undertake certain behavior patterns
3. Starting State
•
•
•
•
•
SQL Server DW on in-house server
SQL Server 2008 R2 Enterprise Edition
Single 4 core (8 thread) i7 w/ 16GB RAM
2 960GB PCIe SSDs for DBs
1 240GB PCIe SSD for TempDb
SQL Server to Redshift - @joeharris76
4. Data Environment
•
•
•
•
•
~20 billion rows in active use
Largest table is also the widest
Volume is doubling more than annually
Data is in many languages
Starts as JSON, ends as Star Schema DW
SQL Server to Redshift - @joeharris76
5. Pain Points
•
•
•
•
•
Biggest cost is SQL Server license
Biggest bottleneck is single threaded perf.
Hand tuning needed to push CPU / disks
SSD reliability is not perfect
SSD performance degrades over time
SQL Server to Redshift - @joeharris76
6. Why Redshift
•
•
•
•
•
•
Vertica wanted £45k per terabyte
16 SQL Server Enterprise cores even more!
Teradata, Netezza, etc. don’t want <5TB sales
SAP HANA not viable for this volume on AWS
Infobright does not support incremental loads
Hadoop/Impala slow & requires lots of learning
SQL Server to Redshift - @joeharris76
7. Data Processing Approach
• No ETL tool truly supports Redshift
– Requirement to load from S3 is a killer
– Tried SSIS, Pentaho, Talend and others
• You’re stuck with ELT
– Load data then transform as needed
– Keep data raw as possible from source
SQL Server to Redshift - @joeharris76
8. War of Encodings
The road to heaven goes
through ÜÑÎÇØDÈ hell
SQL Server to Redshift - @joeharris76
9. Redshift: UTF-8 Only
• Redshift has zero-tolerance for certain chars
– NUL/0x00 => Treated as EOR, documented
– DEL/0x7F => Treated as EOR, undocumented
– 0xBFEFEF => UTF-8 spec "guaranteed non-char"
– These must be removed before loading data
• Other control characters can be loaded by escaping
– You cannot escape a single column, all or nothing
SQL Server to Redshift - @joeharris76
10. SQL Server: UTF-16LE Only
• NVARCHAR takes 2x as much space as a VARCHAR
• Makes functions consistent across ASCII & Unicode
– N/VARCHAR(32) = 32chars / Redshift = 32 bytes
• SQL Server tolerates anything character columns
• Input and output is not sanitized against UTF-16 spec
– Invalid or "guaranteed non-chars" are stored as is
SQL Server to Redshift - @joeharris76
11. SQL Extract: The Hard Way
• BCP is the “standard” way to extract data
• Using BCP your process looks something like this:
– Extract data as a huge UTF-16LE file using bcp
– Convert to a new UTF-8 file using iconv
– Remove or escape problem chars using sed
– Compress the final file using gzip
– All steps are heavily constrained by disk speed
SQL Server to Redshift - @joeharris76
12. SQL Extract: The Easy Way
SQLCMD one-liner for extracts:
Set the cmd code page to UTF-8
Interactive SQL terminal
Prevent summary in output
Select from the table / view
No column headers
Remove special characters
Delimit output with 1 ASCII char
No padding in output
Output in Unicode
Pipe stdout to gzip
chcp 65001 &
sqlcmd –E -Q
“SET NOCOUNT ON;
SELECT * FROM Db.Schema.Table;”
-h-1
-k1
-s”|”
-W
-u
| gzip > “C:file.gz”
SQL Server to Redshift - @joeharris76
13. Data Encryption
•
•
•
•
•
On SQL Server we use TDE
Redshift offers AES encrypted data on disk
Redshift can load client-side encrypted data
Client side encryption only applies while on S3
“Small performance penalty” for using AES
SQL Server to Redshift - @joeharris76
14. Security
• S3 Access => Create bucket(s) just for Redshift staging
• Redshift admin => Use IAM, create automation user(s)
• Redshift database =>
– Do not use admin it’s like SQL Server ‘sa’
• Database objects =>
– Must actively GRANT access to each object
– Use groups to make management easier
SQL Server to Redshift - @joeharris76
15. Sizing your cluster
• Redshift is over-provisioned on storage
• Redshift is super efficient at compression
– Compression not affected by the data model
• Redshift scale out is almost perfectly linear
– 2 nodes is twice as fast as 1 node
• You'll be sizing your cluster for speed!
SQL Server to Redshift - @joeharris76
16. Performance
• Redshift speed depends on node count
– A single node is not particularly fast
• Loading speed appears to be linked to S3 speed
– You must use multiple files for bulk loads
• Query speed appears to be CPU constrained
– Vacuum runs 250 MB/s, queries <20 MB/s
• Data modeling matters for complex query speed
– Use a star schema & well chosen distribution key
SQL Server to Redshift - @joeharris76
17. Data Modeling
2 main concepts to learn
• Distribution key
– Where data is placed, which node & slice
– Needs to be common across most tables
• Sort key
– How data is ordered on disk within the slice
– Good sort keys simply expensive joins
SQL Server to Redshift - @joeharris76
18. Database Maintenance
•
•
•
•
Data loaded to non-empty tables is not sorted
Data loaded to non-empty tables may kills their stats
ANALYZE rebuilds the stats without making changes
VACUUM re-sorts the physical data and rebuilds stats
– Needed to get the best performance
– Very similar to a REBUILD in SQL Server
SQL Server to Redshift - @joeharris76
19. Database Backups
• Redshift ‘backups’ are snapshots of the system
• Taken very quickly, much slower to restore
• Redshift automatically takes intra-day snapshots
• Manual snapshots can be run using AWS cmd line
• Snapshot storage is free up to size of cluster storage
• Snapshots must be restored to an identical cluster
• Snapshots cannot be restored to a running cluster
SQL Server to Redshift - @joeharris76
20. Code Changes
Code changes required so far
• ROW_NUMBER() missing in Redshift
• We gain LAG() and LEAD() which helps
• But very difficult to persist an order value
• DATETIMEOFFSET (e.g. timezone) not avail.
• DATETIMEs now split into 2 columns
• Work in progress…
SQL Server to Redshift - @joeharris76
22. Come Work With Me!
http://www.realitymine.com/careers/
• Currently trying to fill the following roles:
• Business Intelligence Architect (Redshift!)
• Business Intelligence Developer (Tableau!)
• Test Engineer (Quality!)
• Server Developer (C#!)
• Mobile App Developer (Android! iOS!)
• Project Manager
SQL Server to Redshift - @joeharris76
Editor's Notes
Data and Log are always on different disks.Criss-cross pattern used to balance wear.TempDbsplit across 8 files (1 per thread)
TDE required for data encryption.Compression used to maximise SSD speed.A lot of tuning done to push CPU and disks harder.We've seen silent partial failures without any indication.Now have to regularly run DBCC to verify databases. So far we've seen a ~20% perf loss over a year.
We’re actually using out existing SQL Server automation setup to run batch scripts that execute SQL on Redshift.
Four byte character support was recently added and that makes things a little easier.SQL Server's REPLACE() function is **broken** and ***cannot remove any of these values***! Yes, really. I can't tell you how fun it was to figure that out. Because it wasn't fun at all.All escape sensitive data must be escaped in all columns.Embedded newlines **must** be escaped as '\n’
vsOracle which has LENGTH() for characters and LENGTHB() for bytes.vsRedshift which has only LENGTH() and no way to get the byte length.SQL Server will tolerate _anything_ inside a character columnNo sanitisation of inputs or outputsUTF-16LE *compatible*, rather than *compliant* I know this from painful experience
All web searches will suggest using BCP.All ETL tools actually wrap BCP to get data out**Forget about BCP. BCP is the enemy.**BCP DOES NOT SUPPORT STDOUT!!!
Voila! UTF-8 output from SQL Server directly to a gzip file.
* On SQL Server we use TDE (transparent encryption) * Data on disk is AES encrypted, transparently.* Redshift offers AES encryption of the data on disk. * Not actively encrypted during use, same as SQL Server.* Redshift supports loading client-side 'evelope' encrypted data. * Good luck with that! * Slow: You'll have land your data on disk and then reprocess it. * Custom: You'll have to write your own encrypter using Open SSL or some such. * Client side encryption is somewhat moot as it only applies while data is on S3. * My 2p: Enable AES on both S3 and Redshift. Call it a day.* Amazon says there is a 'small perfomance penalty' for using AES. * In practice it seems to be acceptable. * I have *not actually tested* it without AES because I don't want to generate 10 billion rows of sample data.
* Managing user and admin access is kind of a pain in Redshift1. Access to S3 * Create bucket(s) just for Redshift staging data.2. Access to Redshift admin * Use IAM access controls to limit individual's access. * Create users just for automation and enforce password rotation. 3. Access to Redshift database * **Do not allow** use of the admin user - it's like SQL Server's `sa`. * Create 1:1 map of external users to Redshift users (no LDAP/AD support)4. Access to specific database objects * You must actively `GRANT` access to each object. * Use groups to make this task easier. * We have just 2 groups: "admin" (`GRANT ALL`) and "readers" (`GRANT SELECT`)
* Redshift nodes are waaaaaay over-provisioned on storage * 2 TB of storage available per node* Redshift is suuuuuper efficient at compression * Our data in Redshift is roughly 2x the gzipped UTF8 input. * The size varies depending on how we sort the tables. * Therefore you'll be sizing the cluster for **speed**. * You add nodes to go faster _not when you run out of disk._* Tough to get your head around.
Still faster than SQL Server on PCIe SSDs for our dataYou must use multiple files for bulk loads
You cannot schedule these AFAICTThey are auto-deleted on a schedule you can setDefault auto-delete is 1 dayPriced same as S3 beyond cluster size