(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
John Loughlin, AWS Solutions Architect
Kishore Raja, Boingo Wireless, VP Strategy
Ajit Zadgaonkar, Edmunds.com Executive
Director, Engineering Operations
October 2015
ISM303
Migrating Your Enterprise Data
Warehouse to Amazon Redshift

Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper

Amazon Redshift works with your analysis tools
JDBC/ODBC
Amazon Redshift

Data loading options
• Parallel upload to Amazon S3
• AWS Direct Connect
• AWS Import/Export
• Amazon Kinesis
• Systems integrators
Data Integration Systems Integrators

Amazon Redshift architecture
Leader Node
Simple SQL end point
Stores metadata
Optimizes query plan
Coordinates query execution
Compute Nodes
Local columnar storage
Parallel/distributed execution of all queries, loads,
backups, restores, resizes
Start at $0.25/hour, grow to 2 PB (compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS2: HDD; scale from 2 TB to 2 PB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC

Amazon Redshift is priced to analyze all your data
DS2 (HDD)
Price Per Hour for
DW1.XL Single Node
Effective Annual
Price per TB compressed
On-Demand $ 0.850 $ 3,725
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DC1 (SSD)
Price Per Hour for
DW2.L Single Node
Effective Annual
Price per TB compressed
On-Demand $ 0.250 $ 13,690
Pricing is simple
Number of nodes x price/hour
No charge for leader node
No upfront costs
Pay as you go

Common migration patterns
• Data from a variety of relational online transaction
processing (OLTP) systems structure lends itself to SQL
schemas
• Data from logs, devices, sensors,…data is less
structured

Structured data loading
• Data is often being loaded into another warehouse from
an existing ETL process
• Temptation is to “lift and shift” workload
• Resist temptation; instead consider:
• What do I really want to do?
• What do I need?

Ingesting less-structured data
• Some data does not lend itself to a relational schema
• Common pattern is to use Amazon EMR to:
• Impose structure
• Import into Amazon Redshift
• Other solutions are often home-grown scripting
applications

Loading data
• Load to an empty Amazon Redshift database
• Load changes captured in the source system to Amazon
Redshift

Truncate and load
This is by far the easiest option:
• Move the data to Amazon S3
• Multi-part upload
• Import/export service
• AWS Direct Connect
• COPY the data into Amazon Redshift, a table at a time

Load changes
• Identify changes in source systems
• Move data to Amazon S3
• Load changes:
• ‘Upsert process’
• Partner ETL tools

Partner ETL
• Amazon Redshift is supported by a variety of ETL
vendors
• Many simplify the process of data loading
• A variety of vendors offer a free trial of their products,
allowing you to evaluate and choose the one that suits
your needs
• Visit http://aws.amazon.com/redshift/partners

Upsert
• The goal is to insert new rows into and update changed
rows in Amazon Redshift
• Load data into a temporary staging table
• Join the staging table with production and delete the
common rows
• Copy the new data into the production table
• See Updating and Inserting New Data in the Amazon
Redshift Database Developer Guide

COPY command
• Set COMPUPDATE to ON when running on an empty
table
• Use the COPY command
• Each slice can load one file at a time
• Partition input files so all slices can load in parallel
• Use a manifest file

Use multiple input files to maximize throughput
• Each slice can load one file at
a time
• A single input file means only
one slice is ingesting data
• Instead of 100 MB/s, you’re
getting only 6.25 MB/s

Use multiple input files to maximize throughput
• You need at least as many
input files as you have slices
• With 16 input files, all slices
are working so you maximize
throughput
• Get 100 MB/s per node; scale
linearly as you add nodes

Primary keys and manifest files
• Amazon Redshift doesn’t enforce primary key
constraints:
• If you load data multiple times, Amazon Redshift won’t complain
• If you declare primary keys in your data manipulation language
(DML), the optimizer expects the data to be unique
• Use manifest files to control exactly what is loaded and
how to respond if input files are missing:
• Define a JSON manifest on Amazon S3
• Ensures that the cluster loads exactly what you want

Kishore Raja
VP, Strategy
Boingo Wireless
October 7, 2015 | Las Vegas, NV
TCO and ROI for Migrating from
Enterprise Database to Amazon Redshift
ISM303

- Data Architecture
- Success Criteria
- Solutions Evaluated
- Additional Benefits
- Big data Agility
- Summary
Agenda

90+ M
Ad engagements/year
100
Operator partners
100+ Countries
6 Continents
Media Largest ad network
Engaging mobile audiences via Wi-Fi
Wi-Fi Largest operator
of airport wireless networks in the world
DAS
Largest operator
of independent indoor cellular networks
in the U.S.
Broadband
Largest provider
of wireless high-speed Internet & TV
for the military
1 Million+
Hotspots
Nearly
2000
Commercial locations
19
DAS Locations
Boingo: Reaching 1 Billion Consumers Annually
100+
Worldwide

Boingo on AWS
S3
Datawarehouse
Storage and
Content Delivery
Compute and
Networking Database
RDS
Admin and
Security Deployment App Services
Amazon EC2 AMI Elastic IP
VPC VPN Conn Gateway(s)
Route 53 Route
Table
ELB
Auto scaling ENI Lambda
EBS
Glacier
CloudFront
ElastiCache
MySQL DB
CloudWatch
Trusted Advisor
IAM
CloudTrail
Elastic Beanstalk
CloudFormation
OpsWorks
MFA Token
SQS
SQS
Oracle 11g(r2)

Data Architecture
SAP Data Services
Eng data
S3
Flat files
Database
Oracle RDS 11g(r2)
Front end Visualization
(Business Objects)
1. ETL 2. Data Storage 3. Reporting

Issues
• Data is growing which is making OLAP slow
• Inefficient Row based approach (mostly)
• Standard Oracle compression
• Mediocre IOPS
• Single DB server (no concurrency)
• Not enough memory (64GB)
• Administration
– Partitioning
– DB patches, updates, OS patches, updates
– Maintenance (backup, snapshots, replication)
– Recovery failure etc.
• Expensive (license, hardware, support etc.)

Success Criteria
What do we need?
• Memory (at least 256GB)
• Parallel Processing
• Plenty of IOPS
• Less Administration
• Low TCO
Growth rate:
• Currently at 15TB
• 2-3TB average growth per year
Nice to have
• Ingest any data type/store
• Realtime Streaming analysis
• Massive Parallel Processing
• Scale (up or down)
• Integrate any (& every) database
• Multiple levels of Security
• Smart Alerts and Monitoring
• Cost Effective
• Lesser (or zero) CAPEX
• Keep up with Industry
Security/Compliances
• Automated audit reporting

AWS Data Solutions
• Oracle
• SQL Server
• PostgreSQL
• MySQL
• Aurora (MySQL
compatible)
• Small and large scale
non-RDS
• Schemaless
• Using open source
memcached/Redis
• Works on any database
• Datawarehouse
• Petabyte scale
• Massive Parallel
processing
RDS NoSQL In Memory
DataWarehouse
Redshift
Fully Managed, No CAPEX, Highly secure, Scalable
• DAT202: Understanding Database Options on AWS (Wednesday, Oct 7, 11:00 AM - 12:00 PM, San Polo 3501B)
• DAT302 - Relational Database Management Systems in the Cloud: Deploying SQL Server on AWS (Thursday, Oct 8, 5:30 PM - 6:30 PM, San Polo 3501B)
• DAT303: Oracle on AWS and Amazon RDS: Secure, Fast and Scalable (Friday, Oct 9 9:00-10AM, Delfino 4102)

Redshift TCO
EaaS
Eng. Data
S3
Flat files
Redshift
Datawarehouse
Front end Visualization
(Business Objects)
1. ETL 2. Data storage 3. BI reports
- Cluster of 50 DB servers
- 100 CPU cores
- 8TB SSD storage
- 750GB Memory
- Self organizing Cluster(s)
- 160GB increments
Annual Cost: $48,500Annual Cost: ~ $6,500
Annual Cost: ~ $55,000
Database installs, patches, OS installs,
patches, backup, replication, server
maintenance, scaling, security etc.
Managed Service

TCO Comparison
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
Exadata SAP HANA Redshift
TCO Estimates
$400,000
$300,000
$55,000

Performance Results
7,200
2,700
15 15
Query Performance Data Load Performance
1 year of data
1 million records
Latencyinseconds
RedshiftExisting System
7,200
55,000
6500
Existing System Redshift
ETLannualcost
ETL

Migration and Ease Of Use
Database installs, patches, OS installs,
patches, backup, replication, server
maintenance, scaling, security etc.
Administration and Support
0 1 2 3 4
Other Systems
Redshift
Migration Time (in months)
2
4

TCO
Estimated Cluster
- 100 CPU cores
- 8TB SSD storage
- 750GB Memory
- 160GB increments
Actual Cluster
$48,500
$12,000
Savings:
• 40% for upto 1 year term
• 60% for upto 3 year term
Options:
• No upfront 20% *
• Partial upfront 41% - 73%
• All upfront 42% - 76%
Cancellation:
• Full refund within 7 days *
• Prorated refund within 30 days *
• Prorated refund within 90 days
Talend ($6500)
* For 1 year term RI
Python Scripts ($0)
Elasticity Reserved Instances ETL
- ISM208 - The Science of Saving with AWS Reserved Instances (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Delfino 4105)

3. Subnets
Additional Benefits
1. Access Control
• “Deny All” DB cluster
• Firewall rules
• IAM management
2. VPC
• BYOIP
• Ingress access
• Extend to corporate
data center
Cloud
• MFA
• Encryption
• Transit : SSL with TLS v1.2
• Storage : Encryption at rest
• Further isolation inside VPC
• IAM management
• SEC302 - IAM Best Practices to Live By (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Palazzo K)
• NET201 - Creating Your Virtual Data Center: VPC Fundamentals and Connectivity Options (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Titian 2201B)
• ARC403 - From One to Many: Evolving VPC Design (Wednesday, Oct 7, 2:45 PM - 3:45 PM, Palazzo N)

AES 256-bit AES 256-bit AES 256-bit
AES 256-bit AES 256-bit AES 256-bit
AES 256-bit
AES 256-bit
AES 256-bit
AES 256-bit
Database Key
Cluster Master Key
Customer Master Key
HSM
(Data center)
Advanced Encryption

Monitoring and Alerts
Intrusion Detection
• DDoS
• MiTM
• IP Spoofing
• Packet Sniffing
• Port Monitoring
Service
• DVO303 - Scaling Infrastructure Operations with AWS Service Catalog, AWS Config, and AWS CloudTrail (Friday, Oct 9, 9:00 AM - 10:00 AM, Lido 3001B)
• ARC302 - Running Lean Architectures: How to Optimize for Cost Efficiency (Friday, Oct 9, 9:00 AM - 10:00 AM, Palazzo K)

Big Data Agility
Production Datawarehouse
- 100 CPU cores
- 8TB SSD storage
- 750GB Memory
- 160GB increments
Backup
QA Cluster
Predictive Analysis/Adhoc Cluster
Performance Cluster
< 30mins
< 5/hour
< $5/hour
< $5/hour
DAT311 - Large-Scale Genomic Analysis with Amazon Redshift (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Lando 4306)
DAT308 - How Yahoo! Analyzes Billions of Events a Day on Amazon Redshift (Thursday, Oct 8, 4:15 PM - 5:15 PM, Palazzo C)
BDT401 - Amazon Redshift Deep Dive: Tuning and Best Practices (Thursday, Oct 8, 2:45 PM - 3:45 PM, Marcello 4506)

Summary
• (Very) Cost Efficient
• (Highly) Secure (Enterprise grade Encryption)
• Managed service (Administration)
• Quick(er) Migration time
• 167+ Security and Compliancy features
• Proved to work (NASDAQ, NASA, Financial Times, Pinterest etc.)
• Faster with better performance
• Future proof (Ecosystem, security, new services etc.)
• 2+ years on AWS
• Ease of use
ROI

Related Sessions
• DAT311 - Large-Scale Genomic Analysis with Amazon Redshift (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Lando 4306)
• DAT308 - How Yahoo! Analyzes Billions of Events a Day on Amazon Redshift (Thursday, Oct 8, 4:15 PM - 5:15 PM,
Palazzo C)
• BDT401 - Amazon Redshift Deep Dive: Tuning and Best Practices (Thursday, Oct 8, 2:45 PM - 3:45 PM, Marcello 4506)
• DAT202: Understanding Database Options on AWS (Wednesday, Oct 7, 11:00 AM - 12:00 PM, San Polo 3501B)
• DAT302 - Relational Database Management Systems in the Cloud: Deploying SQL Server on AWS (Thursday, Oct 8, 5:30
PM - 6:30 PM, San Polo 3501B)
• DAT303: Oracle on AWS and Amazon RDS: Secure, Fast and Scalable (Friday, Oct 9 9:00-10AM, Delfino 4102)
• SEC302 - IAM Best Practices to Live By (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Palazzo K)
• NET201 - Creating Your Virtual Data Center: VPC Fundamentals and Connectivity Options (Wednesday, Oct 7, 1:30 PM -
2:30 PM, Titian 2201B)
• ARC403 - From One to Many: Evolving VPC Design (Wednesday, Oct 7, 2:45 PM - 3:45 PM, Palazzo N)
• DVO303 - Scaling Infrastructure Operations with AWS Service Catalog, AWS Config, and AWS CloudTrail (Friday, Oct 9,
9:00 AM - 10:00 AM, Lido 3001B)
• ISM208 - The Science of Saving with AWS Reserved Instances (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Delfino 4105)
• ARC302 - Running Lean Architectures: How to Optimize for Cost Efficiency (Friday, Oct 9, 9:00 AM - 10:00 AM, Palazzo K)
RedshiftDatabasesInfrastructureCost

Ajit Zadgaonkar, Executive Director
October 2015
Migration to Amazon Redshift
Edmunds.com

OF CAR BUYERS INFLUENCED BY
EDMUNDS.COM
59%
*R. L. Polk & Co.

Edmunds.com
• 18M unique visitors a month
• 200M+ page views a month
• Over 10k dealer partners
• 14k+ API users
• Over 6M automotive
inventory
• Over 1M content pages
• Lots and lots of data
• Continuously growing data
• 24x7 real-time BI
• DWH in Amazon Redshift
• 32-node cluster

From unsustainable, painful operations to:
• Efficient, cost-effective cluster
• Squeak-free operations
• Happy customers
• Cost reduction (new system costs 1/5 of the old one)
Improvement

Challenges
• Painfully slow queries
• High system resource utilization
• Slow data loading
• Timeouts !
• …all in all, we were running into HUGE PROBLEMS

Lessons learned
• Know the system, the strengths, and the limitations
• Understand the end-to-end usage scenario
• Design the processes following Best Practices
• Invest in real-time monitoring
• Lift and shift may not be the best choice
• Let Enterprise Support and TAMs be your partners
• Monitor, monitor, and trend

The System, the infrastructure
• Syntactical differences (i.e., PostgreSQL 7 vs.
PostgreSQL 8)
• Architectural choices (i.e., columnar database)
• Transaction processing
• Historical data analysis, business intelligence
• Node type, cluster size
• Shared infrastructure vs. dedicated throughput
• The larger the cluster, the bigger the resizing effort

Make the up-front investment: Design
• Select the right sort key
• Timestamp, range filtering on column name, joins
• Compound sort key, interleaved sort key
• Measure query performance, system load, and vacuum
• Ensuring tables have a sort key alone helped us gain
20% performance
• Over 50% of our tables did not have a sort key
• Ensuring that the right sort key is assigned is the path to
winning

Make the upfront investment: Use cases
• Select the right distribution style
• Locate data faster
• Uniform load
• Less data movement
• A good distribution style ensures a healthy system
• Many of our tables did not have the right distribution style

Queries
• Select * is #1 performance killer
• Use WHERE clause on the primary sort column
• Watch out for queries that create “temporary tables”
• Long-running queries might impact downstream services
• Define constraints

VACUUM
• Run VACUUM frequently
• Run right after loading data
• Monitor vacuum time

Data loading
• Load data in sort key order
• Load using multiple files (1 MB to 1 GB)
• #files: Multiples of slices in cluster
• Use compression
• Use single COPY command
• S3 is your best friend

A closer look
• Each node is split into slices
• One slice per core
• Each slice is allocated
memory, CPU, and disk
space
• Each slice processes a
piece of the workload in
parallel

Monitoring
• Console/Amazon CloudWatch monitoring
• CPU, memory, processes
• Data distribution across slices
• Space used per table
• WLM query count, queue wait time, execution time
• Commit stats, top time-consuming queries

In closing
• Amazon Redshift is a great data warehousing platform
• Parting advice: Make investment in Best Practices
• Check out Redshift Utils

(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift

Similar to (ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

(ISM303) Migrating Your Enterprise Data Warehouse To Amazon Redshift