Learn how Boingo Wireless and online media provider Edmunds gained substantial business insights and saved money and time by migrating to Amazon Redshift. Get an inside look into how they accomplished their migration from on-premises solutions. Learn how they tuned their schema and queries to take full advantage of the columnar MPP architecture in Amazon Redshift, how they leveraged third party solutions, and how they met their business intelligence needs in record time.
2. Relational data warehouse
Massively parallel; Petabyte scale
Fully managed
HDD and SSD Platforms
$1,000/TB/Year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
4. Data loading options
• Parallel upload to Amazon S3
• AWS Direct Connect
• AWS Import/Export
• Amazon Kinesis
• Systems integrators
Data Integration Systems Integrators
5. Amazon Redshift architecture
Leader Node
Simple SQL end point
Stores metadata
Optimizes query plan
Coordinates query execution
Compute Nodes
Local columnar storage
Parallel/distributed execution of all queries, loads,
backups, restores, resizes
Start at $0.25/hour, grow to 2 PB (compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS2: HDD; scale from 2 TB to 2 PB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
6. Amazon Redshift is priced to analyze all your data
DS2 (HDD)
Price Per Hour for
DW1.XL Single Node
Effective Annual
Price per TB compressed
On-Demand $ 0.850 $ 3,725
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DC1 (SSD)
Price Per Hour for
DW2.L Single Node
Effective Annual
Price per TB compressed
On-Demand $ 0.250 $ 13,690
1 Year Reservation $ 0.161 $ 8,795
3 Year Reservation $ 0.100 $ 5,500
Pricing is simple
Number of nodes x price/hour
No charge for leader node
No upfront costs
Pay as you go
7. Common migration patterns
• Data from a variety of relational online transaction
processing (OLTP) systems structure lends itself to SQL
schemas
• Data from logs, devices, sensors,…data is less
structured
8. Structured data loading
• Data is often being loaded into another warehouse from
an existing ETL process
• Temptation is to “lift and shift” workload
• Resist temptation; instead consider:
• What do I really want to do?
• What do I need?
9. Ingesting less-structured data
• Some data does not lend itself to a relational schema
• Common pattern is to use Amazon EMR to:
• Impose structure
• Import into Amazon Redshift
• Other solutions are often home-grown scripting
applications
10. Loading data
• Load to an empty Amazon Redshift database
• Load changes captured in the source system to Amazon
Redshift
11. Truncate and load
This is by far the easiest option:
• Move the data to Amazon S3
• Multi-part upload
• Import/export service
• AWS Direct Connect
• COPY the data into Amazon Redshift, a table at a time
12. Load changes
• Identify changes in source systems
• Move data to Amazon S3
• Load changes:
• ‘Upsert process’
• Partner ETL tools
13. Partner ETL
• Amazon Redshift is supported by a variety of ETL
vendors
• Many simplify the process of data loading
• A variety of vendors offer a free trial of their products,
allowing you to evaluate and choose the one that suits
your needs
• Visit http://aws.amazon.com/redshift/partners
14. Upsert
• The goal is to insert new rows into and update changed
rows in Amazon Redshift
• Load data into a temporary staging table
• Join the staging table with production and delete the
common rows
• Copy the new data into the production table
• See Updating and Inserting New Data in the Amazon
Redshift Database Developer Guide
15. COPY command
• Set COMPUPDATE to ON when running on an empty
table
• Use the COPY command
• Each slice can load one file at a time
• Partition input files so all slices can load in parallel
• Use a manifest file
16. Use multiple input files to maximize throughput
• Use the COPY command
• Each slice can load one file at
a time
• A single input file means only
one slice is ingesting data
• Instead of 100 MB/s, you’re
getting only 6.25 MB/s
17. Use multiple input files to maximize throughput
• Use the COPY command
• You need at least as many
input files as you have slices
• With 16 input files, all slices
are working so you maximize
throughput
• Get 100 MB/s per node; scale
linearly as you add nodes
18. Primary keys and manifest files
• Amazon Redshift doesn’t enforce primary key
constraints:
• If you load data multiple times, Amazon Redshift won’t complain
• If you declare primary keys in your data manipulation language
(DML), the optimizer expects the data to be unique
• Use manifest files to control exactly what is loaded and
how to respond if input files are missing:
• Define a JSON manifest on Amazon S3
• Ensures that the cluster loads exactly what you want
20. - Data Architecture
- Success Criteria
- Solutions Evaluated
- Additional Benefits
- Big data Agility
- Summary
Agenda
21. 90+ M
Ad engagements/year
100
Operator partners
100+ Countries
6 Continents
Media Largest ad network
Engaging mobile audiences via Wi-Fi
Wi-Fi Largest operator
of airport wireless networks in the world
DAS
Largest operator
of independent indoor cellular networks
in the U.S.
Broadband
Largest provider
of wireless high-speed Internet & TV
for the military
1 Million+
Hotspots
Nearly
2000
Commercial locations
19
DAS Locations
Boingo: Reaching 1 Billion Consumers Annually
100+
Worldwide
22. Boingo on AWS
S3
Datawarehouse
Storage and
Content Delivery
Compute and
Networking Database
RDS
Admin and
Security Deployment App Services
Amazon EC2 AMI Elastic IP
VPC VPN Conn Gateway(s)
Route 53 Route
Table
ELB
Auto scaling ENI Lambda
EBS
Glacier
CloudFront
ElastiCache
MySQL DB
CloudWatch
Trusted Advisor
IAM
CloudTrail
Elastic Beanstalk
CloudFormation
OpsWorks
MFA Token
SQS
SQS
Oracle 11g(r2)
23. Data Architecture
SAP Data Services
Eng data
S3
Flat files
Database
Oracle RDS 11g(r2)
Front end Visualization
(Business Objects)
1. ETL 2. Data Storage 3. Reporting
24. Issues
• Data is growing which is making OLAP slow
• Inefficient Row based approach (mostly)
• Standard Oracle compression
• Mediocre IOPS
• Single DB server (no concurrency)
• Not enough memory (64GB)
• Administration
– Partitioning
– DB patches, updates, OS patches, updates
– Maintenance (backup, snapshots, replication)
– Recovery failure etc.
• Expensive (license, hardware, support etc.)
25. Success Criteria
What do we need?
• Memory (at least 256GB)
• Parallel Processing
• Plenty of IOPS
• Less Administration
• Low TCO
Growth rate:
• Currently at 15TB
• 2-3TB average growth per year
Nice to have
• Ingest any data type/store
• Realtime Streaming analysis
• Massive Parallel Processing
• Scale (up or down)
• Integrate any (& every) database
• Multiple levels of Security
• Smart Alerts and Monitoring
• Cost Effective
• Lesser (or zero) CAPEX
• Keep up with Industry
Security/Compliances
• Automated audit reporting
27. AWS Data Solutions
• Oracle
• SQL Server
• PostgreSQL
• MySQL
• Aurora (MySQL
compatible)
• Small and large scale
non-RDS
• Schemaless
• Using open source
memcached/Redis
• Works on any database
• Datawarehouse
• Petabyte scale
• Massive Parallel
processing
RDS NoSQL In Memory
DataWarehouse
Redshift
Fully Managed, No CAPEX, Highly secure, Scalable
• DAT202: Understanding Database Options on AWS (Wednesday, Oct 7, 11:00 AM - 12:00 PM, San Polo 3501B)
• DAT302 - Relational Database Management Systems in the Cloud: Deploying SQL Server on AWS (Thursday, Oct 8, 5:30 PM - 6:30 PM, San Polo 3501B)
• DAT303: Oracle on AWS and Amazon RDS: Secure, Fast and Scalable (Friday, Oct 9 9:00-10AM, Delfino 4102)
28. Redshift TCO
EaaS
Eng. Data
S3
Flat files
Redshift
Datawarehouse
Front end Visualization
(Business Objects)
1. ETL 2. Data storage 3. BI reports
- Cluster of 50 DB servers
- 100 CPU cores
- 8TB SSD storage
- 750GB Memory
- Self organizing Cluster(s)
- 160GB increments
Annual Cost: $48,500Annual Cost: ~ $6,500
Annual Cost: ~ $55,000
Database installs, patches, OS installs,
patches, backup, replication, server
maintenance, scaling, security etc.
Managed Service
30. Performance Results
7,200
2,700
15 15
Query Performance Data Load Performance
1 year of data
1 million records
Latencyinseconds
RedshiftExisting System
7,200
55,000
6500
Existing System Redshift
ETLannualcost
ETL
31. Migration and Ease Of Use
Database installs, patches, OS installs,
patches, backup, replication, server
maintenance, scaling, security etc.
Administration and Support
0 1 2 3 4
Other Systems
Redshift
Migration Time (in months)
2
4
32. TCO
Estimated Cluster
- Cluster of 50 DB servers
- 100 CPU cores
- 8TB SSD storage
- 750GB Memory
- Self organizing Cluster(s)
- 160GB increments
Actual Cluster
$48,500
$12,000
Savings:
• 40% for upto 1 year term
• 60% for upto 3 year term
Options:
• No upfront 20% *
• Partial upfront 41% - 73%
• All upfront 42% - 76%
Cancellation:
• Full refund within 7 days *
• Prorated refund within 30 days *
• Prorated refund within 90 days
Talend ($6500)
* For 1 year term RI
Python Scripts ($0)
Elasticity Reserved Instances ETL
- ISM208 - The Science of Saving with AWS Reserved Instances (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Delfino 4105)
33. 3. Subnets
Additional Benefits
1. Access Control
• “Deny All” DB cluster
• Firewall rules
• IAM management
2. VPC
• BYOIP
• Ingress access
• Extend to corporate
data center
Cloud
• MFA
• Encryption
• Transit : SSL with TLS v1.2
• Storage : Encryption at rest
• Further isolation inside VPC
• IAM management
• SEC302 - IAM Best Practices to Live By (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Palazzo K)
• NET201 - Creating Your Virtual Data Center: VPC Fundamentals and Connectivity Options (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Titian 2201B)
• ARC403 - From One to Many: Evolving VPC Design (Wednesday, Oct 7, 2:45 PM - 3:45 PM, Palazzo N)
35. Monitoring and Alerts
Intrusion Detection
• DDoS
• MiTM
• IP Spoofing
• Packet Sniffing
• Port Monitoring
Service
• DVO303 - Scaling Infrastructure Operations with AWS Service Catalog, AWS Config, and AWS CloudTrail (Friday, Oct 9, 9:00 AM - 10:00 AM, Lido 3001B)
• ARC302 - Running Lean Architectures: How to Optimize for Cost Efficiency (Friday, Oct 9, 9:00 AM - 10:00 AM, Palazzo K)
36. Big Data Agility
Production Datawarehouse
- Cluster of 50 DB servers
- 100 CPU cores
- 8TB SSD storage
- 750GB Memory
- Self organizing Cluster(s)
- 160GB increments
Backup
QA Cluster
Predictive Analysis/Adhoc Cluster
Performance Cluster
< 30mins
< 5/hour
< $5/hour
< $5/hour
DAT311 - Large-Scale Genomic Analysis with Amazon Redshift (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Lando 4306)
DAT308 - How Yahoo! Analyzes Billions of Events a Day on Amazon Redshift (Thursday, Oct 8, 4:15 PM - 5:15 PM, Palazzo C)
BDT401 - Amazon Redshift Deep Dive: Tuning and Best Practices (Thursday, Oct 8, 2:45 PM - 3:45 PM, Marcello 4506)
37. Summary
• (Very) Cost Efficient
• (Highly) Secure (Enterprise grade Encryption)
• Managed service (Administration)
• Quick(er) Migration time
• 167+ Security and Compliancy features
• Proved to work (NASDAQ, NASA, Financial Times, Pinterest etc.)
• Faster with better performance
• Future proof (Ecosystem, security, new services etc.)
• 2+ years on AWS
• Ease of use
ROI
38. Related Sessions
• DAT311 - Large-Scale Genomic Analysis with Amazon Redshift (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Lando 4306)
• DAT308 - How Yahoo! Analyzes Billions of Events a Day on Amazon Redshift (Thursday, Oct 8, 4:15 PM - 5:15 PM,
Palazzo C)
• BDT401 - Amazon Redshift Deep Dive: Tuning and Best Practices (Thursday, Oct 8, 2:45 PM - 3:45 PM, Marcello 4506)
• DAT202: Understanding Database Options on AWS (Wednesday, Oct 7, 11:00 AM - 12:00 PM, San Polo 3501B)
• DAT302 - Relational Database Management Systems in the Cloud: Deploying SQL Server on AWS (Thursday, Oct 8, 5:30
PM - 6:30 PM, San Polo 3501B)
• DAT303: Oracle on AWS and Amazon RDS: Secure, Fast and Scalable (Friday, Oct 9 9:00-10AM, Delfino 4102)
• SEC302 - IAM Best Practices to Live By (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Palazzo K)
• NET201 - Creating Your Virtual Data Center: VPC Fundamentals and Connectivity Options (Wednesday, Oct 7, 1:30 PM -
2:30 PM, Titian 2201B)
• ARC403 - From One to Many: Evolving VPC Design (Wednesday, Oct 7, 2:45 PM - 3:45 PM, Palazzo N)
• DVO303 - Scaling Infrastructure Operations with AWS Service Catalog, AWS Config, and AWS CloudTrail (Friday, Oct 9,
9:00 AM - 10:00 AM, Lido 3001B)
• ISM208 - The Science of Saving with AWS Reserved Instances (Wednesday, Oct 7, 1:30 PM - 2:30 PM, Delfino 4105)
• ARC302 - Running Lean Architectures: How to Optimize for Cost Efficiency (Friday, Oct 9, 9:00 AM - 10:00 AM, Palazzo K)
RedshiftDatabasesInfrastructureCost
41. OF CAR BUYERS INFLUENCED BY
EDMUNDS.COM
59%
*R. L. Polk & Co.
42.
43. Edmunds.com
• 18M unique visitors a month
• 200M+ page views a month
• Over 10k dealer partners
• 14k+ API users
• Over 6M automotive
inventory
• Over 1M content pages
• Lots and lots of data
• Continuously growing data
• 24x7 real-time BI
• DWH in Amazon Redshift
• 32-node cluster
44. From unsustainable, painful operations to:
• Efficient, cost-effective cluster
• Squeak-free operations
• Happy customers
• Cost reduction (new system costs 1/5 of the old one)
Improvement
45. Challenges
• Painfully slow queries
• High system resource utilization
• Slow data loading
• Timeouts !
• …all in all, we were running into HUGE PROBLEMS
46. Lessons learned
• Know the system, the strengths, and the limitations
• Understand the end-to-end usage scenario
• Design the processes following Best Practices
• Invest in real-time monitoring
• Lift and shift may not be the best choice
• Let Enterprise Support and TAMs be your partners
• Monitor, monitor, and trend
47. The System, the infrastructure
• Syntactical differences (i.e., PostgreSQL 7 vs.
PostgreSQL 8)
• Architectural choices (i.e., columnar database)
• Transaction processing
• Historical data analysis, business intelligence
• Node type, cluster size
• Shared infrastructure vs. dedicated throughput
• The larger the cluster, the bigger the resizing effort
48. Make the up-front investment: Design
• Select the right sort key
• Timestamp, range filtering on column name, joins
• Compound sort key, interleaved sort key
• Measure query performance, system load, and vacuum
• Ensuring tables have a sort key alone helped us gain
20% performance
• Over 50% of our tables did not have a sort key
• Ensuring that the right sort key is assigned is the path to
winning
49. Make the upfront investment: Use cases
• Select the right distribution style
• Locate data faster
• Uniform load
• Less data movement
• A good distribution style ensures a healthy system
• Many of our tables did not have the right distribution style
50. Queries
• Select * is #1 performance killer
• Use WHERE clause on the primary sort column
• Watch out for queries that create “temporary tables”
• Long-running queries might impact downstream services
• Define constraints
51. VACUUM
• Run VACUUM frequently
• Run right after loading data
• Monitor vacuum time
52. Data loading
• Load data in sort key order
• Load using multiple files (1 MB to 1 GB)
• #files: Multiples of slices in cluster
• Use compression
• Use single COPY command
• S3 is your best friend
53. A closer look
• Each node is split into slices
• One slice per core
• Each slice is allocated
memory, CPU, and disk
space
• Each slice processes a
piece of the workload in
parallel
56. Monitoring
• Console/Amazon CloudWatch monitoring
• CPU, memory, processes
• Data distribution across slices
• Space used per table
• WLM query count, queue wait time, execution time
• Commit stats, top time-consuming queries
57. In closing
• Amazon Redshift is a great data warehousing platform
• Parting advice: Make investment in Best Practices
• Check out Redshift Utils