Amazon Redshift is a fast, simple, cost-effective data warehousing solution, and in this session, we look at the tools and techniques you can use to migrate your existing data warehouse to Amazon Redshift. We will then present a case study on Scholastic’s migration to Amazon Redshift. Scholastic, a large 100-year-old publishing company, was running their business with older, on-premise, data warehousing and analytics solutions, which could not keep up with business needs and were expensive. Scholastic also needed to include new capabilities like streaming data and real time analytics. Scholastic migrated to Amazon Redshift, and achieved agility and faster time to insight while dramatically reducing costs. In this session, Scholastic will discuss how they achieved this, including options considered, technical architecture implemented, results, and lessons learned.
2. Today’s agenda
• Amazon Redshift Overview
• Use cases and benefits
• Migration options
• Scholastic’s use case
• Architecture details
• Technical overview
• Key project learnings
3. Relational data warehouse
Massively parallel; petabyte scale
Fully managed
HDD and SSD platforms
$1,000/TB/year; starts at $0.25/hour
Amazon
Redshift
a lot faster
a lot simpler
a lot cheaper
4. The Forrester Wave™ is copyrighted by Forrester Research, Inc. Forrester and Forrester Wave™ are trademarks of Forrester Research, Inc. The Forrester Wave™ is a graphical
representation of Forrester's call on a market and is plotted using a detailed spreadsheet with exposed scores, weightings, and comments. Forrester does not endorse any
vendor, product, or service depicted in the Forrester Wave. Information is based on best available resources. Opinions reflect judgment at the time and are subject to change.
Forrester Wave™ Enterprise Data Warehouse Q4 ’15
6. Why migrate to Amazon Redshift?
100x faster
Scales from GBs to PBs
Analyze data without storage
constraints
10x cheaper
Easy to provision and operate
Higher productivity
10x faster
No programming
Standard interfaces and
integration to leverage BI tools,
machine learning, streaming
Transactional database MPP database Hadoop
7. Migration from Oracle @ Boingo Wireless
2000+ Commercial Wi-Fi locations
1 million+ Hotspots
90M+ ad engagements
100+ countries
Legacy DW: Oracle 11g based DW
Before migration
Rapid data growth slowed
analytics
Mediocre IOPS, limited memory,
vertical scaling
Admin overhead
Expensive (license, h/w, support)
After migration
180x performance improvement
7x cost savings
9. Migration from Greenplum @ NTT Docomo
68 million customers
10s of TBs per day of data across
mobile network
6PB of total data (uncompressed)
Data science for marketing
operations, logistics etc.
Legacy DW: Greenplum on-premises
After migration:
125 node DS2.8XL cluster
4,500 vCPUs, 30TB RAM
6 PB uncompressed
10x faster analytic queries
50% reduction in time for new BI
app. deployment
Significantly less ops. overhead
10. Migration from SQL on Hadoop @ Yahoo
Analytics for website/mobile events
across multiple Yahoo properties
On an average day
2B events
25M devices
Before migration: Hive – Found it to be
slow, hard to use, share and repeat
After migration:
21 node DC1.8XL (SSD)
50TB compressed data
100x performance improvement
Real-time insights
Easier deployment and
maintenance
11. Migration from SQL on Hadoop @ Yahoo
1
10
100
1000
10000
Count
Distinct
Devices
Count All
Events
Filter
Clauses
Joins
Seconds
Amazon Redshift
Impala
12. Business Value and Productivity
Business Productivity Benefits
Analyze more data
Faster time to market
Get better insights
Match capacity with demand
13. ENGINE X Amazon Redshift
ETL Scripts
SQL in reports
Adhoc. queries
How to Migrate?
Schema Conversion Database Migration
Map data types
Choose compression
encoding, sort keys,
distribution keys
Generate and apply DDL
Schema & Data
Transformation
Data Migration
Convert SQL Code
Bulk Load
Capture updates
Transformations
Assess Gaps
Stored Procedures
Functions
1 2
3
4
14. Convert schema in a few clicks
Sources include Oracle, Teradata,
Greenplum and Netezza
Automatic schema optimization
Converts application SQL code
Detailed assessment report
AWS Schema
Conversion Tool
(AWS SCT)
16. Start your first migration in few minutes
Sources include: Aurora, Oracle, SQL
Server, MySQL and PostgreSQL
Bulk load and continuous replication
Migrate a TB for $3
Fault tolerant
(AWS DMS)
21. Where were we?
Platform
13+ years old. IBM AS/400 DB2 and Microsoft SQL Server are the primary data
warehouse platforms. BI Platform is primarily Microsoft (SSRS, SSAS, Excel, SharePoint)
500+ direct users across every LOB and business function
20+ TB. 5,500+ DB2 workloads, 350+ SQL Server workloads, 15 SSAS cubes, 150+
SSRS reports
Challenges
Inflexible, multi-layered architecture – slow time to market
Inability to meet internal SLAs due to performance of daily ETL processes
Scalability limitations with SQL Server Analysis Services (SSAS) for reports
Limited ability to perform self-service Business Intelligence
21
22. Moving forward: Key decision factors
• Improved performance, scalability, availability,
logging, security
• Enablement of self service business intelligence
• Leverage the skill set of current team (Relational DB
& SQL)
• Integration with existing technology stack
• Alignment with the tech strategy (devops model,
Cloud First)
• Ability to support Big Data initiatives
• Team up with an experienced consulting partner
22
23. Why we chose AWS and Amazon Redshift
AWS was chosen for its agility, scalability, elasticity, and
security
Redshift
• Scalable, fast
• Managed service, cost-optimization models,
elastic
• SQL/relational matched skillset of team
S3 was chosen as location for ingestion process
NorthBay was chosen as the implementation partner for
their expertise in Big Data and Redshift migrations
23
24. How the project unfolded
Goals
• 3-month pilot to migrate a Functional area in key LOB
• Demonstrate immediate business value
• Use AWS Stack & Open Source for Data Movement from DB2
(No CDC/ETL tool)
Outcomes
• Core Framework for Migration
• ELT Architecture and Validation
• Visualization/Self-service capability through Tableau
26. Core Framework
• Jobs and Job Groups are defined as metadata in DynamoDB
• Control-M scheduler, Custom Application and Data Pipeline for
Orchestration
• ELT Process with EMR/Sqoop for Extraction. Load and Transform
the data through Redshift SQL scripts
• Core Framework enables
• Restart capability from point of failure
• Capturing of operational statistics (# of rows updated, etc.)
• Audit capability (which feed caused the Fact to change, etc.)
26
27. Extract
• Pre-create EMR resources at the start of Batch
• Achieve parallelism in Sqoop with mappers and Fair Scheduling
• Sqoop query to add additional fields like Batch_id, Updated_date etc
• Data extracts are split and compressed for optimized loading into Redshift
27
AS400 / DB2
EMR with Sqoop
S3
Metadata
KMS
Data Pipeline
1
2
3
4
5 6
Control Flow
Data Flow
28. Load
• Truncate and Load through Data Pipeline for Staging tables
• Dynamic Work Load Management (WLM) queues setup to allow maximum
resources during Loading/Transformation
• Check and terminate any locks on tables to allow truncation
• Capture metrics related to number of rows loaded, time taken, etc.28
StagingS3
KMS
Data Pipeline
4
1 2
3
EC2 Control Flow
Data Flow
29. Transform
• Custom Application for building Dimensions and Facts
• SQL Scripts are stored in S3 and executed by ELT process
• SQL scripts refactored from SQL Server and AS400 scripts
• Non-Functional Requirements are achieved through Custom App
29
1
3
2
4
5
6
7a
7b
S3
Staging
Facts
Metadata
Dimensions
App
Control Flow
Data Flow
30. Schema Design
• Modified Star Schema
• Natural Keys instead of generating unique identifiers
• Commonly used columns from Dimensions are copied over to
Facts
• Surrogate keys are eliminated except for few cases
• Compression
• Define appropriate Distribution and Sort Keys
• Define primary key and Foreign keys
31. Security
• AWS Key Management Service (KMS) is used for encrypting
access credentials to Source and Target databases
• Jenkins job to allow encrypting of credentials using KMS
directly by Database Administrators
• Amazon EMR, Jenkins resources are given KMS decrypt
permissions to allow connecting to Sources and Targets during
the ELT process
• Standard Security in Transit and at Rest throughout the process
• IAM federation through Enterprise Active Directory
31
32. Reporting
• Business users access to Facts/Dimensions through Tableau
• Power users access to Staging tables through Tableau
• Enable Data Analysts access to files in S3 using Hive/Presto
• Self-Service capability across business users
32
S3 Staging Facts/ Dimensions
Business
Analysts
Power
Users
Data
Analysts
EMR
Presto/Hive
33. Workstream Effort
• Define Jobs and Job Groups specific to each
Workstream
• Create Redshift tables (Staging, Facts, Dimensions)
based on mapping from AS400 and best practices
learned
• Create new SQL scripts (based on the logic from
AS400/SQL Server code) for transformation
• Develop, Test and Deploy in 2-week Agile sprints
33
34. Key Lessons - Technical
• Isolate core framework with project specific code repositories
• Consolidating logging solution across Amazon S3, Amazon
Redshift, Amazon DynamoDB etc., was a challenge
• Make appropriate schema changes when migrating to new
platform
• Custom Framework for gathering operational stats (eg: # of
rows loaded etc.)
• Start with Test Automation tools and Acceptance Test Driven
Development (ATDD) earlier in the project
34
35. Project timeline revisited
After the successful pilot:
• Executive Leadership accelerated timeline:
• Reduce project timeline by 50% (to 12 months) to
deliver value faster to LOBs
• Realize cost savings by eliminating the DB2 and
SQL Server platforms earlier
• Users wanted to be on the new platform!
• Scholastic & NorthBay partnered to create a
training curriculum to ensure a supply of skilled
staff would be available to our teams
35
36. Scaling up: 7 workstreams
• Developed a model for estimating effort and cost
(AWS costs & Labor per LOB migration)
• Running agile teams in parallel – employed Agile
coaches
• Enhanced the core framework to ensure it would
scale effectively when in use by multiple teams
simultaneously
• Building a Code repository for use by all teams
• Building CI / CD Frameworks
37. Where are we now?
• 4 of 7 LOBs migrated – framework enables complete migration of a
functional area within days/weeks as opposed to months. On track to
migrate and decommission entire legacy environment within next 6
months
• 10 weeks to migrate from an external vendor hosting data and providing
reports for one LoB
• Cost of Data Ingestion Framework is under $40/day (EC2, EMR, Data
Pipeline)
• First “Big Data” initiative in production, captures and processes an
average of 1.5 Million e reading events daily (peak: 7 Million)
• Profile: LOB #1
• Loading ~5-6 Million rows/day (6-7GB/day)
• Processing over 1.5 billion rows within Redshift daily
• Complete ETL/ELT batch cycle performance improved by over 170%
38. Key lessons – project execution
• Essential to monitor and optimize AWS costs
• “Data Champion” / “Data Guide” partnership absolutely critical for
successful adoption of new platforms
• Importance of strong Agile coaches while scaling out Agile teams
• Criticality of choosing consulting partners (AWS & North Bay)
who can ramp up and supply key resources fast and cycle off the
project when finished
• Creating new data platforms and migrating data into them is
easy, especially with AWS. Decommission of existing data
platforms is hard!
38
41. Related Sessions
Hear from other customers discussing their Amazon Redshift use cases:
• BDM402—Best Practices for Data Warehousing with Amazon Redshift (King.com)
• BDA304—What’s New with Amazon Redshift
• SVR308—Content and Data Platforms at Vevo: Rebuilding and Scaling from Zero in One Year
• GAM301—How EA Leveraged Amazon Redshift and AWS Partner 47Lining to Gather Meaningful
Player Insights
• BDA207—Fanatics: Deploying Scalable, Self-Service Business Intelligence on AWS
• BDM306— Netflix: Using Amazon S3 as the fabric of our big data ecosystem
• BDA203 — Billions of Rows Transformed in Record Time Using Matillion ETL for Amazon Redshift
(GE Power and Water)
• BDM206 — Understanding IoT Data: How to Leverage Amazon Kinesis in Building an IoT
Analytics Platform on AWS (Hello)
• STG307— Case Study: How Prezi Built and Scales a Cost-Effective, Multipetabyte Data Platform
and Storage Infrastructure on Amazon S3