Amazon Redshift is a fast, fully managed petabyte-scale data warehouse service that costs less than $1,000 a TB a year, under a tenth the price of most traditional data warehousing solutions. Learn how Yahoo! uses both to build a billion event a day infrastructure that is fast, easy, and cost-effective. Dive into how Yahoo performs advanced user retention and cohort analysis to make near–real time product and marketing decisions.
2. Fast, simple, petabyte-scale data warehousing for $1,000/TB/Year
Amazon Redshift a lot faster
a lot cheaper
a lot simpler
3. Amazon Redshift architecture
Leader node
Simple SQL end point
Stores metadata
Optimizes query plan
Coordinates query execution
Compute nodes
Local columnar storage
Parallel/distributed execution of all queries,
loads, backups, restores, resizes
Start at $0.25/hour, grow to 2 PB (compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS2: HDD; scale from 2 TB to 2 PB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
4. Amazon Redshift is priced to analyze all your data
Pricing is simple
# of nodes X hourly price
No charge for leader node
3x data compression on avg
Three copies of data
DS2 (HDD)
Price Per Hour for
smallest single node
Effective Annual
Price per TB compressed
On-Demand $ 0.850 $ 3,725
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DC1 (SSD)
Price Per Hour for
smallest single node
Effective Annual
Price per TB compressed
On-Demand $ 0.250 $ 13,690
1 Year Reservation $ 0.161 $ 8,795
3 Year Reservation $ 0.100 $ 5,500
5. Amazon Redshift is easy to use
Provision in minutes
Monitor query performance
Point and click resize
Built-in security
Automatic backups
8. What to expect from the session
• What does analytics mean for Yahoo?
• Learn how our extract, transform, load (ETL) process runs
• Learn about our Amazon Redshift architecture
• Do’s, don’ts, and best practices for working with
Amazon Redshift
• Deep dive into advanced analytics, featuring how we
define and report user retention
33. Join mitigation strategies
Key
distribution
Records
distributed by
distkey
Choose a field
that you join on
Avoid causing
excess skew
All
distribution
All records
distributed to all
nodes
Most robust, but
most space-
intensive
Fastest joins occur when records are colocated
Key
distribution
A.1 B.1
A.3 B.3
A.5 B.5
A.2 B.2
A.4 B.4
A.6 B.6
All
distribution
A.1 B.1
A.2 B.2
A.3 B.3
A.4 B.4
A.5 B.5
A.6 B.6
A.1 B.1
A.2 B.2
A.3 B.3
A.4 B.4
A.5 B.5
A.6 B.6
Even
distribution
A.1 B.6
A.5 B.2
A.3 B.3
A.4 B.1
A.2 B.5
A.6 B.4
36. DON’T
Run ETL in the default queue
Workload management (WLM) is your friend
37. Example WLM configuration
Queue Concurrency User Groups Timeout (ms) Memory (%)
1 1 etl 50
2 10 60,000 50
Two queues: ETL and ad hoc
Purpose: Insulate normal users from ETL and free up plenty of memory for big
batch jobs
44. User retention and growth
0
1000
2000
3000
4000
5000
6000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
DailyActiveUsers
Product Age (days)
Product A
Product B
45. High churn = wasted ad dollars
$-
$5,000.00
$10,000.00
$15,000.00
$20,000.00
$25,000.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Product age (days)
Product A
Product B
46. The Sputnik method
For generating a multidimensional user retention
analysis table
event_date install_date os_name country active_users cohort_size
monday monday android us 100 100
tuesday monday android us 83 100
monday monday ios us 75 75
tuesday monday ios us 75 75
47. Get one-day retention
SELECT
SUM(active_users) AS active_users,
SUM(cohort_size) AS cohort_size,
SUM(active_users) / SUM(cohort_size) AS retention
FROM user_retention
WHERE
event_date – install_date = 1 AND
CURRENT_DATE – 1 > event_date;
48. Get one-day retention
event_date install_date os_name country active_users cohort_size
monday monday android us 100 100
tuesday monday android us 83 100
monday monday ios us 75 75
tuesday monday ios us 75 75
Active Users: 83 + 75 = 158
Cohort Size: 100 + 75 = 175
-------------------------------
Pct Retention = 158 / 175 = 90%
49. Get one-day retention by OS
SELECT
os_name,
SUM(active_users) AS active_users,
SUM(cohort_size) AS cohort_size,
SUM(active_users) / SUM(cohort_size) AS retention
FROM user_retention
WHERE
event_date – install_date = 1 AND
CURRENT_DATE – 1 > event_date
GROUP BY 1;
50. Get one-day retention
event_date install_date os_name country active_users cohort_size
monday monday android us 100 100
tuesday monday android us 83 100
monday monday ios us 75 75
tuesday monday ios us 75 75
Active Users: 83
Cohort Size: 100
-------------------
Pct Retention = 83%
Active Users: 75
Cohort Size: 75
--------------------
Pct Retention = 100%
iOS: Android:
52. The Sputnik method
Calculate cohort
sizes
• Count users by all
dimensions
• For example: Male,
iOS, in USA, who
installed today
Determine user
activity
• For each day, for each
user, were they active
• Create a table with
user_id and
event_date
Join and
aggregate
• Join user table to
user_activity on
user_id
• SUM active users by
cohort and join to
cohort sizes
53. Calculate cohort sizes
user_id install_date os_name country
1 2015-10-02 iOS us
2 2015-10-01 android ca
3 2015-10-01 android ca
SELECT
install_date, os_name, country,
COUNT(*) AS cohort_size
FROM user
GROUP BY 1,2,3;
54. Calculate cohort sizes
install_date os_name country cohort_size
2015-10-02 iOS us 1
2015-10-01 android ca 2
SELECT
install_date, os_name, country,
COUNT(*) AS cohort_size
FROM user
GROUP BY 1,2,3;
55. Determine user activity
user_id event_date action
1 2015-10-02 app_open
1 2015-10-02 spam
1 2015-10-03 app_open
CREATE TEMP TABLE user_activity AS
SELECT
DISTINCT user_id, event_date
FROM event_daily
WHERE action = ‘app_open’;
56. Determine user activity
user_id event_date action
1 2015-10-02 app_open
1 2015-10-02 spam
1 2015-10-03 app_open
CREATE TEMP TABLE all_users AS
SELECT DISTINCT user_id FROM event_daily;
CREATE TEMP TABLE all_days AS
SELECT DISTINCT event_date FROM event_daily;
57. Determine user activity
user_id event_date action
1 2015-10-02 app_open
1 2015-10-02 spam
1 2015-10-03 app_open
CREATE TABLE active_users_by_day AS
SELECT
xproduct.user_id, xproduct.event_date
FROM (
SELECT * FROM all_users CROSS JOIN all_dates
) xproduct
INNER JOIN user_activity u ON u.user_id = xproduct.user_id;
58. Determine cohort activity
user_id event_date
1 2015-10-02
1 2015-10-03
CREATE TEMP TABLE cohort_activity AS
SELECT
u.*, all_dates.event_date, <1 if hit 0 if miss> as is_active
FROM user AS u
LEFT JOIN all_dates ON all_dates.event_date >= u.install_date
LEFT JOIN active_users_by_day AS au ON
au.user_id = u.user_id AND
au.event_date = all_dates.event_date
WHERE all_dates.event_date >= u.install_date;
user_id install_date os_name country
1 2015-10-02 iOS us
59. Determine cohort activity
user_id event_date install_date os_name country is_active
1 2015-10-02 2015-10-02 iOS us 1
1 2015-10-03 2015-10-02 iOS us 1
1 2015-10-04 2015-10-02 iOS us 0
CREATE TEMP TABLE active_users AS
SELECT
event_date,
install_date, os_name, country,
SUM(is_active) AS count
FROM cohort_activity
GROUP BY 1, 2, 3, 4;
60. Determine cohort activity
event_date install_date os_name country is_active
2015-10-
03
2015-10-
02
iOS us 100
2015-10-
03
2015-10-
02
android us 350
2015-10-
03
2015-10-
02
iOS ca 50 Join these
two tables on
matching cohort
dimensions
install_date os_name country cohort_size
2015-10-02 iOS us 200
2015-10-02 android us 400
2015-10-02 iOS ca 60
61. Big wins for Yahoo
Real-time insights Easier deployment
and maintenance
Data-driven product
development
Cutting edge
analytics
63. Related sessions
Hear from other customers discussing their Amazon Redshift use cases:
• DAT201—Introduction to Amazon Redshift (RetailMeNot)
• ISM303—Migrating Your Enterprise Data Warehouse to Amazon Redshift (Boingo Wireless
and Edmunds)
• ARC303—Pure Play Video OTT: A Microservices Architecture in the Cloud (Verizon)
• ARC305—Self-Service Cloud Services: How J&J Is Managing AWS at Scale for Enterprise
Workloads
• BDT306—The Life of a Click: How Hearst Publishing Manages Clickstream Analytics with
AWS
• DAT311—Large-Scale Genomic Analysis with Amazon Redshift (Human Longevity)
• BDT314—Running a Big Data and Analytics Application on Amazon EMR and Amazon
Redshift with a Focus on Security (Nasdaq)
• BDT316—Offloading ETL to Amazon Elastic MapReduce (Amgen)
• BDT401—Amazon Redshift Deep Dive (TripAdvisor)
• Building a Mobile App using Amazon EC2, Amazon S3, Amazon DynamoDB, and Amazon
Redshift (Tinder)