AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Greg Khairallah, Business Development Manager, AWS
Adam Savitzky, Software Development Engineer, Yahoo!
Scott Hoover, Data Scientist, Looker
July 23, 2015
Best Practices: Amazon Redshift
Reporting and Advanced Analytics

Amazon Redshift – Resources
Getting Started – June Webinar Series:
https://www.youtube.com/watch?v=biqBjWqJi-Q
Best Practices – July Webinar Series:
Optimizing Performance – July 21, 2015
Migration and Data Loading – July 22,2015
Reporting and Advanced Analytics – July 23, 2015

Agenda
• Connecting to Amazon Redshift
• Case Study – Redshift analytics at Yahoo
• Case Study - Redshift Optimizations at Looker
• Questions and Answers

Petabyte scale; massively parallel
Relational data warehouse
Fully managed; zero admin
SSD & HDD platforms
As low as $1,000/TB/Year
Amazon
Redshift

Common Customer Use Cases
Reduce costs by extending
DW rather than adding HW
Migrate completely from
existing DW systems
Respond faster to business
Improve performance by an
order of magnitude
Make more data available
for analysis
Access business data via
standard reporting tools
Add analytic functionality to
applications
Scale DW capacity as
demand grows
Reduce HW & SW costs by an
order of magnitude
Traditional Enterprise DW Companies with Big Data SaaS Companies

Custom ODBC and JDBC Drivers
Up to 35% higher performance than open source drivers
Supported by most Business Intelligence tools
Will continue to support PostgreSQL open source drivers
Download drivers from console

Redshift for Analytics at Yahoo
Adam Savitzky
Tech Yahoo, Software Development Engineer

Introduction
Who am I?
• Yahoo growth team
• Supporting analytics for 6 products in Yahoo’s mobile
portfolio
In the past:

Introduction
What do we do?
▪ Real-time ad-hoc analytics
▪ Mobile properties
▪ What do we care about?
› Engagement and Activity
› User demographics
› Experimentation
› Funnel analysis
› Modeling revenue and user Lifetime Value
› Cohort analysis and retention

High Level Architecture
Mobile App
Hadoop
S3 Redshift
ETL

Scale
▪ On an average day
› 1 billion events
› 25 million devices
› 2 billion parameter key/value pairs
▪ Planned Capacity
› 21 dc1.8xlarge nodes
› 80 billion events
› 100 million devices
› 50 TB (compressed!)

Performance Optimizations
▪ Heavy use of summarization where appropriate
▪ Sort keys and partitioning
▪ Data encoding

Event Schema
event_raw
mail
event
hourly
event
daily
install
install
attribution
event_raw
flickr
event_raw
homerun
event_raw
stark
event_raw
arrow
e
v
e
n
t
r
a
w
u
n
i
o
n
v
i
e
w
user
retention
funnel
first_event
date
param
mail
param
flickr
param
homerun
param
stark
param
arrow
p
a
r
a
m
u
n
i
o
n
v
i
e
w
is_active
param
keys
telemetry
daily
revenue
daily
Raw Tables Summary Tables
Derived Tables

Case Study
User Retention Analysis

Definitions
▪ Cohort - A group of product users that share one or more attributes
› Example: All users who installed on Monday with Android devices
▪ Retention - How many members of a cohort of continue to use the
product over time
› Example: 100 users installed on Monday with Android devices. 7 days
later, 50 of those users returned to the product. We would say the 7-
day retention for this cohort is 50%.

Why Study User Retention?
▪ Quantifies how “sticky” your product is
▪ Allows us to measure Customer Lifetime Value (CLV or
LTV)

Asymptotic
Retention
No Retention
%
Retained

Total
Users
Time
Asymptotic
Retention
No Retention

Calculating User Retention
Definition: For each possible combination of cohort dimensions, for every possible event date, how
many devices belong to that cohort, and how many devices from that cohort were active on that day
event_date product install_date os_name active_users cohort_size
monday mail monday android 100 100
tuesday mail monday android 83 100
monday mail monday ios 75 75
tuesday mail monday ios 62 75
Example with one dimension, os_name:

Example with one dimension, os_name: What’s my 1 day retention for users who installed on
Monday?

Monday?

Monday?
145 175
Aggregate retention across both ios and android is (83 + 62) / (100
+ 75) = 83%

Steps:
1. For each day, determine whether each device was active or not
device_id date is_active
1 2015-01-01 1
1 2015-01-02 0
2 2015-01-01 1
2 2015-01-01 1

Steps:
2. Join device attributes to results of Step 1
device_id date is_active os install_date
1 2015-01-01 1 ios 2015-01-01
1 2015-01-02 0 ios 2015-01-01
2 2015-01-01 1 ios 2015-01-01
2 2015-01-01 1 ios 2015-01-01

Steps:
3. SUM is_active column, grouping by date, os, and install_date (and any
other cohort dimensions)
date active_user_count os install_date
2015-01-01 2 ios 2015-01-01
2015-01-02 1 ios 2015-01-01

Steps:
3. SUM is_active column, grouping by date, os, and install_date (and any
other cohort dimensions)
4. Join the size of each cohort to the result of Step 3
date active_user_count os install_date cohort_size
2015-01-01 2 ios 2015-01-01 2
2015-01-02 1 ios 2015-01-01 2

Lessons Learned
▪ Summarize data for optimal query performance (hourly
or daily rollups)
▪ Think carefully about data model ahead of time. Choose
the right sort keys.
▪ Invest in a good tool for ETL (we use Airflow)
▪ Invest in a good tool for query building and sharing (we
use Looker)
▪ Reserve plenty of spare capacity (at least 40% free)
▪ Reserved nodes are much cheaper
▪ DC nodes are faster, but much smaller capacity

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Scott Hoover, Data Scientist
Redshift and Looker

• We use Redshift to power our own implementation of
Looker, which serves every department with business
intelligence and data for analytics.
• I have worked at Looker for just over two years, doing
everything from Sales Engineering to Professional
Services to Data Engineering. I currently head up our
internal analytics efforts.
Introduction

• How Looker uses Redshift to supply business intelligence
and drive analytics internally.
• How a few Looker customers use Redshift for reporting
and analytics.
Agenda

At Looker, we have two major use cases which drove
our decision to go with Redshift:
• fast analysis of usage data (300+ million events);
• to centralize multiple data sources into a single
warehouse.
Looker and Redshift

• Customer Health:
- MoM/WoW percent change in usage
- Users added/removed
- User engagement (developer, explorer, consumer, occasional consumer)
- LookML contributions and contributors
• Product Usage:
- Features used/not used
- Release pain points
- Github issue/feature tracking
• Reporting for Sales and Marketing:
- Usage in trial
- Performance to quota (sales, meetings, leads, etc.)
- Lead/prospect fit
- Campaign attribution
- SaaS metrics: MRR, cMRR, Churn
What We Care About Most

Redshift Data Pipeline
Pinger
License
Real-Time RDS

Data Model
Event Data & Everything Else

Event Schema
{
"event_id": "1",
"event_type" : "view_connection",
"created_at" : "2015-07-08 20:04:08 +0000",
"attrs" : { "country" : "US",
"state" : "CA",
"browser" : "Safari/537.36",
"uri" : "%2Fadmin%2Fconnections"
}
},
{
"event_id": "2",
"event_type" : "save_look",
"created_at" : "2015-07-08 20:04:12 +0000",
"attrs" : { "country" : "US",
"state" : "CA",
"browser" : "Safari/537.36",
"look_id" : "32"
}
}

Event Schema
id type created_at country state uri browser error … k
1
view_
connection
2015-07-08
20:04:08 +0000
US CA
%2Fadmin%2
Fconnections
Safari/537.36 ø … k1
2 save_look
2015-07-08
20:04:12 +0000
US CA ø Safari/537.36 ø … k2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
N run_query
2015-07-08
22:01:16 +0000
UK ø %2Ffields= Chrome ø … kN

- explore: events
extends: license_base
label: 'Pinger'
always_filter:
events.created_date: '30 days'
joins:
- join: license
sql_on: ${events.license_slug} = ${license.new_slug}
relationship: many_to_one
- join: license_users
sql_on: ${events.user_id} = ${license_users.id}
relationship: many_to_many
- join: client
sql_on: ${client.id} = ${events.client_id}
- join: account
sql_on: ${client.salesforce_account_id} = ${account.id}
- join: opportunity
sql_on: ${account.id} = ${opportunity.account_id}
[...]
- join: sessions
sql_on: ${sessions.event_id} = ${events.id}
Event Schema

Everything Else
company_id account_id opportunity_id trial_id license_id lead_id campaign_id
campaign_member_
at
… k
1 E000000zD0IFIA0
E000000Oi9mxIA
B
0000014uTRG
MA2
1423
00QE000000N
qLsvMAF
701E0000000
6MC7IAM
2013-09-23 23:03:05
+0000
… k1
1 E000000zD0IFIA0
E000000Oi9mxIA
B
0000014uTRG
MA2
1423
00QE000000e
0ZsYMAU
701E0000000
6OAaIAM
2014-02-20 22:39:25
+0000
… k2
1 E000000zD0IFIA0
E000000Oi9mxIA
B
0000014uTRG
MA2
1423
00QE000000e
0ZsYMAU
701E0000000
8XEbIAM
2015-02-18 00:06:09
+0000
… k3
2 E000000zrbTgIAI
E000000VuLHhI
AN
a06E000000a
NOcVIAW
1601
00QE000000X
JVJiMAP
701E0000000
6OB9IAM
2015-04-01 22:04:05
+0000
… k4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
N … kN

- explore: company
joins:
- join: account
sql_on: ${company.account_id} = ${account.id}
- join: opportunity
sql_on: ${company.opportunity_id} = ${opportunity.id}
- join: lead
sql_on: ${company.lead_id} = ${lead.id}
- join: contact
sql_on: ${contact.id} = ${company.contact_id}
fields: [export_set*]
- join: campaign
sql_on: ${company.campaign_id} = ${campaign.id}
- join: trial
sql_on: ${company.trial_id} = ${trial.id}
- join: account_representative
from: user
sql_on: ${opportunity.owner_id} = ${account_representative.id}
fields: [name, count]
- join: license
sql_on: ${company.account_id} = ${license.salesforce_account_id}
relationship: one_to_one
Everything Else

Analyze - Lead Scoring
API 3.0
API
• Construct historical
data set or “Look.”
• GET “Look" using
Looker API.
• Train/test model in R.
• Output PMML file.
• EC2 hosts
Openscoring REST
service + PMML.
• Hit Salesforce API for
new leads; score
leads; update each
lead record.
• View prioritized lists
in Looker.
GET lead
UPDATE lead
GET look

• Scale/Performance
- Transactional databases are not ideal for analytics (slow).
- Redshift scales quickly and is incredibly fast.
• Accessibility
- SQL is in many analysts’ wheelhouse and is easy to adopt.
- Obvious choice for those in the AWS ecosystem or who
preferred managed offerings.
• Centralization of data
- When it comes time to tie top-of-funnel actions to bottom-of-
funnel behavior.
Why Our Customers Use Redshift

• Backstage/Sonicbids: They built an artist search tool that
uses social data from Facebook, Twitter, YouTube, and
Soundcloud to inform booking agents on what sort of draw
they could expect from a certain artist. They used Snowplow,
Redshift, the Looker API , Elasticsearch to build this system.
How Our Customers Use Redshift

• Smartling: sources website translation snippets from
translators the world over. They maintain a database of
translated snippets, like “the car is red” in Turkish, in order
validate incoming translations. So, when a request for “the
car is blue” in Turkish comes in, they can make an
assessment on the syntactic validity of the translation.
How Our Customers Use Redshift

AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics

Similar to AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics