Amazon Redshift is a fast, petabyte-scale data warehouse that makes it easy to analyze your data with existing BI tools for a fraction of the cost of traditional data warehouses.
This webinar will familiarize you with reporting, visualization, and business intelligence options for your Amazon Redshift data warehouse. You will learn how to effectively use exisiting BI tools and SQL clients with your Amazon Redshift data warehouse as well as techniques for performing advanced analytics.
Learning Objectives:
Options for processing, analyzing, and visualizing data in Amazon Redshift
Extending the Amazon Redshift SQL query capabilities
Optimizing query performance with Redshift ODBC / JDBC driver
Overview of BI solutions from our partners
2. Amazon Redshift – Resources
Getting Started – June Webinar Series:
https://www.youtube.com/watch?v=biqBjWqJi-Q
Best Practices – July Webinar Series:
Optimizing Performance – July 21, 2015
Migration and Data Loading – July 22,2015
Reporting and Advanced Analytics – July 23, 2015
3. Agenda
• Connecting to Amazon Redshift
• Case Study – Redshift analytics at Yahoo
• Case Study - Redshift Optimizations at Looker
• Questions and Answers
4. Petabyte scale; massively parallel
Relational data warehouse
Fully managed; zero admin
SSD & HDD platforms
As low as $1,000/TB/Year
Amazon
Redshift
5. Common Customer Use Cases
Reduce costs by extending
DW rather than adding HW
Migrate completely from
existing DW systems
Respond faster to business
Improve performance by an
order of magnitude
Make more data available
for analysis
Access business data via
standard reporting tools
Add analytic functionality to
applications
Scale DW capacity as
demand grows
Reduce HW & SW costs by an
order of magnitude
Traditional Enterprise DW Companies with Big Data SaaS Companies
6. Custom ODBC and JDBC Drivers
Up to 35% higher performance than open source drivers
Supported by most Business Intelligence tools
Will continue to support PostgreSQL open source drivers
Download drivers from console
9. Introduction
Who am I?
• Yahoo growth team
• Supporting analytics for 6 products in Yahoo’s mobile
portfolio
In the past:
10. Introduction
What do we do?
▪ Real-time ad-hoc analytics
▪ Mobile properties
▪ What do we care about?
› Engagement and Activity
› User demographics
› Experimentation
› Funnel analysis
› Modeling revenue and user Lifetime Value
› Cohort analysis and retention
17. Definitions
▪ Cohort - A group of product users that share one or more attributes
› Example: All users who installed on Monday with Android devices
▪ Retention - How many members of a cohort of continue to use the
product over time
› Example: 100 users installed on Monday with Android devices. 7 days
later, 50 of those users returned to the product. We would say the 7-
day retention for this cohort is 50%.
18. Why Study User Retention?
▪ Quantifies how “sticky” your product is
▪ Allows us to measure Customer Lifetime Value (CLV or
LTV)
19. Why Study User Retention?
Asymptotic
Retention
No Retention
%
Retained
20. Why Study User Retention?
Total
Users
Time
Asymptotic
Retention
No Retention
21. Calculating User Retention
Definition: For each possible combination of cohort dimensions, for every possible event date, how
many devices belong to that cohort, and how many devices from that cohort were active on that day
event_date product install_date os_name active_users cohort_size
monday mail monday android 100 100
tuesday mail monday android 83 100
monday mail monday ios 75 75
tuesday mail monday ios 62 75
Example with one dimension, os_name:
22. Calculating User Retention
Example with one dimension, os_name: What’s my 1 day retention for users who installed on
Monday?
event_date product install_date os_name active_users cohort_size
monday mail monday android 100 100
tuesday mail monday android 83 100
monday mail monday ios 75 75
tuesday mail monday ios 62 75
23. Calculating User Retention
Example with one dimension, os_name: What’s my 1 day retention for users who installed on
Monday?
event_date product install_date os_name active_users cohort_size
monday mail monday android 100 100
tuesday mail monday android 83 100
monday mail monday ios 75 75
tuesday mail monday ios 62 75
Example with one dimension, os_name:
24. Calculating User Retention
Example with one dimension, os_name: What’s my 1 day retention for users who installed on
Monday?
event_date product install_date os_name active_users cohort_size
tuesday mail monday android 83 100
tuesday mail monday ios 62 75
145 175
Example with one dimension, os_name:
Aggregate retention across both ios and android is (83 + 62) / (100
+ 75) = 83%
25. Calculating User Retention
Steps:
1. For each day, determine whether each device was active or not
device_id date is_active
1 2015-01-01 1
1 2015-01-02 0
2 2015-01-01 1
2 2015-01-01 1
26. Calculating User Retention
Steps:
1. For each day, determine whether each device was active or not
2. Join device attributes to results of Step 1
device_id date is_active os install_date
1 2015-01-01 1 ios 2015-01-01
1 2015-01-02 0 ios 2015-01-01
2 2015-01-01 1 ios 2015-01-01
2 2015-01-01 1 ios 2015-01-01
27. Calculating User Retention
Steps:
1. For each day, determine whether each device was active or not
2. Join device attributes to results of Step 1
3. SUM is_active column, grouping by date, os, and install_date (and any
other cohort dimensions)
date active_user_count os install_date
2015-01-01 2 ios 2015-01-01
2015-01-02 1 ios 2015-01-01
28. Calculating User Retention
Steps:
1. For each day, determine whether each device was active or not
2. Join device attributes to results of Step 1
3. SUM is_active column, grouping by date, os, and install_date (and any
other cohort dimensions)
4. Join the size of each cohort to the result of Step 3
date active_user_count os install_date cohort_size
2015-01-01 2 ios 2015-01-01 2
2015-01-02 1 ios 2015-01-01 2
30. Lessons Learned
▪ Summarize data for optimal query performance (hourly
or daily rollups)
▪ Think carefully about data model ahead of time. Choose
the right sort keys.
▪ Invest in a good tool for ETL (we use Airflow)
▪ Invest in a good tool for query building and sharing (we
use Looker)
▪ Reserve plenty of spare capacity (at least 40% free)
▪ Reserved nodes are much cheaper
▪ DC nodes are faster, but much smaller capacity
32. • We use Redshift to power our own implementation of
Looker, which serves every department with business
intelligence and data for analytics.
• I have worked at Looker for just over two years, doing
everything from Sales Engineering to Professional
Services to Data Engineering. I currently head up our
internal analytics efforts.
Introduction
33. • How Looker uses Redshift to supply business intelligence
and drive analytics internally.
• How a few Looker customers use Redshift for reporting
and analytics.
Agenda
34. At Looker, we have two major use cases which drove
our decision to go with Redshift:
• fast analysis of usage data (300+ million events);
• to centralize multiple data sources into a single
warehouse.
Looker and Redshift
35. • Customer Health:
- MoM/WoW percent change in usage
- Users added/removed
- User engagement (developer, explorer, consumer, occasional consumer)
- LookML contributions and contributors
• Product Usage:
- Features used/not used
- Release pain points
- Github issue/feature tracking
• Reporting for Sales and Marketing:
- Usage in trial
- Performance to quota (sales, meetings, leads, etc.)
- Lead/prospect fit
- Campaign attribution
- SaaS metrics: MRR, cMRR, Churn
What We Care About Most
44. Analyze - Lead Scoring
API 3.0
API
• Construct historical
data set or “Look.”
• GET “Look" using
Looker API.
• Train/test model in R.
• Output PMML file.
• EC2 hosts
Openscoring REST
service + PMML.
• Hit Salesforce API for
new leads; score
leads; update each
lead record.
• View prioritized lists
in Looker.
GET lead
UPDATE lead
GET look
45. • Scale/Performance
- Transactional databases are not ideal for analytics (slow).
- Redshift scales quickly and is incredibly fast.
• Accessibility
- SQL is in many analysts’ wheelhouse and is easy to adopt.
- Obvious choice for those in the AWS ecosystem or who
preferred managed offerings.
• Centralization of data
- When it comes time to tie top-of-funnel actions to bottom-of-
funnel behavior.
Why Our Customers Use Redshift
46. • Backstage/Sonicbids: They built an artist search tool that
uses social data from Facebook, Twitter, YouTube, and
Soundcloud to inform booking agents on what sort of draw
they could expect from a certain artist. They used Snowplow,
Redshift, the Looker API , Elasticsearch to build this system.
How Our Customers Use Redshift
47. • Smartling: sources website translation snippets from
translators the world over. They maintain a database of
translated snippets, like “the car is red” in Turkish, in order
validate incoming translations. So, when a request for “the
car is blue” in Turkish comes in, they can make an
assessment on the syntactic validity of the translation.
How Our Customers Use Redshift