SlideShare a Scribd company logo
1 of 63
Download to read offline
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Adam Savitzky, Yahoo!
Tina Adams, AWS
October 2015
DAT308
How Yahoo! Analyzes Billions of Events with
Amazon Redshift
Fast, simple, petabyte-scale data warehousing for $1,000/TB/Year
Amazon Redshift a lot faster
a lot cheaper
a lot simpler
Amazon Redshift architecture
Leader node
Simple SQL end point
Stores metadata
Optimizes query plan
Coordinates query execution
Compute nodes
Local columnar storage
Parallel/distributed execution of all queries,
loads, backups, restores, resizes
Start at $0.25/hour, grow to 2 PB (compressed)
DC1: SSD; scale from 160 GB to 326 TB
DS2: HDD; scale from 2 TB to 2 PB
10 GigE
(HPC)
Ingestion
Backup
Restore
JDBC/ODBC
Amazon Redshift is priced to analyze all your data
Pricing is simple
# of nodes X hourly price
No charge for leader node
3x data compression on avg
Three copies of data
DS2 (HDD)
Price Per Hour for
smallest single node
Effective Annual
Price per TB compressed
On-Demand $ 0.850 $ 3,725
1 Year Reservation $ 0.500 $ 2,190
3 Year Reservation $ 0.228 $ 999
DC1 (SSD)
Price Per Hour for
smallest single node
Effective Annual
Price per TB compressed
On-Demand $ 0.250 $ 13,690
1 Year Reservation $ 0.161 $ 8,795
3 Year Reservation $ 0.100 $ 5,500
Amazon Redshift is easy to use
Provision in minutes
Monitor query performance
Point and click resize
Built-in security
Automatic backups
Selected Amazon Redshift customers
Analytics at Yahoo
What to expect from the session
• What does analytics mean for Yahoo?
• Learn how our extract, transform, load (ETL) process runs
• Learn about our Amazon Redshift architecture
• Do’s, don’ts, and best practices for working with
Amazon Redshift
• Deep dive into advanced analytics, featuring how we
define and report user retention
Setting the stage
“We are returning an iconic company
to greatness.”
—Marissa Mayer
Guiding principles
Guiding principles
“You can’t grow a product that hasn’t
reached product market fit.”
—Arjun Sethi, @arjset
Guiding principles
Analytics is critical for growth
Overall volume
0
10
20
30
40
50
60
70
80
90
Yahoo Events Auto Miles
Driven
Google
Searches
McDonald's
Fries Served
Babies Born
Billions
Audience data breakdown
Desktop
Mail
Tumblr
Sports
Weather
Front Page
Aviate
Other
Hadoop
Clusters Nodes Data centers Data
14 42,000 3 500PB
Hive
Slow Hard to use
Hard to share
Hard to repeat
Hive
And many others…
Benchmarks (lower is better)
1
10
100
1000
10000
Count
Distinct
Devices
Count All
Events
Filter
Clauses
Joins
Seconds
Amazon Redshift
Vertica
Impala
Amazon Redshift at Yahoo
Nodes Events per Day Queries per Day Data
21dc1.8xl 2B 1,200 27TB
Architecture
Extract, transform, load (ETL)
Hadoop • Pig
S3 • Airflow
Amazon
Redshift
• Looker
ETL—upstream
Clickstream
Data
(Hadoop)
Intermediate
Storage
(HDFS)
AWS
(S3)
Hourly Batch Process
(Oozie)
Custom Uploader
(python/boto)
ETL—downstream
Data
available?
Copy to
Amazon
Redshift
Sanitize
Export new
installs
Process new
installs
Update
hourly table
Update
install table
Update
params
Subdivide
params
Clean up
Subdivide
events
Data flows in hourly from S3 to Amazon Redshift, where it’s processed
and subdivided
ETL—downstream
Visualization of running and
complete tasks
Schema
event_raw
mail
event
hourly
event
daily
install
install
attribution
event_raw
flickr
event_raw
homerun
event_raw
stark
event_raw
livetext
e
v
e
n
t
r
a
w
u
n
i
o
n
v
i
e
w
user
retention
funnel
first_event
date
param
mail
param
flickr
param
homerun
param
stark
param
livetext
p
a
r
a
m
u
n
i
o
n
v
i
e
w
is_active
param
keys
telemetry
daily
revenue
daily
Raw tables Summary tables
Derived tables
ETL—Nightly
24 hours
available?
Wipe old
data
Build
daily table
Build user
retention
Build
funnel
Vacuum
Runs all daily aggregations and cleans up/vacuums
Do’s and don’ts
DO
Summarize
user_id event_date action
1 2015-10-08 spam
1 2015-10-08 spam
1 2015-10-08 spam
1 2015-10-08 spam
1 2015-10-08 spam
user_id event_date action event_count
1 2015-10-08 spam 5
DO
Choose good
sort keys
(and use them)
CREATE TABLE revenue (
customer_id BIGINT,
transaction_id BIGINT,
location VARCHAR(64),
event_date DATE,
event_ts TIMESTAMP,
revenue_usd DECIMAL
)
DISTKEY(customer_id)
SORTKEY(
location,
event_date,
customer_id
)
DO
Vacuum nightly
(or weekly and tell people you do it nightly)
DO
Avoid joins
where possible—and learn mitigation strategies for when
you must join
Join mitigation strategies
Key
distribution
Records
distributed by
distkey
Choose a field
that you join on
Avoid causing
excess skew
All
distribution
All records
distributed to all
nodes
Most robust, but
most space-
intensive
Fastest joins occur when records are colocated
Key
distribution
A.1 B.1
A.3 B.3
A.5 B.5
A.2 B.2
A.4 B.4
A.6 B.6
All
distribution
A.1 B.1
A.2 B.2
A.3 B.3
A.4 B.4
A.5 B.5
A.6 B.6
A.1 B.1
A.2 B.2
A.3 B.3
A.4 B.4
A.5 B.5
A.6 B.6
Even
distribution
A.1 B.6
A.5 B.2
A.3 B.3
A.4 B.1
A.2 B.5
A.6 B.4
DO
Automate
DON’T
Fill the cluster
(leave more than you think)
DON’T
Run ETL in the default queue
Workload management (WLM) is your friend
Example WLM configuration
Queue Concurrency User Groups Timeout (ms) Memory (%)
1 1 etl 50
2 10 60,000 50
Two queues: ETL and ad hoc
Purpose: Insulate normal users from ETL and free up plenty of memory for big
batch jobs
DON’T
Use CREATE TABLE AS
For permanent tables
DON’T
Email SQL around
Find a good reporting tool
Deep dive: user retention
User retention is…
User retention is…
The most important* quality metric for
your product
* kinda
Day-14 retention over time
User retention and growth
N-day retention
User retention and growth
0
1000
2000
3000
4000
5000
6000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
DailyActiveUsers
Product Age (days)
Product A
Product B
High churn = wasted ad dollars
$-
$5,000.00
$10,000.00
$15,000.00
$20,000.00
$25,000.00
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Product age (days)
Product A
Product B
The Sputnik method
For generating a multidimensional user retention
analysis table
event_date install_date os_name country active_users cohort_size
monday monday android us 100 100
tuesday monday android us 83 100
monday monday ios us 75 75
tuesday monday ios us 75 75
Get one-day retention
SELECT
SUM(active_users) AS active_users,
SUM(cohort_size) AS cohort_size,
SUM(active_users) / SUM(cohort_size) AS retention
FROM user_retention
WHERE
event_date – install_date = 1 AND
CURRENT_DATE – 1 > event_date;
Get one-day retention
event_date install_date os_name country active_users cohort_size
monday monday android us 100 100
tuesday monday android us 83 100
monday monday ios us 75 75
tuesday monday ios us 75 75
Active Users: 83 + 75 = 158
Cohort Size: 100 + 75 = 175
-------------------------------
Pct Retention = 158 / 175 = 90%
Get one-day retention by OS
SELECT
os_name,
SUM(active_users) AS active_users,
SUM(cohort_size) AS cohort_size,
SUM(active_users) / SUM(cohort_size) AS retention
FROM user_retention
WHERE
event_date – install_date = 1 AND
CURRENT_DATE – 1 > event_date
GROUP BY 1;
Get one-day retention
event_date install_date os_name country active_users cohort_size
monday monday android us 100 100
tuesday monday android us 83 100
monday monday ios us 75 75
tuesday monday ios us 75 75
Active Users: 83
Cohort Size: 100
-------------------
Pct Retention = 83%
Active Users: 75
Cohort Size: 75
--------------------
Pct Retention = 100%
iOS: Android:
The Sputnik method
You will need:
Daily event
summary
User
user_id
The Sputnik method
Calculate cohort
sizes
• Count users by all
dimensions
• For example: Male,
iOS, in USA, who
installed today
Determine user
activity
• For each day, for each
user, were they active
• Create a table with
user_id and
event_date
Join and
aggregate
• Join user table to
user_activity on
user_id
• SUM active users by
cohort and join to
cohort sizes
Calculate cohort sizes
user_id install_date os_name country
1 2015-10-02 iOS us
2 2015-10-01 android ca
3 2015-10-01 android ca
SELECT
install_date, os_name, country,
COUNT(*) AS cohort_size
FROM user
GROUP BY 1,2,3;
Calculate cohort sizes
install_date os_name country cohort_size
2015-10-02 iOS us 1
2015-10-01 android ca 2
SELECT
install_date, os_name, country,
COUNT(*) AS cohort_size
FROM user
GROUP BY 1,2,3;
Determine user activity
user_id event_date action
1 2015-10-02 app_open
1 2015-10-02 spam
1 2015-10-03 app_open
CREATE TEMP TABLE user_activity AS
SELECT
DISTINCT user_id, event_date
FROM event_daily
WHERE action = ‘app_open’;
Determine user activity
user_id event_date action
1 2015-10-02 app_open
1 2015-10-02 spam
1 2015-10-03 app_open
CREATE TEMP TABLE all_users AS
SELECT DISTINCT user_id FROM event_daily;
CREATE TEMP TABLE all_days AS
SELECT DISTINCT event_date FROM event_daily;
Determine user activity
user_id event_date action
1 2015-10-02 app_open
1 2015-10-02 spam
1 2015-10-03 app_open
CREATE TABLE active_users_by_day AS
SELECT
xproduct.user_id, xproduct.event_date
FROM (
SELECT * FROM all_users CROSS JOIN all_dates
) xproduct
INNER JOIN user_activity u ON u.user_id = xproduct.user_id;
Determine cohort activity
user_id event_date
1 2015-10-02
1 2015-10-03
CREATE TEMP TABLE cohort_activity AS
SELECT
u.*, all_dates.event_date, <1 if hit 0 if miss> as is_active
FROM user AS u
LEFT JOIN all_dates ON all_dates.event_date >= u.install_date
LEFT JOIN active_users_by_day AS au ON
au.user_id = u.user_id AND
au.event_date = all_dates.event_date
WHERE all_dates.event_date >= u.install_date;
user_id install_date os_name country
1 2015-10-02 iOS us
Determine cohort activity
user_id event_date install_date os_name country is_active
1 2015-10-02 2015-10-02 iOS us 1
1 2015-10-03 2015-10-02 iOS us 1
1 2015-10-04 2015-10-02 iOS us 0
CREATE TEMP TABLE active_users AS
SELECT
event_date,
install_date, os_name, country,
SUM(is_active) AS count
FROM cohort_activity
GROUP BY 1, 2, 3, 4;
Determine cohort activity
event_date install_date os_name country is_active
2015-10-
03
2015-10-
02
iOS us 100
2015-10-
03
2015-10-
02
android us 350
2015-10-
03
2015-10-
02
iOS ca 50 Join these
two tables on
matching cohort
dimensions
install_date os_name country cohort_size
2015-10-02 iOS us 200
2015-10-02 android us 400
2015-10-02 iOS ca 60
Big wins for Yahoo
Real-time insights Easier deployment
and maintenance
Data-driven product
development
Cutting edge
analytics
Thank you!
Related sessions
Hear from other customers discussing their Amazon Redshift use cases:
• DAT201—Introduction to Amazon Redshift (RetailMeNot)
• ISM303—Migrating Your Enterprise Data Warehouse to Amazon Redshift (Boingo Wireless
and Edmunds)
• ARC303—Pure Play Video OTT: A Microservices Architecture in the Cloud (Verizon)
• ARC305—Self-Service Cloud Services: How J&J Is Managing AWS at Scale for Enterprise
Workloads
• BDT306—The Life of a Click: How Hearst Publishing Manages Clickstream Analytics with
AWS
• DAT311—Large-Scale Genomic Analysis with Amazon Redshift (Human Longevity)
• BDT314—Running a Big Data and Analytics Application on Amazon EMR and Amazon
Redshift with a Focus on Security (Nasdaq)
• BDT316—Offloading ETL to Amazon Elastic MapReduce (Amgen)
• BDT401—Amazon Redshift Deep Dive (TripAdvisor)
• Building a Mobile App using Amazon EC2, Amazon S3, Amazon DynamoDB, and Amazon
Redshift (Tinder)

More Related Content

What's hot

Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Amazon Web Services
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseAmazon Web Services
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftJie Li
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon RedshiftUses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon RedshiftAmazon Web Services
 
Building your data warehouse with Redshift
Building your data warehouse with RedshiftBuilding your data warehouse with Redshift
Building your data warehouse with RedshiftAmazon Web Services
 
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Amazon Web Services
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features Amazon Web Services
 
Scalability of Amazon Redshift Data Loading and Query Speed
Scalability of Amazon Redshift Data Loading and Query SpeedScalability of Amazon Redshift Data Loading and Query Speed
Scalability of Amazon Redshift Data Loading and Query SpeedFlyData Inc.
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesAmazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013Amazon Web Services
 
Getting Started with Amazon EC2 and Compute Services
Getting Started with Amazon EC2 and Compute ServicesGetting Started with Amazon EC2 and Compute Services
Getting Started with Amazon EC2 and Compute ServicesAmazon Web Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon Web Services
 
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Web Services
 
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAmazon Web Services
 

What's hot (20)

Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Leveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data WarehouseLeveraging Amazon Redshift for your Data Warehouse
Leveraging Amazon Redshift for your Data Warehouse
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon RedshiftPowering Interactive Data Analysis at Pinterest by Amazon Redshift
Powering Interactive Data Analysis at Pinterest by Amazon Redshift
 
Redshift overview
Redshift overviewRedshift overview
Redshift overview
 
Uses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon RedshiftUses and Best Practices for Amazon Redshift
Uses and Best Practices for Amazon Redshift
 
Building your data warehouse with Redshift
Building your data warehouse with RedshiftBuilding your data warehouse with Redshift
Building your data warehouse with Redshift
 
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...
 
AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features AWS Webcast - Redshift Overview and New Features
AWS Webcast - Redshift Overview and New Features
 
Scalability of Amazon Redshift Data Loading and Query Speed
Scalability of Amazon Redshift Data Loading and Query SpeedScalability of Amazon Redshift Data Loading and Query Speed
Scalability of Amazon Redshift Data Loading and Query Speed
 
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar SeriesGetting Started with Amazon Redshift - AWS July 2016 Webinar Series
Getting Started with Amazon Redshift - AWS July 2016 Webinar Series
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
Introduction to Amazon Redshift and What's Next (DAT103) | AWS re:Invent 2013
 
Getting Started with Amazon EC2 and Compute Services
Getting Started with Amazon EC2 and Compute ServicesGetting Started with Amazon EC2 and Compute Services
Getting Started with Amazon EC2 and Compute Services
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
Amazon Redshift in Action: Enterprise, Big Data, and SaaS Use Cases (DAT205) ...
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Deep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDBDeep Dive on Amazon DynamoDB
Deep Dive on Amazon DynamoDB
 
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data AnalyticsAWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
AWS June 2016 Webinar Series - Amazon Redshift or Big Data Analytics
 

Viewers also liked

AWS初心者向けWebinar AWSでBig Data活用
AWS初心者向けWebinar AWSでBig Data活用AWS初心者向けWebinar AWSでBig Data活用
AWS初心者向けWebinar AWSでBig Data活用Amazon Web Services Japan
 
Apache Hbase バルクロードの使い方
Apache Hbase バルクロードの使い方Apache Hbase バルクロードの使い方
Apache Hbase バルクロードの使い方Takeshi Mikami
 
Bluemixを使ったTwitter分析
Bluemixを使ったTwitter分析Bluemixを使ったTwitter分析
Bluemixを使ったTwitter分析Tanaka Yuichi
 
Apache Airflow入門 (マーケティングデータ分析基盤技術勉強会)
Apache Airflow入門  (マーケティングデータ分析基盤技術勉強会)Apache Airflow入門  (マーケティングデータ分析基盤技術勉強会)
Apache Airflow入門 (マーケティングデータ分析基盤技術勉強会)Takeshi Mikami
 
データ分析基盤構築のポイントと関連クラスメソッドサービスの紹介
データ分析基盤構築のポイントと関連クラスメソッドサービスの紹介データ分析基盤構築のポイントと関連クラスメソッドサービスの紹介
データ分析基盤構築のポイントと関連クラスメソッドサービスの紹介Yosuke Katsuki
 
たまにはOpenShiftも触ってみよう
たまにはOpenShiftも触ってみようたまにはOpenShiftも触ってみよう
たまにはOpenShiftも触ってみようKazuto Kusama
 
短期間で大規模なシンクラ環境を用意した話
短期間で大規模なシンクラ環境を用意した話短期間で大規模なシンクラ環境を用意した話
短期間で大規模なシンクラ環境を用意した話淳 千葉
 
AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側SmartNews, Inc.
 
iOSアプリ開発者から見たMobile Hub
iOSアプリ開発者から見たMobile HubiOSアプリ開発者から見たMobile Hub
iOSアプリ開発者から見たMobile HubJun Kato
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engineWalter Liu
 
Microsoft Azure - SQL Data Warehouse
Microsoft Azure - SQL Data WarehouseMicrosoft Azure - SQL Data Warehouse
Microsoft Azure - SQL Data WarehouseMicrosoft
 
Cloud Foundryで学ぶ、PaaSのしくみ講座
Cloud Foundryで学ぶ、PaaSのしくみ講座Cloud Foundryで学ぶ、PaaSのしくみ講座
Cloud Foundryで学ぶ、PaaSのしくみ講座Kazuto Kusama
 
Re:dash Use Cases at iPROS
Re:dash Use Cases at iPROSRe:dash Use Cases at iPROS
Re:dash Use Cases at iPROSJumpei Yokota
 
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...Amazon Web Services
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSSmartNews, Inc.
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesDataWorks Summit
 
EmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤とEmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤とToru Takahashi
 
AWS Black Belt Online Seminar 2016 Amazon EMR
AWS Black Belt Online Seminar 2016 Amazon EMRAWS Black Belt Online Seminar 2016 Amazon EMR
AWS Black Belt Online Seminar 2016 Amazon EMRAmazon Web Services Japan
 

Viewers also liked (20)

AWS初心者向けWebinar AWSでBig Data活用
AWS初心者向けWebinar AWSでBig Data活用AWS初心者向けWebinar AWSでBig Data活用
AWS初心者向けWebinar AWSでBig Data活用
 
Apache Hbase バルクロードの使い方
Apache Hbase バルクロードの使い方Apache Hbase バルクロードの使い方
Apache Hbase バルクロードの使い方
 
Bluemixを使ったTwitter分析
Bluemixを使ったTwitter分析Bluemixを使ったTwitter分析
Bluemixを使ったTwitter分析
 
Apache Airflow入門 (マーケティングデータ分析基盤技術勉強会)
Apache Airflow入門  (マーケティングデータ分析基盤技術勉強会)Apache Airflow入門  (マーケティングデータ分析基盤技術勉強会)
Apache Airflow入門 (マーケティングデータ分析基盤技術勉強会)
 
データ分析基盤構築のポイントと関連クラスメソッドサービスの紹介
データ分析基盤構築のポイントと関連クラスメソッドサービスの紹介データ分析基盤構築のポイントと関連クラスメソッドサービスの紹介
データ分析基盤構築のポイントと関連クラスメソッドサービスの紹介
 
Cloud Foundry varz
Cloud Foundry varzCloud Foundry varz
Cloud Foundry varz
 
たまにはOpenShiftも触ってみよう
たまにはOpenShiftも触ってみようたまにはOpenShiftも触ってみよう
たまにはOpenShiftも触ってみよう
 
短期間で大規模なシンクラ環境を用意した話
短期間で大規模なシンクラ環境を用意した話短期間で大規模なシンクラ環境を用意した話
短期間で大規模なシンクラ環境を用意した話
 
AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側
 
iOSアプリ開発者から見たMobile Hub
iOSアプリ開発者から見たMobile HubiOSアプリ開発者から見たMobile Hub
iOSアプリ開発者から見たMobile Hub
 
Airflow - a data flow engine
Airflow - a data flow engineAirflow - a data flow engine
Airflow - a data flow engine
 
Microsoft Azure - SQL Data Warehouse
Microsoft Azure - SQL Data WarehouseMicrosoft Azure - SQL Data Warehouse
Microsoft Azure - SQL Data Warehouse
 
Cloud Foundryで学ぶ、PaaSのしくみ講座
Cloud Foundryで学ぶ、PaaSのしくみ講座Cloud Foundryで学ぶ、PaaSのしくみ講座
Cloud Foundryで学ぶ、PaaSのしくみ講座
 
Re:dash Use Cases at iPROS
Re:dash Use Cases at iPROSRe:dash Use Cases at iPROS
Re:dash Use Cases at iPROS
 
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
AWS re:Invent 2016: Netflix: Using Amazon S3 as the fabric of our big data ec...
 
Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWS
 
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data PipelinesAirflow - An Open Source Platform to Author and Monitor Data Pipelines
Airflow - An Open Source Platform to Author and Monitor Data Pipelines
 
AWS Black Belt Online Seminar 2016 AWS IoT
AWS Black Belt Online Seminar 2016 AWS IoTAWS Black Belt Online Seminar 2016 AWS IoT
AWS Black Belt Online Seminar 2016 AWS IoT
 
EmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤とEmbulkとDigdagとデータ分析基盤と
EmbulkとDigdagとデータ分析基盤と
 
AWS Black Belt Online Seminar 2016 Amazon EMR
AWS Black Belt Online Seminar 2016 Amazon EMRAWS Black Belt Online Seminar 2016 Amazon EMR
AWS Black Belt Online Seminar 2016 Amazon EMR
 

Similar to (DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift

AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAmazon Web Services
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Amazon Web Services
 
Introducing Amazon RDS for PostgreSQL (DAT210) | AWS re:Invent 2013
Introducing Amazon RDS for PostgreSQL (DAT210) | AWS re:Invent 2013Introducing Amazon RDS for PostgreSQL (DAT210) | AWS re:Invent 2013
Introducing Amazon RDS for PostgreSQL (DAT210) | AWS re:Invent 2013Amazon Web Services
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...Amazon Web Services
 
High-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsHigh-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsClusterpoint
 
A Data Culture with Embedded Analytics in Action
A Data Culture with Embedded Analytics in ActionA Data Culture with Embedded Analytics in Action
A Data Culture with Embedded Analytics in ActionAmazon Web Services
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMark Kromer
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
How Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftHow Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftAttunity
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureMark Kromer
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudAmazon Web Services
 
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWSAWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWSAmazon Web Services
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Cloudera, Inc.
 
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用eBay EDW元数据管理及应用
eBay EDW元数据管理及应用mysqlops
 
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataHadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataCloudera, Inc.
 
Understanding The Azure Platform November 09
Understanding The Azure Platform   November 09Understanding The Azure Platform   November 09
Understanding The Azure Platform November 09DavidGristwood
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 

Similar to (DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift (20)

AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced AnalyticsAWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
AWS July Webinar Series: Amazon Redshift Reporting and Advanced Analytics
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”
 
Introducing Amazon RDS for PostgreSQL (DAT210) | AWS re:Invent 2013
Introducing Amazon RDS for PostgreSQL (DAT210) | AWS re:Invent 2013Introducing Amazon RDS for PostgreSQL (DAT210) | AWS re:Invent 2013
Introducing Amazon RDS for PostgreSQL (DAT210) | AWS re:Invent 2013
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...
 
High-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutionsHigh-performance database technology for rock-solid IoT solutions
High-performance database technology for rock-solid IoT solutions
 
A Data Culture with Embedded Analytics in Action
A Data Culture with Embedded Analytics in ActionA Data Culture with Embedded Analytics in Action
A Data Culture with Embedded Analytics in Action
 
Microsoft Azure Big Data Analytics
Microsoft Azure Big Data AnalyticsMicrosoft Azure Big Data Analytics
Microsoft Azure Big Data Analytics
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
How Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon RedshiftHow Glidewell Moves Data to Amazon Redshift
How Glidewell Moves Data to Amazon Redshift
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
Big Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS CloudBig Data Use Cases and Solutions in the AWS Cloud
Big Data Use Cases and Solutions in the AWS Cloud
 
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWSAWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
 
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
Hadoop World 2011: Building Web Analytics Processing on Hadoop at CBS Interac...
 
eBay EDW元数据管理及应用
eBay EDW元数据管理及应用eBay EDW元数据管理及应用
eBay EDW元数据管理及应用
 
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTDataHadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
Hadoop World 2011: Hadoop’s Life in Enterprise Systems - Y Masatani, NTTData
 
Understanding The Azure Platform November 09
Understanding The Azure Platform   November 09Understanding The Azure Platform   November 09
Understanding The Azure Platform November 09
 
Building Streaming Applications with Streaming SQL
Building Streaming Applications with Streaming SQLBuilding Streaming Applications with Streaming SQL
Building Streaming Applications with Streaming SQL
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
ITReady DW Day2
ITReady DW Day2ITReady DW Day2
ITReady DW Day2
 

More from Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

More from Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Recently uploaded

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

(DAT308) Yahoo! Analyzes Billions of Events a Day on Amazon Redshift

  • 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Adam Savitzky, Yahoo! Tina Adams, AWS October 2015 DAT308 How Yahoo! Analyzes Billions of Events with Amazon Redshift
  • 2. Fast, simple, petabyte-scale data warehousing for $1,000/TB/Year Amazon Redshift a lot faster a lot cheaper a lot simpler
  • 3. Amazon Redshift architecture Leader node Simple SQL end point Stores metadata Optimizes query plan Coordinates query execution Compute nodes Local columnar storage Parallel/distributed execution of all queries, loads, backups, restores, resizes Start at $0.25/hour, grow to 2 PB (compressed) DC1: SSD; scale from 160 GB to 326 TB DS2: HDD; scale from 2 TB to 2 PB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  • 4. Amazon Redshift is priced to analyze all your data Pricing is simple # of nodes X hourly price No charge for leader node 3x data compression on avg Three copies of data DS2 (HDD) Price Per Hour for smallest single node Effective Annual Price per TB compressed On-Demand $ 0.850 $ 3,725 1 Year Reservation $ 0.500 $ 2,190 3 Year Reservation $ 0.228 $ 999 DC1 (SSD) Price Per Hour for smallest single node Effective Annual Price per TB compressed On-Demand $ 0.250 $ 13,690 1 Year Reservation $ 0.161 $ 8,795 3 Year Reservation $ 0.100 $ 5,500
  • 5. Amazon Redshift is easy to use Provision in minutes Monitor query performance Point and click resize Built-in security Automatic backups
  • 8. What to expect from the session • What does analytics mean for Yahoo? • Learn how our extract, transform, load (ETL) process runs • Learn about our Amazon Redshift architecture • Do’s, don’ts, and best practices for working with Amazon Redshift • Deep dive into advanced analytics, featuring how we define and report user retention
  • 9. Setting the stage “We are returning an iconic company to greatness.” —Marissa Mayer
  • 11. Guiding principles “You can’t grow a product that hasn’t reached product market fit.” —Arjun Sethi, @arjset
  • 12. Guiding principles Analytics is critical for growth
  • 13. Overall volume 0 10 20 30 40 50 60 70 80 90 Yahoo Events Auto Miles Driven Google Searches McDonald's Fries Served Babies Born Billions
  • 15. Hadoop Clusters Nodes Data centers Data 14 42,000 3 500PB
  • 16. Hive Slow Hard to use Hard to share Hard to repeat
  • 17. Hive
  • 19. Benchmarks (lower is better) 1 10 100 1000 10000 Count Distinct Devices Count All Events Filter Clauses Joins Seconds Amazon Redshift Vertica Impala
  • 20. Amazon Redshift at Yahoo Nodes Events per Day Queries per Day Data 21dc1.8xl 2B 1,200 27TB
  • 22. Extract, transform, load (ETL) Hadoop • Pig S3 • Airflow Amazon Redshift • Looker
  • 24. ETL—downstream Data available? Copy to Amazon Redshift Sanitize Export new installs Process new installs Update hourly table Update install table Update params Subdivide params Clean up Subdivide events Data flows in hourly from S3 to Amazon Redshift, where it’s processed and subdivided
  • 27. ETL—Nightly 24 hours available? Wipe old data Build daily table Build user retention Build funnel Vacuum Runs all daily aggregations and cleans up/vacuums
  • 29. DO Summarize user_id event_date action 1 2015-10-08 spam 1 2015-10-08 spam 1 2015-10-08 spam 1 2015-10-08 spam 1 2015-10-08 spam user_id event_date action event_count 1 2015-10-08 spam 5
  • 30. DO Choose good sort keys (and use them) CREATE TABLE revenue ( customer_id BIGINT, transaction_id BIGINT, location VARCHAR(64), event_date DATE, event_ts TIMESTAMP, revenue_usd DECIMAL ) DISTKEY(customer_id) SORTKEY( location, event_date, customer_id )
  • 31. DO Vacuum nightly (or weekly and tell people you do it nightly)
  • 32. DO Avoid joins where possible—and learn mitigation strategies for when you must join
  • 33. Join mitigation strategies Key distribution Records distributed by distkey Choose a field that you join on Avoid causing excess skew All distribution All records distributed to all nodes Most robust, but most space- intensive Fastest joins occur when records are colocated Key distribution A.1 B.1 A.3 B.3 A.5 B.5 A.2 B.2 A.4 B.4 A.6 B.6 All distribution A.1 B.1 A.2 B.2 A.3 B.3 A.4 B.4 A.5 B.5 A.6 B.6 A.1 B.1 A.2 B.2 A.3 B.3 A.4 B.4 A.5 B.5 A.6 B.6 Even distribution A.1 B.6 A.5 B.2 A.3 B.3 A.4 B.1 A.2 B.5 A.6 B.4
  • 35. DON’T Fill the cluster (leave more than you think)
  • 36. DON’T Run ETL in the default queue Workload management (WLM) is your friend
  • 37. Example WLM configuration Queue Concurrency User Groups Timeout (ms) Memory (%) 1 1 etl 50 2 10 60,000 50 Two queues: ETL and ad hoc Purpose: Insulate normal users from ETL and free up plenty of memory for big batch jobs
  • 38. DON’T Use CREATE TABLE AS For permanent tables
  • 39. DON’T Email SQL around Find a good reporting tool
  • 40. Deep dive: user retention
  • 42. User retention is… The most important* quality metric for your product * kinda
  • 43. Day-14 retention over time User retention and growth N-day retention
  • 44. User retention and growth 0 1000 2000 3000 4000 5000 6000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 DailyActiveUsers Product Age (days) Product A Product B
  • 45. High churn = wasted ad dollars $- $5,000.00 $10,000.00 $15,000.00 $20,000.00 $25,000.00 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Product age (days) Product A Product B
  • 46. The Sputnik method For generating a multidimensional user retention analysis table event_date install_date os_name country active_users cohort_size monday monday android us 100 100 tuesday monday android us 83 100 monday monday ios us 75 75 tuesday monday ios us 75 75
  • 47. Get one-day retention SELECT SUM(active_users) AS active_users, SUM(cohort_size) AS cohort_size, SUM(active_users) / SUM(cohort_size) AS retention FROM user_retention WHERE event_date – install_date = 1 AND CURRENT_DATE – 1 > event_date;
  • 48. Get one-day retention event_date install_date os_name country active_users cohort_size monday monday android us 100 100 tuesday monday android us 83 100 monday monday ios us 75 75 tuesday monday ios us 75 75 Active Users: 83 + 75 = 158 Cohort Size: 100 + 75 = 175 ------------------------------- Pct Retention = 158 / 175 = 90%
  • 49. Get one-day retention by OS SELECT os_name, SUM(active_users) AS active_users, SUM(cohort_size) AS cohort_size, SUM(active_users) / SUM(cohort_size) AS retention FROM user_retention WHERE event_date – install_date = 1 AND CURRENT_DATE – 1 > event_date GROUP BY 1;
  • 50. Get one-day retention event_date install_date os_name country active_users cohort_size monday monday android us 100 100 tuesday monday android us 83 100 monday monday ios us 75 75 tuesday monday ios us 75 75 Active Users: 83 Cohort Size: 100 ------------------- Pct Retention = 83% Active Users: 75 Cohort Size: 75 -------------------- Pct Retention = 100% iOS: Android:
  • 51. The Sputnik method You will need: Daily event summary User user_id
  • 52. The Sputnik method Calculate cohort sizes • Count users by all dimensions • For example: Male, iOS, in USA, who installed today Determine user activity • For each day, for each user, were they active • Create a table with user_id and event_date Join and aggregate • Join user table to user_activity on user_id • SUM active users by cohort and join to cohort sizes
  • 53. Calculate cohort sizes user_id install_date os_name country 1 2015-10-02 iOS us 2 2015-10-01 android ca 3 2015-10-01 android ca SELECT install_date, os_name, country, COUNT(*) AS cohort_size FROM user GROUP BY 1,2,3;
  • 54. Calculate cohort sizes install_date os_name country cohort_size 2015-10-02 iOS us 1 2015-10-01 android ca 2 SELECT install_date, os_name, country, COUNT(*) AS cohort_size FROM user GROUP BY 1,2,3;
  • 55. Determine user activity user_id event_date action 1 2015-10-02 app_open 1 2015-10-02 spam 1 2015-10-03 app_open CREATE TEMP TABLE user_activity AS SELECT DISTINCT user_id, event_date FROM event_daily WHERE action = ‘app_open’;
  • 56. Determine user activity user_id event_date action 1 2015-10-02 app_open 1 2015-10-02 spam 1 2015-10-03 app_open CREATE TEMP TABLE all_users AS SELECT DISTINCT user_id FROM event_daily; CREATE TEMP TABLE all_days AS SELECT DISTINCT event_date FROM event_daily;
  • 57. Determine user activity user_id event_date action 1 2015-10-02 app_open 1 2015-10-02 spam 1 2015-10-03 app_open CREATE TABLE active_users_by_day AS SELECT xproduct.user_id, xproduct.event_date FROM ( SELECT * FROM all_users CROSS JOIN all_dates ) xproduct INNER JOIN user_activity u ON u.user_id = xproduct.user_id;
  • 58. Determine cohort activity user_id event_date 1 2015-10-02 1 2015-10-03 CREATE TEMP TABLE cohort_activity AS SELECT u.*, all_dates.event_date, <1 if hit 0 if miss> as is_active FROM user AS u LEFT JOIN all_dates ON all_dates.event_date >= u.install_date LEFT JOIN active_users_by_day AS au ON au.user_id = u.user_id AND au.event_date = all_dates.event_date WHERE all_dates.event_date >= u.install_date; user_id install_date os_name country 1 2015-10-02 iOS us
  • 59. Determine cohort activity user_id event_date install_date os_name country is_active 1 2015-10-02 2015-10-02 iOS us 1 1 2015-10-03 2015-10-02 iOS us 1 1 2015-10-04 2015-10-02 iOS us 0 CREATE TEMP TABLE active_users AS SELECT event_date, install_date, os_name, country, SUM(is_active) AS count FROM cohort_activity GROUP BY 1, 2, 3, 4;
  • 60. Determine cohort activity event_date install_date os_name country is_active 2015-10- 03 2015-10- 02 iOS us 100 2015-10- 03 2015-10- 02 android us 350 2015-10- 03 2015-10- 02 iOS ca 50 Join these two tables on matching cohort dimensions install_date os_name country cohort_size 2015-10-02 iOS us 200 2015-10-02 android us 400 2015-10-02 iOS ca 60
  • 61. Big wins for Yahoo Real-time insights Easier deployment and maintenance Data-driven product development Cutting edge analytics
  • 63. Related sessions Hear from other customers discussing their Amazon Redshift use cases: • DAT201—Introduction to Amazon Redshift (RetailMeNot) • ISM303—Migrating Your Enterprise Data Warehouse to Amazon Redshift (Boingo Wireless and Edmunds) • ARC303—Pure Play Video OTT: A Microservices Architecture in the Cloud (Verizon) • ARC305—Self-Service Cloud Services: How J&J Is Managing AWS at Scale for Enterprise Workloads • BDT306—The Life of a Click: How Hearst Publishing Manages Clickstream Analytics with AWS • DAT311—Large-Scale Genomic Analysis with Amazon Redshift (Human Longevity) • BDT314—Running a Big Data and Analytics Application on Amazon EMR and Amazon Redshift with a Focus on Security (Nasdaq) • BDT316—Offloading ETL to Amazon Elastic MapReduce (Amgen) • BDT401—Amazon Redshift Deep Dive (TripAdvisor) • Building a Mobile App using Amazon EC2, Amazon S3, Amazon DynamoDB, and Amazon Redshift (Tinder)