SlideShare a Scribd company logo
1 of 81
Download to read offline
Building a Sustainable
Data Platform on AWS
Takumi Sakamoto
2016.01.27
Takumi Sakamoto
@takus
😍 = ⚽ ✈ 📷
http://bit.ly/1MCOyBX
JAWSDAYS 2015
Mentioned by @jeffbarr
https://twitter.com/jeffbarr/status/649575575787454464
http://www.slideshare.net/smartnews/smart-newss-journey-into-microservices
AWS Case Study
http://aws.amazon.com/solutions/case-studies/smartnews/
Data Platform at
SmartNews
What is SmartNews?
• News Discovery App
• Launched in 2012
• 15M+ Downloads in World Wide
https://www.smartnews.com/en/
Our Mission
the world's quality information?
the people who need it?
How?
Machine Learning
URLs Found
Structure Analysis
Semantics Analysis
Importance Estimation
Diversification
Internet
100,000+ /day
1000+ /day
Feedback
Deliver
Trending Stories
Data Platform Use Cases
• Product development
• track KPI such as DAU and MAU
• A/B test for new feature, on-boarding, etc...
• ad-hoc analysis
• Provide data to applications
• realtime re-ranking news articles
• CTR prediction of Ads system
• dashboard service for media partners
Data & Its Numbers
• User activities
• ~100 GBs per day (compressed)
• 60+ record types
• User demographics or configurations etc...
• 15M+ records
• Articles metadata
• 100K+ records per day
Sustainable
Data Platform?
Sustainable Data Platform
• Provide a reliable and scalable "Lambda Architecture"
• Minimize both operation & running cost
• Be open to uncertain future
Lambda Architecture
http://lambda-architecture.net/
Why Sustainable?
• Do a lot with a few engineers
• no one is a full-time maintainer
• avoid to waste too much time
• Empower brilliant engineers in SmartNews
• everything should be as self-serve as possible
• don't ask for permission, beg for forgiveness
System Design
λ Architecture at SmartNews
Input Batch Serving
Speed
Output
Design Principles
• Decoupled "Computation" and "Storage" layers
• multiple consumers can use the same data
• run consumers on Spot Instances
• prevent serious data lost with minimum effort
• Use the right tool for the job
• leverage AWS managed service as possible
• fill in the missing pieces by Presto & PipelineDB
An Example
Amazon EMR
AMI 3.x
Amazon S3
Amazon EMR
Hive
General
Users
Application
Engineer
I wanna
upgrade hive
Ad
Engineer
I wanna combine
news data with
ad data
Amazon EMR
AMI 4.x
Amazon EMR
Spark
We’re satisfied
with current
version
Data
Scientist
I wanna test my
algorithm with the
latest spark
Batch Layer
Run multiple EMR clusters for each usages
Kinesis
Stream
Spark
on EMR
AWS
Lambda
Data
Scientist
I wanna consume
streaming data by
Spark
Application
Engineer
I wanna add a
streaming monitor
by Lambda
Speed Layer
Consume the same data for each usages
• AWS managed services
• Replicated data into Multiple AZs
• High availability
Input Data
Collect Events by Fluentd
• Forwarder (running on each instances)
• store JSON events to S3
• forward events to aggregators
• collect metrics and post them to Datadog
• Aggregator
• input events into Kinesis & PipelineDB
• other reporting tasks (not mentioned today)
Forwards to S3
<source>
@type tail
format json
path /data/log/user_activity.log
pos_file /data/log/pos/user_activity.pos
tag smartnews.user_activity
time_key timestamp
</source>
<match smartnews.user_activity>
@type copy
<store>
@type relabel
@label @s3
</store>
<store>
@type forward
@label @forward
</store>
</match>
@include conf.d/s3.conf
@include conf.d/forward.conf
<label @s3>
<% node[:td_agent][:s3].each do |c| -%>
<match <%= c[:tag] %>>
@id s3.<%= c[:tag] %>
@type s3
...
path fluentd/<%= node[:env] %>/<%= node[:role] %>/<%= c[:tag] %>
time_slice_format dt=%Y-%m-%d/hh=%H
time_key timestamp
include_time_key
time_as_epoch
reduced_redundancy true
format json
utc
buffer_chunk_limit 2048m
</match>
<% end -%>
</label>
td-agent.conf conf.d/s3.conf
Capture DynamoDB Streams
<source>
type dynamodb_streams
stream_arn YOUR_DDB_STREAMS_ARN
pos_file /path/to/table.pos
fetch_interval 1
fetch_size 100
</source>
https://github.com/takus/fluent-plugin-dynamodb-streams
DynamoDB DynamoDB
Streams
Amazon S3
AWS
Lambda
Fluentd
Recommended Practices
• Make configuration simple as possible
• fluentd can cover everything, but shouldn't
• keep stateless
• Use v0.12 or later
• "Filter" : better performance
• "Label": eliminate 'output_tag' configuration
Monitor Fluentd Status
• Monitor traffic volume & retry count by Datadog
• Datadog's fluentd integration
• fluent-plugin-flowcounter
• fluent-plugin-dogstatsd
Archive to Amazon S3
• I have 2 recommended settings
• versioning
• enable to recover from human error
• lifecycle policy
• minify storage cost
Archives to IA or Gracier
xx days after the creation date
Keep previous versions xx days
Save you in the future!!
Batch Layer
Various ETL Tasks
• Extract
• dump MySQL records by Embulk
• make files on S3 readable to Hive
• Transform
• transform text files into columnar files (RCFile, ORC)
• generate features for machine learning
• aggregate records (by country, by channel)
• Load
• load aggregated metrics into Amazon Aurora
Hive
• Most popular project on Hadoop ecosystem
• famous for its lovely logo :)
• HiveQL and MapReduce
• convert SQL-like query into MR jobs
• Not adopt Tez engine yet
• Amazon EMR doesn't support now
• limited improvement to our queries
How to process JSON?
A. Transform into columnar table periodically
• required converting job
• better performance
B. Use JSON-SerDe for temporary analysis
• easy way for querying raw json text files
• required to "drop table" for change schema
• performance is not good
Transform Tables
-- Make S3 files readable by Hive
ALTER TABLE raw_activities ADD IF NOT EXISTS PARTITION
(dt='${DATE}', hh='${HOUR}');
-- Transform text files into columnar files (Flatten JSON)
INSERT OVERWRITE TABLE activities
PARTITION (dt='${DATE}', action)
SELECT
user_id, timestamp, os, country,
data,
action
FROM raw_activities
LATERAL VIEW json_tuple(
raw_activities.json,
'userId','timestamp','platform','country','action','data'
) a as user_id, timestamp, os, country, action, data
WHERE dt = '${DATE}'
CLUSTER BY os, country, action, user_id
;
JSON-SerDe
-- Define table with SERDE
CREATE TABLE json_table (
country string,
languages array<string>,
religions map<string,array<int>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
-- Result: 10
SELECT religions['catholic'][0] FROM json_table;
cf. hive-ruby-scripting
-- Define your ruby (JRuby) script
SET rb.script=
require 'json'
def parse (json)
j = JSON.load(json)
j['profile']['attribute1']
end
;
-- Use the script in HQL
SELECT rb_exec('&parse', json) FROM user;
https://github.com/gree/hive-ruby-scripting
Spark
http://www.slideshare.net/smartnews/aws-meetupapache-spark-on-emr
Self-Serve via AWS CLI
# Create EMR clusters that runs Hive & Spark & Ganglia
aws emr create-cluster 
--name "My Cluster" 
--release-label emr-4.2.0 
--applications Name=Hive Name=Spark Name=GANGLIA 
--ec2-attributes KeyName=myKey 
--instance-type c3.4xlarge 
--instance-count 4 
--use-default-roles
Minimize expenses
• Use Spot Instances as possible
• typically discount 50-90%
• select instance type with stable price
• C3 families spike often :(
• Dynamic cluster resizing
• x2 capacity during daily batch job
• 1/2 capacity during midnight
Handle Data Dependencies
Typical Anti-Pattern
5 * * * * app hive -f query_1.hql
15 * * * * app hive -f query_2.hql
30 * * * * app hive -f query_3.hql
0 * * * * app hive -f query_4.hql
1 * * * * app hive -f query_5.hql
Workflow Management
• Define dependencies
• task E is executed after finishing task C and task D
• Scheduling
• task A is kicked after 09:00 AM
• throttle concurrent running of the same task
• Monitoring
• notification in failure
• task C must finish before 01:00 PM (SLA)
cf. http://www.slideshare.net/taroleo/workflow-hacks-1-dots-tokyo
Airflow
• A workflow management systems
• define workflow by Python
• built in shiny UI & CLI
• pluggable architecture
http://nerds.airbnb.com/airflow/
Define Tasks
dag = DAG('tutorial', default_args=default_args)
t1 = BashOperator(
task_id='print_date',
bash_command='date',
dag=dag)
t2 = BashOperator(
task_id='sleep',
bash_command='sleep 5',
retries=3,
dag=dag)
t3 = BashOperator(
task_id='templated',
bash_command="""
{% for i in range(5) %}
echo "{{ ds }}"
echo "{{ macros.ds_add(ds, 7)}}"
echo "{{ params.my_param }}"
{% endfor %}
""",
params={'my_param': 'Parameter I passed in'},
dag=dag)
t2.set_upstream(t1)
t3.set_upstream(t1)
Task
Dependencies
Python code
DAG
Workflow as Code
Deploy codes automatically after merging into master
Visualize Dependencies
What is done or not?
Alerting to Slack
• SLA Violation
• task A should be done till 00:00 PM
• other team's task K has dependency into task A
• Output validation failure
• stop the following tasks if the output is doubtful
Retry from Web UI
Once clear histories, airflow scheduler back fill the histories
Retry from CLI
// Clear some histories from 2016-01-01
airflow clear etl_smartnews 
--task_regex user_ 
--downstream 
--start_date 2016-01-01
// Backfill uncompleted tasks
airflow backfill etl_smartnews 
--start_date 2016-01-01
Check Rendered Query
How Long Each Tasks?
Pluggable Architecture
• Built-in plugins
• operator: bash, hive, preto, mysql
• transfer: hive_to_mysql
• sensor: wait_hive_partition, wait_s3_file
• Written our own plugin
• mysql_partition
Examples
user_sensor = S3KeySensor(
task_id='wait_user',
bucket_name='smartnews',
bucket_key='user/dt={{ ds }}/dump.csv',
)
etl = HiveOperator(
task_id="task1",
hql="INSERT OVERWRITE INTO...."
)
etl.set_upstream(user_sensor)
import = HiveToMySqlTransfer(
task_id=name,
mysql_preoperator="DELETE FROM %s WHERE date = '{{ ds }}'" % table,
sql="SELECT country, count(*) FROM %s" % table,
mysql_table=table
)
import.set_upstream(etl)
Wait a S3 file creation
After the file is created,
Run ETL Query
After that,
Import into MySQL
Serving Layer
Provides batch views
in low-latency and ad-hoc way
Presto
• A distributed SQL query engine
• join multiple data sources (Hive + MySQL)
• support standard ANSI SQL
• designed to handle TBs or PBs scale data
cf. http://www.slideshare.net/frsyuki/presto-hadoop-conference-japan-2014
Presto Architecture
Amazon S3 Kinesis
Stream
Amazon
RDS
Amazon
Aurora
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Worker
Presto
Coordinator
Client
1. Query with Standard SQL
4. Scan data concurrently
5. Aggregate data without disk I/O
6. Return result to client
2. Generate execution plan
3. Dispatch tasks into multiple workers
Amazon EMR
(Hive Metastore)
Provides Hive table metadata
(S3 access only)
※ https://github.com/qubole/presto-kinesis
※
Why Presto?
• Join multiple data sources
• skip large parts of ETL process
• enable to merge Hive/MySQL/Kinesis/PipelineDB
• Low latency
• ~30s to scan billions records in S3
• Low maintenance cost
• stateless, and easy to integrate with Auto Scaling
Use case: A/B Test
-- Suppose that this table exists
DESC hive.default.user_activities;
user_id bigint
action varchar
abtest array<map<varchar, bigint>>
url varchar
-- Summarize page view per A/B Test identifier
-- for comparing two algorithms v1 & v2
SELECT
  dt,
  t['behaviorId'],
  count(*) as pv
FROM hive.default.user_activities CROSS JOIN UNNEST(abtest) AS t (t)
WHERE dt like '2016-01-%' AND action = 'viewArticle'
AND t['definitionId'] = 163
GROUP BY dt, t['behaviorId'] ORDER BY dt
;
2015-12-01 | algorithm_v1 | 40000
2015-12-01 | algorithm_v2 | 62000
Use case: Troubleshoot
-- Store access logs to S3, and query to them
-- Summarize access & 95pct response time by SQL
SELECT
from_unixtime(timestamp),
count(*) as access,
approx_percentile(reqtime, 0.95) as pct95_reqtime
FROM hive.default.access_log
WHERE dt = '2015-11-04' AND hh = '13' AND role = 'xxx'
GROUP BY timestamp ORDER BY timestamp
;
2015-11-04 22:00:00.000 | 6377 | 0.522
2015-11-04 22:00:01.000 | 3580 | 0.422
Scheduled Auto Scaling
$ aws autoscaling describe-scheduled-actions
{
"ScheduledUpdateGroupActions": [
{
"DesiredCapacity": 2,
"AutoScalingGroupName": "presto-worker-prd",
"Recurrence": "59 14 * * *",
"ScheduledActionName": "scalein-2359-jst"
},
{
"DesiredCapacity": 20,
"AutoScalingGroupName": "presto-worker-prd",
"Recurrence": "45 0 * * 1-5",
"ScheduledActionName": "scaleout-0945-jst"
}
]
}
Presto Covers Everything? No!
• Fixed system on Amazon Aurora (or other RDB)
• provides KPI for products & business
• require high availability & low latency
• has no flexibility
• Ad-hoc system on Presto
• provides access to all dataset on data platform
• require high scalability
• has flexibility (join various data sources)
Why Fixed vs Ad-hoc?
• Difficulties on the Ad-hoc only solution
• difficult to prevent heavy queries
• large distinct count exhausts computing resources
• decrease presto maintainability
Output Data
Chartio
• Dashboard as A Service
• helps businesses analyze and track their critical data
• one of AWS partners (※)
• Combine multiple data sources at one dashboard
• Presto, MySQL, Redshift, BigQuery, Elasticsearch ...
• enable to join BigQuery + MySQL internally
• Easy to use for every one
• everyone can make their own dashboard
• write SQL directly / generate query by drag & drop
※ http://www.aws-partner-directory.com/PartnerDirectory/PartnerDetail?id=8959
Creating dashboard
1. Building query
(Drag&Drop / SQL)
2. Add step
(filter、sort、modify)
3. Select visualize way
(table、graph)
Examples
Why Chartio?
• Chartio saves a lot of engineering resources
• before
• maintain in-house dashboard written by rails
• everyone got tired to maintain it
• after
• everyone can build their own dashboard easily
• Chartio's UI is cool
• very important factor for dashboard tool
Missing Pieces of Chartio
• No programable API provides
• need to edit dashboard / chart manually
• No rollback feature
• all changes are recorded, but not rollback to the
previous state
• work around : clone => edit => rename
Speed Layer
Why Speed is Matter?
Today’s News is Wrapping
Tomorrow’s Fish and Chips
↑
Yesterday's News
http://www.personalchefapproach.com/tomorrows-fish-n-chips-wrapper/
How News Behaves?
https://gdsdata.blog.gov.uk/2013/10/22/the-half-life-of-news/
Use cases
• Re-rank news articles by user feedback
• track user's positive/negative signal
• consider gender, age, location, interests
• Realtime article monitoring
• detect high bounce rate (may be broken?)
• make realtime reporting dashboard for A/B test
Realtime Re-Ranking
ref. Stream 処理 (Spark Streaming + Kinesis) と Offline 処理 (Hive) の統合
www.slideshare.net/smartnews/stremspark-streaming-kinesisofflinehive
Amazon
CloudSearch
Search
API
API
Gateway
Kinesis
Stream
Amazon S3
Amazon EMR
Amazon S3 Amazon EMR
DynamoDB
Realtime
Feedback
Re-rank
Articles
Article
Metadata
User
Interests
User
Behaviors
Offline Procees
by Hive / Spark
Realtime Monitoring
API
Gateway
Stream
Continuous
View
Continuous
View
Continuous
View
Discard raw record soon after
consumed by Continuous View
Incrementally
updated in realtime
PipelineDB Chartio
AWS
Lambda
Slack
Access Continuous View
by PostgreSQL Client
Record
※1
※1
Use cron on 26 Feb. 2016
Migrate it soon after supporting VPC
PipelineDB
• OSS & enterprise streaming SQL database
• PostgreSQL compatible
• connect to Chartio 😍
• join stream to normal PostgreSQL table
• Support probabilistic data structures
• e.g. HyperLogLog
https://www.pipelinedb.com/
http://developer.smartnews.com/blog/2015/09/09/20150907pipelinedb/
Continuous View
-- Calculate unique users seen per media each day
-- Using only a constant amount of space (HyperLogLog)
CREATE CONTINUOUS VIEW uniques AS
SELECT
day(arrival_timestamp),
substring(url from '.*://([^/]*)') as hostname,
COUNT(DISTINCT user_id::integer)
FROM activity_stream GROUP BY day,hostname;
-- How many impressions have we served in the last five minutes?
CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS
SELECT COUNT(*) FROM imps_stream;
-- What are the 90th, 95th, 99th percentiles of request latency?
CREATE CONTINUOUS VIEW latency AS
SELECT
percentile_cont(array[90, 95, 99])
WITHIN GROUP (ORDER BY latency::integer)
FROM latency_stream;
Summary
Sustainable Data Platform
• build a reliable and scalable lambda architecture
• minimize operation & running cost
• be open to uncertain future
My Wishlist to AWS
• Support Reduced Redundancy Storage (RRS) on EMR
• Faster EMR Launch
• Set TTL to DynamoDB records
• Auto-scale Kinesis Stream
• Launch Kinesis Analytics in Tokyo region
Thank you!!

More Related Content

What's hot

Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptxWasm1953
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017Amazon Web Services
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshConfluentInc1
 
AWS 환경에서의 위협 탐지 및 사냥 - 신은수, AWS 솔루션즈 아키텍트:: AWS Summit Online Korea 2020
AWS 환경에서의 위협 탐지 및 사냥 - 신은수, AWS  솔루션즈 아키텍트::  AWS Summit Online Korea 2020AWS 환경에서의 위협 탐지 및 사냥 - 신은수, AWS  솔루션즈 아키텍트::  AWS Summit Online Korea 2020
AWS 환경에서의 위협 탐지 및 사냥 - 신은수, AWS 솔루션즈 아키텍트:: AWS Summit Online Korea 2020Amazon Web Services Korea
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Amazon VPC: Security at the Speed Of Light (NET313) - AWS re:Invent 2018
Amazon VPC: Security at the Speed Of Light (NET313) - AWS re:Invent 2018Amazon VPC: Security at the Speed Of Light (NET313) - AWS re:Invent 2018
Amazon VPC: Security at the Speed Of Light (NET313) - AWS re:Invent 2018Amazon Web Services
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceSnowflake Computing
 
Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018Amazon Web Services
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!Chris Taylor
 
AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration
AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration  AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration
AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration Amazon Web Services
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
[2018] 고객 사례를 통해 본 클라우드 전환 전략
[2018] 고객 사례를 통해 본 클라우드 전환 전략[2018] 고객 사례를 통해 본 클라우드 전환 전략
[2018] 고객 사례를 통해 본 클라우드 전환 전략NHN FORWARD
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 

What's hot (20)

Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017How to build a data lake with aws glue data catalog (ABD213-R)  re:Invent 2017
How to build a data lake with aws glue data catalog (ABD213-R) re:Invent 2017
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
 
AWS 환경에서의 위협 탐지 및 사냥 - 신은수, AWS 솔루션즈 아키텍트:: AWS Summit Online Korea 2020
AWS 환경에서의 위협 탐지 및 사냥 - 신은수, AWS  솔루션즈 아키텍트::  AWS Summit Online Korea 2020AWS 환경에서의 위협 탐지 및 사냥 - 신은수, AWS  솔루션즈 아키텍트::  AWS Summit Online Korea 2020
AWS 환경에서의 위협 탐지 및 사냥 - 신은수, AWS 솔루션즈 아키텍트:: AWS Summit Online Korea 2020
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Amazon VPC: Security at the Speed Of Light (NET313) - AWS re:Invent 2018
Amazon VPC: Security at the Speed Of Light (NET313) - AWS re:Invent 2018Amazon VPC: Security at the Speed Of Light (NET313) - AWS re:Invent 2018
Amazon VPC: Security at the Speed Of Light (NET313) - AWS re:Invent 2018
 
Introduction to AWS Glue
Introduction to AWS Glue Introduction to AWS Glue
Introduction to AWS Glue
 
Demystifying Data Warehouse as a Service
Demystifying Data Warehouse as a ServiceDemystifying Data Warehouse as a Service
Demystifying Data Warehouse as a Service
 
Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018
Building Advanced Workflows with AWS Glue (ANT333) - AWS re:Invent 2018
 
AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!AWS Glue - let's get stuck in!
AWS Glue - let's get stuck in!
 
AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration
AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration  AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration
AWS Webcast - Amazon RDS for Oracle: Best Practices and Migration
 
Final terraform
Final terraformFinal terraform
Final terraform
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
DAOS Middleware overview
DAOS Middleware overviewDAOS Middleware overview
DAOS Middleware overview
 
[2018] 고객 사례를 통해 본 클라우드 전환 전략
[2018] 고객 사례를 통해 본 클라우드 전환 전략[2018] 고객 사례를 통해 본 클라우드 전환 전략
[2018] 고객 사례를 통해 본 클라우드 전환 전략
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 

Viewers also liked

[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例Amazon Web Services Japan
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
 
Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築Minero Aoki
 
AWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon ConnectAWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon ConnectAmazon Web Services Japan
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best PracticesAmazon Web Services
 
AWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS ShieldAWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS ShieldAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめAWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWSAWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWSAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-RayAWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-RayAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon AuroraAWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon AuroraAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAmazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB Amazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR Amazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計Amazon Web Services Japan
 
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハックAWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハックAmazon Web Services Japan
 

Viewers also liked (20)

20170725 black belt_monitoring_on_aws
20170725 black belt_monitoring_on_aws20170725 black belt_monitoring_on_aws
20170725 black belt_monitoring_on_aws
 
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
[よくわかるクラウドデータベース] リクルートにおけるRedshift導入・活用事例
 
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon Redshift
 
20170726 black belt_stepfunctions
20170726 black belt_stepfunctions20170726 black belt_stepfunctions
20170726 black belt_stepfunctions
 
Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築Amazon Redshiftによるリアルタイム分析サービスの構築
Amazon Redshiftによるリアルタイム分析サービスの構築
 
AWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon ConnectAWS Black Belt Online Seminar 2017 Amazon Connect
AWS Black Belt Online Seminar 2017 Amazon Connect
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
AWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS ShieldAWS Black Belt Online Seminar 2017 AWS Shield
AWS Black Belt Online Seminar 2017 AWS Shield
 
20170621 aws-black belt-ads-sms
20170621 aws-black belt-ads-sms20170621 aws-black belt-ads-sms
20170621 aws-black belt-ads-sms
 
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめAWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
AWS Black Belt Online Seminar 2017 AWS Summit Tokyo 2017 まとめ
 
AWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWSAWS Black Belt Online Seminar 2017 Deployment on AWS
AWS Black Belt Online Seminar 2017 Deployment on AWS
 
AWS Black Belt online seminar 2017 Snowball
AWS Black Belt online seminar 2017 SnowballAWS Black Belt online seminar 2017 Snowball
AWS Black Belt online seminar 2017 Snowball
 
AWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-RayAWS Black Belt Online Seminar 2017 AWS X-Ray
AWS Black Belt Online Seminar 2017 AWS X-Ray
 
AWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon AuroraAWS Black Belt Online Seminar 2017 Amazon Aurora
AWS Black Belt Online Seminar 2017 Amazon Aurora
 
AWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLiftAWS Black Belt Online Seminar 2017 Amazon GameLift
AWS Black Belt Online Seminar 2017 Amazon GameLift
 
AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB AWS Black Belt Online Seminar 2017 Amazon DynamoDB
AWS Black Belt Online Seminar 2017 Amazon DynamoDB
 
AWS BlackBelt AWS上でのDDoS対策
AWS BlackBelt AWS上でのDDoS対策AWS BlackBelt AWS上でのDDoS対策
AWS BlackBelt AWS上でのDDoS対策
 
AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR AWS Black Belt Online Seminar 2017 Amazon EMR
AWS Black Belt Online Seminar 2017 Amazon EMR
 
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
AWS Black Belt Online Seminar 2017 AWSへのネットワーク接続とAWS上のネットワーク内部設計
 
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハックAWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
AWS Black Belt Online Seminar 2017 Amazon Pinpoint で始めるモバイルアプリのグロースハック
 

Similar to Building a Sustainable Data Platform on AWS

Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesSadayuki Furuhashi
 
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013Amazon Web Services
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebookAniket Mokashi
 
Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql databasePARIKSHIT SAVJANI
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션Amazon Web Services Korea
 
Evolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsEvolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsSteve Jamieson
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextPrateek Maheshwari
 
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...MongoDB
 
React state management with Redux and MobX
React state management with Redux and MobXReact state management with Redux and MobX
React state management with Redux and MobXDarko Kukovec
 
(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT
(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT
(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoTAmazon Web Services
 
Orchestrating complex workflows with aws step functions
Orchestrating complex workflows with aws step functionsOrchestrating complex workflows with aws step functions
Orchestrating complex workflows with aws step functionsChris Shenton
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore IndexSolidQ
 
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...Amazon Web Services
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Zhenxiao Luo
 
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel PartnersCraeg Strong
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
 
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
MongoDB.local Austin 2018:  Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...MongoDB.local Austin 2018:  Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...MongoDB
 

Similar to Building a Sustainable Data Platform on AWS (20)

Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
 
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
Automate Your Big Data Workflows (SVC201) | AWS re:Invent 2013
 
XStream: stream processing platform at facebook
XStream:  stream processing platform at facebookXStream:  stream processing platform at facebook
XStream: stream processing platform at facebook
 
Migrating on premises workload to azure sql database
Migrating on premises workload to azure sql databaseMigrating on premises workload to azure sql database
Migrating on premises workload to azure sql database
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
AWS Step Functions을 활용한 서버리스 앱 오케스트레이션
 
Evolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsEvolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.js
 
Apache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's NextApache Samza 1.0 - What's New, What's Next
Apache Samza 1.0 - What's New, What's Next
 
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
 
React state management with Redux and MobX
React state management with Redux and MobXReact state management with Redux and MobX
React state management with Redux and MobX
 
(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT
(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT
(MBL305) You Have Data from the Devices, Now What?: Getting the Value of the IoT
 
Orchestrating complex workflows with aws step functions
Orchestrating complex workflows with aws step functionsOrchestrating complex workflows with aws step functions
Orchestrating complex workflows with aws step functions
 
DW on AWS
DW on AWSDW on AWS
DW on AWS
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...
Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Work...
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
 
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
MongoDB.local Austin 2018:  Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...MongoDB.local Austin 2018:  Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
MongoDB.local Austin 2018: Ch-Ch-Ch-Ch-Changes: Taking Your MongoDB Stitch A...
 

More from SmartNews, Inc.

SmartNewsを支えるデータパイプラインとその運用
SmartNewsを支えるデータパイプラインとその運用SmartNewsを支えるデータパイプラインとその運用
SmartNewsを支えるデータパイプラインとその運用SmartNews, Inc.
 
Spring で実現する SmartNews のニュース配信基盤
Spring で実現する SmartNews のニュース配信基盤Spring で実現する SmartNews のニュース配信基盤
Spring で実現する SmartNews のニュース配信基盤SmartNews, Inc.
 
エンジニアからプロダクトマネージャーへ
エンジニアからプロダクトマネージャーへエンジニアからプロダクトマネージャーへ
エンジニアからプロダクトマネージャーへSmartNews, Inc.
 
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.SmartNews, Inc.
 
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_cccSmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_cccSmartNews, Inc.
 
Stream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysStream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysSmartNews, Inc.
 
AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側SmartNews, Inc.
 
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews, Inc.
 
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...SmartNews, Inc.
 
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテムSmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテムSmartNews, Inc.
 
SmartNews TechNight vol5 SmartNews Ads大図解
SmartNews TechNight vol5 SmartNews Ads大図解SmartNews TechNight vol5 SmartNews Ads大図解
SmartNews TechNight vol5 SmartNews Ads大図解SmartNews, Inc.
 
SmartNews's journey into microservices
SmartNews's journey into microservicesSmartNews's journey into microservices
SmartNews's journey into microservicesSmartNews, Inc.
 
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合SmartNews, Inc.
 
SmartNews の Webmining を支えるプラットフォーム
SmartNews の Webmining を支えるプラットフォームSmartNews の Webmining を支えるプラットフォーム
SmartNews の Webmining を支えるプラットフォームSmartNews, Inc.
 
AWS meetup「Apache Spark on EMR」
AWS meetup「Apache Spark on EMR」AWS meetup「Apache Spark on EMR」
AWS meetup「Apache Spark on EMR」SmartNews, Inc.
 
Smartnews Product Manager Night
Smartnews Product Manager NightSmartnews Product Manager Night
Smartnews Product Manager NightSmartNews, Inc.
 
SmartNews Ads System - AWS Summit Tokyo 2015
SmartNews Ads System - AWS Summit Tokyo 2015SmartNews Ads System - AWS Summit Tokyo 2015
SmartNews Ads System - AWS Summit Tokyo 2015SmartNews, Inc.
 
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法SmartNews, Inc.
 

More from SmartNews, Inc. (19)

SmartNewsを支えるデータパイプラインとその運用
SmartNewsを支えるデータパイプラインとその運用SmartNewsを支えるデータパイプラインとその運用
SmartNewsを支えるデータパイプラインとその運用
 
Spring で実現する SmartNews のニュース配信基盤
Spring で実現する SmartNews のニュース配信基盤Spring で実現する SmartNews のニュース配信基盤
Spring で実現する SmartNews のニュース配信基盤
 
エンジニアからプロダクトマネージャーへ
エンジニアからプロダクトマネージャーへエンジニアからプロダクトマネージャーへ
エンジニアからプロダクトマネージャーへ
 
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
 
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_cccSmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
 
Stream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdaysStream Processing in SmartNews #jawsdays
Stream Processing in SmartNews #jawsdays
 
AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側
 
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
 
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
 
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテムSmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
 
SmartNews TechNight vol5 SmartNews Ads大図解
SmartNews TechNight vol5 SmartNews Ads大図解SmartNews TechNight vol5 SmartNews Ads大図解
SmartNews TechNight vol5 SmartNews Ads大図解
 
NLP in SmartNews
NLP in SmartNewsNLP in SmartNews
NLP in SmartNews
 
SmartNews's journey into microservices
SmartNews's journey into microservicesSmartNews's journey into microservices
SmartNews's journey into microservices
 
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
 
SmartNews の Webmining を支えるプラットフォーム
SmartNews の Webmining を支えるプラットフォームSmartNews の Webmining を支えるプラットフォーム
SmartNews の Webmining を支えるプラットフォーム
 
AWS meetup「Apache Spark on EMR」
AWS meetup「Apache Spark on EMR」AWS meetup「Apache Spark on EMR」
AWS meetup「Apache Spark on EMR」
 
Smartnews Product Manager Night
Smartnews Product Manager NightSmartnews Product Manager Night
Smartnews Product Manager Night
 
SmartNews Ads System - AWS Summit Tokyo 2015
SmartNews Ads System - AWS Summit Tokyo 2015SmartNews Ads System - AWS Summit Tokyo 2015
SmartNews Ads System - AWS Summit Tokyo 2015
 
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
 

Recently uploaded

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Recently uploaded (20)

Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

Building a Sustainable Data Platform on AWS

  • 1. Building a Sustainable Data Platform on AWS Takumi Sakamoto 2016.01.27
  • 7. What is SmartNews? • News Discovery App • Launched in 2012 • 15M+ Downloads in World Wide https://www.smartnews.com/en/
  • 8. Our Mission the world's quality information? the people who need it? How?
  • 9. Machine Learning URLs Found Structure Analysis Semantics Analysis Importance Estimation Diversification Internet 100,000+ /day 1000+ /day Feedback Deliver Trending Stories
  • 10. Data Platform Use Cases • Product development • track KPI such as DAU and MAU • A/B test for new feature, on-boarding, etc... • ad-hoc analysis • Provide data to applications • realtime re-ranking news articles • CTR prediction of Ads system • dashboard service for media partners
  • 11. Data & Its Numbers • User activities • ~100 GBs per day (compressed) • 60+ record types • User demographics or configurations etc... • 15M+ records • Articles metadata • 100K+ records per day
  • 13. Sustainable Data Platform • Provide a reliable and scalable "Lambda Architecture" • Minimize both operation & running cost • Be open to uncertain future
  • 15. Why Sustainable? • Do a lot with a few engineers • no one is a full-time maintainer • avoid to waste too much time • Empower brilliant engineers in SmartNews • everything should be as self-serve as possible • don't ask for permission, beg for forgiveness
  • 17. λ Architecture at SmartNews Input Batch Serving Speed Output
  • 18. Design Principles • Decoupled "Computation" and "Storage" layers • multiple consumers can use the same data • run consumers on Spot Instances • prevent serious data lost with minimum effort • Use the right tool for the job • leverage AWS managed service as possible • fill in the missing pieces by Presto & PipelineDB
  • 19. An Example Amazon EMR AMI 3.x Amazon S3 Amazon EMR Hive General Users Application Engineer I wanna upgrade hive Ad Engineer I wanna combine news data with ad data Amazon EMR AMI 4.x Amazon EMR Spark We’re satisfied with current version Data Scientist I wanna test my algorithm with the latest spark Batch Layer Run multiple EMR clusters for each usages Kinesis Stream Spark on EMR AWS Lambda Data Scientist I wanna consume streaming data by Spark Application Engineer I wanna add a streaming monitor by Lambda Speed Layer Consume the same data for each usages • AWS managed services • Replicated data into Multiple AZs • High availability
  • 21. Collect Events by Fluentd • Forwarder (running on each instances) • store JSON events to S3 • forward events to aggregators • collect metrics and post them to Datadog • Aggregator • input events into Kinesis & PipelineDB • other reporting tasks (not mentioned today)
  • 22. Forwards to S3 <source> @type tail format json path /data/log/user_activity.log pos_file /data/log/pos/user_activity.pos tag smartnews.user_activity time_key timestamp </source> <match smartnews.user_activity> @type copy <store> @type relabel @label @s3 </store> <store> @type forward @label @forward </store> </match> @include conf.d/s3.conf @include conf.d/forward.conf <label @s3> <% node[:td_agent][:s3].each do |c| -%> <match <%= c[:tag] %>> @id s3.<%= c[:tag] %> @type s3 ... path fluentd/<%= node[:env] %>/<%= node[:role] %>/<%= c[:tag] %> time_slice_format dt=%Y-%m-%d/hh=%H time_key timestamp include_time_key time_as_epoch reduced_redundancy true format json utc buffer_chunk_limit 2048m </match> <% end -%> </label> td-agent.conf conf.d/s3.conf
  • 23. Capture DynamoDB Streams <source> type dynamodb_streams stream_arn YOUR_DDB_STREAMS_ARN pos_file /path/to/table.pos fetch_interval 1 fetch_size 100 </source> https://github.com/takus/fluent-plugin-dynamodb-streams DynamoDB DynamoDB Streams Amazon S3 AWS Lambda Fluentd
  • 24. Recommended Practices • Make configuration simple as possible • fluentd can cover everything, but shouldn't • keep stateless • Use v0.12 or later • "Filter" : better performance • "Label": eliminate 'output_tag' configuration
  • 25. Monitor Fluentd Status • Monitor traffic volume & retry count by Datadog • Datadog's fluentd integration • fluent-plugin-flowcounter • fluent-plugin-dogstatsd
  • 26. Archive to Amazon S3 • I have 2 recommended settings • versioning • enable to recover from human error • lifecycle policy • minify storage cost Archives to IA or Gracier xx days after the creation date Keep previous versions xx days Save you in the future!!
  • 28. Various ETL Tasks • Extract • dump MySQL records by Embulk • make files on S3 readable to Hive • Transform • transform text files into columnar files (RCFile, ORC) • generate features for machine learning • aggregate records (by country, by channel) • Load • load aggregated metrics into Amazon Aurora
  • 29. Hive • Most popular project on Hadoop ecosystem • famous for its lovely logo :) • HiveQL and MapReduce • convert SQL-like query into MR jobs • Not adopt Tez engine yet • Amazon EMR doesn't support now • limited improvement to our queries
  • 30. How to process JSON? A. Transform into columnar table periodically • required converting job • better performance B. Use JSON-SerDe for temporary analysis • easy way for querying raw json text files • required to "drop table" for change schema • performance is not good
  • 31. Transform Tables -- Make S3 files readable by Hive ALTER TABLE raw_activities ADD IF NOT EXISTS PARTITION (dt='${DATE}', hh='${HOUR}'); -- Transform text files into columnar files (Flatten JSON) INSERT OVERWRITE TABLE activities PARTITION (dt='${DATE}', action) SELECT user_id, timestamp, os, country, data, action FROM raw_activities LATERAL VIEW json_tuple( raw_activities.json, 'userId','timestamp','platform','country','action','data' ) a as user_id, timestamp, os, country, action, data WHERE dt = '${DATE}' CLUSTER BY os, country, action, user_id ;
  • 32. JSON-SerDe -- Define table with SERDE CREATE TABLE json_table ( country string, languages array<string>, religions map<string,array<int>> ) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' STORED AS TEXTFILE; -- Result: 10 SELECT religions['catholic'][0] FROM json_table;
  • 33. cf. hive-ruby-scripting -- Define your ruby (JRuby) script SET rb.script= require 'json' def parse (json) j = JSON.load(json) j['profile']['attribute1'] end ; -- Use the script in HQL SELECT rb_exec('&parse', json) FROM user; https://github.com/gree/hive-ruby-scripting
  • 35. Self-Serve via AWS CLI # Create EMR clusters that runs Hive & Spark & Ganglia aws emr create-cluster --name "My Cluster" --release-label emr-4.2.0 --applications Name=Hive Name=Spark Name=GANGLIA --ec2-attributes KeyName=myKey --instance-type c3.4xlarge --instance-count 4 --use-default-roles
  • 36. Minimize expenses • Use Spot Instances as possible • typically discount 50-90% • select instance type with stable price • C3 families spike often :( • Dynamic cluster resizing • x2 capacity during daily batch job • 1/2 capacity during midnight
  • 38. Typical Anti-Pattern 5 * * * * app hive -f query_1.hql 15 * * * * app hive -f query_2.hql 30 * * * * app hive -f query_3.hql 0 * * * * app hive -f query_4.hql 1 * * * * app hive -f query_5.hql
  • 39. Workflow Management • Define dependencies • task E is executed after finishing task C and task D • Scheduling • task A is kicked after 09:00 AM • throttle concurrent running of the same task • Monitoring • notification in failure • task C must finish before 01:00 PM (SLA) cf. http://www.slideshare.net/taroleo/workflow-hacks-1-dots-tokyo
  • 40. Airflow • A workflow management systems • define workflow by Python • built in shiny UI & CLI • pluggable architecture http://nerds.airbnb.com/airflow/
  • 41. Define Tasks dag = DAG('tutorial', default_args=default_args) t1 = BashOperator( task_id='print_date', bash_command='date', dag=dag) t2 = BashOperator( task_id='sleep', bash_command='sleep 5', retries=3, dag=dag) t3 = BashOperator( task_id='templated', bash_command=""" {% for i in range(5) %} echo "{{ ds }}" echo "{{ macros.ds_add(ds, 7)}}" echo "{{ params.my_param }}" {% endfor %} """, params={'my_param': 'Parameter I passed in'}, dag=dag) t2.set_upstream(t1) t3.set_upstream(t1) Task Dependencies Python code DAG
  • 42. Workflow as Code Deploy codes automatically after merging into master
  • 44. What is done or not?
  • 45. Alerting to Slack • SLA Violation • task A should be done till 00:00 PM • other team's task K has dependency into task A • Output validation failure • stop the following tasks if the output is doubtful
  • 46. Retry from Web UI Once clear histories, airflow scheduler back fill the histories
  • 47. Retry from CLI // Clear some histories from 2016-01-01 airflow clear etl_smartnews --task_regex user_ --downstream --start_date 2016-01-01 // Backfill uncompleted tasks airflow backfill etl_smartnews --start_date 2016-01-01
  • 49. How Long Each Tasks?
  • 50. Pluggable Architecture • Built-in plugins • operator: bash, hive, preto, mysql • transfer: hive_to_mysql • sensor: wait_hive_partition, wait_s3_file • Written our own plugin • mysql_partition
  • 51. Examples user_sensor = S3KeySensor( task_id='wait_user', bucket_name='smartnews', bucket_key='user/dt={{ ds }}/dump.csv', ) etl = HiveOperator( task_id="task1", hql="INSERT OVERWRITE INTO...." ) etl.set_upstream(user_sensor) import = HiveToMySqlTransfer( task_id=name, mysql_preoperator="DELETE FROM %s WHERE date = '{{ ds }}'" % table, sql="SELECT country, count(*) FROM %s" % table, mysql_table=table ) import.set_upstream(etl) Wait a S3 file creation After the file is created, Run ETL Query After that, Import into MySQL
  • 53. Provides batch views in low-latency and ad-hoc way
  • 54. Presto • A distributed SQL query engine • join multiple data sources (Hive + MySQL) • support standard ANSI SQL • designed to handle TBs or PBs scale data cf. http://www.slideshare.net/frsyuki/presto-hadoop-conference-japan-2014
  • 55. Presto Architecture Amazon S3 Kinesis Stream Amazon RDS Amazon Aurora Presto Worker Presto Worker Presto Worker Presto Worker Presto Worker Presto Worker Presto Coordinator Client 1. Query with Standard SQL 4. Scan data concurrently 5. Aggregate data without disk I/O 6. Return result to client 2. Generate execution plan 3. Dispatch tasks into multiple workers Amazon EMR (Hive Metastore) Provides Hive table metadata (S3 access only) ※ https://github.com/qubole/presto-kinesis ※
  • 56. Why Presto? • Join multiple data sources • skip large parts of ETL process • enable to merge Hive/MySQL/Kinesis/PipelineDB • Low latency • ~30s to scan billions records in S3 • Low maintenance cost • stateless, and easy to integrate with Auto Scaling
  • 57. Use case: A/B Test -- Suppose that this table exists DESC hive.default.user_activities; user_id bigint action varchar abtest array<map<varchar, bigint>> url varchar -- Summarize page view per A/B Test identifier -- for comparing two algorithms v1 & v2 SELECT   dt,   t['behaviorId'],   count(*) as pv FROM hive.default.user_activities CROSS JOIN UNNEST(abtest) AS t (t) WHERE dt like '2016-01-%' AND action = 'viewArticle' AND t['definitionId'] = 163 GROUP BY dt, t['behaviorId'] ORDER BY dt ; 2015-12-01 | algorithm_v1 | 40000 2015-12-01 | algorithm_v2 | 62000
  • 58. Use case: Troubleshoot -- Store access logs to S3, and query to them -- Summarize access & 95pct response time by SQL SELECT from_unixtime(timestamp), count(*) as access, approx_percentile(reqtime, 0.95) as pct95_reqtime FROM hive.default.access_log WHERE dt = '2015-11-04' AND hh = '13' AND role = 'xxx' GROUP BY timestamp ORDER BY timestamp ; 2015-11-04 22:00:00.000 | 6377 | 0.522 2015-11-04 22:00:01.000 | 3580 | 0.422
  • 59. Scheduled Auto Scaling $ aws autoscaling describe-scheduled-actions { "ScheduledUpdateGroupActions": [ { "DesiredCapacity": 2, "AutoScalingGroupName": "presto-worker-prd", "Recurrence": "59 14 * * *", "ScheduledActionName": "scalein-2359-jst" }, { "DesiredCapacity": 20, "AutoScalingGroupName": "presto-worker-prd", "Recurrence": "45 0 * * 1-5", "ScheduledActionName": "scaleout-0945-jst" } ] }
  • 60. Presto Covers Everything? No! • Fixed system on Amazon Aurora (or other RDB) • provides KPI for products & business • require high availability & low latency • has no flexibility • Ad-hoc system on Presto • provides access to all dataset on data platform • require high scalability • has flexibility (join various data sources)
  • 61. Why Fixed vs Ad-hoc? • Difficulties on the Ad-hoc only solution • difficult to prevent heavy queries • large distinct count exhausts computing resources • decrease presto maintainability
  • 63. Chartio • Dashboard as A Service • helps businesses analyze and track their critical data • one of AWS partners (※) • Combine multiple data sources at one dashboard • Presto, MySQL, Redshift, BigQuery, Elasticsearch ... • enable to join BigQuery + MySQL internally • Easy to use for every one • everyone can make their own dashboard • write SQL directly / generate query by drag & drop ※ http://www.aws-partner-directory.com/PartnerDirectory/PartnerDetail?id=8959
  • 64. Creating dashboard 1. Building query (Drag&Drop / SQL) 2. Add step (filter、sort、modify) 3. Select visualize way (table、graph)
  • 66. Why Chartio? • Chartio saves a lot of engineering resources • before • maintain in-house dashboard written by rails • everyone got tired to maintain it • after • everyone can build their own dashboard easily • Chartio's UI is cool • very important factor for dashboard tool
  • 67. Missing Pieces of Chartio • No programable API provides • need to edit dashboard / chart manually • No rollback feature • all changes are recorded, but not rollback to the previous state • work around : clone => edit => rename
  • 69. Why Speed is Matter?
  • 70. Today’s News is Wrapping Tomorrow’s Fish and Chips
  • 73. Use cases • Re-rank news articles by user feedback • track user's positive/negative signal • consider gender, age, location, interests • Realtime article monitoring • detect high bounce rate (may be broken?) • make realtime reporting dashboard for A/B test
  • 74. Realtime Re-Ranking ref. Stream 処理 (Spark Streaming + Kinesis) と Offline 処理 (Hive) の統合 www.slideshare.net/smartnews/stremspark-streaming-kinesisofflinehive Amazon CloudSearch Search API API Gateway Kinesis Stream Amazon S3 Amazon EMR Amazon S3 Amazon EMR DynamoDB Realtime Feedback Re-rank Articles Article Metadata User Interests User Behaviors Offline Procees by Hive / Spark
  • 75. Realtime Monitoring API Gateway Stream Continuous View Continuous View Continuous View Discard raw record soon after consumed by Continuous View Incrementally updated in realtime PipelineDB Chartio AWS Lambda Slack Access Continuous View by PostgreSQL Client Record ※1 ※1 Use cron on 26 Feb. 2016 Migrate it soon after supporting VPC
  • 76. PipelineDB • OSS & enterprise streaming SQL database • PostgreSQL compatible • connect to Chartio 😍 • join stream to normal PostgreSQL table • Support probabilistic data structures • e.g. HyperLogLog https://www.pipelinedb.com/ http://developer.smartnews.com/blog/2015/09/09/20150907pipelinedb/
  • 77. Continuous View -- Calculate unique users seen per media each day -- Using only a constant amount of space (HyperLogLog) CREATE CONTINUOUS VIEW uniques AS SELECT day(arrival_timestamp), substring(url from '.*://([^/]*)') as hostname, COUNT(DISTINCT user_id::integer) FROM activity_stream GROUP BY day,hostname; -- How many impressions have we served in the last five minutes? CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS SELECT COUNT(*) FROM imps_stream; -- What are the 90th, 95th, 99th percentiles of request latency? CREATE CONTINUOUS VIEW latency AS SELECT percentile_cont(array[90, 95, 99]) WITHIN GROUP (ORDER BY latency::integer) FROM latency_stream;
  • 79. Sustainable Data Platform • build a reliable and scalable lambda architecture • minimize operation & running cost • be open to uncertain future
  • 80. My Wishlist to AWS • Support Reduced Redundancy Storage (RRS) on EMR • Faster EMR Launch • Set TTL to DynamoDB records • Auto-scale Kinesis Stream • Launch Kinesis Analytics in Tokyo region