SlideShare a Scribd company logo
1 of 45
Download to read offline
Stream Processing
in SmartNews
Takumi Sakamoto
2016.03.12
Takumi Sakamoto
@takus
😍 = ⚽ ✈ 📷
http://bit.ly/1MCOyBX
JAWSDAYS 2015
AWS Case Study
http://aws.amazon.com/solutions/case-studies/smartnews/
What is SmartNews?
• News Discovery App for Mobile
• Launched in 2012
• 15M+ Downloads in World Wide
https://www.smartnews.com/en/
How We Deliver News?
Internet Algorithms Trending
News
Why Stream Processing?
Today’s News is Wrapping
Tomorrow’s Fish and Chips
↑
Yesterday's News
http://www.personalchefapproach.com/tomorrows-fish-n-chips-wrapper/
News Articles Lifetime
https://gdsdata.blog.gov.uk/2013/10/22/the-half-life-of-news/
Speed is Matter for Us
System Overview
News Delivery Pipeline
CrawlerInternet Analyzer Indexer CloudSearch API
Search
API
Gateway
Mobile App
API
Tracker
DynamoDB
Index System
Feedback System
1 minute
5 minute
Index System
• Crawler
• collect news articles & social signals
• Analyzer
• extract title, content, thumbnail...
• classify topics (sports, politics, technology...)
• Indexer
• upload article metadata into CloudSearch
Feedback System
• API Tracker
• receive user's activity log from mobile app
• Spark Streaming
• generate various metrics for news ranking
• stored metrics into DynamoDB
How to Glue Each Service?
Ref: Amazon Kinesis: Real-time Streaming Big data Processing Applications
Why Kinesis Streams?
• Fully managed service
• Multiple consumer applications
• Reasonable pricing
Multiple Consumers
Kinesis
Stream
Spark
on EMR
AWS
Lambda
Data
Scientist
I wanna consume
streaming data by
Spark
Application
Engineer
I wanna add a
streaming monitor
by Lambda
Empowers Engineers to Do Trial and Error
News Delivery Pipeline
CrawlerInternet Analyzer Indexer CloudSearch API
Search
API
Gateway
Mobile App
API
Tracker
DynamoDB
Kinesis
Stream
Kinesis
Stream
Kinesis
Stream
Data & Its Numbers
• User activities
• ~100 GBs per day (compressed)
• 60+ record types
• User demographics or configurations etc...
• 15M+ records
• Articles metadata
• 100K+ records per day
How We Produce/Consume
Kinesis Streams?
Index System
Crawler
KPL
KPL
KPL
KCL
KCL
KCL
KPL
KPL
KPL
Analyzer
KCL
KCL
KCL
Indexer
CloudSearch
Collect, Analyze and Index Articles
with Kinesis Libraries (KPL & KCL)
Kinesis Libraries
• Kinesis Producer Library (KPL)
• put records into an stream
• asynchronous architecture (buffer records)
• Kinesis Consumer Library (KCL)
• consume and process data from an stream
• handle complex tasks associated with distributed
computing
KPL/KCL Monitoring
• KPL/KCL publishes custom CloudWatch metrics
• Key Metrics for KPL
• User Record Received, User Record Pending
• All Errors
• Key Metrics for KCL
• RecordsProcessed
• MillisBehindLatest
• RecordProcessor.processRecords.Time
https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kpl.html
https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kcl.html
Monitoring with Datadog
Feedback System
Generate Metrics by User Clusters for
Ranking Articles
Amazon
CloudSearch
API
Search
API
Gateway
Kinesis
Stream
Amazon S3 Hive / Spark
DynamoDB
User
Clusters
User
Feedback
API
Tracker
Amazon S3
Offline ETL / Machine Learning
Push
Notification
Article
Metadata
Metrics
by Cluster
Why Metrics by Cluster?
Consider Each User's Interests
Ensure Diversity for Avoiding Filter Bubble
https://en.wikipedia.org/wiki/Filter_bubble
Amazon
CloudSearch
API
DynamoDB
Article raw score
San Fransisco Giants … 3.5
New York Yankees … 6.2
FIFA World Cup … 20.4
U.S.Open Championships … 8.4
weight
1
0.6
0.2
0.2
score
3.5
3
4.08
1.68
+ =
User
GET /news/sports
Metrics by
User Cluster
Article
Inventry
userId: 1000
gender: Male
age: 36
location: San Fransisco, US
interests: Baseball
Input Data by Fluentd
• Forwarder (running on each instances)
• archive events to S3
• forward events to aggregators
• Aggregator (HA Configuration※)
• put events into Kinesis Stream
• alert and report (not mentioned here)
※ http://docs.fluentd.org/articles/high-availability
Example Configurations
<source>
@type tail
tag smartnews.user_activity
...
</source>
<match smartnews.user_activity>
@type copy
<store>
@type s3
...
</store>
<store>
@type forward
...
</store>
</match>
Forwarder
<source>
@type forward
...
</source>
<match smartnews.user_activity>
@type copy
<store>
@type kinesis
...
</store>
<store>
...
</store>
</match>
Aggregator
http://docs.fluentd.org/articles/kinesis-stream
Offline ETL Flow
Transform Text Files into Columnar Files
Various Machine Learning Tasks
API
RDS
{
“timestamp”: 1453161447,
“userId”: 1234,
“platform”: “ios”,
“edition”: “ja_JP”,
“action”: “viewArticle”,
“data”: {
“articleId: 1234,
“duration”: 30.2
}
}
userId, age, gender, location,
1234, 28, M, Tokyo, …
1235, 32, F, Nagano, …
1240, 18, F, Keyoto, …
Amazon S3
Hive on EMR
Amazon S3
Airflow
Manage
Workflow
Activities
Users
Spark on EMR
Airflow: Workflow Engine
Execute Task A -> Task B -> Task C, D
5 * * * * app hive -f query_1.hql
15 * * * * app hive -f query_2.hql
30 * * * * app hive -f query_3.hql
Spark Streaming
Kinesis Stream
Shard 1
Shard 2
Shard3
Dstream 1
Dstream 2
Dstream 3
R
D
D
RDD
R
D
D
R
D
D
Female
Male
+
Minutely RDD
Teen
Female
Male
Teen
Female
Male
Teen
Minutely Metrics by User Cluster
DynamoDB
.
.
.
Pre Computed RDD
Split Streams into Minutely RDD
Join Minutely RDD on PreComputed RDD
Monitor Spark Streaming
Spark UI is Useful for Monitoring
Integrate with CloudWatch
class CloudWatchRelay(conf: SparkConf) extends StreamingListener {
override def onBatchStarted(batchStarted: StreamingListenerBatchStarted) {
putMetricToCloudWatch(s"BatchStarted", 1.0)
}
override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted) {
putMetricToCloudWatch(s"BatchCompleted", 1.0)
putMetricToCloudWatch(s"BatchRecordsProcessed",
batchCompleted.batchInfo.numRecords toDouble)
batchCompleted.batchInfo.processingDelay.foreach { delay =>
putMetricToCloudWatch(s"ProcessingDelay", delay)
}
batchCompleted.batchInfo.schedulingDelay.foreach { delay =>
putMetricToCloudWatch(s"SchedulingDelay", delay)
}
batchCompleted.batchInfo.totalDelay.foreach { delay =>
putMetricToCloudWatch(s"TotalDelay", delay)
}
}
}
Set Alert to SchedulingDelay
Summary
Summary
• Fast & stable stream processing is crucial for SmartNews
• lifetime of news is very short
• process events as fast as possible
• Kinesis Stream plays an important role
• one-click provision & scale
• empowers engineers to do trial & error
Discuss More?
Join Our Free Lunch in Tokyo Office!!
We’re hiring!!!
ML/NLP engineer
Site reliability engineer
Web application engineer
iOS/Android engineer
Ad engineer
http://about.smartnews.com/en/careers/
See Also
• SmartNews の Webmining を支えるプラットフォーム
• Stream 処理と Offline 処理の統合
• Building a Sustainable Data Platform on AWS
• AWS meetup「Apache Spark on EMR」
PipelineDB
PipelineDB
• OSS & enterprise streaming SQL database
• PostgreSQL compatible
• connect to Chartio 😍
• join stream to normal PostgreSQL table
• Support probabilistic data structures
• e.g. HyperLogLog
https://www.pipelinedb.com/
http://developer.smartnews.com/blog/2015/09/09/20150907pipelinedb/
Realtime Monitoring
API
Gateway
Stream
Continuous
View
Continuous
View
Continuous
View
Discard raw record soon after
consumed by Continuous View
Incrementally
updated in realtime
PipelineDB Chartio
AWS
Lambda
Slack
Access Continuous View
by PostgreSQL Client
Record
Continuous View
-- Calculate unique users seen per media each day
-- Using only a constant amount of space (HyperLogLog)
CREATE CONTINUOUS VIEW uniques AS
SELECT
day(arrival_timestamp),
substring(url from '.*://([^/]*)') as hostname,
COUNT(DISTINCT user_id::integer)
FROM activity_stream GROUP BY day, hostname;
-- How many impressions have we served in the last five minutes?
CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS
SELECT COUNT(*) FROM imps_stream;
-- What are the 90th, 95th, 99th percentiles of request latency?
CREATE CONTINUOUS VIEW latency AS
SELECT
percentile_cont(array[90, 95, 99])
WITHIN GROUP (ORDER BY latency::integer)
FROM latency_stream;
Dashboard in Chartio
1. Building query
(Drag&Drop / SQL)
2. Add step
(filter、sort、modify)
3. Select visualize way
(table、graph)

More Related Content

What's hot

Deep Dive on Amazon S3 (May 2016)
Deep Dive on Amazon S3 (May 2016)Deep Dive on Amazon S3 (May 2016)
Deep Dive on Amazon S3 (May 2016)Julien SIMON
 
Keynote: Future of IT - future of enterprise it Canada
Keynote: Future of IT - future of enterprise it CanadaKeynote: Future of IT - future of enterprise it Canada
Keynote: Future of IT - future of enterprise it CanadaAmazon Web Services
 
AWS APAC Webinar Week - Launching Your First Big Data Project on AWS
AWS APAC Webinar Week - Launching Your First Big Data Project on AWSAWS APAC Webinar Week - Launching Your First Big Data Project on AWS
AWS APAC Webinar Week - Launching Your First Big Data Project on AWSAmazon Web Services
 
Overview of IoT Infrastructure and Connectivity at AWS & Getting Started with...
Overview of IoT Infrastructure and Connectivity at AWS & Getting Started with...Overview of IoT Infrastructure and Connectivity at AWS & Getting Started with...
Overview of IoT Infrastructure and Connectivity at AWS & Getting Started with...Amazon Web Services
 
BDA402 Deep Dive: Log analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log analytics with Amazon Elasticsearch ServiceAmazon Web Services
 
Build A Website on AWS for Your First 10 Million Users
Build A Website on AWS for Your First 10 Million UsersBuild A Website on AWS for Your First 10 Million Users
Build A Website on AWS for Your First 10 Million UsersAmazon Web Services
 
AWS APAC Webinar Week - Understanding AWS Storage Options
AWS APAC Webinar Week - Understanding AWS Storage OptionsAWS APAC Webinar Week - Understanding AWS Storage Options
AWS APAC Webinar Week - Understanding AWS Storage OptionsAmazon Web Services
 
(DVO303) Scaling Infrastructure Operations with AWS
(DVO303) Scaling Infrastructure Operations with AWS(DVO303) Scaling Infrastructure Operations with AWS
(DVO303) Scaling Infrastructure Operations with AWSAmazon Web Services
 
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar SeriesMigrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar SeriesAmazon Web Services
 
Building Your First Big Data Application on AWS
Building Your First Big Data Application on AWSBuilding Your First Big Data Application on AWS
Building Your First Big Data Application on AWSAmazon Web Services
 
SEC303 Automating Security in Cloud Workloads with DevSecOps
SEC303 Automating Security in Cloud Workloads with DevSecOpsSEC303 Automating Security in Cloud Workloads with DevSecOps
SEC303 Automating Security in Cloud Workloads with DevSecOpsAmazon Web Services
 
ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...
ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...
ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...Amazon Web Services
 
(STG406) Using S3 to Build and Scale an Unlimited Storage Service
(STG406) Using S3 to Build and Scale an Unlimited Storage Service(STG406) Using S3 to Build and Scale an Unlimited Storage Service
(STG406) Using S3 to Build and Scale an Unlimited Storage ServiceAmazon Web Services
 
February 2016 Webinar Series - Use AWS Cloud Storage as the Foundation for Hy...
February 2016 Webinar Series - Use AWS Cloud Storage as the Foundation for Hy...February 2016 Webinar Series - Use AWS Cloud Storage as the Foundation for Hy...
February 2016 Webinar Series - Use AWS Cloud Storage as the Foundation for Hy...Amazon Web Services
 
AWS APAC Webinar Week - 2015 An Amazing Year in AWS
AWS APAC Webinar Week - 2015 An Amazing Year in AWSAWS APAC Webinar Week - 2015 An Amazing Year in AWS
AWS APAC Webinar Week - 2015 An Amazing Year in AWSAmazon Web Services
 
AWS April 2016 Webinar Series - S3 Best Practices - A Decade of Field Experience
AWS April 2016 Webinar Series - S3 Best Practices - A Decade of Field ExperienceAWS April 2016 Webinar Series - S3 Best Practices - A Decade of Field Experience
AWS April 2016 Webinar Series - S3 Best Practices - A Decade of Field ExperienceAmazon Web Services
 

What's hot (20)

Amazon EC2:Masterclass
Amazon EC2:MasterclassAmazon EC2:Masterclass
Amazon EC2:Masterclass
 
Deep Dive on Amazon S3 (May 2016)
Deep Dive on Amazon S3 (May 2016)Deep Dive on Amazon S3 (May 2016)
Deep Dive on Amazon S3 (May 2016)
 
Deep Dive on Amazon S3
Deep Dive on Amazon S3Deep Dive on Amazon S3
Deep Dive on Amazon S3
 
Keynote: Future of IT - future of enterprise it Canada
Keynote: Future of IT - future of enterprise it CanadaKeynote: Future of IT - future of enterprise it Canada
Keynote: Future of IT - future of enterprise it Canada
 
AWS APAC Webinar Week - Launching Your First Big Data Project on AWS
AWS APAC Webinar Week - Launching Your First Big Data Project on AWSAWS APAC Webinar Week - Launching Your First Big Data Project on AWS
AWS APAC Webinar Week - Launching Your First Big Data Project on AWS
 
Overview of IoT Infrastructure and Connectivity at AWS & Getting Started with...
Overview of IoT Infrastructure and Connectivity at AWS & Getting Started with...Overview of IoT Infrastructure and Connectivity at AWS & Getting Started with...
Overview of IoT Infrastructure and Connectivity at AWS & Getting Started with...
 
BDA402 Deep Dive: Log analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log analytics with Amazon Elasticsearch ServiceBDA402 Deep Dive: Log analytics with Amazon Elasticsearch Service
BDA402 Deep Dive: Log analytics with Amazon Elasticsearch Service
 
Build A Website on AWS for Your First 10 Million Users
Build A Website on AWS for Your First 10 Million UsersBuild A Website on AWS for Your First 10 Million Users
Build A Website on AWS for Your First 10 Million Users
 
Big Data Architectural Patterns
Big Data Architectural PatternsBig Data Architectural Patterns
Big Data Architectural Patterns
 
AWS APAC Webinar Week - Understanding AWS Storage Options
AWS APAC Webinar Week - Understanding AWS Storage OptionsAWS APAC Webinar Week - Understanding AWS Storage Options
AWS APAC Webinar Week - Understanding AWS Storage Options
 
(DVO303) Scaling Infrastructure Operations with AWS
(DVO303) Scaling Infrastructure Operations with AWS(DVO303) Scaling Infrastructure Operations with AWS
(DVO303) Scaling Infrastructure Operations with AWS
 
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar SeriesMigrate your Data Warehouse to Amazon Redshift - September Webinar Series
Migrate your Data Warehouse to Amazon Redshift - September Webinar Series
 
Building Your First Big Data Application on AWS
Building Your First Big Data Application on AWSBuilding Your First Big Data Application on AWS
Building Your First Big Data Application on AWS
 
SEC303 Automating Security in Cloud Workloads with DevSecOps
SEC303 Automating Security in Cloud Workloads with DevSecOpsSEC303 Automating Security in Cloud Workloads with DevSecOps
SEC303 Automating Security in Cloud Workloads with DevSecOps
 
Sec301 Security @ (Cloud) Scale
Sec301 Security @ (Cloud) ScaleSec301 Security @ (Cloud) Scale
Sec301 Security @ (Cloud) Scale
 
ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...
ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...
ENT313 Deploying a Disaster Recovery Site on AWS: Minimal Cost with Maximum E...
 
(STG406) Using S3 to Build and Scale an Unlimited Storage Service
(STG406) Using S3 to Build and Scale an Unlimited Storage Service(STG406) Using S3 to Build and Scale an Unlimited Storage Service
(STG406) Using S3 to Build and Scale an Unlimited Storage Service
 
February 2016 Webinar Series - Use AWS Cloud Storage as the Foundation for Hy...
February 2016 Webinar Series - Use AWS Cloud Storage as the Foundation for Hy...February 2016 Webinar Series - Use AWS Cloud Storage as the Foundation for Hy...
February 2016 Webinar Series - Use AWS Cloud Storage as the Foundation for Hy...
 
AWS APAC Webinar Week - 2015 An Amazing Year in AWS
AWS APAC Webinar Week - 2015 An Amazing Year in AWSAWS APAC Webinar Week - 2015 An Amazing Year in AWS
AWS APAC Webinar Week - 2015 An Amazing Year in AWS
 
AWS April 2016 Webinar Series - S3 Best Practices - A Decade of Field Experience
AWS April 2016 Webinar Series - S3 Best Practices - A Decade of Field ExperienceAWS April 2016 Webinar Series - S3 Best Practices - A Decade of Field Experience
AWS April 2016 Webinar Series - S3 Best Practices - A Decade of Field Experience
 

Viewers also liked

Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSSmartNews, Inc.
 
Spring で実現する SmartNews のニュース配信基盤
Spring で実現する SmartNews のニュース配信基盤Spring で実現する SmartNews のニュース配信基盤
Spring で実現する SmartNews のニュース配信基盤SmartNews, Inc.
 
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.SmartNews, Inc.
 
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_cccSmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_cccSmartNews, Inc.
 
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews, Inc.
 
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテムSmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテムSmartNews, Inc.
 
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...SmartNews, Inc.
 
エンジニアからプロダクトマネージャーへ
エンジニアからプロダクトマネージャーへエンジニアからプロダクトマネージャーへ
エンジニアからプロダクトマネージャーへSmartNews, Inc.
 
SmartNews's journey into microservices
SmartNews's journey into microservicesSmartNews's journey into microservices
SmartNews's journey into microservicesSmartNews, Inc.
 
AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側SmartNews, Inc.
 
SmartNews TechNight vol5 SmartNews Ads大図解
SmartNews TechNight vol5 SmartNews Ads大図解SmartNews TechNight vol5 SmartNews Ads大図解
SmartNews TechNight vol5 SmartNews Ads大図解SmartNews, Inc.
 
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法SmartNews, Inc.
 
SmartNews Ads System - AWS Summit Tokyo 2015
SmartNews Ads System - AWS Summit Tokyo 2015SmartNews Ads System - AWS Summit Tokyo 2015
SmartNews Ads System - AWS Summit Tokyo 2015SmartNews, Inc.
 
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合SmartNews, Inc.
 
Smartnews Product Manager Night
Smartnews Product Manager NightSmartnews Product Manager Night
Smartnews Product Manager NightSmartNews, Inc.
 
AWS meetup「Apache Spark on EMR」
AWS meetup「Apache Spark on EMR」AWS meetup「Apache Spark on EMR」
AWS meetup「Apache Spark on EMR」SmartNews, Inc.
 
SmartNews の Webmining を支えるプラットフォーム
SmartNews の Webmining を支えるプラットフォームSmartNews の Webmining を支えるプラットフォーム
SmartNews の Webmining を支えるプラットフォームSmartNews, Inc.
 
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2VecKouhei Nakaji
 
LDAを用いた教師なし単語分類
LDAを用いた教師なし単語分類LDAを用いた教師なし単語分類
LDAを用いた教師なし単語分類Kouhei Nakaji
 

Viewers also liked (20)

Building a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWSBuilding a Sustainable Data Platform on AWS
Building a Sustainable Data Platform on AWS
 
Spring で実現する SmartNews のニュース配信基盤
Spring で実現する SmartNews のニュース配信基盤Spring で実現する SmartNews のニュース配信基盤
Spring で実現する SmartNews のニュース配信基盤
 
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
SpringOne Platform 2016 報告会「A Lite Rx API for the JVM」/ 井口 貝 @ SmartNews, Inc.
 
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_cccSmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
SmartNewsのニュース配信を支えるサーバ技術 / Kazhiro Sera @ SmartNews,Inc. #jjug_ccc
 
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
 
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテムSmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
SmartNews TechNight Vol5 : SmartNews AdServer 解体新書 / ポストモーテム
 
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
SmartNews TechNight Vol.5 : SmartNews Ads の配信最適化の仕組みはどうなってるの? (エンジニア / SmartN...
 
エンジニアからプロダクトマネージャーへ
エンジニアからプロダクトマネージャーへエンジニアからプロダクトマネージャーへ
エンジニアからプロダクトマネージャーへ
 
NLP in SmartNews
NLP in SmartNewsNLP in SmartNews
NLP in SmartNews
 
SmartNews's journey into microservices
SmartNews's journey into microservicesSmartNews's journey into microservices
SmartNews's journey into microservices
 
AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側AWSの進化とSmartNewsの裏側
AWSの進化とSmartNewsの裏側
 
SmartNews TechNight vol5 SmartNews Ads大図解
SmartNews TechNight vol5 SmartNews Ads大図解SmartNews TechNight vol5 SmartNews Ads大図解
SmartNews TechNight vol5 SmartNews Ads大図解
 
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
インフラ専任エンジニアが一人もいないSmartNewsにおけるクラウド活用法
 
SmartNews Ads System - AWS Summit Tokyo 2015
SmartNews Ads System - AWS Summit Tokyo 2015SmartNews Ads System - AWS Summit Tokyo 2015
SmartNews Ads System - AWS Summit Tokyo 2015
 
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
Strem処理(Spark Streaming + Kinesis)とOffline処理(Hive)の統合
 
Smartnews Product Manager Night
Smartnews Product Manager NightSmartnews Product Manager Night
Smartnews Product Manager Night
 
AWS meetup「Apache Spark on EMR」
AWS meetup「Apache Spark on EMR」AWS meetup「Apache Spark on EMR」
AWS meetup「Apache Spark on EMR」
 
SmartNews の Webmining を支えるプラットフォーム
SmartNews の Webmining を支えるプラットフォームSmartNews の Webmining を支えるプラットフォーム
SmartNews の Webmining を支えるプラットフォーム
 
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
[SmartNews] Globally Scalable Web Document Classification Using Word2Vec
 
LDAを用いた教師なし単語分類
LDAを用いた教師なし単語分類LDAを用いた教師なし単語分類
LDAを用いた教師なし単語分類
 

Similar to Stream Processing in SmartNews #jawsdays

MongoDB for Time Series Data
MongoDB for Time Series DataMongoDB for Time Series Data
MongoDB for Time Series DataMongoDB
 
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014Amazon Web Services
 
Barga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteBarga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteRoger Barga
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseHao Chen
 
Stay clear of the bugs: Troubleshooting Applications in Microsoft Azure
Stay clear of the bugs: Troubleshooting Applications in Microsoft AzureStay clear of the bugs: Troubleshooting Applications in Microsoft Azure
Stay clear of the bugs: Troubleshooting Applications in Microsoft AzureHARMAN Services
 
AWS re:Invent 2016: Life Without SSH: Immutable Infrastructure in Production ...
AWS re:Invent 2016: Life Without SSH: Immutable Infrastructure in Production ...AWS re:Invent 2016: Life Without SSH: Immutable Infrastructure in Production ...
AWS re:Invent 2016: Life Without SSH: Immutable Infrastructure in Production ...Amazon Web Services
 
Why and How SmartNews uses SaaS?
Why and How SmartNews uses SaaS?Why and How SmartNews uses SaaS?
Why and How SmartNews uses SaaS?Takumi Sakamoto
 
Getting Started with Real-Time Analytics
Getting Started with Real-Time AnalyticsGetting Started with Real-Time Analytics
Getting Started with Real-Time AnalyticsAmazon Web Services
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleSriram Krishnan
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek PROIDEA
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackJakub Hajek
 
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...MongoDB
 
AWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon KinesisAWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon KinesisAmazon Web Services
 
AWS re:Invent 2016: Workshop: Building Your First Big Data Application with A...
AWS re:Invent 2016: Workshop: Building Your First Big Data Application with A...AWS re:Invent 2016: Workshop: Building Your First Big Data Application with A...
AWS re:Invent 2016: Workshop: Building Your First Big Data Application with A...Amazon Web Services
 
Native cloud security monitoring
Native cloud security monitoringNative cloud security monitoring
Native cloud security monitoringJohn Varghese
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureJoey Bolduc-Gilbert
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Demi Ben-Ari
 
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon KinesisDay 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon KinesisAmazon Web Services
 

Similar to Stream Processing in SmartNews #jawsdays (20)

MongoDB for Time Series Data
MongoDB for Time Series DataMongoDB for Time Series Data
MongoDB for Time Series Data
 
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
(WEB301) Operational Web Log Analysis | AWS re:Invent 2014
 
Barga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 KeynoteBarga IC2E & IoTDI'16 Keynote
Barga IC2E & IoTDI'16 Keynote
 
Apache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real TimeApache Eagle: Secure Hadoop in Real Time
Apache Eagle: Secure Hadoop in Real Time
 
Apache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San JoseApache Eagle at Hadoop Summit 2016 San Jose
Apache Eagle at Hadoop Summit 2016 San Jose
 
Stay clear of the bugs: Troubleshooting Applications in Microsoft Azure
Stay clear of the bugs: Troubleshooting Applications in Microsoft AzureStay clear of the bugs: Troubleshooting Applications in Microsoft Azure
Stay clear of the bugs: Troubleshooting Applications in Microsoft Azure
 
AWS re:Invent 2016: Life Without SSH: Immutable Infrastructure in Production ...
AWS re:Invent 2016: Life Without SSH: Immutable Infrastructure in Production ...AWS re:Invent 2016: Life Without SSH: Immutable Infrastructure in Production ...
AWS re:Invent 2016: Life Without SSH: Immutable Infrastructure in Production ...
 
Why and How SmartNews uses SaaS?
Why and How SmartNews uses SaaS?Why and How SmartNews uses SaaS?
Why and How SmartNews uses SaaS?
 
Getting Started with Real-Time Analytics
Getting Started with Real-Time AnalyticsGetting Started with Real-Time Analytics
Getting Started with Real-Time Analytics
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek Docker Logging and analysing with Elastic Stack - Jakub Hajek
Docker Logging and analysing with Elastic Stack - Jakub Hajek
 
Docker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic StackDocker Logging and analysing with Elastic Stack
Docker Logging and analysing with Elastic Stack
 
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
MongoDB World 2018: Ch-Ch-Ch-Ch-Changes: Taking Your Stitch Application to th...
 
AWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon KinesisAWS Webcast - Introduction to Amazon Kinesis
AWS Webcast - Introduction to Amazon Kinesis
 
AWS re:Invent 2016: Workshop: Building Your First Big Data Application with A...
AWS re:Invent 2016: Workshop: Building Your First Big Data Application with A...AWS re:Invent 2016: Workshop: Building Your First Big Data Application with A...
AWS re:Invent 2016: Workshop: Building Your First Big Data Application with A...
 
Monitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp DockerMonitoring in 2017 - TIAD Camp Docker
Monitoring in 2017 - TIAD Camp Docker
 
Native cloud security monitoring
Native cloud security monitoringNative cloud security monitoring
Native cloud security monitoring
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon KinesisDay 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
Day 5 - Real-time Data Processing/Internet of Things (IoT) with Amazon Kinesis
 

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Stream Processing in SmartNews #jawsdays

  • 5. What is SmartNews? • News Discovery App for Mobile • Launched in 2012 • 15M+ Downloads in World Wide https://www.smartnews.com/en/
  • 6. How We Deliver News? Internet Algorithms Trending News
  • 8. Today’s News is Wrapping Tomorrow’s Fish and Chips
  • 11. Speed is Matter for Us
  • 13. News Delivery Pipeline CrawlerInternet Analyzer Indexer CloudSearch API Search API Gateway Mobile App API Tracker DynamoDB Index System Feedback System 1 minute 5 minute
  • 14. Index System • Crawler • collect news articles & social signals • Analyzer • extract title, content, thumbnail... • classify topics (sports, politics, technology...) • Indexer • upload article metadata into CloudSearch
  • 15. Feedback System • API Tracker • receive user's activity log from mobile app • Spark Streaming • generate various metrics for news ranking • stored metrics into DynamoDB
  • 16. How to Glue Each Service?
  • 17. Ref: Amazon Kinesis: Real-time Streaming Big data Processing Applications
  • 18. Why Kinesis Streams? • Fully managed service • Multiple consumer applications • Reasonable pricing
  • 19. Multiple Consumers Kinesis Stream Spark on EMR AWS Lambda Data Scientist I wanna consume streaming data by Spark Application Engineer I wanna add a streaming monitor by Lambda Empowers Engineers to Do Trial and Error
  • 20. News Delivery Pipeline CrawlerInternet Analyzer Indexer CloudSearch API Search API Gateway Mobile App API Tracker DynamoDB Kinesis Stream Kinesis Stream Kinesis Stream
  • 21. Data & Its Numbers • User activities • ~100 GBs per day (compressed) • 60+ record types • User demographics or configurations etc... • 15M+ records • Articles metadata • 100K+ records per day
  • 24. Kinesis Libraries • Kinesis Producer Library (KPL) • put records into an stream • asynchronous architecture (buffer records) • Kinesis Consumer Library (KCL) • consume and process data from an stream • handle complex tasks associated with distributed computing
  • 25. KPL/KCL Monitoring • KPL/KCL publishes custom CloudWatch metrics • Key Metrics for KPL • User Record Received, User Record Pending • All Errors • Key Metrics for KCL • RecordsProcessed • MillisBehindLatest • RecordProcessor.processRecords.Time https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kpl.html https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kcl.html
  • 27. Feedback System Generate Metrics by User Clusters for Ranking Articles Amazon CloudSearch API Search API Gateway Kinesis Stream Amazon S3 Hive / Spark DynamoDB User Clusters User Feedback API Tracker Amazon S3 Offline ETL / Machine Learning Push Notification Article Metadata Metrics by Cluster
  • 28. Why Metrics by Cluster? Consider Each User's Interests Ensure Diversity for Avoiding Filter Bubble https://en.wikipedia.org/wiki/Filter_bubble Amazon CloudSearch API DynamoDB Article raw score San Fransisco Giants … 3.5 New York Yankees … 6.2 FIFA World Cup … 20.4 U.S.Open Championships … 8.4 weight 1 0.6 0.2 0.2 score 3.5 3 4.08 1.68 + = User GET /news/sports Metrics by User Cluster Article Inventry userId: 1000 gender: Male age: 36 location: San Fransisco, US interests: Baseball
  • 29. Input Data by Fluentd • Forwarder (running on each instances) • archive events to S3 • forward events to aggregators • Aggregator (HA Configuration※) • put events into Kinesis Stream • alert and report (not mentioned here) ※ http://docs.fluentd.org/articles/high-availability
  • 30. Example Configurations <source> @type tail tag smartnews.user_activity ... </source> <match smartnews.user_activity> @type copy <store> @type s3 ... </store> <store> @type forward ... </store> </match> Forwarder <source> @type forward ... </source> <match smartnews.user_activity> @type copy <store> @type kinesis ... </store> <store> ... </store> </match> Aggregator http://docs.fluentd.org/articles/kinesis-stream
  • 31. Offline ETL Flow Transform Text Files into Columnar Files Various Machine Learning Tasks API RDS { “timestamp”: 1453161447, “userId”: 1234, “platform”: “ios”, “edition”: “ja_JP”, “action”: “viewArticle”, “data”: { “articleId: 1234, “duration”: 30.2 } } userId, age, gender, location, 1234, 28, M, Tokyo, … 1235, 32, F, Nagano, … 1240, 18, F, Keyoto, … Amazon S3 Hive on EMR Amazon S3 Airflow Manage Workflow Activities Users Spark on EMR
  • 32. Airflow: Workflow Engine Execute Task A -> Task B -> Task C, D 5 * * * * app hive -f query_1.hql 15 * * * * app hive -f query_2.hql 30 * * * * app hive -f query_3.hql
  • 33. Spark Streaming Kinesis Stream Shard 1 Shard 2 Shard3 Dstream 1 Dstream 2 Dstream 3 R D D RDD R D D R D D Female Male + Minutely RDD Teen Female Male Teen Female Male Teen Minutely Metrics by User Cluster DynamoDB . . . Pre Computed RDD Split Streams into Minutely RDD Join Minutely RDD on PreComputed RDD
  • 34. Monitor Spark Streaming Spark UI is Useful for Monitoring
  • 35. Integrate with CloudWatch class CloudWatchRelay(conf: SparkConf) extends StreamingListener { override def onBatchStarted(batchStarted: StreamingListenerBatchStarted) { putMetricToCloudWatch(s"BatchStarted", 1.0) } override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted) { putMetricToCloudWatch(s"BatchCompleted", 1.0) putMetricToCloudWatch(s"BatchRecordsProcessed", batchCompleted.batchInfo.numRecords toDouble) batchCompleted.batchInfo.processingDelay.foreach { delay => putMetricToCloudWatch(s"ProcessingDelay", delay) } batchCompleted.batchInfo.schedulingDelay.foreach { delay => putMetricToCloudWatch(s"SchedulingDelay", delay) } batchCompleted.batchInfo.totalDelay.foreach { delay => putMetricToCloudWatch(s"TotalDelay", delay) } } } Set Alert to SchedulingDelay
  • 37. Summary • Fast & stable stream processing is crucial for SmartNews • lifetime of news is very short • process events as fast as possible • Kinesis Stream plays an important role • one-click provision & scale • empowers engineers to do trial & error
  • 38. Discuss More? Join Our Free Lunch in Tokyo Office!!
  • 39. We’re hiring!!! ML/NLP engineer Site reliability engineer Web application engineer iOS/Android engineer Ad engineer http://about.smartnews.com/en/careers/
  • 40. See Also • SmartNews の Webmining を支えるプラットフォーム • Stream 処理と Offline 処理の統合 • Building a Sustainable Data Platform on AWS • AWS meetup「Apache Spark on EMR」
  • 42. PipelineDB • OSS & enterprise streaming SQL database • PostgreSQL compatible • connect to Chartio 😍 • join stream to normal PostgreSQL table • Support probabilistic data structures • e.g. HyperLogLog https://www.pipelinedb.com/ http://developer.smartnews.com/blog/2015/09/09/20150907pipelinedb/
  • 43. Realtime Monitoring API Gateway Stream Continuous View Continuous View Continuous View Discard raw record soon after consumed by Continuous View Incrementally updated in realtime PipelineDB Chartio AWS Lambda Slack Access Continuous View by PostgreSQL Client Record
  • 44. Continuous View -- Calculate unique users seen per media each day -- Using only a constant amount of space (HyperLogLog) CREATE CONTINUOUS VIEW uniques AS SELECT day(arrival_timestamp), substring(url from '.*://([^/]*)') as hostname, COUNT(DISTINCT user_id::integer) FROM activity_stream GROUP BY day, hostname; -- How many impressions have we served in the last five minutes? CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS SELECT COUNT(*) FROM imps_stream; -- What are the 90th, 95th, 99th percentiles of request latency? CREATE CONTINUOUS VIEW latency AS SELECT percentile_cont(array[90, 95, 99]) WITHIN GROUP (ORDER BY latency::integer) FROM latency_stream;
  • 45. Dashboard in Chartio 1. Building query (Drag&Drop / SQL) 2. Add step (filter、sort、modify) 3. Select visualize way (table、graph)