13. News Delivery Pipeline
CrawlerInternet Analyzer Indexer CloudSearch API
Search
API
Gateway
Mobile App
API
Tracker
DynamoDB
Index System
Feedback System
1 minute
5 minute
14. Index System
• Crawler
• collect news articles & social signals
• Analyzer
• extract title, content, thumbnail...
• classify topics (sports, politics, technology...)
• Indexer
• upload article metadata into CloudSearch
15. Feedback System
• API Tracker
• receive user's activity log from mobile app
• Spark Streaming
• generate various metrics for news ranking
• stored metrics into DynamoDB
20. News Delivery Pipeline
CrawlerInternet Analyzer Indexer CloudSearch API
Search
API
Gateway
Mobile App
API
Tracker
DynamoDB
Kinesis
Stream
Kinesis
Stream
Kinesis
Stream
21. Data & Its Numbers
• User activities
• ~100 GBs per day (compressed)
• 60+ record types
• User demographics or configurations etc...
• 15M+ records
• Articles metadata
• 100K+ records per day
24. Kinesis Libraries
• Kinesis Producer Library (KPL)
• put records into an stream
• asynchronous architecture (buffer records)
• Kinesis Consumer Library (KCL)
• consume and process data from an stream
• handle complex tasks associated with distributed
computing
25. KPL/KCL Monitoring
• KPL/KCL publishes custom CloudWatch metrics
• Key Metrics for KPL
• User Record Received, User Record Pending
• All Errors
• Key Metrics for KCL
• RecordsProcessed
• MillisBehindLatest
• RecordProcessor.processRecords.Time
https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kpl.html
https://docs.aws.amazon.com/kinesis/latest/dev/monitoring-with-kcl.html
27. Feedback System
Generate Metrics by User Clusters for
Ranking Articles
Amazon
CloudSearch
API
Search
API
Gateway
Kinesis
Stream
Amazon S3 Hive / Spark
DynamoDB
User
Clusters
User
Feedback
API
Tracker
Amazon S3
Offline ETL / Machine Learning
Push
Notification
Article
Metadata
Metrics
by Cluster
28. Why Metrics by Cluster?
Consider Each User's Interests
Ensure Diversity for Avoiding Filter Bubble
https://en.wikipedia.org/wiki/Filter_bubble
Amazon
CloudSearch
API
DynamoDB
Article raw score
San Fransisco Giants … 3.5
New York Yankees … 6.2
FIFA World Cup … 20.4
U.S.Open Championships … 8.4
weight
1
0.6
0.2
0.2
score
3.5
3
4.08
1.68
+ =
User
GET /news/sports
Metrics by
User Cluster
Article
Inventry
userId: 1000
gender: Male
age: 36
location: San Fransisco, US
interests: Baseball
29. Input Data by Fluentd
• Forwarder (running on each instances)
• archive events to S3
• forward events to aggregators
• Aggregator (HA Configuration※)
• put events into Kinesis Stream
• alert and report (not mentioned here)
※ http://docs.fluentd.org/articles/high-availability
33. Spark Streaming
Kinesis Stream
Shard 1
Shard 2
Shard3
Dstream 1
Dstream 2
Dstream 3
R
D
D
RDD
R
D
D
R
D
D
Female
Male
+
Minutely RDD
Teen
Female
Male
Teen
Female
Male
Teen
Minutely Metrics by User Cluster
DynamoDB
.
.
.
Pre Computed RDD
Split Streams into Minutely RDD
Join Minutely RDD on PreComputed RDD
37. Summary
• Fast & stable stream processing is crucial for SmartNews
• lifetime of news is very short
• process events as fast as possible
• Kinesis Stream plays an important role
• one-click provision & scale
• empowers engineers to do trial & error
39. We’re hiring!!!
ML/NLP engineer
Site reliability engineer
Web application engineer
iOS/Android engineer
Ad engineer
http://about.smartnews.com/en/careers/
40. See Also
• SmartNews の Webmining を支えるプラットフォーム
• Stream 処理と Offline 処理の統合
• Building a Sustainable Data Platform on AWS
• AWS meetup「Apache Spark on EMR」
44. Continuous View
-- Calculate unique users seen per media each day
-- Using only a constant amount of space (HyperLogLog)
CREATE CONTINUOUS VIEW uniques AS
SELECT
day(arrival_timestamp),
substring(url from '.*://([^/]*)') as hostname,
COUNT(DISTINCT user_id::integer)
FROM activity_stream GROUP BY day, hostname;
-- How many impressions have we served in the last five minutes?
CREATE CONTINUOUS VIEW imps WITH (max_age = '5 minutes') AS
SELECT COUNT(*) FROM imps_stream;
-- What are the 90th, 95th, 99th percentiles of request latency?
CREATE CONTINUOUS VIEW latency AS
SELECT
percentile_cont(array[90, 95, 99])
WITHIN GROUP (ORDER BY latency::integer)
FROM latency_stream;
45. Dashboard in Chartio
1. Building query
(Drag&Drop / SQL)
2. Add step
(filter、sort、modify)
3. Select visualize way
(table、graph)