SlideShare a Scribd company logo
1 of 43
Viki Analytics Infrastructure
BigData Singapore Meetup
Oct 2013
Viki’s Data Pipeline
1. Collecting Data
What data do we collect?
• Clickstream data
• An event is some user interaction or product
related
• A client (web/mobile) sends these events as
HTTP calls.
• Format: JSON
– Schema-less
– Flexible
{"origin":"tv_show_show", "app_ver":"2.9.3.151",
"uuid":"80833c5a760597bf1c8339819636df04",
"user_id":"5298933u", "vs_id":"1008912v-1380452660-
7920", "app_id":"100004a”,
"event":”video_play","timed_comment":"off”,
"stream_quality":"variable”, "bottom_subtitle":"en",
"device_size":"tablet", "feature":"auto_play",
"video_id":"1008912v",
”subtitle_completion_percent":"100",
"device_id":"iPad2,1|6.1.3|apple", "t":"1380452846",
"ip":"99.232.169.246”, "country":"ca",
"city_name":"Toronto”, "region_name":"ON"}
…
How to keep this data clean?
• Problem: Clients often send erroneous data.
eg. missing parameter
• Solution: We write client
libraries for each client to
enforce “world peace”
Ps: there is no such thing as
“world peace”
How to collect > 60 M
events a day?
• fluentd
 Scalable
 Extensibility
 Let you send data to
Hadoop, MongoDB, PostgreSQL etc.
• Writes to Hadoop (TD), Amazon
S3, MongoDB
Where do we store?
• Hadoop (Treasure Data)
 Its fast and easy to setup!
 We don’t have money or time to hire a
Hadoop engineer.
 We retrieve data from Hadoop in batch
jobs
• Amazon S3
 Backup
• MongoDB: Real-time data
2. Retrieving & Processing Data
2. Retrieving & Processing Data
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies
2. Retrieving & Processing Data
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies
Getting All Data To 1 Place
• Port data from different
production databases into PG
• Retrieve click-stream data
from Hadoop to PG
thor db:cp --source prod1 --destination analytics -t public.* --force-schema prod1
a) Production Databases  Analytics DB:
thor db:cp --source A --destination B –t reporting.video_plays --increment
PostgreSQL
{"origin":"tv_show_show", "app_ver":"2.9.3.151",
"uuid":"80833c5a760597bf1c8339819636df04", "user_id":"5298933u",
"vs_id":"1008912v-1380452660-7920", "app_id":"100004a”,
"event":”video_play","timed_comment":"off”, "stream_quality":"variable”,
"bottom_subtitle":"en", "device_size":"tablet", "feature":"auto_play",
"video_id":"1008912v", ”subtitle_completion_percent":"100",
"device_id":"iPad2,1|6.1.3|apple", "t":"1380452846", "ip":"99.232.169.246”,
"country":"ca", "city_name":"Toronto”, "region_name":"ON"}
…
date source partner event video_id country cnt
2013-09-29 ios viki video_play 1008912v ca 2
2013-09-29 android viki video_play 1008912v us 18
…
b) Click-stream Data (Hadoop)  Analytics DB:
Hadoop
PostgreSQL
Aggregation (Hive)
Export Output / Sqoop
SELECT
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`,
v['source'],
v['partner'],
v['event'],
v['video_id'],
v['country'],
COUNT(1) as cnt
FROM events
WHERE TIME_RANGE(time, '2013-09-29', '2013-09-30')
AND v['event'] = 'video_play'
GROUP BY
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ),
v['source'],
v['partner'],
v['event'],
v['video_id'],
v['country'];
Simple Aggregation SQL
The Data Is Not Clean!
Event properties and names change as we
develop:
But…
{"user_id": "152u”, "country": "sg" }
{"user_id": "152", "country_code":"sg" }Old Version:
New Version:
SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`,
v['app_id'] AS `app_id`,
CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis'
WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon'
WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon'
WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo'
WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian'
WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren'
WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere'
ELSE LOWER( v['partner'] )
END AS `partner`,
CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct'
WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct'
WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed'
WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android'
WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios'
ELSE TRIM( v['source'] )
END AS `source` ,
LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2
THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) )
ELSE NULL END ) AS `country` ,
COALESCE ( v['device_size'] ,v['device'] ) AS `device`,
COUNT( 1 ) AS `cnt`
FROM events
WHERE time >= 1380326400 AND time <= 1380412799
AND v['event'] = 'video_play'
GROUP BY
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['app_id'],
CASE WHEN v['app_ver'] LIKE '%_ax'
THEN 'axis' WHEN v['app_ver'] LIKE '%_kd'
THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf'
THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv'
THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx'
THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf'
THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp'
THEN 'samsung_viki_premiere'
ELSE LOWER( v['partner'] )
END ,
CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct'
WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct'
WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) )
THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android'
WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios'
ELSE TRIM( v['source'] ) END,
LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2
THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) )
ELSE NULL END ),
COALESCE ( v['device_size'] ,v['device'] );
(Not so) simple Aggregation SQL
Hadoop
UPDATE "reporting"."cl_main_2013_09"
SET source = 'embed', partner = ’partner1'
WHERE app_id = '100105a' AND (source != 'embed' OR partner != ’partner1')
UPDATE "reporting"."cl_main_2013_09"
SET app_id = '100105a'
WHERE (source = 'embed' AND partner = ’partner1') AND (app_id != '100105a')
UPDATE reporting.cl_main_2013_09
SET user_id = user_id || 'u’
WHERE RIGHT(user_id, 1) ~ '[0-9]’
UPDATE "reporting"."cl_main_2013_09"
SET app_id = '100106a'
WHERE (source = 'embed' AND partner = ’partner2') AND (app_id != '100106a')
UPDATE reporting.cl_main_2013_09
SET source = 'raynor', partner = 'viki', app_id = '100000a’
WHERE event = 'pv’
AND source IS NULL
AND partner IS NULL
AND app_id IS NULL
…even after import
PostgreSQL
30% 70%
Import data Clean up data
Cleaning Up Data Takes Lots of Time
Transforming Data
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies
Transforming Data
…
Table A
Table B
…
Analytics DB (PostgreSQL)
date source partner event country cnt
2013-09-29 ios viki video_play ca 20
…
date source partner event video_id country cnt
2013-09-29 ios viki video_play 1v ca 2
2013-09-29 ios viki video_play 2v ca 18
…
PostgreSQL
20M records
4M records
a) Reducing Table Size By Dropping Dimension
video_plays_with_video_id
video_plays
id title
1c Game of Thrones
2c My Girlfriend Is A
Gumiho
…
PostgreSQL
b) Injecting Extra Fields For Analysis
id title video_count
1c Game of
Thrones
30
2c My Girlfriend
Is A Gumiho
16
…
containers videos
containers containers
1 n
id title
1c Game of Thrones
2c My Girlfriend Is A
Gumiho
…
PostgreSQL
Injecting Extra Fields For Analysis
id title video_count
1c Game of
Thrones
30
2c My Girlfriend
Is A Gumiho
16
…
containers videos
containers containers
1 n
Chunk Tables By Month
video_plays_2013_06
video_plays_2013_07
video_plays_2013_08
video_plays_2013_09
…
ALTER TABLE video_plays_2013_09 INHERIT
video_plays;
ALTER TABLE video_plays_2013_09
ADD CONSTRAINT CHECK
date >= '2013-09-01'
AND date < '2013-10-01';
video_plays (parent table)
Managing Job Dependency
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies
Managing Job Dependency
…
tableA
tableB
…
Analytics DB (PostgreSQL)
Managing Job Dependency
…
tableA
tableB
…
Analytics DB (PostgreSQL)
Azkaban
Cron dependency
management
(Viki Cron Dependency Graph)
Data Presentation
Data Presentation
`
Dashboard
• Yes, dashboard on Rails.
• We have a daily logship process to port the data over to
dashboard server.
thor db:logship –t big_table
Data Visualization
Tableau is slow if directly working on
PostgreSQL
 Export compressed csv’s to tableau server
 Windows 
 Line charts do solve most problems
Engineering involvement in report
creation
• Bad idea!
• Enter Query Reports!
 Fast report churn rate
“Give me six hours to chop down a tree and I
will spend the first four sharpening the axe” –
Abraham Lincoln
Query Reports
Query Reports
Summary report
• Higher level view of metrics
• See changes over time
• (screen shot)
Data Explorer
“The world is your oyster”
One more thing! (Viki Live)
Recap
Lessons Learnt
• Line charts can solve most problems
• Chart your data quickly
• Our dataset is not that big
Simple DIY Suggestion
• Put QueryReports on top of your database. Or Tableau
Desktop.
• Use Mixpanel/KISSMetrics for Product Analytics
• fluentd writes data to Postgres (hstore)
CAN
We are hiring!
Thank you!
ishan@viki.com
huy@viki.com
Viki’s Data Pipeline

More Related Content

Viewers also liked

How experiments drive product growth at Viki
How experiments drive product growth at VikiHow experiments drive product growth at Viki
How experiments drive product growth at Vikiishanagrawal90
 
Genius Company Report
Genius Company Report Genius Company Report
Genius Company Report Amaraj Judge
 
Local or Bust! Google Local and all Things Links WCMKE 2014
Local or Bust! Google Local and all Things Links WCMKE 2014Local or Bust! Google Local and all Things Links WCMKE 2014
Local or Bust! Google Local and all Things Links WCMKE 2014Rachel Fredrickson
 
[RakutenTechConf2013] [C-2_1] Viki - Technology evolution from idea to acquis...
[RakutenTechConf2013] [C-2_1] Viki - Technology evolution from idea to acquis...[RakutenTechConf2013] [C-2_1] Viki - Technology evolution from idea to acquis...
[RakutenTechConf2013] [C-2_1] Viki - Technology evolution from idea to acquis...Rakuten Group, Inc.
 
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy NguyenGrokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy NguyenHuy Nguyen
 
Wikipedia YouTube & Information Literacy
Wikipedia YouTube & Information LiteracyWikipedia YouTube & Information Literacy
Wikipedia YouTube & Information LiteracyEsther Grassian
 
Video Recommender in Viki (VikiでのVideoレコメンド事例)
Video Recommender in Viki (VikiでのVideoレコメンド事例)Video Recommender in Viki (VikiでのVideoレコメンド事例)
Video Recommender in Viki (VikiでのVideoレコメンド事例)umekoumeda
 
CorruptionTrak presentation at CeBIT2012 Code_n12
CorruptionTrak presentation at CeBIT2012 Code_n12CorruptionTrak presentation at CeBIT2012 Code_n12
CorruptionTrak presentation at CeBIT2012 Code_n12ishanagrawal90
 
Communities of Practice: Conversations To Collaboration
Communities of Practice: Conversations To CollaborationCommunities of Practice: Conversations To Collaboration
Communities of Practice: Conversations To CollaborationCollabor8now Ltd
 
A Presentation About Community, By The Community
A Presentation About Community, By The CommunityA Presentation About Community, By The Community
A Presentation About Community, By The CommunityNeil Perkin
 
Fusion investor presentation september 2013 final 1
Fusion investor presentation september 2013 final 1Fusion investor presentation september 2013 final 1
Fusion investor presentation september 2013 final 1Henry Val
 

Viewers also liked (15)

How experiments drive product growth at Viki
How experiments drive product growth at VikiHow experiments drive product growth at Viki
How experiments drive product growth at Viki
 
Genius Company Report
Genius Company Report Genius Company Report
Genius Company Report
 
Local or Bust! Google Local and all Things Links WCMKE 2014
Local or Bust! Google Local and all Things Links WCMKE 2014Local or Bust! Google Local and all Things Links WCMKE 2014
Local or Bust! Google Local and all Things Links WCMKE 2014
 
[RakutenTechConf2013] [C-2_1] Viki - Technology evolution from idea to acquis...
[RakutenTechConf2013] [C-2_1] Viki - Technology evolution from idea to acquis...[RakutenTechConf2013] [C-2_1] Viki - Technology evolution from idea to acquis...
[RakutenTechConf2013] [C-2_1] Viki - Technology evolution from idea to acquis...
 
Viki Media Kit - 2015.
Viki Media Kit - 2015.Viki Media Kit - 2015.
Viki Media Kit - 2015.
 
Genius
GeniusGenius
Genius
 
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy NguyenGrokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen
 
Wikipedia YouTube & Information Literacy
Wikipedia YouTube & Information LiteracyWikipedia YouTube & Information Literacy
Wikipedia YouTube & Information Literacy
 
Dynamics Of Wikipedia
Dynamics Of  WikipediaDynamics Of  Wikipedia
Dynamics Of Wikipedia
 
Video Recommender in Viki (VikiでのVideoレコメンド事例)
Video Recommender in Viki (VikiでのVideoレコメンド事例)Video Recommender in Viki (VikiでのVideoレコメンド事例)
Video Recommender in Viki (VikiでのVideoレコメンド事例)
 
CorruptionTrak presentation at CeBIT2012 Code_n12
CorruptionTrak presentation at CeBIT2012 Code_n12CorruptionTrak presentation at CeBIT2012 Code_n12
CorruptionTrak presentation at CeBIT2012 Code_n12
 
Communities of Practice: Conversations To Collaboration
Communities of Practice: Conversations To CollaborationCommunities of Practice: Conversations To Collaboration
Communities of Practice: Conversations To Collaboration
 
A Presentation About Community, By The Community
A Presentation About Community, By The CommunityA Presentation About Community, By The Community
A Presentation About Community, By The Community
 
Community Development
Community DevelopmentCommunity Development
Community Development
 
Fusion investor presentation september 2013 final 1
Fusion investor presentation september 2013 final 1Fusion investor presentation september 2013 final 1
Fusion investor presentation september 2013 final 1
 

Similar to Viki Big Data Meetup 2013_10

OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock ...
OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock ...OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock ...
OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock ...Altinity Ltd
 
Neo4j GraphSummit Copenhagen - The path to success with Graph Database and Gr...
Neo4j GraphSummit Copenhagen - The path to success with Graph Database and Gr...Neo4j GraphSummit Copenhagen - The path to success with Graph Database and Gr...
Neo4j GraphSummit Copenhagen - The path to success with Graph Database and Gr...Neo4j
 
Building Your First App with MongoDB Stitch
Building Your First App with MongoDB StitchBuilding Your First App with MongoDB Stitch
Building Your First App with MongoDB StitchMongoDB
 
Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)
Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)
Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)Dan Robinson
 
Patterns and Practices for Event Design With Adam Bellemare | Current 2022
Patterns and Practices for Event Design With Adam Bellemare | Current 2022Patterns and Practices for Event Design With Adam Bellemare | Current 2022
Patterns and Practices for Event Design With Adam Bellemare | Current 2022HostedbyConfluent
 
Connecting Your Customers – Building Successful Mobile Games through the Powe...
Connecting Your Customers – Building Successful Mobile Games through the Powe...Connecting Your Customers – Building Successful Mobile Games through the Powe...
Connecting Your Customers – Building Successful Mobile Games through the Powe...Amazon Web Services
 
How to send gzipped requests with boto3
How to send gzipped requests with boto3How to send gzipped requests with boto3
How to send gzipped requests with boto3Luciano Mammino
 
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Lucidworks
 
Browsers with Wings
Browsers with WingsBrowsers with Wings
Browsers with WingsRemy Sharp
 
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDK
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDKGDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDK
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDKNate Wiger
 
Event sourcing the best ubiquitous pattern you have never heard off
Event sourcing   the best ubiquitous pattern you have never heard offEvent sourcing   the best ubiquitous pattern you have never heard off
Event sourcing the best ubiquitous pattern you have never heard offJoe Drumgoole
 
Tracking a soccer game with BigData
Tracking a soccer game with BigDataTracking a soccer game with BigData
Tracking a soccer game with BigDataWSO2
 
Big data streams, Internet of Things, and Complex Event Processing Improve So...
Big data streams, Internet of Things, and Complex Event Processing Improve So...Big data streams, Internet of Things, and Complex Event Processing Improve So...
Big data streams, Internet of Things, and Complex Event Processing Improve So...Chris Haddad
 
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...it-people
 
Database Development Replication Security Maintenance Report
Database Development Replication Security Maintenance ReportDatabase Development Replication Security Maintenance Report
Database Development Replication Security Maintenance Reportnyin27
 
AWS IoTで家庭内IoTをやってみた【JAWS DAYS 2016】
AWS IoTで家庭内IoTをやってみた【JAWS DAYS 2016】AWS IoTで家庭内IoTをやってみた【JAWS DAYS 2016】
AWS IoTで家庭内IoTをやってみた【JAWS DAYS 2016】tsuchimon
 
Tutorial: Building Your First App with MongoDB Stitch
Tutorial: Building Your First App with MongoDB StitchTutorial: Building Your First App with MongoDB Stitch
Tutorial: Building Your First App with MongoDB StitchMongoDB
 
Consumer driven contract testing
Consumer driven contract testingConsumer driven contract testing
Consumer driven contract testingMike van Vendeloo
 
112 portfpres.pdf
112 portfpres.pdf112 portfpres.pdf
112 portfpres.pdfsash236
 
[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...
[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...
[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...Naoki (Neo) SATO
 

Similar to Viki Big Data Meetup 2013_10 (20)

OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock ...
OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock ...OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock ...
OSA Con 2022 - Building Event Collection SDKs and Data Models - Paul Boocock ...
 
Neo4j GraphSummit Copenhagen - The path to success with Graph Database and Gr...
Neo4j GraphSummit Copenhagen - The path to success with Graph Database and Gr...Neo4j GraphSummit Copenhagen - The path to success with Graph Database and Gr...
Neo4j GraphSummit Copenhagen - The path to success with Graph Database and Gr...
 
Building Your First App with MongoDB Stitch
Building Your First App with MongoDB StitchBuilding Your First App with MongoDB Stitch
Building Your First App with MongoDB Stitch
 
Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)
Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)
Powering Heap With PostgreSQL And CitusDB (PGConf Silicon Valley 2015)
 
Patterns and Practices for Event Design With Adam Bellemare | Current 2022
Patterns and Practices for Event Design With Adam Bellemare | Current 2022Patterns and Practices for Event Design With Adam Bellemare | Current 2022
Patterns and Practices for Event Design With Adam Bellemare | Current 2022
 
Connecting Your Customers – Building Successful Mobile Games through the Powe...
Connecting Your Customers – Building Successful Mobile Games through the Powe...Connecting Your Customers – Building Successful Mobile Games through the Powe...
Connecting Your Customers – Building Successful Mobile Games through the Powe...
 
How to send gzipped requests with boto3
How to send gzipped requests with boto3How to send gzipped requests with boto3
How to send gzipped requests with boto3
 
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
Events Processing and Data Analysis with Lucidworks Fusion: Presented by Kira...
 
Browsers with Wings
Browsers with WingsBrowsers with Wings
Browsers with Wings
 
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDK
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDKGDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDK
GDC 2015 - Game Analytics with AWS Redshift, Kinesis, and the Mobile SDK
 
Event sourcing the best ubiquitous pattern you have never heard off
Event sourcing   the best ubiquitous pattern you have never heard offEvent sourcing   the best ubiquitous pattern you have never heard off
Event sourcing the best ubiquitous pattern you have never heard off
 
Tracking a soccer game with BigData
Tracking a soccer game with BigDataTracking a soccer game with BigData
Tracking a soccer game with BigData
 
Big data streams, Internet of Things, and Complex Event Processing Improve So...
Big data streams, Internet of Things, and Complex Event Processing Improve So...Big data streams, Internet of Things, and Complex Event Processing Improve So...
Big data streams, Internet of Things, and Complex Event Processing Improve So...
 
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
Jonathan Ellis "Apache Cassandra 2.0 and 2.1". Выступление на Cassandra conf ...
 
Database Development Replication Security Maintenance Report
Database Development Replication Security Maintenance ReportDatabase Development Replication Security Maintenance Report
Database Development Replication Security Maintenance Report
 
AWS IoTで家庭内IoTをやってみた【JAWS DAYS 2016】
AWS IoTで家庭内IoTをやってみた【JAWS DAYS 2016】AWS IoTで家庭内IoTをやってみた【JAWS DAYS 2016】
AWS IoTで家庭内IoTをやってみた【JAWS DAYS 2016】
 
Tutorial: Building Your First App with MongoDB Stitch
Tutorial: Building Your First App with MongoDB StitchTutorial: Building Your First App with MongoDB Stitch
Tutorial: Building Your First App with MongoDB Stitch
 
Consumer driven contract testing
Consumer driven contract testingConsumer driven contract testing
Consumer driven contract testing
 
112 portfpres.pdf
112 portfpres.pdf112 portfpres.pdf
112 portfpres.pdf
 
[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...
[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...
[Serverless Meetup Tokyo #3] Serverless in Azure (Azure Functionsのアップデート、事例、デ...
 

Recently uploaded

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Viki Big Data Meetup 2013_10

  • 1. Viki Analytics Infrastructure BigData Singapore Meetup Oct 2013
  • 4. What data do we collect? • Clickstream data • An event is some user interaction or product related • A client (web/mobile) sends these events as HTTP calls. • Format: JSON – Schema-less – Flexible {"origin":"tv_show_show", "app_ver":"2.9.3.151", "uuid":"80833c5a760597bf1c8339819636df04", "user_id":"5298933u", "vs_id":"1008912v-1380452660- 7920", "app_id":"100004a”, "event":”video_play","timed_comment":"off”, "stream_quality":"variable”, "bottom_subtitle":"en", "device_size":"tablet", "feature":"auto_play", "video_id":"1008912v", ”subtitle_completion_percent":"100", "device_id":"iPad2,1|6.1.3|apple", "t":"1380452846", "ip":"99.232.169.246”, "country":"ca", "city_name":"Toronto”, "region_name":"ON"} …
  • 5. How to keep this data clean? • Problem: Clients often send erroneous data. eg. missing parameter • Solution: We write client libraries for each client to enforce “world peace” Ps: there is no such thing as “world peace”
  • 6. How to collect > 60 M events a day? • fluentd  Scalable  Extensibility  Let you send data to Hadoop, MongoDB, PostgreSQL etc. • Writes to Hadoop (TD), Amazon S3, MongoDB
  • 7. Where do we store? • Hadoop (Treasure Data)  Its fast and easy to setup!  We don’t have money or time to hire a Hadoop engineer.  We retrieve data from Hadoop in batch jobs • Amazon S3  Backup • MongoDB: Real-time data
  • 8. 2. Retrieving & Processing Data
  • 9. 2. Retrieving & Processing Data • Centralizing All Data Sources • Cleaning Data • Transforming Data • Managing Job Dependencies
  • 10. 2. Retrieving & Processing Data • Centralizing All Data Sources • Cleaning Data • Transforming Data • Managing Job Dependencies
  • 11. Getting All Data To 1 Place • Port data from different production databases into PG • Retrieve click-stream data from Hadoop to PG thor db:cp --source prod1 --destination analytics -t public.* --force-schema prod1 a) Production Databases  Analytics DB: thor db:cp --source A --destination B –t reporting.video_plays --increment PostgreSQL
  • 12. {"origin":"tv_show_show", "app_ver":"2.9.3.151", "uuid":"80833c5a760597bf1c8339819636df04", "user_id":"5298933u", "vs_id":"1008912v-1380452660-7920", "app_id":"100004a”, "event":”video_play","timed_comment":"off”, "stream_quality":"variable”, "bottom_subtitle":"en", "device_size":"tablet", "feature":"auto_play", "video_id":"1008912v", ”subtitle_completion_percent":"100", "device_id":"iPad2,1|6.1.3|apple", "t":"1380452846", "ip":"99.232.169.246”, "country":"ca", "city_name":"Toronto”, "region_name":"ON"} … date source partner event video_id country cnt 2013-09-29 ios viki video_play 1008912v ca 2 2013-09-29 android viki video_play 1008912v us 18 … b) Click-stream Data (Hadoop)  Analytics DB: Hadoop PostgreSQL Aggregation (Hive) Export Output / Sqoop
  • 13. SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`, v['source'], v['partner'], v['event'], v['video_id'], v['country'], COUNT(1) as cnt FROM events WHERE TIME_RANGE(time, '2013-09-29', '2013-09-30') AND v['event'] = 'video_play' GROUP BY SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['source'], v['partner'], v['event'], v['video_id'], v['country']; Simple Aggregation SQL
  • 14. The Data Is Not Clean! Event properties and names change as we develop: But… {"user_id": "152u”, "country": "sg" } {"user_id": "152", "country_code":"sg" }Old Version: New Version:
  • 15. SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`, v['app_id'] AS `app_id`, CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis' WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere' ELSE LOWER( v['partner'] ) END AS `partner`, CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct' WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct' WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android' WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios' ELSE TRIM( v['source'] ) END AS `source` , LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2 THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ELSE NULL END ) AS `country` , COALESCE ( v['device_size'] ,v['device'] ) AS `device`, COUNT( 1 ) AS `cnt` FROM events WHERE time >= 1380326400 AND time <= 1380412799 AND v['event'] = 'video_play' GROUP BY SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['app_id'], CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis' WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere' ELSE LOWER( v['partner'] ) END , CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct' WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct' WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android' WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios' ELSE TRIM( v['source'] ) END, LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2 THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ELSE NULL END ), COALESCE ( v['device_size'] ,v['device'] ); (Not so) simple Aggregation SQL Hadoop
  • 16. UPDATE "reporting"."cl_main_2013_09" SET source = 'embed', partner = ’partner1' WHERE app_id = '100105a' AND (source != 'embed' OR partner != ’partner1') UPDATE "reporting"."cl_main_2013_09" SET app_id = '100105a' WHERE (source = 'embed' AND partner = ’partner1') AND (app_id != '100105a') UPDATE reporting.cl_main_2013_09 SET user_id = user_id || 'u’ WHERE RIGHT(user_id, 1) ~ '[0-9]’ UPDATE "reporting"."cl_main_2013_09" SET app_id = '100106a' WHERE (source = 'embed' AND partner = ’partner2') AND (app_id != '100106a') UPDATE reporting.cl_main_2013_09 SET source = 'raynor', partner = 'viki', app_id = '100000a’ WHERE event = 'pv’ AND source IS NULL AND partner IS NULL AND app_id IS NULL …even after import PostgreSQL
  • 17. 30% 70% Import data Clean up data Cleaning Up Data Takes Lots of Time
  • 18. Transforming Data • Centralizing All Data Sources • Cleaning Data • Transforming Data • Managing Job Dependencies
  • 19. Transforming Data … Table A Table B … Analytics DB (PostgreSQL)
  • 20. date source partner event country cnt 2013-09-29 ios viki video_play ca 20 … date source partner event video_id country cnt 2013-09-29 ios viki video_play 1v ca 2 2013-09-29 ios viki video_play 2v ca 18 … PostgreSQL 20M records 4M records a) Reducing Table Size By Dropping Dimension video_plays_with_video_id video_plays
  • 21. id title 1c Game of Thrones 2c My Girlfriend Is A Gumiho … PostgreSQL b) Injecting Extra Fields For Analysis id title video_count 1c Game of Thrones 30 2c My Girlfriend Is A Gumiho 16 … containers videos containers containers 1 n
  • 22. id title 1c Game of Thrones 2c My Girlfriend Is A Gumiho … PostgreSQL Injecting Extra Fields For Analysis id title video_count 1c Game of Thrones 30 2c My Girlfriend Is A Gumiho 16 … containers videos containers containers 1 n
  • 23. Chunk Tables By Month video_plays_2013_06 video_plays_2013_07 video_plays_2013_08 video_plays_2013_09 … ALTER TABLE video_plays_2013_09 INHERIT video_plays; ALTER TABLE video_plays_2013_09 ADD CONSTRAINT CHECK date >= '2013-09-01' AND date < '2013-10-01'; video_plays (parent table)
  • 24. Managing Job Dependency • Centralizing All Data Sources • Cleaning Data • Transforming Data • Managing Job Dependencies
  • 30. Dashboard • Yes, dashboard on Rails. • We have a daily logship process to port the data over to dashboard server. thor db:logship –t big_table
  • 31. Data Visualization Tableau is slow if directly working on PostgreSQL  Export compressed csv’s to tableau server  Windows   Line charts do solve most problems
  • 32. Engineering involvement in report creation • Bad idea! • Enter Query Reports!  Fast report churn rate “Give me six hours to chop down a tree and I will spend the first four sharpening the axe” – Abraham Lincoln
  • 35. Summary report • Higher level view of metrics • See changes over time • (screen shot)
  • 36. Data Explorer “The world is your oyster”
  • 37. One more thing! (Viki Live)
  • 38. Recap
  • 39. Lessons Learnt • Line charts can solve most problems • Chart your data quickly • Our dataset is not that big
  • 40. Simple DIY Suggestion • Put QueryReports on top of your database. Or Tableau Desktop. • Use Mixpanel/KISSMetrics for Product Analytics • fluentd writes data to Postgres (hstore) CAN

Editor's Notes

  1. Hey I am ishan and this is Huy. We are data engineers at Viki.I want to start by saying that we love the big data community, and would like to thank John for organizing this and giving us an opportunity to share about the infrastructure that we have built at Viki in the past one year.
  2. We want to break it down in simple steps and walk you through the process that we went through while building it.
  3. It’s a bit like picking trashYou need to know what you wantYou don’t want to collect everything, but you also don’t want to leave out anything important
  4. Add example of an event JSON
  5. Errors in reportingHumans are prone to error
  6. We collect over 60 million events a day! To put things in perspective, if you put one sheep on a football ground for each event that we get, that would be a lot of sheep to be hanging out on a football field!700 events a secondWhyhadoop?It allows unstructured dataWrite hive queries to easily retrieve it
  7. We don’t have money or time to hire a Hadoop engineer. Not even now.Reason: its an easy way to store semi structured data and easily query using Hive (Sql-like) Capped collection in mongodb for real-time reporting
  8. Centralizing All Data SourcesData CleanlinessData TransformationManaging Job Dependencies
  9. Centralizing All Data SourcesData CleanlinessData TransformationManaging Job Dependencies
  10. Centralizing All Data SourcesData CleanlinessData TransformationManaging Job Dependencies
  11. To effectively run queries on our data, we need to bring all the data into the same database. In this case we choose Postgres since all our databases are already in Postgres.Anyone here knowsPostgres? It’s like mysql, but it’s better. We’ve built command line tools to copy tables from database to database. So the following command copies all tables in public schema of gaia database to our analytics database, and give them a separate schema. In PG, what schema means is something like namespace for tables.
  12. Take a look at 1 sample event being stored in Hadoop insemi-structured JSON form, you have a video play event for that video id running on an ipad device, coming from an autoplay feature, from Toronto, Ontario, Canada. That’s a hell lot of dimensions. We want to aggregate and select a subset of dimensions to port into PG.The Hadoop Provider we uses (Treasure Data) has a feature that allows you to specify a destination data storage (in this case Postgres), it’ll execute the Hadoop job and write the results into the selected database. It’s the equivalent of using Sqoop to bulk export data into Postgres.
  13. As we develop, our data changes, we make mistake, we forgot to set a variable somewhere, we change our data structure. So the new data gets mixed up with the old data. And to make meaningful, and the simple query becomes not so simple.
  14. Cleaning up the data takes a lot of time, both in processing time and actual human work.But it’s absolutely necessary, since when writing your SQL query to analyze, you just want to focus on your query logic, you don’t want to handle different data values.
  15. Centralizing All Data SourcesData CleanlinessData TransformationManaging Job Dependencies
  16. Once all our data are in Postgres, we start to perform transformation/aggregation to them, depending on various different purposes.
  17. For example, to reduce the size of the table to serve them on a web UI front-end, we aggregate the data further. In this example, we’re dropping the video_id dimension, thus grouping the 2 records together as a new record with the cnt field total to 20.And that reduces the table size.
  18. For example, to reduce the size of the table to serve them on a web UI front-end, we aggregate the data further. In this example, we’re dropping the video_id dimension, thus grouping the 2 records together as a new record with the cnt field total to 20.And that reduces the table size.
  19. For example, to reduce the size of the table to serve them on a web UI front-end, we aggregate the data further. In this example, we’re dropping the video_id dimension, thus grouping the 2 records together as a new record with the cnt field total to 20.And that reduces the table size.
  20. We also chunk our data tables by month, so that when the new month comes, you don’t touch the old months’ data.This also reduce the index size and make it easier to archive your old data.When we first implemented this, we didn’t know how to query cross-month, so we have to write complicated query (like UNION), sometimes we even have to load the data into memory and process them.But then we found out out this awesome feature in Postgres called Table Inheritance. It lets you define a parent table with a bunch of children. And you just need to query the parent table, and depending on your query, it’ll find out the correct children tables to hit.
  21. Centralizing All Data SourcesData CleanlinessData TransformationManaging Job Dependencies
  22. Can anyone tell me what this means? Ok no one can. That’s exactly my point.At some point, our daily job workflow grew so complicated that it’s becoming hard to use crontab to manage them.
  23. Can anyone tell me what this means? Ok no one can. That’s exactly my point.At some point, our daily job workflow grew so complicated that it’s becoming hard to use crontab to manage them.
  24. We cant do complex visualizations in dashboard
  25. We don’t completely exploit the potential of Tableau, but we do have some rather complicated reports running in tableau.
  26. Query reports increased our report churn rate quite a lot!A lot of requests from management.Tableau was too complicated and slow for a fast report churn out timeCurrent reporting process, to something approaching tableauRails app requires us to make changes for report creationProcess an analyst goes through
  27. Add another slide! (with drop downs)
  28. There is too many reports! I want to see the high level metrics all in one place
  29. Enabling the product and business folks to “write” their own queries
  30. A fun side project, where you see what our viewers are watching.We use the mongoDB capped collection for this.
  31. Collecting Data:fluentdWrite to Hadoop, MongoDBProcessing Data: PostgreSQLPresenting Data:Query Report, Summary Report, Data ExplorerTableau
  32. As they say, you can get away with almost anything on the internet as long as you put a cat picture next to it 