Viki Big Data Meetup 2013_10

Viki Analytics Infrastructure
BigData Singapore Meetup
Oct 2013

What data do we collect?
• Clickstream data
• An event is some user interaction or product
related
• A client (web/mobile) sends these events as
HTTP calls.
• Format: JSON
– Schema-less
– Flexible
{"origin":"tv_show_show", "app_ver":"2.9.3.151",
"uuid":"80833c5a760597bf1c8339819636df04",
"user_id":"5298933u", "vs_id":"1008912v-1380452660-
7920", "app_id":"100004a”,
"event":”video_play","timed_comment":"off”,
"stream_quality":"variable”, "bottom_subtitle":"en",
"device_size":"tablet", "feature":"auto_play",
"video_id":"1008912v",
”subtitle_completion_percent":"100",
"device_id":"iPad2,1|6.1.3|apple", "t":"1380452846",
"ip":"99.232.169.246”, "country":"ca",
"city_name":"Toronto”, "region_name":"ON"}
…

How to keep this data clean?
• Problem: Clients often send erroneous data.
eg. missing parameter
• Solution: We write client
libraries for each client to
enforce “world peace”
Ps: there is no such thing as
“world peace”

How to collect > 60 M
events a day?
• fluentd
 Scalable
 Extensibility
 Let you send data to
Hadoop, MongoDB, PostgreSQL etc.
• Writes to Hadoop (TD), Amazon
S3, MongoDB

Where do we store?
• Hadoop (Treasure Data)
 Its fast and easy to setup!
 We don’t have money or time to hire a
Hadoop engineer.
 We retrieve data from Hadoop in batch
jobs
• Amazon S3
 Backup
• MongoDB: Real-time data

2. Retrieving & Processing Data

2. Retrieving & Processing Data
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies

Getting All Data To 1 Place
• Port data from different
production databases into PG
• Retrieve click-stream data
from Hadoop to PG
thor db:cp --source prod1 --destination analytics -t public.* --force-schema prod1
a) Production Databases  Analytics DB:
thor db:cp --source A --destination B –t reporting.video_plays --increment
PostgreSQL

{"origin":"tv_show_show", "app_ver":"2.9.3.151",
"uuid":"80833c5a760597bf1c8339819636df04", "user_id":"5298933u",
"vs_id":"1008912v-1380452660-7920", "app_id":"100004a”,
"event":”video_play","timed_comment":"off”, "stream_quality":"variable”,
"bottom_subtitle":"en", "device_size":"tablet", "feature":"auto_play",
"video_id":"1008912v", ”subtitle_completion_percent":"100",
"device_id":"iPad2,1|6.1.3|apple", "t":"1380452846", "ip":"99.232.169.246”,
"country":"ca", "city_name":"Toronto”, "region_name":"ON"}
…
date source partner event video_id country cnt
2013-09-29 ios viki video_play 1008912v ca 2
2013-09-29 android viki video_play 1008912v us 18
…
b) Click-stream Data (Hadoop)  Analytics DB:
Hadoop
PostgreSQL
Aggregation (Hive)
Export Output / Sqoop

SELECT
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`,
v['source'],
v['partner'],
v['event'],
v['video_id'],
v['country'],
COUNT(1) as cnt
FROM events
WHERE TIME_RANGE(time, '2013-09-29', '2013-09-30')
AND v['event'] = 'video_play'
GROUP BY
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ),
v['source'],
v['partner'],
v['event'],
v['video_id'],
v['country'];
Simple Aggregation SQL

The Data Is Not Clean!
Event properties and names change as we
develop:
But…
{"user_id": "152u”, "country": "sg" }
{"user_id": "152", "country_code":"sg" }Old Version:
New Version:

SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`,
v['app_id'] AS `app_id`,
CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis'
WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon'
WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon'
WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo'
WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian'
WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren'
WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere'
ELSE LOWER( v['partner'] )
END AS `partner`,
CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct'
WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct'
WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed'
WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android'
WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios'
ELSE TRIM( v['source'] )
END AS `source` ,
LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2
THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) )
ELSE NULL END ) AS `country` ,
COALESCE ( v['device_size'] ,v['device'] ) AS `device`,
COUNT( 1 ) AS `cnt`
FROM events
WHERE time >= 1380326400 AND time <= 1380412799
AND v['event'] = 'video_play'
GROUP BY
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['app_id'],
CASE WHEN v['app_ver'] LIKE '%_ax'
THEN 'axis' WHEN v['app_ver'] LIKE '%_kd'
THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf'
THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv'
THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx'
THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf'
THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp'
THEN 'samsung_viki_premiere'
ELSE LOWER( v['partner'] )
END ,
CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct'
WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct'
WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) )
THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android'
WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios'
ELSE TRIM( v['source'] ) END,
LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2
THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) )
ELSE NULL END ),
COALESCE ( v['device_size'] ,v['device'] );
(Not so) simple Aggregation SQL
Hadoop

UPDATE "reporting"."cl_main_2013_09"
SET source = 'embed', partner = ’partner1'
WHERE app_id = '100105a' AND (source != 'embed' OR partner != ’partner1')
SET app_id = '100105a'
WHERE (source = 'embed' AND partner = ’partner1') AND (app_id != '100105a')
UPDATE reporting.cl_main_2013_09
SET user_id = user_id || 'u’
WHERE RIGHT(user_id, 1) ~ '[0-9]’
SET app_id = '100106a'
WHERE (source = 'embed' AND partner = ’partner2') AND (app_id != '100106a')
UPDATE reporting.cl_main_2013_09
SET source = 'raynor', partner = 'viki', app_id = '100000a’
WHERE event = 'pv’
AND source IS NULL
AND partner IS NULL
AND app_id IS NULL
…even after import
PostgreSQL

30% 70%
Import data Clean up data
Cleaning Up Data Takes Lots of Time

Transforming Data
• Cleaning Data

Transforming Data
…
Table A
Table B
…
Analytics DB (PostgreSQL)

date source partner event country cnt
2013-09-29 ios viki video_play ca 20
…
date source partner event video_id country cnt
…
PostgreSQL
20M records
4M records
a) Reducing Table Size By Dropping Dimension
video_plays_with_video_id
video_plays

id title
1c Game of Thrones
2c My Girlfriend Is A
Gumiho
…
PostgreSQL
b) Injecting Extra Fields For Analysis
id title video_count
1c Game of
Thrones
30
2c My Girlfriend
Is A Gumiho
16
…
containers videos
containers containers
1 n

id title
1c Game of Thrones
2c My Girlfriend Is A
Gumiho
…
PostgreSQL
Injecting Extra Fields For Analysis
id title video_count
1c Game of
Thrones
30
2c My Girlfriend
Is A Gumiho
16
…
containers videos
containers containers
1 n

Chunk Tables By Month
video_plays_2013_06
video_plays_2013_07
video_plays_2013_08
video_plays_2013_09
…
ALTER TABLE video_plays_2013_09 INHERIT
video_plays;
ALTER TABLE video_plays_2013_09
ADD CONSTRAINT CHECK
date >= '2013-09-01'
AND date < '2013-10-01';
video_plays (parent table)

Managing Job Dependency
• Cleaning Data

Managing Job Dependency
…
tableA
tableB
…
Analytics DB (PostgreSQL)

Azkaban
Cron dependency
management
(Viki Cron Dependency Graph)

Dashboard
• Yes, dashboard on Rails.
• We have a daily logship process to port the data over to
dashboard server.
thor db:logship –t big_table

Data Visualization
Tableau is slow if directly working on
PostgreSQL
 Export compressed csv’s to tableau server
 Windows 
 Line charts do solve most problems

Engineering involvement in report
creation
• Bad idea!
• Enter Query Reports!
 Fast report churn rate
“Give me six hours to chop down a tree and I
will spend the first four sharpening the axe” –
Abraham Lincoln

Summary report
• Higher level view of metrics
• See changes over time
• (screen shot)

Data Explorer
“The world is your oyster”

Lessons Learnt
• Line charts can solve most problems
• Chart your data quickly
• Our dataset is not that big

Simple DIY Suggestion
• Put QueryReports on top of your database. Or Tableau
Desktop.
• Use Mixpanel/KISSMetrics for Product Analytics
• fluentd writes data to Postgres (hstore)
CAN

Thank you!
ishan@viki.com
huy@viki.com

Viki Big Data Meetup 2013_10

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Similar to Viki Big Data Meetup 2013_10

Similar to Viki Big Data Meetup 2013_10 (20)

Recently uploaded

Recently uploaded (20)

Viki Big Data Meetup 2013_10

Editor's Notes