4. What data do we collect?
• Clickstream data
• An event is some user interaction or product
related
• A client (web/mobile) sends these events as
HTTP calls.
• Format: JSON
– Schema-less
– Flexible
{"origin":"tv_show_show", "app_ver":"2.9.3.151",
"uuid":"80833c5a760597bf1c8339819636df04",
"user_id":"5298933u", "vs_id":"1008912v-1380452660-
7920", "app_id":"100004a”,
"event":”video_play","timed_comment":"off”,
"stream_quality":"variable”, "bottom_subtitle":"en",
"device_size":"tablet", "feature":"auto_play",
"video_id":"1008912v",
”subtitle_completion_percent":"100",
"device_id":"iPad2,1|6.1.3|apple", "t":"1380452846",
"ip":"99.232.169.246”, "country":"ca",
"city_name":"Toronto”, "region_name":"ON"}
…
5. How to keep this data clean?
• Problem: Clients often send erroneous data.
eg. missing parameter
• Solution: We write client
libraries for each client to
enforce “world peace”
Ps: there is no such thing as
“world peace”
6. How to collect > 60 M
events a day?
• fluentd
Scalable
Extensibility
Let you send data to
Hadoop, MongoDB, PostgreSQL etc.
• Writes to Hadoop (TD), Amazon
S3, MongoDB
7. Where do we store?
• Hadoop (Treasure Data)
Its fast and easy to setup!
We don’t have money or time to hire a
Hadoop engineer.
We retrieve data from Hadoop in batch
jobs
• Amazon S3
Backup
• MongoDB: Real-time data
9. 2. Retrieving & Processing Data
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies
10. 2. Retrieving & Processing Data
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies
11. Getting All Data To 1 Place
• Port data from different
production databases into PG
• Retrieve click-stream data
from Hadoop to PG
thor db:cp --source prod1 --destination analytics -t public.* --force-schema prod1
a) Production Databases Analytics DB:
thor db:cp --source A --destination B –t reporting.video_plays --increment
PostgreSQL
12. {"origin":"tv_show_show", "app_ver":"2.9.3.151",
"uuid":"80833c5a760597bf1c8339819636df04", "user_id":"5298933u",
"vs_id":"1008912v-1380452660-7920", "app_id":"100004a”,
"event":”video_play","timed_comment":"off”, "stream_quality":"variable”,
"bottom_subtitle":"en", "device_size":"tablet", "feature":"auto_play",
"video_id":"1008912v", ”subtitle_completion_percent":"100",
"device_id":"iPad2,1|6.1.3|apple", "t":"1380452846", "ip":"99.232.169.246”,
"country":"ca", "city_name":"Toronto”, "region_name":"ON"}
…
date source partner event video_id country cnt
2013-09-29 ios viki video_play 1008912v ca 2
2013-09-29 android viki video_play 1008912v us 18
…
b) Click-stream Data (Hadoop) Analytics DB:
Hadoop
PostgreSQL
Aggregation (Hive)
Export Output / Sqoop
13. SELECT
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`,
v['source'],
v['partner'],
v['event'],
v['video_id'],
v['country'],
COUNT(1) as cnt
FROM events
WHERE TIME_RANGE(time, '2013-09-29', '2013-09-30')
AND v['event'] = 'video_play'
GROUP BY
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ),
v['source'],
v['partner'],
v['event'],
v['video_id'],
v['country'];
Simple Aggregation SQL
14. The Data Is Not Clean!
Event properties and names change as we
develop:
But…
{"user_id": "152u”, "country": "sg" }
{"user_id": "152", "country_code":"sg" }Old Version:
New Version:
15. SELECT SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ) AS `date_d`,
v['app_id'] AS `app_id`,
CASE WHEN v['app_ver'] LIKE '%_ax' THEN 'axis'
WHEN v['app_ver'] LIKE '%_kd' THEN 'amazon'
WHEN v['app_ver'] LIKE '%_kf' THEN 'amazon'
WHEN v['app_ver'] LIKE '%_lv' THEN 'lenovo'
WHEN v['app_ver'] LIKE '%_nx' THEN 'nexian'
WHEN v['app_ver'] LIKE '%_sf' THEN 'smartfren'
WHEN v['app_ver'] LIKE '%_vp' THEN 'samsung_viki_premiere'
ELSE LOWER( v['partner'] )
END AS `partner`,
CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct'
WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct'
WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) THEN 'embed'
WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android'
WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios'
ELSE TRIM( v['source'] )
END AS `source` ,
LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2
THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) )
ELSE NULL END ) AS `country` ,
COALESCE ( v['device_size'] ,v['device'] ) AS `device`,
COUNT( 1 ) AS `cnt`
FROM events
WHERE time >= 1380326400 AND time <= 1380412799
AND v['event'] = 'video_play'
GROUP BY
SUBSTR( FROM_UNIXTIME( time ) ,0 ,10 ), v['app_id'],
CASE WHEN v['app_ver'] LIKE '%_ax'
THEN 'axis' WHEN v['app_ver'] LIKE '%_kd'
THEN 'amazon' WHEN v['app_ver'] LIKE '%_kf'
THEN 'amazon' WHEN v['app_ver'] LIKE '%_lv'
THEN 'lenovo' WHEN v['app_ver'] LIKE '%_nx'
THEN 'nexian' WHEN v['app_ver'] LIKE '%_sf'
THEN 'smartfren' WHEN v['app_ver'] LIKE '%_vp'
THEN 'samsung_viki_premiere'
ELSE LOWER( v['partner'] )
END ,
CASE WHEN ( v['app_id'] = '65535a' AND ( v['site'] IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) ) ) THEN 'direct'
WHEN ( v['event'] = 'pv' OR v['app_id'] = '100000a' ) THEN 'direct'
WHEN ( v['app_id'] = '65535a' AND v['site'] NOT IN ( 'www.viki.com' ,'viki.com' ,'www.viki.mx' ,'viki.mx' ,'' ) )
THEN 'embed' WHEN ( v['source'] = 'mobile' AND v['os'] = 'android' ) THEN 'android'
WHEN ( v['source'] = 'mobile' AND v['device_id'] LIKE '%|apple' ) THEN 'ios'
ELSE TRIM( v['source'] ) END,
LOWER( CASE WHEN LENGTH( TRIM( COALESCE ( v['country'] ,v['country_code'] ) ) ) = 2
THEN TRIM( COALESCE ( v['country'] ,v['country_code'] ) )
ELSE NULL END ),
COALESCE ( v['device_size'] ,v['device'] );
(Not so) simple Aggregation SQL
Hadoop
16. UPDATE "reporting"."cl_main_2013_09"
SET source = 'embed', partner = ’partner1'
WHERE app_id = '100105a' AND (source != 'embed' OR partner != ’partner1')
UPDATE "reporting"."cl_main_2013_09"
SET app_id = '100105a'
WHERE (source = 'embed' AND partner = ’partner1') AND (app_id != '100105a')
UPDATE reporting.cl_main_2013_09
SET user_id = user_id || 'u’
WHERE RIGHT(user_id, 1) ~ '[0-9]’
UPDATE "reporting"."cl_main_2013_09"
SET app_id = '100106a'
WHERE (source = 'embed' AND partner = ’partner2') AND (app_id != '100106a')
UPDATE reporting.cl_main_2013_09
SET source = 'raynor', partner = 'viki', app_id = '100000a’
WHERE event = 'pv’
AND source IS NULL
AND partner IS NULL
AND app_id IS NULL
…even after import
PostgreSQL
20. date source partner event country cnt
2013-09-29 ios viki video_play ca 20
…
date source partner event video_id country cnt
2013-09-29 ios viki video_play 1v ca 2
2013-09-29 ios viki video_play 2v ca 18
…
PostgreSQL
20M records
4M records
a) Reducing Table Size By Dropping Dimension
video_plays_with_video_id
video_plays
21. id title
1c Game of Thrones
2c My Girlfriend Is A
Gumiho
…
PostgreSQL
b) Injecting Extra Fields For Analysis
id title video_count
1c Game of
Thrones
30
2c My Girlfriend
Is A Gumiho
16
…
containers videos
containers containers
1 n
22. id title
1c Game of Thrones
2c My Girlfriend Is A
Gumiho
…
PostgreSQL
Injecting Extra Fields For Analysis
id title video_count
1c Game of
Thrones
30
2c My Girlfriend
Is A Gumiho
16
…
containers videos
containers containers
1 n
23. Chunk Tables By Month
video_plays_2013_06
video_plays_2013_07
video_plays_2013_08
video_plays_2013_09
…
ALTER TABLE video_plays_2013_09 INHERIT
video_plays;
ALTER TABLE video_plays_2013_09
ADD CONSTRAINT CHECK
date >= '2013-09-01'
AND date < '2013-10-01';
video_plays (parent table)
24. Managing Job Dependency
• Centralizing All Data Sources
• Cleaning Data
• Transforming Data
• Managing Job Dependencies
30. Dashboard
• Yes, dashboard on Rails.
• We have a daily logship process to port the data over to
dashboard server.
thor db:logship –t big_table
31. Data Visualization
Tableau is slow if directly working on
PostgreSQL
Export compressed csv’s to tableau server
Windows
Line charts do solve most problems
32. Engineering involvement in report
creation
• Bad idea!
• Enter Query Reports!
Fast report churn rate
“Give me six hours to chop down a tree and I
will spend the first four sharpening the axe” –
Abraham Lincoln
39. Lessons Learnt
• Line charts can solve most problems
• Chart your data quickly
• Our dataset is not that big
40. Simple DIY Suggestion
• Put QueryReports on top of your database. Or Tableau
Desktop.
• Use Mixpanel/KISSMetrics for Product Analytics
• fluentd writes data to Postgres (hstore)
CAN
Hey I am ishan and this is Huy. We are data engineers at Viki.I want to start by saying that we love the big data community, and would like to thank John for organizing this and giving us an opportunity to share about the infrastructure that we have built at Viki in the past one year.
We want to break it down in simple steps and walk you through the process that we went through while building it.
It’s a bit like picking trashYou need to know what you wantYou don’t want to collect everything, but you also don’t want to leave out anything important
Add example of an event JSON
Errors in reportingHumans are prone to error
We collect over 60 million events a day! To put things in perspective, if you put one sheep on a football ground for each event that we get, that would be a lot of sheep to be hanging out on a football field!700 events a secondWhyhadoop?It allows unstructured dataWrite hive queries to easily retrieve it
We don’t have money or time to hire a Hadoop engineer. Not even now.Reason: its an easy way to store semi structured data and easily query using Hive (Sql-like) Capped collection in mongodb for real-time reporting
Centralizing All Data SourcesData CleanlinessData TransformationManaging Job Dependencies
Centralizing All Data SourcesData CleanlinessData TransformationManaging Job Dependencies
Centralizing All Data SourcesData CleanlinessData TransformationManaging Job Dependencies
To effectively run queries on our data, we need to bring all the data into the same database. In this case we choose Postgres since all our databases are already in Postgres.Anyone here knowsPostgres? It’s like mysql, but it’s better. We’ve built command line tools to copy tables from database to database. So the following command copies all tables in public schema of gaia database to our analytics database, and give them a separate schema. In PG, what schema means is something like namespace for tables.
Take a look at 1 sample event being stored in Hadoop insemi-structured JSON form, you have a video play event for that video id running on an ipad device, coming from an autoplay feature, from Toronto, Ontario, Canada. That’s a hell lot of dimensions. We want to aggregate and select a subset of dimensions to port into PG.The Hadoop Provider we uses (Treasure Data) has a feature that allows you to specify a destination data storage (in this case Postgres), it’ll execute the Hadoop job and write the results into the selected database. It’s the equivalent of using Sqoop to bulk export data into Postgres.
As we develop, our data changes, we make mistake, we forgot to set a variable somewhere, we change our data structure. So the new data gets mixed up with the old data. And to make meaningful, and the simple query becomes not so simple.
Cleaning up the data takes a lot of time, both in processing time and actual human work.But it’s absolutely necessary, since when writing your SQL query to analyze, you just want to focus on your query logic, you don’t want to handle different data values.
Centralizing All Data SourcesData CleanlinessData TransformationManaging Job Dependencies
Once all our data are in Postgres, we start to perform transformation/aggregation to them, depending on various different purposes.
For example, to reduce the size of the table to serve them on a web UI front-end, we aggregate the data further. In this example, we’re dropping the video_id dimension, thus grouping the 2 records together as a new record with the cnt field total to 20.And that reduces the table size.
For example, to reduce the size of the table to serve them on a web UI front-end, we aggregate the data further. In this example, we’re dropping the video_id dimension, thus grouping the 2 records together as a new record with the cnt field total to 20.And that reduces the table size.
For example, to reduce the size of the table to serve them on a web UI front-end, we aggregate the data further. In this example, we’re dropping the video_id dimension, thus grouping the 2 records together as a new record with the cnt field total to 20.And that reduces the table size.
We also chunk our data tables by month, so that when the new month comes, you don’t touch the old months’ data.This also reduce the index size and make it easier to archive your old data.When we first implemented this, we didn’t know how to query cross-month, so we have to write complicated query (like UNION), sometimes we even have to load the data into memory and process them.But then we found out out this awesome feature in Postgres called Table Inheritance. It lets you define a parent table with a bunch of children. And you just need to query the parent table, and depending on your query, it’ll find out the correct children tables to hit.
Centralizing All Data SourcesData CleanlinessData TransformationManaging Job Dependencies
Can anyone tell me what this means? Ok no one can. That’s exactly my point.At some point, our daily job workflow grew so complicated that it’s becoming hard to use crontab to manage them.
Can anyone tell me what this means? Ok no one can. That’s exactly my point.At some point, our daily job workflow grew so complicated that it’s becoming hard to use crontab to manage them.
We cant do complex visualizations in dashboard
We don’t completely exploit the potential of Tableau, but we do have some rather complicated reports running in tableau.
Query reports increased our report churn rate quite a lot!A lot of requests from management.Tableau was too complicated and slow for a fast report churn out timeCurrent reporting process, to something approaching tableauRails app requires us to make changes for report creationProcess an analyst goes through
Add another slide! (with drop downs)
There is too many reports! I want to see the high level metrics all in one place
Enabling the product and business folks to “write” their own queries
A fun side project, where you see what our viewers are watching.We use the mongoDB capped collection for this.
Collecting Data:fluentdWrite to Hadoop, MongoDBProcessing Data: PostgreSQLPresenting Data:Query Report, Summary Report, Data ExplorerTableau
As they say, you can get away with almost anything on the internet as long as you put a cat picture next to it