Piano Media - approach to data gathering and processing

nearly three years of continuous changes of approach
to data gathering and processing
(Martin Strycek, Juraj Sottnik)
@rubyslava 2014

We better get it right first
time!

Starting point
● we had two developers
● we had one live server
● we had one cold backup
● we can’t store all the data
● we can’t process all the data

Batch processing - the downsides
● batch every 3 hours
○ delete old data

● updating counters
○ you need to define them upfront

● throwing away old data
○ developer point of view
■ you have no way to correct your mistake

○ business
■ you lose your data

Batch processing - the benefits
● you will learn
○ profiler is your best friend
○ optimizing can be hard and can take time

● what are good access logs good for
○ reconstruct your deleted data

Big Data
● It’s not only about the volume
● What we gonna do with it?
○ We had NO idea!

● We rent more servers.
○ We needed place where to store the data

Big Data
● We went the NoSQL way
○ MongoDB
■ easy replication, possible sharding
■ upsert
■ rich document based queries - we still were one foot
in the SQL world
■ fast prototype

● We were still doing batch processing
● ~15m impressions per day ending with
~5GB raw data per day

Big Data
● each day as collection
○ easy for batch processing

● each impression as a document
● adding processed parameters over
time
● pulling data from 30 collections
○ server is not responding
○ virtual memory is low

Big Data - analytics
● Visitors counts on website/section
○ active - with subscription
○ inactive - without subscription
○ anonymous

● Content consumption
○ how many pageviews
■ active
■ inactive
■ anonymouse

● and others

Business asks:
how many UNIQUE users
did … in month

What we really need
● COUNT(* || DISTINCT ...) GROUP BY
○ entities
○ date periods (day, week, month)
○ combination of entities and date periods [and
some other flags]

● Special demands from analytics team
○ Not too hard to implement with SQL magic

● As fast as possible
○ Minimally as fast as data are incoming

● Still store all historical raw data
○ Ideally compressed

What to do
● Processing raw data?
○ Use lot of space, before getting result
■ We need to store historical data anyway
■ You can store compressed files (LZO) in Hadoop

● Sharding
○ For how long?
○ How to properly determine sharding key(s)?

● Do you have really big amount of data?
● Do you have hardware for running
Hadoop? Really?
● What overnight batch processing really
means?

Naive solution
● Separate counter for each needed
combination, updated for each
impression, maybe with touching DB
○ Fast to generate unique key for combination
■ md5([entityType, entityId, day, dayId].join("|"))

○ Really fast to get value
■ Always primary key
■ Multiget

○ Need to define all GROUP BY combinations
on beginning
○ Failure during processing one impression
■ Need to increment counters in transaction

Real world solution
● Kafka
○ Buffering incoming data
○ Web workers as producers

● Storm / Trident
○ Consuming data from Kafka
○ Processing incoming data
○ Using cassandra as storage backend

● Cassandra
○ Holding counters and helper informations to
determine uniquity

Storm
● Real time processing of unbounded
streams of data
○ Processing data as they come
○ You still need to have computing power
○ Need to transform COUNT(* || DISTINCT ...)
GROUP BY everything to steps of updates of
counters
○ Java, but bolts can be written in different
languages

Trident
● High level abstraction over Storm
○
○
○
○
○

Joins
Aggregations
Grouping
Filtering
Functions

Trident
● Operating in transactions
● Persistent aggregation
○ “Memcached”
○ Cassandra

● DRPC calls
○ No need to touch Cassandra

● Local cluster for development
● Easy to learn basics
● Hard to discover advanced stuff
■ Lack of documentation
■ Need to tune configuration

Trident
● Functions
○ You can do everything you want
■ Touch DB, read emails, …

○ Stay with java
■ No dependencies problem
■ No performance penalty

● Topology
○ Good to define on beginning
■ Spend time on detailed diagram
■ Save you during implementation and future updates

○ Don’t do it too much complex
■ Problem with loading it

Cassandra
● Already in our production on different
project
● No SPOF
● Multi Master
● Scalable
● More good stuff
● Lot of new features in 2.x
○ Lite transactions
○ Lot of fixes
■ Good old times on 0.8
■ Our bug report from 2011 - Double load of commit log
on node start :)

Kafka
● A high-throughput distributed
messaging system
● Something like distributed commit log
○ You can set retention
○ You can move reading offset back
■ Used by Trident transactions

● Cluster
● Ideally to use with Trident

Business asks:
are you ready for ~250m
impressions per day?

Piano Media - approach to data gathering and processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Piano Media - approach to data gathering and processing

Similar to Piano Media - approach to data gathering and processing (20)

Recently uploaded

Recently uploaded (20)

Piano Media - approach to data gathering and processing