3. Starting point
● we had two developers
● we had one live server
● we had one cold backup
● we can’t store all the data
● we can’t process all the data
4. Batch processing - the downsides
● batch every 3 hours
○ delete old data
● updating counters
○ you need to define them upfront
● throwing away old data
○ developer point of view
■ you have no way to correct your mistake
○ business
■ you lose your data
5. Batch processing - the benefits
● you will learn
○ profiler is your best friend
○ optimizing can be hard and can take time
● what are good access logs good for
○ reconstruct your deleted data
7. Big Data
● It’s not only about the volume
● What we gonna do with it?
○ We had NO idea!
● We rent more servers.
○ We needed place where to store the data
8. Big Data
● We went the NoSQL way
○ MongoDB
■ easy replication, possible sharding
■ upsert
■ rich document based queries - we still were one foot
in the SQL world
■ fast prototype
● We were still doing batch processing
● ~15m impressions per day ending with
~5GB raw data per day
9. Big Data
● each day as collection
○ easy for batch processing
● each impression as a document
● adding processed parameters over
time
● pulling data from 30 collections
○ server is not responding
○ virtual memory is low
10. Big Data - analytics
● Visitors counts on website/section
○ active - with subscription
○ inactive - without subscription
○ anonymous
● Content consumption
○ how many pageviews
■ active
■ inactive
■ anonymouse
● and others
12. What we really need
● COUNT(* || DISTINCT ...) GROUP BY
○ entities
○ date periods (day, week, month)
○ combination of entities and date periods [and
some other flags]
● Special demands from analytics team
○ Not too hard to implement with SQL magic
● As fast as possible
○ Minimally as fast as data are incoming
● Still store all historical raw data
○ Ideally compressed
13. What to do
● Processing raw data?
○ Use lot of space, before getting result
■ We need to store historical data anyway
■ You can store compressed files (LZO) in Hadoop
● Sharding
○ For how long?
○ How to properly determine sharding key(s)?
● Do you have really big amount of data?
● Do you have hardware for running
Hadoop? Really?
● What overnight batch processing really
means?
14. Naive solution
● Separate counter for each needed
combination, updated for each
impression, maybe with touching DB
○ Fast to generate unique key for combination
■ md5([entityType, entityId, day, dayId].join("|"))
○ Really fast to get value
■ Always primary key
■ Multiget
○ Need to define all GROUP BY combinations
on beginning
○ Failure during processing one impression
■ Need to increment counters in transaction
15. Real world solution
● Kafka
○ Buffering incoming data
○ Web workers as producers
● Storm / Trident
○ Consuming data from Kafka
○ Processing incoming data
○ Using cassandra as storage backend
● Cassandra
○ Holding counters and helper informations to
determine uniquity
16. Storm
● Real time processing of unbounded
streams of data
○ Processing data as they come
○ You still need to have computing power
○ Need to transform COUNT(* || DISTINCT ...)
GROUP BY everything to steps of updates of
counters
○ Java, but bolts can be written in different
languages
18. Trident
● High level abstraction over Storm
○
○
○
○
○
Joins
Aggregations
Grouping
Filtering
Functions
19. Trident
● Operating in transactions
● Persistent aggregation
○ “Memcached”
○ Cassandra
● DRPC calls
○ No need to touch Cassandra
● Local cluster for development
● Easy to learn basics
● Hard to discover advanced stuff
■ Lack of documentation
■ Need to tune configuration
20. Trident
● Functions
○ You can do everything you want
■ Touch DB, read emails, …
○ Stay with java
■ No dependencies problem
■ No performance penalty
● Topology
○ Good to define on beginning
■ Spend time on detailed diagram
■ Save you during implementation and future updates
○ Don’t do it too much complex
■ Problem with loading it
22. Cassandra
● Already in our production on different
project
● No SPOF
● Multi Master
● Scalable
● More good stuff
● Lot of new features in 2.x
○ Lite transactions
○ Lot of fixes
■ Good old times on 0.8
■ Our bug report from 2011 - Double load of commit log
on node start :)
23. Kafka
● A high-throughput distributed
messaging system
● Something like distributed commit log
○ You can set retention
○ You can move reading offset back
■ Used by Trident transactions
● Cluster
● Ideally to use with Trident