Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Piano Media - approach to data gathering and processing

1,091 views

Published on

Lessons learned when changing our mindset from batch processing to real-time processing of unbound stream of data.

Published in: Technology, Business
  • Best dissertation help you can get, thank god a friend suggested me ⇒⇒⇒WRITE-MY-PAPER.net ⇐⇐⇐ otherwise I could have never completed my dissertation on time.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

Piano Media - approach to data gathering and processing

  1. 1. nearly three years of continuous changes of approach to data gathering and processing (Martin Strycek, Juraj Sottnik) @rubyslava 2014
  2. 2. We better get it right first time!
  3. 3. Starting point ● we had two developers ● we had one live server ● we had one cold backup ● we can’t store all the data ● we can’t process all the data
  4. 4. Batch processing - the downsides ● batch every 3 hours ○ delete old data ● updating counters ○ you need to define them upfront ● throwing away old data ○ developer point of view ■ you have no way to correct your mistake ○ business ■ you lose your data
  5. 5. Batch processing - the benefits ● you will learn ○ profiler is your best friend ○ optimizing can be hard and can take time ● what are good access logs good for ○ reconstruct your deleted data
  6. 6. Business says: save all data
  7. 7. Big Data ● It’s not only about the volume ● What we gonna do with it? ○ We had NO idea! ● We rent more servers. ○ We needed place where to store the data
  8. 8. Big Data ● We went the NoSQL way ○ MongoDB ■ easy replication, possible sharding ■ upsert ■ rich document based queries - we still were one foot in the SQL world ■ fast prototype ● We were still doing batch processing ● ~15m impressions per day ending with ~5GB raw data per day
  9. 9. Big Data ● each day as collection ○ easy for batch processing ● each impression as a document ● adding processed parameters over time ● pulling data from 30 collections ○ server is not responding ○ virtual memory is low
  10. 10. Big Data - analytics ● Visitors counts on website/section ○ active - with subscription ○ inactive - without subscription ○ anonymous ● Content consumption ○ how many pageviews ■ active ■ inactive ■ anonymouse ● and others
  11. 11. Business asks: how many UNIQUE users did … in month
  12. 12. What we really need ● COUNT(* || DISTINCT ...) GROUP BY ○ entities ○ date periods (day, week, month) ○ combination of entities and date periods [and some other flags] ● Special demands from analytics team ○ Not too hard to implement with SQL magic ● As fast as possible ○ Minimally as fast as data are incoming ● Still store all historical raw data ○ Ideally compressed
  13. 13. What to do ● Processing raw data? ○ Use lot of space, before getting result ■ We need to store historical data anyway ■ You can store compressed files (LZO) in Hadoop ● Sharding ○ For how long? ○ How to properly determine sharding key(s)? ● Do you have really big amount of data? ● Do you have hardware for running Hadoop? Really? ● What overnight batch processing really means?
  14. 14. Naive solution ● Separate counter for each needed combination, updated for each impression, maybe with touching DB ○ Fast to generate unique key for combination ■ md5([entityType, entityId, day, dayId].join("|")) ○ Really fast to get value ■ Always primary key ■ Multiget ○ Need to define all GROUP BY combinations on beginning ○ Failure during processing one impression ■ Need to increment counters in transaction
  15. 15. Real world solution ● Kafka ○ Buffering incoming data ○ Web workers as producers ● Storm / Trident ○ Consuming data from Kafka ○ Processing incoming data ○ Using cassandra as storage backend ● Cassandra ○ Holding counters and helper informations to determine uniquity
  16. 16. Storm ● Real time processing of unbounded streams of data ○ Processing data as they come ○ You still need to have computing power ○ Need to transform COUNT(* || DISTINCT ...) GROUP BY everything to steps of updates of counters ○ Java, but bolts can be written in different languages
  17. 17. Storm ● Spouts ● Bolts
  18. 18. Trident ● High level abstraction over Storm ○ ○ ○ ○ ○ Joins Aggregations Grouping Filtering Functions
  19. 19. Trident ● Operating in transactions ● Persistent aggregation ○ “Memcached” ○ Cassandra ● DRPC calls ○ No need to touch Cassandra ● Local cluster for development ● Easy to learn basics ● Hard to discover advanced stuff ■ Lack of documentation ■ Need to tune configuration
  20. 20. Trident ● Functions ○ You can do everything you want ■ Touch DB, read emails, … ○ Stay with java ■ No dependencies problem ■ No performance penalty ● Topology ○ Good to define on beginning ■ Spend time on detailed diagram ■ Save you during implementation and future updates ○ Don’t do it too much complex ■ Problem with loading it
  21. 21. Trident
  22. 22. Cassandra ● Already in our production on different project ● No SPOF ● Multi Master ● Scalable ● More good stuff ● Lot of new features in 2.x ○ Lite transactions ○ Lot of fixes ■ Good old times on 0.8 ■ Our bug report from 2011 - Double load of commit log on node start :)
  23. 23. Kafka ● A high-throughput distributed messaging system ● Something like distributed commit log ○ You can set retention ○ You can move reading offset back ■ Used by Trident transactions ● Cluster ● Ideally to use with Trident
  24. 24. Business asks: are you ready for ~250m impressions per day?
  25. 25. Thank you.

×