Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
SUMMINGBIRD 
Sofian DJAMAA 
@sdjamaa
ANALYTICS INFRASTRUCTURE 
We log all data coming out of our web servers 
• Events of ad displays 
• Events of clicks 
• Ev...
SO MUCH DATA TO COMPUTE THAT WE USE 
HADOOP 
Copyright © 2013 Criteo. Confidential 3
YEAH, BUT WHAT ABOUT REAL NUMBERS? 
4 
We have those too, but you’ll note that the 
marketing team was not involved in the...
DATA ANALYSTS 
Data analysts (or BIs) use our 
data to 
– Improve our algorithms and 
strategy 
– Build financial reports ...
QUERYING TOOL 
Of course we have a « fast » datastore fitting 80% of 
analyst needs… 
… with limited updates per day 
Vert...
ANOTHER QUERYING TOOL 
Therefore sometimes data analysts need to use Hive
MAKE LIFE BETTER 
Can we get « big data » faster? 
• Current situation: we have PBs of useful data to handle 
– We have a ...
STREAMING ISSUES 
But we do have Storm! 
• Fault-tolerance??? 
• We need to recompute historical data whenever a 
new metr...
OUR WISH 
Ideally, we’d like… 
Hadoop with reliable data 
• Get reliable data 
• Process lots of events 
(daily/monthly/qu...
THIS IS CALLED: LAMBDA ARCHITECTURE 
HADOOP 
STORM
WHAT ARE THE CHOICES THEN? 
One job to write per platform? 
– MapReduce/Cascading/Hive jobs 
– Storm topology 
 Learn dif...
CORE CONCEPTS 
Main concept: Platform[P] 
Currently P = [Scalding | Storm | Spark] 
Every job is written with the same pie...
SUMMINGBIRD: CODE SAMPLE 
Data source (either HDFS 
path or Kafka queue) 
All the processing 
happens here (converted 
to ...
BATCHID: CORE OF SUMMINGBIRD API 
We merge previous 
results of Hadoop with 
the new batch 
It means Storm results 
are vo...
UPDATING DATA IN REAL-TIME 
We use a data type: monoids 
« In abstract algebra, a branch of mathematics, a monoid is 
an a...
ALGEBIRD: CODE SAMPLE
PUSHING DATA TO THE STORE 
For ease of programming, we define entities 
CampaignAffiliate & 
DisplayClick are stored in a ...
PUSHING DATA TO THE STORE
BIJECTION 
How can we transform an entity to 
– An Array[Byte] for Memcached store? 
– A String for an in-memory Map[Strin...
BIJECTION: CODE SAMPLE 
A simple Injection from Twitter 
From Int to String 
From String to maybe an Int
DATA PERSISTENENCE 
• Now we know how to convert a (CampaignAffiliate, 
DisplayClick) tuple to an Array[Byte] and we can e...
STOREHAUS 
« Storehaus is a library that makes it easy to work 
with asynchronous key value stores. Storehaus is 
built on...
STOREHAUS: CODE SAMPLE 
Injection because MySqlStore 
takes only MySqlValues 
Original store with only 
put/get operations...
SUMMINGBIRD CLIENT 
HADOOP 
STORM 
Batch 
store 
Live 
store 
Client query 
With 2 different store: we have to use the Cli...
SUMMINGBIRD CLIENT
Client query 
We want to query data up to 
BatchID #3 : 
• We get the latest 
processed batch from 
Hadoop (#1) 
• We merg...
DEPLOY EVERYTHING 
• We decided to build one assembly per platform… 
– … using SBT      
– Unfortunately the 2 sample...
BENEFITS OF SUMMINGBIRD 
Cool  
• Write once, run everywhere 
– We plan to deploy pure Scalding 
jobs thanks to SummingBi...
WE HIRE  
WE HIRE  
WE HIRE  
WE HIRE 
Paris DataGeek - SummingBird
Upcoming SlideShare
Loading in …5
×

Paris DataGeek - SummingBird

Deck of the presentation I gave during the Paris DataGeek meetup on streaming platforms.

  • Be the first to comment

Paris DataGeek - SummingBird

  1. 1. SUMMINGBIRD Sofian DJAMAA @sdjamaa
  2. 2. ANALYTICS INFRASTRUCTURE We log all data coming out of our web servers • Events of ad displays • Events of clicks • Events of arbitrage • …
  3. 3. SO MUCH DATA TO COMPUTE THAT WE USE HADOOP Copyright © 2013 Criteo. Confidential 3
  4. 4. YEAH, BUT WHAT ABOUT REAL NUMBERS? 4 We have those too, but you’ll note that the marketing team was not involved in the design  • 1.4M requests per second average across 6 DCs • 30B events logged and imported into HDFS each day • 2.5B displays • 840M unique users …and 70 BI Analysts who want to play with it all…
  5. 5. DATA ANALYSTS Data analysts (or BIs) use our data to – Improve our algorithms and strategy – Build financial reports They obviously learn SQL
  6. 6. QUERYING TOOL Of course we have a « fast » datastore fitting 80% of analyst needs… … with limited updates per day Vertica (datamart) Hadoop (log store & compute platform) LOGS Batch loading here Some PhD level computations and transformations Batch loading here Querying
  7. 7. ANOTHER QUERYING TOOL Therefore sometimes data analysts need to use Hive
  8. 8. MAKE LIFE BETTER Can we get « big data » faster? • Current situation: we have PBs of useful data to handle – We have a low/middle latency datastore – But batches take hours (litterally)  • e.g. arbitrage events are up to 200GB per hour
  9. 9. STREAMING ISSUES But we do have Storm! • Fault-tolerance??? • We need to recompute historical data whenever a new metric emerges • To be honest those issues can be overcame but at a high cost – We already have a satisfying Hadoop architecture – The switch to a fully streamed architecture would require to rebuild our backends
  10. 10. OUR WISH Ideally, we’d like… Hadoop with reliable data • Get reliable data • Process lots of events (daily/monthly/quarterly aggregates) Storm with fresh data • Get a first overview of our aggregates faster – Errors are bounded to the current « batch » processing
  11. 11. THIS IS CALLED: LAMBDA ARCHITECTURE HADOOP STORM
  12. 12. WHAT ARE THE CHOICES THEN? One job to write per platform? – MapReduce/Cascading/Hive jobs – Storm topology  Learn different technologies to achieve the same result  Risks of discrepancies  Deployment complexity Our choice: SummingBird PIOU!
  13. 13. CORE CONCEPTS Main concept: Platform[P] Currently P = [Scalding | Storm | Spark] Every job is written with the same piece of code, in Scala: • Storm topology • Scalding pipes • And the thing that runs on Spark (?!)
  14. 14. SUMMINGBIRD: CODE SAMPLE Data source (either HDFS path or Kafka queue) All the processing happens here (converted to either a Storm topology or a Cascading job) We merge the processing with another source This is the output of the job We sum output of both sources (DisplayClick) inside a datastore (Memcache, MySQL…)
  15. 15. BATCHID: CORE OF SUMMINGBIRD API We merge previous results of Hadoop with the new batch It means Storm results are volatile BatchID is a unit of time, ex: 1H, 6H.. The BatchID is set on Hadoop frequency to bound errors
  16. 16. UPDATING DATA IN REAL-TIME We use a data type: monoids « In abstract algebra, a branch of mathematics, a monoid is an algebraic structure with a single associative binary operation and an identity element. » Wikipedia Monoids are cool – Associativity loves parallelism – No need to be commutative
  17. 17. ALGEBIRD: CODE SAMPLE
  18. 18. PUSHING DATA TO THE STORE For ease of programming, we define entities CampaignAffiliate & DisplayClick are stored in a « store »
  19. 19. PUSHING DATA TO THE STORE
  20. 20. BIJECTION How can we transform an entity to – An Array[Byte] for Memcached store? – A String for an in-memory Map[String, String] store?  Another cool library: Bijection! « In mathematics, an injective function or injection or one-to-one function is a function that preserves distinctness: it never maps distinct elements of its domain to the same element of its codomain ». Wikipedia Example: Int => String is an injection in the whole set of Strings because « foo » => Int is not possible « 12.42 » => Int is not possible
  21. 21. BIJECTION: CODE SAMPLE A simple Injection from Twitter From Int to String From String to maybe an Int
  22. 22. DATA PERSISTENENCE • Now we know how to convert a (CampaignAffiliate, DisplayClick) tuple to an Array[Byte] and we can even modify it • One last thing is missing: what’s the interaction with the data store?  Last cool thing: Storehaus!!!
  23. 23. STOREHAUS « Storehaus is a library that makes it easy to work with asynchronous key value stores. Storehaus is built on top of Twitter's Future. » Twitter’s GitHub What is really cool: storehaus-algebird • Make all stores mergeable (using Monoid[V])
  24. 24. STOREHAUS: CODE SAMPLE Injection because MySqlStore takes only MySqlValues Original store with only put/get operations Make the store Mergeable to allow updates
  25. 25. SUMMINGBIRD CLIENT HADOOP STORM Batch store Live store Client query With 2 different store: we have to use the ClientStore from SummingBird to merge online and offline data With a single store: we need to know if the data is reliable or not
  26. 26. SUMMINGBIRD CLIENT
  27. 27. Client query We want to query data up to BatchID #3 : • We get the latest processed batch from Hadoop (#1) • We merge those results with the 2 latests from Storm
  28. 28. DEPLOY EVERYTHING • We decided to build one assembly per platform… – … using SBT      – Unfortunately the 2 samples you can find on all Internets are monolithics • This is how we run the Scalding part $> java <f******g lots of Hadoop conf keys> –cp <sh*t classpath value> org.apache.hadoop.commons.RunJar job.jar JobName –arg1 value –arg2 value • The Storm JAR assembly is deployed as usual
  29. 29. BENEFITS OF SUMMINGBIRD Cool  • Write once, run everywhere – We plan to deploy pure Scalding jobs thanks to SummingBird – Same can be done with Storm • I didn’t need to learn any Storm o/ • Use of Scala to write concise jobs – Especially after some months using Cascading... • Opportunity to do some cool Open Source coding  Not cool  • SummingBird = SummingBird + Storehaus + Bijection + Algebird – Need to learn all those APIs – And it’s hard if you’re not a Scala master • Not all of these libraries are up-to-date • As for any new Open Source project, you won’t find: – Tutorials – Examples – StackOverflow posts – Weird Chinese mailing-lists listing your stacktrace
  30. 30. WE HIRE  WE HIRE  WE HIRE  WE HIRE 

×