Paris DataGeek - SummingBird

SUMMINGBIRD
Sofian DJAMAA
@sdjamaa

ANALYTICS INFRASTRUCTURE
We log all data coming out of our web servers
• Events of ad displays
• Events of clicks
• Events of arbitrage
• …

SO MUCH DATA TO COMPUTE THAT WE USE
HADOOP
Copyright © 2013 Criteo. Confidential 3

YEAH, BUT WHAT ABOUT REAL NUMBERS?
4
We have those too, but you’ll note that the
marketing team was not involved in the design 
• 1.4M requests per second average across 6 DCs
• 30B events logged and imported into HDFS each day
• 2.5B displays
• 840M unique users
…and 70 BI Analysts who want to play with it all…

DATA ANALYSTS
Data analysts (or BIs) use our
data to
– Improve our algorithms and
strategy
– Build financial reports
They obviously learn SQL

QUERYING TOOL
Of course we have a « fast » datastore fitting 80% of
analyst needs…
… with limited updates per day
Vertica
(datamart)
Hadoop (log
store &
compute
platform)
LOGS
Batch loading here
Some PhD level
computations and
transformations
Batch loading here
Querying

ANOTHER QUERYING TOOL
Therefore sometimes data analysts need to use Hive

MAKE LIFE BETTER
Can we get « big data » faster?
• Current situation: we have PBs of useful data to handle
– We have a low/middle latency datastore
– But batches take hours (litterally) 
• e.g. arbitrage events are up to 200GB per hour

STREAMING ISSUES
But we do have Storm!
• Fault-tolerance???
• We need to recompute historical data whenever a
new metric emerges
• To be honest those issues can be overcame but at a
high cost
– We already have a satisfying Hadoop architecture
– The switch to a fully streamed architecture would require
to rebuild our backends

OUR WISH
Ideally, we’d like…
Hadoop with reliable data
• Get reliable data
• Process lots of events
(daily/monthly/quarterly
aggregates)
Storm with fresh data
• Get a first overview of our
aggregates faster
– Errors are bounded to the current
« batch » processing

THIS IS CALLED: LAMBDA ARCHITECTURE
HADOOP
STORM

WHAT ARE THE CHOICES THEN?
One job to write per platform?
– MapReduce/Cascading/Hive jobs
– Storm topology
 Learn different technologies to achieve the same result
 Risks of discrepancies
 Deployment complexity
Our choice: SummingBird
PIOU!

CORE CONCEPTS
Main concept: Platform[P]
Currently P = [Scalding | Storm | Spark]
Every job is written with the same piece of code, in Scala:
• Storm topology
• Scalding pipes
• And the thing that runs on Spark (?!)

SUMMINGBIRD: CODE SAMPLE
Data source (either HDFS
path or Kafka queue)
All the processing
happens here (converted
to either a Storm topology
or a Cascading job)
We merge the processing
with another source
This is the output of
the job
We sum output of both sources (DisplayClick)
inside a datastore (Memcache, MySQL…)

BATCHID: CORE OF SUMMINGBIRD API
We merge previous
results of Hadoop with
the new batch
It means Storm results
are volatile
BatchID is a unit of time, ex:
1H, 6H..
The BatchID is set on Hadoop
frequency to bound errors

UPDATING DATA IN REAL-TIME
We use a data type: monoids
« In abstract algebra, a branch of mathematics, a monoid is
an algebraic structure with a single associative binary
operation and an identity element. » Wikipedia
Monoids are cool
– Associativity loves parallelism
– No need to be commutative

PUSHING DATA TO THE STORE
For ease of programming, we define entities
CampaignAffiliate &
DisplayClick are stored in a
« store »

BIJECTION
How can we transform an entity to
– An Array[Byte] for Memcached store?
– A String for an in-memory Map[String, String] store?
 Another cool library: Bijection!
« In mathematics, an injective function or injection or one-to-one
function is a function that preserves distinctness: it never maps
distinct elements of its domain to the same element of its codomain ».
Wikipedia
Example: Int => String is an injection in the whole set of
Strings because
« foo » => Int is not possible
« 12.42 » => Int is not possible

BIJECTION: CODE SAMPLE
A simple Injection from Twitter
From Int to String
From String to maybe an Int

DATA PERSISTENENCE
• Now we know how to convert a (CampaignAffiliate,
DisplayClick) tuple to an Array[Byte] and we can even
modify it
• One last thing is missing: what’s the interaction with the
data store?
 Last cool thing: Storehaus!!!

STOREHAUS
« Storehaus is a library that makes it easy to work
with asynchronous key value stores. Storehaus is
built on top of Twitter's Future. » Twitter’s GitHub
What is really cool: storehaus-algebird
• Make all stores mergeable (using
Monoid[V])

STOREHAUS: CODE SAMPLE
Injection because MySqlStore
takes only MySqlValues
Original store with only
put/get operations
Make the store Mergeable to
allow updates

SUMMINGBIRD CLIENT
HADOOP
STORM
Batch
store
Live
store
Client query
With 2 different store: we have to use the ClientStore from
SummingBird to merge online and offline data
With a single store: we need to know if the data is reliable or not

Client query
We want to query data up to
BatchID #3 :
• We get the latest
processed batch from
Hadoop (#1)
• We merge those results
with the 2 latests from
Storm

DEPLOY EVERYTHING
• We decided to build one assembly per platform…
– … using SBT     
– Unfortunately the 2 samples you can find on all Internets are
monolithics
• This is how we run the Scalding part
$> java <f******g lots of Hadoop conf keys> –cp <sh*t
classpath value> org.apache.hadoop.commons.RunJar job.jar
JobName –arg1 value –arg2 value
• The Storm JAR assembly is deployed as usual

BENEFITS OF SUMMINGBIRD
Cool 
• Write once, run everywhere
– We plan to deploy pure Scalding
jobs thanks to SummingBird
– Same can be done with Storm
• I didn’t need to learn any Storm o/
• Use of Scala to write concise jobs
– Especially after some months
using Cascading...
• Opportunity to do some cool
Open Source coding 
Not cool 
• SummingBird = SummingBird +
Storehaus + Bijection + Algebird
– Need to learn all those APIs
– And it’s hard if you’re not a Scala
master
• Not all of these libraries are up-to-date
• As for any new Open Source
project, you won’t find:
– Tutorials
– Examples
– StackOverflow posts
– Weird Chinese mailing-lists listing
your stacktrace

WE HIRE 
WE HIRE 
WE HIRE 
WE HIRE 

Paris DataGeek - SummingBird

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Paris DataGeek - SummingBird

Similar to Paris DataGeek - SummingBird (20)

Recently uploaded

Recently uploaded (20)

Paris DataGeek - SummingBird

Editor's Notes