4. YEAH, BUT WHAT ABOUT REAL NUMBERS?
4
We have those too, but you’ll note that the
marketing team was not involved in the design
• 1.4M requests per second average across 6 DCs
• 30B events logged and imported into HDFS each day
• 2.5B displays
• 840M unique users
…and 70 BI Analysts who want to play with it all…
5. DATA ANALYSTS
Data analysts (or BIs) use our
data to
– Improve our algorithms and
strategy
– Build financial reports
They obviously learn SQL
6. QUERYING TOOL
Of course we have a « fast » datastore fitting 80% of
analyst needs…
… with limited updates per day
Vertica
(datamart)
Hadoop (log
store &
compute
platform)
LOGS
Batch loading here
Some PhD level
computations and
transformations
Batch loading here
Querying
8. MAKE LIFE BETTER
Can we get « big data » faster?
• Current situation: we have PBs of useful data to handle
– We have a low/middle latency datastore
– But batches take hours (litterally)
• e.g. arbitrage events are up to 200GB per hour
9. STREAMING ISSUES
But we do have Storm!
• Fault-tolerance???
• We need to recompute historical data whenever a
new metric emerges
• To be honest those issues can be overcame but at a
high cost
– We already have a satisfying Hadoop architecture
– The switch to a fully streamed architecture would require
to rebuild our backends
10. OUR WISH
Ideally, we’d like…
Hadoop with reliable data
• Get reliable data
• Process lots of events
(daily/monthly/quarterly
aggregates)
Storm with fresh data
• Get a first overview of our
aggregates faster
– Errors are bounded to the current
« batch » processing
12. WHAT ARE THE CHOICES THEN?
One job to write per platform?
– MapReduce/Cascading/Hive jobs
– Storm topology
Learn different technologies to achieve the same result
Risks of discrepancies
Deployment complexity
Our choice: SummingBird
PIOU!
13. CORE CONCEPTS
Main concept: Platform[P]
Currently P = [Scalding | Storm | Spark]
Every job is written with the same piece of code, in Scala:
• Storm topology
• Scalding pipes
• And the thing that runs on Spark (?!)
14. SUMMINGBIRD: CODE SAMPLE
Data source (either HDFS
path or Kafka queue)
All the processing
happens here (converted
to either a Storm topology
or a Cascading job)
We merge the processing
with another source
This is the output of
the job
We sum output of both sources (DisplayClick)
inside a datastore (Memcache, MySQL…)
15. BATCHID: CORE OF SUMMINGBIRD API
We merge previous
results of Hadoop with
the new batch
It means Storm results
are volatile
BatchID is a unit of time, ex:
1H, 6H..
The BatchID is set on Hadoop
frequency to bound errors
16. UPDATING DATA IN REAL-TIME
We use a data type: monoids
« In abstract algebra, a branch of mathematics, a monoid is
an algebraic structure with a single associative binary
operation and an identity element. » Wikipedia
Monoids are cool
– Associativity loves parallelism
– No need to be commutative
20. BIJECTION
How can we transform an entity to
– An Array[Byte] for Memcached store?
– A String for an in-memory Map[String, String] store?
Another cool library: Bijection!
« In mathematics, an injective function or injection or one-to-one
function is a function that preserves distinctness: it never maps
distinct elements of its domain to the same element of its codomain ».
Wikipedia
Example: Int => String is an injection in the whole set of
Strings because
« foo » => Int is not possible
« 12.42 » => Int is not possible
21. BIJECTION: CODE SAMPLE
A simple Injection from Twitter
From Int to String
From String to maybe an Int
22. DATA PERSISTENENCE
• Now we know how to convert a (CampaignAffiliate,
DisplayClick) tuple to an Array[Byte] and we can even
modify it
• One last thing is missing: what’s the interaction with the
data store?
Last cool thing: Storehaus!!!
23. STOREHAUS
« Storehaus is a library that makes it easy to work
with asynchronous key value stores. Storehaus is
built on top of Twitter's Future. » Twitter’s GitHub
What is really cool: storehaus-algebird
• Make all stores mergeable (using
Monoid[V])
24. STOREHAUS: CODE SAMPLE
Injection because MySqlStore
takes only MySqlValues
Original store with only
put/get operations
Make the store Mergeable to
allow updates
25. SUMMINGBIRD CLIENT
HADOOP
STORM
Batch
store
Live
store
Client query
With 2 different store: we have to use the ClientStore from
SummingBird to merge online and offline data
With a single store: we need to know if the data is reliable or not
27. Client query
We want to query data up to
BatchID #3 :
• We get the latest
processed batch from
Hadoop (#1)
• We merge those results
with the 2 latests from
Storm
28. DEPLOY EVERYTHING
• We decided to build one assembly per platform…
– … using SBT
– Unfortunately the 2 samples you can find on all Internets are
monolithics
• This is how we run the Scalding part
$> java <f******g lots of Hadoop conf keys> –cp <sh*t
classpath value> org.apache.hadoop.commons.RunJar job.jar
JobName –arg1 value –arg2 value
• The Storm JAR assembly is deployed as usual
29. BENEFITS OF SUMMINGBIRD
Cool
• Write once, run everywhere
– We plan to deploy pure Scalding
jobs thanks to SummingBird
– Same can be done with Storm
• I didn’t need to learn any Storm o/
• Use of Scala to write concise jobs
– Especially after some months
using Cascading...
• Opportunity to do some cool
Open Source coding
Not cool
• SummingBird = SummingBird +
Storehaus + Bijection + Algebird
– Need to learn all those APIs
– And it’s hard if you’re not a Scala
master
• Not all of these libraries are up-to-date
• As for any new Open Source
project, you won’t find:
– Tutorials
– Examples
– StackOverflow posts
– Weird Chinese mailing-lists listing
your stacktrace