Taboola's Transition to Apache Spark for Real-Time Analytics

Taboola's Road to Scale
A Focus on Data &
Apache Spark

Collaborative Filtering
Bucketed Consumption Groups
Geo
Region-based
Recommendations
Context
Metadata
Social
Facebook/Twitter API
User Behavior
Cookie Data
Engine Focused on Maximizing CTR & Post Click Engagement

Largest Content Discovery and
Monetization Network
60M
monthly unique
users
130B
Monthly
recommendations
1M+
sourced content
providers
1M+
sourced content
items

What Does it Mean?
• Zero downtime allowed
– Every single component is fault tolerant
• 5 Data Centers across the globe
• Tera-bytes of data / day (many billion events)
• Data must be processed and analyzed in real time, for
example:
– Real-time, per user content recommendations
– Real-time expenditure reports
– Automated campaign management
– Automated recommendation algorithms calibration
– Real-time analytics

Taboola 2007
• Events and logs
(rawdata) written
directly to DB
• Recs Are read from
DB
• Crashed when CNN
launched
Frontend
RecServer

Taboola 2007.5
• Same as
before, but
without direct
write to DB
• Switching to
bulk load
• But – Very
Basic
Reporting, not
scalable
Frontend
Bulk Load
RecServer

Taboola 2008
• Introduced a semi
realtime events
parsing services:
Session Parser and
Session Analyzer
• Divided analysis work
by unit (session)
• Files were pushed
from RecServer(s) to
Backend processing
• Files are gzip textual
INSERT statements
• But – not real time
enough
Frontend
NFS
Backend
RecServer SessionParser SessionAnalyzer
Write Summarized Data
Write rawdata
Read session
files
Read rawdata
Write session
files

Taboola 2010
• Made a leap towards real-
time stream processing
• Unified Session Parser and
Session Analyzer to an in-
memory service (without
going through disk)
• Made dramatic optimization
to memory allocation and
data models
• Failure safe architecture -
can endure data delays,
front-end servers’
malfunction
• No direct DB access - key
for performance, only using
bulk loading for loading
hourly data
Frontend
NFS
Backend
RecServer Session Parser + Analyzer
Write Hourly Data (Bulk
Loading)
Write rawdata
Read rawdata

Taboola 2011-2013
• Roughly same
architecture
• Increasing backend
growth by scaling in
(monster machines)
• Introduced real-time
analyzers
• Introduced sharding
• Moved to lsync based
file sync
• Introduced Top
Reports capabilities
Frontend
Lsync
Backend
RecServer Session Parser + Analyzer
Write Hourly Data (Bulk
Loading)
Write rawdata
Read rawdata

Taboola 2014
• Spark as the distributed engine for data analysis
(and distributed computing in general)
• All critical data path already moved to Spark
• New data modelling based on ProtoStuff(Buf)
• Easily scalable
• Easy ad hoc analysis/research

About Spark
• Open Sourced
• Apache top level project (since Feb. 19th)
• DataBricks - A commercial company that supports it
• Hadoop-compatible computing engine
• Can run side-by-side with Hadoop/Hive on the same
data
• Drastically faster than Hadoop through in-memory
computing
• Multiple H/A options - standalone cluster, Apache
mesos and ZooKeeper or YARN

• With over 100 developers and 25 companies, one of
the most active communities in big data
Spark Development Community
Comparison: Storm (48), Giraph (52), Drill (18), Tez (12)
Past 6 months: more active devs than Hadoop MapReduce!

Spark Performance
Hive
Impala(disk)
Impala(mem)
Shark(disk)
Shark(mem)
0
5
10
15
20
25
30
35
40
45
ResponseTime(s)
SQL
Hadoop
Giraph
GraphX
0
5
10
15
20
25
30
ResponseTime(min)
Graph
Storm
Spark
0
5
10
15
20
25
30
35
Throughput(MB/s/node)
Streaming

Spark API
• Simple to write through easy APIs in
Java, Scala and Python
• The same analytics code can be used for both
streaming data and offline data processing

Spark Key Concepts
Resilient Distributed
Datasets
• Collections of objects spread
across a cluster, stored in RAM
or on Disk
• Built through parallel
transformations
• Automatically rebuilt on failure
• Immutable
Operations
• Transformations
(e.g.
map, filter, groupB
y)
• Actions
(e.g.
count, collect, save
)
Write programs in terms of
transformations on distributed
datasets

Load error messages from a log into
memory, then interactively search for various
patterns
Example: Log Mining
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
messages.filter(lambda s: “mysql” in s).count()
messages.filter(lambda s: “php” in s).count()
. . .
Base RDD
Transformed RDD
Full-text search of Wikipedia
• 60GB on 20 EC2 machine
• 0.5 sec vs. 20s for on-disk

• General task
graphs
• Automatically
pipelines
functions
• Data locality
aware
• Partitioning
aware
to avoid shuffles
Task Scheduler
= cached partition= RDD
reduceByKey
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
map

• Spark runs as a library in your
program (1 instance per app)
• Runs tasks locally or on cluster
– Mesos, YARN or standalone
mode
• Accesses storage systems via
Hadoop InputFormat API
– Can use HBase, HDFS, S3, …
Software Components
Your application
SparkContext
Local
threads
Cluster
manager
Worker
Spark
executor
Worker
Spark
executor
HDFS or other storage

System Architecture & Data Flow @ Taboola
Driver +
ConsumersSpark Cluster
MySQL Cluster
FE Servers
C* Cluster
FE Servers

Execution Graph @ Taboola
• Data start point (dates, etc)rdd1 = Context.parallize([data])
• Loading data from external sources
(Cassandra, MySQL, etc)
rdd2 =
rdd1.mapPartitions(loadfunc())
• Aggregating the data and storing
results
rdd3 =
rdd2.reduceByKey(reduceFunc())
• Saving the results to a DBrdd4 =
rdd3.mapPartitions(saverfunc())
• Executing the above graph by
forcing an output operation
rdd4.count()

Cassandra as a Distributed Storage
• Event Log Files saved as blobs to a dedicated keyspace
• C* Tables holding the Event Log Files are partitioned by day – new Table per day. This way, it is
easier for maintenance and simpler to load into Spark
• Using Astyanax driver + CQL3
– Recipe to load all keys of a table very fast (hundred of thousands / sec)
– Split by keys and then load data by key in batches – in parallel partitions
• Wrote hadoop InputFormat that supports loading this into a lines RDD<String>
– The DataStax InputFormat had issues and at the time was not formally supported
• Worked well, but ended up not using it – instead using mapPartitions
– Very simple, no overhead, no need to be tied to hadoop
– Will probably use the InputFormat when we deploy a Shark solution
• Plans to open source all this
Key (String) Data (blob)
GUID (originally
log file name)
Gzipped file
GUID Gzipped file
… …
userevent_2014-02-19
Key (String) Data (blob)
GUID (originally
log file name)
Gzipped file
GUID Gzipped file
… …
userevent_2014-02-20

Sample – Click Counting for Campaign Stopping
1. mapPartitions – mapping from strings to objects with
a pre designed click key
2. reduceByKey – removing duplicate clicks (see next
slide)
3. Map – switch keys to a campaign+day key
4. reduceByKey – aggregate the data by
campaign+day

Campaign Stopping – Removing Dup Clicks
• When more than 1 click found from the same user on the same
item, leave only the oldest
• Using accumulators to track duplicate numbers
• Notice – not Spark specific

Our Deployment
• 16 nodes, each-
– 24 cores
– 256G Ram
– 6 1TB SSD Disks – JBOD configuration
– 10G Ethernet
• Total Cluster Power
– 4096GB Ram
– 384 CPUs
– 96 TB storage – (effective space is less, Cassandra Keyspaces defined with replication
factor 3)
• Symmetric Deployment
– Mesos + Spark
– Cassandra
• More
– Rabbit MQ on 3 nodes
– ZooKeeper on 3 nodes
– MySQL cluster outside this cluster
• Loads & processes ~1 Tera Bytes (unzipped data) in ~3 minutes

Things that work well with Spark
(from our experience)
• Very easy to code complex jobs
– Harder than SQL, but better than other Map Reduce options
– Simple concepts, “small” API
• Easy to Unit Test
– Runs in local mode, so ideal for micro E2E tests
– Each mapper/reducer can be unit tested without Spark – if you
do not use anonymous classes
• Very resilient
• Can read/write to/from any data source, including
RDBMS, Cassandra, HDFS, local files, etc.
• Great monitoring
• Easy to deploy & upgrade
• Blazing fast

Things that do not work that well
(from our experience)
• Long (endless) running tasks require some workarounds
– Temp files - Spark creates a lot of files in
spark.local.dir, requires periodic cleanup
– Use spark.cleaner.ttl for long running tasks
• Spark Streaming – not fully mature when we tested
– Some end cases can cause loss of data
– Sliding window / batch model does not fit our needs
• We always load some history to deal with late arriving data
• State management left to the user and not trivial
– BUT – we were able to easily implement a bullet proof home
grown, near real time, streaming solution with minimal amount
of code

General / Optimization Tips
• Use Spark Accumulators to collect and report
operational data
• 10G Ethernet
• Multiple SSD disks per node, JBOD configuration
• A lot of memory for the cluster

Technologies Taboola Uses for Spark
• Spark – computing cluster
• Mesos – cluster resource manager
– Better resource allocation (coarse grained) for Spark
• ZooKeeper – distributed coordination
– Enables multi master for mesos & spark
• Cassandra
– Distributed Data Store
• Monitoring – http://metrics.codahale.com/

Attributions
Many of the general Spark slides were taken from the
DataBricks Spark Summit 2013 slides.
There are great materials at:
https://spark.apache.org/
http://spark-summit.org/summit-2013/

Taboola's Transition to Apache Spark for Real-Time Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Taboola's Transition to Apache Spark for Real-Time Analytics

Similar to Taboola's Transition to Apache Spark for Real-Time Analytics (20)

Recently uploaded

Recently uploaded (20)

Taboola's Transition to Apache Spark for Real-Time Analytics

Editor's Notes