Taboola's data processing architecture has evolved over time from directly writing to databases to using Apache Spark for scalable real-time processing. Spark allows Taboola to process terabytes of data daily across multiple data centers for real-time recommendations, analytics, and algorithm calibration. Key aspects of Taboola's architecture include using Cassandra for event storage, Spark for distributed computing, Mesos for cluster management, and Zookeeper for coordination across a large Spark cluster.
2. Collaborative Filtering
Bucketed Consumption Groups
Geo
Region-based
Recommendations
Context
Metadata
Social
Facebook/Twitter API
User Behavior
Cookie Data
Engine Focused on Maximizing CTR & Post Click Engagement
4. What Does it Mean?
• Zero downtime allowed
– Every single component is fault tolerant
• 5 Data Centers across the globe
• Tera-bytes of data / day (many billion events)
• Data must be processed and analyzed in real time, for
example:
– Real-time, per user content recommendations
– Real-time expenditure reports
– Automated campaign management
– Automated recommendation algorithms calibration
– Real-time analytics
5. Taboola 2007
• Events and logs
(rawdata) written
directly to DB
• Recs Are read from
DB
• Crashed when CNN
launched
Frontend
RecServer
6. Taboola 2007.5
• Same as
before, but
without direct
write to DB
• Switching to
bulk load
• But – Very
Basic
Reporting, not
scalable
Frontend
Bulk Load
RecServer
7. Taboola 2008
• Introduced a semi
realtime events
parsing services:
Session Parser and
Session Analyzer
• Divided analysis work
by unit (session)
• Files were pushed
from RecServer(s) to
Backend processing
• Files are gzip textual
INSERT statements
• But – not real time
enough
Frontend
NFS
Backend
RecServer SessionParser SessionAnalyzer
Write Summarized Data
Write rawdata
Read session
files
Read rawdata
Write session
files
8. Taboola 2010
• Made a leap towards real-
time stream processing
• Unified Session Parser and
Session Analyzer to an in-
memory service (without
going through disk)
• Made dramatic optimization
to memory allocation and
data models
• Failure safe architecture -
can endure data delays,
front-end servers’
malfunction
• No direct DB access - key
for performance, only using
bulk loading for loading
hourly data
Frontend
NFS
Backend
RecServer Session Parser + Analyzer
Write Hourly Data (Bulk
Loading)
Write rawdata
Read rawdata
9. Taboola 2011-2013
• Roughly same
architecture
• Increasing backend
growth by scaling in
(monster machines)
• Introduced real-time
analyzers
• Introduced sharding
• Moved to lsync based
file sync
• Introduced Top
Reports capabilities
Frontend
Lsync
Backend
RecServer Session Parser + Analyzer
Write Hourly Data (Bulk
Loading)
Write rawdata
Read rawdata
10. Taboola 2014
• Spark as the distributed engine for data analysis
(and distributed computing in general)
• All critical data path already moved to Spark
• New data modelling based on ProtoStuff(Buf)
• Easily scalable
• Easy ad hoc analysis/research
11. About Spark
• Open Sourced
• Apache top level project (since Feb. 19th)
• DataBricks - A commercial company that supports it
• Hadoop-compatible computing engine
• Can run side-by-side with Hadoop/Hive on the same
data
• Drastically faster than Hadoop through in-memory
computing
• Multiple H/A options - standalone cluster, Apache
mesos and ZooKeeper or YARN
12. • With over 100 developers and 25 companies, one of
the most active communities in big data
Spark Development Community
Comparison: Storm (48), Giraph (52), Drill (18), Tez (12)
Past 6 months: more active devs than Hadoop MapReduce!
15. Spark API
• Simple to write through easy APIs in
Java, Scala and Python
• The same analytics code can be used for both
streaming data and offline data processing
16. Spark Key Concepts
Resilient Distributed
Datasets
• Collections of objects spread
across a cluster, stored in RAM
or on Disk
• Built through parallel
transformations
• Automatically rebuilt on failure
• Immutable
Operations
• Transformations
(e.g.
map, filter, groupB
y)
• Actions
(e.g.
count, collect, save
)
Write programs in terms of
transformations on distributed
datasets
17. Load error messages from a log into
memory, then interactively search for various
patterns
Example: Log Mining
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(lambda s: s.startswith(“ERROR”))
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
messages.filter(lambda s: “mysql” in s).count()
messages.filter(lambda s: “php” in s).count()
. . .
Base RDD
Transformed RDD
Full-text search of Wikipedia
• 60GB on 20 EC2 machine
• 0.5 sec vs. 20s for on-disk
19. • Spark runs as a library in your
program (1 instance per app)
• Runs tasks locally or on cluster
– Mesos, YARN or standalone
mode
• Accesses storage systems via
Hadoop InputFormat API
– Can use HBase, HDFS, S3, …
Software Components
Your application
SparkContext
Local
threads
Cluster
manager
Worker
Spark
executor
Worker
Spark
executor
HDFS or other storage
20. System Architecture & Data Flow @ Taboola
Driver +
ConsumersSpark Cluster
MySQL Cluster
FE Servers
C* Cluster
FE Servers
21. Execution Graph @ Taboola
• Data start point (dates, etc)rdd1 = Context.parallize([data])
• Loading data from external sources
(Cassandra, MySQL, etc)
rdd2 =
rdd1.mapPartitions(loadfunc())
• Aggregating the data and storing
results
rdd3 =
rdd2.reduceByKey(reduceFunc())
• Saving the results to a DBrdd4 =
rdd3.mapPartitions(saverfunc())
• Executing the above graph by
forcing an output operation
rdd4.count()
22. Cassandra as a Distributed Storage
• Event Log Files saved as blobs to a dedicated keyspace
• C* Tables holding the Event Log Files are partitioned by day – new Table per day. This way, it is
easier for maintenance and simpler to load into Spark
• Using Astyanax driver + CQL3
– Recipe to load all keys of a table very fast (hundred of thousands / sec)
– Split by keys and then load data by key in batches – in parallel partitions
• Wrote hadoop InputFormat that supports loading this into a lines RDD<String>
– The DataStax InputFormat had issues and at the time was not formally supported
• Worked well, but ended up not using it – instead using mapPartitions
– Very simple, no overhead, no need to be tied to hadoop
– Will probably use the InputFormat when we deploy a Shark solution
• Plans to open source all this
Key (String) Data (blob)
GUID (originally
log file name)
Gzipped file
GUID Gzipped file
… …
userevent_2014-02-19
Key (String) Data (blob)
GUID (originally
log file name)
Gzipped file
GUID Gzipped file
… …
userevent_2014-02-20
23. Sample – Click Counting for Campaign Stopping
1. mapPartitions – mapping from strings to objects with
a pre designed click key
2. reduceByKey – removing duplicate clicks (see next
slide)
3. Map – switch keys to a campaign+day key
4. reduceByKey – aggregate the data by
campaign+day
24. Campaign Stopping – Removing Dup Clicks
• When more than 1 click found from the same user on the same
item, leave only the oldest
• Using accumulators to track duplicate numbers
• Notice – not Spark specific
25. Our Deployment
• 16 nodes, each-
– 24 cores
– 256G Ram
– 6 1TB SSD Disks – JBOD configuration
– 10G Ethernet
• Total Cluster Power
– 4096GB Ram
– 384 CPUs
– 96 TB storage – (effective space is less, Cassandra Keyspaces defined with replication
factor 3)
• Symmetric Deployment
– Mesos + Spark
– Cassandra
• More
– Rabbit MQ on 3 nodes
– ZooKeeper on 3 nodes
– MySQL cluster outside this cluster
• Loads & processes ~1 Tera Bytes (unzipped data) in ~3 minutes
26. Things that work well with Spark
(from our experience)
• Very easy to code complex jobs
– Harder than SQL, but better than other Map Reduce options
– Simple concepts, “small” API
• Easy to Unit Test
– Runs in local mode, so ideal for micro E2E tests
– Each mapper/reducer can be unit tested without Spark – if you
do not use anonymous classes
• Very resilient
• Can read/write to/from any data source, including
RDBMS, Cassandra, HDFS, local files, etc.
• Great monitoring
• Easy to deploy & upgrade
• Blazing fast
27. Things that do not work that well
(from our experience)
• Long (endless) running tasks require some workarounds
– Temp files - Spark creates a lot of files in
spark.local.dir, requires periodic cleanup
– Use spark.cleaner.ttl for long running tasks
• Spark Streaming – not fully mature when we tested
– Some end cases can cause loss of data
– Sliding window / batch model does not fit our needs
• We always load some history to deal with late arriving data
• State management left to the user and not trivial
– BUT – we were able to easily implement a bullet proof home
grown, near real time, streaming solution with minimal amount
of code
28. General / Optimization Tips
• Use Spark Accumulators to collect and report
operational data
• 10G Ethernet
• Multiple SSD disks per node, JBOD configuration
• A lot of memory for the cluster
29. Technologies Taboola Uses for Spark
• Spark – computing cluster
• Mesos – cluster resource manager
– Better resource allocation (coarse grained) for Spark
• ZooKeeper – distributed coordination
– Enables multi master for mesos & spark
• Cassandra
– Distributed Data Store
• Monitoring – http://metrics.codahale.com/
30. Attributions
Many of the general Spark slides were taken from the
DataBricks Spark Summit 2013 slides.
There are great materials at:
https://spark.apache.org/
http://spark-summit.org/summit-2013/
One of the most exciting things you’ll findGrowing all the timeNASCAR slideIncluding several sponsors of this event are just starting to get involved…If your logo is not up here, forgive us – it’s hard to keep up!
RDD Colloquially referred to as RDDs (e.g. caching in RAM)Lazy operations to build RDDs from other RDDsReturn a result or write it to storage
Add “variables” to the “functions” in functional programming