Taboola's Apache Spark Experience for Real-Time Recommendations & Analytics

Taboola's Experience with
Apache Spark
tal.s@taboola.com
ruthy.g@taboola.com

Engine Focused on Maximizing CTR & Post Click Engagement
Context
Metadata
Geo
Region-based
Recommendations

User Behavior
Cookie Data

Social
Facebook/Twitter API
Collaborative Filtering
Bucketed Consumption Groups

Largest Content Discovery and
Monetization Network

3B
Daily
recommendations

1M+
sourced content
providers

0M
monthly unique
users

1M+
sourced content
items

What Does it Mean?
• 5 Data Centers across the globe
• Tera-bytes of data / day (many billion events)
• Data must be processed and analyzed in real time, for
example:
–
–
–
–
–

Real-time, per user content recommendations
Real-time expenditure reports
Automated campaign management
Automated recommendation algorithms calibration
Real-time analytics

About Spark
•
•
•
•
•

Open Sourced
Apache top level project (since Feb. 19th)
DataBricks - A commercial company that supports it
Hadoop-compatible computing engine
Can run side-by-side with Hadoop/Hive on the same
data
• Drastically faster than Hadoop through in-memory
computing
• Multiple H/A options - standalone cluster, Apache
mesos and ZooKeeper or YARN

Spark Development Community
• With over 100 developers and 25 companies, one of
the most active communities in big data

Comparison: Storm (48), Giraph (52), Drill (18), Tez (12)

Past 6 months: more active devs than Hadoop MapReduce!

15

25

20

10

5

0

Streaming
Response Time (s)

Storm

25

20

15

10

5

0

SQL

30

10
5

0

Graph

GraphX

25

Giraph

40
Hadoop

35

Response Time (min)

Shark (mem)

Shark (disk)

Hive

45

Impala (mem)

35

Impala (disk)

30
Spark

Throughput (MB/s/node)

Spark Performance

30

20

15

Spark API
• Simple to write through easy APIs in
Java, Scala and Python
• The same analytics code can be used for both
streaming data and offline data processing

Spark Key Concepts
Write programs in terms of
transformations on distributed
datasets
Resilient Distributed
Operations
Datasets
• Transformations
• Collections of objects spread
across a cluster, stored in
RAM or on Disk
• Built through parallel
transformations
• Automatically rebuilt on
failure

(e.g.
map, filter, group
By)
• Actions
(e.g.
count, collect, sa
ve)

Working With RDDs
textFile = sc.textFile(”SomeFile.txt”)

RDD
RDD
RDD
RDD

Action

Value

Transformations

linesWithSpark.count()
74
linesWithSpark.first()
# Apache Spark

linesWithSpark = textFile.filter(lambda line: "Spark” in line)

Example: Log Mining
Transformed RDD

Load error messages from a log into
memory, then interactively search for various
patterns
Cache 1
Base RDD

lines = spark.textFile(“hdfs://...”)

results

Worker

errors = lines.filter(lambda s: s.startswith(“ERROR”))

tasks

messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()

messages.filter(lambda s: “mysql” in s).count()

Driver

Action

Block 1

Cache 2

Worker

messages.filter(lambda s: “php” in s).count()

Cache 3

. . .

Full-text search of Wikipedia
• 60GB on 20 EC2 machine
• 0.5 sec vs. 20s for on-disk

Worker
Block 3

Block 2

Task Scheduler
• General task
graphs
• Automatically
pipelines
functions
• Data locality
aware
• Partitioning
aware
to avoid shuffles

B:

A:

F:
Stage 1
C:

groupBy
D:

E:

join
Stage 2 map

= RDD

filter

= cached partition

Stage 3

Software Components
• Spark runs as a library in your
program (1 instance per app)
• Runs tasks locally or on cluster
– Mesos, YARN or standalone
mode

• Accesses storage systems via
Hadoop InputFormat API
– Can use HBase, HDFS, S3, …

Your application

SparkContext
Cluster
manager

Local
threads

Worker

Worker

Spark
executor

Spark
executor

HDFS or other storage

System Architecture & Data Flow @ Taboola
Driver +
Consumers

Spark Cluster

FE Servers
MySQL Cluster

C* Cluster

FE Servers

Execution Graph @ Taboola
rdd1 = Context.parallize([data])

• Data start point (dates, etc)

rdd2 =
rdd1.mapPartitions(loadfunc())

• Loading data from external sources
(Cassandra, MySQL, etc)

rdd3 = rdd2.reduce(reduceFunc())

rdd4 =
rdd3.mapPartitions(saverfunc())

rdd4.count()

• Aggregating the data and storing
results
• Saving the results to a DB
• Executing the above graph by
forcing an output operation

Cassandra as a Distributed Storage
•
•
•

Event Log Files saved as blobs to a dedicated keyspace
C* Tables holding the Event Log Files are partitioned by day – new Table per day. This way, it is
easier for maintenance and simpler to load into Spark
Using Astyanax driver + CQL3
–
–

•

Wrote hadoop InputFormat that supports loading this into a lines RDD<String>
–

•

The DataStax InputFormat had issues and at the time was not formally supported

Worked well, but ended up not using it – instead using mapPartitions
–
–

•

Recipe to load all keys of a table very fast (hundred of thousands / sec)
Split by keys and then load data by key in batches – in parallel partitions

Very simple, no overhead, no need to be tied to hadoop
Will probably use the InputFormat when we deploy a Shark solution

Plans to open source all this

userevent_2014-02-19

userevent_2014-02-20

Key (String)

Data (blob)

Key (String)

Data (blob)

GUID (originally
log file name)

Gzipped file

GUID (originally
log file name)

Gzipped file

GUID

Gzipped file

GUID

Gzipped file

…

…

…

…

Sample – Click Counting for Campaign Stopping

1. mapPartitions – mapping from strings to objects with
a pre designed click key
2. reduceByKey – removing duplicate clicks (see next
slide)
3. Map – switch keys to a campaign+day key
4. reduceByKey – aggregate the data by
campaign+day

Campaign Stopping – Removing Dup Clicks

• When more than 1 click found from the same user on the same
item, leave only the oldest
• Using accumulators to track duplicate numbers

Our Deployment
• 7 nodes, each–
–
–
–

24 cores
256G Ram
6 1TB SSD Disks – JBOD configuration
10G Ethernet

• Total Cluster Power
–
–
–

1760GB Ram
168 CPUs
42 TB storage – (effective space is less, Cassandra Keyspaces defined with replication
factor 3)

• Symmetric Deployment
–
–

Mesos + Spark
Cassandra

• More
–
–
–

Rabbit MQ on 3 nodes
ZooKeeper on 3 nodes
MySQL cluster outside this cluster

• Loads & processes ~1 Tera Bytes (unzipped data) in ~3 minutes

Things that work well with Spark
(from our experience)
• Very easy to code complex jobs
– Harder than SQL, but better than other Map Reduce options
– Simple concepts, “small” API

• Easy to Unit Test
– Runs in local mode, so ideal for micro E2E tests
– Each mapper/reducer can be unit tested without Spark – if you
do not use anonymous classes

• Very resilient
• Can read/write to/from any data source, including
RDBMS, Cassandra, HDFS, local files, etc.
• Great monitoring
• Easy to deploy & upgrade
• Blazing fast

Things that do not work that well
(from our experience)
• Long (endless) running tasks require some workarounds
– Temp files - Spark creates a lot of files in
spark.local.dir, requires periodic cleanup
– Use spark.cleaner.ttl for long running tasks
– Periodic Driver restarts (spark 0.8.1)

• Spark Streaming – not fully mature
– Some end cases can cause loss of data
– Sliding window / batch model does not fit our needs
• We always load some history to deal with late arriving data
• State management left to the user and not trivial
– BUT – we were able to easily implement a bullet proof home
grown, near real time, streaming solution with minimal amount
of code

General / Optimization Tips
• Use Spark Accumulators to collect and report
operational data
• 10G Ethernet
• Multiple SSD disks per node, JBOD configuration
• A lot of memory for the cluster
• Use Leader Election for Driver H/A
– In Spark 0.9 may not be needed with the new option to run
the driver inside the cluster

Technologies Taboola Uses for Spark
• Spark – computing cluster
• Mesos – cluster resource manager
– Better resource allocation (coarse grained) for Spark

• ZooKeeper – distributed coordination
– Enables multi master for mesos & spark

• Curator
– Leader Election – Taboola’s Spark Driver

• Cassandra
– Distributed Data Store

• Monitoring – http://metrics.codahale.com/

Attributions
Many of the general Spark slides were taken from the
DataBricks Spark Summit 2013 slides.
There are great materials at:
https://spark.incubator.apache.org/
http://spark-summit.org/summit-2013/

Thank You!
tal.s@taboola.com
ruthy.g@taboola.com

Taboola's Apache Spark Experience for Real-Time Recommendations & Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Taboola's Apache Spark Experience for Real-Time Recommendations & Analytics

Similar to Taboola's Apache Spark Experience for Real-Time Recommendations & Analytics (20)

Recently uploaded

Recently uploaded (20)

Taboola's Apache Spark Experience for Real-Time Recommendations & Analytics

Editor's Notes