At taboola we are getting a constant feed of data (many billions of user events a day) and are using Apache Spark together with Cassandra for both real time data stream processing as well as offline data processing. We'd like to share our experience with these cutting edge technologies.
Apache Spark is an open source project - Hadoop-compatible computing engine that makes big data analysis drastically faster, through in-memory computing, and simpler to write, through easy APIs in Java, Scala and Python. This project was born as part of a PHD work in UC Berkley's AMPLab (part of the BDAS - pronounced "Bad Ass") and turned into an incubating Apache project with more active contributors than Hadoop. Surprisingly, Yahoo! are one of the biggest contributors to the project and already have large production clusters of Spark on YARN.
Spark can run either standalone cluster, or using either Apache mesos and ZooKeeper or YARN and can run side by side with Hadoop/Hive on the same data.
One of the biggest benefits of Spark is that the API is very simple and the same analytics code can be used for both streaming data and offline data processing.
2. Engine Focused on Maximizing CTR & Post Click Engagement
Context
Metadata
Geo
Region-based
Recommendations
User Behavior
Cookie Data
Social
Facebook/Twitter API
Collaborative Filtering
Bucketed Consumption Groups
4. What Does it Mean?
• 5 Data Centers across the globe
• Tera-bytes of data / day (many billion events)
• Data must be processed and analyzed in real time, for
example:
–
–
–
–
–
Real-time, per user content recommendations
Real-time expenditure reports
Automated campaign management
Automated recommendation algorithms calibration
Real-time analytics
5. About Spark
•
•
•
•
•
Open Sourced
Apache top level project (since Feb. 19th)
DataBricks - A commercial company that supports it
Hadoop-compatible computing engine
Can run side-by-side with Hadoop/Hive on the same
data
• Drastically faster than Hadoop through in-memory
computing
• Multiple H/A options - standalone cluster, Apache
mesos and ZooKeeper or YARN
6. Spark Development Community
• With over 100 developers and 25 companies, one of
the most active communities in big data
Comparison: Storm (48), Giraph (52), Drill (18), Tez (12)
Past 6 months: more active devs than Hadoop MapReduce!
9. Spark API
• Simple to write through easy APIs in
Java, Scala and Python
• The same analytics code can be used for both
streaming data and offline data processing
10. Spark Key Concepts
Write programs in terms of
transformations on distributed
datasets
Resilient Distributed
Operations
Datasets
• Transformations
• Collections of objects spread
across a cluster, stored in
RAM or on Disk
• Built through parallel
transformations
• Automatically rebuilt on
failure
(e.g.
map, filter, group
By)
• Actions
(e.g.
count, collect, sa
ve)
11. Working With RDDs
textFile = sc.textFile(”SomeFile.txt”)
RDD
RDD
RDD
RDD
Action
Value
Transformations
linesWithSpark.count()
74
linesWithSpark.first()
# Apache Spark
linesWithSpark = textFile.filter(lambda line: "Spark” in line)
12. Example: Log Mining
Transformed RDD
Load error messages from a log into
memory, then interactively search for various
patterns
Cache 1
Base RDD
lines = spark.textFile(“hdfs://...”)
results
Worker
errors = lines.filter(lambda s: s.startswith(“ERROR”))
tasks
messages = errors.map(lambda s: s.split(“t”)[2])
messages.cache()
messages.filter(lambda s: “mysql” in s).count()
Driver
Action
Block 1
Cache 2
Worker
messages.filter(lambda s: “php” in s).count()
Cache 3
. . .
Full-text search of Wikipedia
• 60GB on 20 EC2 machine
• 0.5 sec vs. 20s for on-disk
Worker
Block 3
Block 2
14. Software Components
• Spark runs as a library in your
program (1 instance per app)
• Runs tasks locally or on cluster
– Mesos, YARN or standalone
mode
• Accesses storage systems via
Hadoop InputFormat API
– Can use HBase, HDFS, S3, …
Your application
SparkContext
Cluster
manager
Local
threads
Worker
Worker
Spark
executor
Spark
executor
HDFS or other storage
15. System Architecture & Data Flow @ Taboola
Driver +
Consumers
Spark Cluster
FE Servers
MySQL Cluster
C* Cluster
FE Servers
16. Execution Graph @ Taboola
rdd1 = Context.parallize([data])
• Data start point (dates, etc)
rdd2 =
rdd1.mapPartitions(loadfunc())
• Loading data from external sources
(Cassandra, MySQL, etc)
rdd3 = rdd2.reduce(reduceFunc())
rdd4 =
rdd3.mapPartitions(saverfunc())
rdd4.count()
• Aggregating the data and storing
results
• Saving the results to a DB
• Executing the above graph by
forcing an output operation
17. Cassandra as a Distributed Storage
•
•
•
Event Log Files saved as blobs to a dedicated keyspace
C* Tables holding the Event Log Files are partitioned by day – new Table per day. This way, it is
easier for maintenance and simpler to load into Spark
Using Astyanax driver + CQL3
–
–
•
Wrote hadoop InputFormat that supports loading this into a lines RDD<String>
–
•
The DataStax InputFormat had issues and at the time was not formally supported
Worked well, but ended up not using it – instead using mapPartitions
–
–
•
Recipe to load all keys of a table very fast (hundred of thousands / sec)
Split by keys and then load data by key in batches – in parallel partitions
Very simple, no overhead, no need to be tied to hadoop
Will probably use the InputFormat when we deploy a Shark solution
Plans to open source all this
userevent_2014-02-19
userevent_2014-02-20
Key (String)
Data (blob)
Key (String)
Data (blob)
GUID (originally
log file name)
Gzipped file
GUID (originally
log file name)
Gzipped file
GUID
Gzipped file
GUID
Gzipped file
…
…
…
…
18. Sample – Click Counting for Campaign Stopping
1. mapPartitions – mapping from strings to objects with
a pre designed click key
2. reduceByKey – removing duplicate clicks (see next
slide)
3. Map – switch keys to a campaign+day key
4. reduceByKey – aggregate the data by
campaign+day
19. Campaign Stopping – Removing Dup Clicks
• When more than 1 click found from the same user on the same
item, leave only the oldest
• Using accumulators to track duplicate numbers
20. Our Deployment
• 7 nodes, each–
–
–
–
24 cores
256G Ram
6 1TB SSD Disks – JBOD configuration
10G Ethernet
• Total Cluster Power
–
–
–
1760GB Ram
168 CPUs
42 TB storage – (effective space is less, Cassandra Keyspaces defined with replication
factor 3)
• Symmetric Deployment
–
–
Mesos + Spark
Cassandra
• More
–
–
–
Rabbit MQ on 3 nodes
ZooKeeper on 3 nodes
MySQL cluster outside this cluster
• Loads & processes ~1 Tera Bytes (unzipped data) in ~3 minutes
21. Things that work well with Spark
(from our experience)
• Very easy to code complex jobs
– Harder than SQL, but better than other Map Reduce options
– Simple concepts, “small” API
• Easy to Unit Test
– Runs in local mode, so ideal for micro E2E tests
– Each mapper/reducer can be unit tested without Spark – if you
do not use anonymous classes
• Very resilient
• Can read/write to/from any data source, including
RDBMS, Cassandra, HDFS, local files, etc.
• Great monitoring
• Easy to deploy & upgrade
• Blazing fast
22. Things that do not work that well
(from our experience)
• Long (endless) running tasks require some workarounds
– Temp files - Spark creates a lot of files in
spark.local.dir, requires periodic cleanup
– Use spark.cleaner.ttl for long running tasks
– Periodic Driver restarts (spark 0.8.1)
• Spark Streaming – not fully mature
– Some end cases can cause loss of data
– Sliding window / batch model does not fit our needs
• We always load some history to deal with late arriving data
• State management left to the user and not trivial
– BUT – we were able to easily implement a bullet proof home
grown, near real time, streaming solution with minimal amount
of code
23. General / Optimization Tips
• Use Spark Accumulators to collect and report
operational data
• 10G Ethernet
• Multiple SSD disks per node, JBOD configuration
• A lot of memory for the cluster
• Use Leader Election for Driver H/A
– In Spark 0.9 may not be needed with the new option to run
the driver inside the cluster
24. Technologies Taboola Uses for Spark
• Spark – computing cluster
• Mesos – cluster resource manager
– Better resource allocation (coarse grained) for Spark
• ZooKeeper – distributed coordination
– Enables multi master for mesos & spark
• Curator
– Leader Election – Taboola’s Spark Driver
• Cassandra
– Distributed Data Store
• Monitoring – http://metrics.codahale.com/
25. Attributions
Many of the general Spark slides were taken from the
DataBricks Spark Summit 2013 slides.
There are great materials at:
https://spark.incubator.apache.org/
http://spark-summit.org/summit-2013/
One of the most exciting things you’ll findGrowing all the timeNASCAR slideIncluding several sponsors of this event are just starting to get involved…If your logo is not up here, forgive us – it’s hard to keep up!
RDD Colloquially referred to as RDDs (e.g. caching in RAM)Lazy operations to build RDDs from other RDDsReturn a result or write it to storage
Let me illustrate this with some bad powerpoint diagrams and animationsThis diagram is LOGICAL,
Add “variables” to the “functions” in functional programming