Spark war stories taboola

Who are we?
Tal Sliwowicz
Director, R&D
tal@taboola.com
Ruthy Goldberg
Sr. Software Engineer
ruthy@taboola.com

Our War Story
“A good plan violently executed now is better than a
perfect plan executed next week”
George S. Patton

Our Data Requirements
• Lots of incoming traffic (100K requests/sec)
• Data:
– Personalized served recommendations – per user, per page
view
– Events - What the user actually read and what he did
• The data needs to be joined and processed in real time
– Campaigns Management
– Recommendations
– Billing
– Reports
– Etc.
• The data needs to be available for offline research

Challenges
• We care about sessions - chain of page views and
events for a specific user
– Length can be hours or even days
• We care about users – chain of sessions across sites
– Length can be days or even months
• Stateless Application – single user data is sent from
multiple data centers and multiple servers
– No deterministic affinity to a server or DC
– Order isn’t guaranteed
– Must be robust and automatically deal with late arrivals
– “Exactly once” semantics

Challenges Cont.
• Many streams of data that need to be joined (user,
session, page view, widgets, recommendations,
events, actions)
• 5+TB of daily data
• Data analysis requires pre-joining the streams and
looking on the data across time

Naïve / Brute Force Solution
• Join some streams in the FE Server
– De-normalization is done as early as possible
– Everything that isn’t event or action is joined
– However, cannot assume a single PV happens on a single
server
• Join the above with events and actions in Spark
memory
– Minutes of data - ok
– 2+ Hours of data - slow (30+ minutes of processing)
– Days of data - #Fail

Why Did it Fail?
• Incoming data is received by data class (i.e. Request,
Event, etc) and by incoming timestamp
– Separate RDD per class
– The RDDs contain randomly - hash partitioned - incoming
data
• Join key is by session and page view ids

Why Did it Fail?
• To join the data:
– First, remap the incoming data to a PairRDD and add the join
key (needs to be done individually, per RDD class)
– Second, cogroup the PairRDDs  shuffle must be performed
on all participating RDDs
• The initial data is distributed randomly across many
nodes and multiple RDDs
– Small data sets  small shuffles
– Huge data sets  unmanageable shuffles

The Solution
Avoid Them
Shuffles

The Solution
• Designed to avoid the initial / heaviest shuffle
• Go through an intermediary phase before reading the
data for analysis
• As streamed data is being received, save each
message to Cassandra
– All classes saved together to a single table
– The table is partitioned by the read key

Table Model in C*
• Partition key – session start hour + user bucket (0-9,999)
• Clustering key - publisher_id, user_id, session_id, view_id,
data_type, data_hash
• Data Type - MULTI_REQUEST, USER_EVENT,
ACTION_CONVERSION, …
• Data – blobs of protobuff
• Results:
– All the data of a single session is in one place, regardless of
time of arrival
– Idempotent process – if same message is received twice it
overruns the previous arrivals due to same hash id

Result
• Week of data (~35TB) - 2 hours to analyze and report
• Analyzing 1% sample of the users reduces this
linearly (partition key)
• Analyzing a single publisher which is 1% of the data
reduces this almost linearly (clustering key)

Good, but not good enough
• We used Cassandra because we had it as an
available resource
• However, Cassandra:
– Isn’t columnar - cannot read partial rows (specific columns)
– Eventually consistent – not accurate enough
– For heavy loads suffers from memory issues
– Cross DC replication isn’t reliable under heavy load
• Now working on the next gen solution
– See you in a future meetup…

Some More Tips
• Avoid cogroup and use broadcasts when one of the
RDDs is small enough
• Whenever possible use map() instead of
mapPartitions()
– Memory and processing efficiency gained
– Unless setup is expensive
• G1GC – we have had a very good experience with it
in tight memory situations
– Does not work well out of the box, requires some tweaking

Thank You!
ruthy@taboola.com
tal@taboola.com

Spark war stories taboola

Recommended

Recommended

More Related Content

What's hot

What's hot (11)

Similar to Spark war stories taboola

Similar to Spark war stories taboola (20)

Recently uploaded

Recently uploaded (20)

Spark war stories taboola

Editor's Notes