2. Who are we?
Tal Sliwowicz
Director, R&D
tal@taboola.com
Ruthy Goldberg
Sr. Software Engineer
ruthy@taboola.com
3. Our War Story
“A good plan violently executed now is better than a
perfect plan executed next week”
George S. Patton
4. Our Data Requirements
• Lots of incoming traffic (100K requests/sec)
• Data:
– Personalized served recommendations – per user, per page
view
– Events - What the user actually read and what he did
• The data needs to be joined and processed in real time
– Campaigns Management
– Recommendations
– Billing
– Reports
– Etc.
• The data needs to be available for offline research
5. Challenges
• We care about sessions - chain of page views and
events for a specific user
– Length can be hours or even days
• We care about users – chain of sessions across sites
– Length can be days or even months
• Stateless Application – single user data is sent from
multiple data centers and multiple servers
– No deterministic affinity to a server or DC
– Order isn’t guaranteed
– Must be robust and automatically deal with late arrivals
– “Exactly once” semantics
6. Challenges Cont.
• Many streams of data that need to be joined (user,
session, page view, widgets, recommendations,
events, actions)
• 5+TB of daily data
• Data analysis requires pre-joining the streams and
looking on the data across time
7. Naïve / Brute Force Solution
• Join some streams in the FE Server
– De-normalization is done as early as possible
– Everything that isn’t event or action is joined
– However, cannot assume a single PV happens on a single
server
• Join the above with events and actions in Spark
memory
– Minutes of data - ok
– 2+ Hours of data - slow (30+ minutes of processing)
– Days of data - #Fail
8. Why Did it Fail?
• Incoming data is received by data class (i.e. Request,
Event, etc) and by incoming timestamp
– Separate RDD per class
– The RDDs contain randomly - hash partitioned - incoming
data
• Join key is by session and page view ids
9. Why Did it Fail?
• To join the data:
– First, remap the incoming data to a PairRDD and add the join
key (needs to be done individually, per RDD class)
– Second, cogroup the PairRDDs shuffle must be performed
on all participating RDDs
• The initial data is distributed randomly across many
nodes and multiple RDDs
– Small data sets small shuffles
– Huge data sets unmanageable shuffles
12. The Solution
• Designed to avoid the initial / heaviest shuffle
• Go through an intermediary phase before reading the
data for analysis
• As streamed data is being received, save each
message to Cassandra
– All classes saved together to a single table
– The table is partitioned by the read key
13. Table Model in C*
• Partition key – session start hour + user bucket (0-9,999)
• Clustering key - publisher_id, user_id, session_id, view_id,
data_type, data_hash
• Data Type - MULTI_REQUEST, USER_EVENT,
ACTION_CONVERSION, …
• Data – blobs of protobuff
• Results:
– All the data of a single session is in one place, regardless of
time of arrival
– Idempotent process – if same message is received twice it
overruns the previous arrivals due to same hash id
15. Result
• Week of data (~35TB) - 2 hours to analyze and report
• Analyzing 1% sample of the users reduces this
linearly (partition key)
• Analyzing a single publisher which is 1% of the data
reduces this almost linearly (clustering key)
16. Good, but not good enough
• We used Cassandra because we had it as an
available resource
• However, Cassandra:
– Isn’t columnar - cannot read partial rows (specific columns)
– Eventually consistent – not accurate enough
– For heavy loads suffers from memory issues
– Cross DC replication isn’t reliable under heavy load
• Now working on the next gen solution
– See you in a future meetup…
17. Some More Tips
• Avoid cogroup and use broadcasts when one of the
RDDs is small enough
• Whenever possible use map() instead of
mapPartitions()
– Memory and processing efficiency gained
– Unless setup is expensive
• G1GC – we have had a very good experience with it
in tight memory situations
– Does not work well out of the box, requires some tweaking