ATC is a system built using Samza to manage communications with LinkedIn members. It aims to improve the member experience by applying common functionality across different communication types and use cases. It handles thousands of communications per second while maintaining a good understanding of members' states in near-real-time. ATC focuses on sending the right message to the right member through the right channel at the right time using techniques like filtering, aggregation, channel selection and delivery optimization. It was built to be highly scalable using streaming technologies like Kafka, RocksDB and host affinity to replicate state across datacenters for redundancy. Personalization is achieved through relevance scores computed offline and stored in RocksDB.
3. What problem are we trying to solve?
In the past, LinkedIn provided a poor communications experience to some of
its members.
Too much email, low quality email, fired on multiple channels at once
Our goal was to build a system which could apply some common
functionality across many different communication types and use cases in
order to improve the member experience.
Handle thousands of communications per second
Good understanding of state of members on the site in near-real-time
4. How does ATC think about
creating
a delightful member experience?
5. 5 Rights
Right member
Right message
Useful to member
Shouldn’t have seen it before
Right frequency
Right channel
9. Delivery-time Optimization
● Hold on to a message and deliver it at the right
moment.
● Ex: Don’t buzz my phone at 2 AM.
● I like to read my daily digests every day after work.
11. Requirements for ATC
● Highly-scalable
● Nearline (but close to real-time!)
● Ingest data from many sources
● Persist some data, but most needs are low TTL
18. Streaming Technologies
Kafka: publish-subscribe messaging system
Used to send input to ATC to trigger communications
Many actions and signals in the LinkedIn ecosystem are tracked in kafka events. We can
consume these signals to better understand the state of the ecosystem.
Databus: change capture system for databases
Produces an event whenever an entry in a database changes
19. Host affinity
By default, whenever a Samza app is deployed, the task instances can be
moved to any host in the cluster, regardless of where the instances were
previously deployed.
If there was any state saved (e.g. RocksDB), then the new instances would
have to rebuild that state off of the changelog. This bootstrapping can take
some time depending on the amount of data to reload. Task instances
can’t process new input until bootstrapping is complete.
We have some use cases which can’t be delayed for the amount of time it
20. Host affinity (continued)
Host affinity is a Samza feature which allows us to deploy task instances
back to the same hosts from the previous deployment, so state does not
need to be reloaded.
In case of failures for individual instances, Samza can fallback to moving the
instance elsewhere and bootstrapping off of the changelog.
21. Multiple datacenters
Samza does not currently support replicating persistent application state
(e.g. RocksDB) across multiple clusters which are running the same app.
We need ATC to run in multiple datacenters for redundancy.
We need to have state in each datacenter so that if we have to move
processing between datacenters, then we can continue to properly handle
input.
22. Multiple datacenters
We rely on the input streams to replicate the main input so that we can do
processing and build up state in all datacenters.
The side effects (trigger the actual email send) then will only get emitted by
one of the datacenters. We can dynamically choose where side effects are
triggered.
24. Deployments
When we deploy changes to ATC, we can deploy to a single datacenter at a
time in order to test new versions on only a fraction of traffic.
In some cases, we shift all side effects out of a datacenter to do an upgrade.
Since we still process all input, we can validate almost all of our
functionality and ensure performance doesn’t take an unexpected hit.
25. Store migrations
In some cases, we need to migrate our system to use a new instance of a
store.
For example, when support was added to use RocksDB TTL, we needed to migrate some of
our stores.
Since we only needed the last X days of data, we could use the following
strategy for the migration:
Write to both the old and new store for X days, but continue to read from the old store.
After X days, read from the new store, but continue writing both stores so we could fall back
26. Personalization through relevance
We work closely with a relevance team in order to make better decisions
about the communications we send out.
e.g. channel selection, delivery time, aggregation thresholds
Every day, scores for different decisions are computed offline (Hadoop) by the
relevance team. Those scores are pushed to ATC through Kafka, and then
ATC stores the scores in RocksDB.
Scores are generated for each member, so we can personalize the
experience.
28. Remote calls
Some data is not available on a Kafka stream in a pragmatic way
We make REST requests to fetch that data
Done at the beginning of pipeline
Extract event
Make remote calls and decorate event
Process decorated event
29. Remote calls - Efficiently
Use ParSeq
Framework to write asynchronous code in Java
Open Sourced
ParSeq uses a thread pool for making remote calls
Rest of processing happens serially
Checkpointing handled by application
30. Real-time Processing
Some messages require real-time latency
Tuned Kafka’s batching configuration to achieve sub-second of pre-ATC
latency
Can be tuned even more aggressively!
ATC/Samza processes most events in 2-3 ms
No remote calls for these messages