HyperConnect's 1to1 video matchmaking system is consist of various machine learning techniques to maximize user satisfaction. Our matchmaking system manages large user context containing actions a few seconds ago, and reacts in milliseconds to produce meaningful new results in each user session. It's difficult in traditional way. So, distributed streaming is essential to handle in this cases. Topics include: - Why our team choose Apache Flink in comparison with alternatives - Matchmaking streaming architecture with detail abstraction levels based on Flink operator - Pairwise scoring microservice management with Flink - Stateful matchmaking computation with low latency, fault-tolerance, and scalability - How to manage large-scale events: classifying feature types, collecting with a multi-window stream - Applications: personalization, multi-armed-bandit on stream.
13. Matchmaking Steps
- Preprocess data stream before entering match cycle
- Conditional tagging for segmentations (ex) A/B test)
- Vectorize user features for machine learning models
- Save managed state on operator and connect it to feature engineering
operator
14. Matchmaking Steps
- Split match request attributes information
1) Match context infos
- can be changed in matchmaking cycle
- small size
- direct transfer to next operator
2) Match properties
- properties cannot be changed
- large size
- broadcast to only scoring operators
Total Feature memory
Small Mutable Context
Large Immutable Properties
16. Matchmaking Steps
- Aggregate by window
- Global tumbling window
- Keyed tumbling window when segmentation enabled
- Custom trigger & evictor depending on matchmaking function
- Custom pair generating protocol with scoring operators
17. Matchmaking Steps
- Efficient pair generation and bucketizing considering
communication costs
- Total communication costs
= network consume time
+ serialization/deserialization time
19. Matchmaking Steps
Matchmaking logic is divided into several microservices
- Many teams can be involved for scoring logic advancement
- Scoring Operator with broadcasted request data or external in-memory cache
- Prepare data to be provided to each microservices
- Get responses from microservices
- Scoring Microservices can be REST or Event driven
- Use AsyncDataStream
23. Matchmaking Steps
- Trigger condition: Timeout or all candidates arrived
- Timeout dummy token sent by generating operator for fault tolerance
ScorersPair
Generator
Match
Maker
Leftovers
Send Timeout Token
for custom trigger
24. Matchmaking Steps
- Create app level backpressure by buffering some requests when scorer fails
consecutively
ScorersPair
Generator
Match
Maker
Leftovers
Too many
leftovers!
Backpressure:
Buffering
selected requests
Some pair scorer failed
=> Increase requests wasn’t matched
=> Increase next match requests
=> Repeated.. => Service failed
28. Infra & Operation
- Easy to set up new staging infra
- For performance testing
- Metrics on prometheus/grafana
Match App1
Job manager
Task manager
Match App2
Job manager
Task manager Score App2
Score App1
HA Proxy
ACTIVE
STANDBY
- For zero downtime: Blue green deployment
- Score microservices can be rolling updated
29. Performance Tuning
Network cost & General tips
- Latency vs Throughput trade-off: setBufferTimeout
- Forward vs Rebalance: setParallelism
- For resource isolation: stream.slotSharingGroup