Unlocking the Future of AI Agents with Large Language Models
Tales Of The Black Knight - Keeping EverythingMe running
1.
2. • “One Tap Happiness”
• Smart and Contextual Launcher / Phone UI
• App Organization
• In-Phone Search
• App Search
• App Recommendations (+ sponsored)
• Contextual Content Discovery (cards)
What This means:
• Lots of algorithms and data from the servers
• Millions of downloads, Hundreds of sustained R/S
• 1B use events collected per month
• Fucking up means fucking up the users’ phones
3. • 100% Cloud-based (EC2)
• 100% Automated Infrastructure
• Third-party software is FOSS only
• Continuous Deployment
• Loads of metrics and logging
• Databases: Redis, MySQL, Cassandra
• Languages: Python, Go, C++ (Java on Android)
• Important Tools: Tornado, Thrift, Scribe,
Statsd, Kibana, Celery, ZooKeeper, Chef,
Docker
4. • Servers may be terminated at any moment
• Disks may fail at any moment
• The LAN may fail or hiccup
• Services you query may crash or be restarted
• Your code may crash at any moment
• The Client is an idiot that might send garbage
• The Server is an idiot that might return garbage
• We Are All Idiots
• We should accept and embrace all that
5. • Separation into many small services
• No SPOFs
• Little reliance on disk
• Aspire for statelessness
• Dynamic Endpoint Management
• In App Failover and LB
• Aggressive Timeouts
• Multi tiered alerting
• Sane Fallback Values
• Graceful Degradation
6. C*
API
Search
Ads
Images
Geo
Redis
Redis
Redis
Redis
Context
● Thin API Layer
○ Input validation
○ Connection funnelling
● Many smaller services
● Many redis instances
○ “database” = instance
● Thrift for internal APIs
● Deploys are less scary
● Scaling is easier
● Well defined contracts
Auto-Complete
7. if machine_is_down:
# All is well
return KeepFighting
elif fucked_services_count == 1:
log.info(“Tis But a Scratch”)
return KeepFighting
elif fucked_services_count == 2:
log.info(“Just a Flesh Wound!”)
return KeepFighting
elif num_running_services >= 2:
log.info(“I’ll Bite Your Legs Off!”)
return KeepFighting
else:
log.info(“All Right, We’ll Call It A Draw”)
return SwitchDataCenter_PLZ_KTXBAI
8. • No Database Master ⇒ remain read only
• No Queue ⇒ Write to log for future processing
• No Service X ⇒ Return a default response, not
an exception
• No MySQL ⇒ Everything is ready for serving in
Redis anyway
• No internal service - fall back to external service
• Etc...
9. • Multi Edge / single Central DCs
• Geo-DNS based
• Edges are read only
• Central is write only
• API / Logs
• All edges are data-symmetrical
• Any Edge may be taken out
• Central May be taken out without
service disruption
10. • Zookeeper manages an endpoint tree
• Watchdog registers services - no self announce
• Changes to endpoints are published to all services
• Automatic switching and adding of endpoints
• Facilitates no-downtime deploys with downtime :)
• A dead machine is deleted from ZK automatically
• A static snapshot of endpoints kept on all
machines
11. • Internal “learning” Load Balancing Connection Pool
• Protocol Agnostic (kinda...)
• Python magic - no code changes
• Silent failovers
• Automatic fast banning / exploration
• Why in-app?
• Application Aware
• Less latency
• EP management support
12. • Proper timeouts are a key factor for a distributed system
• They should be as low as possible while avoiding FP
• Internal service timeouts should be < 50ms
• Client timeouts can be rather big to support retries
• Without proper timeouts any link in the chain can bring you
down
• Log them but try to recover
• Don’t forget they add up!
• Bad Timeout == Point Of Failure
13. • The obvious dark side of all this
• Survival Strategies
• Separate non time-critical API calls
• Client Side Backoffs
• Selective Failover Retries
• Capacity barriers in internal services
• Tune well
• Fail fast!
• Return sane defaults, not errors
14. • We are constantly improving our infrastructure
• Our DevOps team are doing that, not maintaining servers
• We accept that all solutions are temporary
• Start-up infrastructure is always a compromise
• We embrace Post-Mortems as an opportunity to improve
16. • MySQL/Redis data abstraction library
• Objects can be saved or loaded from either
• MySQL is write only, Redis read only
• MySQL can be down without disruption
• Only MySQL is replicated between DCs
• Automatic migrations to redis
• Spartacus - a pseudo MySQL slave notifies
on changes
Central
MySQL
Redis
Edge
MySQL
Spartacus
Redis