SlideShare a Scribd company logo
1 of 16
Download to read offline
• “One Tap Happiness” 
• Smart and Contextual Launcher / Phone UI 
• App Organization 
• In-Phone Search 
• App Search 
• App Recommendations (+ sponsored) 
• Contextual Content Discovery (cards) 
What This means: 
• Lots of algorithms and data from the servers 
• Millions of downloads, Hundreds of sustained R/S 
• 1B use events collected per month 
• Fucking up means fucking up the users’ phones
• 100% Cloud-based (EC2) 
• 100% Automated Infrastructure 
• Third-party software is FOSS only 
• Continuous Deployment 
• Loads of metrics and logging 
• Databases: Redis, MySQL, Cassandra 
• Languages: Python, Go, C++ (Java on Android) 
• Important Tools: Tornado, Thrift, Scribe, 
Statsd, Kibana, Celery, ZooKeeper, Chef, 
Docker
• Servers may be terminated at any moment 
• Disks may fail at any moment 
• The LAN may fail or hiccup 
• Services you query may crash or be restarted 
• Your code may crash at any moment 
• The Client is an idiot that might send garbage 
• The Server is an idiot that might return garbage 
• We Are All Idiots 
• We should accept and embrace all that
• Separation into many small services 
• No SPOFs 
• Little reliance on disk 
• Aspire for statelessness 
• Dynamic Endpoint Management 
• In App Failover and LB 
• Aggressive Timeouts 
• Multi tiered alerting 
• Sane Fallback Values 
• Graceful Degradation
C* 
API 
Search 
Ads 
Images 
Geo 
Redis 
Redis 
Redis 
Redis 
Context 
● Thin API Layer 
○ Input validation 
○ Connection funnelling 
● Many smaller services 
● Many redis instances 
○ “database” = instance 
● Thrift for internal APIs 
● Deploys are less scary 
● Scaling is easier 
● Well defined contracts 
Auto-Complete
if machine_is_down: 
# All is well 
return KeepFighting 
elif fucked_services_count == 1: 
log.info(“Tis But a Scratch”) 
return KeepFighting 
elif fucked_services_count == 2: 
log.info(“Just a Flesh Wound!”) 
return KeepFighting 
elif num_running_services >= 2: 
log.info(“I’ll Bite Your Legs Off!”) 
return KeepFighting 
else: 
log.info(“All Right, We’ll Call It A Draw”) 
return SwitchDataCenter_PLZ_KTXBAI
• No Database Master ⇒ remain read only 
• No Queue ⇒ Write to log for future processing 
• No Service X ⇒ Return a default response, not 
an exception 
• No MySQL ⇒ Everything is ready for serving in 
Redis anyway 
• No internal service - fall back to external service 
• Etc...
• Multi Edge / single Central DCs 
• Geo-DNS based 
• Edges are read only 
• Central is write only 
• API / Logs 
• All edges are data-symmetrical 
• Any Edge may be taken out 
• Central May be taken out without 
service disruption
• Zookeeper manages an endpoint tree 
• Watchdog registers services - no self announce 
• Changes to endpoints are published to all services 
• Automatic switching and adding of endpoints 
• Facilitates no-downtime deploys with downtime :) 
• A dead machine is deleted from ZK automatically 
• A static snapshot of endpoints kept on all 
machines
• Internal “learning” Load Balancing Connection Pool 
• Protocol Agnostic (kinda...) 
• Python magic - no code changes 
• Silent failovers 
• Automatic fast banning / exploration 
• Why in-app? 
• Application Aware 
• Less latency 
• EP management support
• Proper timeouts are a key factor for a distributed system 
• They should be as low as possible while avoiding FP 
• Internal service timeouts should be < 50ms 
• Client timeouts can be rather big to support retries 
• Without proper timeouts any link in the chain can bring you 
down 
• Log them but try to recover 
• Don’t forget they add up! 
• Bad Timeout == Point Of Failure
• The obvious dark side of all this 
• Survival Strategies 
• Separate non time-critical API calls 
• Client Side Backoffs 
• Selective Failover Retries 
• Capacity barriers in internal services 
• Tune well 
• Fail fast! 
• Return sane defaults, not errors
• We are constantly improving our infrastructure 
• Our DevOps team are doing that, not maintaining servers 
• We accept that all solutions are temporary 
• Start-up infrastructure is always a compromise 
• We embrace Post-Mortems as an opportunity to improve
Want to improve our infra? We’re hiring ;) 
Ping me: 
Twitter: @dvirsky 
dvir@everything.me
• MySQL/Redis data abstraction library 
• Objects can be saved or loaded from either 
• MySQL is write only, Redis read only 
• MySQL can be down without disruption 
• Only MySQL is replicated between DCs 
• Automatic migrations to redis 
• Spartacus - a pseudo MySQL slave notifies 
on changes 
Central 
MySQL 
Redis 
Edge 
MySQL 
Spartacus 
Redis

More Related Content

What's hot

Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
Dvir Volk
 

What's hot (19)

Scaling php applications with redis
Scaling php applications with redisScaling php applications with redis
Scaling php applications with redis
 
Massively Scaled High Performance Web Services with PHP
Massively Scaled High Performance Web Services with PHPMassively Scaled High Performance Web Services with PHP
Massively Scaled High Performance Web Services with PHP
 
Fluentd and AWS at classmethod
Fluentd and AWS at classmethodFluentd and AWS at classmethod
Fluentd and AWS at classmethod
 
Developing high-performance network servers in Lisp
Developing high-performance network servers in LispDeveloping high-performance network servers in Lisp
Developing high-performance network servers in Lisp
 
Redis in Practice: Scenarios, Performance and Practice with PHP
Redis in Practice: Scenarios, Performance and Practice with PHPRedis in Practice: Scenarios, Performance and Practice with PHP
Redis in Practice: Scenarios, Performance and Practice with PHP
 
[231] the simplicity of cluster apps with circuit
[231] the simplicity of cluster apps with circuit[231] the simplicity of cluster apps with circuit
[231] the simplicity of cluster apps with circuit
 
Run Node Run
Run Node RunRun Node Run
Run Node Run
 
Dexador Rises
Dexador RisesDexador Rises
Dexador Rises
 
Raymond Kuiper - Working the API like a Unix Pro
Raymond Kuiper - Working the API like a Unix ProRaymond Kuiper - Working the API like a Unix Pro
Raymond Kuiper - Working the API like a Unix Pro
 
Scalable Web Apps
Scalable Web AppsScalable Web Apps
Scalable Web Apps
 
MongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log CollectorMongoFr : MongoDB as a log Collector
MongoFr : MongoDB as a log Collector
 
NginX - good practices, tips and advanced techniques
NginX - good practices, tips and advanced techniquesNginX - good practices, tips and advanced techniques
NginX - good practices, tips and advanced techniques
 
Nodejs - Should Ruby Developers Care?
Nodejs - Should Ruby Developers Care?Nodejs - Should Ruby Developers Care?
Nodejs - Should Ruby Developers Care?
 
Nodejs - A quick tour (v4)
Nodejs - A quick tour (v4)Nodejs - A quick tour (v4)
Nodejs - A quick tour (v4)
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Volker Fröhlich - How to Debug Common Agent Issues
Volker Fröhlich - How to Debug Common Agent IssuesVolker Fröhlich - How to Debug Common Agent Issues
Volker Fröhlich - How to Debug Common Agent Issues
 
Mongo db - How we use Go and MongoDB by Sam Helman
Mongo db - How we use Go and MongoDB by Sam HelmanMongo db - How we use Go and MongoDB by Sam Helman
Mongo db - How we use Go and MongoDB by Sam Helman
 
Rsyslog log normalization
Rsyslog log normalizationRsyslog log normalization
Rsyslog log normalization
 
Woo: Writing a fast web server @ ELS2015
Woo: Writing a fast web server @ ELS2015Woo: Writing a fast web server @ ELS2015
Woo: Writing a fast web server @ ELS2015
 

Viewers also liked

Kicking ass with redis
Kicking ass with redisKicking ass with redis
Kicking ass with redis
Dvir Volk
 

Viewers also liked (7)

10 reasons to be excited about go
10 reasons to be excited about go10 reasons to be excited about go
10 reasons to be excited about go
 
Boosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and SparkBoosting Machine Learning with Redis Modules and Spark
Boosting Machine Learning with Redis Modules and Spark
 
Redis data modeling examples
Redis data modeling examplesRedis data modeling examples
Redis data modeling examples
 
Redis Use Patterns (DevconTLV June 2014)
Redis Use Patterns (DevconTLV June 2014)Redis Use Patterns (DevconTLV June 2014)
Redis Use Patterns (DevconTLV June 2014)
 
Kicking ass with redis
Kicking ass with redisKicking ass with redis
Kicking ass with redis
 
Redis in Practice
Redis in PracticeRedis in Practice
Redis in Practice
 
Everything you always wanted to know about Redis but were afraid to ask
Everything you always wanted to know about Redis but were afraid to askEverything you always wanted to know about Redis but were afraid to ask
Everything you always wanted to know about Redis but were afraid to ask
 

Similar to Tales Of The Black Knight - Keeping EverythingMe running

Got Problems? Let's Do a Health Check
Got Problems? Let's Do a Health CheckGot Problems? Let's Do a Health Check
Got Problems? Let's Do a Health Check
Luis Guirigay
 
Lotuscript for large systems
Lotuscript for large systemsLotuscript for large systems
Lotuscript for large systems
Bill Buchan
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
xlight
 

Similar to Tales Of The Black Knight - Keeping EverythingMe running (20)

Got Problems? Let's Do a Health Check
Got Problems? Let's Do a Health CheckGot Problems? Let's Do a Health Check
Got Problems? Let's Do a Health Check
 
12 Step Guide to Lotuscript
12 Step Guide to Lotuscript12 Step Guide to Lotuscript
12 Step Guide to Lotuscript
 
Scaling tappsi
Scaling tappsiScaling tappsi
Scaling tappsi
 
Django production
Django productionDjango production
Django production
 
Handling Massive Traffic with Python
Handling Massive Traffic with PythonHandling Massive Traffic with Python
Handling Massive Traffic with Python
 
Cloud Foundry Summit 2015: 12 Factor Apps For Operations
Cloud Foundry Summit 2015: 12 Factor Apps For OperationsCloud Foundry Summit 2015: 12 Factor Apps For Operations
Cloud Foundry Summit 2015: 12 Factor Apps For Operations
 
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
 
Lotuscript for large systems
Lotuscript for large systemsLotuscript for large systems
Lotuscript for large systems
 
Nagios XI Best Practices
Nagios XI Best PracticesNagios XI Best Practices
Nagios XI Best Practices
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Micro Services Architecture
Micro Services ArchitectureMicro Services Architecture
Micro Services Architecture
 
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
 
Top ten-list
Top ten-listTop ten-list
Top ten-list
 
MySQL Performance Tuning at COSCUP 2014
MySQL Performance Tuning at COSCUP 2014MySQL Performance Tuning at COSCUP 2014
MySQL Performance Tuning at COSCUP 2014
 
Python performance profiling
Python performance profilingPython performance profiling
Python performance profiling
 
Redundant devops
Redundant devopsRedundant devops
Redundant devops
 
Metrics driven development with dedicated Observability Team
Metrics driven development with dedicated Observability TeamMetrics driven development with dedicated Observability Team
Metrics driven development with dedicated Observability Team
 

Recently uploaded

TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 

Tales Of The Black Knight - Keeping EverythingMe running

  • 1.
  • 2. • “One Tap Happiness” • Smart and Contextual Launcher / Phone UI • App Organization • In-Phone Search • App Search • App Recommendations (+ sponsored) • Contextual Content Discovery (cards) What This means: • Lots of algorithms and data from the servers • Millions of downloads, Hundreds of sustained R/S • 1B use events collected per month • Fucking up means fucking up the users’ phones
  • 3. • 100% Cloud-based (EC2) • 100% Automated Infrastructure • Third-party software is FOSS only • Continuous Deployment • Loads of metrics and logging • Databases: Redis, MySQL, Cassandra • Languages: Python, Go, C++ (Java on Android) • Important Tools: Tornado, Thrift, Scribe, Statsd, Kibana, Celery, ZooKeeper, Chef, Docker
  • 4. • Servers may be terminated at any moment • Disks may fail at any moment • The LAN may fail or hiccup • Services you query may crash or be restarted • Your code may crash at any moment • The Client is an idiot that might send garbage • The Server is an idiot that might return garbage • We Are All Idiots • We should accept and embrace all that
  • 5. • Separation into many small services • No SPOFs • Little reliance on disk • Aspire for statelessness • Dynamic Endpoint Management • In App Failover and LB • Aggressive Timeouts • Multi tiered alerting • Sane Fallback Values • Graceful Degradation
  • 6. C* API Search Ads Images Geo Redis Redis Redis Redis Context ● Thin API Layer ○ Input validation ○ Connection funnelling ● Many smaller services ● Many redis instances ○ “database” = instance ● Thrift for internal APIs ● Deploys are less scary ● Scaling is easier ● Well defined contracts Auto-Complete
  • 7. if machine_is_down: # All is well return KeepFighting elif fucked_services_count == 1: log.info(“Tis But a Scratch”) return KeepFighting elif fucked_services_count == 2: log.info(“Just a Flesh Wound!”) return KeepFighting elif num_running_services >= 2: log.info(“I’ll Bite Your Legs Off!”) return KeepFighting else: log.info(“All Right, We’ll Call It A Draw”) return SwitchDataCenter_PLZ_KTXBAI
  • 8. • No Database Master ⇒ remain read only • No Queue ⇒ Write to log for future processing • No Service X ⇒ Return a default response, not an exception • No MySQL ⇒ Everything is ready for serving in Redis anyway • No internal service - fall back to external service • Etc...
  • 9. • Multi Edge / single Central DCs • Geo-DNS based • Edges are read only • Central is write only • API / Logs • All edges are data-symmetrical • Any Edge may be taken out • Central May be taken out without service disruption
  • 10. • Zookeeper manages an endpoint tree • Watchdog registers services - no self announce • Changes to endpoints are published to all services • Automatic switching and adding of endpoints • Facilitates no-downtime deploys with downtime :) • A dead machine is deleted from ZK automatically • A static snapshot of endpoints kept on all machines
  • 11. • Internal “learning” Load Balancing Connection Pool • Protocol Agnostic (kinda...) • Python magic - no code changes • Silent failovers • Automatic fast banning / exploration • Why in-app? • Application Aware • Less latency • EP management support
  • 12. • Proper timeouts are a key factor for a distributed system • They should be as low as possible while avoiding FP • Internal service timeouts should be < 50ms • Client timeouts can be rather big to support retries • Without proper timeouts any link in the chain can bring you down • Log them but try to recover • Don’t forget they add up! • Bad Timeout == Point Of Failure
  • 13. • The obvious dark side of all this • Survival Strategies • Separate non time-critical API calls • Client Side Backoffs • Selective Failover Retries • Capacity barriers in internal services • Tune well • Fail fast! • Return sane defaults, not errors
  • 14. • We are constantly improving our infrastructure • Our DevOps team are doing that, not maintaining servers • We accept that all solutions are temporary • Start-up infrastructure is always a compromise • We embrace Post-Mortems as an opportunity to improve
  • 15. Want to improve our infra? We’re hiring ;) Ping me: Twitter: @dvirsky dvir@everything.me
  • 16. • MySQL/Redis data abstraction library • Objects can be saved or loaded from either • MySQL is write only, Redis read only • MySQL can be down without disruption • Only MySQL is replicated between DCs • Automatic migrations to redis • Spartacus - a pseudo MySQL slave notifies on changes Central MySQL Redis Edge MySQL Spartacus Redis