Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Humans by the hundred

Scaling Big Data for Big Team Growth

  • Login to see the comments

Humans by the hundred

  1. Humans By The Hundred Scaling Big Data for Big Team Growth
  2. $ whoami SRE Manager at Yelp CWRU Alum Pittsburgh native <3 Web Operations Just a dude
  3. Yelp’s Mission: Connecting people with great local businesses.
  4. Yelp Stats: As of Q2 2015 83M 3268%83M
  5. What is Yelp? Many sites: www, m, biz, api Mobile apps Partner platform Hundreds of developers Thousands of servers
  6. Why Am I Here?
  7. This talk is about people
  8. The Goal
  9. Iterate as fast as possible
  10. Regardless of how many people are participating
  11. Deployment
  12. How It Starts
  13. Deployment: the early days Get a few people together in slack/irc/etc. Merge up the code Run the tests Manually test it in stage Cross your fingers
  14. Things get slower... Tests take longer to run More hosts = longer downloads More developers = more eyeballs More features = more code
  15. The Problem: Humans Are Fallible
  16. The Problem: Humans Are Fallible “…oh @$#&”
  17. The Problem, With Math Assume: Every change has a chance of success: 98% That means no test failures, no reverts, etc. Every deploy has a number of changes: n Any failure in the pipeline invalidates the deploy Let’s figure out the probability of a successful deployment: p
  18. The Problem, With Math Only you p = .98 (98%) You and a friend p = .98 * .98 = .96 (96%) You and nine co-workers p = .98 * .98 * .98 * … * .98 = .82 (82%)
  19. The Problem, With Math p = (.98)n
  20. The Problem, With Math p = (.98)n exponential decay!
  21. This doesn’t scale! More developers = more changes More changes = longer deploys Longer deploys = less time to develop Less time to develop = slower to iterate Slower to iterate != the goal
  22. Mitigating Exponential Decay p = (.98)n
  23. Mitigating Exponential Decay p = (.98)n
  24. Making it harder to screw up Write more tests Write better tests Get better code reviews Get better infrastructure Switch programming languages Use better tools
  25. Just write better software and stop making mistakes!
  27. The Real World Testing builds confidence in our changes Testing does not protect you from failure Better tools, tests, and infrastructure can raise our success rates
  28. Mitigating Exponential Decay p = (.98)n
  29. Mitigating Exponential Decay p = (.98)n
  30. Service-Oriented Architecture Large monolith → smaller services Services communicate over network Usually HTTP, but you can do RPC, SOAP, etc. Service = independent code base Independent deployments
  31. Service-Oriented Architecture Benefits Smaller code bases = upper bound to n Failure domains become isolated Technology independence Federated responsibility
  32. Service-Oriented Architecture Drawbacks everything becomes decoupled function calls start looking like HTTP requests versioning can be a nightmare tracking dependencies is hard data consistency becomes challenging end-to-end testing becomes hard(er), if not impossible
  33. SOA scales people, not code.
  34. Conquering SOA With the monolith, it’s easy to focus on mean time between failures (MTBF)
  35. Conquering SOA In a SOA, focus on mean time to recovery (MTTR)
  36. Conquering SOA Fail fast Anticipate failure Leverage iteration speed to recover fast
  37. Conquering SOA Treat everything as distributed That means everything will fail Use timeouts, retries Find ways to degrade gracefully Fail fast & isolated Don’t rely on synchronous processes Prepare for eventual consistency
  38. Reaping the Benefits Smaller failure domains Fewer people & changes to manage Deploys get smaller Deploys get faster Deploys become continuous
  39. Reaping the Benefits Smaller changes means smaller code reviews means faster validation means smaller blast radius means faster iteration
  40. Continuous Delivery Everyone works against master branch Master is deployed when commits added Deployment gated by tests Monitoring knows something is wrong before you do!
  42. Testing
  43. Tests are hard to get right.
  44. How can we do better?
  45. “Not Recommended” Tests
  46. “Not Recommended” Tests If a test fails on master: a feature is broken on the live website, or your test sucks and you should ditch it In either case, we disable it Ticket is created Developers can fix it later or just bin it and start fresh
  47. Reliable tests >> test coverage.
  48. Don’t always run all the tests!
  49. Tests of external services should be monitoring
  50. Define your boundaries.
  51. / dataset_challenge ● 61K businesses ● 61K checkin-sets ● 481K business attributes ● 1.6M reviews ● 366K users ● 2.8M edge social-graph ● 495K tips Your academic project, research or visualizations, submitted by Dec 31, 2015 = $5,000 prize + $1,000 for publication + $500 for presenting* Academic dataset from 10 cities in 4 countries!
  52. @YelpEngineering YelpEngineers
  54. Questions?