Humans by the hundred

Humans By The Hundred
Scaling Big Data for Big Team Growth

$ whoami
SRE Manager at Yelp
CWRU Alum
Pittsburgh native
<3 Web Operations
Just a dude

Yelp’s Mission:
Connecting people with great
local businesses.

Yelp Stats:
As of Q2 2015
83M 3268%83M

What is Yelp?
Many sites: www, m, biz, api
Mobile apps
Partner platform
Hundreds of developers
Thousands of servers

Regardless of how many people
are participating

Deployment: the early days
Get a few people together in slack/irc/etc.
Merge up the code
Run the tests
Manually test it in stage
Cross your fingers

Things get slower...
Tests take longer to run
More hosts = longer downloads
More developers = more eyeballs
More features = more code

The Problem: Humans Are Fallible

The Problem: Humans Are Fallible
“…oh @$#&”

The Problem, With Math
Assume:
Every change has a chance of success: 98%
That means no test failures, no reverts, etc.
Every deploy has a number of changes: n
Any failure in the pipeline invalidates the deploy
Let’s figure out the probability of a
successful deployment: p

Only you
p = .98 (98%)
You and a friend
p = .98 * .98 = .96 (96%)
You and nine co-workers
p = .98 * .98 * .98 * … * .98 = .82 (82%)

p = (.98)n

p = (.98)n
exponential decay!

This doesn’t scale!
More developers = more changes
More changes = longer deploys
Longer deploys = less time to develop
Less time to develop = slower to iterate
Slower to iterate != the goal

Mitigating Exponential Decay
p = (.98)n

Making it harder to screw up
Write more tests
Write better tests
Get better code reviews
Get better infrastructure
Switch programming languages
Use better tools

Just write better software and stop
making mistakes!

The Real World
Testing builds confidence in our changes
Testing does not protect you from failure
Better tools, tests, and infrastructure can
raise our success rates

Service-Oriented Architecture
Large monolith → smaller services
Services communicate over network
Usually HTTP, but you can do RPC, SOAP, etc.
Service = independent code base
Independent deployments

Benefits
Smaller code bases = upper bound to n
Failure domains become isolated
Technology independence
Federated responsibility

Drawbacks
everything becomes decoupled
function calls start looking like HTTP requests
versioning can be a nightmare
tracking dependencies is hard
data consistency becomes challenging
end-to-end testing becomes hard(er), if not
impossible

Conquering SOA
With the monolith, it’s easy to focus on
mean time between failures (MTBF)

Conquering SOA
In a SOA, focus on mean time to recovery
(MTTR)

Conquering SOA
Fail fast
Anticipate failure
Leverage iteration speed to recover fast

Conquering SOA
Treat everything as distributed
That means everything will fail
Use timeouts, retries
Find ways to degrade gracefully
Fail fast & isolated
Don’t rely on synchronous processes
Prepare for eventual consistency

Reaping the Benefits
Smaller failure domains
Fewer people & changes to manage
Deploys get smaller
Deploys get faster
Deploys become continuous

Reaping the Benefits
Smaller changes
means smaller code reviews
means faster validation
means smaller blast radius
means faster iteration

Continuous Delivery
Everyone works against master branch
Master is deployed when commits added
Deployment gated by tests
Monitoring knows something is wrong
before you do!

“Not Recommended” Tests
If a test fails on master:
a feature is broken on the live website, or
your test sucks and you should ditch it
In either case, we disable it
Ticket is created
Developers can fix it later or just bin it and start
fresh

Reliable tests >> test coverage.

Don’t always run all the tests!

Tests of external services should be
monitoring

yelp.com / dataset_challenge
● 61K businesses
● 61K checkin-sets
● 481K business attributes
● 1.6M reviews
● 366K users
● 2.8M edge social-graph
● 495K tips
Your academic project, research or visualizations, submitted by Dec 31,
2015
=
$5,000 prize + $1,000 for publication + $500 for presenting*
Academic dataset from 10 cities in 4 countries!

@YelpEngineering
YelpEngineers
engineeringblog.yelp.com
github.com/yelp

Humans by the hundred

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Humans by the hundred

Similar to Humans by the hundred (20)

More from Yelp Engineering

More from Yelp Engineering (6)

Recently uploaded

Recently uploaded (20)

Humans by the hundred

Editor's Notes