4 years at Yelp, 80 people -> hundreds Just going to talk about what I’ve learned along that way
For this talk to make sense, we have to also talk about what Yelp is.
Connect people w/ great local businesses
Approx. 83 million UMVs via mobile More than 83 million reviews contributed since inception Approx. 68% of all searches on Yelp came from mobile (mobile web & app) Yelp is present across 32 countries
To drive numbers like that, big platform
Lots of sites and apps with lots of features, lots of people working on them, and lots of computers that powers it all
Well, I can write a convincing talk proposal. ...well, and we’re sponsoring
You’ve been conned! And now you can’t leave!
You thought I was going to come up here and talk about big data (we’re at CHUG, right?)
...but instead I am here to talk to you about people
and that’s because skynet hasn’t taken over yet, so at the end of the day it’s still humans that write software
Figuring out how to make good software is hard, especially when solving big data problems
and the infrastructure that allows that software to exist is challenging to create and maintain effectively
both the software and the infrastructure are critical, but the infrastructure tends to be much harder to change than the software
It’s like the foundation of your house… or the wheels on your car
and that infrastructure is as important for scaling the humans parts of your company as the technology!
so these things are all intertwined. if you solve the people problems, the technology will follow (and vice versa)
Just going to talk about some challenging problems we’ve faced.
We have cool technology, but the way we succeed is in how we scale our people!
Ok but actually WTF are we talking about
one of the very first things that gets in the way with this, particularly for something like a website, is...
This is the one the biggest challenges that makes accomplishing our goal difficult
This is how most projects, companies, etc. start: single code repo, maybe a server or two, and one or a couple of developers
And then ship it!
Dump the code into production, probably restart everything. Click around, make sure stuff looks good. Maybe you’ve even got error monitoring!
This works for a while, and it’s all you need when you get started.
But then time passes, and the monolith grows. You add features. You add developers to make those features.
As you add code and scale out your org + infrastructure, things naturally take longer. What was once a 10 minute deploy process might take closer to 30 minutes… or 45 minutes… or maybe even an hour!
...but that’s not a big deal, right?
And here we run into a problem.
HUMANS SCREW UP
As you grow, you’re doing more stuff. More people writing more code to power more features covered by more tests that guard deployments with more people in them.
More stuff means more chances to screw up, which we do, because we are humans. And when you screw up, it means back to square one… new build, new test run, new deployment.
...and everyone has lost as much time as it takes to get this far. And they’ll have to invest it all again to get their code out!
This starts looking pretty grim even around 10 branches, and that’s assuming a (generous) 98% success rate! At 20 branches we’re below 70%!
So… how can we do better?
Well, we can try to improve this number...
Make it harder to screw up! Decrease the chances of failure.
This is where almost all teams start focusing their efforts first.
Here are some common ways people to try to make screwing up more difficult.
This doesn’t work in the real world. At the end of the day, we’re still human
We’re people! We make mistakes! We just spent a long time talking about how we’re fallible. Why would this be any less true of the systems we create to prevent us from making mistakes?
In reality, doing all those things does help
But at the end of the day, you need more to scale an org. We want the asymptotic solutions, not the constant factor.
...and of course, as computing professionals, you’ve all probably been writhing in your seats, trying to tell me to do this first
We tackle this asymptotic factor with SOA. Split up large code bases into smaller ones that can be developed independently and communicate over common interfaces.
How you size this is up to you. Don’t fall prey to the hype of “microservices” if it doesn’t make sense for you.
It is a lot harder to do SOA than a monolith, and it can decrease your rate of success dramatically! It takes a lot of effort and discipline to get it right.
However, it’s very difficult to obtain the advantages it provides any other way.
Embrace the idea that failures will happen, and be ready for them!
In a world like this, you need to safeguard your deployment process. It’s a problem if it gets slow, because it’s your out when you screw up.
Ok, but wait a second. You just said deployment is gated using tests, right? But I thought those were hard to get right!
...and there is a real cost to getting them wrong!
It can be easy, especially in dynamic languages, to accrue dependencies on other things over time. “Katamari” dependencies
It can be tempting to write tests that cover way, way too much
Tests that don’t have enough coverage
tests that rely on lots of external infrastructure
tests that test the outside world: external apis, vendors, etc.
Like we talked about before: stop treating the tests as a safety net.
Not all tests are sacred, and indeed a lot of them are probably hurting you if you’ve been around long enough. Some tests are better than others.
A test that fails on master when nothing is actually broken is actually depriving you of information. You have no idea if what that test tests is OK, because a failure doesn’t mean anything and therefore a pass doesn’t mean anything either!
Reliable tests are wayyy better than achieving high test coverage. CD only works when green actually means green and red actually means red.
We have a hard time with this one because Python is a dynamic language, but if you can do this, you should!
Monitoring will let you know even sooner when your integration code with a partner breaks, especially when it’s their side. Running this at integration time isn’t telling you anything and it’s slowing you down.
The Problem, With Math Assume:
Every change has a chance of success: 98% That means no test failures, no reverts, etc. Every deploy has a number of changes: n Any failure in the pipeline invalidates the deploy Let’s figure out the probability of a successful deployment: p
The Problem, With Math Only
you p = .98 (98%) You and a friend p = .98 * .98 = .96 (96%) You and nine co-workers p = .98 * .98 * .98 * … * .98 = .82 (82%)
Conquering SOA Treat everything as
distributed That means everything will fail Use timeouts, retries Find ways to degrade gracefully Fail fast & isolated Don’t rely on synchronous processes Prepare for eventual consistency
“Not Recommended” Tests If a
test fails on master: a feature is broken on the live website, or your test sucks and you should ditch it In either case, we disable it Ticket is created Developers can fix it later or just bin it and start fresh