22. Deployment: the early days
Get a few people together in slack/irc/etc.
Merge up the code
Run the tests
Manually test it in stage
Cross your fingers
23.
24.
25. Things get slower...
Tests take longer to run
More hosts = longer downloads
More developers = more eyeballs
More features = more code
29. The Problem, With Math
Assume:
Every change has a chance of success: 98%
That means no test failures, no reverts, etc.
Every deploy has a number of changes: n
Any failure in the pipeline invalidates the deploy
Let’s figure out the probability of a
successful deployment: p
30. The Problem, With Math
Only you
p = .98 (98%)
You and a friend
p = .98 * .98 = .96 (96%)
You and nine co-workers
p = .98 * .98 * .98 * … * .98 = .82 (82%)
34. This doesn’t scale!
More developers = more changes
More changes = longer deploys
Longer deploys = less time to develop
Less time to develop = slower to iterate
Slower to iterate != the goal
38. Making it harder to screw up
Write more tests
Write better tests
Get better code reviews
Get better infrastructure
Switch programming languages
Use better tools
42. The Real World
Testing builds confidence in our changes
Testing does not protect you from failure
Better tools, tests, and infrastructure can
raise our success rates
45. Service-Oriented Architecture
Large monolith → smaller services
Services communicate over network
Usually HTTP, but you can do RPC, SOAP, etc.
Service = independent code base
Independent deployments
47. Service-Oriented Architecture
Drawbacks
everything becomes decoupled
function calls start looking like HTTP requests
versioning can be a nightmare
tracking dependencies is hard
data consistency becomes challenging
end-to-end testing becomes hard(er), if not
impossible
52. Conquering SOA
Treat everything as distributed
That means everything will fail
Use timeouts, retries
Find ways to degrade gracefully
Fail fast & isolated
Don’t rely on synchronous processes
Prepare for eventual consistency
53. Reaping the Benefits
Smaller failure domains
Fewer people & changes to manage
Deploys get smaller
Deploys get faster
Deploys become continuous
54. Reaping the Benefits
Smaller changes
means smaller code reviews
means faster validation
means smaller blast radius
means faster iteration
55. Continuous Delivery
Everyone works against master branch
Master is deployed when commits added
Deployment gated by tests
Monitoring knows something is wrong
before you do!
68. “Not Recommended” Tests
If a test fails on master:
a feature is broken on the live website, or
your test sucks and you should ditch it
In either case, we disable it
Ticket is created
Developers can fix it later or just bin it and start
fresh
4 years at Yelp, 80 people -> hundreds
Just going to talk about what I’ve learned along that way
For this talk to make sense, we have to also talk about what Yelp is.
Connect people w/ great local businesses
Approx. 83 million UMVs via mobile
More than 83 million reviews contributed since inception
Approx. 68% of all searches on Yelp came from mobile (mobile web & app)
Yelp is present across 32 countries
To drive numbers like that, big platform
Lots of sites and apps with lots of features, lots of people working on them, and lots of computers that powers it all
Well, I can write a convincing talk proposal. ...well, and we’re sponsoring
You’ve been conned! And now you can’t leave!
You thought I was going to come up here and talk about big data (we’re at CHUG, right?)
...but instead I am here to talk to you about people
and that’s because skynet hasn’t taken over yet, so at the end of the day it’s still humans that write software
Figuring out how to make good software is hard, especially when solving big data problems
and the infrastructure that allows that software to exist is challenging to create and maintain effectively
both the software and the infrastructure are critical, but the infrastructure tends to be much harder to change than the software
It’s like the foundation of your house… or the wheels on your car
and that infrastructure is as important for scaling the humans parts of your company as the technology!
so these things are all intertwined. if you solve the people problems, the technology will follow (and vice versa)
Just going to talk about some challenging problems we’ve faced.
We have cool technology, but the way we succeed is in how we scale our people!
Ok but actually WTF are we talking about
one of the very first things that gets in the way with this, particularly for something like a website, is...
This is the one the biggest challenges that makes accomplishing our goal difficult
This is how most projects, companies, etc. start: single code repo, maybe a server or two, and one or a couple of developers
And then ship it!
Dump the code into production, probably restart everything. Click around, make sure stuff looks good. Maybe you’ve even got error monitoring!
This works for a while, and it’s all you need when you get started.
But then time passes, and the monolith grows. You add features. You add developers to make those features.
As you add code and scale out your org + infrastructure, things naturally take longer. What was once a 10 minute deploy process might take closer to 30 minutes… or 45 minutes… or maybe even an hour!
...but that’s not a big deal, right?
And here we run into a problem.
HUMANS SCREW UP
As you grow, you’re doing more stuff. More people writing more code to power more features covered by more tests that guard deployments with more people in them.
More stuff means more chances to screw up, which we do, because we are humans. And when you screw up, it means back to square one… new build, new test run, new deployment.
...and everyone has lost as much time as it takes to get this far. And they’ll have to invest it all again to get their code out!
This starts looking pretty grim even around 10 branches, and that’s assuming a (generous) 98% success rate! At 20 branches we’re below 70%!
So… how can we do better?
Well, we can try to improve this number...
Make it harder to screw up! Decrease the chances of failure.
This is where almost all teams start focusing their efforts first.
Here are some common ways people to try to make screwing up more difficult.
It’s easy!
This doesn’t work in the real world. At the end of the day, we’re still human
We’re people! We make mistakes! We just spent a long time talking about how we’re fallible. Why would this be any less true of the systems we create to prevent us from making mistakes?
In reality, doing all those things does help
But at the end of the day, you need more to scale an org. We want the asymptotic solutions, not the constant factor.
...and of course, as computing professionals, you’ve all probably been writhing in your seats, trying to tell me to do this first
We tackle this asymptotic factor with SOA. Split up large code bases into smaller ones that can be developed independently and communicate over common interfaces.
How you size this is up to you. Don’t fall prey to the hype of “microservices” if it doesn’t make sense for you.
It is a lot harder to do SOA than a monolith, and it can decrease your rate of success dramatically! It takes a lot of effort and discipline to get it right.
However, it’s very difficult to obtain the advantages it provides any other way.
Embrace the idea that failures will happen, and be ready for them!
In a world like this, you need to safeguard your deployment process. It’s a problem if it gets slow, because it’s your out when you screw up.
Ok, but wait a second. You just said deployment is gated using tests, right? But I thought those were hard to get right!
...and there is a real cost to getting them wrong!
It can be easy, especially in dynamic languages, to accrue dependencies on other things over time. “Katamari” dependencies
It can be tempting to write tests that cover way, way too much
Tests that don’t have enough coverage
slow tests
tests that rely on lots of external infrastructure
tests that test the outside world: external apis, vendors, etc.
Like we talked about before: stop treating the tests as a safety net.
Not all tests are sacred, and indeed a lot of them are probably hurting you if you’ve been around long enough. Some tests are better than others.
A test that fails on master when nothing is actually broken is actually depriving you of information. You have no idea if what that test tests is OK, because a failure doesn’t mean anything and therefore a pass doesn’t mean anything either!
Reliable tests are wayyy better than achieving high test coverage. CD only works when green actually means green and red actually means red.
We have a hard time with this one because Python is a dynamic language, but if you can do this, you should!
Monitoring will let you know even sooner when your integration code with a partner breaks, especially when it’s their side. Running this at integration time isn’t telling you anything and it’s slowing you down.