DevOps is far more about culture and organization than it is about technology and tooling. This talk will discuss the speaker's experiences leading high-performing engineering teams at Google, eBay, and Stitch Fix, and will offer suggestions for other organizations to level up their DevOps game.
https://www.meetup.com/SV-ELC/events/240087808/
Modern software-service models take advantage of the great benefits in having the same team both build the software as well as operate it in production -- "You Build It; You Run It" is the Amazon mantra. What does this mean in practice?
Organizationally, it means small teams with well-defined areas of responsibility, directly aligned with the business. The teams are cross-functional, meaning that each team has all the skill sets it requires to do its job, while at the same time relying on other teams for supporting services, tools, and libraries.
Process-wise, it means doubling down on practices like test-driven development and continuous delivery. Using continuous delivery practices, high-performing teams can and do release their applications and services multiple times a day. This enables them to iterate rapidly, experiment courageously, and fail more quickly.
Culturally, it means end-to-end ownership. Each team owns its software end-to-end, from design to development to deployment to retirement. The same engineers who are responsible for the features are responsible for quality, performance, operations, and maintenance. This ownership puts incentives in the right place to encourage building maintainable, observable, and operable systems from the start.
All these techniques and approaches are available to everyone, and practical examples in this talk will help other organizations on their journey.
2. Background
• VP Engineering at Stitch Fix
o Combining “Art and Science” to revolutionize apparel retail
• Consulting “CTO as a service”
o Helping companies scale engineering organizations and technology
• Director of Engineering for Google App Engine
o World’s largest Platform-as-a-Service
• Chief Engineer / Distinguished Architect at eBay
o Multiple generations of eBay’s infrastructure
@randyshoup linkedin.com/in/randyshoup
7. Combining Art and
[Data] Science
• 1:1 Ratio of Data Science to Engineering
o Almost 100 software engineers
o Almost 100 data scientists and algorithm developers
o Unique in our industry
• Apply intelligence to *every* part of the business
o Buying
o Inventory management
o Logistics optimization
o Styling recommendations
o Demand prediction
• Humans and machines augmenting each other
@randyshoup linkedin.com/in/randyshoup
15. Conway’s Law
• Organization determines architecture
o Design of a system will be a reflection of the communication paths within the
organization
• Modular system requires modular organization
o Small, independent teams lead to more flexible, composable systems
o Larger, interdependent teams lead to larger systems
• We can engineer the system we want by engineering the
organization
@randyshoup linkedin.com/in/randyshoup
16. Small
“Service” Teams
• Amazon “2 Pizza” Teams
o No team should be larger than can be fed by 2 large pizzas
o Typically 4-6 people
o Mix of junior and senior people
• Aligned to Business Domains
o Clear, well-defined area of responsibility
o Single service or set of related services
o Minimal, well-defined “interface”
@randyshoup linkedin.com/in/randyshoup
17. Full-Stack
Teams
• All disciplines required for the team’s function
o Design
o Development
o Quality and Performance
o Maintenance
o Operations
• Symphony, not a Factory
o Diversity of skills and talents
o Working together more important than individual talent
• Depend on other teams for supporting services, libraries,
and tools
@randyshoup linkedin.com/in/randyshoup
18. End-to-End
Ownership
• Teams own their roadmap
• No separate maintenance or sustaining engineering
team
• Teams are long-term
o Team owns service from design to deployment to retirement
@randyshoup linkedin.com/in/randyshoup
19. Stable Points in
Organization Size
• ~5
o Everyone fits around a conference table
o Single team, no structure
o High bandwidth communication between individuals
o Fluid roles
• ~20
o Very difficult to manage as a single team, but possible
o Need to introduce structure, but can be challenging to make it optimal / efficient
o Potential trough of productivity and motivation
• ~50-100+
o Requires shift from coordinating individuals to coordinating teams
o High-bandwidth within teams, loose coupling between teams
o Focus on team structure and responsibilities
@randyshoup linkedin.com/in/randyshoup
21. Test-Driven
Development
• Tests help you go faster
o Tests “have your back”
o Development velocity
• Tests make better code
o Confidence to break things
o Courage to refactor mercilessly
• Tests make better systems
o Catch bugs earlier, fail faster
@randyshoup linkedin.com/in/randyshoup
22. “Do you have time to do it
twice?”
“We don’t have time to do it
right!”
23. Test-Driven
Development
• Do it right (enough) the first time
o The more constrained you are on time and resources, the more important it is to
build solid features
o Build one great thing instead of two half-finished things
• Right ≠ Perfect (80 / 20 Rule)
• Basically no bug tracking system (!)
o Bugs are fixed as they come up
o Backlog contains features we want to build
o Backlog contains technical debt we want to repay
@randyshoup linkedin.com/in/randyshoup
24. Transitioning
to Testing
• Write functional tests around a component
o If you can only have a few tests, they should be meaningful ones
• Fail any build that breaks a test
• Opportunistically add tests
o For every new bug, add a test that reproduces the bug and verifies the fix
o For every new feature, add tests for that feature
@randyshoup linkedin.com/in/randyshoup
26. Continuous
Delivery
• Most applications deployed multiple times per day
• More solid systems
o Release smaller units of work
o Smaller changes to roll back or roll forward
o Faster to repair, easier to understand, simpler to diagnose
• Rapid experimentation
o Small experiments and rapid iteration are cheap
@randyshoup linkedin.com/in/randyshoup
27. Continuous
Delivery
• Enabled by
o API-driven infrastructure (“cloud”)
o PaaS
o Containers
• eBay 2-week trains vs. today
@randyshoup linkedin.com/in/randyshoup
28. Triangle of
Technical Tradeoffs
• When you choose date and
features, you implicitly
choose a level of quality
• Changing one changes the
others
• Be open and honest when
you are making these
tradeoffs
Date
QualityFeatures
33. Culture eats strategy and
organization and technology and
process and … for breakfast.
-- me
34. Cross-Functional
Collaboration
• Open communication
o Individuals encouraged to work directly with each other
o Prefer informal cooperation over formal channels
• Best decisions made through partnership
o Agreement on goals and priorities makes it easier to agree on tactics
o Given common context, well-meaning people will generally agree
o “Disagree and Commit”
• Solve problems instead of pointing fingers
o Otherwise, playing strategy instead of solving the problem
o Otherwise, avoiding blame and hiding the ball
@randyshoup linkedin.com/in/randyshoup
35. None of us is as smart as all of
us.
-- Japanese proverb,
as quoted by Bob Taylor
36. Goals of a
Service Owner
• Meet the needs of my clients …
• Functionality
• Quality
• Performance
• Stability and reliability
• Constant improvement over time
• … at minimum cost and effort
• Leverage common tools and infrastructure
• Leverage other services
• Automate building, deploying, and operating my service
• Optimize for efficient use of resources
@randyshoup linkedin.com/in/randyshoup
37. Responsibilities of a
Service Owner
• End-to-end Ownership
o Team owns service from design to deployment to retirement
o No separate maintenance or sustaining engineering team
• Autonomy and Accountability
o Freedom to choose technology, methodology, working environment
o Responsibility for the results of those choices
@randyshoup linkedin.com/in/randyshoup
39. Service
Relationships
• Vendor – Customer Relationship
o Friendly and cooperative, but structured
o Clear ownership and division of responsibility
• Customer Focus
o Value of service comes from its value to its customers
• Customer can choose to use service or not (!)
o Must be strictly better than the alternatives of build, buy, borrow
@randyshoup linkedin.com/in/randyshoup
40. Service-Service
Relationships
• Service-Level Agreement (SLA)
o Promise of service levels by the provider
o Customer needs to be able to rely on the service, like a utility
• Charging and Cost Allocation
o Charge customers for *usage* of the service
o Aligns economic incentives of customer and provider
o Motivates both sides to optimize for efficiency
o (+) Pre- / post-allocation at Google
41. Blameless
Post-Mortems
• Post-mortem After Every Incident
o Document exactly what happened
o What went right
o What went wrong
• Open and Honest Discussion
o What contributed to the incident?
o What could we have done better?
Engineers compete to take personal responsibility (!)
“Finally we can fix that broken system”
@randyshoup linkedin.com/in/randyshoup
42. Blameless
Post-Mortems
• Action Items
o How will we change process, technology, documentation, etc.
o How could we have automated the problems away?
o How could we have diagnosed more quickly?
o How could we have restored service more quickly?
• Follow up (!)
@randyshoup linkedin.com/in/randyshoup
43. Failure is not falling down,
but refusing to get back up.
-- Theodore Roosevelt
45. Thanks!
• Stitch Fix is hiring!
o www.stitchfix.com/careers
o Based in San Francisco
o Hiring everywhere!
o More than half remote, all across US
o Application development, Platform engineering, Data
Science
• Please contact me
o @randyshoup
o linkedin.com/in/randyshoup
47. Sharing
Specialty Skills
• Specialty Skills
o Security, User Experience, Compliance, DBA, etc.
o Often quite difficult to hire
o Rarely need a full-time person on each team
• Approach 1: Service Model
o Make domain teams as self-sufficient as possible
o Encode best practices into service / tool / library
o Own those specialty services just like domain teams do
@randyshoup linkedin.com/in/randyshoup
48. Sharing
Specialty Skills
• Approach 2: Shared Model
o Single person shared among multiple teams
• Approach 3: Coaching / Advisory Model
o Specialty team is a pool of advisors
o Provide special expertise as needed
o Goal is to make domain teams self-sufficient
@randyshoup linkedin.com/in/randyshoup
49. Team
Anti-Patterns
• Skill-based teams
o Based around tiers or technologies (e.g., front-end team, application team, DBA
team, Ops team)
o (-) Every project crosses many team boundaries
o (-) No end-to-end ownership of the system
o (-) No end-to-end ownership of the customer experience
• Project-based teams
o Form ad-hoc team for a particular project, then disband
o (-) No long-term ownership of code, product, service
o (-) Encourages short-term approach instead of sustainable technical debt
50. Team
Anti-Patterns
• Large teams
o (-) Teams larger than 8-10 should be split
o (-) Communication and coordination overhead makes it increasingly difficult to
sustain velocity
51. Effective
Global Teams
• Local Ownership
o Well-defined area of responsibility
o Clean interface with the rest of the organization
• Individual teams are co-located
o High-bandwidth communication within a team
o Minimal coordination across teams
52. Global Team
Anti-Patterns
• Anti-Pattern: Split Teams Over Geographies
o (-) Constant need for coordination over time zones
o (-) Local conversations become disruptive rather than helpful
o (-) No local pride of ownership
• Anti-Pattern: Remote Team as Job Shop
o (-) Constant need for management and task assignment
o (-) Resentment between first-tier and second-tier sites
o (-) No local pride of ownership
o Ex. eBay remote offices vs. Google remote offices
53. Remote
Teams
• Fully remote *OR* fully co-located
o Remote teams rely on virtual proximity (chat, hangouts, IRC)
o Co-located teams rely on physical proximity (co-working)
• Anti-Pattern: “Mostly” co-located
o (-) Co-located majority ends up determining communication methods
o (-) Remote individuals left out, less able to contribute, less productive
54. Feature
Flags
• Configuration “flag” to enable / disable a feature for a
particular set of users
o Independently discovered at eBay, Facebook, Google, etc.
• More solid systems
o Decouple feature delivery from code delivery
o Rapid on and off
o Develop / test / verify in production
o Dark launches
• Enables experimentation
o A | B testing