TD Presents: Reliability x Large Scale talks for infrastructure and Site Reliability Engineering
Talk: Building Reliable Services - A journey from servers to services
Speaker: Chris Maxwell
Event: https://techplay.jp/event/657905
Location: Tokyo, Japan
Date: March 15, 2018
【TD Presents】「信頼性×大規模」サービスを運営する会社が語る!サービスを安定的、かつ、スケーラブルに運営するための技術事例勉強会 ~インフラ/SRE編~
4. WHY?
Building Reliable Services
• Reliability is an emergent property
• You cannot buy reliability
• You can invest in communication, tools, and
processes that increase reliability
6. MANY DEPLOYMENTS
8+ Environments
Varying capabilities and scale per environment
50+ Services
Not a micro services architecture…
275+ Deployments
Production clusters from 3 to 200+ instances
7. RUNTIME CONVERGENCE
Cookbooks Downloaded
Configuration Management Server Pattern
Code Downloaded
Configuration Management of releases
Runtime Failures
Dependencies and Releases use same process
Dependencies Downloaded
3rd Party dependencies are everywhere
8. OUR HERO
Infrastructure Engineer
Systems Engineer who owns the resources
underlying services. Automation, Cloud, Networks,
Security Groups, DNS, Production Support services
Site Reliability Engineer
Software Engineer and Systems Engineer that
improves services with automation and system-
wide tools and best practices
9. INCREASE VELOCITY
Faster than Weekly Deployments
• Releases through Configuration Management
• Infrastructure team gatekeeping
More Sites
• We need more sites by end of the year
• 50+ services per site
10. COMPLEX PLATFORM
Where to Start?
• Job Control
• Query and Compute
• Storage
• Segmentation
Many Differences
• Ruby
• Java
• Hadoop
• Presto
• Scala
Many teams
• Backend
• Query
• API
• Integrations
• Frontend
• Infrastructure
Growth and Change
• New features every week
• Product evolution
11. SERVICE DELIVERY IS HARD
Hero Refuses
Politely…
Teams continue using existing practices
Foundation is Dirty Work
Thankless tasks
Change exposes implicit usage
Measure Reliability
Improves existing processes
Starts measuring features
12. WISDOM FROM OUTSIDE
Simple First
“Everything should be made as
simple as possible, but not
simpler.”
— Paraphrase of Albert Einstein
13. ON EXPERTS AND ADVICE
You’re the expert given
your specific context
and needs
14. MENTOR RETURNS
The number of “chunks”
of context an human engineer
can retain is the:
“magical number seven (7),
plus or minus two”
— George Miller
15. FIRST CHANGES
Standard Deployment Targets
For our environment, we need:
• Site - data residency
• Cloud - vendor / implementation
• Region - resource location
• Service - internal service name
• Stage - delivery stages
• Cluster - deployment target
16. HARD WORK AHEAD
Reliability sometimes means
rolling up your sleeves and
getting dirty,
working on core infrastructure
to create a strong foundation
to be reliable upon
17. FIRST CHANGES
Standard Startup Services
For our environment, we need:
• preinit - discover deployment target
• ephemeral - automatic volume mounting
• final - bootstrap configuration management
18. KEEP IT SIMPLE
“Complexity is the root
cause of the vast
majority of problems
with software today” —
Moseley & Marks
19. ACCEPTS CHALLENGE
Standard Service Definition
• Autoscale Group
• Optional CodeDeploy Package
• Internal Load Balancer
• Internal DNS Endpoint
• Optional External Load Balancer & DNS Endpoint
20. AUTOSCALING PRESTO
Attach to the Team
Our hero joins a service team
Autoscaling Presto
Helps to autoscale the entire service
Work with Team
Helps transition config into artifact
21. CODEDEPLOY PRESTO
Learn from Team
Their challenges and needs
Artifact Code + Config
Transition from simple autoscaling to
Code + Config Artifacts
Simple is Hard
3+ sources of configuration truth
12+ mostly same but different configurations
Complexity was workaround for inflexible
Configuration Management
22. MOVE FAST
Direct API Tools
• Service API not complete
• Team needed compound operations
Conductor to manage cluster ops
• Built service-specific tools using underlying APIs
• Routing and Segmentation
23. FRIENDS FOR THE JOURNEY
AutoScaling &
Launch Configuration
IAM Instance-Profile RolesRoute53CodeDeploy
EC2 Security GroupApplication Load Balancer
& Target Group
24. MORE FRIENDS
Trusting Team
Software Engineering teams trusted our hero
Outside Experience
Engineers with Domain Specific experience helped
our hero understand the systems
25. SLIDE TITLE
value of explicitly
defined service
contracts
talk first,
software later
26. DELIVERY STATES
Dangerous Shutdown
Some services require careful shutdown procedures
Delivery cannot hard-fail 14-day running jobs
Loose definition of responsibility
Delivery is an organic combination of Configuration
Management, system service control, release control
New Orchestration exposes old assumptions
In-place is sub-optimal for 2-week jobs
New-cluster is sub-optimal for remaining jobs
27. MENTOR RETURNS
Tools express the process
Process should uplift the
organization
“Tools are necessary but not sufficient. To build a
future we all can live with, we have to build it
together” — Bridget Kromhout
28. OUR HERO
Service Tool
Orchestrate 6 infrastructure APIs with MVP tools:
• Leverage immediate gain
• orchestration
• Paying interest
• Learning team needs and behaviour
• Liability that must be paid in full
• Intend to replace with API + client
29. SERVICES FIRST
All services should look the same
Any engineer can
• Create a cluster
• Update a cluster
• Deploy to a cluster
• Delete a cluster
Safely, using the same tool
30. SLIDE TITLE
Survey the Work
How deep does the hole go?
Start with Friends
API and Segmentation
Where to Start?
Look for the greatest need
31. COMPLEXITY
Complex Service(s)
• Manual Post-Start Actions
• Service Discovery because no standards
Duplication in Many Places
• 5 services of the same service
• We were pushing the limits of legacy model
32. COMPLEXITY
Unclear boundaries
• Configuration ownership shared across teams
• Service Discovery because no standards
Unclear assumptions
• Inconsistent naming and usage
• The way it works now is the way it should be
33. MIGRATION
Simplifying Complex
Re-evaluate all choices
in light of services-first
Many Transitional Changes
Startup Services
Infrastructure to Application
Precision Replacement
Coordinated Handover
Careful work
34. THE PROCESS
Legacy Process
• Servers First
• Human Orchestration
Transition
• Services First
• Automatic triggers legacy
Value
• Replace legacy with artifact
37. THE REWARD
Service Patterns for Scaling
• Deployment Targets
• Standard Startup
• Standard Services
New Powers
• On-Demand Clusters
• Per-Cluster Versioning
• Immediate Feedback
38. OUR HERO
Your team builds it,
your team runs it;
we can help
your team
run it better
39. OUR BLUEPRINT
Standard Services
• Deployment Target
• Internal Hostname
• Internal Load Balancer
• Autoscale Group
• CodeDeploy Artifact
Supporting Services
Artifacts are easier with:
• Configuration support hooks
• Service Control hooks
• Remote Execution hooks
• Metrics, monitors, logs, alerts
40. REMAINING SERVICES
41+ Services
Just 41+ more to go
Each one needs conversion
200+ Deployments
Just 200+ more to go
Each one needs re-deployment
Empathy
Not all services were designed for
a multi-cluster environments
Not all services were designed for
graceful termination
Not all services have active
improvements planned
Challenges
• Non-idempotent
• State-full / Disk-full
• Master/Worker Co-Services
• Maintain Service Levels
• High Throughput Environment
41. THE WAY HOME
Best Practices
Standard Services
Standard Delivery
Standard Tooling
Work for Teams
Improve Service as a Service
Work with Teams
Enable Super Powers
Deploy on Demand
Per-Cluster Versions
42. REMAINING SERVICES
Service Improvements
Target business value:
Delivery Velocity
High-Trust Services
Support Config Management
No Big-Bang Replacements
Business Depends on Previous Process
Strategy to Improve
Small Iterations
Incremental Value
43. OUR SERVICE IS NOT YOUR SERVICE
All software is created
within a context, and
trade-offs are made
based on that context
45. INVESTMENTS
Understandable
Make every service easy to understand
Allow any engineer to quickly operate and improve
Consistent
Make every service look the same
Allow any engineer to work on any system without context
Repeatable
Practice makes perfect
47. NO HEROES, ONLY TEAM
Yuu Yamashita Takashi Kokubun Yuki Ito
Chris Maxwell You?
Site Reliability Engineer
Robin Bowes
You?
Site Reliability Engineer
You?
Infrastructure Engineer
You?
Site Reliability Engineer
48. T R E A S U R E D A T A
BUILDING RELIABLE SERVICES
• @WrathOfChris
https://twitter.com/WrathOfChris
• Chris Maxwell
https://www.linkedin.com/in/wrathofchris/
• 採用情報
https://www.treasuredata.co.jp/careers/
• トレジャーデータ株式会社
https://www.linkedin.com/company/treasure-data-inc-