Tokyo SRE Meetup - Building Reliable Services - A Journey from servers to services

T R E A S U R E D A T A
BUILDING RELIABLE SERVICES
The journey from servers to services
Chris Maxwell
Site Reliability Manager

WHY?
Building Reliable Services
• Reliability is an emergent property
• You cannot buy reliability
• You can invest in communication, tools, and
processes that increase reliability

Product
Sales
M
arketing
Analytics
DAILY WORKLOAD
1+ Million Events / Sec
400,000+ Queries / Day
15+ Trillion Rows / Day 
173+ Million Rows / Sec

MANY DEPLOYMENTS
8+ Environments
Varying capabilities and scale per environment
50+ Services
Not a micro services architecture…
275+ Deployments
Production clusters from 3 to 200+ instances

RUNTIME CONVERGENCE
Cookbooks Downloaded
Conﬁguration Management Server Pattern
Code Downloaded
Conﬁguration Management of releases
Runtime Failures
Dependencies and Releases use same process
Dependencies Downloaded
3rd Party dependencies are everywhere

OUR HERO
Infrastructure Engineer
Systems Engineer who owns the resources
underlying services. Automation, Cloud, Networks,
Security Groups, DNS, Production Support services
Site Reliability Engineer
Software Engineer and Systems Engineer that
improves services with automation and system-
wide tools and best practices

INCREASE VELOCITY
Faster than Weekly Deployments
• Releases through Conﬁguration Management
• Infrastructure team gatekeeping
More Sites
• We need more sites by end of the year
• 50+ services per site

COMPLEX PLATFORM
Where to Start?
• Job Control
• Query and Compute
• Storage
• Segmentation
Many Differences
• Ruby
• Java
• Hadoop
• Presto
• Scala
Many teams
• Backend
• Query
• API
• Integrations
• Frontend
• Infrastructure
Growth and Change
• New features every week
• Product evolution

SERVICE DELIVERY IS HARD
Hero Refuses
Politely…
Teams continue using existing practices
Foundation is Dirty Work
Thankless tasks
Change exposes implicit usage
Measure Reliability
Improves existing processes
Starts measuring features

WISDOM FROM OUTSIDE
Simple First
“Everything should be made as
simple as possible, but not
simpler.”
— Paraphrase of Albert Einstein

ON EXPERTS AND ADVICE
You’re the expert given
your speciﬁc context
and needs

MENTOR RETURNS
The number of “chunks”
of context an human engineer  
 
can retain is the:
“magical number seven (7),
plus or minus two”
— George Miller

FIRST CHANGES
Standard Deployment Targets
For our environment, we need:
• Site - data residency
• Cloud - vendor / implementation
• Region - resource location
• Service - internal service name
• Stage - delivery stages
• Cluster - deployment target

HARD WORK AHEAD
Reliability sometimes means
rolling up your sleeves and
getting dirty,
working on core infrastructure
to create a strong foundation
to be reliable upon

FIRST CHANGES
Standard Startup Services
For our environment, we need:
• preinit - discover deployment target
• ephemeral - automatic volume mounting
• ﬁnal - bootstrap conﬁguration management

KEEP IT SIMPLE
“Complexity is the root
cause of the vast
majority of problems
with software today” —
Moseley & Marks

ACCEPTS CHALLENGE
Standard Service Deﬁnition
• Autoscale Group
• Optional CodeDeploy Package
• Internal Load Balancer
• Internal DNS Endpoint
• Optional External Load Balancer & DNS Endpoint

AUTOSCALING PRESTO
Attach to the Team
Our hero joins a service team
Autoscaling Presto
Helps to autoscale the entire service
Work with Team
Helps transition conﬁg into artifact

CODEDEPLOY PRESTO
Learn from Team
Their challenges and needs
Artifact Code + Config
Transition from simple autoscaling to
Code + Config Artifacts
Simple is Hard
3+ sources of configuration truth
12+ mostly same but different configurations
Complexity was workaround for inflexible
Configuration Management

MOVE FAST
Direct API Tools
• Service API not complete
• Team needed compound operations
Conductor to manage cluster ops
• Built service-speciﬁc tools using underlying APIs
• Routing and Segmentation

FRIENDS FOR THE JOURNEY
AutoScaling &
Launch Conﬁguration
IAM Instance-Proﬁle RolesRoute53CodeDeploy
EC2 Security GroupApplication Load Balancer
& Target Group

MORE FRIENDS
Trusting Team
Software Engineering teams trusted our hero
Outside Experience
Engineers with Domain Speciﬁc experience helped
our hero understand the systems

SLIDE TITLE
value of explicitly
deﬁned service
contracts
talk ﬁrst,
software later

DELIVERY STATES
Dangerous Shutdown
Some services require careful shutdown procedures
Delivery cannot hard-fail 14-day running jobs
Loose deﬁnition of responsibility
Delivery is an organic combination of Conﬁguration
Management, system service control, release control
New Orchestration exposes old assumptions
In-place is sub-optimal for 2-week jobs
New-cluster is sub-optimal for remaining jobs

MENTOR RETURNS
Tools express the process
Process should uplift the
organization
“Tools are necessary but not suﬃcient. To build a
future we all can live with, we have to build it
together” — Bridget Kromhout

OUR HERO
Service Tool
Orchestrate 6 infrastructure APIs with MVP tools:
• Leverage immediate gain
• orchestration
• Paying interest
• Learning team needs and behaviour
• Liability that must be paid in full
• Intend to replace with API + client

SERVICES FIRST
All services should look the same
Any engineer can
• Create a cluster
• Update a cluster
• Deploy to a cluster
• Delete a cluster
Safely, using the same tool

SLIDE TITLE
Survey the Work
How deep does the hole go?
Start with Friends
API and Segmentation
Where to Start?
Look for the greatest need

COMPLEXITY
Complex Service(s)
• Manual Post-Start Actions
• Service Discovery because no standards
Duplication in Many Places
• 5 services of the same service
• We were pushing the limits of legacy model

COMPLEXITY
Unclear boundaries
• Conﬁguration ownership shared across teams
• Service Discovery because no standards
Unclear assumptions
• Inconsistent naming and usage
• The way it works now is the way it should be

MIGRATION
Simplifying Complex
Re-evaluate all choices
in light of services-ﬁrst
Many Transitional Changes
Startup Services
Infrastructure to Application
Precision Replacement
Coordinated Handover
Careful work

THE PROCESS
Legacy Process
• Servers First
• Human Orchestration
Transition
• Services First
• Automatic triggers legacy
Value
• Replace legacy with artifact

VISION
Standard Services First
With standards, 
exceptions are hard;
Without standards,
everything is hard

OUR HERO
Autoscaling Implemented
• Second Services Team:
• Launched to Staging last week
• Launched to Production yesterday

THE REWARD
Service Patterns for Scaling
• Deployment Targets
• Standard Startup
• Standard Services
New Powers
• On-Demand Clusters
• Per-Cluster Versioning
• Immediate Feedback

OUR HERO
Your team builds it,
your team runs it;
we can help
your team
run it better

OUR BLUEPRINT
Standard Services
• Deployment Target
• Internal Hostname
• Internal Load Balancer
• Autoscale Group
• CodeDeploy Artifact
Supporting Services
Artifacts are easier with:
• Conﬁguration support hooks
• Service Control hooks
• Remote Execution hooks
• Metrics, monitors, logs, alerts

REMAINING SERVICES
41+ Services
Just 41+ more to go
Each one needs conversion
200+ Deployments
Just 200+ more to go
Each one needs re-deployment
Empathy
Not all services were designed for
a multi-cluster environments
Not all services were designed for
graceful termination
Not all services have active
improvements planned
Challenges
• Non-idempotent
• State-full / Disk-full
• Master/Worker Co-Services
• Maintain Service Levels
• High Throughput Environment

THE WAY HOME
Best Practices
Standard Services
Standard Delivery
Standard Tooling
Work for Teams
Improve Service as a Service
Work with Teams
Enable Super Powers
Deploy on Demand
Per-Cluster Versions

REMAINING SERVICES
Service Improvements
Target business value:
Delivery Velocity
High-Trust Services
Support Conﬁg Management
No Big-Bang Replacements
Business Depends on Previous Process
Strategy to Improve
Small Iterations
Incremental Value

OUR SERVICE IS NOT YOUR SERVICE
All software is created
within a context, and
trade-offs are made
based on that context

RELIABILITY
Reliability is:
The quality of being
trustworthy or performing
consistently well

INVESTMENTS
Understandable
Make every service easy to understand
Allow any engineer to quickly operate and improve
Consistent
Make every service look the same
Allow any engineer to work on any system without context
Repeatable
Practice makes perfect

NO HEROES, ONLY TEAM
Yuu Yamashita Takashi Kokubun Yuki Ito
Chris Maxwell You?
Robin Bowes
You?
You?
Infrastructure Engineer
You?

T R E A S U R E D A T A
BUILDING RELIABLE SERVICES
• @WrathOfChris 
https://twitter.com/WrathOfChris
• Chris Maxwell 
https://www.linkedin.com/in/wrathofchris/
• 採用情報 
https://www.treasuredata.co.jp/careers/
• トレジャーデータ株式会社 
https://www.linkedin.com/company/treasure-data-inc-

Tokyo SRE Meetup - Building Reliable Services - A Journey from servers to services

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Tokyo SRE Meetup - Building Reliable Services - A Journey from servers to services