SlideShare a Scribd company logo
1 of 48
Download to read offline
BUILDING RELIABLE SERVICES
T R E A S U R E D A T A
BUILDING RELIABLE SERVICES
The journey from servers to services
Chris Maxwell
Site Reliability Manager
Treasure Data Services
WHY?
Building Reliable Services
• Reliability is an emergent property
• You cannot buy reliability
• You can invest in communication, tools, and
processes that increase reliability
Product
Sales
M
arketing
Analytics
DAILY WORKLOAD
1+ Million Events / Sec
400,000+ Queries / Day
15+ Trillion Rows / Day

173+ Million Rows / Sec
MANY DEPLOYMENTS
8+ Environments
Varying capabilities and scale per environment
50+ Services
Not a micro services architecture…
275+ Deployments
Production clusters from 3 to 200+ instances
RUNTIME CONVERGENCE
Cookbooks Downloaded
Configuration Management Server Pattern
Code Downloaded
Configuration Management of releases
Runtime Failures
Dependencies and Releases use same process
Dependencies Downloaded
3rd Party dependencies are everywhere
OUR HERO
Infrastructure Engineer
Systems Engineer who owns the resources
underlying services. Automation, Cloud, Networks,
Security Groups, DNS, Production Support services
Site Reliability Engineer
Software Engineer and Systems Engineer that
improves services with automation and system-
wide tools and best practices
INCREASE VELOCITY
Faster than Weekly Deployments
• Releases through Configuration Management
• Infrastructure team gatekeeping
More Sites
• We need more sites by end of the year
• 50+ services per site
COMPLEX PLATFORM
Where to Start?
• Job Control
• Query and Compute
• Storage
• Segmentation
Many Differences
• Ruby
• Java
• Hadoop
• Presto
• Scala
Many teams
• Backend
• Query
• API
• Integrations
• Frontend
• Infrastructure
Growth and Change
• New features every week
• Product evolution
SERVICE DELIVERY IS HARD
Hero Refuses
Politely…
Teams continue using existing practices
Foundation is Dirty Work
Thankless tasks
Change exposes implicit usage
Measure Reliability
Improves existing processes
Starts measuring features
WISDOM FROM OUTSIDE
Simple First
“Everything should be made as
simple as possible, but not
simpler.”
— Paraphrase of Albert Einstein
ON EXPERTS AND ADVICE
You’re the expert given
your specific context
and needs
MENTOR RETURNS
The number of “chunks”
of context an human engineer 



can retain is the:
“magical number seven (7),
plus or minus two”
— George Miller
FIRST CHANGES
Standard Deployment Targets
For our environment, we need:
• Site - data residency
• Cloud - vendor / implementation
• Region - resource location
• Service - internal service name
• Stage - delivery stages
• Cluster - deployment target
HARD WORK AHEAD
Reliability sometimes means
rolling up your sleeves and
getting dirty,
working on core infrastructure
to create a strong foundation
to be reliable upon
FIRST CHANGES
Standard Startup Services
For our environment, we need:
• preinit - discover deployment target
• ephemeral - automatic volume mounting
• final - bootstrap configuration management
KEEP IT SIMPLE
“Complexity is the root
cause of the vast
majority of problems
with software today” —
Moseley & Marks
ACCEPTS CHALLENGE
Standard Service Definition
• Autoscale Group
• Optional CodeDeploy Package
• Internal Load Balancer
• Internal DNS Endpoint
• Optional External Load Balancer & DNS Endpoint
AUTOSCALING PRESTO
Attach to the Team
Our hero joins a service team
Autoscaling Presto
Helps to autoscale the entire service
Work with Team
Helps transition config into artifact
CODEDEPLOY PRESTO
Learn from Team
Their challenges and needs
Artifact Code + Config
Transition from simple autoscaling to
Code + Config Artifacts
Simple is Hard
3+ sources of configuration truth
12+ mostly same but different configurations
Complexity was workaround for inflexible
Configuration Management
MOVE FAST
Direct API Tools
• Service API not complete
• Team needed compound operations
Conductor to manage cluster ops
• Built service-specific tools using underlying APIs
• Routing and Segmentation
FRIENDS FOR THE JOURNEY
AutoScaling &
Launch Configuration
IAM Instance-Profile RolesRoute53CodeDeploy
EC2 Security GroupApplication Load Balancer
& Target Group
MORE FRIENDS
Trusting Team
Software Engineering teams trusted our hero
Outside Experience
Engineers with Domain Specific experience helped
our hero understand the systems
SLIDE TITLE
value of explicitly
defined service
contracts
talk first,
software later
DELIVERY STATES
Dangerous Shutdown
Some services require careful shutdown procedures
Delivery cannot hard-fail 14-day running jobs
Loose definition of responsibility
Delivery is an organic combination of Configuration
Management, system service control, release control
New Orchestration exposes old assumptions
In-place is sub-optimal for 2-week jobs
New-cluster is sub-optimal for remaining jobs
MENTOR RETURNS
Tools express the process
Process should uplift the
organization
“Tools are necessary but not sufficient. To build a
future we all can live with, we have to build it
together” — Bridget Kromhout
OUR HERO
Service Tool
Orchestrate 6 infrastructure APIs with MVP tools:
• Leverage immediate gain
• orchestration
• Paying interest
• Learning team needs and behaviour
• Liability that must be paid in full
• Intend to replace with API + client
SERVICES FIRST
All services should look the same
Any engineer can
• Create a cluster
• Update a cluster
• Deploy to a cluster
• Delete a cluster
Safely, using the same tool
SLIDE TITLE
Survey the Work
How deep does the hole go?
Start with Friends
API and Segmentation
Where to Start?
Look for the greatest need
COMPLEXITY
Complex Service(s)
• Manual Post-Start Actions
• Service Discovery because no standards
Duplication in Many Places
• 5 services of the same service
• We were pushing the limits of legacy model
COMPLEXITY
Unclear boundaries
• Configuration ownership shared across teams
• Service Discovery because no standards
Unclear assumptions
• Inconsistent naming and usage
• The way it works now is the way it should be
MIGRATION
Simplifying Complex
Re-evaluate all choices
in light of services-first
Many Transitional Changes
Startup Services
Infrastructure to Application
Precision Replacement
Coordinated Handover
Careful work
THE PROCESS
Legacy Process
• Servers First
• Human Orchestration
Transition
• Services First
• Automatic triggers legacy
Value
• Replace legacy with artifact
VISION
Standard Services First
With standards,

exceptions are hard;
Without standards,
everything is hard
OUR HERO
Autoscaling Implemented
• Second Services Team:
• Launched to Staging last week
• Launched to Production yesterday
THE REWARD
Service Patterns for Scaling
• Deployment Targets
• Standard Startup
• Standard Services
New Powers
• On-Demand Clusters
• Per-Cluster Versioning
• Immediate Feedback
OUR HERO
Your team builds it,
your team runs it;
we can help
your team
run it better
OUR BLUEPRINT
Standard Services
• Deployment Target
• Internal Hostname
• Internal Load Balancer
• Autoscale Group
• CodeDeploy Artifact
Supporting Services
Artifacts are easier with:
• Configuration support hooks
• Service Control hooks
• Remote Execution hooks
• Metrics, monitors, logs, alerts
REMAINING SERVICES
41+ Services
Just 41+ more to go
Each one needs conversion
200+ Deployments
Just 200+ more to go
Each one needs re-deployment
Empathy
Not all services were designed for
a multi-cluster environments
Not all services were designed for
graceful termination
Not all services have active
improvements planned
Challenges
• Non-idempotent
• State-full / Disk-full
• Master/Worker Co-Services
• Maintain Service Levels
• High Throughput Environment
THE WAY HOME
Best Practices
Standard Services
Standard Delivery
Standard Tooling
Work for Teams
Improve Service as a Service
Work with Teams
Enable Super Powers
Deploy on Demand
Per-Cluster Versions
REMAINING SERVICES
Service Improvements
Target business value:
Delivery Velocity
High-Trust Services
Support Config Management
No Big-Bang Replacements
Business Depends on Previous Process
Strategy to Improve
Small Iterations
Incremental Value
OUR SERVICE IS NOT YOUR SERVICE
All software is created
within a context, and
trade-offs are made
based on that context
RELIABILITY
Reliability is:
The quality of being
trustworthy or performing
consistently well
INVESTMENTS
Understandable
Make every service easy to understand
Allow any engineer to quickly operate and improve
Consistent
Make every service look the same
Allow any engineer to work on any system without context
Repeatable
Practice makes perfect
HEROES ARE FOR STORIES
NO HEROES, ONLY TEAM
Yuu Yamashita Takashi Kokubun Yuki Ito
Chris Maxwell You?
Site Reliability Engineer
Robin Bowes
You?
Site Reliability Engineer
You?
Infrastructure Engineer
You?
Site Reliability Engineer
T R E A S U R E D A T A
BUILDING RELIABLE SERVICES
• @WrathOfChris

https://twitter.com/WrathOfChris
• Chris Maxwell

https://www.linkedin.com/in/wrathofchris/
• 採用情報

https://www.treasuredata.co.jp/careers/
• トレジャーデータ株式会社

https://www.linkedin.com/company/treasure-data-inc-

More Related Content

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 

Featured

PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...DevGAMM Conference
 

Featured (20)

Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 
More than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike RoutesMore than Just Lines on a Map: Best Practices for U.S Bike Routes
More than Just Lines on a Map: Best Practices for U.S Bike Routes
 
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
Ride the Storm: Navigating Through Unstable Periods / Katerina Rudko (Belka G...
 

Tokyo SRE Meetup - Building Reliable Services - A Journey from servers to services

  • 2. T R E A S U R E D A T A BUILDING RELIABLE SERVICES The journey from servers to services Chris Maxwell Site Reliability Manager
  • 4. WHY? Building Reliable Services • Reliability is an emergent property • You cannot buy reliability • You can invest in communication, tools, and processes that increase reliability
  • 5. Product Sales M arketing Analytics DAILY WORKLOAD 1+ Million Events / Sec 400,000+ Queries / Day 15+ Trillion Rows / Day
 173+ Million Rows / Sec
  • 6. MANY DEPLOYMENTS 8+ Environments Varying capabilities and scale per environment 50+ Services Not a micro services architecture… 275+ Deployments Production clusters from 3 to 200+ instances
  • 7. RUNTIME CONVERGENCE Cookbooks Downloaded Configuration Management Server Pattern Code Downloaded Configuration Management of releases Runtime Failures Dependencies and Releases use same process Dependencies Downloaded 3rd Party dependencies are everywhere
  • 8. OUR HERO Infrastructure Engineer Systems Engineer who owns the resources underlying services. Automation, Cloud, Networks, Security Groups, DNS, Production Support services Site Reliability Engineer Software Engineer and Systems Engineer that improves services with automation and system- wide tools and best practices
  • 9. INCREASE VELOCITY Faster than Weekly Deployments • Releases through Configuration Management • Infrastructure team gatekeeping More Sites • We need more sites by end of the year • 50+ services per site
  • 10. COMPLEX PLATFORM Where to Start? • Job Control • Query and Compute • Storage • Segmentation Many Differences • Ruby • Java • Hadoop • Presto • Scala Many teams • Backend • Query • API • Integrations • Frontend • Infrastructure Growth and Change • New features every week • Product evolution
  • 11. SERVICE DELIVERY IS HARD Hero Refuses Politely… Teams continue using existing practices Foundation is Dirty Work Thankless tasks Change exposes implicit usage Measure Reliability Improves existing processes Starts measuring features
  • 12. WISDOM FROM OUTSIDE Simple First “Everything should be made as simple as possible, but not simpler.” — Paraphrase of Albert Einstein
  • 13. ON EXPERTS AND ADVICE You’re the expert given your specific context and needs
  • 14. MENTOR RETURNS The number of “chunks” of context an human engineer 
 
 can retain is the: “magical number seven (7), plus or minus two” — George Miller
  • 15. FIRST CHANGES Standard Deployment Targets For our environment, we need: • Site - data residency • Cloud - vendor / implementation • Region - resource location • Service - internal service name • Stage - delivery stages • Cluster - deployment target
  • 16. HARD WORK AHEAD Reliability sometimes means rolling up your sleeves and getting dirty, working on core infrastructure to create a strong foundation to be reliable upon
  • 17. FIRST CHANGES Standard Startup Services For our environment, we need: • preinit - discover deployment target • ephemeral - automatic volume mounting • final - bootstrap configuration management
  • 18. KEEP IT SIMPLE “Complexity is the root cause of the vast majority of problems with software today” — Moseley & Marks
  • 19. ACCEPTS CHALLENGE Standard Service Definition • Autoscale Group • Optional CodeDeploy Package • Internal Load Balancer • Internal DNS Endpoint • Optional External Load Balancer & DNS Endpoint
  • 20. AUTOSCALING PRESTO Attach to the Team Our hero joins a service team Autoscaling Presto Helps to autoscale the entire service Work with Team Helps transition config into artifact
  • 21. CODEDEPLOY PRESTO Learn from Team Their challenges and needs Artifact Code + Config Transition from simple autoscaling to Code + Config Artifacts Simple is Hard 3+ sources of configuration truth 12+ mostly same but different configurations Complexity was workaround for inflexible Configuration Management
  • 22. MOVE FAST Direct API Tools • Service API not complete • Team needed compound operations Conductor to manage cluster ops • Built service-specific tools using underlying APIs • Routing and Segmentation
  • 23. FRIENDS FOR THE JOURNEY AutoScaling & Launch Configuration IAM Instance-Profile RolesRoute53CodeDeploy EC2 Security GroupApplication Load Balancer & Target Group
  • 24. MORE FRIENDS Trusting Team Software Engineering teams trusted our hero Outside Experience Engineers with Domain Specific experience helped our hero understand the systems
  • 25. SLIDE TITLE value of explicitly defined service contracts talk first, software later
  • 26. DELIVERY STATES Dangerous Shutdown Some services require careful shutdown procedures Delivery cannot hard-fail 14-day running jobs Loose definition of responsibility Delivery is an organic combination of Configuration Management, system service control, release control New Orchestration exposes old assumptions In-place is sub-optimal for 2-week jobs New-cluster is sub-optimal for remaining jobs
  • 27. MENTOR RETURNS Tools express the process Process should uplift the organization “Tools are necessary but not sufficient. To build a future we all can live with, we have to build it together” — Bridget Kromhout
  • 28. OUR HERO Service Tool Orchestrate 6 infrastructure APIs with MVP tools: • Leverage immediate gain • orchestration • Paying interest • Learning team needs and behaviour • Liability that must be paid in full • Intend to replace with API + client
  • 29. SERVICES FIRST All services should look the same Any engineer can • Create a cluster • Update a cluster • Deploy to a cluster • Delete a cluster Safely, using the same tool
  • 30. SLIDE TITLE Survey the Work How deep does the hole go? Start with Friends API and Segmentation Where to Start? Look for the greatest need
  • 31. COMPLEXITY Complex Service(s) • Manual Post-Start Actions • Service Discovery because no standards Duplication in Many Places • 5 services of the same service • We were pushing the limits of legacy model
  • 32. COMPLEXITY Unclear boundaries • Configuration ownership shared across teams • Service Discovery because no standards Unclear assumptions • Inconsistent naming and usage • The way it works now is the way it should be
  • 33. MIGRATION Simplifying Complex Re-evaluate all choices in light of services-first Many Transitional Changes Startup Services Infrastructure to Application Precision Replacement Coordinated Handover Careful work
  • 34. THE PROCESS Legacy Process • Servers First • Human Orchestration Transition • Services First • Automatic triggers legacy Value • Replace legacy with artifact
  • 35. VISION Standard Services First With standards,
 exceptions are hard; Without standards, everything is hard
  • 36. OUR HERO Autoscaling Implemented • Second Services Team: • Launched to Staging last week • Launched to Production yesterday
  • 37. THE REWARD Service Patterns for Scaling • Deployment Targets • Standard Startup • Standard Services New Powers • On-Demand Clusters • Per-Cluster Versioning • Immediate Feedback
  • 38. OUR HERO Your team builds it, your team runs it; we can help your team run it better
  • 39. OUR BLUEPRINT Standard Services • Deployment Target • Internal Hostname • Internal Load Balancer • Autoscale Group • CodeDeploy Artifact Supporting Services Artifacts are easier with: • Configuration support hooks • Service Control hooks • Remote Execution hooks • Metrics, monitors, logs, alerts
  • 40. REMAINING SERVICES 41+ Services Just 41+ more to go Each one needs conversion 200+ Deployments Just 200+ more to go Each one needs re-deployment Empathy Not all services were designed for a multi-cluster environments Not all services were designed for graceful termination Not all services have active improvements planned Challenges • Non-idempotent • State-full / Disk-full • Master/Worker Co-Services • Maintain Service Levels • High Throughput Environment
  • 41. THE WAY HOME Best Practices Standard Services Standard Delivery Standard Tooling Work for Teams Improve Service as a Service Work with Teams Enable Super Powers Deploy on Demand Per-Cluster Versions
  • 42. REMAINING SERVICES Service Improvements Target business value: Delivery Velocity High-Trust Services Support Config Management No Big-Bang Replacements Business Depends on Previous Process Strategy to Improve Small Iterations Incremental Value
  • 43. OUR SERVICE IS NOT YOUR SERVICE All software is created within a context, and trade-offs are made based on that context
  • 44. RELIABILITY Reliability is: The quality of being trustworthy or performing consistently well
  • 45. INVESTMENTS Understandable Make every service easy to understand Allow any engineer to quickly operate and improve Consistent Make every service look the same Allow any engineer to work on any system without context Repeatable Practice makes perfect
  • 46. HEROES ARE FOR STORIES
  • 47. NO HEROES, ONLY TEAM Yuu Yamashita Takashi Kokubun Yuki Ito Chris Maxwell You? Site Reliability Engineer Robin Bowes You? Site Reliability Engineer You? Infrastructure Engineer You? Site Reliability Engineer
  • 48. T R E A S U R E D A T A BUILDING RELIABLE SERVICES • @WrathOfChris
 https://twitter.com/WrathOfChris • Chris Maxwell
 https://www.linkedin.com/in/wrathofchris/ • 採用情報
 https://www.treasuredata.co.jp/careers/ • トレジャーデータ株式会社
 https://www.linkedin.com/company/treasure-data-inc-