SlideShare a Scribd company logo
1 of 30
Download to read offline
Redefine Operations in a DevOps World
The New Role for Site Reliability Engineering
Todd Palino
DO2T61S
DEVOPS: AGILE OPERATIONS
Senior Staff Engineer, Site Reliability
LinkedIn
2 COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED#CAWORLD #NOBARRIERS
Abstract
Across industries, modern operations teams have noted the emergence of a new role: the Site
Reliability Engineer (SRE); a new IT craftsperson who fuses software engineering and
operations best practices to enable highly reliable software systems. Once the domain of web-
scale businesses, this discipline is both applicable and important for any organization looking to
differentiate itself in a world increasingly defined by software.
In this session, Todd Palino from LinkedIn explores SRE from organizational, team and
individual perspectives. He’ll describe how by crafting automation and problem solving, SRE
can permeate across a technical organization – not only ensuring a massively high-performant
and always available site, but used to inform optimum decision making - in everything from
system procurement to application design, builds and deployment.
Todd will talk in depth about what constitutes the best in SRE in a DevOps world, using
examples to examine the techniques needed to accelerate value and grow teams. Taking the
‘lid-off’ SRE at LinkedIn, join Todd as he describes how it started and continues to evolve, what
goals are important, and how it’s instrumental in building a high-trust and inclusive team culture
needed to drive continuous improvement -- and importantly, have lots of fun doing it!
Todd Palino
LinkedIn
Senior Staff Engineer
Site Reliability
‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
Automate the easy things
Make the hard things easy
‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
What gets measured,
gets fixed
‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
We are here to
attack the problem
Not the person
‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
Be impeccable with your word
Don’t take things personally
Don’t make assumptions
Always do your best
There are three kinds of feedback:
Positive, negative, and none
Only one of them is bad
‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
28 COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED#CAWORLD #NOBARRIERS
Resources
Every Day is Monday in Operations – Ben Purgason and David Henke
https://www.linkedin.com/pulse/introduction-every-day-monday-operations-benjamin-purgason/
Site Reliability Engineering, from O’Reilly Media
https://landing.google.com/sre/book.html
More Questions About SRE?
29 COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED#CAWORLD #NOBARRIERS
Stay connected at communities.ca.com
Thank you.
30 COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED#CAWORLD #NOBARRIERS
DevOps:
Agile Operations
For more information on DevOps: Agile Operations,
please visit: http://cainc.to/CAW17-AO

More Related Content

Similar to Redefine Operations in a DevOps World: The New Role for Site Reliability Engineering

Inside Dynamic Organizations
 Inside Dynamic Organizations Inside Dynamic Organizations
Inside Dynamic OrganizationsCA Technologies
 
Application Security in a DevOps World
Application Security in a DevOps WorldApplication Security in a DevOps World
Application Security in a DevOps WorldCA Technologies
 
Security for AWS : Journey to Least Privilege
Security for AWS : Journey to Least PrivilegeSecurity for AWS : Journey to Least Privilege
Security for AWS : Journey to Least Privilegedhubbard858
 
Automating the Modern Software Factory
Automating the Modern Software FactoryAutomating the Modern Software Factory
Automating the Modern Software FactoryCA Technologies
 
How Components Increase Speed and Risk
How Components Increase Speed and RiskHow Components Increase Speed and Risk
How Components Increase Speed and RiskCA Technologies
 
Containers on AWS - State of the Union - CON201 - re:Invent 2017
Containers on AWS - State of the Union - CON201 - re:Invent 2017Containers on AWS - State of the Union - CON201 - re:Invent 2017
Containers on AWS - State of the Union - CON201 - re:Invent 2017Amazon Web Services
 
Securing Your Enterprise Continuous Delivery Pipelines with CA Automation Sol...
Securing Your Enterprise Continuous Delivery Pipelines with CA Automation Sol...Securing Your Enterprise Continuous Delivery Pipelines with CA Automation Sol...
Securing Your Enterprise Continuous Delivery Pipelines with CA Automation Sol...CA Technologies
 
The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...
The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...
The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...Amazon Web Services
 
Health and Diagnostics at Your Fingertips
Health and Diagnostics at Your FingertipsHealth and Diagnostics at Your Fingertips
Health and Diagnostics at Your FingertipsRockwell Automation
 
End-to-End Continuous Delivery with CA Automic Release Automation and CA Serv...
End-to-End Continuous Delivery with CA Automic Release Automation and CA Serv...End-to-End Continuous Delivery with CA Automic Release Automation and CA Serv...
End-to-End Continuous Delivery with CA Automic Release Automation and CA Serv...CA Technologies
 
加速するデータドリブンコミュニケーション
加速するデータドリブンコミュニケーション加速するデータドリブンコミュニケーション
加速するデータドリブンコミュニケーションKeisuke Anzai
 
Big Data Engineer Skills and Job Description | Edureka
Big Data Engineer Skills and Job Description | EdurekaBig Data Engineer Skills and Job Description | Edureka
Big Data Engineer Skills and Job Description | EdurekaEdureka!
 
WPT202_Bridging the Racial Digital Divide
WPT202_Bridging the Racial Digital DivideWPT202_Bridging the Racial Digital Divide
WPT202_Bridging the Racial Digital DivideAmazon Web Services
 
Completing the Microservices Puzzle: Kubernetes, Prometheus and FreshTracks.io
Completing the Microservices Puzzle: Kubernetes, Prometheus and FreshTracks.ioCompleting the Microservices Puzzle: Kubernetes, Prometheus and FreshTracks.io
Completing the Microservices Puzzle: Kubernetes, Prometheus and FreshTracks.ioCA Technologies
 
Collaborate PeopleSoft keynote session: Cloudy Skies Today and Tomorrow
Collaborate PeopleSoft keynote session: Cloudy Skies Today and TomorrowCollaborate PeopleSoft keynote session: Cloudy Skies Today and Tomorrow
Collaborate PeopleSoft keynote session: Cloudy Skies Today and TomorrowSmart ERP Solutions, Inc.
 
best commerce cloud consulting | ecom web it solutions | Etisbew
best commerce cloud consulting | ecom web it solutions | Etisbewbest commerce cloud consulting | ecom web it solutions | Etisbew
best commerce cloud consulting | ecom web it solutions | EtisbewVaddeboina Sailaja
 
The Future is Now: What’s New in ForgeRock Access Management
The Future is Now: What’s New in ForgeRock Access Management The Future is Now: What’s New in ForgeRock Access Management
The Future is Now: What’s New in ForgeRock Access Management ForgeRock
 
ENT212-An Overview of Best Practices for Large-Scale Migrations
ENT212-An Overview of Best Practices for Large-Scale MigrationsENT212-An Overview of Best Practices for Large-Scale Migrations
ENT212-An Overview of Best Practices for Large-Scale MigrationsAmazon Web Services
 
CON213_Hands-on Kubernetes on AWS
CON213_Hands-on Kubernetes on AWSCON213_Hands-on Kubernetes on AWS
CON213_Hands-on Kubernetes on AWSAmazon Web Services
 
Programmable Video Fundamentals
Programmable Video FundamentalsProgrammable Video Fundamentals
Programmable Video FundamentalsMark Roberts
 

Similar to Redefine Operations in a DevOps World: The New Role for Site Reliability Engineering (20)

Inside Dynamic Organizations
 Inside Dynamic Organizations Inside Dynamic Organizations
Inside Dynamic Organizations
 
Application Security in a DevOps World
Application Security in a DevOps WorldApplication Security in a DevOps World
Application Security in a DevOps World
 
Security for AWS : Journey to Least Privilege
Security for AWS : Journey to Least PrivilegeSecurity for AWS : Journey to Least Privilege
Security for AWS : Journey to Least Privilege
 
Automating the Modern Software Factory
Automating the Modern Software FactoryAutomating the Modern Software Factory
Automating the Modern Software Factory
 
How Components Increase Speed and Risk
How Components Increase Speed and RiskHow Components Increase Speed and Risk
How Components Increase Speed and Risk
 
Containers on AWS - State of the Union - CON201 - re:Invent 2017
Containers on AWS - State of the Union - CON201 - re:Invent 2017Containers on AWS - State of the Union - CON201 - re:Invent 2017
Containers on AWS - State of the Union - CON201 - re:Invent 2017
 
Securing Your Enterprise Continuous Delivery Pipelines with CA Automation Sol...
Securing Your Enterprise Continuous Delivery Pipelines with CA Automation Sol...Securing Your Enterprise Continuous Delivery Pipelines with CA Automation Sol...
Securing Your Enterprise Continuous Delivery Pipelines with CA Automation Sol...
 
The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...
The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...
The Enterprise Fast Lane - What Your Competition Doesn't Want You to Know abo...
 
Health and Diagnostics at Your Fingertips
Health and Diagnostics at Your FingertipsHealth and Diagnostics at Your Fingertips
Health and Diagnostics at Your Fingertips
 
End-to-End Continuous Delivery with CA Automic Release Automation and CA Serv...
End-to-End Continuous Delivery with CA Automic Release Automation and CA Serv...End-to-End Continuous Delivery with CA Automic Release Automation and CA Serv...
End-to-End Continuous Delivery with CA Automic Release Automation and CA Serv...
 
加速するデータドリブンコミュニケーション
加速するデータドリブンコミュニケーション加速するデータドリブンコミュニケーション
加速するデータドリブンコミュニケーション
 
Big Data Engineer Skills and Job Description | Edureka
Big Data Engineer Skills and Job Description | EdurekaBig Data Engineer Skills and Job Description | Edureka
Big Data Engineer Skills and Job Description | Edureka
 
WPT202_Bridging the Racial Digital Divide
WPT202_Bridging the Racial Digital DivideWPT202_Bridging the Racial Digital Divide
WPT202_Bridging the Racial Digital Divide
 
Completing the Microservices Puzzle: Kubernetes, Prometheus and FreshTracks.io
Completing the Microservices Puzzle: Kubernetes, Prometheus and FreshTracks.ioCompleting the Microservices Puzzle: Kubernetes, Prometheus and FreshTracks.io
Completing the Microservices Puzzle: Kubernetes, Prometheus and FreshTracks.io
 
Collaborate PeopleSoft keynote session: Cloudy Skies Today and Tomorrow
Collaborate PeopleSoft keynote session: Cloudy Skies Today and TomorrowCollaborate PeopleSoft keynote session: Cloudy Skies Today and Tomorrow
Collaborate PeopleSoft keynote session: Cloudy Skies Today and Tomorrow
 
best commerce cloud consulting | ecom web it solutions | Etisbew
best commerce cloud consulting | ecom web it solutions | Etisbewbest commerce cloud consulting | ecom web it solutions | Etisbew
best commerce cloud consulting | ecom web it solutions | Etisbew
 
The Future is Now: What’s New in ForgeRock Access Management
The Future is Now: What’s New in ForgeRock Access Management The Future is Now: What’s New in ForgeRock Access Management
The Future is Now: What’s New in ForgeRock Access Management
 
ENT212-An Overview of Best Practices for Large-Scale Migrations
ENT212-An Overview of Best Practices for Large-Scale MigrationsENT212-An Overview of Best Practices for Large-Scale Migrations
ENT212-An Overview of Best Practices for Large-Scale Migrations
 
CON213_Hands-on Kubernetes on AWS
CON213_Hands-on Kubernetes on AWSCON213_Hands-on Kubernetes on AWS
CON213_Hands-on Kubernetes on AWS
 
Programmable Video Fundamentals
Programmable Video FundamentalsProgrammable Video Fundamentals
Programmable Video Fundamentals
 

More from Todd Palino

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderTodd Palino
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsTodd Palino
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayTodd Palino
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Todd Palino
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowTodd Palino
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum PainTodd Palino
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInTodd Palino
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaTodd Palino
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak PerformanceTodd Palino
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More ProblemsTodd Palino
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into OverdriveTodd Palino
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTodd Palino
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesTodd Palino
 
Enterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceEnterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceTodd Palino
 

More from Todd Palino (14)

Leading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical LeaderLeading Without Managing: Becoming an SRE Technical Leader
Leading Without Managing: Becoming an SRE Technical Leader
 
From Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy StepsFrom Operations to Site Reliability in Five Easy Steps
From Operations to Site Reliability in Five Easy Steps
 
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart WayCode Yellow: Helping Operations Top-Heavy Teams the Smart Way
Code Yellow: Helping Operations Top-Heavy Teams the Smart Way
 
Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?Why Does (My) Monitoring Suck?
Why Does (My) Monitoring Suck?
 
URP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to KnowURP? Excuse You! The Three Kafka Metrics You Need to Know
URP? Excuse You! The Three Kafka Metrics You Need to Know
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum Pain
 
I'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedInI'm No Hero: Full Stack Reliability at LinkedIn
I'm No Hero: Full Stack Reliability at LinkedIn
 
Multi tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafkaMulti tier, multi-tenant, multi-problem kafka
Multi tier, multi-tenant, multi-problem kafka
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
More Datacenters, More Problems
More Datacenters, More ProblemsMore Datacenters, More Problems
More Datacenters, More Problems
 
Putting Kafka Into Overdrive
Putting Kafka Into OverdrivePutting Kafka Into Overdrive
Putting Kafka Into Overdrive
 
Tuning Kafka for Fun and Profit
Tuning Kafka for Fun and ProfitTuning Kafka for Fun and Profit
Tuning Kafka for Fun and Profit
 
Kafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier ArchitecturesKafka at Scale: Multi-Tier Architectures
Kafka at Scale: Multi-Tier Architectures
 
Enterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a ServiceEnterprise Kafka: Kafka as a Service
Enterprise Kafka: Kafka as a Service
 

Recently uploaded

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 

Recently uploaded (20)

So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 

Redefine Operations in a DevOps World: The New Role for Site Reliability Engineering

  • 1. Redefine Operations in a DevOps World The New Role for Site Reliability Engineering Todd Palino DO2T61S DEVOPS: AGILE OPERATIONS Senior Staff Engineer, Site Reliability LinkedIn
  • 2. 2 COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED#CAWORLD #NOBARRIERS Abstract Across industries, modern operations teams have noted the emergence of a new role: the Site Reliability Engineer (SRE); a new IT craftsperson who fuses software engineering and operations best practices to enable highly reliable software systems. Once the domain of web- scale businesses, this discipline is both applicable and important for any organization looking to differentiate itself in a world increasingly defined by software. In this session, Todd Palino from LinkedIn explores SRE from organizational, team and individual perspectives. He’ll describe how by crafting automation and problem solving, SRE can permeate across a technical organization – not only ensuring a massively high-performant and always available site, but used to inform optimum decision making - in everything from system procurement to application design, builds and deployment. Todd will talk in depth about what constitutes the best in SRE in a DevOps world, using examples to examine the techniques needed to accelerate value and grow teams. Taking the ‘lid-off’ SRE at LinkedIn, join Todd as he describes how it started and continues to evolve, what goals are important, and how it’s instrumental in building a high-trust and inclusive team culture needed to drive continuous improvement -- and importantly, have lots of fun doing it! Todd Palino LinkedIn Senior Staff Engineer Site Reliability
  • 3. ‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
  • 4. ‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
  • 5. ‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
  • 6. Automate the easy things Make the hard things easy
  • 7. ‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
  • 8. ‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
  • 9. ‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
  • 10. ‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
  • 11. ‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
  • 12. ‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
  • 13. ‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
  • 14.
  • 15. ‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
  • 16. ‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
  • 17. ‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
  • 19. ‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED We are here to attack the problem Not the person
  • 20. ‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
  • 21. Be impeccable with your word Don’t take things personally Don’t make assumptions Always do your best
  • 22. There are three kinds of feedback: Positive, negative, and none Only one of them is bad
  • 23. ‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
  • 24.
  • 25. ‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
  • 26. ‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
  • 27. ‹#› #CAWORLD #NOBARRIERS COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED
  • 28. 28 COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED#CAWORLD #NOBARRIERS Resources Every Day is Monday in Operations – Ben Purgason and David Henke https://www.linkedin.com/pulse/introduction-every-day-monday-operations-benjamin-purgason/ Site Reliability Engineering, from O’Reilly Media https://landing.google.com/sre/book.html More Questions About SRE?
  • 29. 29 COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED#CAWORLD #NOBARRIERS Stay connected at communities.ca.com Thank you.
  • 30. 30 COPYRIGHT © 2017 CA. ALL RIGHTS RESERVED#CAWORLD #NOBARRIERS DevOps: Agile Operations For more information on DevOps: Agile Operations, please visit: http://cainc.to/CAW17-AO

Editor's Notes

  1. Site reliability engineering is a phrase that was first coined by Ben Treynor at Google in 2008, and he’s described it as what happens when you treat operations like a software problem. SRE codifies the rules about how we run our infrastructure, develops tools that implements those rules, and monitors the entire system to make sure it’s working the way we expect. And when it doesn’t, we mitigate the problem and add more to our tools to make them work better the next time. Many companies have SRE organizations now, with one of the largest being at LinkedIn. Of course, Facebook, Apple, and Netflix are no slouches either. We all do it a little bit differently – for example, at LinkedIn SRE and software engineering are different organizations, as opposed to at Google where SREs and SWEs are part of the same organization. There’s no one right way to SRE, and you don’t have to have a huge company either. This does make my job here today a little tricky, as I can’t give you a step by step guide on how to implement SRE at your company. What I can do is tell you what SRE is, and what kind of environment it thrives in. LinkedIn is an excellent example of this, and here’s why.
  2. These are LinkedIn’s six core values. Members first, relationships matter, Be open, honest, and constructive, demand excellence, take intelligent risks, and act like an owner. They’re also the reason SRE works at LinkedIn, and you’re going to hear these themes echoed in everything I talk about. If we left out any of these, our site reliability organization would not be effective. Now, these are not your company’s values, unless you’re one of my colleagues, but you probably recognize all of them as being important.
  3. This is a mantra you often find in DevOps organizations – move fast and break things. It’s not completely crazy – If you’re taking intelligent risks, sometimes things are going to break. But breaking things should not be the goal. Yes, we need to move quickly. But we should not treat breaking things as the holy grail. We have users, and the user experience matters. I want to enable my developers to move quickly in the safest way possible, and this is one of the things that differentiates SRE – our focus is first on site up. What does that mean? Site stability comes first – if there is a problem with the site that is impacting users, getting that mitigated at the very least is everyone’s primary focus. And this is built into our DNA across the entire company. Even the development team that I work with, when they develop their OKRs each quarter, has as the very first objective “Site Up”.
  4. Distilling it down, beyond keeping the site up, SRE’s job is to automate all the easy things. And to make the hard things easy. Don’t worry, you’re not going to run out of things to do. Remember, you’re working with developers
  5. There’s a lot of them, and they’re always creating new features and applications because they’re being driven by product teams
  6. Who may not always be as focused on stability as you are. So they need tools, controls, and data to be able to accomplish their goals, and do it in a way that maintains the reliability of the entire site. That’s where SRE comes in.
  7. This is Ben, who is one of our SRE managers. He’s written a fantastic post series about SRE and operations titled “Every Day is Monday in Operations” – the URL is at the end of the presentation. This picture came about when he said during an all hands that you could call him any time when there’s a problem, and posted his phone number. But really, this isn’t SRE. We’re not superheroes. I’m more like…
  8. Batman. Well…
  9. That’s more like it. I don’t have super powers. I build wonderful toys to solve problems. LinkedIn SRE has a lot of tools available to us, which we’ve either built or improved, to make everyone’s job easier, both SRE and software engineer. We have build and deployment systems that automate getting code into production, in a manner consistent with our policies. We have monitoring and alerting systems that are common across the entire site. We have an auto-remediation system called Nurse that can respond to alerts and run through mitigation and recovery without waking us up. And when we find a new task, we write a new tool. This is because an SRE is also…
  10. Professionally lazy. My job is not to respond to the alert and get the site back up. OK, it’s part of my job. But really, my job is to make sure that the problem never happens more than once, and that we spend as little time as possible finding and fixing the issues. My job is to automate myself out of a job. To the old school sysadmins, this sounds like a bad idea – you want to make yourself indispensable. You want to make sure that everyone appreciates you and realizes that the site would fall apart without you. If you do that, they can’t let you go. They also can’t let you go on vacation. As an SRE, I know there is always another challenge. The developers are always adding new features that I need to understand, monitor, and assure are built in a way to be scalable and easy to run. There’s always an improvement I can make to our monitoring, or a system that needs to be tuned a little better. I want the next challenge, not last week’s challenge. Besides being lazy, what kind of skills does a typical SRE need to demonstrate?
  11. One of the biggest skills that you have to hire for is being able to work with code. Not only being able to write it, and write it well, but also being able to read and review. SREs write a lot of tools, but especially because those tools are part of our infrastructure that keeps the entire site running, they are treated as first-class applications, just like the products that serve our members. We strive to write good applications, not just scripts that we’ve hacked together. And like a good engineering organization, we have code reviews. It doesn’t stop there, however
  12. My particular brand of SRE is called “embedded SRE”. This means that my team is embedded with a development team, and we work together on a product as a single team. In my case, this is Apache Kafka, which many may be familiar with as a high performance streaming data infrastructure. The Streaming SRE team and the Streaming Engineering team sit together, we plan work together, and we handle issues together. Even though we have separate management chains, for all intents and purposes we are a single team. As such we both write tools to support running Kafka, such as monitoring applications, though the SRE team spends more of their time on this task. We also both debug deeper problems in Kafka, or discuss and plan new features. The software engineering team tends to spend more time on this side of the work. We each have our own expertise that we bring to the team – they go much deeper on the code, particularly the product code, than SRE does. But SRE knows much more…
  13. How all the pieces fit together. SREs need to not only understand their own applications, and how the various bits interact, but also how the site infrastructure in general works. A site like LinkedIn has hundreds of applications that variously depend on each other. It’s difficult for the development team to understand what all of these applications to, which ones are upstream, and which ones are downstream. And then there’s all of the infrastructure tools that help us deploy applications. How do we get hardware? How do we define where apps get deployed? How do you set up network access controls? Source code ACLs? Network port numbers? SRE is here to help take the application that has been developed, and both get it, and keep it, running to serve the members. We have to know what all the pieces are, who is responsible for them, and how to use them. Especially in a large organization, this is a lot of information, and we don’t want to be the irreplaceable ops guy, so
  14. A lot of an SRE’s job is knowing where to find the answers. I’m sure I’m not the only person here who looks like a genius to their friends and family because we know how to Google the answer to their questions. Seriously, we all know how screwed we would be if Google or StackOverflow went down. But as long as they’re up, and thank your deity of choice that they have SREs, I don’t need to know all the answers, I just need to know where to look for them. It’s not too hard to identify this type of person, because they’re the ones who are willing to answer a question asked of them with “I don’t know, but here’s how I’d go about looking for the answer.” They’re also the person who is constantly learning new things, which is another important attribute for an SRE.
  15. We’ve talked about what SRE is, and what an SRE looks like. So how do you go about building an SRE organization that works? I was going to talk about hiring the right people here, but that’s not what comes first. Before you can hire SREs, you need to have a company that is willing to support them. All the technical expertise in the world will do you no good if you have an environment that ties their hands. So we have to start with a company that is willing to listen to the site reliability organization, and trust in their assessment and direction. Depending on how bad things are, this might include halting new features until the site can be stabilized. At LinkedIn, we call this a Code Yellow – the team that is in this state is declaring that they have to pare back everything they’re doing to stabilize their current problems. It’s not a failure – it’s a declaration that things are bad, it’s not acceptable, and they’re prioritizing fixing it. The first thing you need is data. If you don’t have it, it needs to be the first thing SRE works on.
  16. What gets measured, gets fixed. This is a quote from David Henke, who led Engineering and Operations at LinkedIn for many years. He can be credited with the pivot in our technical organization that got us focused on fixing what was wrong with the infrastructure as we were in our hypergrowth phase. You cannot tackle the problem if you don’t know that you have a problem, or how bad it is. SRE loves data. Data cuts to the truth of the situation, without bias or equivocation. LinkedIn currently generates over 100 terabytes a day of application metrics, logs, tracking and measurement data. I know, because it’s one of several types of data that Kafka carries internally. This is what drives the decisions we make – what problem is impacting our members the most, what feature is the most worthwhile to pursue, which team has the most on-call pain? Once SRE can identify where the problems are, we can attack them. Which leads us to our first big culture point for SRE.
  17. We are here to attack the problem, not the person. Another quote from David, which was spoken specifically about incident post-mortems, but applies overall to how we work. LinkedIn strives to maintain a blameless environment in engineering and operations. We all make mistakes – I, myself, have knocked the entirety of one of our backend datacenters with a broadcast command that wiped out all of Zookeeper. Knowing who is to blame does nothing for fixing the situation that led to a problem. It only serves to make that person feel isolated. Breaking things happens – if you’re taking intelligent risks, some of them are going to fail. These are opportunities to learn, and figure out how to do better next time.
  18. I saw one of the best examples of this this past August. At the end of August we had our SREinCon – an internal 2-day conference for our SRE organization where we get together and share what we are doing and what we are learning. This year, one of my colleagues stood up on stage and told her story about the anatomy of a major incident that took place shortly after she started at the company, which was over a year previous to the conference. In that incident, the actions she took to try and mitigate the problem ended up taking down the entire site. She spoke of how it all happened, and what was learned both organizationally and personally. That she was willing to stand in front of her peers and discuss that openly and honestly is the finest testament to our blameless culture. The conference itself is another cultural touchstone
  19. This is having a collaborative environment. We’ve all seen companies where internal political squabbles detract from the values and mission of the company. I hope this is not somewhere where any of you are right now. We know that this is toxic and will destroy a team, and a large part of that is because you cannot be open and honest. David also promoted the idea of the Four Agreements – these are the agreements that we need to have with our colleagues. Do your best. Be impeccable with your word. Don’t assume anything. And don’t take things personally. These four agreements are the foundation for collaboration, and a team that can trust each individual to be doing their job, and not working against the team. When I trust the rest of the team to handle their own applications well, it frees me to work on my own. This increases the efficiency of the entire company, because we are not duplicating effort.
  20. It also means that when someone has feedback for me, I can accept it without being concerned about ulterior motives. And I want their feedback, because it’s the only way we can improve. Yes, Mr. Henke is a very smart man. The only bad feedback is no feedback, especially when it comes to tools and infrastructure. It either means that the application is not being used, or the people using it don’t care enough to improve it. If it’s that people don’t care, well, you have a culture problem that needs to be addressed. But if the problem is that the application is not being used, especially if it’s an infrastructure tool that it’s expected everyone is using, it means that there are problems that are being worked around, and for some reason there is a lack of trust that if given the feedback, it will be acted on.
  21. Of course you’re going to have to compensate your SREs well. But we can get money anywhere. Especially if you’re located in a tech-dense area, such as the San Francisco Bay area, good engineers are in high demand. If you want to hire and keep your SREs, you need to offer more.
  22. The key is to provide the opportunity to learn, grow, and be recognized for it. When I was hired at LinkedIn, nearly four years ago, I knew what SRE was, though it wasn’t my role at the time. What I didn’t know was the first thing about big data. I had no idea what Apache Kafka was. I was brought in for my general skills as an engineer, my ability to mesh with the team, and my ability to learn. Since that time, I have essentially reinvented my career, and that is thanks in large part to the support I have received from LinkedIn, and my management chain, to build my own brand. Sure, I’m paid well. But I stay because I’m treated well, and I have the ability to make a real impact. Not only on LinkedIn, but also for our members. That impact is a direct function of my ability to be a technical leader within SRE.
  23. Part of being an SRE is engaging with other teams. Our infrastructure consists of hundreds of interconnected applications, so it’s rare that you’ll run into an issue that isn’t shared by multiple teams. Even if it’s a problem with your own application, it nearly always impacts someone else, either upstream or downstream. This is one of the reasons that LinkedIn’s career progression path documents specifically call out an increasing amount of interaction with teams both inside and outside of the company. As an example, some of my responsibilities that are not directly related to the application I run include leading our Site Reliability Technology Leadership Group, developing new standards for incident management across the entire company, and taking advantage of wonderful opportunities like these to talk with my peers in the industry about both Apache Kafka and site reliability engineering.
  24. But when you make it a priority for your SREs to constantly improve themselves, and you must, you also need to accept that sometimes you need to let them go. At LinkedIn we frequently use the phrase “next play”. For any sports fans, the origin of this phrase is with college basketball, and a coach who would constantly emphasize “next play” every time his team completed a sequence. For us, it’s the same – you completed that project? Great, what’s the next play? But it also refers to your personal next play. As individuals, we must be constantly aware of what we need in our career, what the next logical step is. Sometimes this is movement within the company to a new team, and we make this very easy with clear guidelines and open doors for discussion. Sometimes it’s at another company. Either way, everyone must feel comfortable with discussing this with their management without fear, and management must support their engineers with making the changes that are right for that person first.
  25. The most important thing to remember, though, is that culture starts at the top. I wouldn’t work here if I didn’t trust our entire team, from Satya Nadella on down. LinkedIn’s values are fully supported by Jeff, Kevin, and Mohak. Our site reliability organization has been built by David, Bruno Connelly, and the teams they have built. David Henke had something else that he said that I’d like to leave you with. You are only as good as your lieutenants. Leadership builds the team that supports their vision, and needs to trust them to execute on it. If nothing else, walk out of this room understanding that you need to be the change you wish to see in your company.
  26. If you have more questions about SRE at LinkedIn, I encourage you to check out Ben’s post series titled Every Day is Monday in Operations. He collaborated with David Henke on this, and you’ll find it published on LinkedIn, of course. You should also check out Site Reliability Engineering, published by O’Reilly Media, which was authored by several SREs at Google. Keep in mind, of course, that neither of these will tell you exactly how SRE is supposed to work. They will only tell you how it works at LinkedIn and Google. You’ll need to take that information and figure out how it should look for you.