SlideShare a Scribd company logo
1 of 36
Making a Lion Bulletproof: SRE in Banking
Robin van Zijll & Janna Brummel (@jannabrummel) QCon NY, June 26 2019
InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
ing-sre-teams-practices/
Presented at QCon New York
www.qconnewyork.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
ING is a global financial organization, active in 41 countries
This talk is about the retail bank of NL with…
9 million debit cards
8 million retail customers
7 million ATM transactions/month
Mobile banking is used by 4.5 million customers
Together, they log in 6 million times a day (100+ TPS)
99.77
99.87
0.22 0.13
INTERNET BANKING MOBILE BANKING
AVAILABILITY FIGURES 2018
PRIME TIME (06:30 AM – 01:00 AM)
Uptime Downtime
99.88regulator target
Logins per second for Mobile Banking
100
40
20
0
60
80
120
140
00:00 04:00 08:00 12:00 16:00 20:0020:00 00:00 04:00 08:00 12:00 16:00
99.63
99.78
0.37 0.22
INTERNET BANKING MOBILE BANKING
AVAILABILITY FIGURES 2018
24 HOURS A DAY
Uptime Downtime
99.999customer
expectation
SR-what?
Site Reliability Engineering is “what happens when you
ask a software engineer to design an operations
function” – Ben Traynor (Google)
People
At ING we are organized in tribes with (Biz)DevOps squads
responsible for build and run
product owners
tribe
tribe
lead
squad squad squad squad
tribe tribe tribe
Our SRE team is a ‘horizontal’ squad part of a productivity engineering tribe
We support 1700 engineers across 340 squads
Our SRE team
7 engineers (4 dev, 3 ops)
2 more joining soon
1 product owner
1 chapter lead
mostly with engineering and on-call
experience in ING product engineering
When we hire SREs, we look for someone who’s
Passionate about reliability, problems, DevOps and open source
OK with failure
Insensitive to hierarchy
Willing to teach and advise engineers about reliability
Experienced in on-call duties and 1+ language(s) in our stack
Still excited to work with us after meeting half our team and having heard
realistic job expectations
Process
Why and how did we start with SRE?
We used to have a small team of ops engineers on call for online channels
These engineers were the ones up at night, but they could not structurally
improve service reliability because of our DevOps model
SRE pilot was started and supported
• Team was transformed and given a new purpose
• Decided on SRE model, way of working and roadmap
• Experiences and proposal were presented to senior management
After knowledge transfer of old tasks, SRE was launched :)
For SRE, we generally see 3 organizational models
product engineering + SRE
product
engineering
SREs
tribe SRE
product
engineering
Service ownership is shared
between PE and SRE
SREs are distributed and
embedded in PE teams,
service ownership is shared
Service ownership is with PE,
SRE consults and creates tools
our model
What do we do as SREs?
Product
Development
Capacity Planning
Testing + Release
Procedures
Postmortem/RCA
Incident Response
Monitoring
Service Reliability Hierarchy, from O’Reilly’s Site Reliability Engineering (2016)
Curious to learn more about…
• Learning from failure? Check out
Jason’s and Ryan’s talk
• Chaos engineering and graceful
degredation? Check out Lorne’s talk
• High impact outlier system failures?
Check out Laura’s talk
What do we do as SREs?
We spend 80% of our time on engineering
• We deliver the Reliability Toolkit: a white-box monitoring and alerting stack
• We work on a secure container platform with a service mesh in public cloud
We spread SRE love and best practices
• We reach out to engineers to consult and get feedback
• We educate on reliability topics
What we don’t do
• On-call for product engineering
• Work on SRE-topics already covered by other teams in our organization
We do outreach and we educate on SRE topics
We educate engineers
• Engineering onboarding
• Prometheus workshops
We facilitate knowledge sharing
• Cross-domain SRE guild
• SRE demo sessions open to all
• Guidance via chat and intranet
• Prometheus user community
• Conference report out
We reach out to engineers
• Feedback loop for products
• We are reliability advocates
When we demo, we sometimes block the hallway
We use these principles in our way of working
We work with industry standards
We work with open source products and practices
We automate toil wherever and whenever we can
Technology
Why did we develop the Reliability Toolkit?
Mean time to repair is too long – we waste time finding
incident owners
Lack of insight into application health for teams
High level of technology diversity makes implementing
monitoring difficult
How does the Reliability Toolkit work?
Prometheus Alert Manager
Model Builder
Grafana
E-mail, SMS (Message Bird) and
ChatOps (Mattermost)
Applications
How do we provision the Reliability Toolkit?
SRE Team
Together with a
team we create
a joint config
We maintain and
update binaries
We deliver the
Reliability Toolkit
on 5 instances over
3 environments, we
remain responsible
We deliver client libraries
so metrics can be
scraped from servers
Before, teams would own and use a full pipeline…
version
control
combine
configurations
build publish deploy
=
reliability
toolkit
done by
devops
team
done by
devops
team
…now they only own and update config
version
control
combine
configurations
build
deploy
=
reliability
toolkit
done by
devops
team
Increasing and improving usage of Reliability Toolkit
Include client libraries in engineering frameworks
Ensure a good feedback loop: in person or in tooling
Educate others during onboarding and workshops
Template team dashboards and make other dashboards
accessible to all
And now Reliability Toolkit usage has been increasing
We made onboarding and using our Reliability Toolkit
easy, but our 70 onboarded teams still need to ensure that
Prometheus can scrape metrics
How can we reach all 340 teams?
Let’s try a service mesh!
Curious? Check the Software Defined Infrastructure track
Why use service mesh to improve reliability?
• Service mesh helps us to get new/updated functionality to applications fast
• We can improve observability for all: metrics, logs, distributed tracing and
resilience patterns based on incident learnings that work out of the box
• We can introduce/expand A/B testing, canary releasing and staged rollouts
• Engineers only need to worry about security at application level: immutable
containers, zero trust network and security policies for free, taking away
risk documentation work
What are we working on next?
• Scaling in our Reliability Toolkit stack for efficient use of
resources, scaling up number of teams using our stack
• Expanding our role as reliability advocates
• Completing PoC with service mesh
Takeaways
• Hire SREs from your product engineering domain
• Never compromise on mindset in SREs
• Start with a pilot if you are not sure if SRE works for you
• Pick a SRE model that works well for your organization
• Try to get senior management support and understanding
• Invest in SRE outreach and education
• Focus on scalability and ease-of-use in your tooling
• Don’t be afraid of redesign if it makes users happier
Questions?
Icons used are all from flaticon.com
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
ing-sre-teams-practices/

More Related Content

More from C4Media

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDC4Media
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine LearningC4Media
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at SpeedC4Media
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsC4Media
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsC4Media
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerC4Media
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleC4Media
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeC4Media
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereC4Media
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing ForC4Media
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data EngineeringC4Media
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreC4Media
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsC4Media
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechC4Media
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/awaitC4Media
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaC4Media
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?C4Media
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseC4Media
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinC4Media
 

More from C4Media (20)

Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 
CockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL DatabaseCockroachDB: Architecture of a Geo-Distributed SQL Database
CockroachDB: Architecture of a Geo-Distributed SQL Database
 
A Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with BrooklinA Dive into Streams @LinkedIn with Brooklin
A Dive into Streams @LinkedIn with Brooklin
 

Recently uploaded

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Making a Lion Bulletproof: SRE in Banking

  • 1. Making a Lion Bulletproof: SRE in Banking Robin van Zijll & Janna Brummel (@jannabrummel) QCon NY, June 26 2019
  • 2. InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ ing-sre-teams-practices/
  • 3. Presented at QCon New York www.qconnewyork.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. ING is a global financial organization, active in 41 countries This talk is about the retail bank of NL with… 9 million debit cards 8 million retail customers 7 million ATM transactions/month
  • 5. Mobile banking is used by 4.5 million customers Together, they log in 6 million times a day (100+ TPS)
  • 6. 99.77 99.87 0.22 0.13 INTERNET BANKING MOBILE BANKING AVAILABILITY FIGURES 2018 PRIME TIME (06:30 AM – 01:00 AM) Uptime Downtime 99.88regulator target
  • 7. Logins per second for Mobile Banking 100 40 20 0 60 80 120 140 00:00 04:00 08:00 12:00 16:00 20:0020:00 00:00 04:00 08:00 12:00 16:00
  • 8. 99.63 99.78 0.37 0.22 INTERNET BANKING MOBILE BANKING AVAILABILITY FIGURES 2018 24 HOURS A DAY Uptime Downtime 99.999customer expectation
  • 9. SR-what? Site Reliability Engineering is “what happens when you ask a software engineer to design an operations function” – Ben Traynor (Google)
  • 11. At ING we are organized in tribes with (Biz)DevOps squads responsible for build and run product owners tribe tribe lead squad squad squad squad tribe tribe tribe Our SRE team is a ‘horizontal’ squad part of a productivity engineering tribe We support 1700 engineers across 340 squads
  • 12. Our SRE team 7 engineers (4 dev, 3 ops) 2 more joining soon 1 product owner 1 chapter lead mostly with engineering and on-call experience in ING product engineering
  • 13. When we hire SREs, we look for someone who’s Passionate about reliability, problems, DevOps and open source OK with failure Insensitive to hierarchy Willing to teach and advise engineers about reliability Experienced in on-call duties and 1+ language(s) in our stack Still excited to work with us after meeting half our team and having heard realistic job expectations
  • 15. Why and how did we start with SRE? We used to have a small team of ops engineers on call for online channels These engineers were the ones up at night, but they could not structurally improve service reliability because of our DevOps model SRE pilot was started and supported • Team was transformed and given a new purpose • Decided on SRE model, way of working and roadmap • Experiences and proposal were presented to senior management After knowledge transfer of old tasks, SRE was launched :)
  • 16. For SRE, we generally see 3 organizational models product engineering + SRE product engineering SREs tribe SRE product engineering Service ownership is shared between PE and SRE SREs are distributed and embedded in PE teams, service ownership is shared Service ownership is with PE, SRE consults and creates tools our model
  • 17. What do we do as SREs? Product Development Capacity Planning Testing + Release Procedures Postmortem/RCA Incident Response Monitoring Service Reliability Hierarchy, from O’Reilly’s Site Reliability Engineering (2016) Curious to learn more about… • Learning from failure? Check out Jason’s and Ryan’s talk • Chaos engineering and graceful degredation? Check out Lorne’s talk • High impact outlier system failures? Check out Laura’s talk
  • 18. What do we do as SREs? We spend 80% of our time on engineering • We deliver the Reliability Toolkit: a white-box monitoring and alerting stack • We work on a secure container platform with a service mesh in public cloud We spread SRE love and best practices • We reach out to engineers to consult and get feedback • We educate on reliability topics What we don’t do • On-call for product engineering • Work on SRE-topics already covered by other teams in our organization
  • 19. We do outreach and we educate on SRE topics We educate engineers • Engineering onboarding • Prometheus workshops We facilitate knowledge sharing • Cross-domain SRE guild • SRE demo sessions open to all • Guidance via chat and intranet • Prometheus user community • Conference report out We reach out to engineers • Feedback loop for products • We are reliability advocates
  • 20. When we demo, we sometimes block the hallway
  • 21. We use these principles in our way of working We work with industry standards We work with open source products and practices We automate toil wherever and whenever we can
  • 23. Why did we develop the Reliability Toolkit? Mean time to repair is too long – we waste time finding incident owners Lack of insight into application health for teams High level of technology diversity makes implementing monitoring difficult
  • 24. How does the Reliability Toolkit work? Prometheus Alert Manager Model Builder Grafana E-mail, SMS (Message Bird) and ChatOps (Mattermost) Applications
  • 25. How do we provision the Reliability Toolkit? SRE Team Together with a team we create a joint config We maintain and update binaries We deliver the Reliability Toolkit on 5 instances over 3 environments, we remain responsible We deliver client libraries so metrics can be scraped from servers
  • 26. Before, teams would own and use a full pipeline… version control combine configurations build publish deploy = reliability toolkit done by devops team done by devops team
  • 27. …now they only own and update config version control combine configurations build deploy = reliability toolkit done by devops team
  • 28. Increasing and improving usage of Reliability Toolkit Include client libraries in engineering frameworks Ensure a good feedback loop: in person or in tooling Educate others during onboarding and workshops Template team dashboards and make other dashboards accessible to all
  • 29. And now Reliability Toolkit usage has been increasing
  • 30. We made onboarding and using our Reliability Toolkit easy, but our 70 onboarded teams still need to ensure that Prometheus can scrape metrics How can we reach all 340 teams?
  • 31. Let’s try a service mesh! Curious? Check the Software Defined Infrastructure track
  • 32. Why use service mesh to improve reliability? • Service mesh helps us to get new/updated functionality to applications fast • We can improve observability for all: metrics, logs, distributed tracing and resilience patterns based on incident learnings that work out of the box • We can introduce/expand A/B testing, canary releasing and staged rollouts • Engineers only need to worry about security at application level: immutable containers, zero trust network and security policies for free, taking away risk documentation work
  • 33. What are we working on next? • Scaling in our Reliability Toolkit stack for efficient use of resources, scaling up number of teams using our stack • Expanding our role as reliability advocates • Completing PoC with service mesh
  • 34. Takeaways • Hire SREs from your product engineering domain • Never compromise on mindset in SREs • Start with a pilot if you are not sure if SRE works for you • Pick a SRE model that works well for your organization • Try to get senior management support and understanding • Invest in SRE outreach and education • Focus on scalability and ease-of-use in your tooling • Don’t be afraid of redesign if it makes users happier
  • 35. Questions? Icons used are all from flaticon.com
  • 36. Watch the video with slide synchronization on InfoQ.com! https://www.infoq.com/presentations/ ing-sre-teams-practices/