SlideShare a Scribd company logo
1 of 25
Download to read offline
Reducing incident MTTR
via automated diagnosis and remediation
Auto-remediation and Event Driven Automation Meetup
Kiran Gollu, Founder
Neptune.io © 2017
Neptune
Neptune.io © 2017 2
•  One-stop incident automation-as-a-service for DevOps
•  Less Alerts, More Uptime & More sleep
•  Highly scalable (millions of alerts), secure, and reliable (tolerate data center
region wide outages)
Trusted by
Backed by
Me
Neptune.io © 2017
•  Present:
•  Founder, Neptune
•  YCombinator and Data Collective VC backed startup
•  Past:
•  Founding engineer/architect at AWS S3, and DynamoDB
•  Published research papers on distributed systems collaborating
with Microsoft Research
What I will talk about
•  State of incident response today
•  Alert enrichment and auto-remediation techniques / tradeoffs
•  How do you solve this problem for your company?
•  Demo Neptune
•  Our learning's & recommendations
Neptune.io © 2017
What is Incident Response?
How to handle
alerts/incidents/
outages?
Many more..
Alerts
Neptune.io © 2017
Maturity level of incident response teams
@jpaulreed @kfinnbraun DevOps Enterprise Summit Neptune.io © 2017
Problem#1: Lots of alerts
7Neptune.io © 2017
Problem#2: Failures are complicated
Neptune.io © 2017
Problem#3: Debugging is hard!
9
Source : DevOps survey; Victor Ops incident response
#4: 95% of Time To Recovery(TTR) is still manual today
Alert Troubleshooting
Triage | Investigate | Identify
Resolution Documentation
73% 10%5% 12%
Snapshots
•  Graphs & metrics
•  Logs
•  Webpages
Service health checks
•  Internal
•  External
Host/App diagnostics
•  “Top”, “df –H” etc.
•  Heap dumps/Stack traces
Runbooks
•  On single/cluster
of hosts
•  Any script, any
language
Cloud API/CLI
actions
•  Start/Stop/
Reboot
•  Scale up/down
Root-cause analysis
& Audit
•  Heap dumps
•  Logs
•  Graphs
Post-mortem
•  History
•  Diagnostics
Neptune.io © 2017
FBAR : Facebook Auto Remediation Platform
“…Its doing the work of approximately 200 sys
admins…”
“We built an internal tool for AWS”
Nurse: Auto-remediation platform
“60% of problems are fixed automatically…”
Winston: Event driven automation tool
11Neptune.io © 2017
Your Options
Neptune.io © 2017
SaaS product, On-Premise offering on AWS,
deep integrations with monitoring tools
Open source event driven automation
Build it in house
Common Issues
•  Noise
•  non-actionable alerts
•  false positives
•  self-recovering alerts
•  Engineer burnout: Too much manual work
•  Not measuring cost of dealing with incidents
•  Incorrect monitoring thresholds
•  Bandaids instead of root causing problems
•  Alert correlation is hard – downstream/upstream dependencies
•  Not having clear incident response and escalation processeses
Neptune.io © 2017
Architecting automated diagnosis and remediation
Neptune.io © 2017
1. One-stop incident tracking
•  Helps identify those top-20% alerts causing 80% of pain
•  Sorted by frequency and MTTR
•  Capture:
•  MTTA (mean time to acknowledgement)
•  MTTR (mean time to resolution)
•  Frequency of occurrence (#times a particular alert has occurred)
•  Reporting + Auditing
•  Audit all activity (both manual + automated)
•  Leads to data-driven post mortems
Neptune.io © 2017
2. Enrich Alerts
•  When an alert occurs:
•  Gather debugging context automatically from different tools
Use cases:
1.  High memory alert
•  capture top-10 memory hogs, thread dump, memory usage graphs
2.  High app error rate
•  capture error rate & latency trends
•  app logs for 5xx errors from Splunk/Sumologic
•  app health checks
Neptune.io © 2017
3. Auto-Remediate repetitive alerts
•  When an alert occurs:
•  If it’s a known alert à Run a remediation runbook
•  Use cases:
•  Process crashed à restart process
•  Host is unpingable à restart 3 times and escalate if still fails
•  Service is down à capture graphs, run a remediation workflow
Neptune.io © 2017
Cultural aspects
•  Have a clear incident resolution and escalation process
•  Break silos : Dev / Ops can resolve and share how incidents are resolved
•  Document and version your runbooks
•  Single consolidated report per each incident to make post-mortems easy
•  Audit all manual and automated actions for an incident
•  Use your own communication tools (Slack, HipChat) but record incident logs
•  Use tools to log team collaboration activity
Neptune.io © 2017
Neptune Demo
Neptune.io © 2017
How do we ensure military grade security?
Neptune.io © 2017
Our learning’s from 200+ SRE/Ops teams
•  Automate simple things first
•  Have checks in place to avoid cascading failures
•  Rate limiting, handling correlated failures
•  Capture state and snapshots for self-recovering alerts
•  Don’t automatically fix when you don’t know root cause
•  Enriching incidents is as important as automating repetitive incidents
•  Availability of automation tool should be >>> your apps
Neptune.io © 2017
What I talked about
•  Reducing MTTR gives you:
•  More agility à frees up engineers to work on business critical problems
•  More uptime, thus better customer experience
•  More sleep, Happier engineers
•  To reduce MTTR for production alerts incidents:
•  create actionable alerts and fix monitoring thresholds
•  Embrace automation – enrich and automate alerts
•  Instill good incident response processes
Neptune.io © 2017
Signup for 14 day free trail!
kiran@neptune.io
Neptune.io © 2017
Backup
Neptune.io © 2017
How do you get better at it?
•  Continuously eliminate manual effort involved
•  Streamline your incident response workflow (cultural aspects)
•  Encourage good behavior and punish bad behavior
•  Measure time spent in incident response
•  Make your alerts actionable
•  Fix the monitoring thresholds as a continuous process
•  Enrich and automate incidents to reduce MTTR
Neptune.io © 2017

More Related Content

What's hot

Runecast Analyzer Overview
Runecast Analyzer OverviewRunecast Analyzer Overview
Runecast Analyzer OverviewStanimir Markov
 
Optimize & Secure Your Hybrid Cloud with Runecast (September 2021)
Optimize & Secure Your Hybrid Cloud with Runecast (September 2021)Optimize & Secure Your Hybrid Cloud with Runecast (September 2021)
Optimize & Secure Your Hybrid Cloud with Runecast (September 2021)Jason Mashak
 
The Joy of Proactive Security
The Joy of Proactive SecurityThe Joy of Proactive Security
The Joy of Proactive SecurityAndy Hoernecke
 
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationEmpower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationRundeck
 
Runecast: Simplified Security with Unparalleled Transparency (March 2022)
Runecast: Simplified Security with Unparalleled Transparency (March 2022)Runecast: Simplified Security with Unparalleled Transparency (March 2022)
Runecast: Simplified Security with Unparalleled Transparency (March 2022)Jason Mashak
 
Patch your workplaces at home, in a meeting center or at the office
Patch your workplaces at home, in a meeting center or at the officePatch your workplaces at home, in a meeting center or at the office
Patch your workplaces at home, in a meeting center or at the officeIvanti
 
Intro to Puppet Enterprise 04.20.2017
Intro to Puppet Enterprise 04.20.2017Intro to Puppet Enterprise 04.20.2017
Intro to Puppet Enterprise 04.20.2017Puppet
 
Final observability starts_with_data
Final observability starts_with_dataFinal observability starts_with_data
Final observability starts_with_dataDave McAllister
 
BlueHat v18 || Improving security posture through increased agility with meas...
BlueHat v18 || Improving security posture through increased agility with meas...BlueHat v18 || Improving security posture through increased agility with meas...
BlueHat v18 || Improving security posture through increased agility with meas...BlueHat Security Conference
 
Shmoocon 2015 - httpscreenshot
Shmoocon 2015 - httpscreenshotShmoocon 2015 - httpscreenshot
Shmoocon 2015 - httpscreenshotjstnkndy
 
Nick Drage & Fraser Scott - Epic battle devops vs security
Nick Drage & Fraser Scott - Epic battle devops vs securityNick Drage & Fraser Scott - Epic battle devops vs security
Nick Drage & Fraser Scott - Epic battle devops vs securityDevSecCon
 
Lessons from DevOps: Taking DevOps practices into your AppSec Life
Lessons from DevOps: Taking DevOps practices into your AppSec LifeLessons from DevOps: Taking DevOps practices into your AppSec Life
Lessons from DevOps: Taking DevOps practices into your AppSec LifeMatt Tesauro
 
Intro to Puppet Enterprise Webinar 07.27.2017
Intro to Puppet Enterprise Webinar 07.27.2017Intro to Puppet Enterprise Webinar 07.27.2017
Intro to Puppet Enterprise Webinar 07.27.2017Claire Priester Papas
 
AllDayDevOps Security Chaos Engineering 2019
AllDayDevOps Security Chaos Engineering 2019 AllDayDevOps Security Chaos Engineering 2019
AllDayDevOps Security Chaos Engineering 2019 Aaron Rinehart
 
DEVSECOPS: Coding DevSecOps journey
DEVSECOPS: Coding DevSecOps journeyDEVSECOPS: Coding DevSecOps journey
DEVSECOPS: Coding DevSecOps journeyJason Suttie
 
Taking AppSec to 11 - BSides Austin 2016
Taking AppSec to 11 - BSides Austin 2016Taking AppSec to 11 - BSides Austin 2016
Taking AppSec to 11 - BSides Austin 2016Matt Tesauro
 
Chaos engineering for cloud native security
Chaos engineering for cloud native securityChaos engineering for cloud native security
Chaos engineering for cloud native securityKennedy
 
Consumerproduct
Consumerproduct Consumerproduct
Consumerproduct Webroot
 

What's hot (20)

Runecast Analyzer Overview
Runecast Analyzer OverviewRunecast Analyzer Overview
Runecast Analyzer Overview
 
Optimize & Secure Your Hybrid Cloud with Runecast (September 2021)
Optimize & Secure Your Hybrid Cloud with Runecast (September 2021)Optimize & Secure Your Hybrid Cloud with Runecast (September 2021)
Optimize & Secure Your Hybrid Cloud with Runecast (September 2021)
 
The Joy of Proactive Security
The Joy of Proactive SecurityThe Joy of Proactive Security
The Joy of Proactive Security
 
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital TransformationEmpower Devs, Simplify Ops, and Accelerate your Digital Transformation
Empower Devs, Simplify Ops, and Accelerate your Digital Transformation
 
Runecast: Simplified Security with Unparalleled Transparency (March 2022)
Runecast: Simplified Security with Unparalleled Transparency (March 2022)Runecast: Simplified Security with Unparalleled Transparency (March 2022)
Runecast: Simplified Security with Unparalleled Transparency (March 2022)
 
Patch your workplaces at home, in a meeting center or at the office
Patch your workplaces at home, in a meeting center or at the officePatch your workplaces at home, in a meeting center or at the office
Patch your workplaces at home, in a meeting center or at the office
 
Intro to Puppet Enterprise 04.20.2017
Intro to Puppet Enterprise 04.20.2017Intro to Puppet Enterprise 04.20.2017
Intro to Puppet Enterprise 04.20.2017
 
Final observability starts_with_data
Final observability starts_with_dataFinal observability starts_with_data
Final observability starts_with_data
 
BlueHat v18 || Improving security posture through increased agility with meas...
BlueHat v18 || Improving security posture through increased agility with meas...BlueHat v18 || Improving security posture through increased agility with meas...
BlueHat v18 || Improving security posture through increased agility with meas...
 
Chaos monitoring
Chaos monitoringChaos monitoring
Chaos monitoring
 
Shmoocon 2015 - httpscreenshot
Shmoocon 2015 - httpscreenshotShmoocon 2015 - httpscreenshot
Shmoocon 2015 - httpscreenshot
 
Nick Drage & Fraser Scott - Epic battle devops vs security
Nick Drage & Fraser Scott - Epic battle devops vs securityNick Drage & Fraser Scott - Epic battle devops vs security
Nick Drage & Fraser Scott - Epic battle devops vs security
 
Lessons from DevOps: Taking DevOps practices into your AppSec Life
Lessons from DevOps: Taking DevOps practices into your AppSec LifeLessons from DevOps: Taking DevOps practices into your AppSec Life
Lessons from DevOps: Taking DevOps practices into your AppSec Life
 
Intro to Puppet Enterprise Webinar 07.27.2017
Intro to Puppet Enterprise Webinar 07.27.2017Intro to Puppet Enterprise Webinar 07.27.2017
Intro to Puppet Enterprise Webinar 07.27.2017
 
AllDayDevOps Security Chaos Engineering 2019
AllDayDevOps Security Chaos Engineering 2019 AllDayDevOps Security Chaos Engineering 2019
AllDayDevOps Security Chaos Engineering 2019
 
DEVSECOPS: Coding DevSecOps journey
DEVSECOPS: Coding DevSecOps journeyDEVSECOPS: Coding DevSecOps journey
DEVSECOPS: Coding DevSecOps journey
 
Taking AppSec to 11 - BSides Austin 2016
Taking AppSec to 11 - BSides Austin 2016Taking AppSec to 11 - BSides Austin 2016
Taking AppSec to 11 - BSides Austin 2016
 
Chaos engineering for cloud native security
Chaos engineering for cloud native securityChaos engineering for cloud native security
Chaos engineering for cloud native security
 
Qradar as a SOC core
Qradar as a SOC coreQradar as a SOC core
Qradar as a SOC core
 
Consumerproduct
Consumerproduct Consumerproduct
Consumerproduct
 

Viewers also liked

Security Orchestration and Automation with Hexadite+
Security Orchestration and Automation with Hexadite+Security Orchestration and Automation with Hexadite+
Security Orchestration and Automation with Hexadite+Nathan Burke
 
Automation and Orchestration - Harnessing Threat Intelligence for Better Inci...
Automation and Orchestration - Harnessing Threat Intelligence for Better Inci...Automation and Orchestration - Harnessing Threat Intelligence for Better Inci...
Automation and Orchestration - Harnessing Threat Intelligence for Better Inci...Chris Ross
 
Event Driven Automation Meetup May 14/2015
Event Driven Automation Meetup May 14/2015Event Driven Automation Meetup May 14/2015
Event Driven Automation Meetup May 14/2015Dmitri Zimine
 
Netflix Winston meetup presentation 2015-11-18
Netflix Winston meetup presentation 2015-11-18Netflix Winston meetup presentation 2015-11-18
Netflix Winston meetup presentation 2015-11-18Sayli Karmarkar
 
March 2009 - Reducing Incidents: 3-2-1-0 Approach
March 2009 - Reducing Incidents: 3-2-1-0 ApproachMarch 2009 - Reducing Incidents: 3-2-1-0 Approach
March 2009 - Reducing Incidents: 3-2-1-0 ApproachIT Service and Support
 
Goodbye CLI, hello API: Leveraging network programmability in security incid...
Goodbye CLI, hello API:  Leveraging network programmability in security incid...Goodbye CLI, hello API:  Leveraging network programmability in security incid...
Goodbye CLI, hello API: Leveraging network programmability in security incid...Joel W. King
 
Strategy for Reducing Ticket Backlog
Strategy for Reducing Ticket BacklogStrategy for Reducing Ticket Backlog
Strategy for Reducing Ticket BacklogMark Copeland
 
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015DevOpsDays Tel Aviv
 
If We Only Had the Time: How Security Teams Can Focus On What’s Important
If We Only Had the Time: How Security Teams Can Focus On What’s ImportantIf We Only Had the Time: How Security Teams Can Focus On What’s Important
If We Only Had the Time: How Security Teams Can Focus On What’s ImportantNathan Burke
 
LEAN Project: Incident Reduction
LEAN Project: Incident ReductionLEAN Project: Incident Reduction
LEAN Project: Incident ReductionSagnik Pal
 
Event driven-automation and workflows
Event driven-automation and workflowsEvent driven-automation and workflows
Event driven-automation and workflowsDmitri Zimine
 
AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...
AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...
AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...Amazon Web Services
 
StackStorm DevOps Automation Webinar
StackStorm DevOps Automation WebinarStackStorm DevOps Automation Webinar
StackStorm DevOps Automation WebinarStackStorm
 
Ignite slides minimum viable runbooks lite
Ignite slides minimum viable runbooks   liteIgnite slides minimum viable runbooks   lite
Ignite slides minimum viable runbooks liteWill La
 
ITIL v3 Problem Management
ITIL v3 Problem ManagementITIL v3 Problem Management
ITIL v3 Problem ManagementJosep Bardallo
 

Viewers also liked (19)

Security Orchestration and Automation with Hexadite+
Security Orchestration and Automation with Hexadite+Security Orchestration and Automation with Hexadite+
Security Orchestration and Automation with Hexadite+
 
Automation and Orchestration - Harnessing Threat Intelligence for Better Inci...
Automation and Orchestration - Harnessing Threat Intelligence for Better Inci...Automation and Orchestration - Harnessing Threat Intelligence for Better Inci...
Automation and Orchestration - Harnessing Threat Intelligence for Better Inci...
 
Overview
OverviewOverview
Overview
 
Event Driven Automation Meetup May 14/2015
Event Driven Automation Meetup May 14/2015Event Driven Automation Meetup May 14/2015
Event Driven Automation Meetup May 14/2015
 
Nurse tech talk
Nurse tech talkNurse tech talk
Nurse tech talk
 
Netflix Winston meetup presentation 2015-11-18
Netflix Winston meetup presentation 2015-11-18Netflix Winston meetup presentation 2015-11-18
Netflix Winston meetup presentation 2015-11-18
 
March 2009 - Reducing Incidents: 3-2-1-0 Approach
March 2009 - Reducing Incidents: 3-2-1-0 ApproachMarch 2009 - Reducing Incidents: 3-2-1-0 Approach
March 2009 - Reducing Incidents: 3-2-1-0 Approach
 
Goodbye CLI, hello API: Leveraging network programmability in security incid...
Goodbye CLI, hello API:  Leveraging network programmability in security incid...Goodbye CLI, hello API:  Leveraging network programmability in security incid...
Goodbye CLI, hello API: Leveraging network programmability in security incid...
 
Strategy for Reducing Ticket Backlog
Strategy for Reducing Ticket BacklogStrategy for Reducing Ticket Backlog
Strategy for Reducing Ticket Backlog
 
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
Monitoring at Facebook - Ran Leibman, Facebook - DevOpsDays Tel Aviv 2015
 
If We Only Had the Time: How Security Teams Can Focus On What’s Important
If We Only Had the Time: How Security Teams Can Focus On What’s ImportantIf We Only Had the Time: How Security Teams Can Focus On What’s Important
If We Only Had the Time: How Security Teams Can Focus On What’s Important
 
LEAN Project: Incident Reduction
LEAN Project: Incident ReductionLEAN Project: Incident Reduction
LEAN Project: Incident Reduction
 
Event driven-automation and workflows
Event driven-automation and workflowsEvent driven-automation and workflows
Event driven-automation and workflows
 
AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...
AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...
AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...
 
Design Thinking and Lean UX
Design Thinking and Lean UXDesign Thinking and Lean UX
Design Thinking and Lean UX
 
Incident Management
Incident ManagementIncident Management
Incident Management
 
StackStorm DevOps Automation Webinar
StackStorm DevOps Automation WebinarStackStorm DevOps Automation Webinar
StackStorm DevOps Automation Webinar
 
Ignite slides minimum viable runbooks lite
Ignite slides minimum viable runbooks   liteIgnite slides minimum viable runbooks   lite
Ignite slides minimum viable runbooks lite
 
ITIL v3 Problem Management
ITIL v3 Problem ManagementITIL v3 Problem Management
ITIL v3 Problem Management
 

Similar to Neptune facebook autoremediation_talk

NextGen Endpoint Security for Dummies
NextGen Endpoint Security for DummiesNextGen Endpoint Security for Dummies
NextGen Endpoint Security for DummiesAtif Ghauri
 
TIG / Infocyte: Proactive Cybersecurity for State and Local Government
TIG / Infocyte: Proactive Cybersecurity for State and Local GovernmentTIG / Infocyte: Proactive Cybersecurity for State and Local Government
TIG / Infocyte: Proactive Cybersecurity for State and Local GovernmentInfocyte
 
SanerNow a platform for Endpoint security and systems Management
SanerNow  a platform for Endpoint security and systems ManagementSanerNow  a platform for Endpoint security and systems Management
SanerNow a platform for Endpoint security and systems ManagementSecPod Technologies
 
Microservice Architecture
Microservice ArchitectureMicroservice Architecture
Microservice ArchitectureEngin Yoeyen
 
Webinar: Machine learning analytics for immediate resolution to the most chal...
Webinar: Machine learning analytics for immediate resolution to the most chal...Webinar: Machine learning analytics for immediate resolution to the most chal...
Webinar: Machine learning analytics for immediate resolution to the most chal...Melina Black
 
Security Outsourcing - Couples Counseling - Atif Ghauri
Security Outsourcing - Couples Counseling - Atif GhauriSecurity Outsourcing - Couples Counseling - Atif Ghauri
Security Outsourcing - Couples Counseling - Atif GhauriAtif Ghauri
 
Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataStylight
 
Itsummit2015 blizzard
Itsummit2015 blizzardItsummit2015 blizzard
Itsummit2015 blizzardkevin_donovan
 
The Biggest Mistake you can make with your Data Center Licenses
The Biggest Mistake you can make with your Data Center LicensesThe Biggest Mistake you can make with your Data Center Licenses
The Biggest Mistake you can make with your Data Center LicensesIvanti
 
DevOpsRoadTrip San Francisco Final Speaking Deck
DevOpsRoadTrip San Francisco Final Speaking Deck DevOpsRoadTrip San Francisco Final Speaking Deck
DevOpsRoadTrip San Francisco Final Speaking Deck VictorOps
 
ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015Shannon Lietz
 
DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015Shannon Lietz
 
Its Not You Its Me MSSP Couples Counseling
Its Not You Its Me   MSSP Couples CounselingIts Not You Its Me   MSSP Couples Counseling
Its Not You Its Me MSSP Couples CounselingAtif Ghauri
 
Shift Left Security - The What, Why and How
Shift Left Security - The What, Why and HowShift Left Security - The What, Why and How
Shift Left Security - The What, Why and HowDevOps.com
 
AWS live hack: Atlassian + Snyk OSS on AWS
AWS live hack: Atlassian + Snyk OSS on AWSAWS live hack: Atlassian + Snyk OSS on AWS
AWS live hack: Atlassian + Snyk OSS on AWSEric Smalling
 
How to Solve Your Top IT Security Reporting Challenges with AlienVault
How to Solve Your Top IT Security Reporting Challenges with AlienVaultHow to Solve Your Top IT Security Reporting Challenges with AlienVault
How to Solve Your Top IT Security Reporting Challenges with AlienVaultAlienVault
 

Similar to Neptune facebook autoremediation_talk (20)

NextGen Endpoint Security for Dummies
NextGen Endpoint Security for DummiesNextGen Endpoint Security for Dummies
NextGen Endpoint Security for Dummies
 
TIG / Infocyte: Proactive Cybersecurity for State and Local Government
TIG / Infocyte: Proactive Cybersecurity for State and Local GovernmentTIG / Infocyte: Proactive Cybersecurity for State and Local Government
TIG / Infocyte: Proactive Cybersecurity for State and Local Government
 
SanerNow a platform for Endpoint security and systems Management
SanerNow  a platform for Endpoint security and systems ManagementSanerNow  a platform for Endpoint security and systems Management
SanerNow a platform for Endpoint security and systems Management
 
Microservice Architecture
Microservice ArchitectureMicroservice Architecture
Microservice Architecture
 
Webinar: Machine learning analytics for immediate resolution to the most chal...
Webinar: Machine learning analytics for immediate resolution to the most chal...Webinar: Machine learning analytics for immediate resolution to the most chal...
Webinar: Machine learning analytics for immediate resolution to the most chal...
 
Security Outsourcing - Couples Counseling - Atif Ghauri
Security Outsourcing - Couples Counseling - Atif GhauriSecurity Outsourcing - Couples Counseling - Atif Ghauri
Security Outsourcing - Couples Counseling - Atif Ghauri
 
Lean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big DataLean Enterprise, Microservices and Big Data
Lean Enterprise, Microservices and Big Data
 
Itsummit2015 blizzard
Itsummit2015 blizzardItsummit2015 blizzard
Itsummit2015 blizzard
 
The Biggest Mistake you can make with your Data Center Licenses
The Biggest Mistake you can make with your Data Center LicensesThe Biggest Mistake you can make with your Data Center Licenses
The Biggest Mistake you can make with your Data Center Licenses
 
DevOpsRoadTrip San Francisco Final Speaking Deck
DevOpsRoadTrip San Francisco Final Speaking Deck DevOpsRoadTrip San Francisco Final Speaking Deck
DevOpsRoadTrip San Francisco Final Speaking Deck
 
ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015
 
DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015
 
DevSecCon Keynote
DevSecCon KeynoteDevSecCon Keynote
DevSecCon Keynote
 
Découvrez le Rugged DevOps
Découvrez le Rugged DevOpsDécouvrez le Rugged DevOps
Découvrez le Rugged DevOps
 
Its Not You Its Me MSSP Couples Counseling
Its Not You Its Me   MSSP Couples CounselingIts Not You Its Me   MSSP Couples Counseling
Its Not You Its Me MSSP Couples Counseling
 
Synapse Automated Spreadsheet Reporting
Synapse Automated Spreadsheet ReportingSynapse Automated Spreadsheet Reporting
Synapse Automated Spreadsheet Reporting
 
Shift Left Security - The What, Why and How
Shift Left Security - The What, Why and HowShift Left Security - The What, Why and How
Shift Left Security - The What, Why and How
 
AWS live hack: Atlassian + Snyk OSS on AWS
AWS live hack: Atlassian + Snyk OSS on AWSAWS live hack: Atlassian + Snyk OSS on AWS
AWS live hack: Atlassian + Snyk OSS on AWS
 
How to Solve Your Top IT Security Reporting Challenges with AlienVault
How to Solve Your Top IT Security Reporting Challenges with AlienVaultHow to Solve Your Top IT Security Reporting Challenges with AlienVault
How to Solve Your Top IT Security Reporting Challenges with AlienVault
 
Many products-no-security (1)
Many products-no-security (1)Many products-no-security (1)
Many products-no-security (1)
 

Recently uploaded

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Recently uploaded (20)

DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Neptune facebook autoremediation_talk

  • 1. Reducing incident MTTR via automated diagnosis and remediation Auto-remediation and Event Driven Automation Meetup Kiran Gollu, Founder Neptune.io © 2017
  • 2. Neptune Neptune.io © 2017 2 •  One-stop incident automation-as-a-service for DevOps •  Less Alerts, More Uptime & More sleep •  Highly scalable (millions of alerts), secure, and reliable (tolerate data center region wide outages) Trusted by Backed by
  • 3. Me Neptune.io © 2017 •  Present: •  Founder, Neptune •  YCombinator and Data Collective VC backed startup •  Past: •  Founding engineer/architect at AWS S3, and DynamoDB •  Published research papers on distributed systems collaborating with Microsoft Research
  • 4. What I will talk about •  State of incident response today •  Alert enrichment and auto-remediation techniques / tradeoffs •  How do you solve this problem for your company? •  Demo Neptune •  Our learning's & recommendations Neptune.io © 2017
  • 5. What is Incident Response? How to handle alerts/incidents/ outages? Many more.. Alerts Neptune.io © 2017
  • 6. Maturity level of incident response teams @jpaulreed @kfinnbraun DevOps Enterprise Summit Neptune.io © 2017
  • 7. Problem#1: Lots of alerts 7Neptune.io © 2017
  • 8. Problem#2: Failures are complicated Neptune.io © 2017
  • 10. Source : DevOps survey; Victor Ops incident response #4: 95% of Time To Recovery(TTR) is still manual today Alert Troubleshooting Triage | Investigate | Identify Resolution Documentation 73% 10%5% 12% Snapshots •  Graphs & metrics •  Logs •  Webpages Service health checks •  Internal •  External Host/App diagnostics •  “Top”, “df –H” etc. •  Heap dumps/Stack traces Runbooks •  On single/cluster of hosts •  Any script, any language Cloud API/CLI actions •  Start/Stop/ Reboot •  Scale up/down Root-cause analysis & Audit •  Heap dumps •  Logs •  Graphs Post-mortem •  History •  Diagnostics Neptune.io © 2017
  • 11. FBAR : Facebook Auto Remediation Platform “…Its doing the work of approximately 200 sys admins…” “We built an internal tool for AWS” Nurse: Auto-remediation platform “60% of problems are fixed automatically…” Winston: Event driven automation tool 11Neptune.io © 2017
  • 12. Your Options Neptune.io © 2017 SaaS product, On-Premise offering on AWS, deep integrations with monitoring tools Open source event driven automation Build it in house
  • 13. Common Issues •  Noise •  non-actionable alerts •  false positives •  self-recovering alerts •  Engineer burnout: Too much manual work •  Not measuring cost of dealing with incidents •  Incorrect monitoring thresholds •  Bandaids instead of root causing problems •  Alert correlation is hard – downstream/upstream dependencies •  Not having clear incident response and escalation processeses Neptune.io © 2017
  • 14. Architecting automated diagnosis and remediation Neptune.io © 2017
  • 15. 1. One-stop incident tracking •  Helps identify those top-20% alerts causing 80% of pain •  Sorted by frequency and MTTR •  Capture: •  MTTA (mean time to acknowledgement) •  MTTR (mean time to resolution) •  Frequency of occurrence (#times a particular alert has occurred) •  Reporting + Auditing •  Audit all activity (both manual + automated) •  Leads to data-driven post mortems Neptune.io © 2017
  • 16. 2. Enrich Alerts •  When an alert occurs: •  Gather debugging context automatically from different tools Use cases: 1.  High memory alert •  capture top-10 memory hogs, thread dump, memory usage graphs 2.  High app error rate •  capture error rate & latency trends •  app logs for 5xx errors from Splunk/Sumologic •  app health checks Neptune.io © 2017
  • 17. 3. Auto-Remediate repetitive alerts •  When an alert occurs: •  If it’s a known alert à Run a remediation runbook •  Use cases: •  Process crashed à restart process •  Host is unpingable à restart 3 times and escalate if still fails •  Service is down à capture graphs, run a remediation workflow Neptune.io © 2017
  • 18. Cultural aspects •  Have a clear incident resolution and escalation process •  Break silos : Dev / Ops can resolve and share how incidents are resolved •  Document and version your runbooks •  Single consolidated report per each incident to make post-mortems easy •  Audit all manual and automated actions for an incident •  Use your own communication tools (Slack, HipChat) but record incident logs •  Use tools to log team collaboration activity Neptune.io © 2017
  • 20. How do we ensure military grade security? Neptune.io © 2017
  • 21. Our learning’s from 200+ SRE/Ops teams •  Automate simple things first •  Have checks in place to avoid cascading failures •  Rate limiting, handling correlated failures •  Capture state and snapshots for self-recovering alerts •  Don’t automatically fix when you don’t know root cause •  Enriching incidents is as important as automating repetitive incidents •  Availability of automation tool should be >>> your apps Neptune.io © 2017
  • 22. What I talked about •  Reducing MTTR gives you: •  More agility à frees up engineers to work on business critical problems •  More uptime, thus better customer experience •  More sleep, Happier engineers •  To reduce MTTR for production alerts incidents: •  create actionable alerts and fix monitoring thresholds •  Embrace automation – enrich and automate alerts •  Instill good incident response processes Neptune.io © 2017
  • 23. Signup for 14 day free trail! kiran@neptune.io Neptune.io © 2017
  • 25. How do you get better at it? •  Continuously eliminate manual effort involved •  Streamline your incident response workflow (cultural aspects) •  Encourage good behavior and punish bad behavior •  Measure time spent in incident response •  Make your alerts actionable •  Fix the monitoring thresholds as a continuous process •  Enrich and automate incidents to reduce MTTR Neptune.io © 2017