Slides for my talk at Scale15x: https://www.socallinuxexpo.org/scale/15x/presentations/stackstorm-if-devops-automation
Devops automation, open-source,
Demo was at the core of the talk, the video is at https://youtu.be/3TjhBGshvvY?t=3h31m5s
14. What is being automated with StackStorm
• Security checks
– On malware detection in a VM, isolate
network port on a switch
• App blue-green deployment
– On Jenkins tests passed, bring new vm
claster, deploy and configure app, set
loadbalancer to send % of traffic to new
app, monitor, roll forward, or back out
• Networking
– On BGP peer goes down: collect
troubleshooting data, post on slack &
create JIRA ticket
– On Link aggregation member error,
check load, if capacity of rest of LAG
bundle enough, disable link with error
• OpenStack
– orphan VM clean-up: On orphans
detected, shut down, email owner, keep for
few days, delete
– VM evacuation on HW failures: On host
RAID failure, get list of impacted VMs,
email VM owners, evacuate VMs, create
JIRA ticket for hardware replacement.
• Service remediation:
– Cassandra “node down” recovery: On ring
node dying, deploy new node, configure,
add to the ring.
– Remediating RabbitMQ, Galera cluster,
MySQL, and more…
14
15. What SHOULD be automated?
From: Practice of Cloud System Administration, by Thomas Limoncelli
17. Auto Remediation
17
FBAR (saving 13,680 hours/day)
Naoru
Nurse
Winston (powered by StackStorm)
Azure Automation
Mistral workflow service
StackStorm automation platform
ACT
OBSERVE
ORIENT
DECIDE
18. “Auto-Remediation & Automation at Facebook” @ Auto-Remediation Meetup SF
https://www.meetup.com/Auto-Remediation-and-Event-Driven-Automation/events/236704012/
43 % Problem Fixed
51 % False positives
94 % Automated
19. “Sleep Better at Night: OpenStack Cloud Auto-Healing” @ OpenStack Summit Barcelona
Mirantis: Auto-remediating 2,000 node OpenStack cluster at Symantec with StackStorm
20. Engineer
Wakes up
Logs in
and ACK
Checks
runbook
Studies
the alert
Fixes the
problem
Runs
diagnostics
PagerDuty
Alert
2:02 AM 2:07 AM 2:15 AM2:10 AM 2:30 AM2:20 AM2:00 AM
On-call, Without Automation
Source: “Winston: Helping Netflix Engineers Sleep at Night” @ Qcon ‘16 SF
https://goo.gl/lHzq4r
21. False
Positive
Winston
2:00 AM
2:05 AM
2:05 AM
2:15
AM
Assisted
Diagnostics
Fixed the
problem
On-call With Winston
Source: “Winston: Helping Netflix Engineers Sleep at Night” @ Qcon ‘16 SF
https://goo.gl/lHzq4r
30. StackStorm is like …
30
AWS Lambda AWS Step Functions
OpenSource, for DIY Serverless
31. StackStorm is like …
31
ActionsSensors
WorkflowsRules
IT Domains
Config mgmtStorageNetworking ContainersCloud InfraMonitoring Ops Support
Step Functions
AWS Lambda
32. Serverless with Swarm
for Genomic Annotation Computing
Dmitri Zimine
http://github.com/dzimine/serverless-swarm
@dzimine
Image by Miki Yoshihito, Creative Commons license
38. Why not Scripts?
38
• Simple to define, reason, visualize
• Transparent
– state is clear, execution is trackable: running, complete, failed steps
• Reliable
– Workflows are long-running
– Crash tolerance
– “Restart from point of failure”
44. 44
Infrastructure as code
Case Study
• Service Catalog backed up by workflow
• Automate provisioning on VMW/OpenStack, 4 Data centers
• Before: CPO, operator updates via GUI, click and pray, x4
• After: StackStorm, dev -> code review -> staging -> QA-> prod
56. • Use StackStorm.
Try it, find automation, nail POC. Let us know, good & bad.
curl -sSL https://stackstorm.com/packages/install.sh | bash -s
docs.stackstorm.com/install
• Commit code. Become a “community maintainer”
It is not hard (weekend hack?). We help & support.
• Spread the word
Blog. Tweet. Talk. Mention. Bug. Github Star!
56
Contribute! Everything counts