Monitoring - we all have to do it, but most people don’t seem to like it very much. Etsy has been using Nagios for over a decade to monitor its infrastructure, and over that time has created a set of tools that has allowed multiple teams to deploy, manage, and scale it. In this talk we will offer guidelines on how to scale monitoring and alerting setups, ideas for workflows around monitoring, and methods of reducing friction and alert fatigue for on-call engineers.
48. Problems
• Four git repos, inconsistent mess, duplication
• Broken semi-useful automation - need to regain trust
49. Problems
• Four git repos, inconsistent mess, duplication
• Broken semi-useful automation - need to regain trust
• Some shared config, some unique
50. Problems
• Four git repos, inconsistent mess, duplication
• Broken semi-useful automation - need to regain trust
• Some shared config, some unique
• Gain confidence in changes
51. Problems
• Four git repos, inconsistent mess, duplication
• Broken semi-useful automation - need to regain trust
• Some shared config, some unique
• Gain confidence in changes
• Stop editing on the production box
77. • git pull repo on deploy host
• Run Chef recipe to add automated pieces
78. • git pull repo on deploy host
• Run Chef recipe to add automated pieces
• Re-run the try-nagios script against that
79. • git pull repo on deploy host
• Run Chef recipe to add automated pieces
• Re-run the try-nagios script against that
• rsync copy from deploy box to Nagios hosts
80. • git pull repo on deploy host
• Run Chef recipe to add automated pieces
• Re-run the try-nagios script against that
• rsync copy from deploy box to Nagios hosts
• Create symlink for nagios.cfg
81. • git pull repo on deploy host
• Run Chef recipe to add automated pieces
• Re-run the try-nagios script against that
• rsync copy from deploy box to Nagios hosts
• Create symlink for nagios.cfg
• Restart Nagios
82.
83. @beerops - @lozzd Velocity 2016
LESSONS LEARNED:
Use the tools you have.