Leveling up monitoring: A decade of automating and scaling Nagios

•

8 likes•15,811 views

Monitoring - we all have to do it, but most people don’t seem to like it very much. Etsy has been using Nagios for over a decade to monitor its infrastructure, and over that time has created a set of tools that has allowed multiple teams to deploy, manage, and scale it. In this talk we will offer guidelines on how to scale monitoring and alerting setups, ideas for workflows around monitoring, and methods of reducing friction and alert fatigue for on-call engineers.

Software

Leveling Up Monitoring:
A Decade of Automating and
Scaling Nagios
Katherine Daniels and Laurie Denness
@beerops - @lozzd Velocity 2016

@beerops - @lozzd Velocity 2016
Katherine Daniels 
@beerops
Senior Operations Engineer, Etsy
Co-Author of Effective DevOps
Laurie Denness
@lozzd
Staff Operations Engineer, Etsy
Official Graph Enthusiast

Agenda
@beerops - @lozzd Velocity 2016
Automation
2
Deployinator
3
Scaling + Tooling
4
In The Beginning...
1

25M
Active Buyers
About Etsy
1.6M
Active Sellers
$2.39B
2015 Annual GMS
(As of March 31, 2016)

https://kartar.net/2015/08/monitoring-
survey-2015---tools/

@beerops - @lozzd Velocity 2016
In The Beginning

@beerops - @lozzd Velocity 2016
Sometimes your statement needs emphasis with
a black background.

@beerops - @lozzd Velocity 2016
LESSONS LEARNED:
Templates are awesome.

$@beerops - @lozzd Velocity 2016 define service { use generic-service hostgroups Linux_hosts,!email-only-servers service_description SSH check_command check_ssh }$

$@beerops - @lozzd Velocity 2016 define service { use disk-space-service hostgroup_name email-only-servers contact_groups ops_nonurgent }$

@beerops - @lozzd Velocity 2016
LESSONS LEARNED:
Start small.

@beerops - @lozzd Velocity 2016
Nagios and Chef

@beerops - @lozzd Velocity 2016
LESSONS LEARNED:
Automation is awesome!

@beerops - @lozzd Velocity 2016
LESSONS LEARNED:
Automation is awesome!
HA HA JUST KIDDING

@beerops - @lozzd Velocity 2016
LESSONS LEARNED:
Trust but verify.

@beerops - @lozzd Velocity 2016
How Many Repos?

@beerops - @lozzd Velocity 2016
LESSONS LEARNED:
?!?!?!?!??!?!

@beerops - @lozzd Velocity 2016
LESSONS LEARNED:
Try, fail, learn, and try again.

Problems
• Four git repos, inconsistent mess, duplication

Problems
• Four git repos, inconsistent mess, duplication
• Broken semi-useful automation - need to regain trust

Problems
• Four git repos, inconsistent mess, duplication
• Broken semi-useful automation - need to regain trust
• Some shared conﬁg, some unique

Problems
• Four git repos, inconsistent mess, duplication
• Broken semi-useful automation - need to regain trust
• Some shared conﬁg, some unique
• Gain conﬁdence in changes

@beerops - @lozzd Velocity 2016
Nagios and Chef
and Deployinator!

@beerops - @lozzd Velocity 2016
Solution 1:  
Merge everything: find and remove duplication,
shared configs

@beerops - @lozzd Velocity 2016
Thanks Murphy!

@beerops - @lozzd Velocity 2016
Super Secret Option!!!

@beerops - @lozzd Velocity 2016
Solution 2:
Using Jenkins CI to test changes before
production

@beerops - @lozzd Velocity 2016
Solution 3:
Use Deployinator to run Chef recipe to generate
automated configs

@beerops - @lozzd Velocity 2016
Solution 4:
Use Deployinator to rsync config to all boxes

• git pull repo on deploy host
• Run Chef recipe to add automated pieces

• git pull repo on deploy host
• Run Chef recipe to add automated pieces
• Re-run the try-nagios script against that

• git pull repo on deploy host
• Run Chef recipe to add automated pieces
• Re-run the try-nagios script against that
• rsync copy from deploy box to Nagios hosts

@beerops - @lozzd Velocity 2016
LESSONS LEARNED:
Use the tools you have.

@beerops - @lozzd Velocity 2016
Scaling things up!

@beerops - @lozzd Velocity 2016
Core Workers

@beerops - @lozzd Velocity 2016
LESSONS LEARNED:
If at first you don’t succeed,
rub some webscale on it.

@beerops - @lozzd Velocity 2016
Iterating and Iterating

@beerops - @lozzd Velocity 2016
LESSONS LEARNED:
Iterate
Iterate
Iterate

@beerops - @lozzd Velocity 2016
To Inﬁnity and Beyond

• Templates are awesome
• Start small
• Automation is awesome
• Trust but verify
• Learn from (y)our mistakes
• Iterate on the tools you have

Open Source Summary
• http://github.com/etsy/deployinator
• http://github.com/etsy/pushbot
• http://github.com/etsy/trylib
• http://github.com/etsy/opsweekly
• http://github.com/etsy/nagios-herald
• http://github.com/RJ/irccat

THANK YOU!
@beerops - @lozzd Velocity 2016

Viewers also liked

Rock Stars, Builders, and Janitors: You're Doing It Wrong, New Relic [FutureS...New Relic

You Can't Buy AgileRTigger

Spring and Web Content ManagementZak Greant

InformationWeek covers InfoAxon Technologies for Nagios Implementation InfoAxon Technologies Limited

Spring first in Magnolia CMS - Spring I/O 2015Tobias Mattsson

AnsibleFest 2014 - Role Tips and Tricksjimi-c

Nagios Consulting Implementation and MaintenanceRazak Mohammed Ali

Developing Good Operations ToolsJames Turnbull

Rencontres Mondiales Du Logiciel Libre 2009FAN Fully Automated Nagios

Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With NagiosNagios

Nagios Conference 2011 - Nicholas Scott - Nagios Performance TuningNagios

Состояние сетевой безопасности в 2016 году Qrator Labs

Fully Automated Nagios Jm2L 2009FAN Fully Automated Nagios

Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise EditionNagios

What is Nagios XI and how is it different from Nagios CoreSanjay Willie

Viewers also liked (15)

Rock Stars, Builders, and Janitors: You're Doing It Wrong, New Relic [FutureS...

You Can't Buy Agile

Spring and Web Content Management

InformationWeek covers InfoAxon Technologies for Nagios Implementation

Spring first in Magnolia CMS - Spring I/O 2015

AnsibleFest 2014 - Role Tips and Tricks

Nagios Consulting Implementation and Maintenance

Developing Good Operations Tools

Rencontres Mondiales Du Logiciel Libre 2009

Nagios Conference 2011 - Mike Guthrie - Distributed Monitoring With Nagios

Nagios Conference 2011 - Nicholas Scott - Nagios Performance Tuning

Состояние сетевой безопасности в 2016 году

Fully Automated Nagios Jm2L 2009

Marcus Rochelle - Landis+Gyr - Monitoring with Nagios Enterprise Edition

What is Nagios XI and how is it different from Nagios Core

Recently uploaded

TECUNIQUE: Success Stories: IT Service providermohitmore19

Microsoft AI Transformation Partner Playbook.pdfWilly Marroquin (WillyDevNET)

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls

A Secure and Reliable Document Management System is Essential.docxComplianceQuest1

Right Money Management App For Your Financial GoalsJhone kinadey

Professional Resume Template for Software DevelopersVinodh Ram

Diamond Application Development Crafting Solutions with PrecisionSolGuruz

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE9953056974 Low Rate Call Girls In Saket, Delhi NCR

5 Signs You Need a Fashion PLM Software.pdfWave PLM

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions

HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai

Optimizing AI for immediate response in Smart CCTVshikhaohhpro

How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700

Clustering techniques data mining book ....ShaimaaMohamedGalal

Exploring iOS App Development: Simplifying the ProcessEvangelist Apps https://twitter.com/EvangelistSW/

Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812

Recently uploaded (20)

TECUNIQUE: Success Stories: IT Service provider

Microsoft AI Transformation Partner Playbook.pdf

Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...

call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️

A Secure and Reliable Document Management System is Essential.docx

Right Money Management App For Your Financial Goals

Professional Resume Template for Software Developers

Diamond Application Development Crafting Solutions with Precision

CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE

5 Signs You Need a Fashion PLM Software.pdf

CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️

Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf

Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...

HR Software Buyers Guide in 2024 - HRSoftware.com

Optimizing AI for immediate response in Smart CCTV

How To Troubleshoot Collaboration Apps for the Modern Connected Worker

(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...

Clustering techniques data mining book ....

Exploring iOS App Development: Simplifying the Process

Unlocking the Future of AI Agents with Large Language Models

Leveling up monitoring: A decade of automating and scaling Nagios

1. Leveling Up Monitoring: A Decade of Automating and Scaling Nagios Katherine Daniels and Laurie Denness @beerops - @lozzd Velocity 2016

2. @beerops - @lozzd Velocity 2016 Katherine Daniels  @beerops Senior Operations Engineer, Etsy Co-Author of Effective DevOps Laurie Denness @lozzd Staff Operations Engineer, Etsy Official Graph Enthusiast

3. 3

4. Agenda @beerops - @lozzd Velocity 2016 Automation 2 Deployinator 3 Scaling + Tooling 4 In The Beginning... 1

6. 25M Active Buyers About Etsy 1.6M Active Sellers $2.39B 2015 Annual GMS (As of March 31, 2016)

7. Monitoring!

8. @beerops - @lozzd Velocity 2016

9. @beerops - @lozzd Velocity 2016

10. bit.ly/yaynagios

11. https://kartar.net/2015/08/monitoring- survey-2015---tools/

12.

13. @beerops - @lozzd Velocity 2016 In The Beginning

14. @beerops - @lozzd Velocity 2016

15. @beerops - @lozzd Velocity 2016

16. @beerops - @lozzd Velocity 2016 Sometimes your statement needs emphasis with a black background.

17. @beerops - @lozzd Velocity 2016

18. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: Templates are awesome.

19. @beerops - @lozzd Velocity 2016

20. @beerops - @lozzd Velocity 2016

21. @beerops - @lozzd Velocity 2016

22. @beerops - @lozzd Velocity 2016

23. @beerops - @lozzd Velocity 2016 define service { use generic-service hostgroups Linux_hosts,!email-only-servers service_description SSH check_command check_ssh }

24. @beerops - @lozzd Velocity 2016 define service { use disk-space-service hostgroup_name email-only-servers contact_groups ops_nonurgent }

25. @beerops - @lozzd Velocity 2016

26. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: Start small.

27. @beerops - @lozzd Velocity 2016 Nagios and Chef

28. @beerops - @lozzd Velocity 2016

29. @beerops - @lozzd Velocity 2016

30. 24

31.

32.

33. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: Automation is awesome!

34. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: Automation is awesome! HA HA JUST KIDDING

35. @beerops - @lozzd Velocity 2016

36. @beerops - @lozzd Velocity 2016

37. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: Trust but verify.

38. @beerops - @lozzd Velocity 2016 How Many Repos?

39. @beerops - @lozzd Velocity 2016

40.

41.

42.

43. @beerops - @lozzd Velocity 2016

44. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: ?!?!?!?!??!?!

45. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: Try, fail, learn, and try again.

46. Problems

47. Problems • Four git repos, inconsistent mess, duplication

48. Problems • Four git repos, inconsistent mess, duplication • Broken semi-useful automation - need to regain trust

49. Problems • Four git repos, inconsistent mess, duplication • Broken semi-useful automation - need to regain trust • Some shared conﬁg, some unique

50. Problems • Four git repos, inconsistent mess, duplication • Broken semi-useful automation - need to regain trust • Some shared conﬁg, some unique • Gain conﬁdence in changes

51. Problems • Four git repos, inconsistent mess, duplication • Broken semi-useful automation - need to regain trust • Some shared conﬁg, some unique • Gain conﬁdence in changes • Stop editing on the production box

52. @beerops - @lozzd Velocity 2016 Nagios and Chef

53. @beerops - @lozzd Velocity 2016 Nagios and Chef and Deployinator!

54. @beerops - @lozzd Velocity 2016 Solution 1:   Merge everything: find and remove duplication, shared configs

55. @beerops - @lozzd Velocity 2016 Thanks Murphy!

56.

57. @beerops - @lozzd Velocity 2016 Super Secret Option!!!

58. @beerops - @lozzd Velocity 2016

59. @beerops - @lozzd Velocity 2016

60. @beerops - @lozzd Velocity 2016

61.

62.

63. @beerops - @lozzd Velocity 2016 Solution 2: Using Jenkins CI to test changes before production

64.

65.

66.

67.

68.

69. @beerops - @lozzd Velocity 2016 Solution 3: Use Deployinator to run Chef recipe to generate automated configs

70.

71.

72. Chart Title

73. Chart Title

74. @beerops - @lozzd Velocity 2016 Solution 4: Use Deployinator to rsync config to all boxes

75.

76. • git pull repo on deploy host

77. • git pull repo on deploy host • Run Chef recipe to add automated pieces

78. • git pull repo on deploy host • Run Chef recipe to add automated pieces • Re-run the try-nagios script against that

79. • git pull repo on deploy host • Run Chef recipe to add automated pieces • Re-run the try-nagios script against that • rsync copy from deploy box to Nagios hosts

80. • git pull repo on deploy host • Run Chef recipe to add automated pieces • Re-run the try-nagios script against that • rsync copy from deploy box to Nagios hosts • Create symlink for nagios.cfg

81. • git pull repo on deploy host • Run Chef recipe to add automated pieces • Re-run the try-nagios script against that • rsync copy from deploy box to Nagios hosts • Create symlink for nagios.cfg • Restart Nagios

82.

83. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: Use the tools you have.

84. @beerops - @lozzd Velocity 2016 Scaling things up!

85.

86.

87. @beerops - @lozzd Velocity 2016

88. @beerops - @lozzd Velocity 2016

89.

90.

91. @beerops - @lozzd Velocity 2016

92. @beerops - @lozzd Velocity 2016

93. @beerops - @lozzd Velocity 2016 Core Workers

94. @beerops - @lozzd Velocity 2016 Core Workers

95. @beerops - @lozzd Velocity 2016

96.

97.

98. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: If at first you don’t succeed, rub some webscale on it.

99. @beerops - @lozzd Velocity 2016 Iterating and Iterating

100.

101.

102.

103.

104.

105.

106.

107.

108.

109.

110.

111. @beerops - @lozzd Velocity 2016 LESSONS LEARNED: Iterate Iterate Iterate

112. @beerops - @lozzd Velocity 2016 To Inﬁnity and Beyond

113.

114. @beerops - @lozzd Velocity 2016

115. http://github.com/etsy/opsweekly

116. http://github.com/etsy/opsweekly

117. Chart Title

118. Chart Title

119.

120.

121. Final Lessons Learned

122. • Templates are awesome • Start small • Automation is awesome • Trust but verify • Learn from (y)our mistakes • Iterate on the tools you have

123. Open Source Summary

124. Open Source Summary • http://github.com/etsy/deployinator • http://github.com/etsy/pushbot • http://github.com/etsy/trylib • http://github.com/etsy/opsweekly • http://github.com/etsy/nagios-herald • http://github.com/RJ/irccat

125. THANK YOU! @beerops - @lozzd Velocity 2016

Leveling up monitoring: A decade of automating and scaling Nagios

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (15)

Recently uploaded

Recently uploaded (20)

Leveling up monitoring: A decade of automating and scaling Nagios