Working in Web Operations means dealing with production systems that in most cases needs to be operational 24×7x365.
To reach 99.99999% uptime, you must fail as little as possible.
This talk will go through a few real-world incidents and failures experienced by our small WebOps team, and outline what we are learning (the hard way), and how we’re trying to improve.
What could possibly go wrong? :-)
9. misplaced comma +
fix didn't make it to master +
unintended general rollout +
parser choked on comma +
fork with no rate limiting +
fatal() dumped core +
kernel.core_uses_pid = 1 +
small SSD metadata partition +
indexes corruption =
massive outage (no data loss)
10.
11. DO
Rate limit fork of children
Test disk full conditions
Master your infrastructure
20. Subject: [PATCH] sched: avoid unnecessary overflow in sched_clock
From: Salman Qazi <sqazi@google.com>
Date: 2011-11-16 20:55:31
In hundreds of days, the __cycles_2_ns calculation in sched_clock
has an overflow. cyc * per_cpu(cyc2ns, cpu) exceeds 64 bits, causing
the final value to become zero. We can solve this without losing
any precision.
We can decompose TSC into quotient and remainder of division by the
scale factor, and then use this to convert TSC into nanoseconds.
Reviewed-by: Paul Turner <pjt@google.com>
Acked-by: John Stultz <johnstul@us.ibm.com>
Signed-off-by: Salman Qazi <sqazi@google.com>
---
arch/x86/include/asm/timer.h | 23 ++++++++++++++++++++++-
1 files changed, 22 insertions(+), 1 deletions(-)
Patch #1, Nov 16th 2011
diff --git a/arch/x86/include/asm/timer.h b/arch/x86/include/asm/timer.h
index fa7b917..431793e 100644
--- a/arch/x86/include/asm/timer.h
+++ b/arch/x86/include/asm/timer.h
@@ -32,6 +32,22 @@ extern int no_timer_check;
* (mathieu.desnoyers@polymtl.ca)
*
* -johnstul@us.ibm.com "math is hard, lets go shopping!"
21. --- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -608,6 +608,8 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, ...)
{
unsigned long long tsc_now, ns_now, *offset;
unsigned long flags, *scale;
+ unsigned long long quot;
+ unsigned long long rem; Patch #2, Mar 8th 2012
local_irq_save(flags);
sched_clock_idle_sleep_event();
@@ -620,7 +622,15 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, ...)
if (cpu_khz) {
*scale = (NSEC_PER_MSEC << CYC2NS_SCALE_FACTOR)/cpu_khz;
- *offset = ns_now - (tsc_now * *scale >> CYC2NS_SCALE_FACTOR);
+
+ /*
+ * Avoid premature overflow by splitting into quotient
+ * and remainder. See the comment above __cycles_2_ns
+ */
+ quot = (tsc_now >> CYC2NS_SCALE_FACTOR);
+ rem = tsc_now & ((1ULL << CYC2NS_SCALE_FACTOR) - 1);
+ *offset = ns_now - (quot * *scale +
+ ((rem * *scale) >> CYC2NS_SCALE_FACTOR));
}
29. t - 4y 2m
From: Roman Zippel <zippel@linux-m68k.org>
Date: Thu, 1 May 2008 04:34:41 -0700
Subject: [PATCH] ntp: handle leap second via timer
Remove the leap second handling from second_overflow(), which doesn't have to
check for it every second anymore. With CONFIG_NO_HZ this also makes sure the
leap second is handled close to the full second. Additionally this makes it
possible to abort a leap second properly by resetting the STA_INS/STA_DEL status bits.
Signed-off-by: Roman Zippel <zippel@linux-m68k.org>
Cc: john stultz <johnstul@us.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---
include/linux/clocksource.h | 2 +
include/linux/timex.h | 1 +
kernel/time/ntp.c | 133 +++++++++++++++++++++++++++++--------------
kernel/time/timekeeping.c | 4 +-
40. T + {1,2}m
{August,September} 1st, 2012
fake leap seconds
41. read more
A story of leaping seconds
http://blog.fastmail.fm/2012/07/03/a-story-of-leaping-seconds/
Tips and tricks to deal with leap seconds
http://my.opera.com/marcomarongiu/blog/index.dml/tag/ntp
Serverfault question on random debian crashes
http://serverfault.com/questions/403732/leapocalypse
Wired article about leap second problems
http://www.wired.com/wiredenterprise/2012/07/leap-second-bug-
wreaks-havoc-with-java-linux/
42. DO
Keep your kernel updated
Use valuable external resources
(serverfault etc...)
46. ops lessons learned
Don't repeat yourself (DRY)
Always keep it simple (KISS)
Separate ops team doesn't work well
Practice Continuous deployment. Now.
Communication makes the difference
Learn your tools
Master your infrastructure
RTFM
...