29. Manager
PM, Boss
Customers
t
Event
事件持續的時間
29
溝通的對象
PM 23:30 Friday
AM 03:34 Monday
AM 11:30 Sunday
AM 10:34 Tuesday
10m 30m 60m 120m
SRE Manager Manager Boss
SRE
SRE
SRE
SRE
Manager
PM, Boss
SRE
Manager
PM, Boss
Customers
Customers
Manager
Manager
PM, Boss
Customers
Customers
30. 止血
● 目的:
a. 讓系統盡快恢復服務
b. 減少營運損失
● 作法
a. 從現象,依據架構、指標,找問題點 (Part II)
b. rollback, rollback, rollback
c. 用最簡單的方法:加資源、移除有問題的節點、增加新節點、蓋防火巷
● 同步
a. 聯繫相關的人:Backend、Frontend、DBA、Networking
b. 蒐集現象、指標
30
40. LB
Web API ES Node
Service API and ES GroupBatchDB
Sync commodities,
categories
Service A
Search
Service C
Add commodities, categories
Web API ES Node
Web API ES Node
Service B
Search
40
問題發生當下的架構
96. ● The network is reliable (網路是可靠的)
● Latency is zero (網路沒有延遲)
● Bandwidth is infinite (頻寬是無限的)
● The network is secure (網路是安全的)
計算計科學家 Peter Deutsch 在九零年代就提出 Fallacies of distributed
computing (分散式系統的謬論),點出以下容易被忽略、或者輕忽的觀點:
分散式系統的謬論
96
● Topology doesn’t change (網路拓墣不會改變)
● There is one administrator (網路上有個管理員)
● Transport cost is zero (傳輸沒有成本)
● The network is homogeneous (網路是同質的)