21. Issue mitigation protocol – priority 1
Service interruption
Acknowledge alert (45
mins)
Begin resolution (+30
mins)
22. Issue mitigation protocol – priority 1
Service
interruption
Acknowledge
alert (45 mins)
Begin
resolution (+30
mins)
Update every
120 mins (MS
Teams)
23. Issue mitigation protocol – priority 1
Service
interruption
Acknowledg
e alert (45
mins)
Begin
resolution
(+30 mins)
Update
every 120
mins (MS
Teams)
Publish
post-
mortem (72
hours)
24. How it looks like in production
500
POST /subscriptions
25. How it looks like in production
500
POST /subscriptions
34. Metrics, that we pay attention to
• Resource utilization – CPU, Memory, container pool
35. Metrics, that we pay attention to
• Resource utilization – CPU, Memory, container pool
• API responses – response time (SLA), status codes
36. Metrics, that we pay attention to
• Resource utilization – CPU, Memory, container pool
• API responses – response time (SLA), status codes
• Structure of load – (IP vs PUA vs SSO vs Legacy auth)
42. Things to log
1
POST /subscription
2
Lookup customer
Create customer
Taxes and coupons
Create subscription
Activate user
Workflow steps
43. Things to log
1
POST /subscription
2
Lookup customer
Create customer
Taxes and coupons
Create subscription
Activate user
3
Outcoming
requests
44. Things to log
1
POST /subscription
2
Lookup customer
Create customer
Taxes and coupons
Create subscription
Activate user
3
Outcoming
requests
Anomalies
45. Things to log
1
POST /subscription
2
Lookup customer
Create customer
Taxes and coupons
Create subscription
Activate user
3
4 Data manipulation
46. Things to log
1
POST /subscription
2
Lookup customer
Create customer
Taxes and coupons
Create subscription
Activate user
3
4 Data manipulation
Audit trail
47. Things to log
1
POST /subscription
2
Lookup customer
Create customer
Taxes and coupons
Create subscription
Activate user
3
4
5
Availability
55. Zipkin collection flow
App code Trace instrumentation Http Client Zipkin Collector
GET /foo
record tags and
timestamp
Service A
56. Zipkin collection flow
App code Trace instrumentation Http Client Zipkin Collector
GET /foo
record tags and
timestamp
add trace headers
Service A
57. Zipkin collection flow
App code Trace instrumentation Http Client Zipkin Collector
GET /foo
GET /foo
X-B3-TraceId: aa
X-B3-SpanId: 6b
record tags and
timestamp
add trace headers
Invoke
request
Service A
58. Zipkin collection flow
App code Trace instrumentation Http Client Zipkin Collector
GET /foo
200 OK
GET /foo
X-B3-TraceId: aa
X-B3-SpanId: 6b
record tags and
timestamp
add trace headers
record duration
Invoke
request
200 OK
Service A
59. Zipkin collection flow
App code Trace instrumentation Http Client Zipkin Collector
GET /foo
200 OK
GET /foo
X-B3-TraceId: aa
X-B3-SpanId: 6b
record tags and
timestamp
add trace headers
record duration
async span report
Invoke
request
200 OK
Service A
67. Additional notes
• Proactive vs reactive usage
• Use 20/80 approach proactively
• Nice to have: combination of tracing and logging
68. Additional notes
• Proactive vs reactive usage
• Use 20/80 approach proactively
• Nice to have: combination of tracing and logging
• Custom annotations and naming where needed