Slides from my presentation on distributed tracing, explaining what is latency and why it matters. We took a look at openzipkin and its concepts like how the core annotations works, what are tags/logs, etc. Followed by a demo application created using golang and java (spring boot , spring cloud sleuth zipkin) . You can find source code here
https://github.com/nklmish/go-distributed-tracing-demo
https://github.com/nklmish/java-distributed-tracing-demo
25. @nklmish
Clients impacted by longtail latency…
Percentile: 99th => 1 out of 100 visit experience D
Total visits experience delay: N ÷ 100 => 5,000
Total visits affected: 8%N => 40,000
Impacts:
a. Lot of visitsb. Repeated visits in a day
1 visit (In our distributed system): 8 downstream calls =>
interacting with S
(99% fast & 1% slow)
N: No. of visits (500,000) D: Delay (50 ms) S: Highly active service
(suffering from longtail latency)
1 visit encountering latency: 1-(0.99^8) = 1-0.922 => 0.077 ≈ 8%
(likelihood)
27. @nklmish
But we still don’t know…
Request timeline (When it started & which
operation)
Logs-Correlation
How the same operation behaved across different
cluster/region/zone.
How much deviation comparing to acceptable
value.
Call graph
29. @nklmish
Distributed tracing
Tracks request flow.
Fast reaction (Traced data available within mins)
Dynamically instruments apps.
System insight, critical path, understanding call graphs
(which services, which operations, at what time, etc.)
Measuring E2E latency
Call patterns (Optimisation) & bug discovering
(Spotting redundant requests, sync vs async)
31. @nklmish
Via Tracing system
Tracing system should:
Trace
Have Low overhead
Be scaleable
Work 24 * 7 * 365 (production bugs are difficult to
reproduce)
Shouldn’t :
Rely on programmers collaboration
34. @nklmish
Span
Denotes logical unit of
work done (Timestamped)
Work done is expressed in
human readable string
(operation name)
Created by tracer
(instrumenting code)
Slim (KiB or less)
Root span - span without
parent id
35. @nklmish
Zipkin annotations
Client
Server
cs
sr
ss
HTTP Request: get catalog
(span starts)
cr
HTTP Response: catalog
(span ends)
(Processing time = ss - sr)
(Response time = cr - cs)
(Network latency = sr - cs)
(Network latency = cr - ss)
cs: client send
ss: server
send
cr: client received
sr: server
received
48. @nklmish
Tracer
Does most of the heavy lifting e.g. span
creation, context generation, passing info,
data propagation, etc.
49. @nklmish
Sampling
Controls how much to record
High traffic Systems, fraction of
traffic is enough
Low traffic Systems, adjust based on
your needs
Note: Debug spans are always recorded.
57. @nklmish
Summary
Distributed systems hard to reason, complex call graphs
Distributed tracing helps to analyse E2E latency &
understanding call graphs
Instrumentation is tricky (async, thread pool, callbacks, etc.)
OpenZipkin provides:
open source tracing system
Visualises request flow
Spring cloud sleuth brings tracing to spring world
OpenTracing - goal to standardised tracing