Distributed tracing - get a grasp on your production

@nklmish
Distributed tracing -
get a grasp on your production
“the most wanted and missed tool in the microservice world”

@nklmish
Agenda
Why latency ?
Distributed tracing
Short demo
Zipkin & core concepts
Code walkthrough

@nklmish
Every little bit count

@nklmish
With scale, you see
(source: https://gist.github.com/hellerbarde/2843375)

@nklmish
Remember, slow pages lose users

@nklmish
Distributed systems - latency analysis

@nklmish
Story time: How bob meet longtail latency

@nklmish
Bob didn’t knew he was suffering from
Longtail latency

@nklmish
Bob trying to troubleshooting longtail
latency in distributed system

@nklmish
Option 1: Log Analysis

@nklmish
Not everything in critical path.

@nklmish
Correlating logs, manual works

@nklmish
It simply doesn’t make sense

@nklmish
Option 2: What about Metrics?

@nklmish
Something is wrong

@nklmish
Can’t tell the cause
?

@nklmish
Aggregates (avg, stdev) may deceive

@nklmish
Bob, could we find out how many clients
are impacted ?

@nklmish
Bob learn about percentiles

@nklmish
Clients impacted by longtail latency…
Percentile: 99th => 1 out of 100 visit experience D
Total visits experience delay: N ÷ 100 => 5,000
Total visits affected: 8%N => 40,000
Impacts: 
a. Lot of visitsb. Repeated visits in a day
1 visit (In our distributed system): 8 downstream calls =>
interacting with S  
(99% fast & 1% slow)
N: No. of visits (500,000) D: Delay (50 ms) S: Highly active service
(suffering from longtail latency)
1 visit encountering latency: 1-(0.99^8) = 1-0.922 => 0.077 ≈ 8%
(likelihood)

@nklmish
But we still don’t know…
Request timeline (When it started & which
operation)
Logs-Correlation
How the same operation behaved across different
cluster/region/zone.
How much deviation comparing to acceptable
value.
Call graph

@nklmish
Bob was missing Distributed Tracing

@nklmish
Distributed tracing
Tracks request flow.
Fast reaction (Traced data available within mins)
Dynamically instruments apps.
System insight, critical path, understanding call graphs
(which services, which operations, at what time, etc.)
Measuring E2E latency
Call patterns (Optimisation) & bug discovering
(Spotting redundant requests, sync vs async)

@nklmish
How can we apply this knowledge

@nklmish
Via Tracing system
Tracing system should:
Trace
Have Low overhead
Be scaleable
Work 24 * 7 * 365 (production bugs are difficult to
reproduce)
Shouldn’t :
Rely on programmers collaboration

@nklmish
OpenZipkin -
OpenSource tracing system

@nklmish
OpenZipkin
Zipkin is:
Distributed tracing system
Created by twitter
Based on Dapper.
OpenZipkin:
Github organisation
Primary Fork of Zipkin
Opensource
Pluggable architecture

@nklmish
Span
Denotes logical unit of
work done (Timestamped)
Work done is expressed in
human readable string
(operation name)
Created by tracer
(instrumenting code)
Slim (KiB or less)
Root span - span without
parent id

@nklmish
Zipkin annotations
Client
Server
cs
sr
ss
HTTP Request: get catalog
(span starts)
cr
HTTP Response: catalog
(span ends)
(Processing time = ss - sr)
(Response time = cr - cs)
(Network latency = sr - cs)
(Network latency = cr - ss)
cs: client send
ss: server
send
cr: client received 
sr: server
received

@nklmish
It’s all about trace & span
HTTP Request: get catalog
CataloService:
getCatalog()
(traceId:1, parentId:, spanId: 1)
PriceService:
getPrice()
(traceId:1, parentId: 1,
spanId: 2)
ProductService:
getProducts()
(traceId:1, parentId: 1, spanId: 3)
Database call
spanId: 4)
Data analytic call
spanId: 5)
SpanTrace

@nklmish
Trace (E2e latency graph)
DAG of spans, forms latency tree.

@nklmish
Demo
https://github.com/nklmish/java-
distributed-tracing-demo
https://github.com/nklmish/go-

@nklmish
Demo application - Zipkin visualises
dependencies

@nklmish
Zipkin’s architecture
APICollector UI
Transport
service
(instrume
-nted)
Storage
Receive
spans
Scribe/kafka
Deserialising,
sampling &
scheduling
for storage
DB
Store spans
cassandra/mysql/elastic-search
visualize
retrieves data
Collect &
convert
spans

@nklmish
Tags
Tag denotes:
key-value pair
Not
timestamped
A span may
contain zero or
more tags

@nklmish
Log
Log denotes:
Event name (mark
meaningful
moment in lifetime
of a span)
Timestamped
A span may contain
zero or more logs

@nklmish
Annotations
Helps explaining latency with a
timestamp.
Annotations are often codes. e.g. sr,
cs, etc.

@nklmish
Binary Annotations
Tags a span with context, usually to
support query or aggregation. (e.g.
http.path)
Repeatable and vary on the host.

@nklmish
Can I have large spans ( e.g. MiB)
Decrease usability & increases cost of
tracing system

@nklmish
Beware of clock skew!!!
10:00 10:00

@nklmish
Beware of clock skew!!!
10:00:01 10:00:22

@nklmish
Tracer
Does most of the heavy lifting e.g. span
creation, context generation, passing info,  
data propagation, etc.

@nklmish
Sampling
Controls how much to record
High traffic Systems, fraction of
traffic is enough
Low traffic Systems, adjust based on
your needs
Note: Debug spans are always recorded.

@nklmish
Opentracing
Standardise tracing
Vendor neutral
tracing API
Implementation
available in 6
languages
http://opentracing.io/
documentation/

@nklmish
Spring cloud sleuth zipkin
Brings distributed tracing to spring
cloud
Spring cloud starter zipkin 
(Zipkin + sleuth)
Supports
Hystrix
Async
Rest template
Feign
Zuul
Spring integration
…
http://tiny.cc/scs-doc

@nklmish
Code Walkthrough
https://github.com/nklmish/java-
https://github.com/nklmish/go-

@nklmish
Who uses tracing
http://tiny.cc/tracing-impl

@nklmish
Summary : Latency is never zero,  
embrace it

@nklmish
Summary
Distributed systems hard to reason, complex call graphs
Distributed tracing helps to analyse E2E latency &
understanding call graphs
Instrumentation is tricky (async, thread pool, callbacks, etc.)
OpenZipkin provides:
open source tracing system
Visualises request flow
Spring cloud sleuth brings tracing to spring world
OpenTracing - goal to standardised tracing

@nklmish
Thank You
Questions?
http://tiny.cc/tracing
http://tiny.cc/tracing-slidesSlides =>
Review =>
Source Code

Distributed tracing - get a grasp on your production

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Distributed tracing - get a grasp on your production

Similar to Distributed tracing - get a grasp on your production (20)

More from nklmish

More from nklmish (10)

Recently uploaded

Recently uploaded (20)

Distributed tracing - get a grasp on your production