Distributed Tracing, from internal SAAS insights

Distributed Tracing
Insights from
internal SAAS team

Calendar
> Background
> Our internal DistTrace problems
> Searching for better solution
> Conclusion

> @dxhuy (フィ）
> LINE’s Observability Team EM
Introduction

> Large scale metrics platform (400M
datapoints per min)
> ~~ Logs platform (1M log per min)
> ~~ Dist Tracing platform (2M spans
per min)
Our team
https://speakerdeck.com/line_devday2019/deep-dive-into-lines-time-series-database-flash

> Large scale metrics platform (400M
datapoints per min)
> ~~ Logs platform (1M log per min)
> ~~ Dist Tracing platform (2M spans
per min)
Our team

> What is Distributed Tracing
> Request scope DistTrace
> Concepts
Prerequisite know-how

Elasticsearch
Our setup
Brave 
base client
Zipkin base 
internal server
Elasticsearch
Elasticsearch
Thrift
Custom Armeria 
based collector
https://github.com/line/armeria
NvME based 
High spec machines

Our setup
> Inhouse customized Zipkin UI
- Now became upstream default
(zipkin-lens)

Problems with
OSS multi-tenant DistTracing
> Storage cost + scalability
> Standard war (instrument lib)
> UI / UX for multi-tenant
> User voice: high implement
cost, useful spans be sampled
out..

Storage problem
Infra cost
Useful data
If we sampling  
100% data
Infra cost for Trace
>>
Infra cost for app

Sampling problem
> Sampling to reduce storage cost
> All OSS employ “head-based”
sampling (unbiased sampling)
- Many useful data will be
sampling out
- What is useful data btw?

OSS UI / UX problem
> Search by “serviceName” is hard
when you have 100 teams and each
team has 50 services
> Time range query is mostly
useless when you have 1000 rps

Standard war
Zipkin B3 header
OpenTracing
OpenCencus
OpenTelemetry
Language base (go, jvm)
Middleware base (htrace..)

Trace without trace
> The Mystery Machine: End-to-end performance
analysis of large-scale Internet services
- First trace solution at facebook (before
Canopi)
- Use “LOG” to calculate causal analysis
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.478.3942&rep=rep1&type=pdf

Trace without trace (cont)
> Canopy: An End-to-End Performance Tracing
And Analysis System (Facebook)
> From scratch solution (even trace API,
trace standard)
> Head based sampling (token bucket)
https://research.fb.com/publications/canopy-end-to-end-performance-tracing-at-scale/

Trace without Trace (cont)
> Transparent tracing of microservice-based
applications
- Utilize Proxy for trace interception
- Utilize Linux Syscall for trace
instrumentation
https://dl.acm.org/doi/abs/10.1145/3297280.3297403

Unify standard
> Universal context propagation for
distributed system instrumentation
- Propose universal “standard” for context
data to overcome standard war

Better sampling method
> Sifter: Scalable Sampling for Distributed
Traces, without Feature Engineering
- Tailed-based sampling method
- Utilize Machine-Learning to overcome
feature selection problem

Better sampling method
> Honeycomb refinery (DistTrace vendor)
- Feature based, tail-based sampling
- https://docs.honeycomb.io/working-with-
your-data/tracing/refinery/

Better sampling method (cont)
> Sifter: Scalable Sampling for Distributed
Traces, without Feature Engineering
- Tailed-based sampling method
- Utilize Machine-Learning to overcome
feature selection problem

What we could do
Some insights from our company
> Log is the most informative one
of 3 pillars
> Tail-based sampling seems the
cure for “usefulness” of
DistTrace service

OSS move
Firehorse mode of Zipkin and Jeager
(100% sampling, skip indexing)
> https://cwiki.apache.org/
confluence/display/ZIPKIN/
Firehose+mode
> https://github.com/jaegertracing/
jaeger/issues/1731

Propose architecture
Firehose mode
Trace Client
Feature base 
Sampler
Storage  
Log Client
Inject trace ID
Sample-in
-Error trace
-High Latency trace
Construct Trace by
Buffer span in memory

No more search UI
for trace
> User traverse to trace by trace
ID only (no more time range base
search)
> User get trace ID from LOG
search UI (you need search-able
LOG, example: Kibana)

Distributed Tracing, from internal SAAS insights

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Distributed Tracing, from internal SAAS insights

Similar to Distributed Tracing, from internal SAAS insights (20)

More from Huy Do

More from Huy Do (16)

Recently uploaded

Recently uploaded (20)

Distributed Tracing, from internal SAAS insights