4. > Large scale metrics platform (400M
datapoints per min)
> ~~ Logs platform (1M log per min)
> ~~ Dist Tracing platform (2M spans
per min)
Our team
https://speakerdeck.com/line_devday2019/deep-dive-into-lines-time-series-database-flash
5. > Large scale metrics platform (400M
datapoints per min)
> ~~ Logs platform (1M log per min)
> ~~ Dist Tracing platform (2M spans
per min)
Our team
6. > What is Distributed Tracing
> Request scope DistTrace
> Concepts
Prerequisite know-how
7. Elasticsearch
Our setup
Brave
base client
Zipkin base
internal server
Elasticsearch
Elasticsearch
Thrift
Custom Armeria
based collector
https://github.com/line/armeria
NvME based
High spec machines
8. Our setup
> Inhouse customized Zipkin UI
- Now became upstream default
(zipkin-lens)
9. Problems with
OSS multi-tenant DistTracing
> Storage cost + scalability
> Standard war (instrument lib)
> UI / UX for multi-tenant
> User voice: high implement
cost, useful spans be sampled
out..
11. Sampling problem
> Sampling to reduce storage cost
> All OSS employ “head-based”
sampling (unbiased sampling)
- Many useful data will be
sampling out
- What is useful data btw?
12. OSS UI / UX problem
> Search by “serviceName” is hard
when you have 100 teams and each
team has 50 services
> Time range query is mostly
useless when you have 1000 rps
13. Standard war
Zipkin B3 header
OpenTracing
OpenCencus
OpenTelemetry
Language base (go, jvm)
Middleware base (htrace..)
15. Trace without trace
> The Mystery Machine: End-to-end performance
analysis of large-scale Internet services
- First trace solution at facebook (before
Canopi)
- Use “LOG” to calculate causal analysis
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.478.3942&rep=rep1&type=pdf
16. Trace without trace (cont)
> Canopy: An End-to-End Performance Tracing
And Analysis System (Facebook)
> From scratch solution (even trace API,
trace standard)
> Head based sampling (token bucket)
https://research.fb.com/publications/canopy-end-to-end-performance-tracing-at-scale/
17. Trace without Trace (cont)
> Transparent tracing of microservice-based
applications
- Utilize Proxy for trace interception
- Utilize Linux Syscall for trace
instrumentation
https://dl.acm.org/doi/abs/10.1145/3297280.3297403
18. Unify standard
> Universal context propagation for
distributed system instrumentation
- Propose universal “standard” for context
data to overcome standard war
https://dl.acm.org/doi/abs/10.1145/3190508.3190526
19. Better sampling method
> Sifter: Scalable Sampling for Distributed
Traces, without Feature Engineering
- Tailed-based sampling method
- Utilize Machine-Learning to overcome
feature selection problem
https://dl.acm.org/doi/abs/10.1145/3357223.3362736
21. Better sampling method (cont)
> Sifter: Scalable Sampling for Distributed
Traces, without Feature Engineering
- Tailed-based sampling method
- Utilize Machine-Learning to overcome
feature selection problem
https://dl.acm.org/doi/abs/10.1145/3357223.3362736
22. What we could do
Some insights from our company
> Log is the most informative one
of 3 pillars
> Tail-based sampling seems the
cure for “usefulness” of
DistTrace service
23. OSS move
Firehorse mode of Zipkin and Jeager
(100% sampling, skip indexing)
> https://cwiki.apache.org/
confluence/display/ZIPKIN/
Firehose+mode
> https://github.com/jaegertracing/
jaeger/issues/1731
24. Propose architecture
Firehose mode
Trace Client
Feature base
Sampler
Storage
Log Client
Inject trace ID
Sample-in
-Error trace
-High Latency trace
Construct Trace by
Buffer span in memory
25. No more search UI
for trace
> User traverse to trace by trace
ID only (no more time range base
search)
> User get trace ID from LOG
search UI (you need search-able
LOG, example: Kibana)