SlideShare a Scribd company logo
1 of 26
Download to read offline
Distributed Tracing
Insights from
internal SAAS team
Calendar
> Background
> Our internal DistTrace problems
> Searching for better solution
> Conclusion
> @dxhuy (フィ)
> LINE’s Observability Team EM
Introduction
> Large scale metrics platform (400M
datapoints per min)
> ~~ Logs platform (1M log per min)
> ~~ Dist Tracing platform (2M spans
per min)
Our team
https://speakerdeck.com/line_devday2019/deep-dive-into-lines-time-series-database-flash
> Large scale metrics platform (400M
datapoints per min)
> ~~ Logs platform (1M log per min)
> ~~ Dist Tracing platform (2M spans
per min)
Our team
> What is Distributed Tracing
> Request scope DistTrace
> Concepts
Prerequisite know-how
Elasticsearch
Our setup
Brave

base client
Zipkin base

internal server
Elasticsearch
Elasticsearch
Thrift
Custom Armeria

based collector
https://github.com/line/armeria
NvME based

High spec machines
Our setup
> Inhouse customized Zipkin UI
- Now became upstream default
(zipkin-lens)
Problems with
OSS multi-tenant DistTracing
> Storage cost + scalability
> Standard war (instrument lib)
> UI / UX for multi-tenant
> User voice: high implement
cost, useful spans be sampled
out..
Storage problem
Infra cost
Useful data
If we sampling 

100% data
Infra cost for Trace
>>
Infra cost for app
Sampling problem
> Sampling to reduce storage cost
> All OSS employ “head-based”
sampling (unbiased sampling)
- Many useful data will be
sampling out
- What is useful data btw?
OSS UI / UX problem
> Search by “serviceName” is hard
when you have 100 teams and each
team has 50 services
> Time range query is mostly
useless when you have 1000 rps
Standard war
Zipkin B3 header
OpenTracing
OpenCencus
OpenTelemetry
Language base (go, jvm)
Middleware base (htrace..)
How could we
make it better?
Trace without trace
> The Mystery Machine: End-to-end performance
analysis of large-scale Internet services
- First trace solution at facebook (before
Canopi)
- Use “LOG” to calculate causal analysis
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.478.3942&rep=rep1&type=pdf
Trace without trace (cont)
> Canopy: An End-to-End Performance Tracing
And Analysis System (Facebook)
> From scratch solution (even trace API,
trace standard)
> Head based sampling (token bucket)
https://research.fb.com/publications/canopy-end-to-end-performance-tracing-at-scale/
Trace without Trace (cont)
> Transparent tracing of microservice-based
applications
- Utilize Proxy for trace interception
- Utilize Linux Syscall for trace
instrumentation
https://dl.acm.org/doi/abs/10.1145/3297280.3297403
Unify standard
> Universal context propagation for
distributed system instrumentation
- Propose universal “standard” for context
data to overcome standard war
https://dl.acm.org/doi/abs/10.1145/3190508.3190526
Better sampling method
> Sifter: Scalable Sampling for Distributed
Traces, without Feature Engineering
- Tailed-based sampling method
- Utilize Machine-Learning to overcome
feature selection problem
https://dl.acm.org/doi/abs/10.1145/3357223.3362736
Better sampling method
> Honeycomb refinery (DistTrace vendor)
- Feature based, tail-based sampling
- https://docs.honeycomb.io/working-with-
your-data/tracing/refinery/
https://dl.acm.org/doi/abs/10.1145/3357223.3362736
Better sampling method (cont)
> Sifter: Scalable Sampling for Distributed
Traces, without Feature Engineering
- Tailed-based sampling method
- Utilize Machine-Learning to overcome
feature selection problem
https://dl.acm.org/doi/abs/10.1145/3357223.3362736
What we could do
Some insights from our company
> Log is the most informative one
of 3 pillars
> Tail-based sampling seems the
cure for “usefulness” of
DistTrace service
OSS move
Firehorse mode of Zipkin and Jeager
(100% sampling, skip indexing)
> https://cwiki.apache.org/
confluence/display/ZIPKIN/
Firehose+mode
> https://github.com/jaegertracing/
jaeger/issues/1731
Propose architecture
Firehose mode
Trace Client
Feature base

Sampler
Storage 

Log Client
Inject trace ID
Sample-in
-Error trace
-High Latency trace
Construct Trace by
Buffer span in memory
No more search UI
for trace
> User traverse to trace by trace
ID only (no more time range base
search)
> User get trace ID from LOG
search UI (you need search-able
LOG, example: Kibana)
Thanks you
for listening

More Related Content

What's hot

Wayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryWayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryInfluxData
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Hadoop User Group
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slidesryancox
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseDataStax Academy
 
HBase at Flurry
HBase at FlurryHBase at Flurry
HBase at Flurryddlatham
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanNarayana B
 
XNAT Tuning & Monitoring
XNAT Tuning & MonitoringXNAT Tuning & Monitoring
XNAT Tuning & MonitoringJohn Paulett
 
Flume and Hadoop performance insights
Flume and Hadoop performance insightsFlume and Hadoop performance insights
Flume and Hadoop performance insightsOmid Vahdaty
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...Cloudera, Inc.
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageHBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageCloudera, Inc.
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems confluent
 
Backup, Restore, and Disaster Recovery
Backup, Restore, and Disaster RecoveryBackup, Restore, and Disaster Recovery
Backup, Restore, and Disaster RecoveryMongoDB
 
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQLHBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQLCloudera, Inc.
 
Exadata下的数据并行加载、并行卸载及性能监控
Exadata下的数据并行加载、并行卸载及性能监控Exadata下的数据并行加载、并行卸载及性能监控
Exadata下的数据并行加载、并行卸载及性能监控Kaiyao Huang
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance ComputersDave Hiltbrand
 
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB,  or how we implemented a 10-times faster CassandraSeastar / ScyllaDB,  or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB, or how we implemented a 10-times faster CassandraTzach Livyatan
 

What's hot (20)

Wayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics DeliveryWayfair Use Case: The four R's of Metrics Delivery
Wayfair Use Case: The four R's of Metrics Delivery
 
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
Yahoo! Hadoop User Group - May Meetup - HBase and Pig: The Hadoop ecosystem a...
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
SignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series DatabaseSignalFx: Making Cassandra Perform as a Time Series Database
SignalFx: Making Cassandra Perform as a Time Series Database
 
HBase at Flurry
HBase at FlurryHBase at Flurry
HBase at Flurry
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
MongoDB Backup & Disaster Recovery
MongoDB Backup & Disaster RecoveryMongoDB Backup & Disaster Recovery
MongoDB Backup & Disaster Recovery
 
XNAT Tuning & Monitoring
XNAT Tuning & MonitoringXNAT Tuning & Monitoring
XNAT Tuning & Monitoring
 
Flume and Hadoop performance insights
Flume and Hadoop performance insightsFlume and Hadoop performance insights
Flume and Hadoop performance insights
 
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed StorageHBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
HBaseCon 2013: Apache HBase at Pinterest - Scaling Our Feed Storage
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems Kafka on ZFS: Better Living Through Filesystems
Kafka on ZFS: Better Living Through Filesystems
 
HDFS Internals
HDFS InternalsHDFS Internals
HDFS Internals
 
Backup, Restore, and Disaster Recovery
Backup, Restore, and Disaster RecoveryBackup, Restore, and Disaster Recovery
Backup, Restore, and Disaster Recovery
 
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQLHBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
 
Exadata下的数据并行加载、并行卸载及性能监控
Exadata下的数据并行加载、并行卸载及性能监控Exadata下的数据并行加载、并行卸载及性能监控
Exadata下的数据并行加载、并行卸载及性能监控
 
Using R on High Performance Computers
Using R on High Performance ComputersUsing R on High Performance Computers
Using R on High Performance Computers
 
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB,  or how we implemented a 10-times faster CassandraSeastar / ScyllaDB,  or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
 

Similar to Distributed Tracing, from internal SAAS insights

Splunk app for stream
Splunk app for stream Splunk app for stream
Splunk app for stream csching
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
 
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINEKafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINEkawamuray
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
 
Threat hunting on the wire
Threat hunting on the wireThreat hunting on the wire
Threat hunting on the wireInfoSec Addicts
 
Using Istio to Secure & Monitor Your Services
Using Istio to Secure & Monitor Your ServicesUsing Istio to Secure & Monitor Your Services
Using Istio to Secure & Monitor Your ServicesAlcide
 
TechWiseTV Workshop: Catalyst Switching Programmability
TechWiseTV Workshop: Catalyst Switching ProgrammabilityTechWiseTV Workshop: Catalyst Switching Programmability
TechWiseTV Workshop: Catalyst Switching ProgrammabilityRobb Boyd
 
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)Jakub Botwicz
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals Vrushali Lanjewar
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...GeeksLab Odessa
 
Web Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdfWeb Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdfSamHoney6
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareDatabricks
 
Automated prevention of ransomware with machine learning and gpos
Automated prevention of ransomware with machine learning and gposAutomated prevention of ransomware with machine learning and gpos
Automated prevention of ransomware with machine learning and gposPriyanka Aash
 
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOs
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOsSPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOs
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOsRod Soto
 
Monitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusMonitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusFabian Reinartz
 
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get Diagnostics
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get DiagnosticsBoris Stoyanov - Troubleshooting the Virtual Router - Run and Get Diagnostics
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get DiagnosticsShapeBlue
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
 

Similar to Distributed Tracing, from internal SAAS insights (20)

Splunk app for stream
Splunk app for stream Splunk app for stream
Splunk app for stream
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at Renault
 
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINEKafka meetup JP #3 - Engineering Apache Kafka at LINE
Kafka meetup JP #3 - Engineering Apache Kafka at LINE
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - Introduction
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 
Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0Native Support of Prometheus Monitoring in Apache Spark 3.0
Native Support of Prometheus Monitoring in Apache Spark 3.0
 
Threat hunting on the wire
Threat hunting on the wireThreat hunting on the wire
Threat hunting on the wire
 
Using Istio to Secure & Monitor Your Services
Using Istio to Secure & Monitor Your ServicesUsing Istio to Secure & Monitor Your Services
Using Istio to Secure & Monitor Your Services
 
TechWiseTV Workshop: Catalyst Switching Programmability
TechWiseTV Workshop: Catalyst Switching ProgrammabilityTechWiseTV Workshop: Catalyst Switching Programmability
TechWiseTV Workshop: Catalyst Switching Programmability
 
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)
Cotopaxi - IoT testing toolkit (Black Hat Asia 2019 Arsenal)
 
Distributed Database practicals
Distributed Database practicals Distributed Database practicals
Distributed Database practicals
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
 
Web Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdfWeb Template Mechanisms in SOC Verification - DVCon.pdf
Web Template Mechanisms in SOC Verification - DVCon.pdf
 
What's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You CareWhat's New in Apache Spark 2.3 & Why Should You Care
What's New in Apache Spark 2.3 & Why Should You Care
 
Automated prevention of ransomware with machine learning and gpos
Automated prevention of ransomware with machine learning and gposAutomated prevention of ransomware with machine learning and gpos
Automated prevention of ransomware with machine learning and gpos
 
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOs
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOsSPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOs
SPO2-T11_Automated-Prevention-of-Ransomware-with-Machine-Learning-and-GPOs
 
Monitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with PrometheusMonitoring a Kubernetes-backed microservice architecture with Prometheus
Monitoring a Kubernetes-backed microservice architecture with Prometheus
 
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get Diagnostics
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get DiagnosticsBoris Stoyanov - Troubleshooting the Virtual Router - Run and Get Diagnostics
Boris Stoyanov - Troubleshooting the Virtual Router - Run and Get Diagnostics
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 

More from Huy Do

Some note about GC algorithm
Some note about GC algorithmSome note about GC algorithm
Some note about GC algorithmHuy Do
 
Engineering Efficiency in LINE
Engineering Efficiency in LINEEngineering Efficiency in LINE
Engineering Efficiency in LINEHuy Do
 
GOCON Autumn (Story of our own Monitoring Agent in golang)
GOCON Autumn (Story of our own Monitoring Agent in golang)GOCON Autumn (Story of our own Monitoring Agent in golang)
GOCON Autumn (Story of our own Monitoring Agent in golang)Huy Do
 
Story Writing Byte Serializer in Golang
Story Writing Byte Serializer in GolangStory Writing Byte Serializer in Golang
Story Writing Byte Serializer in GolangHuy Do
 
Akka と Typeの話
Akka と Typeの話Akka と Typeの話
Akka と Typeの話Huy Do
 
[Scalameetup]spark shuffle
[Scalameetup]spark shuffle[Scalameetup]spark shuffle
[Scalameetup]spark shuffleHuy Do
 
DI in ruby
DI in rubyDI in ruby
DI in rubyHuy Do
 
Itlc2015
Itlc2015Itlc2015
Itlc2015Huy Do
 
Consistent Hashingの小ネタ
Consistent Hashingの小ネタConsistent Hashingの小ネタ
Consistent Hashingの小ネタHuy Do
 
Thriftを用いた分散型のNyancatを作ってきた
Thriftを用いた分散型のNyancatを作ってきたThriftを用いた分散型のNyancatを作ってきた
Thriftを用いた分散型のNyancatを作ってきたHuy Do
 
NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]Huy Do
 
実践Akka
実践Akka実践Akka
実践AkkaHuy Do
 
CA15卒勉強会 メタプログラミングについて
CA15卒勉強会 メタプログラミングについてCA15卒勉強会 メタプログラミングについて
CA15卒勉強会 メタプログラミングについてHuy Do
 
Making CLI app in ruby
Making CLI app in rubyMaking CLI app in ruby
Making CLI app in rubyHuy Do
 
CacheとRailsの簡単まとめ
CacheとRailsの簡単まとめCacheとRailsの簡単まとめ
CacheとRailsの簡単まとめHuy Do
 
[Htmlday]present
[Htmlday]present[Htmlday]present
[Htmlday]presentHuy Do
 

More from Huy Do (16)

Some note about GC algorithm
Some note about GC algorithmSome note about GC algorithm
Some note about GC algorithm
 
Engineering Efficiency in LINE
Engineering Efficiency in LINEEngineering Efficiency in LINE
Engineering Efficiency in LINE
 
GOCON Autumn (Story of our own Monitoring Agent in golang)
GOCON Autumn (Story of our own Monitoring Agent in golang)GOCON Autumn (Story of our own Monitoring Agent in golang)
GOCON Autumn (Story of our own Monitoring Agent in golang)
 
Story Writing Byte Serializer in Golang
Story Writing Byte Serializer in GolangStory Writing Byte Serializer in Golang
Story Writing Byte Serializer in Golang
 
Akka と Typeの話
Akka と Typeの話Akka と Typeの話
Akka と Typeの話
 
[Scalameetup]spark shuffle
[Scalameetup]spark shuffle[Scalameetup]spark shuffle
[Scalameetup]spark shuffle
 
DI in ruby
DI in rubyDI in ruby
DI in ruby
 
Itlc2015
Itlc2015Itlc2015
Itlc2015
 
Consistent Hashingの小ネタ
Consistent Hashingの小ネタConsistent Hashingの小ネタ
Consistent Hashingの小ネタ
 
Thriftを用いた分散型のNyancatを作ってきた
Thriftを用いた分散型のNyancatを作ってきたThriftを用いた分散型のNyancatを作ってきた
Thriftを用いた分散型のNyancatを作ってきた
 
NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]NoSQL for great good [hanoi.rb talk]
NoSQL for great good [hanoi.rb talk]
 
実践Akka
実践Akka実践Akka
実践Akka
 
CA15卒勉強会 メタプログラミングについて
CA15卒勉強会 メタプログラミングについてCA15卒勉強会 メタプログラミングについて
CA15卒勉強会 メタプログラミングについて
 
Making CLI app in ruby
Making CLI app in rubyMaking CLI app in ruby
Making CLI app in ruby
 
CacheとRailsの簡単まとめ
CacheとRailsの簡単まとめCacheとRailsの簡単まとめ
CacheとRailsの簡単まとめ
 
[Htmlday]present
[Htmlday]present[Htmlday]present
[Htmlday]present
 

Recently uploaded

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendArshad QA
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AIABDERRAOUF MEHENNI
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 

Recently uploaded (20)

Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Test Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and BackendTest Automation Strategy for Frontend and Backend
Test Automation Strategy for Frontend and Backend
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AISyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
SyndBuddy AI 2k Review 2024: Revolutionizing Content Syndication with AI
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 

Distributed Tracing, from internal SAAS insights

  • 2. Calendar > Background > Our internal DistTrace problems > Searching for better solution > Conclusion
  • 3. > @dxhuy (フィ) > LINE’s Observability Team EM Introduction
  • 4. > Large scale metrics platform (400M datapoints per min) > ~~ Logs platform (1M log per min) > ~~ Dist Tracing platform (2M spans per min) Our team https://speakerdeck.com/line_devday2019/deep-dive-into-lines-time-series-database-flash
  • 5. > Large scale metrics platform (400M datapoints per min) > ~~ Logs platform (1M log per min) > ~~ Dist Tracing platform (2M spans per min) Our team
  • 6. > What is Distributed Tracing > Request scope DistTrace > Concepts Prerequisite know-how
  • 7. Elasticsearch Our setup Brave
 base client Zipkin base
 internal server Elasticsearch Elasticsearch Thrift Custom Armeria
 based collector https://github.com/line/armeria NvME based
 High spec machines
  • 8. Our setup > Inhouse customized Zipkin UI - Now became upstream default (zipkin-lens)
  • 9. Problems with OSS multi-tenant DistTracing > Storage cost + scalability > Standard war (instrument lib) > UI / UX for multi-tenant > User voice: high implement cost, useful spans be sampled out..
  • 10. Storage problem Infra cost Useful data If we sampling 
 100% data Infra cost for Trace >> Infra cost for app
  • 11. Sampling problem > Sampling to reduce storage cost > All OSS employ “head-based” sampling (unbiased sampling) - Many useful data will be sampling out - What is useful data btw?
  • 12. OSS UI / UX problem > Search by “serviceName” is hard when you have 100 teams and each team has 50 services > Time range query is mostly useless when you have 1000 rps
  • 13. Standard war Zipkin B3 header OpenTracing OpenCencus OpenTelemetry Language base (go, jvm) Middleware base (htrace..)
  • 14. How could we make it better?
  • 15. Trace without trace > The Mystery Machine: End-to-end performance analysis of large-scale Internet services - First trace solution at facebook (before Canopi) - Use “LOG” to calculate causal analysis http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.478.3942&rep=rep1&type=pdf
  • 16. Trace without trace (cont) > Canopy: An End-to-End Performance Tracing And Analysis System (Facebook) > From scratch solution (even trace API, trace standard) > Head based sampling (token bucket) https://research.fb.com/publications/canopy-end-to-end-performance-tracing-at-scale/
  • 17. Trace without Trace (cont) > Transparent tracing of microservice-based applications - Utilize Proxy for trace interception - Utilize Linux Syscall for trace instrumentation https://dl.acm.org/doi/abs/10.1145/3297280.3297403
  • 18. Unify standard > Universal context propagation for distributed system instrumentation - Propose universal “standard” for context data to overcome standard war https://dl.acm.org/doi/abs/10.1145/3190508.3190526
  • 19. Better sampling method > Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering - Tailed-based sampling method - Utilize Machine-Learning to overcome feature selection problem https://dl.acm.org/doi/abs/10.1145/3357223.3362736
  • 20. Better sampling method > Honeycomb refinery (DistTrace vendor) - Feature based, tail-based sampling - https://docs.honeycomb.io/working-with- your-data/tracing/refinery/ https://dl.acm.org/doi/abs/10.1145/3357223.3362736
  • 21. Better sampling method (cont) > Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering - Tailed-based sampling method - Utilize Machine-Learning to overcome feature selection problem https://dl.acm.org/doi/abs/10.1145/3357223.3362736
  • 22. What we could do Some insights from our company > Log is the most informative one of 3 pillars > Tail-based sampling seems the cure for “usefulness” of DistTrace service
  • 23. OSS move Firehorse mode of Zipkin and Jeager (100% sampling, skip indexing) > https://cwiki.apache.org/ confluence/display/ZIPKIN/ Firehose+mode > https://github.com/jaegertracing/ jaeger/issues/1731
  • 24. Propose architecture Firehose mode Trace Client Feature base
 Sampler Storage 
 Log Client Inject trace ID Sample-in -Error trace -High Latency trace Construct Trace by Buffer span in memory
  • 25. No more search UI for trace > User traverse to trace by trace ID only (no more time range base search) > User get trace ID from LOG search UI (you need search-able LOG, example: Kibana)