SlideShare a Scribd company logo
1 of 23
Download to read offline
PinTrace
Distributed Tracing@Pinterest
Suman Karumuri
Proprietary and Confidential
● About me
● What is distributed tracing?
● Why PinTrace?
● Pintrace architecture
● Challenges and Lessons
● Contributions
● Q & A.
Agenda
Proprietary and Confidential
● Lead for Tracing effort at Pinterest.
● Former Twitter Zipkin (open source distributed tracing project) lead.
● Former Twitter, Facebook, Amazon, Yahoo, Goldman Sachs Engineer.
● Published papers on automatic trace instrumentation@Brown CS.
● Passionate about Distributed Tracing and Distributed cloud infrastructure.
About me
Proprietary and Confidential
Distributed system
Client Service 1
Service 2
Service 3
Proprietary and Confidential
10th
Rule of Distributed System Monitoring
“Any sufficiently complicated distributed system
contains an ad-hoc, informally-specified, siloed
implementation of causal tracing.”
- Rodrigo Fonseca
Why Distributed tracing?
Proprietary and Confidential
What is distributed tracing?
Client Service 1 Service 2
ts1, r1, client req sent
ts2, r1, server req rcvd
ts7, r1, server resp sent
ts3, r1, client req sent
ts4, r1, server req rcvd
ts5, r1, server resp sent
ts6, r1, client resp rcvdts8, r1, client resp rcvd
Structured logging on steroids.
Proprietary and Confidential
Annotation
Client Service 1 Service 2
ts1, r1, CS
ts2, r1, server req rcvd
ts7, r1, server resp sent
ts3, r1, client req sent
ts4, r1, server req rcvd
ts5, r1, server resp sent
ts6, r1, client resp rcvdts8, r1, client resp rcvd
Timestamped event name with a structured payload.
Proprietary and Confidential
Span
Client Service 1 Service 2
ts1, r1, s1, - , CR
ts2, r1, s1, - , SR
ts7, r1, s1, - , SS
ts3, r1, client req sent
ts4, r1, server req rcvd
ts5, r1, server resp sent
ts6, r1, client resp rcvdts8, r1, s1, -, CS
A logical unit of work captured as a set of annotations. Ex: A request response pair.
Proprietary and Confidential
Trace
Client Service 1 Service 2
ts1, r1, s1, 0, CS
ts2, r1, s1, 0, SR
ts7, r1, s1, 0, SS
ts3, r1, s2, s1, CS
ts4, r1, s2, s1, SR
ts5, r1, s2, s1, SS
ts6, r1, s2, s1, CRts8, r1, s1, 0, CR
A DAG of spans that belong to the same request.
Proprietary and Confidential
Tracer: Piece of software that traces a request and generates spans.
Sampler: selects which requests to trace.
Reporter: Gathers the spans from a tracer and sends them to the collector.
Span aggregation pipeline: a mechanism to transfer spans from reporter to collector.
Collector: A service that gathers spans from various services from the pipeline.
Span storage: A backend used by the collector to store the spans.
Client/UI: An interface to search, access and visualize trace data.
Components of Tracing infrastructure
Proprietary and Confidential
Motivation:
Success of project prestige, Hbase debugging, Pinpoint.
Make backend faster and cheaper. Speed => More engagement.
Loading home feed consists of ~50 backend services.
Uses of Traces
Understand what we built: service dependency graphs.
Understand where a request spent it’s time - for debugging, tuning, cost attribution.
Improve time to triage: Ex: what service caused this request to fail? Why is the search API slow
after recent deployment?
Why PinTrace?
Proprietary and Confidential
PinTrace architecture
Varnish
ngapi
Singer -
Kafka pipeline
(Spark) Span aggregation
Trace processing & storage
ES
Trace store
Zipkin UI The Wall
Py thrift tracer
Py Span logger
Java service(s)
Java thrift tracer
Java span logger
Java Service
Python service
Go service
MySQL
Memcached
Decider
Proprietary and Confidential
Ensuring data quality.
Tracing infrastructure can be fragile since it has a lot of moving parts.
The more customized the pipeline, the harder it is to ensure data quality.
Use metrics and alerting to monitor the pipeline for correctness.
E2E monitoring: Sentinel
Traces a known request path periodically and check the resulting trace for correctness.
The known request path should have all known language/protocol combinations.
Measures end to end trace latency.
Testing
Proprietary and Confidential
Collect a lot of trace data but provides very few insights.
Spend time scaling the trace collection infrastructure than provide value.
Using tracing when simpler methods would suffice.
Use simpler time series metrics for counting the number of API calls.
Tracing is expensive,
Higher dark latency compared to other methods.
Tracing infrastructure is expensive since we are dealing with an order of magnitude more data.
Tracing tarpit
Proprietary and Confidential
Tracing is not the solution to a problem, it’s a tool.
Build tools around traces to solve a problem.
Should augment our time series metrics and logging platform.
Traces should only be used for computing distributed metrics.
Tracing infrastructure should be cheap and easy to run.
Quality of traces is more important than quantity of traces.
All processing and analysis of traces on ingestion and avoid post processing.
Our Tracing philosophy
Proprietary and Confidential
Instrumentation is hard.
Instrumenting the framework is less brittle, agnostic to business logic and more reusable.
Even after instrumenting the framework, there will be snow flakes.
The more opinionated the framework the easier it is to instrument. Ex: Java/go vs Python.
Need instrumentation for every language protocol combinations.
Use a framework that is already enabled for tracing.
Instrumentation challenges
Proprietary and Confidential
Deploying tracing at scale is a complex and challenging process.
Needs a company wide span aggregation pipeline.
Enabling and deploying instrumentation across several Java/Python services is like herding cats.
Scaling the tracing backend.
Dealing with multiple stakeholders and doing things the “right” way.
Can’t see it’s benefits or ensure data quality until it is fully deployed.
Do deployments along key request paths first for best results.
Deployment challenges
Proprietary and Confidential
User Education is very important.
Most people use tracing for solving needle in haystack and
SREs get tracing. Still an esoteric concept even for good engineers.
Explain the use cases on when they can use tracing.
Insights into performance bottlenecks or global visibility.
Tracing landscape is confusing.
Distributed tracing/Zipkin landscape is rapidly evolving and can be confusing.
Zipkin UI has some rough edges.
Lessons learned
Proprietary and Confidential
Data quality
For identifying performance bottlenecks from traces relative durations are most important.
When deployed in the right order, even partial tracing is useful.
Trace errors are ok when in leaves.
Tracing Infrastructure
Tracing infrastructure is a Tier 2 service in almost all companies.
Tracing is expensive.
Lessons learned (contd)
Proprietary and Confidential
● Identified that we use a really old version of finagle-memcache client that is
blocking the finagle upgrade.
● Identified ~7% of Java code as dead code and deleted 20KLoC so far.
● First company wide log/span aggregation pipeline.
● Identified an synchronous mysql client, now moving to asynchronous one.
● Local zipkin set up: Debugging Hbase latency issues.
Wins
Proprietary and Confidential
Future work
● Short term
○ Finish python instrumentation.
○ Open source spark backend.
○ Robust and scalable backend:
■ Trace all employee requests by default.
■ Make it easy to look at trace data for a request in pinterest app and web UI.
● Medium term
○ End to end traces to measure user perceived wait time. Ex:
Mobile/Browser -> Java/Python/go -> MySQL/MemCache/HBase.
○ Apply tracing to other use cases like jenkins builds times.
○ Improve Zipkin UI.
Q&A
Thank you!
skarumuri@pinterest.com
Btw, we are hiring!

More Related Content

What's hot

APNIC Hackathon CDN Ranking
APNIC Hackathon CDN Ranking APNIC Hackathon CDN Ranking
APNIC Hackathon CDN Ranking Siena Perry
 
WJAX 2019 - Taking Distributed Tracing to the next level
WJAX 2019 - Taking Distributed Tracing to the next levelWJAX 2019 - Taking Distributed Tracing to the next level
WJAX 2019 - Taking Distributed Tracing to the next levelFrank Pfleger
 
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
Listening at the Cocktail Party with Deep Neural Networks and TensorFlowListening at the Cocktail Party with Deep Neural Networks and TensorFlow
Listening at the Cocktail Party with Deep Neural Networks and TensorFlowDatabricks
 
Apache metron meetup presentation at capital one
Apache metron meetup presentation at capital oneApache metron meetup presentation at capital one
Apache metron meetup presentation at capital onegvetticaden
 
Invincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in TapioInvincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in TapioInvincea, Inc.
 
Threat Hunting with Elastic at SpectorOps: Welcome to HELK
Threat Hunting with Elastic at SpectorOps: Welcome to HELKThreat Hunting with Elastic at SpectorOps: Welcome to HELK
Threat Hunting with Elastic at SpectorOps: Welcome to HELKElasticsearch
 
RT4 - The whole sordid story
RT4 - The whole sordid storyRT4 - The whole sordid story
RT4 - The whole sordid storyJesse Vincent
 
Bhutan Cybersecurity Week 2021: APNIC vulnerability reporting program
Bhutan Cybersecurity Week 2021: APNIC vulnerability reporting programBhutan Cybersecurity Week 2021: APNIC vulnerability reporting program
Bhutan Cybersecurity Week 2021: APNIC vulnerability reporting programAPNIC
 
Monitoring & Analysis 101 - N00b to Ninja in 60 Minutes at ISSW on April 9, 2016
Monitoring & Analysis 101 - N00b to Ninja in 60 Minutes at ISSW on April 9, 2016Monitoring & Analysis 101 - N00b to Ninja in 60 Minutes at ISSW on April 9, 2016
Monitoring & Analysis 101 - N00b to Ninja in 60 Minutes at ISSW on April 9, 2016grecsl
 
Everything You wanted to Know About Distributed Tracing
Everything You wanted to Know About Distributed TracingEverything You wanted to Know About Distributed Tracing
Everything You wanted to Know About Distributed TracingAmuhinda Hungai
 
Software cracking and patching
Software cracking and patchingSoftware cracking and patching
Software cracking and patchingMayank Gavri
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC OsloDavid Pilato
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgDavid Pilato
 
How Autodesk Delivers Seamless Customer Experience with Catchpoint
How Autodesk Delivers Seamless Customer Experience with CatchpointHow Autodesk Delivers Seamless Customer Experience with Catchpoint
How Autodesk Delivers Seamless Customer Experience with CatchpointDevOps.com
 
How Automated Vulnerability Analysis Discovered Hundreds of Android 0-days
How Automated Vulnerability Analysis Discovered Hundreds of Android 0-daysHow Automated Vulnerability Analysis Discovered Hundreds of Android 0-days
How Automated Vulnerability Analysis Discovered Hundreds of Android 0-daysPriyanka Aash
 
Big Data for Security - DNS Analytics
Big Data for Security - DNS AnalyticsBig Data for Security - DNS Analytics
Big Data for Security - DNS AnalyticsMarco Casassa Mont
 
SAI - Serverless Integration Architectures - 09/2019
SAI - Serverless Integration Architectures - 09/2019SAI - Serverless Integration Architectures - 09/2019
SAI - Serverless Integration Architectures - 09/2019Samuel Vandecasteele
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spotmarkgrover
 
Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)
Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)
Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)Alex Pinto
 

What's hot (20)

APNIC Hackathon CDN Ranking
APNIC Hackathon CDN Ranking APNIC Hackathon CDN Ranking
APNIC Hackathon CDN Ranking
 
WJAX 2019 - Taking Distributed Tracing to the next level
WJAX 2019 - Taking Distributed Tracing to the next levelWJAX 2019 - Taking Distributed Tracing to the next level
WJAX 2019 - Taking Distributed Tracing to the next level
 
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
Listening at the Cocktail Party with Deep Neural Networks and TensorFlowListening at the Cocktail Party with Deep Neural Networks and TensorFlow
Listening at the Cocktail Party with Deep Neural Networks and TensorFlow
 
Apache metron meetup presentation at capital one
Apache metron meetup presentation at capital oneApache metron meetup presentation at capital one
Apache metron meetup presentation at capital one
 
Invincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in TapioInvincea: Reasoning in Incident Response in Tapio
Invincea: Reasoning in Incident Response in Tapio
 
Threat Hunting with Elastic at SpectorOps: Welcome to HELK
Threat Hunting with Elastic at SpectorOps: Welcome to HELKThreat Hunting with Elastic at SpectorOps: Welcome to HELK
Threat Hunting with Elastic at SpectorOps: Welcome to HELK
 
RT4 - The whole sordid story
RT4 - The whole sordid storyRT4 - The whole sordid story
RT4 - The whole sordid story
 
Bhutan Cybersecurity Week 2021: APNIC vulnerability reporting program
Bhutan Cybersecurity Week 2021: APNIC vulnerability reporting programBhutan Cybersecurity Week 2021: APNIC vulnerability reporting program
Bhutan Cybersecurity Week 2021: APNIC vulnerability reporting program
 
Monitoring & Analysis 101 - N00b to Ninja in 60 Minutes at ISSW on April 9, 2016
Monitoring & Analysis 101 - N00b to Ninja in 60 Minutes at ISSW on April 9, 2016Monitoring & Analysis 101 - N00b to Ninja in 60 Minutes at ISSW on April 9, 2016
Monitoring & Analysis 101 - N00b to Ninja in 60 Minutes at ISSW on April 9, 2016
 
Everything You wanted to Know About Distributed Tracing
Everything You wanted to Know About Distributed TracingEverything You wanted to Know About Distributed Tracing
Everything You wanted to Know About Distributed Tracing
 
Software cracking and patching
Software cracking and patchingSoftware cracking and patching
Software cracking and patching
 
Managing your Black Friday Logs NDC Oslo
Managing your  Black Friday Logs NDC OsloManaging your  Black Friday Logs NDC Oslo
Managing your Black Friday Logs NDC Oslo
 
Internet census 2012
Internet census 2012Internet census 2012
Internet census 2012
 
Managing your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed LuxembourgManaging your black friday logs Voxxed Luxembourg
Managing your black friday logs Voxxed Luxembourg
 
How Autodesk Delivers Seamless Customer Experience with Catchpoint
How Autodesk Delivers Seamless Customer Experience with CatchpointHow Autodesk Delivers Seamless Customer Experience with Catchpoint
How Autodesk Delivers Seamless Customer Experience with Catchpoint
 
How Automated Vulnerability Analysis Discovered Hundreds of Android 0-days
How Automated Vulnerability Analysis Discovered Hundreds of Android 0-daysHow Automated Vulnerability Analysis Discovered Hundreds of Android 0-days
How Automated Vulnerability Analysis Discovered Hundreds of Android 0-days
 
Big Data for Security - DNS Analytics
Big Data for Security - DNS AnalyticsBig Data for Security - DNS Analytics
Big Data for Security - DNS Analytics
 
SAI - Serverless Integration Architectures - 09/2019
SAI - Serverless Integration Architectures - 09/2019SAI - Serverless Integration Architectures - 09/2019
SAI - Serverless Integration Architectures - 09/2019
 
Fighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache SpotFighting cybersecurity threats with Apache Spot
Fighting cybersecurity threats with Apache Spot
 
Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)
Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)
Measuring the IQ of your Threat Intelligence Feeds (#tiqtest)
 

Viewers also liked

AWS Customer presentation - Voice Publishing
AWS Customer presentation - Voice PublishingAWS Customer presentation - Voice Publishing
AWS Customer presentation - Voice PublishingAmazon Web Services
 
Evm+agile estimating
Evm+agile estimatingEvm+agile estimating
Evm+agile estimatingGlen Alleman
 
Agile in the government
Agile in the government Agile in the government
Agile in the government Glen Alleman
 
Paradigm of agile project management
Paradigm of agile project managementParadigm of agile project management
Paradigm of agile project managementGlen Alleman
 
Webinar AWS für Unternehmen Teil 3: Disaster Recovery
Webinar AWS für Unternehmen Teil 3: Disaster RecoveryWebinar AWS für Unternehmen Teil 3: Disaster Recovery
Webinar AWS für Unternehmen Teil 3: Disaster RecoveryAWS Germany
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsRomain Jacotin
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)Romain Jacotin
 

Viewers also liked (8)

AWS Customer presentation - Voice Publishing
AWS Customer presentation - Voice PublishingAWS Customer presentation - Voice Publishing
AWS Customer presentation - Voice Publishing
 
Evm+agile estimating
Evm+agile estimatingEvm+agile estimating
Evm+agile estimating
 
Agile in the government
Agile in the government Agile in the government
Agile in the government
 
Paradigm of agile project management
Paradigm of agile project managementParadigm of agile project management
Paradigm of agile project management
 
Webinar AWS für Unternehmen Teil 3: Disaster Recovery
Webinar AWS für Unternehmen Teil 3: Disaster RecoveryWebinar AWS für Unternehmen Teil 3: Disaster Recovery
Webinar AWS für Unternehmen Teil 3: Disaster Recovery
 
GFS
GFSGFS
GFS
 
The Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systemsThe Google Chubby lock service for loosely-coupled distributed systems
The Google Chubby lock service for loosely-coupled distributed systems
 
The Google File System (GFS)
The Google File System (GFS)The Google File System (GFS)
The Google File System (GFS)
 

Similar to PinTrace Advanced AWS meetup

Distributed tracing
Distributed tracingDistributed tracing
Distributed tracingnishantmodak
 
"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sha...
"Introducing Distributed Tracing in a Large Software System",  Kostiantyn Sha..."Introducing Distributed Tracing in a Large Software System",  Kostiantyn Sha...
"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sha...Fwdays
 
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...AgileNetwork
 
Opentracing jaeger
Opentracing jaegerOpentracing jaeger
Opentracing jaegerOracle Korea
 
Distributed Tracing with Jaeger
Distributed Tracing with JaegerDistributed Tracing with Jaeger
Distributed Tracing with JaegerInho Kang
 
Tracing Micro Services with OpenTracing
Tracing Micro Services with OpenTracingTracing Micro Services with OpenTracing
Tracing Micro Services with OpenTracingHemant Kumar
 
Splunk App for Stream
Splunk App for StreamSplunk App for Stream
Splunk App for StreamSplunk
 
Network security monitoring elastic webinar - 16 june 2021
Network security monitoring   elastic webinar - 16 june 2021Network security monitoring   elastic webinar - 16 june 2021
Network security monitoring elastic webinar - 16 june 2021Mouaz Alnouri
 
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16AppDynamics
 
10 years in Network Protocol testing L2 L3 L4-L7 Tcl Python Manual and Automa...
10 years in Network Protocol testing L2 L3 L4-L7 Tcl Python Manual and Automa...10 years in Network Protocol testing L2 L3 L4-L7 Tcl Python Manual and Automa...
10 years in Network Protocol testing L2 L3 L4-L7 Tcl Python Manual and Automa...Mullaiselvan Mohan
 
Splunk App for Stream for Enhanced Operational Intelligence from Wire Data
Splunk App for Stream for Enhanced Operational Intelligence from Wire DataSplunk App for Stream for Enhanced Operational Intelligence from Wire Data
Splunk App for Stream for Enhanced Operational Intelligence from Wire DataSplunk
 
201411203 goto night on graphs for fraud detection
201411203 goto night on graphs for fraud detection201411203 goto night on graphs for fraud detection
201411203 goto night on graphs for fraud detectionRik Van Bruggen
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationInside Analysis
 
Fiware: Connecting to robots
Fiware: Connecting to robotsFiware: Connecting to robots
Fiware: Connecting to robotsJaime Martin Losa
 
Выявление и локализация проблем в сети с помощью инструментов Riverbed
Выявление и локализация проблем в сети с помощью инструментов RiverbedВыявление и локализация проблем в сети с помощью инструментов Riverbed
Выявление и локализация проблем в сети с помощью инструментов RiverbedElena Marianenko
 
Resume_Appaji
Resume_AppajiResume_Appaji
Resume_AppajiAppaji K
 
Observability for Application Developers (1)-1.pptx
Observability for Application Developers (1)-1.pptxObservability for Application Developers (1)-1.pptx
Observability for Application Developers (1)-1.pptxOpsTree solutions
 

Similar to PinTrace Advanced AWS meetup (20)

Distributed tracing
Distributed tracingDistributed tracing
Distributed tracing
 
Distributed tracing
Distributed tracing Distributed tracing
Distributed tracing
 
"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sha...
"Introducing Distributed Tracing in a Large Software System",  Kostiantyn Sha..."Introducing Distributed Tracing in a Large Software System",  Kostiantyn Sha...
"Introducing Distributed Tracing in a Large Software System", Kostiantyn Sha...
 
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
 
Opentracing jaeger
Opentracing jaegerOpentracing jaeger
Opentracing jaeger
 
Distributed Tracing with Jaeger
Distributed Tracing with JaegerDistributed Tracing with Jaeger
Distributed Tracing with Jaeger
 
Tracing Micro Services with OpenTracing
Tracing Micro Services with OpenTracingTracing Micro Services with OpenTracing
Tracing Micro Services with OpenTracing
 
Splunk App for Stream
Splunk App for StreamSplunk App for Stream
Splunk App for Stream
 
Network security monitoring elastic webinar - 16 june 2021
Network security monitoring   elastic webinar - 16 june 2021Network security monitoring   elastic webinar - 16 june 2021
Network security monitoring elastic webinar - 16 june 2021
 
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
Monitoring and Instrumentation Strategies: Tips and Best Practices - AppSphere16
 
10 years in Network Protocol testing L2 L3 L4-L7 Tcl Python Manual and Automa...
10 years in Network Protocol testing L2 L3 L4-L7 Tcl Python Manual and Automa...10 years in Network Protocol testing L2 L3 L4-L7 Tcl Python Manual and Automa...
10 years in Network Protocol testing L2 L3 L4-L7 Tcl Python Manual and Automa...
 
Splunk App for Stream for Enhanced Operational Intelligence from Wire Data
Splunk App for Stream for Enhanced Operational Intelligence from Wire DataSplunk App for Stream for Enhanced Operational Intelligence from Wire Data
Splunk App for Stream for Enhanced Operational Intelligence from Wire Data
 
201411203 goto night on graphs for fraud detection
201411203 goto night on graphs for fraud detection201411203 goto night on graphs for fraud detection
201411203 goto night on graphs for fraud detection
 
First in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter IntegrationFirst in Class: Optimizing the Data Lake for Tighter Integration
First in Class: Optimizing the Data Lake for Tighter Integration
 
Fiware: Connecting to robots
Fiware: Connecting to robotsFiware: Connecting to robots
Fiware: Connecting to robots
 
resume4
resume4resume4
resume4
 
Dagster @ R&S MNT
Dagster @ R&S MNTDagster @ R&S MNT
Dagster @ R&S MNT
 
Выявление и локализация проблем в сети с помощью инструментов Riverbed
Выявление и локализация проблем в сети с помощью инструментов RiverbedВыявление и локализация проблем в сети с помощью инструментов Riverbed
Выявление и локализация проблем в сети с помощью инструментов Riverbed
 
Resume_Appaji
Resume_AppajiResume_Appaji
Resume_Appaji
 
Observability for Application Developers (1)-1.pptx
Observability for Application Developers (1)-1.pptxObservability for Application Developers (1)-1.pptx
Observability for Application Developers (1)-1.pptx
 

More from Suman Karumuri

Pintrace: Distributed tracing @Pinterest
Pintrace: Distributed tracing @PinterestPintrace: Distributed tracing @Pinterest
Pintrace: Distributed tracing @PinterestSuman Karumuri
 
Pintrace: Distributed tracing@Pinterest
Pintrace: Distributed tracing@PinterestPintrace: Distributed tracing@Pinterest
Pintrace: Distributed tracing@PinterestSuman Karumuri
 
Practical Byzantine Fault Tolerance
Practical Byzantine Fault TolerancePractical Byzantine Fault Tolerance
Practical Byzantine Fault ToleranceSuman Karumuri
 

More from Suman Karumuri (9)

Monorepo at Pinterest
Monorepo at PinterestMonorepo at Pinterest
Monorepo at Pinterest
 
Pintrace: Distributed tracing @Pinterest
Pintrace: Distributed tracing @PinterestPintrace: Distributed tracing @Pinterest
Pintrace: Distributed tracing @Pinterest
 
Pintrace: Distributed tracing@Pinterest
Pintrace: Distributed tracing@PinterestPintrace: Distributed tracing@Pinterest
Pintrace: Distributed tracing@Pinterest
 
Phobos
PhobosPhobos
Phobos
 
Gpu Join Presentation
Gpu Join PresentationGpu Join Presentation
Gpu Join Presentation
 
Dream Language!
Dream Language!Dream Language!
Dream Language!
 
Bittorrent
BittorrentBittorrent
Bittorrent
 
Practical Byzantine Fault Tolerance
Practical Byzantine Fault TolerancePractical Byzantine Fault Tolerance
Practical Byzantine Fault Tolerance
 
bluespec talk
bluespec talkbluespec talk
bluespec talk
 

PinTrace Advanced AWS meetup

  • 2. Proprietary and Confidential ● About me ● What is distributed tracing? ● Why PinTrace? ● Pintrace architecture ● Challenges and Lessons ● Contributions ● Q & A. Agenda
  • 3. Proprietary and Confidential ● Lead for Tracing effort at Pinterest. ● Former Twitter Zipkin (open source distributed tracing project) lead. ● Former Twitter, Facebook, Amazon, Yahoo, Goldman Sachs Engineer. ● Published papers on automatic trace instrumentation@Brown CS. ● Passionate about Distributed Tracing and Distributed cloud infrastructure. About me
  • 4. Proprietary and Confidential Distributed system Client Service 1 Service 2 Service 3
  • 5. Proprietary and Confidential 10th Rule of Distributed System Monitoring “Any sufficiently complicated distributed system contains an ad-hoc, informally-specified, siloed implementation of causal tracing.” - Rodrigo Fonseca Why Distributed tracing?
  • 6. Proprietary and Confidential What is distributed tracing? Client Service 1 Service 2 ts1, r1, client req sent ts2, r1, server req rcvd ts7, r1, server resp sent ts3, r1, client req sent ts4, r1, server req rcvd ts5, r1, server resp sent ts6, r1, client resp rcvdts8, r1, client resp rcvd Structured logging on steroids.
  • 7. Proprietary and Confidential Annotation Client Service 1 Service 2 ts1, r1, CS ts2, r1, server req rcvd ts7, r1, server resp sent ts3, r1, client req sent ts4, r1, server req rcvd ts5, r1, server resp sent ts6, r1, client resp rcvdts8, r1, client resp rcvd Timestamped event name with a structured payload.
  • 8. Proprietary and Confidential Span Client Service 1 Service 2 ts1, r1, s1, - , CR ts2, r1, s1, - , SR ts7, r1, s1, - , SS ts3, r1, client req sent ts4, r1, server req rcvd ts5, r1, server resp sent ts6, r1, client resp rcvdts8, r1, s1, -, CS A logical unit of work captured as a set of annotations. Ex: A request response pair.
  • 9. Proprietary and Confidential Trace Client Service 1 Service 2 ts1, r1, s1, 0, CS ts2, r1, s1, 0, SR ts7, r1, s1, 0, SS ts3, r1, s2, s1, CS ts4, r1, s2, s1, SR ts5, r1, s2, s1, SS ts6, r1, s2, s1, CRts8, r1, s1, 0, CR A DAG of spans that belong to the same request.
  • 10. Proprietary and Confidential Tracer: Piece of software that traces a request and generates spans. Sampler: selects which requests to trace. Reporter: Gathers the spans from a tracer and sends them to the collector. Span aggregation pipeline: a mechanism to transfer spans from reporter to collector. Collector: A service that gathers spans from various services from the pipeline. Span storage: A backend used by the collector to store the spans. Client/UI: An interface to search, access and visualize trace data. Components of Tracing infrastructure
  • 11. Proprietary and Confidential Motivation: Success of project prestige, Hbase debugging, Pinpoint. Make backend faster and cheaper. Speed => More engagement. Loading home feed consists of ~50 backend services. Uses of Traces Understand what we built: service dependency graphs. Understand where a request spent it’s time - for debugging, tuning, cost attribution. Improve time to triage: Ex: what service caused this request to fail? Why is the search API slow after recent deployment? Why PinTrace?
  • 12. Proprietary and Confidential PinTrace architecture Varnish ngapi Singer - Kafka pipeline (Spark) Span aggregation Trace processing & storage ES Trace store Zipkin UI The Wall Py thrift tracer Py Span logger Java service(s) Java thrift tracer Java span logger Java Service Python service Go service MySQL Memcached Decider
  • 13. Proprietary and Confidential Ensuring data quality. Tracing infrastructure can be fragile since it has a lot of moving parts. The more customized the pipeline, the harder it is to ensure data quality. Use metrics and alerting to monitor the pipeline for correctness. E2E monitoring: Sentinel Traces a known request path periodically and check the resulting trace for correctness. The known request path should have all known language/protocol combinations. Measures end to end trace latency. Testing
  • 14. Proprietary and Confidential Collect a lot of trace data but provides very few insights. Spend time scaling the trace collection infrastructure than provide value. Using tracing when simpler methods would suffice. Use simpler time series metrics for counting the number of API calls. Tracing is expensive, Higher dark latency compared to other methods. Tracing infrastructure is expensive since we are dealing with an order of magnitude more data. Tracing tarpit
  • 15. Proprietary and Confidential Tracing is not the solution to a problem, it’s a tool. Build tools around traces to solve a problem. Should augment our time series metrics and logging platform. Traces should only be used for computing distributed metrics. Tracing infrastructure should be cheap and easy to run. Quality of traces is more important than quantity of traces. All processing and analysis of traces on ingestion and avoid post processing. Our Tracing philosophy
  • 16. Proprietary and Confidential Instrumentation is hard. Instrumenting the framework is less brittle, agnostic to business logic and more reusable. Even after instrumenting the framework, there will be snow flakes. The more opinionated the framework the easier it is to instrument. Ex: Java/go vs Python. Need instrumentation for every language protocol combinations. Use a framework that is already enabled for tracing. Instrumentation challenges
  • 17. Proprietary and Confidential Deploying tracing at scale is a complex and challenging process. Needs a company wide span aggregation pipeline. Enabling and deploying instrumentation across several Java/Python services is like herding cats. Scaling the tracing backend. Dealing with multiple stakeholders and doing things the “right” way. Can’t see it’s benefits or ensure data quality until it is fully deployed. Do deployments along key request paths first for best results. Deployment challenges
  • 18. Proprietary and Confidential User Education is very important. Most people use tracing for solving needle in haystack and SREs get tracing. Still an esoteric concept even for good engineers. Explain the use cases on when they can use tracing. Insights into performance bottlenecks or global visibility. Tracing landscape is confusing. Distributed tracing/Zipkin landscape is rapidly evolving and can be confusing. Zipkin UI has some rough edges. Lessons learned
  • 19. Proprietary and Confidential Data quality For identifying performance bottlenecks from traces relative durations are most important. When deployed in the right order, even partial tracing is useful. Trace errors are ok when in leaves. Tracing Infrastructure Tracing infrastructure is a Tier 2 service in almost all companies. Tracing is expensive. Lessons learned (contd)
  • 20. Proprietary and Confidential ● Identified that we use a really old version of finagle-memcache client that is blocking the finagle upgrade. ● Identified ~7% of Java code as dead code and deleted 20KLoC so far. ● First company wide log/span aggregation pipeline. ● Identified an synchronous mysql client, now moving to asynchronous one. ● Local zipkin set up: Debugging Hbase latency issues. Wins
  • 21. Proprietary and Confidential Future work ● Short term ○ Finish python instrumentation. ○ Open source spark backend. ○ Robust and scalable backend: ■ Trace all employee requests by default. ■ Make it easy to look at trace data for a request in pinterest app and web UI. ● Medium term ○ End to end traces to measure user perceived wait time. Ex: Mobile/Browser -> Java/Python/go -> MySQL/MemCache/HBase. ○ Apply tracing to other use cases like jenkins builds times. ○ Improve Zipkin UI.
  • 22. Q&A