SlideShare a Scribd company logo
1 of 31
Download to read offline
Singer, Pinterest's Logging Infrastructure
Krishna Gade
Data Engineering Manager
Discover Pinterest
Big Data and Apache Mesos
Connor Doyle
Mesosphere
Roger Wang
Pinterest
Bernardo Gomez Palacio
Guavus
Pinterest is a data product.
A/B Experimentation
Promoted Pins
Product Insights
Spam Control Related Pins
Home Feed
Search Quality
DATA
Numbers
• > 30 billion Pins
• 10 billion messages-a-day logged to Kafka
• 10 petabytes of data in S3
• Ingest 20 terabytes of new data each day
• Petabyte-a-day processed in Hadoop
• 6 Hadoop clusters of 3000 nodes in AWS
• Over 100 regular users running over 2,000 jobs each day
4x Data Growth
Data Architecture Overview
pins
repins, likes
impressions
Kafka
App
Storm
HadoopSinger
HBase
Redshift
Insights
Features
Roadmap
• Switch to Kafka 0.8 for all data streams
• Invest in scalable stream processing for realtime insights and products
• Migrate to a robust Hadoop 2.0 platform
• Experiment with Spark esp., for machine learning
• Unified batch and stream compute framework
Roger Wang
Software Engineer
Singer
A High-Performance Logging Infrastructure
Logging Infrastructure before Singer
Storm
kafka
agentkafka
agent
Host
Kafka
Consumer
S3
Kafka
copier
Kafka Cluster
Hadoop
cluster
Logging Infrastructure with Singer
Logging infrastructure with Singer
Storm
kafka
agentkafka
agent
Host
singer
agent
Kafka
Consumer
S3Secor
Kafka Cluster
Hadoop
cluster
Singer Logging Agent
•Simple logging mechanism for applications
• Decouple applications from log repository
• Existing applications that logs to disk
•Isolate applications from Singer agent failure
•Isolate applications from log repository failure
• Avoid internal buffering and log loss
•Better resource usage
• Connection consolidation
• Flexible batching
Singer Features
•At-least-once delivery
•Configurable adaptive log latency by periodical tailing
•Dynamically discover new log streams
•Dynamically pick up new log configuration
•Pluggable log stream reader
•Pluggable log stream writer
•Rich set of stats via Ostrich
Singer Architecture
LogStream
monitor
Configuration
watcher
Reader Writer
Log
repository
Reader Writer
Reader Writer
Reader Writer
Log configuration
LogStream processors
A - 1
A -2
B - 1
C - 1
Singer Concepts and Components
•LogStream/LogFile
•LogPosition
•LogStreamMonitor
•LogStreamProcessor
•LogStreamReader/LogFileReader
•LogStreamWriter
Log Stream Monitor
LogStream
monitor
Log Stream A-1 Processor Stats
Log Stream B-1 Processor Stats
Log Stream B-2
LogStream Registrar
empty log stream Processor Stats
Periodic Task
Log Stream Processor
Reader
Writer
Commit position
Refresh
LogStream
EOS
next
batch
update stats
calculate next processing time
schedule next processing cycle
Abort on exception
No Yes
Load position
and seek reader
Abort on exception
Process batch
Abort on exception
Processing a batch
Adaptive Log Processing Interval
No message
next cycle =
min(MaxInterval, 2*current interval)
> 1 messages
next cycle = MinInterval
[MinInterval, MaxInterval]
Pluggable Log Stream Reader
LogFileReader LogMessage with LogPosition
LogMessage: {key: <binary>; timestamp: <timestamp>; message: <binary>}
LogPosition: inode + byte offset
Log Message
Envelope thrift message passed between Reader and Writer:
key binary Uninterpreted binary used to co-locate
message. Examples are: session id so
that all log entries in the session are on
the same partition. No seder cost.
timestamp nanosecs
message binary Uninterpreted binary data. Examples are:
Text log line, thrift message or file path.
No serder cost.
Log Position
● Caching can give wrong byte offset
● Implement a generic buffered Java InputStream which tracks byte offsets
● Restrictions: Reader should not cache or read-ahead.
LogFile inode next log file to read from
byteOffset byte offset from head of file next byte to read from the
file
Log Rotation
log log.1 log.2 log.4log.3 log.6log.5 log.7
log log.1 log.2 log.4log.3 log.6log.5 log.7
1. Using inode to identify log file.
2. Check inode<->filename mapping when open file by name.
10 12 1413 1615 1711
12 1413 1615111018
Duplicate inodes
log log.1 log.2 log.4log.3 log.6log.5 log.7
log log.1 log.2 log.4log.3 log.6log.5 log.7
10 12 1413 1615 1711
12 1413 1615111018
Skip the cycle to wait for log rotation.
Log File Reader Caveats
Corrupted block Partial LogMessage
Log File Reader kept open between processing cycle to avoid file
opening cost
Pluggable Log Stream Writer
•Writer interprets LogMessage
•Examples:
• Log archiver interpret the message as file path
• Kafka writer create Kafka message without deserialize the content in the envelope.
Log Configuration
Puppet master
Watcher
Restart Singer on change
puppet
agent
Singer Deployment
•Debian package: part of base image?
•Dynamic configuration update through Puppet
•Resource footprint enformed
•Rich stats exported through Ostrich to OpenTSD
Alternatives
•Scribe
•Logstash
•…
What’s next?
•Resilient file format so that we can skip corrupted blocks
•Pluggable log processing policy
Singer, Pinterest's Logging Infrastructure

More Related Content

What's hot

Modern Data Flow
Modern Data FlowModern Data Flow
Modern Data Flowconfluent
 
Data Pipelines with Apache Kafka
Data Pipelines with Apache KafkaData Pipelines with Apache Kafka
Data Pipelines with Apache KafkaBen Stopford
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!Guido Schmutz
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1Maruf Hassan
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into ElasticsearchKnoldus Inc.
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
 
Streaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafkaconfluent
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elkRushika Shah
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practicelarsgeorge
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 

What's hot (20)

Modern Data Flow
Modern Data FlowModern Data Flow
Modern Data Flow
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 
Data Pipelines with Apache Kafka
Data Pipelines with Apache KafkaData Pipelines with Apache Kafka
Data Pipelines with Apache Kafka
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!KSQL - Stream Processing simplified!
KSQL - Stream Processing simplified!
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Elasticsearch presentation 1
Elasticsearch presentation 1Elasticsearch presentation 1
Elasticsearch presentation 1
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Deep Dive Into Elasticsearch
Deep Dive Into ElasticsearchDeep Dive Into Elasticsearch
Deep Dive Into Elasticsearch
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Streaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache KafkaStreaming Data and Stream Processing with Apache Kafka
Streaming Data and Stream Processing with Apache Kafka
 
Log analysis using elk
Log analysis using elkLog analysis using elk
Log analysis using elk
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
HBase in Practice
HBase in PracticeHBase in Practice
HBase in Practice
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 

Similar to Singer, Pinterest's Logging Infrastructure

Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Timothy Spann
 
F_1330_Narkhede_Kafka .pptx
F_1330_Narkhede_Kafka .pptxF_1330_Narkhede_Kafka .pptx
F_1330_Narkhede_Kafka .pptxNIMITJAIN71
 
bigdata 2022_ FLiP Into Pulsar Apps
bigdata 2022_ FLiP Into Pulsar Appsbigdata 2022_ FLiP Into Pulsar Apps
bigdata 2022_ FLiP Into Pulsar AppsTimothy Spann
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaSteven Wu
 
Apache kafka
Apache kafkaApache kafka
Apache kafkaMvkZ
 
Apache kafka
Apache kafkaApache kafka
Apache kafkaMvkZ
 
Apache kafka
Apache kafkaApache kafka
Apache kafkaMvkZ
 
Timothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for MLTimothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for MLEdunomica
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and ThenSATOSHI TAGOMORI
 
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...LINE Corporation
 
Asko Oja Moskva Architecture Highload
Asko Oja Moskva Architecture HighloadAsko Oja Moskva Architecture Highload
Asko Oja Moskva Architecture HighloadOntico
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2aspyker
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaRicardo Bravo
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafkaSamuel Kerrien
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingDibyendu Bhattacharya
 
Distributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola ScaleDistributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola ScaleApache Kafka TLV
 

Similar to Singer, Pinterest's Logging Infrastructure (20)

Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
Machine Intelligence Guild_ Build ML Enhanced Event Streaming Applications wi...
 
F_1330_Narkhede_Kafka .pptx
F_1330_Narkhede_Kafka .pptxF_1330_Narkhede_Kafka .pptx
F_1330_Narkhede_Kafka .pptx
 
bigdata 2022_ FLiP Into Pulsar Apps
bigdata 2022_ FLiP Into Pulsar Appsbigdata 2022_ FLiP Into Pulsar Apps
bigdata 2022_ FLiP Into Pulsar Apps
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Timothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for MLTimothy Spann: Apache Pulsar for ML
Timothy Spann: Apache Pulsar for ML
 
Fluentd Overview, Now and Then
Fluentd Overview, Now and ThenFluentd Overview, Now and Then
Fluentd Overview, Now and Then
 
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
 
Asko Oja Moskva Architecture Highload
Asko Oja Moskva Architecture HighloadAsko Oja Moskva Architecture Highload
Asko Oja Moskva Architecture Highload
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Introduction to apache kafka
Introduction to apache kafkaIntroduction to apache kafka
Introduction to apache kafka
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
 
Distributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola ScaleDistributed Kafka Architecture Taboola Scale
Distributed Kafka Architecture Taboola Scale
 

Recently uploaded

KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 

Recently uploaded (20)

KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 

Singer, Pinterest's Logging Infrastructure

  • 2. Krishna Gade Data Engineering Manager Discover Pinterest Big Data and Apache Mesos
  • 4. Pinterest is a data product.
  • 5. A/B Experimentation Promoted Pins Product Insights Spam Control Related Pins Home Feed Search Quality DATA
  • 6. Numbers • > 30 billion Pins • 10 billion messages-a-day logged to Kafka • 10 petabytes of data in S3 • Ingest 20 terabytes of new data each day • Petabyte-a-day processed in Hadoop • 6 Hadoop clusters of 3000 nodes in AWS • Over 100 regular users running over 2,000 jobs each day
  • 8. Data Architecture Overview pins repins, likes impressions Kafka App Storm HadoopSinger HBase Redshift Insights Features
  • 9. Roadmap • Switch to Kafka 0.8 for all data streams • Invest in scalable stream processing for realtime insights and products • Migrate to a robust Hadoop 2.0 platform • Experiment with Spark esp., for machine learning • Unified batch and stream compute framework
  • 10. Roger Wang Software Engineer Singer A High-Performance Logging Infrastructure
  • 11. Logging Infrastructure before Singer Storm kafka agentkafka agent Host Kafka Consumer S3 Kafka copier Kafka Cluster Hadoop cluster
  • 12. Logging Infrastructure with Singer Logging infrastructure with Singer Storm kafka agentkafka agent Host singer agent Kafka Consumer S3Secor Kafka Cluster Hadoop cluster
  • 13. Singer Logging Agent •Simple logging mechanism for applications • Decouple applications from log repository • Existing applications that logs to disk •Isolate applications from Singer agent failure •Isolate applications from log repository failure • Avoid internal buffering and log loss •Better resource usage • Connection consolidation • Flexible batching
  • 14. Singer Features •At-least-once delivery •Configurable adaptive log latency by periodical tailing •Dynamically discover new log streams •Dynamically pick up new log configuration •Pluggable log stream reader •Pluggable log stream writer •Rich set of stats via Ostrich
  • 15. Singer Architecture LogStream monitor Configuration watcher Reader Writer Log repository Reader Writer Reader Writer Reader Writer Log configuration LogStream processors A - 1 A -2 B - 1 C - 1
  • 16. Singer Concepts and Components •LogStream/LogFile •LogPosition •LogStreamMonitor •LogStreamProcessor •LogStreamReader/LogFileReader •LogStreamWriter
  • 17. Log Stream Monitor LogStream monitor Log Stream A-1 Processor Stats Log Stream B-1 Processor Stats Log Stream B-2 LogStream Registrar empty log stream Processor Stats Periodic Task
  • 18. Log Stream Processor Reader Writer Commit position Refresh LogStream EOS next batch update stats calculate next processing time schedule next processing cycle Abort on exception No Yes Load position and seek reader Abort on exception Process batch Abort on exception Processing a batch
  • 19. Adaptive Log Processing Interval No message next cycle = min(MaxInterval, 2*current interval) > 1 messages next cycle = MinInterval [MinInterval, MaxInterval]
  • 20. Pluggable Log Stream Reader LogFileReader LogMessage with LogPosition LogMessage: {key: <binary>; timestamp: <timestamp>; message: <binary>} LogPosition: inode + byte offset
  • 21. Log Message Envelope thrift message passed between Reader and Writer: key binary Uninterpreted binary used to co-locate message. Examples are: session id so that all log entries in the session are on the same partition. No seder cost. timestamp nanosecs message binary Uninterpreted binary data. Examples are: Text log line, thrift message or file path. No serder cost.
  • 22. Log Position ● Caching can give wrong byte offset ● Implement a generic buffered Java InputStream which tracks byte offsets ● Restrictions: Reader should not cache or read-ahead. LogFile inode next log file to read from byteOffset byte offset from head of file next byte to read from the file
  • 23. Log Rotation log log.1 log.2 log.4log.3 log.6log.5 log.7 log log.1 log.2 log.4log.3 log.6log.5 log.7 1. Using inode to identify log file. 2. Check inode<->filename mapping when open file by name. 10 12 1413 1615 1711 12 1413 1615111018
  • 24. Duplicate inodes log log.1 log.2 log.4log.3 log.6log.5 log.7 log log.1 log.2 log.4log.3 log.6log.5 log.7 10 12 1413 1615 1711 12 1413 1615111018 Skip the cycle to wait for log rotation.
  • 25. Log File Reader Caveats Corrupted block Partial LogMessage Log File Reader kept open between processing cycle to avoid file opening cost
  • 26. Pluggable Log Stream Writer •Writer interprets LogMessage •Examples: • Log archiver interpret the message as file path • Kafka writer create Kafka message without deserialize the content in the envelope.
  • 27. Log Configuration Puppet master Watcher Restart Singer on change puppet agent
  • 28. Singer Deployment •Debian package: part of base image? •Dynamic configuration update through Puppet •Resource footprint enformed •Rich stats exported through Ostrich to OpenTSD
  • 30. What’s next? •Resilient file format so that we can skip corrupted blocks •Pluggable log processing policy