6. Numbers
• > 30 billion Pins
• 10 billion messages-a-day logged to Kafka
• 10 petabytes of data in S3
• Ingest 20 terabytes of new data each day
• Petabyte-a-day processed in Hadoop
• 6 Hadoop clusters of 3000 nodes in AWS
• Over 100 regular users running over 2,000 jobs each day
9. Roadmap
• Switch to Kafka 0.8 for all data streams
• Invest in scalable stream processing for realtime insights and products
• Migrate to a robust Hadoop 2.0 platform
• Experiment with Spark esp., for machine learning
• Unified batch and stream compute framework
13. Singer Logging Agent
•Simple logging mechanism for applications
• Decouple applications from log repository
• Existing applications that logs to disk
•Isolate applications from Singer agent failure
•Isolate applications from log repository failure
• Avoid internal buffering and log loss
•Better resource usage
• Connection consolidation
• Flexible batching
14. Singer Features
•At-least-once delivery
•Configurable adaptive log latency by periodical tailing
•Dynamically discover new log streams
•Dynamically pick up new log configuration
•Pluggable log stream reader
•Pluggable log stream writer
•Rich set of stats via Ostrich
18. Log Stream Processor
Reader
Writer
Commit position
Refresh
LogStream
EOS
next
batch
update stats
calculate next processing time
schedule next processing cycle
Abort on exception
No Yes
Load position
and seek reader
Abort on exception
Process batch
Abort on exception
Processing a batch
19. Adaptive Log Processing Interval
No message
next cycle =
min(MaxInterval, 2*current interval)
> 1 messages
next cycle = MinInterval
[MinInterval, MaxInterval]
21. Log Message
Envelope thrift message passed between Reader and Writer:
key binary Uninterpreted binary used to co-locate
message. Examples are: session id so
that all log entries in the session are on
the same partition. No seder cost.
timestamp nanosecs
message binary Uninterpreted binary data. Examples are:
Text log line, thrift message or file path.
No serder cost.
22. Log Position
● Caching can give wrong byte offset
● Implement a generic buffered Java InputStream which tracks byte offsets
● Restrictions: Reader should not cache or read-ahead.
LogFile inode next log file to read from
byteOffset byte offset from head of file next byte to read from the
file
23. Log Rotation
log log.1 log.2 log.4log.3 log.6log.5 log.7
log log.1 log.2 log.4log.3 log.6log.5 log.7
1. Using inode to identify log file.
2. Check inode<->filename mapping when open file by name.
10 12 1413 1615 1711
12 1413 1615111018
28. Singer Deployment
•Debian package: part of base image?
•Dynamic configuration update through Puppet
•Resource footprint enformed
•Rich stats exported through Ostrich to OpenTSD