This presentation discusses how logs and stream-processing can form a backbone for data flow, ETL, and real-time data processing. It will describe the challenges and lessons learned as LinkedIn built out its real-time data subscription and processing infrastructure. It will also discuss the role of real-time processing and its relationship to offline processing frameworks such as MapReduce.
6. Three principles
1. One pipeline to rule them all
2. Stream processing >> messaging
3. Clusters not servers
7. Characteristics
• Scalability of a filesystem
– Hundreds of MB/sec/server throughput
– Many TB per server
• Guarantees of a database
– Messages strictly ordered
– All data persistent
• Distributed by default
– Replication
– Partitioning model
8. Kafka At LinkedIn
• 175 TB of in-flight log data per colo
• Low-latency: ~1.5 ms
• Replicated to each datacenter
• Tens of thousands of data producers
• Thousands of consumers
• 7 million messages written/sec
• 35 million messages read/sec
• Hadoop integration
9. Open source
• Apache Software Foundation
• Very healthy usage outside LinkedIn
• Broad base of committers
• 30 clients in 15 languages
• Great ecosystem of supporting tools
10. The Plan
1. Apache Kafka
2. Logs and Distributed Systems
3. Logs and Data Integration
4. Logs and Stream Processing
Who are you?What is this talk about? What is a log and what is it good forExciting topic
Producers, consumers distributed
First project was an open source clone of Amazon Dynamo (Project Voldemort)Makes explaining things easier
1 Pipeline for database data1 Pipeline for metrics1 Pipeline for events1 Pipeline for real-time processingNo pipeline for application logs300ActiveMQ brokers
200 Kafka-related projects on github1000+ emails/month
The log is fundamental abstraction Kafka providesYou can use a log as a drop-in replacement for a messaging system, but it can also do a lot more
What is a log?Traditional uses?Non-traditional uses…
Time orderedSemi-structured
List of changesContents of record doesn’t matterIndexed by “time”Not application log (i.e. text file)
Data model of Kafka: A topicPartitions can be spread over machines, replicated
The whole system is one big distributed system
Paxos,Zookeeper (Zab), Raft, etc.Traditional databases, Hbase/Bigtable, Spanner, HDFS namenode, etcLog has two purposes:ReplicationConsistency
Very important problem
What if replica is down?Ordering is importantTime is important
Log is list of changesKey point: can re-create any point-in-timeIn banking: credits and debitsIn software: the version control changelog
State-machine replication: log the incoming requests (logical logging)
Log the changed rows (physical logging)
Outside of distributed systems internals…
AKA ETLMany systemsEvent dataMost important problem for data-centric companiesIntegration >> ML
Two exacerbating factors
One-size fits all
Database cache coherencyData deployment from HadoopNever get to full connectivity
Metcalfe’s lawAll data in multi-subscriber, real-time logsThe company is a big distributed system
Batch is dominant paradigm for data processing, why?First thing you want when you have real-time data streams is real-time transformations
1790Collected data by Networks=>stream processing3,929,214 people$44kHorses and wagons are a high latency, batch channel
Service: One input = one outputBatch job: All inputs = all outputsStream computing: any window = output for that window
Importance of the log—buffering, multisubscriberOutput goes to a live serving system or another batch processing system (Hadoop, DWH)Examples: RecommendationsEmailMonitoringSecurity
Storm and SamzaAbout process management – both integrate with KafkaMapReduce and HDFS