More Related Content
Similar to ApacheCon-Flume-Kafka-2016
Similar to ApacheCon-Flume-Kafka-2016 (20)
More from Jayesh Thakrar (6)
ApacheCon-Flume-Kafka-2016
- 1. © 2014, Conversant, Inc. All rights reserved.
PRESENTED BY
May 18, 2016
APACHE FLUME OR
APACHE KAFKA?
HOW ABOUT BOTH?
Apache North America Big Data Conference - May 10, 2016
Jayesh Thakrar (jthakrar@conversantmedia.com)
- 2. © 2014, Conversant, Inc. All rights reserved.2
Conversant (www.conversantmedia.com)
• Adserving - real-time bidding
• Intelligent messaging using online and offline activity without using
personally identifiable information (PII)
Hadoop Engineering
• Designs, builds, and manages clusters running Hadoop, HBase, Spark,
Storm, Kafka, Cassandra, OpenTSDB, etc.
• Team: 4 people, 20+ clusters, 500+ servers, PBs of storage, etc.
- 3. © 2014, Conversant, Inc. All rights reserved.3
AGENDA
History and Evolution of Conversant's Data Pipeline
Flume Customization
Compare Flume and Kafka
Metrics and Monitoring
- 5. © 2014, Conversant, Inc. All rights reserved.5
INTER-DATACENTER DATA PIPELINE
InternetAd
Exchanges
Web Sites
(Publishers)
Users
U.S. East Coast
Data Center
European
Data Center
Chicago
Data Center (EDW)
U.S. West Coast
Data Center
- 6. © 2014, Conversant, Inc. All rights reserved.6
Home-grown log collection system in PERL, shell and Python
15-20 billion log lines
Comma or tab separated log format, implicit schema
DATA PIPELINE VERSION 1
(PRIOR TO SEPT 2013)
AdServer
Application
AdServer
Application
AdServer
Applications
AdServer
Application
AdServer
Application
Data Center
Local Log
Manager
AdServer
Application
Chicago Log
Aggregator
Data
Warehouse
- 7. © 2014, Conversant, Inc. All rights reserved.7
Non-trivial operational and recovery effort during
• Network/WAN outage
• Planned/unplanned server maintenance
Difficult file format/schema evolution
Delayed reporting and metrics (2-3 hours)
Scaling and storage utilization on local log manager
DATA PIPELINE VERSION 1
- 8. © 2014, Conversant, Inc. All rights reserved.8
DATA PIPELINE VERSION 2
(SEP 2013 - MAR 2015)
Application logging in Avro format
50-80+ billion daily log lines
3-hop flume pipeline
Flume event schema : event header, event payload
• Header key/value = log type, log version, server-id, UUID, timestamp, # of log lines
• Payload = byte array = Avro file
AdServer
Application
AdServer
Application
AdServer
Applications
with Local
Flume Agents
AdServer
Application
AdServer
Application
Data Center
Local
Compressor
Flume Agents
AdServer
Application
Chicago Deduping
and Bifurcating
Flume Agents
Dedicated
Hadoop Cluster
- 9. © 2014, Conversant, Inc. All rights reserved.9
DATA PIPELINE VERSION 2
Explicit application log schema
Version tagged payload = easier log file schema evolution
No manual recovery during network outages and server maintenance
Detailed, explicit metrics in real-time
- 10. © 2014, Conversant, Inc. All rights reserved.10
DATA PIPELINE VERSION 3
(MAR 2015-JUN 2015)
Switch from dedicated MapR cluster to CDH cluster (new EDW)
AdServer
Application
AdServer
Application
AdServer
Applications
with Local
Flume Agents
AdServer
Application
AdServer
Application
Data Center
Local
Compressor
Flume Agents
AdServer
Application
Chicago Deduping
and Bifurcating
Flume Agents
Dedicated
Hadoop Cluster
Enterprise
Hadoop Cluster
- 11. © 2014, Conversant, Inc. All rights reserved.11
DATA PIPELINE VERSION 3
About 4-5k file creation/sec by Flume - Namenode overwhelmed!!
Manual intervention for data recovery - painful reminder of version 1
- 12. © 2014, Conversant, Inc. All rights reserved.12
DATA PIPELINE VERSION 4
(JUNE 2015+ )
Embedded flume agents in applications
Kafka to "buffer/self-regulate" data flow
Camus mapreduce framework to land data
AdServer
Application
AdServer
Application
AdServer
Applications
with Embedded
Flume Agents
AdServer
Application
AdServer
Application
Data Center
Local
Compressor
Flume Agents
AdServer
Application
Chicago Deduping
and Bifurcating
Flume Agents
Enterprise Hadoop
Cluster + Camus
Mapreduce
AdServer
Application
AdServer
ApplicationKafka
- 13. © 2014, Conversant, Inc. All rights reserved.13
DATA PIPELINE VERSION 4
Kafka + Flume = Hadoop decoupling and data redundancy
Additional metrics and visibility from Kafka
In future, allows for data sniffing/sampling and real-time stream processing of
log data
- 15. © 2014, Conversant, Inc. All rights reserved.15
ADSTACK DATA CENTER BUILDING BLOCK
Multi-threaded application flushes
batched Avro log lines through
embedded Flume agent based on
time and/or line count thresholds
Compressor agent compresses
data and sends downstream to
Chicago
• Custom Flume interceptor =
compression and filtering
• Custom Flume selector = event
forwarding to specific channel
- 16. © 2014, Conversant, Inc. All rights reserved.16
CHICAGO DEDUPING AND BIFURCATING AGENTS
Landing Flume Agent
• Custom Interceptor = Check
HBase for UUID, forward if
absent (check-and-forward)
and insert into HBase
• Custom selector = forward
every Nth event to QA flow
(dedicated channel and sink)
- 18. © 2014, Conversant, Inc. All rights reserved.18
KEY POINTS
Batch of application log lines = "logical log file"
= 1 Flume event = Kafka message
Application created custom header key/value pairs in Flume events -
log type, server-id, UUID, log version, # of log lines, timestamp, etc.
Events compressed at remote data center
Events deduped using HBase lookup (check-and-forward) in Chicago
Data pipeline resilient to server and network outages and system
maintenance
- 20. © 2014, Conversant, Inc. All rights reserved.20
FLUME IN A NUTSHELL: ARCHITECTURE
Source
Channel 1
Channel 2
Sink 2
Sink 1
Flume Agent
interceptor
selector
Source or Sink
- 21. © 2014, Conversant, Inc. All rights reserved.21
FLUME IN A NUTSHELL: ECOSYSTEM
Pre-canned Flume Sources
Avro
Flume Sink (for daisy chaining)
Thrift
Exec (Unix pipe/stdout)
Kafka
Netcat
HTTP
Spooling Directory
Custom Code
Pre-canned Flume Sinks
HDFS
Hive
Avro
Flume Sink (for daisy chaining)
Thrift
Kafka
File Roll
(Output spooling directory)
HBase
Solr
Elastic Search
Custom Code
Pre-canned Channels
Memory Channel
File Channel
Kafka Channel
JDBC Channel
- 22. © 2014, Conversant, Inc. All rights reserved.22
KAFKA IN A NUTSHELL: ARCHITECTURE
oldest data latest data
Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
Kafka Broker
Producer
Consumer A Consumer B
- 23. © 2014, Conversant, Inc. All rights reserved.23
KAFKA IN A NUTSHELL: SCALABILITY
Producer 1 Producer 2
- 24. © 2014, Conversant, Inc. All rights reserved.24
FLUME AND KAFKA: DATA PIPELINE BLOCKS
Data
Source
Flume or
Kafka
Data
Destination
Flume or
Kafka
Data
Source
Data
Source
Data
Destination
Data
Destination
Data
Destination
Data
Destination
Flume or
Kafka
- 25. © 2014, Conversant, Inc. All rights reserved.25
FLUME V/S KAFKA: DATA AND ROUTING
Data pipeline block philosophy
• Flume = buffered pipeline => transfer and forget
• Kafka = buffered temporal log => transfer and remember (short-term)
Data introspection, manipulation and conditional routing/multiplexing
• Flume = Can intercept and manipulate events (source/sink interceptor)
Flume = Conditional routing, multiplexing of events (source/sink selector)
• Kafka = Pass-through only
Changes in data destination or data source
• Flume = requires re-configuration
• Kafka = N/A, source (producer) and destination (consumer) agnostic
- 26. © 2014, Conversant, Inc. All rights reserved.26
FLUME V/S KAFKA: RELIABILTY, SCALABITY, ECOSYSTEM
Server Outage
• Flume = Flume-to-flume or incoming flow = failover to backup flume agent
Outgoing flow = buffered in channel in flume agent
• Kafka = Producer/consumer failover to another broker (replica partition)
Scalability
• Flume = add agents, re-configure (re-wire) data flows
• Kafka = add brokers, increase topic partitions and (re)distribute partitions
Ecosystem
• Flume = Pre-canned sources, sinks and channels
• Kafka = Kafka Connect and Kafka Streams
- 28. © 2014, Conversant, Inc. All rights reserved.28
ADMINISTRATION
No UI in either of them
Flume: agent stop/start shell script
Kafka
• Stop/start brokers
• Create/delete/view/manage topics, partitions and topic configuration
• Other utilities - e.g. view log data, stress testing, etc.
- 30. © 2014, Conversant, Inc. All rights reserved.30
METRICS
Kafka - JMX and API
• Broker network traffic
• Topic and partition traffic
• Replication and consumer lag
- 31. © 2014, Conversant, Inc. All rights reserved.31
MONITORING AND ALERTING
Flume Key Health Indicators
• Flume listener port
• Incoming traffic rate and errors
• Outgoing traffic rate and errors
• Channel capacity utilization
Kafka Key Health Indicators
• Broker listener port
• Under-replicated partitions (In-sync replica and amount of replica lag)
• Consumer lag
- 32. © 2014, Conversant, Inc. All rights reserved.32
MONITORING & METRICS @ CONVERSANT
TSDB Graph of
Flume Events
Across Data Centers
The blip is a rolling
restart of servers for a
software deploy
Legend
Chicago
East Coast
West Coast
Europe
- 34. © 2014, Conversant, Inc. All rights reserved.34
MORE INFO ON CONVERSANT DATA PIPELINE
Conversant Blog
http://engineering.conversantmedia.com/community/2015/06/01/conversant-big-data-everywhere
Sample GitHub Project
https://github.com/mbkeane/BigDataTechCon
Chicago Area Kafka Enthusiasts (CAKE)
http://www.meetup.com/Chicago-Area-Kafka-Enthusiasts/events/230867233