SlideShare a Scribd company logo
1 of 35
© 2014, Conversant, Inc. All rights reserved.
PRESENTED BY
May 18, 2016
APACHE FLUME OR
APACHE KAFKA?
HOW ABOUT BOTH?
Apache North America Big Data Conference - May 10, 2016
Jayesh Thakrar (jthakrar@conversantmedia.com)
© 2014, Conversant, Inc. All rights reserved.2
 Conversant (www.conversantmedia.com)
• Adserving - real-time bidding
• Intelligent messaging using online and offline activity without using
personally identifiable information (PII)
 Hadoop Engineering
• Designs, builds, and manages clusters running Hadoop, HBase, Spark,
Storm, Kafka, Cassandra, OpenTSDB, etc.
• Team: 4 people, 20+ clusters, 500+ servers, PBs of storage, etc.
© 2014, Conversant, Inc. All rights reserved.3
AGENDA
 History and Evolution of Conversant's Data Pipeline
 Flume Customization
 Compare Flume and Kafka
 Metrics and Monitoring
© 2014, Conversant, Inc. All rights reserved.4
Conversant Data
Pipeline Overview
© 2014, Conversant, Inc. All rights reserved.5
INTER-DATACENTER DATA PIPELINE
InternetAd
Exchanges
Web Sites
(Publishers)
Users
U.S. East Coast
Data Center
European
Data Center
Chicago
Data Center (EDW)
U.S. West Coast
Data Center
© 2014, Conversant, Inc. All rights reserved.6
 Home-grown log collection system in PERL, shell and Python
 15-20 billion log lines
 Comma or tab separated log format, implicit schema
DATA PIPELINE VERSION 1
(PRIOR TO SEPT 2013)
AdServer
Application
AdServer
Application
AdServer
Applications
AdServer
Application
AdServer
Application
Data Center
Local Log
Manager
AdServer
Application
Chicago Log
Aggregator
Data
Warehouse
© 2014, Conversant, Inc. All rights reserved.7
 Non-trivial operational and recovery effort during
• Network/WAN outage
• Planned/unplanned server maintenance
 Difficult file format/schema evolution
 Delayed reporting and metrics (2-3 hours)
 Scaling and storage utilization on local log manager
DATA PIPELINE VERSION 1
© 2014, Conversant, Inc. All rights reserved.8
DATA PIPELINE VERSION 2
(SEP 2013 - MAR 2015)
 Application logging in Avro format
 50-80+ billion daily log lines
 3-hop flume pipeline
 Flume event schema : event header, event payload
• Header key/value = log type, log version, server-id, UUID, timestamp, # of log lines
• Payload = byte array = Avro file
AdServer
Application
AdServer
Application
AdServer
Applications
with Local
Flume Agents
AdServer
Application
AdServer
Application
Data Center
Local
Compressor
Flume Agents
AdServer
Application
Chicago Deduping
and Bifurcating
Flume Agents
Dedicated
Hadoop Cluster
© 2014, Conversant, Inc. All rights reserved.9
DATA PIPELINE VERSION 2
 Explicit application log schema
 Version tagged payload = easier log file schema evolution
 No manual recovery during network outages and server maintenance
 Detailed, explicit metrics in real-time
© 2014, Conversant, Inc. All rights reserved.10
DATA PIPELINE VERSION 3
(MAR 2015-JUN 2015)
 Switch from dedicated MapR cluster to CDH cluster (new EDW)
AdServer
Application
AdServer
Application
AdServer
Applications
with Local
Flume Agents
AdServer
Application
AdServer
Application
Data Center
Local
Compressor
Flume Agents
AdServer
Application
Chicago Deduping
and Bifurcating
Flume Agents
Dedicated
Hadoop Cluster
Enterprise
Hadoop Cluster
© 2014, Conversant, Inc. All rights reserved.11
DATA PIPELINE VERSION 3
 About 4-5k file creation/sec by Flume - Namenode overwhelmed!!
 Manual intervention for data recovery - painful reminder of version 1
© 2014, Conversant, Inc. All rights reserved.12
DATA PIPELINE VERSION 4
(JUNE 2015+ )
 Embedded flume agents in applications
 Kafka to "buffer/self-regulate" data flow
 Camus mapreduce framework to land data
AdServer
Application
AdServer
Application
AdServer
Applications
with Embedded
Flume Agents
AdServer
Application
AdServer
Application
Data Center
Local
Compressor
Flume Agents
AdServer
Application
Chicago Deduping
and Bifurcating
Flume Agents
Enterprise Hadoop
Cluster + Camus
Mapreduce
AdServer
Application
AdServer
ApplicationKafka
© 2014, Conversant, Inc. All rights reserved.13
DATA PIPELINE VERSION 4
 Kafka + Flume = Hadoop decoupling and data redundancy
 Additional metrics and visibility from Kafka
 In future, allows for data sniffing/sampling and real-time stream processing of
log data
© 2014, Conversant, Inc. All rights reserved.14
Flume Customization
© 2014, Conversant, Inc. All rights reserved.15
ADSTACK DATA CENTER BUILDING BLOCK
 Multi-threaded application flushes
batched Avro log lines through
embedded Flume agent based on
time and/or line count thresholds
 Compressor agent compresses
data and sends downstream to
Chicago
• Custom Flume interceptor =
compression and filtering
• Custom Flume selector = event
forwarding to specific channel
© 2014, Conversant, Inc. All rights reserved.16
CHICAGO DEDUPING AND BIFURCATING AGENTS
 Landing Flume Agent
• Custom Interceptor = Check
HBase for UUID, forward if
absent (check-and-forward)
and insert into HBase
• Custom selector = forward
every Nth event to QA flow
(dedicated channel and sink)
© 2014, Conversant, Inc. All rights reserved.17
INTO THE DATA WAREHOUSE
© 2014, Conversant, Inc. All rights reserved.18
KEY POINTS
 Batch of application log lines = "logical log file"
= 1 Flume event = Kafka message
 Application created custom header key/value pairs in Flume events -
log type, server-id, UUID, log version, # of log lines, timestamp, etc.
 Events compressed at remote data center
 Events deduped using HBase lookup (check-and-forward) in Chicago
 Data pipeline resilient to server and network outages and system
maintenance
© 2014, Conversant, Inc. All rights reserved.19
Flume and Kafka
or
Flume v/s Kafka
© 2014, Conversant, Inc. All rights reserved.20
FLUME IN A NUTSHELL: ARCHITECTURE
Source
Channel 1
Channel 2
Sink 2
Sink 1
Flume Agent
interceptor
selector
Source or Sink
© 2014, Conversant, Inc. All rights reserved.21
FLUME IN A NUTSHELL: ECOSYSTEM
Pre-canned Flume Sources
 Avro
Flume Sink (for daisy chaining)
 Thrift
 Exec (Unix pipe/stdout)
 Kafka
 Netcat
 HTTP
 Spooling Directory
 Custom Code
Pre-canned Flume Sinks
 HDFS
 Hive
 Avro
Flume Sink (for daisy chaining)
 Thrift
 Kafka
 File Roll
(Output spooling directory)
 HBase
 Solr
 Elastic Search
 Custom Code
Pre-canned Channels
 Memory Channel
 File Channel
 Kafka Channel
 JDBC Channel
© 2014, Conversant, Inc. All rights reserved.22
KAFKA IN A NUTSHELL: ARCHITECTURE
oldest data latest data
Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
Kafka Broker
Producer
Consumer A Consumer B
© 2014, Conversant, Inc. All rights reserved.23
KAFKA IN A NUTSHELL: SCALABILITY
Producer 1 Producer 2
© 2014, Conversant, Inc. All rights reserved.24
FLUME AND KAFKA: DATA PIPELINE BLOCKS
Data
Source
Flume or
Kafka
Data
Destination
Flume or
Kafka
Data
Source
Data
Source
Data
Destination
Data
Destination
Data
Destination
Data
Destination
Flume or
Kafka
© 2014, Conversant, Inc. All rights reserved.25
FLUME V/S KAFKA: DATA AND ROUTING
 Data pipeline block philosophy
• Flume = buffered pipeline => transfer and forget
• Kafka = buffered temporal log => transfer and remember (short-term)
 Data introspection, manipulation and conditional routing/multiplexing
• Flume = Can intercept and manipulate events (source/sink interceptor)
Flume = Conditional routing, multiplexing of events (source/sink selector)
• Kafka = Pass-through only
 Changes in data destination or data source
• Flume = requires re-configuration
• Kafka = N/A, source (producer) and destination (consumer) agnostic
© 2014, Conversant, Inc. All rights reserved.26
FLUME V/S KAFKA: RELIABILTY, SCALABITY, ECOSYSTEM
 Server Outage
• Flume = Flume-to-flume or incoming flow = failover to backup flume agent
Outgoing flow = buffered in channel in flume agent
• Kafka = Producer/consumer failover to another broker (replica partition)
 Scalability
• Flume = add agents, re-configure (re-wire) data flows
• Kafka = add brokers, increase topic partitions and (re)distribute partitions
 Ecosystem
• Flume = Pre-canned sources, sinks and channels
• Kafka = Kafka Connect and Kafka Streams
© 2014, Conversant, Inc. All rights reserved.27
Administration,
Metrics and Monitoring
© 2014, Conversant, Inc. All rights reserved.28
ADMINISTRATION
 No UI in either of them
 Flume: agent stop/start shell script
 Kafka
• Stop/start brokers
• Create/delete/view/manage topics, partitions and topic configuration
• Other utilities - e.g. view log data, stress testing, etc.
© 2014, Conversant, Inc. All rights reserved.29
FLUME METRICS: JMX AND HTTP/JSON ENDPOINT
© 2014, Conversant, Inc. All rights reserved.30
METRICS
 Kafka - JMX and API
• Broker network traffic
• Topic and partition traffic
• Replication and consumer lag
© 2014, Conversant, Inc. All rights reserved.31
MONITORING AND ALERTING
 Flume Key Health Indicators
• Flume listener port
• Incoming traffic rate and errors
• Outgoing traffic rate and errors
• Channel capacity utilization
 Kafka Key Health Indicators
• Broker listener port
• Under-replicated partitions (In-sync replica and amount of replica lag)
• Consumer lag
© 2014, Conversant, Inc. All rights reserved.32
MONITORING & METRICS @ CONVERSANT
TSDB Graph of
Flume Events
Across Data Centers
The blip is a rolling
restart of servers for a
software deploy
Legend
Chicago
East Coast
West Coast
Europe
© 2014, Conversant, Inc. All rights reserved.33
MONITORING & METRICS IN GRAFANA DASHBOARDS
© 2014, Conversant, Inc. All rights reserved.34
MORE INFO ON CONVERSANT DATA PIPELINE
 Conversant Blog
http://engineering.conversantmedia.com/community/2015/06/01/conversant-big-data-everywhere
 Sample GitHub Project
https://github.com/mbkeane/BigDataTechCon
 Chicago Area Kafka Enthusiasts (CAKE)
http://www.meetup.com/Chicago-Area-Kafka-Enthusiasts/events/230867233
© 2014, Conversant, Inc. All rights reserved.35
Questions?

More Related Content

What's hot

Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoopChristophe Marchal
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingRapheephan Thongkham-Uan
 
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Steve Hoffman
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...DataWorks Summit
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDataWorks Summit
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRyan Bosshart
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with FlumeRatnakar Pawar
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBasedave_revell
 
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11Cloudera, Inc.
 
Extracting twitter data using apache flume
Extracting twitter data using apache flumeExtracting twitter data using apache flume
Extracting twitter data using apache flumeBharat Khanna
 
Apache Phoenix: Use Cases and New Features
Apache Phoenix: Use Cases and New FeaturesApache Phoenix: Use Cases and New Features
Apache Phoenix: Use Cases and New FeaturesHBaseCon
 

What's hot (20)

Big data: Loading your data with flume and sqoop
Big data:  Loading your data with flume and sqoopBig data:  Loading your data with flume and sqoop
Big data: Loading your data with flume and sqoop
 
Flume
FlumeFlume
Flume
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in Manufacturing
 
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
 
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
 
Apache Flume (NG)
Apache Flume (NG)Apache Flume (NG)
Apache Flume (NG)
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Apache flume - Twitter Streaming
Apache flume - Twitter Streaming Apache flume - Twitter Streaming
Apache flume - Twitter Streaming
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
Apache flume
Apache flumeApache flume
Apache flume
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Realtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLibRealtime Detection of DDOS attacks using Apache Spark and MLLib
Realtime Detection of DDOS attacks using Apache Spark and MLLib
 
Centralized logging with Flume
Centralized logging with FlumeCentralized logging with Flume
Centralized logging with Flume
 
Cloudera's Flume
Cloudera's FlumeCloudera's Flume
Cloudera's Flume
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
Flume intro-100715
Flume intro-100715Flume intro-100715
Flume intro-100715
 
Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11Flume @ Austin HUG 2/17/11
Flume @ Austin HUG 2/17/11
 
Extracting twitter data using apache flume
Extracting twitter data using apache flumeExtracting twitter data using apache flume
Extracting twitter data using apache flume
 
Apache Phoenix: Use Cases and New Features
Apache Phoenix: Use Cases and New FeaturesApache Phoenix: Use Cases and New Features
Apache Phoenix: Use Cases and New Features
 

Viewers also liked

Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVROairisData
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
 
大型电商的数据服务的要点和难点
大型电商的数据服务的要点和难点 大型电商的数据服务的要点和难点
大型电商的数据服务的要点和难点 Chao Zhu
 
Implementing and running a secure datalake from the trenches
Implementing and running a secure datalake from the trenches Implementing and running a secure datalake from the trenches
Implementing and running a secure datalake from the trenches DataWorks Summit
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascienceAdam Muise
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 

Viewers also liked (9)

Parquet and AVRO
Parquet and AVROParquet and AVRO
Parquet and AVRO
 
Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015Moving to a data-centric architecture: Toronto Data Unconference 2015
Moving to a data-centric architecture: Toronto Data Unconference 2015
 
大型电商的数据服务的要点和难点
大型电商的数据服务的要点和难点 大型电商的数据服务的要点和难点
大型电商的数据服务的要点和难点
 
Implementing and running a secure datalake from the trenches
Implementing and running a secure datalake from the trenches Implementing and running a secure datalake from the trenches
Implementing and running a secure datalake from the trenches
 
Parquet overview
Parquet overviewParquet overview
Parquet overview
 
Paytm labs soyouwanttodatascience
Paytm labs soyouwanttodatasciencePaytm labs soyouwanttodatascience
Paytm labs soyouwanttodatascience
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Flume vs. kafka
Flume vs. kafkaFlume vs. kafka
Flume vs. kafka
 

Similar to ApacheCon-Flume-Kafka-2016

OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsTimothy Spann
 
Schema Registry & Stream Analytics Manager
Schema Registry  & Stream Analytics ManagerSchema Registry  & Stream Analytics Manager
Schema Registry & Stream Analytics ManagerSriharsha Chintalapani
 
SAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made EasySAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made EasyDataWorks Summit
 
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...Red Hat Developers
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann
 
Leverage Kafka to build a stream processing platform
Leverage Kafka to build a stream processing platformLeverage Kafka to build a stream processing platform
Leverage Kafka to build a stream processing platformconfluent
 
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
PortoTechHub  - Hail Hydrate! From Stream to Lake with Apache Pulsar and FriendsPortoTechHub  - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and FriendsTimothy Spann
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopDataWorks Summit
 
ITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming AppsITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming AppsTimothy Spann
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoophadooparchbook
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache KafkaJoe Stein
 
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-timeFlurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-timeTrieu Nguyen
 
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataBuilding Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataTimothy Spann
 
Building Real-Time Travel Alerts
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel AlertsTimothy Spann
 
Part 2: Architecture and the Operator Experience (Pivotal Cloud Platform Road...
Part 2: Architecture and the Operator Experience (Pivotal Cloud Platform Road...Part 2: Architecture and the Operator Experience (Pivotal Cloud Platform Road...
Part 2: Architecture and the Operator Experience (Pivotal Cloud Platform Road...VMware Tanzu
 
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationCoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationTimothy Spann
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeTimothy Spann
 
Data Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, HowData Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, HowPat Patterson
 

Similar to ApacheCon-Flume-Kafka-2016 (20)

OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
 
Schema Registry & Stream Analytics Manager
Schema Registry  & Stream Analytics ManagerSchema Registry  & Stream Analytics Manager
Schema Registry & Stream Analytics Manager
 
SAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made EasySAM - Streaming Analytics Made Easy
SAM - Streaming Analytics Made Easy
 
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
Kafka at the Edge: an IoT scenario with OpenShift Streams for Apache Kafka | ...
 
Streaming analytics manager
Streaming analytics managerStreaming analytics manager
Streaming analytics manager
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
Leverage Kafka to build a stream processing platform
Leverage Kafka to build a stream processing platformLeverage Kafka to build a stream processing platform
Leverage Kafka to build a stream processing platform
 
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
PortoTechHub  - Hail Hydrate! From Stream to Lake with Apache Pulsar and FriendsPortoTechHub  - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
PortoTechHub - Hail Hydrate! From Stream to Lake with Apache Pulsar and Friends
 
Architecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with HadoopArchitecting a Fraud Detection Application with Hadoop
Architecting a Fraud Detection Application with Hadoop
 
Fraud Detection Architecture
Fraud Detection ArchitectureFraud Detection Architecture
Fraud Detection Architecture
 
ITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming AppsITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming Apps
 
Fraud Detection using Hadoop
Fraud Detection using HadoopFraud Detection using Hadoop
Fraud Detection using Hadoop
 
Current and Future of Apache Kafka
Current and Future of Apache KafkaCurrent and Future of Apache Kafka
Current and Future of Apache Kafka
 
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-timeFlurry Analytic Backend - Processing Terabytes of Data in Real-time
Flurry Analytic Backend - Processing Terabytes of Data in Real-time
 
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataBuilding Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
 
Building Real-Time Travel Alerts
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
 
Part 2: Architecture and the Operator Experience (Pivotal Cloud Platform Road...
Part 2: Architecture and the Operator Experience (Pivotal Cloud Platform Road...Part 2: Architecture and the Operator Experience (Pivotal Cloud Platform Road...
Part 2: Architecture and the Operator Experience (Pivotal Cloud Platform Road...
 
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationCoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lake
 
Data Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, HowData Integration with Apache Kafka: What, Why, How
Data Integration with Apache Kafka: What, Why, How
 

More from Jayesh Thakrar

ApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data SourcesApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data SourcesJayesh Thakrar
 
Apache big-data-2017-spark-profiling
Apache big-data-2017-spark-profilingApache big-data-2017-spark-profiling
Apache big-data-2017-spark-profilingJayesh Thakrar
 
Data Modeling for IoT and Big Data
Data Modeling for IoT and Big DataData Modeling for IoT and Big Data
Data Modeling for IoT and Big DataJayesh Thakrar
 
Apache big-data-2017-scala-sql
Apache big-data-2017-scala-sqlApache big-data-2017-scala-sql
Apache big-data-2017-scala-sqlJayesh Thakrar
 
Data Loss and Duplication in Kafka
Data Loss and Duplication in KafkaData Loss and Duplication in Kafka
Data Loss and Duplication in KafkaJayesh Thakrar
 
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14Jayesh Thakrar
 

More from Jayesh Thakrar (6)

ApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data SourcesApacheCon North America 2018: Creating Spark Data Sources
ApacheCon North America 2018: Creating Spark Data Sources
 
Apache big-data-2017-spark-profiling
Apache big-data-2017-spark-profilingApache big-data-2017-spark-profiling
Apache big-data-2017-spark-profiling
 
Data Modeling for IoT and Big Data
Data Modeling for IoT and Big DataData Modeling for IoT and Big Data
Data Modeling for IoT and Big Data
 
Apache big-data-2017-scala-sql
Apache big-data-2017-scala-sqlApache big-data-2017-scala-sql
Apache big-data-2017-scala-sql
 
Data Loss and Duplication in Kafka
Data Loss and Duplication in KafkaData Loss and Duplication in Kafka
Data Loss and Duplication in Kafka
 
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
Chicago-Java-User-Group-Meetup-Some-Garbage-Talk-2015-01-14
 

ApacheCon-Flume-Kafka-2016

  • 1. © 2014, Conversant, Inc. All rights reserved. PRESENTED BY May 18, 2016 APACHE FLUME OR APACHE KAFKA? HOW ABOUT BOTH? Apache North America Big Data Conference - May 10, 2016 Jayesh Thakrar (jthakrar@conversantmedia.com)
  • 2. © 2014, Conversant, Inc. All rights reserved.2  Conversant (www.conversantmedia.com) • Adserving - real-time bidding • Intelligent messaging using online and offline activity without using personally identifiable information (PII)  Hadoop Engineering • Designs, builds, and manages clusters running Hadoop, HBase, Spark, Storm, Kafka, Cassandra, OpenTSDB, etc. • Team: 4 people, 20+ clusters, 500+ servers, PBs of storage, etc.
  • 3. © 2014, Conversant, Inc. All rights reserved.3 AGENDA  History and Evolution of Conversant's Data Pipeline  Flume Customization  Compare Flume and Kafka  Metrics and Monitoring
  • 4. © 2014, Conversant, Inc. All rights reserved.4 Conversant Data Pipeline Overview
  • 5. © 2014, Conversant, Inc. All rights reserved.5 INTER-DATACENTER DATA PIPELINE InternetAd Exchanges Web Sites (Publishers) Users U.S. East Coast Data Center European Data Center Chicago Data Center (EDW) U.S. West Coast Data Center
  • 6. © 2014, Conversant, Inc. All rights reserved.6  Home-grown log collection system in PERL, shell and Python  15-20 billion log lines  Comma or tab separated log format, implicit schema DATA PIPELINE VERSION 1 (PRIOR TO SEPT 2013) AdServer Application AdServer Application AdServer Applications AdServer Application AdServer Application Data Center Local Log Manager AdServer Application Chicago Log Aggregator Data Warehouse
  • 7. © 2014, Conversant, Inc. All rights reserved.7  Non-trivial operational and recovery effort during • Network/WAN outage • Planned/unplanned server maintenance  Difficult file format/schema evolution  Delayed reporting and metrics (2-3 hours)  Scaling and storage utilization on local log manager DATA PIPELINE VERSION 1
  • 8. © 2014, Conversant, Inc. All rights reserved.8 DATA PIPELINE VERSION 2 (SEP 2013 - MAR 2015)  Application logging in Avro format  50-80+ billion daily log lines  3-hop flume pipeline  Flume event schema : event header, event payload • Header key/value = log type, log version, server-id, UUID, timestamp, # of log lines • Payload = byte array = Avro file AdServer Application AdServer Application AdServer Applications with Local Flume Agents AdServer Application AdServer Application Data Center Local Compressor Flume Agents AdServer Application Chicago Deduping and Bifurcating Flume Agents Dedicated Hadoop Cluster
  • 9. © 2014, Conversant, Inc. All rights reserved.9 DATA PIPELINE VERSION 2  Explicit application log schema  Version tagged payload = easier log file schema evolution  No manual recovery during network outages and server maintenance  Detailed, explicit metrics in real-time
  • 10. © 2014, Conversant, Inc. All rights reserved.10 DATA PIPELINE VERSION 3 (MAR 2015-JUN 2015)  Switch from dedicated MapR cluster to CDH cluster (new EDW) AdServer Application AdServer Application AdServer Applications with Local Flume Agents AdServer Application AdServer Application Data Center Local Compressor Flume Agents AdServer Application Chicago Deduping and Bifurcating Flume Agents Dedicated Hadoop Cluster Enterprise Hadoop Cluster
  • 11. © 2014, Conversant, Inc. All rights reserved.11 DATA PIPELINE VERSION 3  About 4-5k file creation/sec by Flume - Namenode overwhelmed!!  Manual intervention for data recovery - painful reminder of version 1
  • 12. © 2014, Conversant, Inc. All rights reserved.12 DATA PIPELINE VERSION 4 (JUNE 2015+ )  Embedded flume agents in applications  Kafka to "buffer/self-regulate" data flow  Camus mapreduce framework to land data AdServer Application AdServer Application AdServer Applications with Embedded Flume Agents AdServer Application AdServer Application Data Center Local Compressor Flume Agents AdServer Application Chicago Deduping and Bifurcating Flume Agents Enterprise Hadoop Cluster + Camus Mapreduce AdServer Application AdServer ApplicationKafka
  • 13. © 2014, Conversant, Inc. All rights reserved.13 DATA PIPELINE VERSION 4  Kafka + Flume = Hadoop decoupling and data redundancy  Additional metrics and visibility from Kafka  In future, allows for data sniffing/sampling and real-time stream processing of log data
  • 14. © 2014, Conversant, Inc. All rights reserved.14 Flume Customization
  • 15. © 2014, Conversant, Inc. All rights reserved.15 ADSTACK DATA CENTER BUILDING BLOCK  Multi-threaded application flushes batched Avro log lines through embedded Flume agent based on time and/or line count thresholds  Compressor agent compresses data and sends downstream to Chicago • Custom Flume interceptor = compression and filtering • Custom Flume selector = event forwarding to specific channel
  • 16. © 2014, Conversant, Inc. All rights reserved.16 CHICAGO DEDUPING AND BIFURCATING AGENTS  Landing Flume Agent • Custom Interceptor = Check HBase for UUID, forward if absent (check-and-forward) and insert into HBase • Custom selector = forward every Nth event to QA flow (dedicated channel and sink)
  • 17. © 2014, Conversant, Inc. All rights reserved.17 INTO THE DATA WAREHOUSE
  • 18. © 2014, Conversant, Inc. All rights reserved.18 KEY POINTS  Batch of application log lines = "logical log file" = 1 Flume event = Kafka message  Application created custom header key/value pairs in Flume events - log type, server-id, UUID, log version, # of log lines, timestamp, etc.  Events compressed at remote data center  Events deduped using HBase lookup (check-and-forward) in Chicago  Data pipeline resilient to server and network outages and system maintenance
  • 19. © 2014, Conversant, Inc. All rights reserved.19 Flume and Kafka or Flume v/s Kafka
  • 20. © 2014, Conversant, Inc. All rights reserved.20 FLUME IN A NUTSHELL: ARCHITECTURE Source Channel 1 Channel 2 Sink 2 Sink 1 Flume Agent interceptor selector Source or Sink
  • 21. © 2014, Conversant, Inc. All rights reserved.21 FLUME IN A NUTSHELL: ECOSYSTEM Pre-canned Flume Sources  Avro Flume Sink (for daisy chaining)  Thrift  Exec (Unix pipe/stdout)  Kafka  Netcat  HTTP  Spooling Directory  Custom Code Pre-canned Flume Sinks  HDFS  Hive  Avro Flume Sink (for daisy chaining)  Thrift  Kafka  File Roll (Output spooling directory)  HBase  Solr  Elastic Search  Custom Code Pre-canned Channels  Memory Channel  File Channel  Kafka Channel  JDBC Channel
  • 22. © 2014, Conversant, Inc. All rights reserved.22 KAFKA IN A NUTSHELL: ARCHITECTURE oldest data latest data Source: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying Kafka Broker Producer Consumer A Consumer B
  • 23. © 2014, Conversant, Inc. All rights reserved.23 KAFKA IN A NUTSHELL: SCALABILITY Producer 1 Producer 2
  • 24. © 2014, Conversant, Inc. All rights reserved.24 FLUME AND KAFKA: DATA PIPELINE BLOCKS Data Source Flume or Kafka Data Destination Flume or Kafka Data Source Data Source Data Destination Data Destination Data Destination Data Destination Flume or Kafka
  • 25. © 2014, Conversant, Inc. All rights reserved.25 FLUME V/S KAFKA: DATA AND ROUTING  Data pipeline block philosophy • Flume = buffered pipeline => transfer and forget • Kafka = buffered temporal log => transfer and remember (short-term)  Data introspection, manipulation and conditional routing/multiplexing • Flume = Can intercept and manipulate events (source/sink interceptor) Flume = Conditional routing, multiplexing of events (source/sink selector) • Kafka = Pass-through only  Changes in data destination or data source • Flume = requires re-configuration • Kafka = N/A, source (producer) and destination (consumer) agnostic
  • 26. © 2014, Conversant, Inc. All rights reserved.26 FLUME V/S KAFKA: RELIABILTY, SCALABITY, ECOSYSTEM  Server Outage • Flume = Flume-to-flume or incoming flow = failover to backup flume agent Outgoing flow = buffered in channel in flume agent • Kafka = Producer/consumer failover to another broker (replica partition)  Scalability • Flume = add agents, re-configure (re-wire) data flows • Kafka = add brokers, increase topic partitions and (re)distribute partitions  Ecosystem • Flume = Pre-canned sources, sinks and channels • Kafka = Kafka Connect and Kafka Streams
  • 27. © 2014, Conversant, Inc. All rights reserved.27 Administration, Metrics and Monitoring
  • 28. © 2014, Conversant, Inc. All rights reserved.28 ADMINISTRATION  No UI in either of them  Flume: agent stop/start shell script  Kafka • Stop/start brokers • Create/delete/view/manage topics, partitions and topic configuration • Other utilities - e.g. view log data, stress testing, etc.
  • 29. © 2014, Conversant, Inc. All rights reserved.29 FLUME METRICS: JMX AND HTTP/JSON ENDPOINT
  • 30. © 2014, Conversant, Inc. All rights reserved.30 METRICS  Kafka - JMX and API • Broker network traffic • Topic and partition traffic • Replication and consumer lag
  • 31. © 2014, Conversant, Inc. All rights reserved.31 MONITORING AND ALERTING  Flume Key Health Indicators • Flume listener port • Incoming traffic rate and errors • Outgoing traffic rate and errors • Channel capacity utilization  Kafka Key Health Indicators • Broker listener port • Under-replicated partitions (In-sync replica and amount of replica lag) • Consumer lag
  • 32. © 2014, Conversant, Inc. All rights reserved.32 MONITORING & METRICS @ CONVERSANT TSDB Graph of Flume Events Across Data Centers The blip is a rolling restart of servers for a software deploy Legend Chicago East Coast West Coast Europe
  • 33. © 2014, Conversant, Inc. All rights reserved.33 MONITORING & METRICS IN GRAFANA DASHBOARDS
  • 34. © 2014, Conversant, Inc. All rights reserved.34 MORE INFO ON CONVERSANT DATA PIPELINE  Conversant Blog http://engineering.conversantmedia.com/community/2015/06/01/conversant-big-data-everywhere  Sample GitHub Project https://github.com/mbkeane/BigDataTechCon  Chicago Area Kafka Enthusiasts (CAKE) http://www.meetup.com/Chicago-Area-Kafka-Enthusiasts/events/230867233
  • 35. © 2014, Conversant, Inc. All rights reserved.35 Questions?