Kinesis vs-kafka-and-kafka-deep-dive

Kinesis vs. Kafka –
Kafka Deep Dive
Yifeng Jiang
Solutions Engineer, Hortonworks
© Hortonworks Inc. 2011 – 2015. All Rights Reserved

自己紹介
蒋　逸峰 (Yifeng Jiang)
•  Solutions Engineer, Hortonworks
•  HBase book author
•  ⽇日本に来て１０年年経ちました…
•  趣味は⼭山登り
•  Twitter: @uprush

About Hortonworks
Customer Momentum
•  556 customers (as of August 5, 2015)
•  119 customers added in Q2 2015
•  Publicly traded on NASDAQ: HDP
Hortonworks Data Platform
•  Completely open multi-tenant platform
for any app and any data
•  Consistent enterprise services for security,
operations, and governance
Partner for Customer Success
•  Leader in open-source community, focused on
innovation to meet enterprise needs
•  Unrivaled Hadoop support subscriptions
Founded in 2011
Original 24 architects, developers,
operators of Hadoop from Yahoo!
740+
E M P L O Y E E S
1350+
E C O S Y S T E M
PA R T N E R S

Hortonworks Data Plateform (HDP)
Deploy on premises and cloud

Kinesis vs. Kafka

Amazon Kinesis -- Introduction
Amazon Kinesis is a fully managed, cloud-based service for real-time data
processing over large, distributed data streams.

Apache Kafka -- Introduction
Messaging systems
Real-time
Scalable to handle large data volume
Low Latency
Fault tolerant
Originated at LinkedIn
Aimed at solving data movement across systems
Scala and Java
Open Source (Apache 2.0)
Adapted at many companies

Kinesis vs. Kafka – Future
Similar Futures
•  Messaging system for large scale
real-time data processing
•  High performance, highly scalable,
low latency
•  Fault tolerant
Difference
•  Full managed cloud service vs. OSS
•  Data durability and performance
trade off
•  Interface
•  AWS service integration vs. OSS or
single platform (e.g., HDP)
integration

Kinesis vs. Kafka – Data Durability
Kinesis
•  Synchronously replicates data
across three facilities
•  High durability for free
Kafka
•  Replication across servers in the
same DC/AZ. Configurable min # in-
sync replica and ACKs.
•  Asynchronously mirror data across
clusters across datacenters / AZs
Performance trade off

Kinesis vs. Kafka – Interface
Kinesis
•  REST only
•  Client library wraps REST API
Kafka
•  Low level API
•  REST API available (wrapping low
level API).
Impact throughput and latency

Kinesis vs. Kafka – Processing
Kafka
•  Custom consumers
•  Event monitoring and alerting use case
•  Strom
•  Fraud detection, Simple aggregation
•  Spark Streaming / Storm Trident
•  Micro-batch, near real-time
•  Camus
•  Batch hadoop ingestion
Kinesis
•  KCL applications on EC2
•  Storm
•  Spark streaming
•  EMR for batch ingestion, e.g., write to S3

Kinesis vs. Kafka – Deployment & Operation
Kafka
•  HDP: almost one-click deploy with Ambari
•  Basic monitoring with Ambari
•  Expand and rebalance: partition assignment
and consumer rebalance
•  Zookeeper can also be managed by Ambari
Kinesis
•  Fully managed, one-click deploy
•  CloudWatch monitoring
•  Expand and rebalance: resharding a stream
•  Easy operation

Kafka Deep Dive

Kafka – Concepts
* ZK is used by Broker, Consumer
Broker-0
P0.R0 (L)
P1.R0
Broker-1
P0.R1
P2.R1 (L)
Broker-2
P1.R2 (L)
P2.R2
Topic with 3 partition and Replica factor 2
Producer
Consumer

Kafka -- Concepts
Topics
Partitions
•  Offset
•  Ordered
Replication
•  Prevents data loss
•  Never read or written to
•  Does not increase throughput
•  Tolerates Replica-1 failures
$[ambari-‐qa@c6401
bin]$
kafka-‐topics.sh
-‐-‐zookeeper
c6401:2181
-‐-‐describe
-‐-‐topic
page_visits

Topic:page_visits
PartitionCount:4
ReplicationFactor:2
Configs:

Topic:
page_visits
Partition:
0
Leader:
1

Replicas:
0,1

Isr:
1,0

Topic:
page_visits
Partition:
1
Leader:
0

Replicas:
1,0

Isr:
0,1

Topic:
page_visits
Partition:
2
Leader:
1

Replicas:
0,1

Isr:
1,0

Topic:
page_visits
Partition:
3
Leader:
0

Replicas:
1,0

Isr:
0,1

Kafka Broker
Store messages (logs) on local disk
•  Messages are appended to log file
•  Log Retention – time and size based
Controller
•  Cluster management
•  Runs on each broker machine
•  One leader, others follower
Leader Partition
•  Broker that is the leader for certain partitions
Use ZK for coordination

Kafka Producer
New Producer API in 0.8.2
•  Kafka-client.jar
•  New Java API
•  Default Asynchronous mode
Create a new message and publish to a Topic and Partition
•  Takes topic, value and optional key and partition id

Kafka Producer API (0.8.2) – Cont.
•  Original messages are partitioned and then split into batches
•  Each split batch is sent to leader broker (and then replicated to ISR)
•  Each send is acknowledged by either leader broker and/or all ISR
p3 p2 p1 p2 p1m5 m4 m3 m2 m1
Broker-0
P0.R0 (L)
P1.R0
Broker-1
P0.R1
P2.R1 (L)
Broker-2
P1.R2 (L)
P2.R2
Topic with 3 partition and Replica factor 2
App Producer
Lib
partitioner Split
batch

Kafka Consumer
Read data from Kafka brokers
•  JVM APIs supported out of box by project
•  Consumers pull data from brokers
•  Consumer apps have to keep track of the topic-partition offset read
Consumer API
Simple API
•  Greater control over consumption of topic/partitions
•  Consumer apps will be complex as they need to handle things like offset handling.
High-level
•  Uses Simple API internally
•  Consumer apps will be simple to implement as offset tracking is out of box
•  But not flexible in terms of what partitions to read.

Kafka Consumer – Cont.
Consumer Groups
•  Allow multiple hosts to form a group to access a topic
•  Consume hosts join a group by using same group.id
•  Guarantees a message is read by only one consumer in a group
•  Partitions are assigned to consumers in a group
•  A consumer node may get one or more partitions
•  But one partition is assigned to only one consumer host
•  Order of the message is guaranteed with in a partition
•  Max parallelism – determined by topic partitions
•  More consumers than partitions – some consumers will be idle
P0
Broker-0
P3
Broker-1
P1 P2
C1 C2
Consumer Group - 1
C3 C4
Consumer Group - 2
C5 C6

Kafka – Why Kafka is fast
Fast Writes
Writes are appends to file system
Partitions improve performance and throughput
Uses OS buffer cache
Lots of memory on the machine helps
Fast Reads
Memory mapped files
File descriptor to socket descriptor efficient transfer
Linux sendfile(), JVM transferTo() implementation
Why Performance?
Disk flushes are delayed
Durability is guaranteed via replication
When consumers are reading the latest data, it reads from page cache

Kafka – Cluster Mirroring
Mirror Maker
•  Mirror data across clusters even in different DCs / AZs
•  Stand alone tool uses Consumer and Producer API
•  Reads from one or more source cluster and writes to a target cluster
•  Whitelist/blacklist topic

Kafka REST Interface
REST Interface
•  Wraps Producer and Consumer API
Performance Overhead
•  Two hops
•  Extra REST server to maintain
•  Parse JSON payload

Kinesis vs. Kafka -- Terms
Amazon Kinesis Apache Kafka
Streams Topics
Data Records Messages
Producers Producers
Kinesis Producer Library Producer API
Consumers Consumers
Kinesis Applications Consumer Applications
Kinesis Client Library Consumer – High level API
N/A Consumer – Simple API
Shards Partitions
N/A (built in MD5 hash on partition
keys)
Custom partitioner
Sequence Numbers Offset
Application Name Consumer Group ID

Kinesis vs-kafka-and-kafka-deep-dive

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Kinesis vs-kafka-and-kafka-deep-dive

Similar to Kinesis vs-kafka-and-kafka-deep-dive (20)

More from Yifeng Jiang

More from Yifeng Jiang (20)

Recently uploaded

Recently uploaded (20)

Kinesis vs-kafka-and-kafka-deep-dive