SlideShare a Scribd company logo
1 of 87
Download to read offline
Spark Streaming + Kafka
@joanvr
About me: Joan Viladrosa @joanvr
Degree In Computer Science
Advanced programming techniques & System interfaces and integration
Co-Founder in Educabits
Big data solutions using AWS cloud
Big Data Developer in Trovit
Hadoop and MapReduce Framework
Big Data Architect & Tech Lead in BillyMobile
Design new architecture with Hadoop cutting edge technologies:
Kafka, Storm, Hive, HBase, Spark, Kylin, Druid
What is
Kafka?
- Publish-Subscribe
Messaging System
What is
Kafka?
- Publish-Subscribe
Messaging System
- Fast
- Scalable
- Durable
- Fault-tolerant
What makes it great?
What is Kafka?
Producer Producer Producer Producer
Kafka
Consumer Consumer Consumer Consumer
What is Kafka?
Apache
Storm
Apache
Spark
My Java App Logger
Kafka
Apache
Storm
Apache
Spark
My Java App
Monitoring
Tool
Kafka terminology
- Topic: A feed of messages
- Producer: Processes that publish messages to a topic
- Consumer: Processes that subscribe to topics and
process the feed of published messages
- Broker: Each server of a kafka cluster that holds,
receives and sends the actual data
Kafka Topic Partitions
0 1 2 3 4 5 6Partition 0
Partition 1
Partition 2
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
Topic:
Old New
writes
Kafka Topic Partitions
0 1 2 3 4 5 6Partition 0 7 8 9
Old New
1
0
1
1
1
2
1
3
1
4
1
5
Producer
writes
Consumer A
(offset=6)
Consumer B
(offset=12)
reads reads
Kafka Topic Partitions
0 1 2 3 4 5 6P0
P1
P2
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
0 1 2 3 4 5 6P3
P4
P5
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
0 1 2 3 4 5 6P6
P7
P8
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
Broker 1 Broker 2 Broker 3
Consumers &
Producers
Kafka Topic Partitions
0 1 2 3 4 5 6P0
P1
P2
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
0 1 2 3 4 5 6P3
P4
P5
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
0 1 2 3 4 5 6P6
P7
P8
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
Broker 1 Broker 2 Broker 3
Consumers &
Producers
More
Storage
Kafka Topic Partitions
0 1 2 3 4 5 6P0
P1
P2
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
0 1 2 3 4 5 6P3
P4
P5
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
0 1 2 3 4 5 6P6
P7
P8
0 1 2 3 4 5 6
0 1 2 3 4 5 6
7 8 9
7 8
Broker 1 Broker 2 Broker 3
Consumers &
Producers
More
Parallelism
Apache Kafka Semantics
- In short: consumer delivery semantics are up to you, not Kafka.
Apache Kafka Semantics
- In short: consumer delivery semantics are up to you, not Kafka.
- Kafka doesn’t store the state of the consumers*
- It just sends you what you ask for (topic, partition, offset, length)
- You have to take care of your state
Apache Kafka Timeline
may-2016nov-2015nov-2013nov-2012
New
Producer
New
Consumer
Security
Kafka Streams
Apache
Incubator
Project
0.7 0.8 0.9 0.10
What is
Spark
Streaming?
- Process streams of data
- Micro-batching approach
What is
Spark
Streaming?
- Process streams of data
- Micro-batching approach
- Same API as Spark
- Same integrations as Spark
- Same guarantees &
semantics as Spark
What makes it great?
What is Spark Streaming?
What is Spark Streaming?
How does it work?
- Discretized Streams (DStream)
How does it work?
- Discretized Streams (DStream)
- Discretized Streams (DStream)
How does it work?
How does it work?
How does it work?
How does it work?
Spark Streaming Semantics: Side effects
As in Spark:
- Not guarantee exactly-once semantics for output actions
- Any side-effecting output operations may be repeated
- Because of node failure, process failure, etc.
- So, be careful when outputting to external sources
Spark
Streaming
Kafka
Integration
Spark Streaming Kafka Integration Timeline
dec-2016jul-2016jan-2016sep-2015jun-2015mar-2015dec-2014sep-2014
Fault Tolerant
WAL
+
Python API
Direct
Streams
+
Python API
Improved
Streaming UI
Metadata in
UI (offsets)
+
Graduated
Direct
Receivers Native Kafka
0.10
(experimental)
1.1 1.2 1.3 1.4 1.5 1.6 2.0 2.1
Kafka Receiver (<= Spark 1.1)
Executor
Driver
Launch jobs
on data
Receiver
Continuously receive
data using
High Level API
Update offsets in
ZooKeeper
HDFS
Kafka Receiver with WAL (Spark 1.2)
Executor
Driver
Launch jobs
on data
Continuously receive
data using
High Level API
Update offsets in
ZooKeeper
WAL
Receiver
Kafka Receiver with WAL (Spark 1.2)
Application
Driver
Executor
Spark
Context
Jobs
Computation
checkpointed
Receiver
Input
stream
Block
metadata
Block
metadata
written
to log
Block data
written both
memory + log
Streaming
Context
Kafka Receiver with WAL (Spark 1.2)
Restarted Driver Restarted
Executor
Restarted
Spark
Context
Relaunch
Jobs
Restart
computation
from info in
checkpoints Restarted
Receiver
Resend
unacked data
Recover
Block
metadata
from log
Recover Block
data from log
Restarted
Streaming
Context
HDFS
Kafka Receiver with WAL (Spark 1.2)
Executor
Driver
Launch jobs
on data
Continuously receive
data using
High Level API
Update offsets in
ZooKeeper
WAL
Receiver
New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3)
Executor
Driver
New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3)
Executor
Driver 1. Query latest offsets
and decide offset ranges
for batch
New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3)
Executor
Driver
2. Launch jobs
using offset
ranges
1. Query latest offsets
and decide offset ranges
for batch
New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3)
Executor
Driver
2. Launch jobs
using offset
ranges
1. Query latest offsets
and decide offset ranges
for batchtopic1, p1,
(2000, 2100)
topic1, p2,
(2010, 2110)
topic1, p3,
(2002, 2102)
New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3)
Executor
Driver
2. Launch jobs
using offset
ranges
3. Reads data using
offset ranges in jobs
using Simple API
1. Query latest offsets
and decide offset ranges
for batchtopic1, p1,
(2000, 2100)
topic1, p2,
(2010, 2110)
topic1, p3,
(2002, 2102)
New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3)
Executor
Driver
2. Launch jobs
using offset
ranges
3. Reads data using
offset ranges in jobs
using Simple API
1. Query latest offsets
and decide offset ranges
for batchtopic1, p1,
(2000, 2100)
topic1, p2,
(2010, 2110)
topic1, p3,
(2002, 2102)
New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3)
Executor
Driver
2. Launch jobs
using offset
ranges
3. Reads data using
offset ranges in jobs
using Simple API
1. Query latest offsets
and decide offset ranges
for batchtopic1, p1,
(2000, 2100)
topic1, p2,
(2010, 2110)
topic1, p3,
(2002, 2102)
New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3)
Executor
Driver
2. Launch jobs
using offset
ranges
3. Reads data using
offset ranges in jobs
using Simple API
1. Query latest offsets
and decide offset ranges
for batch
Direct Kafka API benefits
- No WALs or Receivers
- Allows end-to-end exactly-once semantics pipelines
- updates to downstream systems should be idempotent or transactional
- More fault-tolerant
- More efficient
- Easier to use.
How to use
Direct Kafka API
Maven dependency
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.3.0</version><!-- until 1.6.3 -->
</dependency>
How to use
Direct Kafka API
Java example code
val ssc = new StreamingContext(...)
//hostname:port for Kafka brokers, NOT Zookeeper
val kafkaParams = Map(
"metadata.broker.list" ->
"broker01:9092,broker02:9092"
)
val topics = Set(
"sometopic",
"anothertopic"
)
val stream = KafkaUtils.createDirectStream
[String, String, StringDecoder, StringDecoder]
(ssc, kafkaParams, topics)
How to use
Direct Kafka API
Java example code
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
.offsetRanges
rdd.foreachPartition { iter =>
val osr: OffsetRange = offsetRanges(
TaskContext.get.partitionId)
// get any needed data from the offset range
val topic = osr.topic
val kafkaPartitionId = osr.partition
val begin = osr.fromOffset
val end = osr.untilOffset
}
}
How to use
Direct Kafka API
Java example code
val offsetRanges = Array(
OffsetRange("sometopic", 0, 110, 220),
OffsetRange("sometopic", 1, 100, 313),
OffsetRange("anothertopic", 0, 456, 789)
)
val rdd = KafkaUtils.createRDD
[String, String, StringDecoder, StringDecoder]
(sc, kafkaParams, offsetRanges)
Spark Streaming UI improvements (Spark 1.4)
Kafka Metadata (offsets) in UI (Spark 1.5)
What about
Spark 2.0+
and new
Kafka
Integration?
This is why we are here,
right?
What about
Spark 2.0+
and new
Kafka
Integration?
This is why we are here,
right?
spark-streaming-kafka-0-8 spark-streaming-kafka-0-10
Broker Version 0.8.2.1 or higher 0.10.0 or higher
Api Stability Stable Experimental
Language Support Scala, Java, Python Scala, Java
Receiver DStream Yes No
Direct DStream Yes Yes
SSL / TLS Support No Yes
Offset Commit Api No Yes
Dynamic Topic Subscription No Yes
How to use
Old Kafka
Integration
on Spark 2.0+
Maven dependency
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
<version>2.0.0</version><!-- until 2.1.0 right now -->
</dependency>
How to use
New Kafka
Integration
on Spark 2.0+
Maven dependency
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>2.0.0</version><!-- until 2.1.0 right now -->
</dependency>
What’s really
New with this
New Kafka
Integration?
- New Consumer API
- instead of Simple API
- Location Strategies
- Consumer Strategies
- SSL / TLS
- No Python API :(
Location Strategies
- New consumer API will pre-fetch messages into buffers
- So, keep cached consumers into executors
- It’s better to schedule partitions on the host with
appropriate consumers
Location Strategies
- PreferConsistent
Distribute partitions evenly across available executors
- PreferBrokers
If your executors are on the same hosts as your Kafka brokers
- PreferFixed
Specify an explicit mapping of partitions to hosts
Consumer Strategies
- New consumer API has a number of different ways to
specify topics, some of which require considerable
post-object-instantiation setup.
- ConsumerStrategies provides an abstraction that allows
Spark to obtain properly configured consumers even
after restart from checkpoint.
Consumer Strategies
- Subscribe subscribe to a fixed collection of topics
- SubscribePattern use a regex to specify topics of interest
- Assign specify a fixed collection of partitions
● Overloaded constructors to specify the starting offset for a particular
partition.
● ConsumerStrategy is a public class that you can extend.
SSL/TLS encryption
- New consumer API supports SSL
- Only applies to communication between Spark and
Kafka brokers
- Still responsible for separately securing Spark
inter-node communication
SSL/TLS encryption
val kafkaParams = Map[String, Object](
// ... //
// the usual params, make sure the port in bootstrap.servers is the TLS one
// ... //
"security.protocol" -> "SSL",
"ssl.truststore.location" -> "/some-directory/kafka.client.truststore.jks",
"ssl.truststore.password" -> "test1234",
"ssl.keystore.location" -> "/some-directory/kafka.client.keystore.jks",
"ssl.keystore.password" -> "test1234",
"ssl.key.password" -> "test1234"
)
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "broker01:9092,broker02:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "stream_group_id",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("topicA", "topicB")
val stream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.map(record => (record.key, record.value))
How to use
New Kafka
Integration
on Spark 2.0+
Java example code
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
.offsetRanges
rdd.foreachPartition { iter =>
val osr: OffsetRange = offsetRanges(
TaskContext.get.partitionId)
// get any needed data from the offset range
val topic = osr.topic
val kafkaPartitionId = osr.partition
val begin = osr.fromOffset
val end = osr.untilOffset
}
}
How to use
New Kafka
Integration
on Spark 2.0+
Java example code
stream.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges]
.offsetRanges
// DO YOUR STUFF with DATA
stream.asInstanceOf[CanCommitOffsets]
.commitAsync(offsetRanges)
}
}
How to use
New Kafka
Integration
on Spark 2.0+
Java example code
Kafka + Spark
Semantics
Three use cases:
- At most once
- At least once
- Exactly once
Kafka + Spark
Semantics
At most once
- We don’t want duplicates
- Not worth the hassle of ensuring that
messages don’t get lost
- Example: Sending statistics over UDP
1. Set spark.task.maxFailures to 1
2. Make sure spark.speculation is false
(the default)
3. Set Kafka param auto.offset.reset to
“largest”
4. Set Kafka param enable.auto.commit to
true
Kafka + Spark
Semantics
At most once
- This will mean you lose messages on restart
- At least they shouldn’t get replayed.
- Test this carefully if it’s actually important to
you that a message never gets repeated,
because it’s not a common use case.
Kafka + Spark
Semantics
At least once
- We don’t want to lose any record
- We don’t care about duplicates
- Example: Sending internal alerts on relative
rare occurrences on the stream
1. Set spark.task.maxFailures > 1000
2. Set Kafka param auto.offset.reset to
“smallest”
3. Set Kafka param enable.auto.commit to
false
Kafka + Spark
Semantics
At least once
- Don’t be silly! Do NOT replay your whole log on
every restart…
- Manually commit the offsets when you are
100% sure records are processed
- If this is “too hard” you’d better have a relative
short retention log.
- Or be REALLY ok with duplicates. For example
you are outputting to an external system that
handles duplicates for you (HBase)
Kafka + Spark
Semantics
Exactly once
- We don’t want to lose any record
- We don’t want duplicates either
- Example: Storing stream in data warehouse
1. We need some kind of idempotent writes, or
whole-or-nothing writes
2. Only store offsets EXACTLY after writing data
3. Same parameters as at least once
Kafka + Spark
Semantics
Exactly once
- Probably the hardest to achieve right
- Still some small chance of failure if your app
fails just between writing data and committing
offsets… (but REALLY small)
- If your writing is not “atomic” or has some
“steps”, you should checkpoint each step and
when recovering from a failure finish the
“current micro-batch” from the failed step.
It’s DEMO time!
Spark
Streaming +
Kafka
at Billy Mobile
a story of love and fury
Some
Billy
Insights
we rock it!
10Brecords monthly
20TBweekly retention log
5Kevents/second
x4growth/year
Our use cases: ETL to Data Warehouse
- Input events from Kafka
- Enrich the events with some external data sources
- Finally store it to Hive
- We do NOT want duplicates
- We do NOT want to lose events
Our use cases: ETL to Data Warehouse
- Input events from Kafka
- Enrich the events with some external data sources
- Finally store it to Hive
- We do NOT want duplicates
- We do NOT want to lose events
exactly once!
Our use cases: ETL to Data Warehouse
- Hive is not transactional
- Neither idempotent writes
- Writing files to HDFS is “atomic” (whole or nothing)
- A relation 1:1 from each partition-batch to file in HDFS
- Store to ZK the current state of the batch
- Store to ZK offsets of last finished batch
Our use cases: ETL to Data Warehouse
On failure:
- If executors fails, just keep going (reschedule task)
> spark.task.maxFailures = 1000
- If driver fails (or restart):
- Load offsets and state from “current batch” if exists
and “finish” it (KafkaUtils.createRDD)
- Continue Stream from last saved offsets
Our use cases: Anomalies Detection
- Input events from Kafka
- Periodically load batch-computed model
- Detect when an offer stops converting (or too much)
- We do not care about losing some events (on restart)
- We always need to process the “real-time” stream
Our use cases: Anomalies Detection
- Input events from Kafka
- Periodically load batch-computed model
- Detect when an offer stops converting (or too much)
- We do not care about losing some events (on restart)
- We always need to process the “real-time” stream
At most once!
Our use cases: Anomalies Detection
- It’s useless to detect anomalies on a lagged stream!
- Actually it could be very bad
- Always restart stream on latest offsets
- Restart with “fresh” state
Our use cases: Store it to Entity Cache
- Input events from Kafka
- Almost no processing
- Store it to HBase (has idempotent writes)
- We do not care about duplicates
- We can’t lose a single event
Our use cases: Store it to Entity Cache
- Input events from Kafka
- Almost no processing
- Store it to HBase (has idempotent writes)
- We do not care about duplicates
- We can’t lose a single event
at least once!
Our use cases: Store it to Entity Cache
- Since HBase has idempotent writes, we can write
events multiple times without hassle
- But, we do NOT start with earliest offsets…
- That would be 7 days of redundant writes…!!!
- We store offsets of last finished batch
- But obviously we might re-write some events on restart
or failure
Lessons
Learned
- Do NOT use checkpointing!
- Not recoverable across upgrades
- Track offsets yourself
- ZK, HDFS, DB…
- Do NOT use enable.auto.commit
- It only makes sense in at most once
- Memory might be an issue
- You do not want to waste it...
- Adjust batchDuration
- Adjust maxRatePerPartition
Future
Research
- Dynamic Allocation
spark.dynamicAllocation.enabled vs
spark.streaming.dynamicAllocation.enabled
https://issues.apache.org/jira/browse/SPARK-12133
But no reference in docs...
- Monitoring
- Graceful shutdown
- Structured Streaming
Thank you
very much!
and have a beer!
Questions?
WE ARE HIRING!
@joanvr
joan.viladrosa@billymob.com

More Related Content

What's hot

Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slidesDat Tran
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Beautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBBeautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBleesjensen
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkDataWorks Summit
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsDatabricks
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkDatabricks
 
Spark streaming
Spark streamingSpark streaming
Spark streamingWhiteklay
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksGuido Schmutz
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkDatabricks
 

What's hot (20)

Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
PySpark in practice slides
PySpark in practice slidesPySpark in practice slides
PySpark in practice slides
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Beautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDBBeautiful Monitoring With Grafana and InfluxDB
Beautiful Monitoring With Grafana and InfluxDB
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
Real-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache FlinkReal-time Stream Processing with Apache Flink
Real-time Stream Processing with Apache Flink
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and PluginsMonitor Apache Spark 3 on Kubernetes using Metrics and Plugins
Monitor Apache Spark 3 on Kubernetes using Metrics and Plugins
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache SparkArbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
 
Spark streaming
Spark streamingSpark streaming
Spark streaming
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringApache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
 
Stream Processing – Concepts and Frameworks
Stream Processing – Concepts and FrameworksStream Processing – Concepts and Frameworks
Stream Processing – Concepts and Frameworks
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Modularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache SparkModularized ETL Writing with Apache Spark
Modularized ETL Writing with Apache Spark
 

Similar to Spark streaming + kafka 0.10

[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration StoryJoan Viladrosa Riera
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration storyJoan Viladrosa Riera
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraSpark Summit
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Big Data Spain
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Guido Schmutz
 
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Lightbend
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Guido Schmutz
 
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaKafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaGuido Schmutz
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaGuido Schmutz
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 
Kafka timestamp offset
Kafka timestamp offsetKafka timestamp offset
Kafka timestamp offsetDaeMyung Kang
 
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?Eyal Ben Ivri
 
How to build 1000 microservices with Kafka and thrive
How to build 1000 microservices with Kafka and thriveHow to build 1000 microservices with Kafka and thrive
How to build 1000 microservices with Kafka and thriveNatan Silnitsky
 
Can Kafka Handle a Lyft Ride? (Andrey Falko & Can Cecen, Lyft) Kafka Summit 2020
Can Kafka Handle a Lyft Ride? (Andrey Falko & Can Cecen, Lyft) Kafka Summit 2020Can Kafka Handle a Lyft Ride? (Andrey Falko & Can Cecen, Lyft) Kafka Summit 2020
Can Kafka Handle a Lyft Ride? (Andrey Falko & Can Cecen, Lyft) Kafka Summit 2020HostedbyConfluent
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?confluent
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA
10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA
10 Lessons Learned from using Kafka in 1000 microservices - ScalaUANatan Silnitsky
 

Similar to Spark streaming + kafka 0.10 (20)

[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story[Big Data Spain] Apache Spark Streaming + Kafka 0.10:  an Integration Story
[Big Data Spain] Apache Spark Streaming + Kafka 0.10: an Integration Story
 
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story[Spark Summit EU 2017] Apache spark streaming + kafka 0.10  an integration story
[Spark Summit EU 2017] Apache spark streaming + kafka 0.10 an integration story
 
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan ViladrosarieraApache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
Apache Spark Streaming + Kafka 0.10 with Joan Viladrosariera
 
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
Spark Streaming + Kafka 0.10: an integration story by Joan Viladrosa Riera at...
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
 
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
 
Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!Apache Kafka - Scalable Message Processing and more!
Apache Kafka - Scalable Message Processing and more!
 
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around KafkaKafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Kafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around KafkaKafka Connect & Streams - the ecosystem around Kafka
Kafka Connect & Streams - the ecosystem around Kafka
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 
Kafka timestamp offset
Kafka timestamp offsetKafka timestamp offset
Kafka timestamp offset
 
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
What's New in Spark 2?
What's New in Spark 2?What's New in Spark 2?
What's New in Spark 2?
 
How to build 1000 microservices with Kafka and thrive
How to build 1000 microservices with Kafka and thriveHow to build 1000 microservices with Kafka and thrive
How to build 1000 microservices with Kafka and thrive
 
Can Kafka Handle a Lyft Ride? (Andrey Falko & Can Cecen, Lyft) Kafka Summit 2020
Can Kafka Handle a Lyft Ride? (Andrey Falko & Can Cecen, Lyft) Kafka Summit 2020Can Kafka Handle a Lyft Ride? (Andrey Falko & Can Cecen, Lyft) Kafka Summit 2020
Can Kafka Handle a Lyft Ride? (Andrey Falko & Can Cecen, Lyft) Kafka Summit 2020
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA
10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA
10 Lessons Learned from using Kafka in 1000 microservices - ScalaUA
 

Recently uploaded

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Recently uploaded (20)

[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Spark streaming + kafka 0.10

  • 1. Spark Streaming + Kafka @joanvr
  • 2. About me: Joan Viladrosa @joanvr Degree In Computer Science Advanced programming techniques & System interfaces and integration Co-Founder in Educabits Big data solutions using AWS cloud Big Data Developer in Trovit Hadoop and MapReduce Framework Big Data Architect & Tech Lead in BillyMobile Design new architecture with Hadoop cutting edge technologies: Kafka, Storm, Hive, HBase, Spark, Kylin, Druid
  • 4. What is Kafka? - Publish-Subscribe Messaging System - Fast - Scalable - Durable - Fault-tolerant What makes it great?
  • 5. What is Kafka? Producer Producer Producer Producer Kafka Consumer Consumer Consumer Consumer
  • 6. What is Kafka? Apache Storm Apache Spark My Java App Logger Kafka Apache Storm Apache Spark My Java App Monitoring Tool
  • 7. Kafka terminology - Topic: A feed of messages - Producer: Processes that publish messages to a topic - Consumer: Processes that subscribe to topics and process the feed of published messages - Broker: Each server of a kafka cluster that holds, receives and sends the actual data
  • 8. Kafka Topic Partitions 0 1 2 3 4 5 6Partition 0 Partition 1 Partition 2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Topic: Old New writes
  • 9. Kafka Topic Partitions 0 1 2 3 4 5 6Partition 0 7 8 9 Old New 1 0 1 1 1 2 1 3 1 4 1 5 Producer writes Consumer A (offset=6) Consumer B (offset=12) reads reads
  • 10. Kafka Topic Partitions 0 1 2 3 4 5 6P0 P1 P2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P3 P4 P5 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P6 P7 P8 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Broker 1 Broker 2 Broker 3 Consumers & Producers
  • 11. Kafka Topic Partitions 0 1 2 3 4 5 6P0 P1 P2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P3 P4 P5 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P6 P7 P8 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Broker 1 Broker 2 Broker 3 Consumers & Producers More Storage
  • 12. Kafka Topic Partitions 0 1 2 3 4 5 6P0 P1 P2 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P3 P4 P5 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 0 1 2 3 4 5 6P6 P7 P8 0 1 2 3 4 5 6 0 1 2 3 4 5 6 7 8 9 7 8 Broker 1 Broker 2 Broker 3 Consumers & Producers More Parallelism
  • 13. Apache Kafka Semantics - In short: consumer delivery semantics are up to you, not Kafka.
  • 14. Apache Kafka Semantics - In short: consumer delivery semantics are up to you, not Kafka. - Kafka doesn’t store the state of the consumers* - It just sends you what you ask for (topic, partition, offset, length) - You have to take care of your state
  • 16. What is Spark Streaming? - Process streams of data - Micro-batching approach
  • 17. What is Spark Streaming? - Process streams of data - Micro-batching approach - Same API as Spark - Same integrations as Spark - Same guarantees & semantics as Spark What makes it great?
  • 18. What is Spark Streaming?
  • 19. What is Spark Streaming?
  • 20. How does it work? - Discretized Streams (DStream)
  • 21. How does it work? - Discretized Streams (DStream)
  • 22. - Discretized Streams (DStream) How does it work?
  • 23. How does it work?
  • 24. How does it work?
  • 25. How does it work?
  • 26. Spark Streaming Semantics: Side effects As in Spark: - Not guarantee exactly-once semantics for output actions - Any side-effecting output operations may be repeated - Because of node failure, process failure, etc. - So, be careful when outputting to external sources
  • 28. Spark Streaming Kafka Integration Timeline dec-2016jul-2016jan-2016sep-2015jun-2015mar-2015dec-2014sep-2014 Fault Tolerant WAL + Python API Direct Streams + Python API Improved Streaming UI Metadata in UI (offsets) + Graduated Direct Receivers Native Kafka 0.10 (experimental) 1.1 1.2 1.3 1.4 1.5 1.6 2.0 2.1
  • 29. Kafka Receiver (<= Spark 1.1) Executor Driver Launch jobs on data Receiver Continuously receive data using High Level API Update offsets in ZooKeeper
  • 30. HDFS Kafka Receiver with WAL (Spark 1.2) Executor Driver Launch jobs on data Continuously receive data using High Level API Update offsets in ZooKeeper WAL Receiver
  • 31. Kafka Receiver with WAL (Spark 1.2) Application Driver Executor Spark Context Jobs Computation checkpointed Receiver Input stream Block metadata Block metadata written to log Block data written both memory + log Streaming Context
  • 32. Kafka Receiver with WAL (Spark 1.2) Restarted Driver Restarted Executor Restarted Spark Context Relaunch Jobs Restart computation from info in checkpoints Restarted Receiver Resend unacked data Recover Block metadata from log Recover Block data from log Restarted Streaming Context
  • 33. HDFS Kafka Receiver with WAL (Spark 1.2) Executor Driver Launch jobs on data Continuously receive data using High Level API Update offsets in ZooKeeper WAL Receiver
  • 34. New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3) Executor Driver
  • 35. New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3) Executor Driver 1. Query latest offsets and decide offset ranges for batch
  • 36. New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3) Executor Driver 2. Launch jobs using offset ranges 1. Query latest offsets and decide offset ranges for batch
  • 37. New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3) Executor Driver 2. Launch jobs using offset ranges 1. Query latest offsets and decide offset ranges for batchtopic1, p1, (2000, 2100) topic1, p2, (2010, 2110) topic1, p3, (2002, 2102)
  • 38. New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3) Executor Driver 2. Launch jobs using offset ranges 3. Reads data using offset ranges in jobs using Simple API 1. Query latest offsets and decide offset ranges for batchtopic1, p1, (2000, 2100) topic1, p2, (2010, 2110) topic1, p3, (2002, 2102)
  • 39. New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3) Executor Driver 2. Launch jobs using offset ranges 3. Reads data using offset ranges in jobs using Simple API 1. Query latest offsets and decide offset ranges for batchtopic1, p1, (2000, 2100) topic1, p2, (2010, 2110) topic1, p3, (2002, 2102)
  • 40. New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3) Executor Driver 2. Launch jobs using offset ranges 3. Reads data using offset ranges in jobs using Simple API 1. Query latest offsets and decide offset ranges for batchtopic1, p1, (2000, 2100) topic1, p2, (2010, 2110) topic1, p3, (2002, 2102)
  • 41. New Direct Kafka Integration w/o Receivers and WALs (Spark 1.3) Executor Driver 2. Launch jobs using offset ranges 3. Reads data using offset ranges in jobs using Simple API 1. Query latest offsets and decide offset ranges for batch
  • 42. Direct Kafka API benefits - No WALs or Receivers - Allows end-to-end exactly-once semantics pipelines - updates to downstream systems should be idempotent or transactional - More fault-tolerant - More efficient - Easier to use.
  • 43. How to use Direct Kafka API Maven dependency <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka_2.10</artifactId> <version>1.3.0</version><!-- until 1.6.3 --> </dependency>
  • 44. How to use Direct Kafka API Java example code val ssc = new StreamingContext(...) //hostname:port for Kafka brokers, NOT Zookeeper val kafkaParams = Map( "metadata.broker.list" -> "broker01:9092,broker02:9092" ) val topics = Set( "sometopic", "anothertopic" ) val stream = KafkaUtils.createDirectStream [String, String, StringDecoder, StringDecoder] (ssc, kafkaParams, topics)
  • 45. How to use Direct Kafka API Java example code stream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges] .offsetRanges rdd.foreachPartition { iter => val osr: OffsetRange = offsetRanges( TaskContext.get.partitionId) // get any needed data from the offset range val topic = osr.topic val kafkaPartitionId = osr.partition val begin = osr.fromOffset val end = osr.untilOffset } }
  • 46. How to use Direct Kafka API Java example code val offsetRanges = Array( OffsetRange("sometopic", 0, 110, 220), OffsetRange("sometopic", 1, 100, 313), OffsetRange("anothertopic", 0, 456, 789) ) val rdd = KafkaUtils.createRDD [String, String, StringDecoder, StringDecoder] (sc, kafkaParams, offsetRanges)
  • 47. Spark Streaming UI improvements (Spark 1.4)
  • 48. Kafka Metadata (offsets) in UI (Spark 1.5)
  • 49. What about Spark 2.0+ and new Kafka Integration? This is why we are here, right?
  • 50. What about Spark 2.0+ and new Kafka Integration? This is why we are here, right? spark-streaming-kafka-0-8 spark-streaming-kafka-0-10 Broker Version 0.8.2.1 or higher 0.10.0 or higher Api Stability Stable Experimental Language Support Scala, Java, Python Scala, Java Receiver DStream Yes No Direct DStream Yes Yes SSL / TLS Support No Yes Offset Commit Api No Yes Dynamic Topic Subscription No Yes
  • 51. How to use Old Kafka Integration on Spark 2.0+ Maven dependency <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka-0-8_2.11</artifactId> <version>2.0.0</version><!-- until 2.1.0 right now --> </dependency>
  • 52. How to use New Kafka Integration on Spark 2.0+ Maven dependency <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka-0-10_2.11</artifactId> <version>2.0.0</version><!-- until 2.1.0 right now --> </dependency>
  • 53. What’s really New with this New Kafka Integration? - New Consumer API - instead of Simple API - Location Strategies - Consumer Strategies - SSL / TLS - No Python API :(
  • 54. Location Strategies - New consumer API will pre-fetch messages into buffers - So, keep cached consumers into executors - It’s better to schedule partitions on the host with appropriate consumers
  • 55. Location Strategies - PreferConsistent Distribute partitions evenly across available executors - PreferBrokers If your executors are on the same hosts as your Kafka brokers - PreferFixed Specify an explicit mapping of partitions to hosts
  • 56. Consumer Strategies - New consumer API has a number of different ways to specify topics, some of which require considerable post-object-instantiation setup. - ConsumerStrategies provides an abstraction that allows Spark to obtain properly configured consumers even after restart from checkpoint.
  • 57. Consumer Strategies - Subscribe subscribe to a fixed collection of topics - SubscribePattern use a regex to specify topics of interest - Assign specify a fixed collection of partitions ● Overloaded constructors to specify the starting offset for a particular partition. ● ConsumerStrategy is a public class that you can extend.
  • 58. SSL/TLS encryption - New consumer API supports SSL - Only applies to communication between Spark and Kafka brokers - Still responsible for separately securing Spark inter-node communication
  • 59. SSL/TLS encryption val kafkaParams = Map[String, Object]( // ... // // the usual params, make sure the port in bootstrap.servers is the TLS one // ... // "security.protocol" -> "SSL", "ssl.truststore.location" -> "/some-directory/kafka.client.truststore.jks", "ssl.truststore.password" -> "test1234", "ssl.keystore.location" -> "/some-directory/kafka.client.keystore.jks", "ssl.keystore.password" -> "test1234", "ssl.key.password" -> "test1234" )
  • 60. val kafkaParams = Map[String, Object]( "bootstrap.servers" -> "broker01:9092,broker02:9092", "key.deserializer" -> classOf[StringDeserializer], "value.deserializer" -> classOf[StringDeserializer], "group.id" -> "stream_group_id", "auto.offset.reset" -> "latest", "enable.auto.commit" -> (false: java.lang.Boolean) ) val topics = Array("topicA", "topicB") val stream = KafkaUtils.createDirectStream[String, String]( streamingContext, PreferConsistent, Subscribe[String, String](topics, kafkaParams) ) stream.map(record => (record.key, record.value)) How to use New Kafka Integration on Spark 2.0+ Java example code
  • 61. stream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges] .offsetRanges rdd.foreachPartition { iter => val osr: OffsetRange = offsetRanges( TaskContext.get.partitionId) // get any needed data from the offset range val topic = osr.topic val kafkaPartitionId = osr.partition val begin = osr.fromOffset val end = osr.untilOffset } } How to use New Kafka Integration on Spark 2.0+ Java example code
  • 62. stream.foreachRDD { rdd => val offsetRanges = rdd.asInstanceOf[HasOffsetRanges] .offsetRanges // DO YOUR STUFF with DATA stream.asInstanceOf[CanCommitOffsets] .commitAsync(offsetRanges) } } How to use New Kafka Integration on Spark 2.0+ Java example code
  • 63. Kafka + Spark Semantics Three use cases: - At most once - At least once - Exactly once
  • 64. Kafka + Spark Semantics At most once - We don’t want duplicates - Not worth the hassle of ensuring that messages don’t get lost - Example: Sending statistics over UDP 1. Set spark.task.maxFailures to 1 2. Make sure spark.speculation is false (the default) 3. Set Kafka param auto.offset.reset to “largest” 4. Set Kafka param enable.auto.commit to true
  • 65. Kafka + Spark Semantics At most once - This will mean you lose messages on restart - At least they shouldn’t get replayed. - Test this carefully if it’s actually important to you that a message never gets repeated, because it’s not a common use case.
  • 66. Kafka + Spark Semantics At least once - We don’t want to lose any record - We don’t care about duplicates - Example: Sending internal alerts on relative rare occurrences on the stream 1. Set spark.task.maxFailures > 1000 2. Set Kafka param auto.offset.reset to “smallest” 3. Set Kafka param enable.auto.commit to false
  • 67. Kafka + Spark Semantics At least once - Don’t be silly! Do NOT replay your whole log on every restart… - Manually commit the offsets when you are 100% sure records are processed - If this is “too hard” you’d better have a relative short retention log. - Or be REALLY ok with duplicates. For example you are outputting to an external system that handles duplicates for you (HBase)
  • 68. Kafka + Spark Semantics Exactly once - We don’t want to lose any record - We don’t want duplicates either - Example: Storing stream in data warehouse 1. We need some kind of idempotent writes, or whole-or-nothing writes 2. Only store offsets EXACTLY after writing data 3. Same parameters as at least once
  • 69. Kafka + Spark Semantics Exactly once - Probably the hardest to achieve right - Still some small chance of failure if your app fails just between writing data and committing offsets… (but REALLY small) - If your writing is not “atomic” or has some “steps”, you should checkpoint each step and when recovering from a failure finish the “current micro-batch” from the failed step.
  • 71. Spark Streaming + Kafka at Billy Mobile a story of love and fury
  • 72. Some Billy Insights we rock it! 10Brecords monthly 20TBweekly retention log 5Kevents/second x4growth/year
  • 73. Our use cases: ETL to Data Warehouse - Input events from Kafka - Enrich the events with some external data sources - Finally store it to Hive - We do NOT want duplicates - We do NOT want to lose events
  • 74. Our use cases: ETL to Data Warehouse - Input events from Kafka - Enrich the events with some external data sources - Finally store it to Hive - We do NOT want duplicates - We do NOT want to lose events exactly once!
  • 75. Our use cases: ETL to Data Warehouse - Hive is not transactional - Neither idempotent writes - Writing files to HDFS is “atomic” (whole or nothing) - A relation 1:1 from each partition-batch to file in HDFS - Store to ZK the current state of the batch - Store to ZK offsets of last finished batch
  • 76. Our use cases: ETL to Data Warehouse On failure: - If executors fails, just keep going (reschedule task) > spark.task.maxFailures = 1000 - If driver fails (or restart): - Load offsets and state from “current batch” if exists and “finish” it (KafkaUtils.createRDD) - Continue Stream from last saved offsets
  • 77. Our use cases: Anomalies Detection - Input events from Kafka - Periodically load batch-computed model - Detect when an offer stops converting (or too much) - We do not care about losing some events (on restart) - We always need to process the “real-time” stream
  • 78. Our use cases: Anomalies Detection - Input events from Kafka - Periodically load batch-computed model - Detect when an offer stops converting (or too much) - We do not care about losing some events (on restart) - We always need to process the “real-time” stream At most once!
  • 79. Our use cases: Anomalies Detection - It’s useless to detect anomalies on a lagged stream! - Actually it could be very bad - Always restart stream on latest offsets - Restart with “fresh” state
  • 80. Our use cases: Store it to Entity Cache - Input events from Kafka - Almost no processing - Store it to HBase (has idempotent writes) - We do not care about duplicates - We can’t lose a single event
  • 81. Our use cases: Store it to Entity Cache - Input events from Kafka - Almost no processing - Store it to HBase (has idempotent writes) - We do not care about duplicates - We can’t lose a single event at least once!
  • 82. Our use cases: Store it to Entity Cache - Since HBase has idempotent writes, we can write events multiple times without hassle - But, we do NOT start with earliest offsets… - That would be 7 days of redundant writes…!!! - We store offsets of last finished batch - But obviously we might re-write some events on restart or failure
  • 83. Lessons Learned - Do NOT use checkpointing! - Not recoverable across upgrades - Track offsets yourself - ZK, HDFS, DB… - Do NOT use enable.auto.commit - It only makes sense in at most once - Memory might be an issue - You do not want to waste it... - Adjust batchDuration - Adjust maxRatePerPartition
  • 84. Future Research - Dynamic Allocation spark.dynamicAllocation.enabled vs spark.streaming.dynamicAllocation.enabled https://issues.apache.org/jira/browse/SPARK-12133 But no reference in docs... - Monitoring - Graceful shutdown - Structured Streaming
  • 85. Thank you very much! and have a beer! Questions?
  • 86.