SlideShare a Scribd company logo
1 of 21
State Management in
Structured Streaming
Chandan Prakash
00Copyright 2018 © Qubole
Agenda
● Structured Streaming : Brief Intro
● Types of Stream Processing : Stateless vs Stateful
● State in Stream Processing
● State Store in Stream Processing
● State Management in Old Spark Streaming
● State Management in Structured Streaming
● Demo with Code Example
● Quiz , Food For Thought
00Copyright 2018 © Qubole
What does this picture represent ?
Image Source: google
00Copyright 2018 © Qubole
Batch Processing Stream Processing
Image Source: google
00Copyright 2018 © Qubole
Structured Streaming : Brief Intro
Image Source: google
● Built on Spark SQL engine
● Illusion : Stream of incoming data as unbounded Input Table, Processing
logic as Sql Query, output of processing as Results Table
● Internally query gets converted into incremental Micro-batch processing
00Copyright 2018 © Qubole
Structured Streaming Query Example
00Copyright 2018 © Qubole
Types of Stream Processing
● Stateless Streaming
○ Processing of every record is independent
○ Operations like map, filter
● Stateful Streaming
○ Processing of record is dependent on
previous records
○ Operations like aggregating count of records
per distinct key, deduplicating records
00Copyright 2018 © Qubole
State in Stream Processing
● State of Streaming Progress
○ Metadata of stream processing : offsets
○ Keeping track how much data processed so far
○ Needed for fault tolerance
○ Present in both stateless and stateful processing
● State of Data
○ Intermediate data information between records
○ Operations like aggregation, deduplication
○ Present in Stateful Processing
Note: When we say “State”, in general it means the State of data for processing. The
other one is called metadata/offsets
00Copyright 2018 © Qubole
State Store in Streaming
● Reliable place providing read and write of
intermediate data (state)
● Can sustain streaming failures and restore
processing from the same point
● Options :
In-memory, File Systems, Storage Systems
In-Memory HashMap
00Copyright 2018 © Qubole
State Management in old/Dstream Spark Streaming
● RDD based Streaming
● Inefficient Flawed design
○ State persisted with offset metadata
○ Complete snapshot persistence every microbatch
○ Tightly coupled, synchronous with Spark RDD tasks
○ No provision for incremental state persistence
○ Processing overhead, bottleneck as state grows
00Copyright 2018 © Qubole
State Management in Structured Streaming
Fundamental shift from Old Spark Streaming
● Decoupled from offsets/metadata checkpointing
● Asynchronous to Spark Tasks/Jobs
● Incremental State persistence
00Copyright 2018 © Qubole
HDFS backed State Management
1. In-Memory Hashmap + HDFS
2. Versioned key-value store per
partition
3. Versioned Delta file per partition
4. Partition Task scheduled on same
executor where previous state is
5. Synchronous write to HashMap and
Delta file outputstream
6. Asynchronous daemon thread per
executor for snapshotting, file
purging/deletion in HDFS
7. Only one thread in Executor can write
to a delta file. But threads from
multiple executors can try to write to
same delta file.
00Copyright 2018 © Qubole
Code Entities in HDFS backed State Management
● StatefulOperators
○ defines computation logic to be executed against the state store with set of rows in a partition
● StateStoreOps
○ prepares a StateStoreRDD for doing computations against state store with the computation logic
passed by the stateful operator.
● storeUpdateFunction
○ contains the computation logic defining what to do against the state store with data generated in a
partition task.
● HDFSBackedStateStore
○ concrete implementation of State Store using concurrent hashmap, backed by HDFS file system
for persistence.
● HDFSBackedStateStoreProvider
○ contains methods to get given store and execute maintenance task (snapshotting , purging,
deleting files, cleaning old states).
● StateStoreCoordinator
○ ensures task for a partition gets scheduled on an executor where its last versioned state is
maintained in hashmap.
00Copyright 2018 © Qubole
Code Flow of Stateful Structured Streaming
00Copyright 2018 © Qubole
Quiz Time
Possible Issues with
the HDFS backed
implementation in
production ?
00Copyright 2018 © Qubole
Quiz Time
Possible Issues with
the HDFS backed
implementation in
production ?
● State is constrained by executor
memory
● Same executor memory to be shared
with RDD computation
● Single Daemon thread responsible
snapshotting entire state hashmaps,
file cleanings, etc
00Copyright 2018 © Qubole
In-Memory HashMap
Possible Solution ?
Food for Thought
00Copyright 2018 © Qubole
Embedded/Local Store :
● Key-Value embedded data store
● Improvised LevelDB open sourced by
Facebook
● Bring Database close to Processing
● Pros :
○ No Memory Issues (HashMap)
○ No Network Latency (Cassandra)
○ Fast writes : Buffer + Sequential Transaction Log
○ Isolation
● Cons
○ Not Distributed
○ Not Replicated
○ Overhead of maintenance, non-JVM memory
● Architecture
○ Memtable : in-memory buffer
○ Change Log
○ SST Table on disk
Image Source: google
00Copyright 2018 © Qubole
in Streaming Systems
● Apache Flink
https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html
● Apache Samza
https://samza.apache.org/learn/documentation/0.7.0/container/state-management.html
● Kafka Streams
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Internal+Data+Mana
gement
00Copyright 2018 © Qubole
Summary
● What is Stateful Processing and State in Streaming
● Architecture of State Management in Stateful processing of Structured
Streaming
● Code Example
● Why Embedded Store like RocksDB is so important in Stream Processing
Thank You. Questions?
Qubole Blog : https://www.qubole.com/blog/

More Related Content

What's hot

What's hot (20)

A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
 
Apache Kafka - Martin Podval
Apache Kafka - Martin PodvalApache Kafka - Martin Podval
Apache Kafka - Martin Podval
 
Flink history, roadmap and vision
Flink history, roadmap and visionFlink history, roadmap and vision
Flink history, roadmap and vision
 
How to Build a Scylla Database Cluster that Fits Your Needs
How to Build a Scylla Database Cluster that Fits Your NeedsHow to Build a Scylla Database Cluster that Fits Your Needs
How to Build a Scylla Database Cluster that Fits Your Needs
 
Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!Near real-time statistical modeling and anomaly detection using Flink!
Near real-time statistical modeling and anomaly detection using Flink!
 
Introduction to Kafka and Zookeeper
Introduction to Kafka and ZookeeperIntroduction to Kafka and Zookeeper
Introduction to Kafka and Zookeeper
 
Producer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache KafkaProducer Performance Tuning for Apache Kafka
Producer Performance Tuning for Apache Kafka
 
Dynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data AlertsDynamic Rule-based Real-time Market Data Alerts
Dynamic Rule-based Real-time Market Data Alerts
 
Redis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech TalkRedis vs Infinispan | DevNation Tech Talk
Redis vs Infinispan | DevNation Tech Talk
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
 
The top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scaleThe top 3 challenges running multi-tenant Flink at scale
The top 3 challenges running multi-tenant Flink at scale
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
OpenTelemetry For Operators
OpenTelemetry For OperatorsOpenTelemetry For Operators
OpenTelemetry For Operators
 

Similar to State management in Structured Streaming

Benchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetesBenchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetes
DoKC
 
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
Severalnines
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 

Similar to State management in Structured Streaming (20)

Key considerations in productionizing streaming applications
Key considerations in productionizing streaming applicationsKey considerations in productionizing streaming applications
Key considerations in productionizing streaming applications
 
RubiX
RubiXRubiX
RubiX
 
Benchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetesBenchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetes
 
Monitoring with Clickhouse
Monitoring with ClickhouseMonitoring with Clickhouse
Monitoring with Clickhouse
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
 
Scaling ELK Stack - DevOpsDays Singapore
Scaling ELK Stack - DevOpsDays SingaporeScaling ELK Stack - DevOpsDays Singapore
Scaling ELK Stack - DevOpsDays Singapore
 
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
 
Big data should be simple
Big data should be simpleBig data should be simple
Big data should be simple
 
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, ShopifyIt's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda Architecture
 
How YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQLHow YugaByte DB Implements Distributed PostgreSQL
How YugaByte DB Implements Distributed PostgreSQL
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
MySQL Time Machine by replicating into HBase - Slides from Percona Live Amste...
 
YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions YugaByte DB Internals - Storage Engine and Transactions
YugaByte DB Internals - Storage Engine and Transactions
 
Backing up Wikipedia Databases
Backing up Wikipedia DatabasesBacking up Wikipedia Databases
Backing up Wikipedia Databases
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 

More from datamantra

More from datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 

Recently uploaded

➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
amitlee9823
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Recently uploaded (20)

➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men  🔝Thrissur🔝   Escor...
➥🔝 7737669865 🔝▻ Thrissur Call-girls in Women Seeking Men 🔝Thrissur🔝 Escor...
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men  🔝Mathura🔝   Escorts...
➥🔝 7737669865 🔝▻ Mathura Call-girls in Women Seeking Men 🔝Mathura🔝 Escorts...
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 

State management in Structured Streaming

  • 1. State Management in Structured Streaming Chandan Prakash
  • 2. 00Copyright 2018 © Qubole Agenda ● Structured Streaming : Brief Intro ● Types of Stream Processing : Stateless vs Stateful ● State in Stream Processing ● State Store in Stream Processing ● State Management in Old Spark Streaming ● State Management in Structured Streaming ● Demo with Code Example ● Quiz , Food For Thought
  • 3. 00Copyright 2018 © Qubole What does this picture represent ? Image Source: google
  • 4. 00Copyright 2018 © Qubole Batch Processing Stream Processing Image Source: google
  • 5. 00Copyright 2018 © Qubole Structured Streaming : Brief Intro Image Source: google ● Built on Spark SQL engine ● Illusion : Stream of incoming data as unbounded Input Table, Processing logic as Sql Query, output of processing as Results Table ● Internally query gets converted into incremental Micro-batch processing
  • 6. 00Copyright 2018 © Qubole Structured Streaming Query Example
  • 7. 00Copyright 2018 © Qubole Types of Stream Processing ● Stateless Streaming ○ Processing of every record is independent ○ Operations like map, filter ● Stateful Streaming ○ Processing of record is dependent on previous records ○ Operations like aggregating count of records per distinct key, deduplicating records
  • 8. 00Copyright 2018 © Qubole State in Stream Processing ● State of Streaming Progress ○ Metadata of stream processing : offsets ○ Keeping track how much data processed so far ○ Needed for fault tolerance ○ Present in both stateless and stateful processing ● State of Data ○ Intermediate data information between records ○ Operations like aggregation, deduplication ○ Present in Stateful Processing Note: When we say “State”, in general it means the State of data for processing. The other one is called metadata/offsets
  • 9. 00Copyright 2018 © Qubole State Store in Streaming ● Reliable place providing read and write of intermediate data (state) ● Can sustain streaming failures and restore processing from the same point ● Options : In-memory, File Systems, Storage Systems In-Memory HashMap
  • 10. 00Copyright 2018 © Qubole State Management in old/Dstream Spark Streaming ● RDD based Streaming ● Inefficient Flawed design ○ State persisted with offset metadata ○ Complete snapshot persistence every microbatch ○ Tightly coupled, synchronous with Spark RDD tasks ○ No provision for incremental state persistence ○ Processing overhead, bottleneck as state grows
  • 11. 00Copyright 2018 © Qubole State Management in Structured Streaming Fundamental shift from Old Spark Streaming ● Decoupled from offsets/metadata checkpointing ● Asynchronous to Spark Tasks/Jobs ● Incremental State persistence
  • 12. 00Copyright 2018 © Qubole HDFS backed State Management 1. In-Memory Hashmap + HDFS 2. Versioned key-value store per partition 3. Versioned Delta file per partition 4. Partition Task scheduled on same executor where previous state is 5. Synchronous write to HashMap and Delta file outputstream 6. Asynchronous daemon thread per executor for snapshotting, file purging/deletion in HDFS 7. Only one thread in Executor can write to a delta file. But threads from multiple executors can try to write to same delta file.
  • 13. 00Copyright 2018 © Qubole Code Entities in HDFS backed State Management ● StatefulOperators ○ defines computation logic to be executed against the state store with set of rows in a partition ● StateStoreOps ○ prepares a StateStoreRDD for doing computations against state store with the computation logic passed by the stateful operator. ● storeUpdateFunction ○ contains the computation logic defining what to do against the state store with data generated in a partition task. ● HDFSBackedStateStore ○ concrete implementation of State Store using concurrent hashmap, backed by HDFS file system for persistence. ● HDFSBackedStateStoreProvider ○ contains methods to get given store and execute maintenance task (snapshotting , purging, deleting files, cleaning old states). ● StateStoreCoordinator ○ ensures task for a partition gets scheduled on an executor where its last versioned state is maintained in hashmap.
  • 14. 00Copyright 2018 © Qubole Code Flow of Stateful Structured Streaming
  • 15. 00Copyright 2018 © Qubole Quiz Time Possible Issues with the HDFS backed implementation in production ?
  • 16. 00Copyright 2018 © Qubole Quiz Time Possible Issues with the HDFS backed implementation in production ? ● State is constrained by executor memory ● Same executor memory to be shared with RDD computation ● Single Daemon thread responsible snapshotting entire state hashmaps, file cleanings, etc
  • 17. 00Copyright 2018 © Qubole In-Memory HashMap Possible Solution ? Food for Thought
  • 18. 00Copyright 2018 © Qubole Embedded/Local Store : ● Key-Value embedded data store ● Improvised LevelDB open sourced by Facebook ● Bring Database close to Processing ● Pros : ○ No Memory Issues (HashMap) ○ No Network Latency (Cassandra) ○ Fast writes : Buffer + Sequential Transaction Log ○ Isolation ● Cons ○ Not Distributed ○ Not Replicated ○ Overhead of maintenance, non-JVM memory ● Architecture ○ Memtable : in-memory buffer ○ Change Log ○ SST Table on disk Image Source: google
  • 19. 00Copyright 2018 © Qubole in Streaming Systems ● Apache Flink https://flink.apache.org/features/2018/01/30/incremental-checkpointing.html ● Apache Samza https://samza.apache.org/learn/documentation/0.7.0/container/state-management.html ● Kafka Streams https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Internal+Data+Mana gement
  • 20. 00Copyright 2018 © Qubole Summary ● What is Stateful Processing and State in Streaming ● Architecture of State Management in Stateful processing of Structured Streaming ● Code Example ● Why Embedded Store like RocksDB is so important in Stream Processing
  • 21. Thank You. Questions? Qubole Blog : https://www.qubole.com/blog/

Editor's Notes

  1. How many of you have idea about streaming, Worked on any streaming, understand the word “state management” ? …...should be useful for everyone of you. information about past input and can be used to influence the processing of future input, will see in detail Feel free to ask questions at any point of time during presentation
  2. Why you would like to listen this ? Although the talk is specific to Spark Structured Streaming, but the design, architecture, concepts and thought process behind why its there what its there will give you good understanding of any Streaming technology. All are like distant cousins of same family and you will see many overlaps between different streaming systems. Understanding one helps you to understand others. Many of them copy or say are inspired from each other. Will give you persepective of streaming engine developer
  3. *Quick question: What do you infer from this picture ?
  4. *pretty much sums up difference between batch and stream processing Batch is data at rest, you take chunk of data each time you process. In streaming you keep getting data and you need to process it as and when the data comes
  5. We will see running version of this example on Qubole Notebook after understanding State Management START THE CLUSTER Objective of showing this code example is to give you idea of stateful processing, so when we talk about state management , you can actually relate and understand easily
  6. Having given some rough idea about structured streaming, Lets start with the actual topic that we want to discuss today By analogy to SQL, the select and where clauses of a query are usually stateless, but join, group by and aggregation functions like sum and count require state.
  7. Intermediate information in stream processing State of progress: offsets/commits
  8. Often easy to understand when compared with predecessor, evolution is constant process, something new comes because of limitations of old Story about experience with Stateless stream processing, maintaining offsets in zookeeper
  9. This is the main meat of this talk that I want to go into detail
  10. Prepared diagram on my understanding of the internal code, how it works in upcoming Spark 2.4 It is very important to note here is that all these concepts like incremental checkpointing, asynchronous state management are not specific to Spark Streaming. Will find in other streaming systems like Flink,etc also with different names.
  11. Slide for guys interested in checking out code theirselves classes/interfaces/method involved in doing the State management Wont go in detail, instead will show the code flow of the state management in next slide
  12. Stateful operator is the place where logic to interact with state store resides. Show code
  13. Before I go forward, do you have any questions here Because now I have a question for you
  14. Do u see any possible issues with this architecture Honestly I have not encountered any issues but lets discuss what can be possible issues with this approach
  15. Go back to architecture diagram
  16. Had intentionally not talked about RocksDB at the starting, now is the time Really wanted to talk about this embedded storage or local persistent store
  17. Why Embedded Storage? Became famous because of Flash Memory era/ SSDs , writing to local disks became much faster compared to client-server model over network to storage systems. Sequential read/write : analogy of airport conveyor belt for spinning disks, latency involved in doing the rotation and seek time going to right sector of the data Hadoop was about moving processing closer to data, RocksDb is about moving database closer to processing. Improvised LevelDB : multithreaded write and compaction, support for bloom scans while reading data, improved compaction logic similar to HBase
  18. rocksDB is present in almost every latest streaming systems with need of keeping unlimited state without penalty of network call Storm : currently does not use local storage like rocksDb. It still relies on remote storages like redis,HBase,cassandra. Samza : features in LinkedIn like personalized feed to be sent to your wall is decided after joining lot of information with the available feed using Samza Kafka and Samza were written by same people in LinkedIn who later went on to found company called Confluent where they wrote kafka Streams. So you will find many similarities.
  19. Like said in the beginning, understanding one system will help us understand others. RocksDB understanding is one of them . Incremental checkpointing, snapshotting, Asynchronous state management are other concepts Technologies might be different, implementations might be different but after all they are trying to similar problem of distributed world which have same challenges, limitations and expectations like fault tolerance,exactly once processing,etc will be there everywhere
  20. Please have a close watch on Qubole Engineering. We write lot of interesting stuffs on Big data on cloud, Spark , open sourced SparkLens, Tuning, Hive , Presto, AWS,