If you want to extend Apache Spark and think that you will need to maintain a separate code base in your own fork, you’re wrong. You can customize different components of the framework, like file commit protocols or state and checkpoint stores.
1. Customizing Apache
Spark - beyond
SparkSessionExtensions
Bartosz Konieczny @waitingforcode
Implementing a custom state store
2. About me
Bartosz Konieczny
Data Engineer @OCTOTechnology
#ApacheSparkEnthusiast #DataOnTheCloud
👓 read my data & Spark articles at waitingforcode.com
🎓 learn data engineering with me at becomedataengineer.com
follow me @waitingforcode
check github.com/bartosz25 for data code snippets
9. state store simplified definition by myself
A versioned partition-based map used to store intermediary
results (state) of stateful operations (aggregations, streaming
joins, arbitrary stateful processing, deduplication, global limit).
10. State store customization 101
▪ How?
▪ spark.sql.streaming.stateStore.providerClass
▪ What?
▪ org.apache.spark.sql.execution.streaming.state.StateStoreProvider
org.apache.spark.sql.execution.streaming.state.StateStore
▪ Why?
▪ RocksDB rocks 🤘
11. APIs - 5 main operation types
trait StateStore
def get(key: UnsafeRow): UnsafeRow
def put(key: UnsafeRow,
value: UnsafeRow): Unit
def remove(key: UnsafeRow): Unit
def commit(): Long
def abort(): Unit
def hasCommitted: Boolean
def iterator(): Iterator[UnsafeRowPair]
def getRange(start: Option[UnsafeRow],
end: Option[UnsafeRow]):
Iterator[UnsafeRowPair]
def metrics: StateStoreMetrics
trait StateStoreProvider
def doMaintenance(): Unit
def supportedCustomMetrics:
Seq[StateStoreCustomMetric]
CRUD
maintenance
"transaction"
management
state
expiration
state store
metrics
13. CRUD with API
initialize
state store
get current
value
(state)
set new
value
(state)
transform state
(Spark-defined function,
user-defined function for
arbitrary stateful
processing)
StateStore
#getStore(version:
Long): StateStore
+
StateStoreProvider
#createAndInit
StateStore
#get
StateStore
#put
StateStoreOps
#mapPartitionsWithS
tateStore
StateStoreRDD
or
state store manager
⚪ StreamingDeduplicateExec#store.put(key, EMPTY_ROW)
⚪ FlatMapGroupsWithStateExec#stateManager.putState(store,
stateData.keyRow, updatedStateObj,
currentTimeoutTimestamp)
examples
17. State finalization with API
after
processing
alive and
expired states
validate
modified
state
task
completed
invoke state
store listener
task
completion
listener
StateStore
#abort
gather & log
state metrics
StateStore
#metrics
"customMetrics" : {
"loadedMapCacheHitCount": 12,
"loadedMapCacheMissCount": 0,
"stateOnCurrentVersionSizeBytes": 208
}
CompletionIterator
NextIterator
StateStore
#commit
if failure (version
not committed)
all tasks
terminated
19. State maintenance - with API
background
thread per
partition
(store)
every
spark.sql.streaming.stateStore.maintenanceInterval start
maintenance
job
StateStoreProvider
#doMaintenance
21. Remember
▪ getRange(start, end) - no range
▪ state expiration - iteration over all states
▪ iterator() - UnsafeRowPair is mutable
▪ put() - UnsafeRow can be reused, use the copies Luke!
22. Remember
▪ getRange(start, end) - no range
▪ state expiration - iteration over all states
▪ iterator() - UnsafeRowPair is mutable
▪ put() - UnsafeRow can be reused, use the copies Luke!
▪ consistency awareness - spark.sql.streaming.minBatchesToRetain
23. Remember
▪ getRange(start, end) - no range
▪ state expiration - iteration over all states
▪ iterator() - UnsafeRowPair is mutable
▪ put() - UnsafeRow can be reused, use the copies Luke!
▪ consistency awareness - spark.sql.streaming.minBatchesToRetain
▪ state reloading semantic - incremental changes (delta) vs snapshot in time
▪ state reloading semantic - delete markers
24. Remember
▪ getRange(start, end) - no range
▪ state expiration - iteration over all states
▪ iterator() - UnsafeRowPair is mutable
▪ put() - UnsafeRow can be reused, use the copies Luke!
▪ consistency awareness - spark.sql.streaming.minBatchesToRetain
▪ state reloading semantic - incremental changes (delta) vs snapshot in time
▪ state reloading semantic - delete markers
▪ state store implementation is immutable - remains the same between runs
▪ state store commit - micro-batch/epoch + 1!
25. Resources
▪ follow-up blog posts series: https://www.waitingforcode.com/tags/data-ai-summit-europe-2020-articles
▪ Github project - MapDB-backed state store, customized checkpoint manager and file committer:
https://github.com/bartosz25/data-ai-summit-2020
▪ blog posts/talks about custom:
data sources: https://databricks.com/session_eu19/extending-spark-sql-2-4-with-new-data-sources-
live-coding-session-continues
plugins:
https://issues.apache.org/jira/browse/SPARK-28091
https://databricks.com/session_eu20/what-is-new-with-apache-spark-performance-monitoring-
in-spark-3-0
SQL plan:
https://databricks.com/session/how-to-extend-apache-spark-with-customized-optimizations
https://www.waitingforcode.com/tags/spark-sql-customization
26. Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.
Thank you!
@waitingforcode / waitingforcode.com
@OCTOTechnology / blog.octo.com/en