Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

Share

Extending Apache Spark – Beyond Spark Session Extensions

Download to read offline

If you want to extend Apache Spark and think that you will need to maintain a separate code base in your own fork, you’re wrong. You can customize different components of the framework, like file commit protocols or state and checkpoint stores.

  • Be the first to like this

Extending Apache Spark – Beyond Spark Session Extensions

  1. 1. Customizing Apache Spark - beyond SparkSessionExtensions Bartosz Konieczny @waitingforcode Implementing a custom state store
  2. 2. About me Bartosz Konieczny Data Engineer @OCTOTechnology #ApacheSparkEnthusiast #DataOnTheCloud 👓 read my data & Spark articles at waitingforcode.com 🎓 learn data engineering with me at becomedataengineer.com follow me @waitingforcode check github.com/bartosz25 for data code snippets
  3. 3. A customized Apache Spark?
  4. 4. 3 levels of customization (subjective) User-Defined-*
  5. 5. 3 levels of customization (subjective) User-Defined-* SQL plans, data sources/sinks, plugins, file committers, checkpoint manager, state stores
  6. 6. 3 levels of customization (subjective) User-Defined-* SQL plans, data sources/sinks, plugins, file committers, checkpoint manager, state stores topology mapper, recovery mode 😱
  7. 7. 3 levels of customization (subjective) User-Defined-* SQL plans, data sources/sinks, plugins, file committers, checkpoint manager, state stores topology mapper, recovery mode 😱
  8. 8. A customized state store?
  9. 9. state store simplified definition by myself A versioned partition-based map used to store intermediary results (state) of stateful operations (aggregations, streaming joins, arbitrary stateful processing, deduplication, global limit).
  10. 10. State store customization 101 ▪ How? ▪ spark.sql.streaming.stateStore.providerClass ▪ What? ▪ org.apache.spark.sql.execution.streaming.state.StateStoreProvider org.apache.spark.sql.execution.streaming.state.StateStore ▪ Why? ▪ RocksDB rocks 🤘
  11. 11. APIs - 5 main operation types trait StateStore def get(key: UnsafeRow): UnsafeRow def put(key: UnsafeRow, value: UnsafeRow): Unit def remove(key: UnsafeRow): Unit def commit(): Long def abort(): Unit def hasCommitted: Boolean def iterator(): Iterator[UnsafeRowPair] def getRange(start: Option[UnsafeRow], end: Option[UnsafeRow]): Iterator[UnsafeRowPair] def metrics: StateStoreMetrics trait StateStoreProvider def doMaintenance(): Unit def supportedCustomMetrics: Seq[StateStoreCustomMetric] CRUD maintenance "transaction" management state expiration state store metrics
  12. 12. CRUD initialize state store get current value (state) set new value (state) transform state (Spark-defined function, user-defined function for arbitrary stateful processing)
  13. 13. CRUD with API initialize state store get current value (state) set new value (state) transform state (Spark-defined function, user-defined function for arbitrary stateful processing) StateStore #getStore(version: Long): StateStore + StateStoreProvider #createAndInit StateStore #get StateStore #put StateStoreOps #mapPartitionsWithS tateStore StateStoreRDD or state store manager ⚪ StreamingDeduplicateExec#store.put(key, EMPTY_ROW) ⚪ FlatMapGroupsWithStateExec#stateManager.putState(store, stateData.keyRow, updatedStateObj, currentTimeoutTimestamp) examples
  14. 14. State expiration list all states remove the state for every key apply expiration predicate, eg. watermark predicate
  15. 15. State expiration - with API list all states remove the state for every key apply expiration predicate, eg. watermark predicate StateStore #getRange StateStore #iterator StateStore #remove store.getRange(None, None).map { p => stateData.withNew(p.key, p.value, getStateObject(p.value), getTimestamp(p.value)) } def getRange(start: Option[UnsafeRow], end: Option[UnsafeRow]): Iterator[UnsafeRowPair] = { iterator() } // StateStore default implementation StreamingAggregationStateManagerBaseImpl { override def iterator(store: StateStore): Iterator[UnsafeRowPair] = { store.iterator() }
  16. 16. State finalization after processing alive and expired states validate modified state task completed invoke state store listener task completion listener
  17. 17. State finalization with API after processing alive and expired states validate modified state task completed invoke state store listener task completion listener StateStore #abort gather & log state metrics StateStore #metrics "customMetrics" : { "loadedMapCacheHitCount": 12, "loadedMapCacheMissCount": 0, "stateOnCurrentVersionSizeBytes": 208 } CompletionIterator NextIterator StateStore #commit if failure (version not committed) all tasks terminated
  18. 18. State maintenance background thread per partition (store) every spark.sql.streaming.stateStore.maintenanceInterval start maintenance job
  19. 19. State maintenance - with API background thread per partition (store) every spark.sql.streaming.stateStore.maintenanceInterval start maintenance job StateStoreProvider #doMaintenance
  20. 20. Remember ▪ getRange(start, end) - no range ▪ state expiration - iteration over all states
  21. 21. Remember ▪ getRange(start, end) - no range ▪ state expiration - iteration over all states ▪ iterator() - UnsafeRowPair is mutable ▪ put() - UnsafeRow can be reused, use the copies Luke!
  22. 22. Remember ▪ getRange(start, end) - no range ▪ state expiration - iteration over all states ▪ iterator() - UnsafeRowPair is mutable ▪ put() - UnsafeRow can be reused, use the copies Luke! ▪ consistency awareness - spark.sql.streaming.minBatchesToRetain
  23. 23. Remember ▪ getRange(start, end) - no range ▪ state expiration - iteration over all states ▪ iterator() - UnsafeRowPair is mutable ▪ put() - UnsafeRow can be reused, use the copies Luke! ▪ consistency awareness - spark.sql.streaming.minBatchesToRetain ▪ state reloading semantic - incremental changes (delta) vs snapshot in time ▪ state reloading semantic - delete markers
  24. 24. Remember ▪ getRange(start, end) - no range ▪ state expiration - iteration over all states ▪ iterator() - UnsafeRowPair is mutable ▪ put() - UnsafeRow can be reused, use the copies Luke! ▪ consistency awareness - spark.sql.streaming.minBatchesToRetain ▪ state reloading semantic - incremental changes (delta) vs snapshot in time ▪ state reloading semantic - delete markers ▪ state store implementation is immutable - remains the same between runs ▪ state store commit - micro-batch/epoch + 1!
  25. 25. Resources ▪ follow-up blog posts series: https://www.waitingforcode.com/tags/data-ai-summit-europe-2020-articles ▪ Github project - MapDB-backed state store, customized checkpoint manager and file committer: https://github.com/bartosz25/data-ai-summit-2020 ▪ blog posts/talks about custom: data sources: https://databricks.com/session_eu19/extending-spark-sql-2-4-with-new-data-sources- live-coding-session-continues plugins: https://issues.apache.org/jira/browse/SPARK-28091 https://databricks.com/session_eu20/what-is-new-with-apache-spark-performance-monitoring- in-spark-3-0 SQL plan: https://databricks.com/session/how-to-extend-apache-spark-with-customized-optimizations https://www.waitingforcode.com/tags/spark-sql-customization
  26. 26. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions. Thank you! @waitingforcode / waitingforcode.com @OCTOTechnology / blog.octo.com/en

If you want to extend Apache Spark and think that you will need to maintain a separate code base in your own fork, you’re wrong. You can customize different components of the framework, like file commit protocols or state and checkpoint stores.

Views

Total views

200

On Slideshare

0

From embeds

0

Number of embeds

72

Actions

Downloads

11

Shares

0

Comments

0

Likes

0

×