Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020

Jules Damji and Denny Lee from Databricks Developer Relations will recap some keynote highlights, and each will briefly present personal picks from sessions that resonated well with them. Next, Jacek Laskowski, an independent consultant, will speak about Spark 3.0 internals, and Scott Haines from Twilio, Inc. will give a talk about structured streaming microservice architectures. This live coding session and technical deep dive are not to be missed!

  • Login to see the comments

  • Be the first to like this

Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020

  1. 1. Arbitrary Stateful Aggregation and MERGE INTO Spark Structured Streaming + Delta Lake = “Double Metrics” Jacek Laskowski jaceklaskowski / November 2020
  2. 2. About the Speaker Jacek Laskowski is an IT Freelancer specializing in Apache Spark, Delta Lake, Apache Kafka and Kafka Streams. Contact me at or DM on twitter @jaceklaskowski to discuss opportunities. Best known by "The Internals Of" online books @
  3. 3. The Internals of Delta Lake 1. Available for free @
  4. 4. Friendly Reminder Should you have any questions, Feel free to ask them in the chat window. I’m going to answer them at the end of the talk. Thank you!
  5. 5. Client Requirements and Recommendations 1. A client wants to load Kafka records at regular intervals ● Spark Structured Streaming 2. A client wants to do a stateful aggregation in a custom per-group way ● KeyValueGroupedDataset.flatMapGroups WithState 3. A client wants to update a Delta table with aggregation results ● MERGE INTO ● DataStreamWriter.foreachBatch
  6. 6. Arbitrary Stateful Aggregation 1. KeyValueGroupedDataset.flatMapGroupsWithState (scaladoc) 2. A user-defined per-group state 3. For a static batch Dataset, the function will be invoked once per group 4. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations
  7. 7. The Code 1. Code?! Open Intellij IDEA! 😎
  8. 8. Delta Lake Users Mailing List 1. Multiple executions of flatMapGroupsWithState when DeltaTable.merge
  9. 9. Possible Way-Outs (“Solutions”) 1. Separate Delta table for state? a. Avoid multiple passes over flatMapGroupsWithState
  10. 10. O’Reilly Learning Spark 2nd Edition 1. Available for free @ 2. Chapter 9 “Building Reliable Data Lakes with Apache Spark” touches Delta Lake a. Also the competitors: Apache Hudi and Apache Iceberg
  11. 11. That’s all folks! Thank you! ❤ /me Answering questions... Jacek Laskowski / @jaceklaskowski /