Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020

•

0 likes•572 views

Jules Damji and Denny Lee from Databricks Developer Relations will recap some keynote highlights, and each will briefly present personal picks from sessions that resonated well with them. Next, Jacek Laskowski, an independent consultant, will speak about Spark 3.0 internals, and Scott Haines from Twilio, Inc. will give a talk about structured streaming microservice architectures. This live coding session and technical deep dive are not to be missed!

Data & Analytics

Arbitrary Stateful Aggregation
and MERGE INTO
Spark Structured Streaming + Delta Lake = “Double Metrics”
Jacek Laskowski jaceklaskowski / November 2020

About the Speaker
Jacek Laskowski is an IT Freelancer specializing in Apache
Spark, Delta Lake, Apache Kafka and Kafka Streams.
Contact me at jacek@japila.pl or DM on twitter
@jaceklaskowski to discuss opportunities.
Best known by "The Internals Of" online books @
https://books.japila.pl

The Internals of Delta Lake
1. Available for free @
https://books.japila.pl/delta-lake-internals

Friendly Reminder
Should you have any questions,
Feel free to ask them in the chat window.
I’m going to answer them at the end of the talk.
Thank you!

Client Requirements and Recommendations
1. A client wants to load Kafka records at
regular intervals
● Spark Structured Streaming
2. A client wants to do a stateful
aggregation in a custom per-group way
● KeyValueGroupedDataset.flatMapGroups
WithState
3. A client wants to update a Delta table
with aggregation results
● MERGE INTO
● DataStreamWriter.foreachBatch

Arbitrary Stateful Aggregation
1. KeyValueGroupedDataset.ﬂatMapGroupsWithState (scaladoc)
2. A user-deﬁned per-group state
3. For a static batch Dataset, the function will be invoked once per group
4. For a streaming Dataset, the function will be invoked for each group repeatedly
in every trigger, and updates to each group's state will be saved across
invocations

The Code
1. Code?! Open Intellij IDEA! 😎

Delta Lake Users Mailing List
1. Multiple executions of ﬂatMapGroupsWithState when DeltaTable.merge

Possible Way-Outs (“Solutions”)
1. Separate Delta table for state?
a. Avoid multiple passes over ﬂatMapGroupsWithState

O’Reilly Learning Spark
2nd Edition
1. Available for free @ https://dbricks.co/get-ebook
2. Chapter 9 “Building Reliable Data Lakes with
Apache Spark” touches Delta Lake
a. Also the competitors: Apache Hudi and
Apache Iceberg

That’s all folks! Thank you! ❤
/me Answering questions...
Jacek Laskowski / @jaceklaskowski / jacek@japila.pl

What's hot

Koalas: Interoperability Between Koalas and Apache SparkDatabricks

Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks

Apache Spark Core—Deep Dive—Proper OptimizationDatabricks

Spark SQL - 10 Things You Need to KnowKristian Alexander

Building Robust ETL Pipelines with Apache SparkDatabricks

Spark sqlZahra Eskandari

Spark SQLJoud Khattab

How Apache Spark fits into the Big Data landscapePaco Nathan

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks

Spark SQL Join Improvement at FacebookDatabricks

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

From Pipelines to Refineries: Scaling Big Data ApplicationsDatabricks

SPARQL and Linked Data BenchmarkingKristian Alexander

Taking Spark Streaming to the Next Level with Datasets and DataFramesDatabricks

Why you should care about data layout in the file system with Cheng Lian and ...Databricks

What is New with Apache Spark Performance Monitoring in Spark 3.0Databricks

Scaling Apache Spark at FacebookDatabricks

Spark shuffle introductioncolorant

Spark overviewLisa Hua

Robust and Scalable ETL over Cloud Storage with Apache SparkDatabricks

What's hot (20)

Koalas: Interoperability Between Koalas and Apache Spark

Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop

Apache Spark Core—Deep Dive—Proper Optimization

Spark SQL - 10 Things You Need to Know

Building Robust ETL Pipelines with Apache Spark

Spark sql

Spark SQL

How Apache Spark fits into the Big Data landscape

Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...

Spark SQL Join Improvement at Facebook

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3

From Pipelines to Refineries: Scaling Big Data Applications

SPARQL and Linked Data Benchmarking

Taking Spark Streaming to the Next Level with Datasets and DataFrames

Why you should care about data layout in the file system with Cheng Lian and ...

What is New with Apache Spark Performance Monitoring in Spark 3.0

Scaling Apache Spark at Facebook

Spark shuffle introduction

Spark overview

Robust and Scalable ETL over Cloud Storage with Apache Spark

Similar to Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020

Kafka Streams VS Spark Structured Streaming - Modern Stream Processing Engin...Jacek Laskowski

Akka 2.4 plus commercial features in Typesafe Reactive PlatformLegacy Typesafe (now Lightbend)

Akka 2.4 plus new commercial features in Typesafe Reactive PlatformLegacy Typesafe (now Lightbend)

Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)Databricks

What is Apache Kafka®?Eventador

What is apache Kafka?Kenny Gorman

SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion DubaiCodemotion Dubai

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France

Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...confluent

Streampunk - The Difference Engine for Unlocking the Kafka Black Box with Ral...HostedbyConfluent

Migrating structured data between Hadoop and RDBMSBouquet

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson

Moving from Big Data to Fast Data? Here's How To Pick The Right Streaming EngineLightbend

Akka Streams And Kafka Streams: Where Microservices Meet Fast DataLightbend

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn

Display earthquakes with Akka-httpPierangelo Cecchetto

Understanding Akka Streams, Back Pressure, and Asynchronous ArchitecturesLightbend

Scylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDBScyllaDB

What is Apache Kafka and What is an Event Streaming Platform?confluent

Kafka Streams for Java enthusiastsSlim Baltagi

Similar to Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020 (20)

Kafka Streams VS Spark Structured Streaming - Modern Stream Processing Engin...

Akka 2.4 plus commercial features in Typesafe Reactive Platform

Akka 2.4 plus new commercial features in Typesafe Reactive Platform

Extending Spark SQL 2.4 with New Data Sources (Live Coding Session)

What is Apache Kafka®?

What is apache Kafka?

SMACK Stack - Fast Data Done Right by Stefan Siprell at Codemotion Dubai

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...

Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...

Streampunk - The Difference Engine for Unlocking the Kafka Black Box with Ral...

Migrating structured data between Hadoop and RDBMS

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...

Moving from Big Data to Fast Data? Here's How To Pick The Right Streaming Engine

Akka Streams And Kafka Streams: Where Microservices Meet Fast Data

Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...

Display earthquakes with Akka-http

Understanding Akka Streams, Back Pressure, and Asynchronous Architectures

Scylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDB

What is Apache Kafka and What is an Event Streaming Platform?

Kafka Streams for Java enthusiasts

Recently uploaded

Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann

BabyOno dropshipping via API with DroFx.pptxolyaivanovalion

Edukaciniai dropshipping via API with DroFxolyaivanovalion

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823

Midocean dropshipping via API with DroFxolyaivanovalion

April 2024 - Crypto Market Report's Analysismanisha194592

Introduction-to-Machine-Learning (1).pptxfirstjob4

Mature dropshipping via API with DroFx.pptxolyaivanovalion

Sampling (random) method and Non random.pptDr. Soumendra Kumar Patra

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823

Invezz.com - Grow your wealth with trading signalsInvezz1

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7Call Girls in Nagpur High Profile Call Girls

VidaXL dropshipping via API with DroFx.pptxolyaivanovalion

Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY

Recently uploaded (20)

Generative AI on Enterprise Cloud with NiFi and Milvus

BabyOno dropshipping via API with DroFx.pptx

Edukaciniai dropshipping via API with DroFx

Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...

Midocean dropshipping via API with DroFx

April 2024 - Crypto Market Report's Analysis

Introduction-to-Machine-Learning (1).pptx

Mature dropshipping via API with DroFx.pptx

Sampling (random) method and Non random.ppt

Best VIP Call Girls Noida Sector 39 Call Me: 8448380779

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...

Invezz.com - Grow your wealth with trading signals

(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7

VidaXL dropshipping via API with DroFx.pptx

Accredited-Transport-Cooperatives-Jan-2021-Web.pdf

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service

Best VIP Call Girls Noida Sector 22 Call Me: 8448380779

BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...

Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020

1. Arbitrary Stateful Aggregation and MERGE INTO Spark Structured Streaming + Delta Lake = “Double Metrics” Jacek Laskowski jaceklaskowski / November 2020

2. About the Speaker Jacek Laskowski is an IT Freelancer specializing in Apache Spark, Delta Lake, Apache Kafka and Kafka Streams. Contact me at jacek@japila.pl or DM on twitter @jaceklaskowski to discuss opportunities. Best known by "The Internals Of" online books @ https://books.japila.pl

3. The Internals of Delta Lake 1. Available for free @ https://books.japila.pl/delta-lake-internals

4. Friendly Reminder Should you have any questions, Feel free to ask them in the chat window. I’m going to answer them at the end of the talk. Thank you!

5. Client Requirements and Recommendations 1. A client wants to load Kafka records at regular intervals ● Spark Structured Streaming 2. A client wants to do a stateful aggregation in a custom per-group way ● KeyValueGroupedDataset.flatMapGroups WithState 3. A client wants to update a Delta table with aggregation results ● MERGE INTO ● DataStreamWriter.foreachBatch

6. Arbitrary Stateful Aggregation 1. KeyValueGroupedDataset.ﬂatMapGroupsWithState (scaladoc) 2. A user-deﬁned per-group state 3. For a static batch Dataset, the function will be invoked once per group 4. For a streaming Dataset, the function will be invoked for each group repeatedly in every trigger, and updates to each group's state will be saved across invocations

7. The Code 1. Code?! Open Intellij IDEA! 😎

8. Delta Lake Users Mailing List 1. Multiple executions of ﬂatMapGroupsWithState when DeltaTable.merge

9. Possible Way-Outs (“Solutions”) 1. Separate Delta table for state? a. Avoid multiple passes over ﬂatMapGroupsWithState

10. O’Reilly Learning Spark 2nd Edition 1. Available for free @ https://dbricks.co/get-ebook 2. Chapter 9 “Building Reliable Data Lakes with Apache Spark” touches Delta Lake a. Also the competitors: Apache Hudi and Apache Iceberg

11. That’s all folks! Thank you! ❤ /me Answering questions... Jacek Laskowski / @jaceklaskowski / jacek@japila.pl

Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020

Similar to Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020 (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Arbitrary Stateful Aggregation and MERGE INTO - Data + AI Summit EU 2020