SlideShare a Scribd company logo
1 of 22
Delta Lake and the Delta
Architecture
07/07/2021
1
Personal introduction
Yue Fang
a big data enthusiast, and has worked on big data tech skills for almost 10 years.
builds data pipelines and platforms on Cloudera's platform and Azure's Cloud.
is a certified AWS solution architect.
deep experience using spark structured streaming, Kafka, Cassandra, Hive,
HBase, Solr, EventHub and Cosmosdb.
worked on the Azure Databricks platform and Delta Lake as well.
2
Outline
● Apache Spark problems
● Data Lake problems
● What is DataBricks?
● Delta Lake key features
● Delta Lake architecture
● Lakehouse architecture
3
Apache Spark Problems
● Not ACID compliant
● Missing schema enforcement
● Small files - big problems
- File listing
- File opening/closing
- Reduced compression effectiveness
- Excessive metadata(external HIVE tables)
4
Two docs for details.
Generic Load/Save Functions - Spark 3.1.2 Documentation
Transactional writes to cloud storage with DBIO | Databricks on AWS
Data Lakes Problems
A data lake is a centralized repository that allows you to store all your
structured and unstructured data at any scale.
5
Reliability issues
● Failed production jobs leave data in
corrupt state
● Lack of schema enforcement
● creates inconsistent and low quality
data(schema-on-read)
● Lack of consistency makes it almost
impossible to mix appends and reads,
batch and streaming
Performance issues
● File size inconsistency with either too small or
too big files
● Slow read/write performance of cloud storage
compared to file system storage
Garbage In Garbage Out
Why is Databricks?
6
source: Comparing Databricks to Apache Spark
Databricks builds on top of Spark and adds:
- Highly reliable and performant data
pipelines
- Productive data science at scale.
Delta Lake introduction
7
What is Delta Lake?
● an open source project that enables building a Lakehouse architecture on top
of data lakes.
● a storage layer that brings scalable, ACID transactions to Apache Spark and
other big-data engines.
● Delta Lake provides ACID transactions, scalable metadata handling, and
unifies streaming and batch data processing on top of existing data lakes,
such as S3, ADLS, GCS, and HDFS.
8
Delta Lake key features
● ACID Transactions
● Scalable Metadata Handling
● Time Travel (data versioning)
● Open Format
● Delta Lake change data feed
● Unified Batch and Streaming Source and Sink
● Schema Enforcement
● Schema Evolution
● Audit History
● Updates and Delete
● 100% Compatible with Apache Spark API
● Data Clean-up
9
Delta Lake key feature - ACID transaction
● What the transaction log is.
● How the transaction log serves as a single source of truth to support ACID.
● How Delta Lake computes the state of each table.
● Using optimistic concurrency control.
● How Delta Lake uses mutual exclusion to ensure that commits
are serialized properly.
10
DEMO
Delta Lake key feature - Schema Enforcement
Schema enforcement, also known as schema validation, is a safeguard in Delta
Lake that ensures data quality by rejecting writes to a table that do not match the
table’s schema.
● Schema validation on write.
● Cannot contain any additional columns that are not present in the target
table’s schema
● Cannot have column data types that differ from the column data types in the
target table.
● Can not contain column names that differ only by case.
● Table’s schema is saved in JSON format inside the transaction log.
11
Delta Lake key feature - Schema Evolution
Schema evolution is a feature that allows users to easily change a table’s current
schema to accommodate data that is changing over time.
“Read-Compatible” Schema Change .option('mergeSchema', 'true')
● Adding new columns (this is the most common scenario)
● Changing of data types from NullType -> any other type, or upcasts from
ByteType -> ShortType -> IntegerType
“Non-Read-Compabtile” Schema Change .option("overwriteSchema", "true")
● Dropping a column
● Changing an existing column’s data type (in place)
● Renaming column names that differ only by case (e.g. “Foo” and “foo”)
12
DEMO
Delta Lake key feature - Time Travel
Delta Lake time travel allows you to query an older snapshot of a Delta table.
● Timestamp based
● Version number based
● Data retention
Transaction log file retention period
delta.logRetentionDuration = 30 days at default
Data file retention period
delta.deletedFileRetentionDuration = 7
● Use cases:
○ Audit data changes
○ Reproduce experiments & report
○ Rollbacks
https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-alter-table.html#delta-table-schema-options
13
DEMO
Delta Lake key feature - Table Utility Commands
● Remove files no longer referenced by Delta table
● Audit History
● Retrieve Table Details
● Generate a manifest file
● Convert parquet table to Delta table
● Convert Delta table to parquet table
14
DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
DEMO
Delta Lake key feature - Insert|Delete|UpSert
● SQL
INSERT
DELETE
UPDATE
MERGE
● Delta Table API
delete
update
Merge
A merge operation can fail if multiple rows of the source dataset match and
attempt to update the same rows of the target Delta table
15
DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
Delta Lake key feature - Clean up
● Transaction Log clean up
_delta_log
Checkpoint log file
delta.logRetentionDuration=30 days at default
● Data file clean up
SQL
API
Vacuum command
Retention 7 days at default
spark.databricks.delta.retentionDurationCheck.enabled = true|false
16
DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
Delta Lake key feature - Streaming as source and sink
● Delta Lake is deeply integrated with Spark Structured Streaming through
readStream and writeStream.
● Delta Lake overcomes many of the limitations typically associated with
streaming systems and files, including:
Maintaining “exactly-once” processing with more than one stream (or
concurrent batch jobs)
Efficiently discovering which files are new when using files as the source
for a stream
17
DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
Delta Lake key feature - Streaming as source and sink
As a source
does not handle input that is not an append and
throws an exception if any modifications occur on
the table being used as a source.
● delete the output and checkpoint and restart
the stream from the beginning.
● set either of these two options:
ignoreDeletes
ignoreChanges
Specify initial position
● startingVersion
● startingTimestamp
18
As a sink
● Append mode
● Complete mode
Delta Lake key feature - Delta Lake change data feed
● Support DataBricks Runtime 8.4 and above
● The Delta change data feed represents row-level changes between
versions of a Delta table.
● set spark.databricks.delta.properties.defaults.enableChangeDataFeed =
true;
● Change data event schema
In addition to the data columns, change data contains metadata
columns that identify the type of change event:
_change_type
>>insert, update_preimage , update_postimage, delete
_commit_version
_commit_timestamp
19
DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
Delta Lake Architecture
20
DEMO
LakeHouse Architecture
21
A paradigm or conception of modern architecture.
Rely on Delta Lake under the hood
Replace additional data warehouse and data lake
Need fast SQL analysis engine
Future trend
Thank you
22
Any questions are welcome.
Learning together

More Related Content

What's hot

Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseDatabricks
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta LakeKnoldus Inc.
 
Getting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsGetting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsDatabricks
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Databricks
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lakeMykola Zerniuk
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDatabricks
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure DatabricksDustin Vannoy
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...DataScienceConferenc1
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceHarald Erb
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake OverviewJames Serra
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...Databricks
 

What's hot (20)

Free Training: How to Build a Lakehouse
Free Training: How to Build a LakehouseFree Training: How to Build a Lakehouse
Free Training: How to Build a Lakehouse
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
Getting Started with Databricks SQL Analytics
Getting Started with Databricks SQL AnalyticsGetting Started with Databricks SQL Analytics
Getting Started with Databricks SQL Analytics
 
Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0Achieving Lakehouse Models with Spark 3.0
Achieving Lakehouse Models with Spark 3.0
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
Delta Lake with Azure Databricks
Delta Lake with Azure DatabricksDelta Lake with Azure Databricks
Delta Lake with Azure Databricks
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Actionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data ScienceActionable Insights with AI - Snowflake for Data Science
Actionable Insights with AI - Snowflake for Data Science
 
Data Lake Overview
Data Lake OverviewData Lake Overview
Data Lake Overview
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 

Similar to Delta lake and the delta architecture

Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...StreamNative
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSDatabricks
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkDatabricks
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAiougVizagChapter
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksKnoldus Inc.
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureLuan Moreno Medeiros Maciel
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangDatabricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Cloudera, Inc.
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineChester Chen
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Databricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore IndexSolidQ
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...HostedbyConfluent
 

Similar to Delta lake and the delta architecture (20)

Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Growing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RSGrowing the Delta Ecosystem to Rust and Python with Delta-RS
Growing the Delta Ecosystem to Rust and Python with Delta-RS
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Healthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache SparkHealthcare Claim Reimbursement using Apache Spark
Healthcare Claim Reimbursement using Apache Spark
 
MS-SQL SERVER ARCHITECTURE
MS-SQL SERVER ARCHITECTUREMS-SQL SERVER ARCHITECTURE
MS-SQL SERVER ARCHITECTURE
 
Aioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_featuresAioug vizag oracle12c_new_features
Aioug vizag oracle12c_new_features
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Getting Started with Delta Lake on Databricks
Getting Started with Delta Lake on DatabricksGetting Started with Delta Lake on Databricks
Getting Started with Delta Lake on Databricks
 
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft AzureOtimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Transactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric LiangTransactional writes to cloud storage with Eric Liang
Transactional writes to cloud storage with Eric Liang
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
Simplifying Hadoop with RecordService, A Secure and Unified Data Access Path ...
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
Building the Petcare Data Platform using Delta Lake and 'Kyte': Our Spark ETL...
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
In-memory ColumnStore Index
In-memory ColumnStore IndexIn-memory ColumnStore Index
In-memory ColumnStore Index
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
 

More from Adam Doyle

Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering RolesAdam Doyle
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster ServicesAdam Doyle
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations PresentationAdam Doyle
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowAdam Doyle
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAdam Doyle
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAAdam Doyle
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop DevelopmentAdam Doyle
 
The new big data
The new big dataThe new big data
The new big dataAdam Doyle
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020Adam Doyle
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleAdam Doyle
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAAdam Doyle
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackAdam Doyle
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does dataAdam Doyle
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsAdam Doyle
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingAdam Doyle
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019Adam Doyle
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleAdam Doyle
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user groupAdam Doyle
 

More from Adam Doyle (20)

ML Ops.pptx
ML Ops.pptxML Ops.pptx
ML Ops.pptx
 
Data Engineering Roles
Data Engineering RolesData Engineering Roles
Data Engineering Roles
 
Managed Cluster Services
Managed Cluster ServicesManaged Cluster Services
Managed Cluster Services
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflowMay 2021 Spark Testing ... or how to farm reputation on StackOverflow
May 2021 Spark Testing ... or how to farm reputation on StackOverflow
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
Apache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEAApache Iceberg Presentation for the St. Louis Big Data IDEA
Apache Iceberg Presentation for the St. Louis Big Data IDEA
 
Localized Hadoop Development
Localized Hadoop DevelopmentLocalized Hadoop Development
Localized Hadoop Development
 
The new big data
The new big dataThe new big data
The new big data
 
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020Feature store Overview   St. Louis Big Data IDEA Meetup aug 2020
Feature store Overview St. Louis Big Data IDEA Meetup aug 2020
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Operationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEAOperationalizing Data Science St. Louis Big Data IDEA
Operationalizing Data Science St. Louis Big Data IDEA
 
Retooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech StackRetooling on the Modern Data and Analytics Tech Stack
Retooling on the Modern Data and Analytics Tech Stack
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
How stlrda does data
How stlrda does dataHow stlrda does data
How stlrda does data
 
Tailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analyticsTailoring machine learning practices to support prescriptive analytics
Tailoring machine learning practices to support prescriptive analytics
 
Synthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-makingSynthesis of analytical methods data driven decision-making
Synthesis of analytical methods data driven decision-making
 
Big Data IDEA 101 2019
Big Data IDEA 101 2019Big Data IDEA 101 2019
Big Data IDEA 101 2019
 
Data Engineering and the Data Science Lifecycle
Data Engineering and the Data Science LifecycleData Engineering and the Data Science Lifecycle
Data Engineering and the Data Science Lifecycle
 
Data engineering Stl Big Data IDEA user group
Data engineering   Stl Big Data IDEA user groupData engineering   Stl Big Data IDEA user group
Data engineering Stl Big Data IDEA user group
 

Recently uploaded

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 

Recently uploaded (20)

FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 

Delta lake and the delta architecture

  • 1. Delta Lake and the Delta Architecture 07/07/2021 1
  • 2. Personal introduction Yue Fang a big data enthusiast, and has worked on big data tech skills for almost 10 years. builds data pipelines and platforms on Cloudera's platform and Azure's Cloud. is a certified AWS solution architect. deep experience using spark structured streaming, Kafka, Cassandra, Hive, HBase, Solr, EventHub and Cosmosdb. worked on the Azure Databricks platform and Delta Lake as well. 2
  • 3. Outline ● Apache Spark problems ● Data Lake problems ● What is DataBricks? ● Delta Lake key features ● Delta Lake architecture ● Lakehouse architecture 3
  • 4. Apache Spark Problems ● Not ACID compliant ● Missing schema enforcement ● Small files - big problems - File listing - File opening/closing - Reduced compression effectiveness - Excessive metadata(external HIVE tables) 4 Two docs for details. Generic Load/Save Functions - Spark 3.1.2 Documentation Transactional writes to cloud storage with DBIO | Databricks on AWS
  • 5. Data Lakes Problems A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. 5 Reliability issues ● Failed production jobs leave data in corrupt state ● Lack of schema enforcement ● creates inconsistent and low quality data(schema-on-read) ● Lack of consistency makes it almost impossible to mix appends and reads, batch and streaming Performance issues ● File size inconsistency with either too small or too big files ● Slow read/write performance of cloud storage compared to file system storage Garbage In Garbage Out
  • 6. Why is Databricks? 6 source: Comparing Databricks to Apache Spark Databricks builds on top of Spark and adds: - Highly reliable and performant data pipelines - Productive data science at scale.
  • 8. What is Delta Lake? ● an open source project that enables building a Lakehouse architecture on top of data lakes. ● a storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines. ● Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes, such as S3, ADLS, GCS, and HDFS. 8
  • 9. Delta Lake key features ● ACID Transactions ● Scalable Metadata Handling ● Time Travel (data versioning) ● Open Format ● Delta Lake change data feed ● Unified Batch and Streaming Source and Sink ● Schema Enforcement ● Schema Evolution ● Audit History ● Updates and Delete ● 100% Compatible with Apache Spark API ● Data Clean-up 9
  • 10. Delta Lake key feature - ACID transaction ● What the transaction log is. ● How the transaction log serves as a single source of truth to support ACID. ● How Delta Lake computes the state of each table. ● Using optimistic concurrency control. ● How Delta Lake uses mutual exclusion to ensure that commits are serialized properly. 10 DEMO
  • 11. Delta Lake key feature - Schema Enforcement Schema enforcement, also known as schema validation, is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table’s schema. ● Schema validation on write. ● Cannot contain any additional columns that are not present in the target table’s schema ● Cannot have column data types that differ from the column data types in the target table. ● Can not contain column names that differ only by case. ● Table’s schema is saved in JSON format inside the transaction log. 11
  • 12. Delta Lake key feature - Schema Evolution Schema evolution is a feature that allows users to easily change a table’s current schema to accommodate data that is changing over time. “Read-Compatible” Schema Change .option('mergeSchema', 'true') ● Adding new columns (this is the most common scenario) ● Changing of data types from NullType -> any other type, or upcasts from ByteType -> ShortType -> IntegerType “Non-Read-Compabtile” Schema Change .option("overwriteSchema", "true") ● Dropping a column ● Changing an existing column’s data type (in place) ● Renaming column names that differ only by case (e.g. “Foo” and “foo”) 12 DEMO
  • 13. Delta Lake key feature - Time Travel Delta Lake time travel allows you to query an older snapshot of a Delta table. ● Timestamp based ● Version number based ● Data retention Transaction log file retention period delta.logRetentionDuration = 30 days at default Data file retention period delta.deletedFileRetentionDuration = 7 ● Use cases: ○ Audit data changes ○ Reproduce experiments & report ○ Rollbacks https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-alter-table.html#delta-table-schema-options 13 DEMO
  • 14. Delta Lake key feature - Table Utility Commands ● Remove files no longer referenced by Delta table ● Audit History ● Retrieve Table Details ● Generate a manifest file ● Convert parquet table to Delta table ● Convert Delta table to parquet table 14 DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS DEMO
  • 15. Delta Lake key feature - Insert|Delete|UpSert ● SQL INSERT DELETE UPDATE MERGE ● Delta Table API delete update Merge A merge operation can fail if multiple rows of the source dataset match and attempt to update the same rows of the target Delta table 15 DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
  • 16. Delta Lake key feature - Clean up ● Transaction Log clean up _delta_log Checkpoint log file delta.logRetentionDuration=30 days at default ● Data file clean up SQL API Vacuum command Retention 7 days at default spark.databricks.delta.retentionDurationCheck.enabled = true|false 16 DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
  • 17. Delta Lake key feature - Streaming as source and sink ● Delta Lake is deeply integrated with Spark Structured Streaming through readStream and writeStream. ● Delta Lake overcomes many of the limitations typically associated with streaming systems and files, including: Maintaining “exactly-once” processing with more than one stream (or concurrent batch jobs) Efficiently discovering which files are new when using files as the source for a stream 17 DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
  • 18. Delta Lake key feature - Streaming as source and sink As a source does not handle input that is not an append and throws an exception if any modifications occur on the table being used as a source. ● delete the output and checkpoint and restart the stream from the beginning. ● set either of these two options: ignoreDeletes ignoreChanges Specify initial position ● startingVersion ● startingTimestamp 18 As a sink ● Append mode ● Complete mode
  • 19. Delta Lake key feature - Delta Lake change data feed ● Support DataBricks Runtime 8.4 and above ● The Delta change data feed represents row-level changes between versions of a Delta table. ● set spark.databricks.delta.properties.defaults.enableChangeDataFeed = true; ● Change data event schema In addition to the data columns, change data contains metadata columns that identify the type of change event: _change_type >>insert, update_preimage , update_postimage, delete _commit_version _commit_timestamp 19 DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
  • 21. LakeHouse Architecture 21 A paradigm or conception of modern architecture. Rely on Delta Lake under the hood Replace additional data warehouse and data lake Need fast SQL analysis engine Future trend
  • 22. Thank you 22 Any questions are welcome. Learning together