Delta lake and the delta architecture

Delta Lake and the Delta
Architecture
07/07/2021
1

Personal introduction
Yue Fang
a big data enthusiast, and has worked on big data tech skills for almost 10 years.
builds data pipelines and platforms on Cloudera's platform and Azure's Cloud.
is a certified AWS solution architect.
deep experience using spark structured streaming, Kafka, Cassandra, Hive,
HBase, Solr, EventHub and Cosmosdb.
worked on the Azure Databricks platform and Delta Lake as well.
2

Outline
● Apache Spark problems
● Data Lake problems
● What is DataBricks?
● Delta Lake key features
● Delta Lake architecture
● Lakehouse architecture
3

Apache Spark Problems
● Not ACID compliant
● Missing schema enforcement
● Small files - big problems
- File listing
- File opening/closing
- Reduced compression effectiveness
- Excessive metadata(external HIVE tables)
4
Two docs for details.
Generic Load/Save Functions - Spark 3.1.2 Documentation
Transactional writes to cloud storage with DBIO | Databricks on AWS

Data Lakes Problems
A data lake is a centralized repository that allows you to store all your
structured and unstructured data at any scale.
5
Reliability issues
● Failed production jobs leave data in
corrupt state
● Lack of schema enforcement
● creates inconsistent and low quality
data(schema-on-read)
● Lack of consistency makes it almost
impossible to mix appends and reads,
batch and streaming
Performance issues
● File size inconsistency with either too small or
too big files
● Slow read/write performance of cloud storage
compared to file system storage
Garbage In Garbage Out

Why is Databricks?
6
source: Comparing Databricks to Apache Spark
Databricks builds on top of Spark and adds:
- Highly reliable and performant data
pipelines
- Productive data science at scale.

What is Delta Lake?
● an open source project that enables building a Lakehouse architecture on top
of data lakes.
● a storage layer that brings scalable, ACID transactions to Apache Spark and
other big-data engines.
● Delta Lake provides ACID transactions, scalable metadata handling, and
unifies streaming and batch data processing on top of existing data lakes,
such as S3, ADLS, GCS, and HDFS.
8

Delta Lake key features
● ACID Transactions
● Scalable Metadata Handling
● Time Travel (data versioning)
● Open Format
● Delta Lake change data feed
● Unified Batch and Streaming Source and Sink
● Schema Enforcement
● Schema Evolution
● Audit History
● Updates and Delete
● 100% Compatible with Apache Spark API
● Data Clean-up
9

Delta Lake key feature - ACID transaction
● What the transaction log is.
● How the transaction log serves as a single source of truth to support ACID.
● How Delta Lake computes the state of each table.
● Using optimistic concurrency control.
● How Delta Lake uses mutual exclusion to ensure that commits
are serialized properly.
10
DEMO

Delta Lake key feature - Schema Enforcement
Schema enforcement, also known as schema validation, is a safeguard in Delta
Lake that ensures data quality by rejecting writes to a table that do not match the
table’s schema.
● Schema validation on write.
● Cannot contain any additional columns that are not present in the target
table’s schema
● Cannot have column data types that differ from the column data types in the
target table.
● Can not contain column names that differ only by case.
● Table’s schema is saved in JSON format inside the transaction log.
11

Delta Lake key feature - Schema Evolution
Schema evolution is a feature that allows users to easily change a table’s current
schema to accommodate data that is changing over time.
“Read-Compatible” Schema Change .option('mergeSchema', 'true')
● Adding new columns (this is the most common scenario)
● Changing of data types from NullType -> any other type, or upcasts from
ByteType -> ShortType -> IntegerType
“Non-Read-Compabtile” Schema Change .option("overwriteSchema", "true")
● Dropping a column
● Changing an existing column’s data type (in place)
● Renaming column names that differ only by case (e.g. “Foo” and “foo”)
12
DEMO

Delta Lake key feature - Time Travel
Delta Lake time travel allows you to query an older snapshot of a Delta table.
● Timestamp based
● Version number based
● Data retention
Transaction log file retention period
delta.logRetentionDuration = 30 days at default
Data file retention period
delta.deletedFileRetentionDuration = 7
● Use cases:
○ Audit data changes
○ Reproduce experiments & report
○ Rollbacks
https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-alter-table.html#delta-table-schema-options
13
DEMO

Delta Lake key feature - Table Utility Commands
● Remove files no longer referenced by Delta table
● Audit History
● Retrieve Table Details
● Generate a manifest file
● Convert parquet table to Delta table
● Convert Delta table to parquet table
14
DML Internals (Delete, Update, Merge) - Delta Lake Tech Talks Table deletes, updates, and merges | Databricks on AWS
DEMO

Delta Lake key feature - Insert|Delete|UpSert
● SQL
INSERT
DELETE
UPDATE
MERGE
● Delta Table API
delete
update
Merge
A merge operation can fail if multiple rows of the source dataset match and
attempt to update the same rows of the target Delta table
15

Delta Lake key feature - Clean up
● Transaction Log clean up
_delta_log
Checkpoint log file
delta.logRetentionDuration=30 days at default
● Data file clean up
SQL
API
Vacuum command
Retention 7 days at default
spark.databricks.delta.retentionDurationCheck.enabled = true|false
16

Delta Lake key feature - Streaming as source and sink
● Delta Lake is deeply integrated with Spark Structured Streaming through
readStream and writeStream.
● Delta Lake overcomes many of the limitations typically associated with
streaming systems and files, including:
Maintaining “exactly-once” processing with more than one stream (or
concurrent batch jobs)
Efficiently discovering which files are new when using files as the source
for a stream
17

Delta Lake key feature - Streaming as source and sink
As a source
does not handle input that is not an append and
throws an exception if any modifications occur on
the table being used as a source.
● delete the output and checkpoint and restart
the stream from the beginning.
● set either of these two options:
ignoreDeletes
ignoreChanges
Specify initial position
● startingVersion
● startingTimestamp
18
As a sink
● Append mode
● Complete mode

Delta Lake key feature - Delta Lake change data feed
● Support DataBricks Runtime 8.4 and above
● The Delta change data feed represents row-level changes between
versions of a Delta table.
● set spark.databricks.delta.properties.defaults.enableChangeDataFeed =
true;
● Change data event schema
In addition to the data columns, change data contains metadata
columns that identify the type of change event:
_change_type
>>insert, update_preimage , update_postimage, delete
_commit_version
_commit_timestamp
19

Delta Lake Architecture
20
DEMO

LakeHouse Architecture
21
A paradigm or conception of modern architecture.
Rely on Delta Lake under the hood
Replace additional data warehouse and data lake
Need fast SQL analysis engine
Future trend

Thank you
22
Any questions are welcome.
Learning together

Delta lake and the delta architecture

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Delta lake and the delta architecture

Similar to Delta lake and the delta architecture (20)

More from Adam Doyle

More from Adam Doyle (20)

Recently uploaded

Recently uploaded (20)

Delta lake and the delta architecture