Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale

Leveraging Apache Spark and
Delta Lake for Efficient Data
Encryption at Scale
Jason Hale and Daniel Harrington

Agenda
Background
Authors, Mars and the Petcare Data Platform.
CCPA
Enhanced privacy rights and consumer protection
Gecko
Deep dive into our Bespoke, Automated
CCPA compliance tool.

Authors
• Masters’ in Physics– University of Exeter.
• Data Engineer for Mars Petcare - 16 months.
• Worked on ELT framework development and Gecko.
• Masters’ in Integrated Mechanical and Electrical
Engineering – University of Bath.
• Data Engineer for Mars Petcare - 10 months.
• Worked solely on Gecko.
Jason
Hale
Daniel
Harrington

We’re part of a
broad and diverse
family company
that’s constantly
evolving.

Copyright © 2020 Mars, Incorporated — Confidential 6
We’ve grown from our
beginnings in pet food in
1935
to a family of nutrition, health,
and services businesses today.

The Petcare Data Platform
Introduction

The Petcare Data
Platform
• The Data & Analytics team manages a platform of
anonymized data from across Mars Petcare’s brands
and businesses
• Our ingestion pipeline, Kyte, has ingested data from
across these business units to form the basis of the
Petcare Data Platform (PDP).
• We have built engines and designed processes that
have enhanced the business value and integrity of the
PDP.
• One of these is Gecko: our CCPA compliance
ecosystem designed for Spark and Delta Lake.

The PDP
Ingestion
Web Services
Activation/
Marketing
Engines
Transformations

California Consumer Privacy Act (CCPA)

CCPA
• California Consumer Privacy Act – Effective
January 2020
• Protects the personal information a
business collects about Consumers and
how it is used and shared
• Three key rights:
▪ Right to Opt Out.
▪ Right to Request Disclosure.
▪ Right to Request Deletion (Right To Be
Forgotten).

Our Mission
1. Handle CCPA Right to Forget requests
more efficiently, safely and effectively.
2. Increase the overall security of PII data in
the Petcare Data Platform (PDP) Vault.
3. Maintain Non-PII data structure, in order
to continue to provide analytical value
and overall data integrity.

Gecko Ecosystem
• The concept behind Gecko is to use row (client) level encryption for PII data, and to
store encryption keys in a single delta lake table in our lake.
• Gecko is made up of two core functions: Gecko Crawl and Gecko Delete
• Gecko Crawl:
▪ Handles encryption/decryption of PII data
▪ Generates a “Master Table” containing all PII within the PDP
• Gecko Delete:
▪ Handles CCPA compliance through redaction of encryption keys
Core Concepts

1. Key Generation
*Illustrative Example
• Loop through each source + table
• Salt = 16-byte binary string
• 1 salt per Source_Id
• Client Ids prioritized over primary
keys

1. Key Generation
+
= encryption key
password

2. Data Encryption
• Multithreading notebooks at the table level: encrypt data in parallel.
• 3 locations required to encrypt due to 3 write locations in the ELT process.
• Join tables to the ID_SALT table for the configured Id_Column / Source_Id in order to derive the
Salt key.
• Fernet encryption udf applied across all PII columns.
• Encrypted data validated and overwrites existing ingested data.
• Ability to decrypt and obtain original PII when required and permitted

2. Data Encryption

2. Data Encryption
▪ Individual files for each date
ingested
▪ Required to encrypt each file path
one by one (600 in some cases)
▪ Encryption process not easily
optimised
▪ Single delta table for all dates
▪ Single path to encrypt
▪ Encryption process easily optimised
by Spark
Delta (x1 ELT write loc)Parquet (x2 ELT write locs)

Optimizing Parquet Encryption
• Data wasn't partitioned across the
cluster leading to extremely low
utilization
• Made worse by skewed data sets
(typically those with free text fields)
• Runs took an extremely long time
Initial Shortfalls – Loop file by file

Optimizing Parquet Encryption
• Increase number of partitions after shuffle
removes skew effects & ensures Spark
parallelism
• Python concurrent futures allows us to
execute encryption logic for multiple parquet
files across multiple workers in parallel
• Massively increased cluster utilization &
reduced run time from days to hours
Solution: Parallelism with
threading + Spark

3. Master Table Generation
• Collect all PII in the PDP into a single Master Table
• Fields for each PII attribute: Name, Phone Number, Email, Address, Note
• Each field contains an array of encrypted PII for each Source_Id
PII Labeling and Collection for Future use

3. Master Table Generation

Gecko Delete
• The Gecko Delete process offers superior
simplicity, consistency and tractability.
• The process is as follows:
1. Request is ingested into our configs.
2. The delete pipeline is triggered.
3. The request ID is used to filter our ID_SALT config.
4. The relevant salt is redacted.
• The Client’s PII is now irretrievable and
Non-PII data structure is maintained.

Gecko Delete
Vault: Petcare Data
Platform
Mars Petcare Business
Unit (Veterinary)
Right To Forget
Request:
Banfield ID
7586241
Right to forget
request comes from
the Business Unit
Via the OneTrust
System
Client is
identified within
the Vault Client’s Salt is redacted from
the ID table
• Without the Salt ID, the Encryption Key cannot be generated
• We only have to remove a single record from a single table but we achieve both of the following:
• From this point onwards, the Client PII can never be retrieved from the Vault
• All of the non-PII data stays exactly as it was in the lake, safely maintaining its overall integrity and value
7586241 [GECKO REDACTED]

How have our Processes been Improved?
Benefits

Benefits
1. Security - Every instance of an individual’s
information is encrypted
2. Speed – Single filter and redact
3. Auditability – Config contains meta data about
deletions performed
4. Automation– Easily monitored as part of a BAU
process
5. Data Integrity – By not deleting rows data
structure and integrity is maintained

Next steps for Gecko
Future Work

Future Work
• Potential building of future custom NLP model for auto PII detection
on ingestion.
• API layer over Key access to enhance speed & security.
• Integration of Gecko module into the ingestion process itself.
Additions and Improvements

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale

Similar to Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (17)

Leveraging Apache Spark and Delta Lake for Efficient Data Encryption at Scale