Data teams are faced with a variety of tasks when migrating Hadoop-based platforms to Databricks. A common pitfall happens during the migration step where often overlooked access control policies can block adoption. This session will focus on the best practices to migrate and modernize Hadoop-based policies to govern data access (such as those in Apache Ranger or Apache Sentry). Data architects must consider new, fine-grained access control requirements when migrating from Hadoop architectures to Databricks in order to deliver secure access to as many data sets and data consumers as possible. This session will provide guidance across open source, AWS, Azure and partner tools, such as Immuta, on how to scale existing Hadoop-based policies to dynamically support more classes of users, implement fine-grained access control and leverage automation to protect sensitive data while maximizing utility — without manual effort
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
Migrate and Modernize Hadoop-Based Security Policies for Databricks
1. Migrate and Modernize Hadoop-
Based Security Policies for
Databricks
Steve Touw
CTO, Immuta
2. Can I just migrate my Apache Ranger/Sentry
Policies Directly to [Databricks]?
[presto]
[synapse]
[snowflake]
[starburst]
[etc…]
3. Can I just migrate my Apache Ranger/Sentry
Policies Directly to [Databricks]?
Migrate
Modernize
Yes!
No!
How do I get to
Yes for both?
(that’s what this talk
is about…)
5. 2012 - Development of
Cloudera Access (later
renamed to Sentry) starts
2013 - XA Secure created,
later acquired by Hortonworks
A lot has changed in 8 years...
6. Hadoop is No Longer The Center of the Universe
Multi-cloud, Multi-
compute
Managing compute-specific controls across
more than one of these systems is not
feasible
7. Data Protection Laws of the World...Growing
https://www.dlapiperdataprotection.com/
8. WHY IMMUTA
1990 2025
Privacy Rules & Regulations driving
data “fuel crisis”
Compliant
Data for
Analytics
HIPAA
(1996)
GDPR
(2018)
CCPA
(2020)
GLBA
(1999)
HITECH
(2009)
350+
Privacy & Infosec
Bills Proposed
The Data
“Fuel” Crisis
DataLegallyUsableforAnalytics
9. WHY IMMUTA
We need to secure
our data.
I need to use
our data.
LEGAL / COMPLIANCE
DATA ANALYSTS
& SCIENTISTS
So the data “tug of war” has begun…
DATA
DATA PLATFORM OWNER / DATA
ENGINEERING
10. More Complexity, Changing Definitions of Privacy
Preservation
Language from CCPA (and other similar language in GDPR)
“1798.145(a)(5): The obligations imposed on businesses by this title shall not restrict a
business’ ability to collect, use, retain, sell, or disclose consumer information that is
deidentified or in the aggregate consumer information.”
Meaning, if you deidentify/anonymize the data, CCPA doesn’t apply, yay!
But, nothing in life is free…
PI is defined as information "that identifies, relates to, describes, is capable of being
associated with, or could reasonably be linked." !!!!!
11. How to balance the speed of the business with secure access to sensitive data?
The Privacy vs Utility Tradeoff
FULL PRIVACY FULL UTILITY
Closed Open
THE RISK OF DATA USE
Sweet
spot
More stringent
definitions are
swinging the
pendulum here
Momentum
LEGAL / COMPLIANCE
DATA ANALYSTS
& SCIENTISTS
12. The World has Changed.
We are in:
The “Cloud Private Data Era”
More regulatory and privacy
concerns
More stringent definitions of
privacy preservation
Complex data platform
ecosystem
13. The “Cloud Private Data Era” Has Created a Role Tidal
Wave
More regulatory and privacy
concerns
More stringent definitions of
privacy preservation
Complex data platform
ecosystem
14. Role Explosion Example (Real Customer Use Case)
Each row-level policy in
Ranger is tied to an
individual role - but they
are all doing the “same
thing”
If you want to show new
data, you need a new Role
and a new Policy
This isn’t just Ranger -
think AWS IAM Roles too!
redacted
redacted
redacted
redacted
redacted
redacted
redacted
redacted
user associated
to role
the exact same policy written
over and over again
the only change: the role
15. Role-Based Access Control (RBAC) is Broken
▪ RBAC should really be named “Static-
based Access Control”
▪ It’s like writing code without being
able to use variables!
16. 2012 - Development of
Cloudera Access (later
renamed to Sentry) starts
2013 - XA Secure created,
later acquired by Hortonworks
Conceived Before the Cloud
Private Data Era
17. You Must Do Both…
If You Don’t, You Won’t Realize the Benefits of
the Cloud
Migrate
Modernize
Yes!
Yes!
18. Let’s Cover How To Fix Each of These...
Attribute-based Access
Control (ABAC)
Privacy Enhancing
Technologies (PETs)
Separation of Policy from
Platform
More regulatory and privacy
concerns
More stringent definitions of
privacy preservation
Complex data platform
ecosystem
19. Let’s Cover How To Fix Each of These...
Attribute-based Access
Control (ABAC)
Privacy Enhancing
Technologies (PETs)
Separation of Policy from
Platform
More regulatory and privacy
concerns
More stringent definitions of
privacy preservation
Complex data platform
ecosystem
20. Separate Policy from Platform
Just like the big data era required the separation of compute
from storage, the private data era requires the separation of
policy from platform.
This allows defining policy externally from the platform and
executing enforcement live in the platform without creating data
copies/views.
● Table access controls
● Column level controls
● Row level security
● Cell-level controls
In a consistent manner,
no matter your compute
21. You Must Also Separate Policy from Physical
Thousands of
tables and
columns
PoliciesThousands of
policies
Abstract with
logical metadata
PII, PHI, Address, SSN, etc...
Very few,
understandable,
policies
22. Let’s Cover How To Fix Each of These...
Attribute-based Access
Control (ABAC)
Privacy Enhancing
Technologies (PETs)
Separation of Policy from
Platform
More regulatory and privacy
concerns
More stringent definitions of
privacy preservation
Complex data platform
ecosystem
23. Remember This?
▪ RBAC should really be
named “Static-based
Access Control”
▪ It’s like writing code
without being able to use
variables!
Wouldn’t it have been nice to just
write this with a variable and have
the policy dynamically defined at
RUN TIME?
organization_name IN
(SELECT org_name from redacted
WHERE role IN (@role))
▪ This is ABAC and it really
should be called
“Dynamic-based Access
Control”
24. Ranger/Hortonworks Real Customer Example
They had 8 rules per
table times 12 tables
for a total of 96
rules!
redacted
redacted
redacted
redacted
redacted
redacted
redacted
redacted
user associated
to role
the exact same policy written
over and over again
the only change: the role
25. With ABAC/Immuta, It’s a Single Policy!
This is because it separates the user
details from the policy and treats them as a
read-time variable. This also future-proofs
the policy.
We can also build the rule once and have it
apply to all 12 tables with our logical
metadata layer (discussed previously).
This also future-proofs adding new tables.
26. Let’s Cover How To Fix Each of These...
Attribute-based Access
Control (ABAC)
Privacy Enhancing
Technologies (PETs)
Separation of Policy from
Platform
More regulatory and privacy
concerns
More stringent definitions of
privacy preservation
Complex data platform
ecosystem
27. How to balance the speed of the business with secure access to sensitive data?
How Do We Hit The Privacy vs Utility Sweet Spot?
FULL PRIVACY FULL UTILITY
Closed Open
THE RISK OF DATA USE
Sweet
spot
LEGAL / COMPLIANCE
DATA ANALYSTS
& SCIENTISTS
28. I know stuff about Judd and Leslie
photo credit: Gawker
29. New York Taxi & Limousine Commission
• Data was released containing taxi pickups,
dropoffs, location, time, amount, and tip
amount, among others
• This seems pretty harmless?
30. Well, Judd and Leslie May Not Think It’s Harmless
• This photos was geotagged (with time), so
by simply querying by medallion and time,
we know how much Judd and Leslie tip!
31. Limit
Features
Limit
Records
Limit
Functions
Reduced specificity
Regular Expressions for strings
Rounding for numeric data
Column restriction
Hide or replace values with
NULL
Row restrictions
Restrict access to certain
types of rows
Differential Privacy
Inject noise into aggregate
measures based on privacy
guarantees
Hashing/Encryption Local DP
Randomly alter a percentage
of data
Aggregate-Only
Only allow aggregate
functions on data
K-anonymization
Suppress values that can lead
to linkage attacks
32. Taxi data properly anonymized while
providing utility
Generalize: remove
precision from time
and space
Randomize: replace
with false but
legitimate values at
a specified rate
Mask:
using salted
deterministic hash
Direct Identifier: Indirect Identifiers: Sensitive
33. Attack occurs when the
potential for re-identification
exists. Factors include:
● Access
● External Knowledge
● Incentives
Attack Event (A) represents
the probability that an attack
occurs
Success Event (S)
represents the probability
that an attack is successful
Terminology
BACKGROUND
Attack
A
S
34. Data Risk
Risk
Mitigation modify data to limit
the ability of an adversary to
make inferences
Inferences
● Record ownership
● Participation
● Attribute Values
Techniques
● k-Anon
● LDP
● DP
● Masking
A
S
35. Context Risk
Risk
A
A
S
Mitigation “shrinks” the
attack surface.
Controls
● Limiting Access
● Limiting types of Queries
● Purpose Limitations
● Agreements
● Creating Disincentives
● Training