SlideShare a Scribd company logo
1 of 35
PROTECT YOUR PRIVATE DATA
WITH ORC COLUMN
ENCRYPTION
Owen O’Malley
owen@cloudera.com
September 2019
@owen_omalley
© 2019 Cloudera, Inc. All rights reserved. 2
WHO AM I?
• First committer added to Hadoop
 MapReduce
 Scaling
 Security
• Hive
 ACID transactions
• ORC
 Creator
SECURITY AND DATA PROTECTION
IN HADOOP
© 2019 Cloudera, Inc. All rights reserved. 4
EXAMPLE DATA LAKE SCENARIO
Marketing
Demographics
Electronic
medical records
CRM
POS
(Structured)(Structured) (Structured) (Structured) (Structured)
Cluster 1: Dublin Cluster 2: San Francisco
(Unstructured)(Unstructured)(Unstructured)
Cluster 3: Prague
(Structured)
On Premise Data Lakes
(Unstructured)(Structured) (Unstructured) (Structured)
Cloud Data Lakes
Social
Weblogs & Feeds
Transactional
Mobile
IoT
Personal Data
© 2019 Cloudera, Inc. All rights reserved. 5
DIFFERENCES IN THE BIG DATA CONTEXT
• Breaking down silos: fantastic for analytics, but security challenges
 Centralized data lake with multi-tenancy requires secure (and easy) authentication and fine-grained
authorization
• Data democratization and the Data Scientist role (often a data super user with
elevated privileges)
• Data is maintained over a long duration
• Cloud and Hybrid architectures spanning data center and (multiple) public clouds
further broaden the attack surface area and present novel authentication and
authorization challenges
• Along with adherence to security fundamentals and defense in-depth, a data-centric
approach to security becomes critical
Watch Towers
Limited Entry Points
Moat
Kerberos
Securing your data lake
High Hard Walls
Check Identity
Inner Walls
Firewall
Encryption, TLS, Key
Trustee, Navigator
Encrypt, Ranger KMS
LDAP/AD
Apache Knox: AuthN, API
Gateway, Proxy, SSO
Apache Ranger : ABAC
AuthZ, Audits,
Anonymization
Apache Metron:
Detection
© 2019 Cloudera, Inc. All rights reserved. 7
DATA PROTECTION IN HADOOP
• Must be applied in three different levels
Storage – encrypt data at rest
• HDFS encryption zones
• Volume encryption
Transmission – encrypt data in motion
• SSL
Upon access – apply restrictions when accessed
• Ranger dynamic masking and row filtering
Dynamic Row Filtering & Column Masking WithApache Ranger &Apache Hive
User 2: Ivanna
Location : EU
Group: HR
User 1: Joe
Location : US
Group: Analyst
Original Query:
SELECT country, nationalid,
ccnumber, mrn, name FROM
ww_customers
Country National ID CC No DOB MRN Name Policy ID
US 232323233 4539067047629850 9/12/1969 8233054331 John Doe nj23j424
US 333287465 5391304868205600 8/13/1979 3736885376 Jane Doe cadsd984
Germany T22000129 4532786256545550 3/5/1963 876452830A Ernie Schwarz KK-2345909
Country National ID CC No MRN Name
US xxxxx3233 4539 xxxx xxxx
xxxx
null John Doe
US xxxxx7465 5391 xxxx xxxx
xxxx
null Jane Doe
Ranger Policy Enforcement
Query Rewritten based on Dynamic Ranger
Policies:
Filter rows by region & apply relevant column
masking
Users from US Analyst group see data for
US persons with CC and National ID
(SSN) as masked values and MRN is
nullified
Country National ID Name MRN
Germany T22000129 Ernie Schwarz 876452830A
EU HR Policy Admins can see
unmasked but are restricted by
row filtering policies to see data
for EU persons only
Original Query:
SELECT country,
nationalid, name, mrn
FROM ww_customers
Analysts
HR Marketing
Ranger
WHAT ARE WE FIXING?
© 2019 Cloudera, Inc. All rights reserved. 10
FRAMING THE PROBLEM…
• Related data, different security requirements
• Authorization – who can see it
• Audit – track who read it
• Encrypt on disk – regulatory
• File-level (or blob) granularity isn’t enough
• File systems don’t understand columns
© 2019 Cloudera, Inc. All rights reserved. 11
REQUIREMENTS
• Readers should transparently decrypt data
If and only if the user has access to the key
The data must be decrypted locally
Old readers must not break!
• Columns are only decrypted as necessary
• Master keys must be managed securely
Support for Key Management Server & hardware
Support for key rolling
EARLIER WORKAROUNDS
© 2019 Cloudera, Inc. All rights reserved. 13
PARTIAL SOLUTION – HDFS ENCRYPTION ZONES
• Transparent HDFS Encryption
• Encryption zones
• HDFS directory trees
• Unique master key for each zone
• Client decrypts data
• Key Management via KeyProvider API
© 2019 Cloudera, Inc. All rights reserved. 14
HDFS ENCRYPTION ZONE LIMITATIONS
• Very coarse protection
• Only entire directory subtrees
• No ability to protect columns
• A lot of users need access to keys
• Moves between zones is painful
• When writing with Hive, data is moved multiple times per a query
© 2019 Cloudera, Inc. All rights reserved. 15
PARTIAL SOLUTION – HIVE SERVER 2
• Limit access to warehouse data to Hive
• Only “hive” user has HDFS access
• Breaks Hadoop’s multi-paradigm data access
• Many customers use both Hive & Spark
• JDBC isn’t a distributed protocol
• Funneling large data through a small pipe
• Spark Data Warehouse Connector to LLAP fixes this
© 2019 Cloudera, Inc. All rights reserved. 16
PARTIAL SOLUTION – SEPARATE TABLES
• Split private information out of tables
• Separate directories in HDFS
• HDFS and/or HS2 authorization
• Enables HDFS encryption
• Limitations
• Need to join with other tables
• Higher operational overhead
© 2019 Cloudera, Inc. All rights reserved. 17
PARTIAL SOLUTION – ENCRYPTION UDF
• Hive has user defined functions
• aes_encrypt and aes_decrypt
• Limitations
• Key management is problematic
• Encryption is not seeded
• Size of value leaks information
THE WINNER IS …
© 2019 Cloudera, Inc. All rights reserved. 19
COLUMNAR ENCRYPTION
• Columnar file formats, such as ORC and Parquet
• Write data in columns
• Column projection
• Better compression
• Encryption works really well
• Only encrypt bytes for column
• Can store multiple variants of data
© 2019 Cloudera, Inc. All rights reserved. 20
ORC FILE FORMAT
© 2019 Cloudera, Inc. All rights reserved. 21
USER EXPERIENCE
• Set table properties for encryption
• orc.encrypt = ”pii:ssn,email;credit:card_info”
• orc.mask = “sha256:card_info”
• Define where to get the encryption keys
• Configuration defines the key provider via URI
© 2019 Cloudera, Inc. All rights reserved. 22
KEY MANAGEMENT
• Create a master key for each use case
• “pii”, “pci”, or “hipaa”
• Each column in each file uses unique local key
• Allows audit of which users read which files
• Ranger policies limit access to keys
• Who, What, When, Where
© 2019 Cloudera, Inc. All rights reserved. 23
KEY PROVIDER API
• Provides limited access to encryption keys
• Encrypts or decrypts local keys
• Users are never given master keys
• Key versions and key rolling of master keys
• Allows 3rd party plugins
• Supports Cloud, Hadoop or Ranger KMS
© 2019 Cloudera, Inc. All rights reserved. 24
ENCRYPTION DATA FLOW
© 2019 Cloudera, Inc. All rights reserved. 25
ENCRYPTION FLOW
• Local key
• Random for each encrypted column in file
• Encrypted w/ master key by KMS
• Encrypted local key is stored in file metadata
• IV is generated to be unique
• Column, kind, stripe, & counter
© 2019 Cloudera, Inc. All rights reserved. 26
STATIC DATA MASKING
• What happens without key access?
• Define static masks
• Nullify – all values become null
• Redact – mask values ‘Xxxxx Xxxxx!’
• Can define ranges to unmask
• SHA256 – replace with SHA256
• Custom - user defined
© 2019 Cloudera, Inc. All rights reserved. 27
DATA ANONYMIZATION
• Anonymization is hard!
• AOL search logs
• Netflix prize datasets
• NYC taxi dataset
• Always evaluate security tradeoffs
• Tokenization is a useful technique
• Assigns arbitrary replacements
© 2019 Cloudera, Inc. All rights reserved. 28
KEY DISPOSAL
• Often need to keep data for 90 days
• Currently the data is written twice
• With column encryption:
• Roll keys daily
• Delete master key after 90 days
© 2019 Cloudera, Inc. All rights reserved. 29
ORC ENCRYPTION DESIGN
• Write both variants of streams
• Masked unencrypted
• Unmasked encrypted
• Encrypt both data and statistics
• Maintain compatibility for old readers
• Read unencrypted variant
• Preserve ability to seek in file
© 2019 Cloudera, Inc. All rights reserved. 30
ORC WRITE PIPELINE
• Streams go through pipeline
• Run length encoding
• Compression (zlib, snappy, lzo, lz4, zstd, or none)
• Encryption
• Encryption is AES/CTR
• Allows seek
• No padding
CONCLUSIONS
© 2019 Cloudera, Inc. All rights reserved. 32
CONCLUSIONS
• ORC column encryptions provides
Transparent encryption
Multi-paradigm column security
Compatible with old readers
Static masking
Audit logging (via KMS logging)
• Supports file merging
• Released in ORC 1.6
© 2019 Cloudera, Inc. All rights reserved. 33
INTEGRATION WITH OTHER TOOLS
• Hive & Spark
No change other than defining table properties
• Apache Hive’s LLAP
Cache and fast processing of SQL queries
Column encryption changes internal interfaces
Cache both encrypted and unencrypted variants
Ensure audit log reflects end-user and what they accessed
© 2019 Cloudera, Inc. All rights reserved. 34
LIMITATIONS
• Need encryption policy for write
Current Atlas & Ranger tags lag data
Auto-discovery requires pre-access
• Changes to masking policy
Need to re-write files
• Need additional data masks
Credit card, addresses, etc.
• Decrypted local keys could be saved
THANK YOU
Owen O’Malley
owen@cloudera.com
@owen_omalley

More Related Content

What's hot

Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache RangerDataWorks Summit
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 
Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopCloudera, Inc.
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Frame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine LearningFrame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine LearningDavid Stein
 
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...aiuy
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxData
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in ImpalaCloudera, Inc.
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDatabricks
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingDataWorks Summit
 

What's hot (20)

Securing Hadoop with Apache Ranger
Securing Hadoop with Apache RangerSecuring Hadoop with Apache Ranger
Securing Hadoop with Apache Ranger
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Apache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for HadoopApache Sqoop: A Data Transfer Tool for Hadoop
Apache Sqoop: A Data Transfer Tool for Hadoop
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Hive tuning
Hive tuningHive tuning
Hive tuning
 
Frame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine LearningFrame - Feature Management for Productive Machine Learning
Frame - Feature Management for Productive Machine Learning
 
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
DataFusion-and-Arrow_Supercharge-Your-Data-Analytical-Tool-with-a-Rusty-Query...
 
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
InfluxDB IOx Tech Talks: Intro to the InfluxDB IOx Read Buffer - A Read-Optim...
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Druid deep dive
Druid deep diveDruid deep dive
Druid deep dive
 
Query Compilation in Impala
Query Compilation in ImpalaQuery Compilation in Impala
Query Compilation in Impala
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Bringing complex event processing to Spark streaming
Bringing complex event processing to Spark streamingBringing complex event processing to Spark streaming
Bringing complex event processing to Spark streaming
 

Similar to Protect your private data with ORC column encryption

Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionFine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionOwen O'Malley
 
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionProtect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionDataWorks Summit
 
Cryptographie avancée et Logical Data Fabric : Accélérez le partage et la mig...
Cryptographie avancée et Logical Data Fabric : Accélérez le partage et la mig...Cryptographie avancée et Logical Data Fabric : Accélérez le partage et la mig...
Cryptographie avancée et Logical Data Fabric : Accélérez le partage et la mig...Denodo
 
Hadoop security implementationon 20171003
Hadoop security implementationon 20171003Hadoop security implementationon 20171003
Hadoop security implementationon 20171003lee tracie
 
Security implementation on hadoop
Security implementation on hadoopSecurity implementation on hadoop
Security implementation on hadoopWei-Chiu Chuang
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartchCloudera, Inc.
 
Transparent Encryption in HDFS
Transparent Encryption in HDFSTransparent Encryption in HDFS
Transparent Encryption in HDFSDataWorks Summit
 
Overcoming the Challenges of Architecting for the Cloud
Overcoming the Challenges of Architecting for the CloudOvercoming the Challenges of Architecting for the Cloud
Overcoming the Challenges of Architecting for the CloudZscaler
 
cisco csr1000v
cisco csr1000vcisco csr1000v
cisco csr1000vMing914298
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Asug84339 how to secure privacy data in a hybrid s4 hana landscape
Asug84339   how to secure privacy data in a hybrid s4 hana landscapeAsug84339   how to secure privacy data in a hybrid s4 hana landscape
Asug84339 how to secure privacy data in a hybrid s4 hana landscapeDharma Atluri
 
Where to Store the Cloud Encryption Keys - InterOp 2012
Where to Store the Cloud Encryption Keys - InterOp 2012Where to Store the Cloud Encryption Keys - InterOp 2012
Where to Store the Cloud Encryption Keys - InterOp 2012Trend Micro
 
Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionCloudera, Inc.
 
Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...
Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...
Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...Cloudera, Inc.
 
IRJET- Survey of Cryptographic Techniques to Certify Sharing of Informati...
IRJET-  	  Survey of Cryptographic Techniques to Certify Sharing of Informati...IRJET-  	  Survey of Cryptographic Techniques to Certify Sharing of Informati...
IRJET- Survey of Cryptographic Techniques to Certify Sharing of Informati...IRJET Journal
 
CipherCloud for Any App
CipherCloud for Any AppCipherCloud for Any App
CipherCloud for Any AppCipherCloud
 
Fighting cyber fraud with hadoop
Fighting cyber fraud with hadoopFighting cyber fraud with hadoop
Fighting cyber fraud with hadoopNiel Dunnage
 
Fiware - communicating with ROS robots using Fast RTPS
Fiware - communicating with ROS robots using Fast RTPSFiware - communicating with ROS robots using Fast RTPS
Fiware - communicating with ROS robots using Fast RTPSJaime Martin Losa
 

Similar to Protect your private data with ORC column encryption (20)

Fine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column EncryptionFine Grain Access Control for Big Data: ORC Column Encryption
Fine Grain Access Control for Big Data: ORC Column Encryption
 
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column EncryptionProtect your Private Data in your Hadoop Clusters with ORC Column Encryption
Protect your Private Data in your Hadoop Clusters with ORC Column Encryption
 
Cryptographie avancée et Logical Data Fabric : Accélérez le partage et la mig...
Cryptographie avancée et Logical Data Fabric : Accélérez le partage et la mig...Cryptographie avancée et Logical Data Fabric : Accélérez le partage et la mig...
Cryptographie avancée et Logical Data Fabric : Accélérez le partage et la mig...
 
Hadoop security implementationon 20171003
Hadoop security implementationon 20171003Hadoop security implementationon 20171003
Hadoop security implementationon 20171003
 
Security implementation on hadoop
Security implementation on hadoopSecurity implementation on hadoop
Security implementation on hadoop
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
Big data journey to the cloud 5.30.18 asher bartch
Big data journey to the cloud 5.30.18   asher bartchBig data journey to the cloud 5.30.18   asher bartch
Big data journey to the cloud 5.30.18 asher bartch
 
Transparent Encryption in HDFS
Transparent Encryption in HDFSTransparent Encryption in HDFS
Transparent Encryption in HDFS
 
LDSS for mobile cloud
LDSS for mobile cloud  LDSS for mobile cloud
LDSS for mobile cloud
 
Overcoming the Challenges of Architecting for the Cloud
Overcoming the Challenges of Architecting for the CloudOvercoming the Challenges of Architecting for the Cloud
Overcoming the Challenges of Architecting for the Cloud
 
cisco csr1000v
cisco csr1000vcisco csr1000v
cisco csr1000v
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Asug84339 how to secure privacy data in a hybrid s4 hana landscape
Asug84339   how to secure privacy data in a hybrid s4 hana landscapeAsug84339   how to secure privacy data in a hybrid s4 hana landscape
Asug84339 how to secure privacy data in a hybrid s4 hana landscape
 
Where to Store the Cloud Encryption Keys - InterOp 2012
Where to Store the Cloud Encryption Keys - InterOp 2012Where to Store the Cloud Encryption Keys - InterOp 2012
Where to Store the Cloud Encryption Keys - InterOp 2012
 
Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solution
 
Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...
Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...
Comprehensive Security for the Enterprise III: Protecting Data at Rest and In...
 
IRJET- Survey of Cryptographic Techniques to Certify Sharing of Informati...
IRJET-  	  Survey of Cryptographic Techniques to Certify Sharing of Informati...IRJET-  	  Survey of Cryptographic Techniques to Certify Sharing of Informati...
IRJET- Survey of Cryptographic Techniques to Certify Sharing of Informati...
 
CipherCloud for Any App
CipherCloud for Any AppCipherCloud for Any App
CipherCloud for Any App
 
Fighting cyber fraud with hadoop
Fighting cyber fraud with hadoopFighting cyber fraud with hadoop
Fighting cyber fraud with hadoop
 
Fiware - communicating with ROS robots using Fast RTPS
Fiware - communicating with ROS robots using Fast RTPSFiware - communicating with ROS robots using Fast RTPS
Fiware - communicating with ROS robots using Fast RTPS
 

More from Owen O'Malley

Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemRunning An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemOwen O'Malley
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACIDOwen O'Malley
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetOwen O'Malley
 
Strata NYC 2018 Iceberg
Strata NYC 2018  IcebergStrata NYC 2018  Iceberg
Strata NYC 2018 IcebergOwen O'Malley
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetOwen O'Malley
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopOwen O'Malley
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersOwen O'Malley
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to HiveOwen O'Malley
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013Owen O'Malley
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File IntroductionOwen O'Malley
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop OperationsOwen O'Malley
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduceOwen O'Malley
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroOwen O'Malley
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopOwen O'Malley
 

More from Owen O'Malley (20)

Running An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid ThemRunning An Apache Project: 10 Traps and How to Avoid Them
Running An Apache Project: 10 Traps and How to Avoid Them
 
Big Data's Journey to ACID
Big Data's Journey to ACIDBig Data's Journey to ACID
Big Data's Journey to ACID
 
ORC Deep Dive 2020
ORC Deep Dive 2020ORC Deep Dive 2020
ORC Deep Dive 2020
 
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and ParquetFast Access to Your Data - Avro, JSON, ORC, and Parquet
Fast Access to Your Data - Avro, JSON, ORC, and Parquet
 
Strata NYC 2018 Iceberg
Strata NYC 2018  IcebergStrata NYC 2018  Iceberg
Strata NYC 2018 Iceberg
 
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and ParquetFast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
Fast Spark Access To Your Complex Data - Avro, JSON, ORC, and Parquet
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
Protecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache HadoopProtecting Enterprise Data in Apache Hadoop
Protecting Enterprise Data in Apache Hadoop
 
Data protection2015
Data protection2015Data protection2015
Data protection2015
 
Structor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop ClustersStructor - Automated Building of Virtual Hadoop Clusters
Structor - Automated Building of Virtual Hadoop Clusters
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Adding ACID Updates to Hive
Adding ACID Updates to HiveAdding ACID Updates to Hive
Adding ACID Updates to Hive
 
ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013ORC File and Vectorization - Hadoop Summit 2013
ORC File and Vectorization - Hadoop Summit 2013
 
ORC Files
ORC FilesORC Files
ORC Files
 
ORC File Introduction
ORC File IntroductionORC File Introduction
ORC File Introduction
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Next Generation Hadoop Operations
Next Generation Hadoop OperationsNext Generation Hadoop Operations
Next Generation Hadoop Operations
 
Next Generation MapReduce
Next Generation MapReduceNext Generation MapReduce
Next Generation MapReduce
 
Bay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 IntroBay Area HUG Feb 2011 Intro
Bay Area HUG Feb 2011 Intro
 
Plugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in HadoopPlugging the Holes: Security and Compatability in Hadoop
Plugging the Holes: Security and Compatability in Hadoop
 

Recently uploaded

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiSuhani Kapoor
 

Recently uploaded (20)

Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service AmravatiVIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
VIP Call Girls in Amravati Aarohi 8250192130 Independent Escort Service Amravati
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 

Protect your private data with ORC column encryption

  • 1. PROTECT YOUR PRIVATE DATA WITH ORC COLUMN ENCRYPTION Owen O’Malley owen@cloudera.com September 2019 @owen_omalley
  • 2. © 2019 Cloudera, Inc. All rights reserved. 2 WHO AM I? • First committer added to Hadoop  MapReduce  Scaling  Security • Hive  ACID transactions • ORC  Creator
  • 3. SECURITY AND DATA PROTECTION IN HADOOP
  • 4. © 2019 Cloudera, Inc. All rights reserved. 4 EXAMPLE DATA LAKE SCENARIO Marketing Demographics Electronic medical records CRM POS (Structured)(Structured) (Structured) (Structured) (Structured) Cluster 1: Dublin Cluster 2: San Francisco (Unstructured)(Unstructured)(Unstructured) Cluster 3: Prague (Structured) On Premise Data Lakes (Unstructured)(Structured) (Unstructured) (Structured) Cloud Data Lakes Social Weblogs & Feeds Transactional Mobile IoT Personal Data
  • 5. © 2019 Cloudera, Inc. All rights reserved. 5 DIFFERENCES IN THE BIG DATA CONTEXT • Breaking down silos: fantastic for analytics, but security challenges  Centralized data lake with multi-tenancy requires secure (and easy) authentication and fine-grained authorization • Data democratization and the Data Scientist role (often a data super user with elevated privileges) • Data is maintained over a long duration • Cloud and Hybrid architectures spanning data center and (multiple) public clouds further broaden the attack surface area and present novel authentication and authorization challenges • Along with adherence to security fundamentals and defense in-depth, a data-centric approach to security becomes critical
  • 6. Watch Towers Limited Entry Points Moat Kerberos Securing your data lake High Hard Walls Check Identity Inner Walls Firewall Encryption, TLS, Key Trustee, Navigator Encrypt, Ranger KMS LDAP/AD Apache Knox: AuthN, API Gateway, Proxy, SSO Apache Ranger : ABAC AuthZ, Audits, Anonymization Apache Metron: Detection
  • 7. © 2019 Cloudera, Inc. All rights reserved. 7 DATA PROTECTION IN HADOOP • Must be applied in three different levels Storage – encrypt data at rest • HDFS encryption zones • Volume encryption Transmission – encrypt data in motion • SSL Upon access – apply restrictions when accessed • Ranger dynamic masking and row filtering
  • 8. Dynamic Row Filtering & Column Masking WithApache Ranger &Apache Hive User 2: Ivanna Location : EU Group: HR User 1: Joe Location : US Group: Analyst Original Query: SELECT country, nationalid, ccnumber, mrn, name FROM ww_customers Country National ID CC No DOB MRN Name Policy ID US 232323233 4539067047629850 9/12/1969 8233054331 John Doe nj23j424 US 333287465 5391304868205600 8/13/1979 3736885376 Jane Doe cadsd984 Germany T22000129 4532786256545550 3/5/1963 876452830A Ernie Schwarz KK-2345909 Country National ID CC No MRN Name US xxxxx3233 4539 xxxx xxxx xxxx null John Doe US xxxxx7465 5391 xxxx xxxx xxxx null Jane Doe Ranger Policy Enforcement Query Rewritten based on Dynamic Ranger Policies: Filter rows by region & apply relevant column masking Users from US Analyst group see data for US persons with CC and National ID (SSN) as masked values and MRN is nullified Country National ID Name MRN Germany T22000129 Ernie Schwarz 876452830A EU HR Policy Admins can see unmasked but are restricted by row filtering policies to see data for EU persons only Original Query: SELECT country, nationalid, name, mrn FROM ww_customers Analysts HR Marketing Ranger
  • 9. WHAT ARE WE FIXING?
  • 10. © 2019 Cloudera, Inc. All rights reserved. 10 FRAMING THE PROBLEM… • Related data, different security requirements • Authorization – who can see it • Audit – track who read it • Encrypt on disk – regulatory • File-level (or blob) granularity isn’t enough • File systems don’t understand columns
  • 11. © 2019 Cloudera, Inc. All rights reserved. 11 REQUIREMENTS • Readers should transparently decrypt data If and only if the user has access to the key The data must be decrypted locally Old readers must not break! • Columns are only decrypted as necessary • Master keys must be managed securely Support for Key Management Server & hardware Support for key rolling
  • 13. © 2019 Cloudera, Inc. All rights reserved. 13 PARTIAL SOLUTION – HDFS ENCRYPTION ZONES • Transparent HDFS Encryption • Encryption zones • HDFS directory trees • Unique master key for each zone • Client decrypts data • Key Management via KeyProvider API
  • 14. © 2019 Cloudera, Inc. All rights reserved. 14 HDFS ENCRYPTION ZONE LIMITATIONS • Very coarse protection • Only entire directory subtrees • No ability to protect columns • A lot of users need access to keys • Moves between zones is painful • When writing with Hive, data is moved multiple times per a query
  • 15. © 2019 Cloudera, Inc. All rights reserved. 15 PARTIAL SOLUTION – HIVE SERVER 2 • Limit access to warehouse data to Hive • Only “hive” user has HDFS access • Breaks Hadoop’s multi-paradigm data access • Many customers use both Hive & Spark • JDBC isn’t a distributed protocol • Funneling large data through a small pipe • Spark Data Warehouse Connector to LLAP fixes this
  • 16. © 2019 Cloudera, Inc. All rights reserved. 16 PARTIAL SOLUTION – SEPARATE TABLES • Split private information out of tables • Separate directories in HDFS • HDFS and/or HS2 authorization • Enables HDFS encryption • Limitations • Need to join with other tables • Higher operational overhead
  • 17. © 2019 Cloudera, Inc. All rights reserved. 17 PARTIAL SOLUTION – ENCRYPTION UDF • Hive has user defined functions • aes_encrypt and aes_decrypt • Limitations • Key management is problematic • Encryption is not seeded • Size of value leaks information
  • 19. © 2019 Cloudera, Inc. All rights reserved. 19 COLUMNAR ENCRYPTION • Columnar file formats, such as ORC and Parquet • Write data in columns • Column projection • Better compression • Encryption works really well • Only encrypt bytes for column • Can store multiple variants of data
  • 20. © 2019 Cloudera, Inc. All rights reserved. 20 ORC FILE FORMAT
  • 21. © 2019 Cloudera, Inc. All rights reserved. 21 USER EXPERIENCE • Set table properties for encryption • orc.encrypt = ”pii:ssn,email;credit:card_info” • orc.mask = “sha256:card_info” • Define where to get the encryption keys • Configuration defines the key provider via URI
  • 22. © 2019 Cloudera, Inc. All rights reserved. 22 KEY MANAGEMENT • Create a master key for each use case • “pii”, “pci”, or “hipaa” • Each column in each file uses unique local key • Allows audit of which users read which files • Ranger policies limit access to keys • Who, What, When, Where
  • 23. © 2019 Cloudera, Inc. All rights reserved. 23 KEY PROVIDER API • Provides limited access to encryption keys • Encrypts or decrypts local keys • Users are never given master keys • Key versions and key rolling of master keys • Allows 3rd party plugins • Supports Cloud, Hadoop or Ranger KMS
  • 24. © 2019 Cloudera, Inc. All rights reserved. 24 ENCRYPTION DATA FLOW
  • 25. © 2019 Cloudera, Inc. All rights reserved. 25 ENCRYPTION FLOW • Local key • Random for each encrypted column in file • Encrypted w/ master key by KMS • Encrypted local key is stored in file metadata • IV is generated to be unique • Column, kind, stripe, & counter
  • 26. © 2019 Cloudera, Inc. All rights reserved. 26 STATIC DATA MASKING • What happens without key access? • Define static masks • Nullify – all values become null • Redact – mask values ‘Xxxxx Xxxxx!’ • Can define ranges to unmask • SHA256 – replace with SHA256 • Custom - user defined
  • 27. © 2019 Cloudera, Inc. All rights reserved. 27 DATA ANONYMIZATION • Anonymization is hard! • AOL search logs • Netflix prize datasets • NYC taxi dataset • Always evaluate security tradeoffs • Tokenization is a useful technique • Assigns arbitrary replacements
  • 28. © 2019 Cloudera, Inc. All rights reserved. 28 KEY DISPOSAL • Often need to keep data for 90 days • Currently the data is written twice • With column encryption: • Roll keys daily • Delete master key after 90 days
  • 29. © 2019 Cloudera, Inc. All rights reserved. 29 ORC ENCRYPTION DESIGN • Write both variants of streams • Masked unencrypted • Unmasked encrypted • Encrypt both data and statistics • Maintain compatibility for old readers • Read unencrypted variant • Preserve ability to seek in file
  • 30. © 2019 Cloudera, Inc. All rights reserved. 30 ORC WRITE PIPELINE • Streams go through pipeline • Run length encoding • Compression (zlib, snappy, lzo, lz4, zstd, or none) • Encryption • Encryption is AES/CTR • Allows seek • No padding
  • 32. © 2019 Cloudera, Inc. All rights reserved. 32 CONCLUSIONS • ORC column encryptions provides Transparent encryption Multi-paradigm column security Compatible with old readers Static masking Audit logging (via KMS logging) • Supports file merging • Released in ORC 1.6
  • 33. © 2019 Cloudera, Inc. All rights reserved. 33 INTEGRATION WITH OTHER TOOLS • Hive & Spark No change other than defining table properties • Apache Hive’s LLAP Cache and fast processing of SQL queries Column encryption changes internal interfaces Cache both encrypted and unencrypted variants Ensure audit log reflects end-user and what they accessed
  • 34. © 2019 Cloudera, Inc. All rights reserved. 34 LIMITATIONS • Need encryption policy for write Current Atlas & Ranger tags lag data Auto-discovery requires pre-access • Changes to masking policy Need to re-write files • Need additional data masks Credit card, addresses, etc. • Decrypted local keys could be saved