Row/Column- Level Security in SQL for Apache Spark

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Row/Column-level
Security in SQL
for Apache Spark
Dongjoon Hyun – Software Engineer
Bikas Saha – Software Engineer
April 2017

Who am I
 Software Engineer @ Hortonworks
 Apache REEF PMC member and committer
 Apache Spark project contributor
 https://github.com/dongjoon-hyun
Dongjoon Hyun

Agenda
Security Issues
Goals
Components
How it works
Demo

Security
 One of fundamental features for enterprise adoption
– Multi-tenancy: Billing team / Data science team / Marketing teams
 Row and column-level access control for SQL users
– Row filtering
– Column masking
 Must enforce shared policies to various SQL engines simultaneously
– E.g. Apache Spark 2.1/1.6 and Apache Hive 2.1

Issue 1
 Spark reads all or nothing
– Directory/File-based permissions are insufficient
 Permission 777 on warehouse?
Security starts from storage

Issue 2
 Spark apps should be rewritten
– Special data source tables
 Duplicated data
– Filtered rows
– Removed or masked columns
 SQL Views
– Maintained by manually
Overhead during starting and maintaining security policies

Goals

Goal 1: Spark SQL Apps
Support row/column-level security with the batch apps
from pyspark.sql import SparkSession
spark = SparkSession
.builder
.enableHiveSupport()
.getOrCreate()
spark.sql("select * from db_common.t_customer").show()
db_common
t_customer
…

Goal 2: Spark shells (1/2)
Support row/column-level security in all shells
spark-shell
pyspark

Goal 2: Spark shells (2/2)
Support row/column-level security in all shells
sparkR
spark-sql

Goal 3: Spark Thrift Server
Support row/column-level security with Spark Thrift Server
Login as `hive`
Login as `spark`

Components

What are required?
 Kerberos
 Apache Hadoop (HDFS/YARN)
 Apache Ranger
 Apache Hive (LLAP)
 Spark-LLAP: A library and patches to integrate the above
Focus here

Apache Ranger
Provide a standard authorization method across many Hadoop components
https://hortonworks.com/apache/ranger/#section_2

Apache Hive
 Hive Ranger Plugin & Policies
– Support row/column-level security
 LLAP Daemon (GA in HDP 2.6)
– Persistent query servers with intelligent in-memory caching
– Provide a secure relational datanode view of the data
Trusted Service

Spark-LLAP for Spark 1.6
• User should use LlapContext
• Support Scala/Java and spark-shell
HDP 2.5
var lc = new LlapContext(sc)
lc.sql("select * from t").show
Spark-LLAP (Technical Preview)
Milestone
• No need to rewrite SQL related code
• Support all languages and shells
HDP 2.6 Next
• Support YARN cluster mode

Spark-LLAP GitHub (Apache License)

How it works

How it works – Overview
Case: spark-submit with YARN cluster mode
Spark
Hive
(HiveServer2)
Ranger
LLAP
User
Admin
2. Launch
3. Get delegation token
1. Manage policies
7. Monitor Audits
6. Read filtered/masked data
Authorize
5. Get data locations
4. Get metadata

How it works – Overview
Spark
Hive
(HiveServer2)
Ranger
LLAP
User
Admin
2. Launch
3. Get delegation token
1. Manage policies
7. Monitor Audits
6. Read filtered/masked data
Authorize
5. Get data locations
4. Get metadata
Existing InfraNew for Spark
New for Hive (GA)

Hive
Enable LLAP

Admin – Manage
Hive Database: db_common
Table: *
Hive Column: *
Select User: spark
Permissions: SELECT

Admin – Audit

User
 spark-submit
--jars spark-llap.jar
--conf spark.sql.hive.llap=true
--conf spark.yarn.security.credentials.hiveserver2.enabled=true
--master yarn
--deploy-mode cluster
sql.py
Launch Spark jobs
Note: There exists more static configurations related LLAP
`--package` option is supported, too
Easy to turn on/off
Only used for YARN cluster mode

Spark
 HDFS Delegation Token
– HDFSCredentialProvider gets it from namenode
 Hive Metastore Delegation Token
– HiveCredentialProvider gets it from Hive Metastore
 HiveServer2 Delegation Token
– HiveServer2CredentialProvider gets it from HiveServer2
Get delegation tokens
Spark-LLAP
Existing
Note: Spark manages token renewal

Spark
LlapMetastoreCatalog: Replaces MetastoreRelation into LlapRelation
SELECT gender, count(*)
FROM db_common.t_customer
WHERE name LIKE '%Obama’
GROUP BY gender
LlapRelation
SubqueryAlias
Analyzed Logical Plan
Filter: name like %Obama
Aggregate: gender
UnresolvedRelation
Parsed Logical Plan
Aggregate: gender

Spark
LlapMetastoreCatalog: Replaces MetastoreRelation into LlapRelation
Without Spark-LLAP
With Spark-LLAP

Spark
LlapRelation supports predicate pushdown during optimization
LlapRelation
SubqueryAlias
Aggregate: gender
LlapRelation
Filter: EndsWith(name,Obama)
Optimized Logical Plan
Project: gender
Aggregate: gender

Spark
LlapRelation supports predicate pushdown during optimization
LlapRelation
SubqueryAlias
Aggregate: gender
LlapRelation
Filter: EndsWith(name,Obama)
Optimized Logical Plan
Project: gender
Aggregate: gender
Scan LlapRelation
PushedFilter:
StringEndsWith(name, Obama)
Filter: EndsWith(name, Obama)
Physical Plan
Project: gender
HashAggregate: gender
…

Spark
Read filtered and masked data from LLAP
jobConf.set("hive.llap.zk.registry.user", "hive")
jobConf.set("llap.if.hs2.connection", parameters("url"))
jobConf.set("llap.if.query", queryString)
…
// Create Hadoop RDD and convert LLAP Row into Spark Row
sc.sparkContext
.hadoopRDD(…)
.mapPartitionsWithInputSplit(…)

Demo (Video)

Some related SPARK Issues
 SPARK-14743 Add a configurable credential manager for Spark running on YARN
 SPARK-15777 Catalog federation (Open)
 SPARK-17767 Spark SQL ExternalCatalog API custom implementation support (Closed as Later)
 SPARK-17819 Support default database in connection URIs for Spark Thrift Server
 SPARK-18517 DROP TABLE IF EXISTS should not warn for non-exist
 SPARK-18840 Avoid throw exception when getting token renewal interval in non HDFS security env.
 SPARK-18857 Don't use `Iterator.duplicate` in STS
 SPARK-19021 Generailize HDFSCredentialProvider to support non HDFS security filesystems
 SPARK-19038 Avoid overwriting keytab configuration in yarn-client
 SPARK-19179 Change spark.yarn.access.namenodes config and update docs
 SPARK-19970 Table owner should be USER instead of PRINCIPAL
 SPARK-19995 Register tokens to current UGI to avoid re-issuing of tokens in yarn client mode

Summary
 Support row/column-level security with
– Spark apps
– Spark shells
– Spark Thrift Server
 You can use the existing Spark 2.X SQL apps and scripts
 Easy to turn on/off with only configurations
 Ranger enforces Hive/Spark simultaneously and consistently
Spark-LLAP with HDP 2.6 is TP

Acknowledgement
 Apache Hive / Apache Spark / Apache Ranger
 Bikas Saha, Saisai Shao, Jason Dere, Thejas Nair, Zhan Zhang, and
many others

Thank you

Row/Column- Level Security in SQL for Apache Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Row/Column- Level Security in SQL for Apache Spark

Similar to Row/Column- Level Security in SQL for Apache Spark (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Row/Column- Level Security in SQL for Apache Spark