SlideShare a Scribd company logo
1 of 35
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Row/Column-level
Security in SQL
for Apache Spark
Dongjoon Hyun – Software Engineer
Bikas Saha – Software Engineer
April 2017
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who am I
 Software Engineer @ Hortonworks
 Apache REEF PMC member and committer
 Apache Spark project contributor
 https://github.com/dongjoon-hyun
Dongjoon Hyun
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
Security Issues
Goals
Components
How it works
Demo
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security
 One of fundamental features for enterprise adoption
– Multi-tenancy: Billing team / Data science team / Marketing teams
 Row and column-level access control for SQL users
– Row filtering
– Column masking
 Must enforce shared policies to various SQL engines simultaneously
– E.g. Apache Spark 2.1/1.6 and Apache Hive 2.1
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Issue 1
 Spark reads all or nothing
– Directory/File-based permissions are insufficient
 Permission 777 on warehouse?
Security starts from storage
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Issue 2
 Spark apps should be rewritten
– Special data source tables
 Duplicated data
– Filtered rows
– Removed or masked columns
 SQL Views
– Maintained by manually
Overhead during starting and maintaining security policies
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goals
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goal 1: Spark SQL Apps
Support row/column-level security with the batch apps
from pyspark.sql import SparkSession
spark = SparkSession 
.builder 
.enableHiveSupport() 
.getOrCreate()
spark.sql("select * from db_common.t_customer").show()
db_common
t_customer
…
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goal 2: Spark shells (1/2)
Support row/column-level security in all shells
spark-shell
pyspark
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goal 2: Spark shells (2/2)
Support row/column-level security in all shells
sparkR
spark-sql
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Goal 3: Spark Thrift Server
Support row/column-level security with Spark Thrift Server
Login as `hive`
Login as `spark`
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Components
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What are required?
 Kerberos
 Apache Hadoop (HDFS/YARN)
 Apache Ranger
 Apache Hive (LLAP)
 Spark-LLAP: A library and patches to integrate the above
Focus here
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Ranger
Provide a standard authorization method across many Hadoop components
https://hortonworks.com/apache/ranger/#section_2
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Hive
 Hive Ranger Plugin & Policies
– Support row/column-level security
 LLAP Daemon (GA in HDP 2.6)
– Persistent query servers with intelligent in-memory caching
– Provide a secure relational datanode view of the data
Trusted Service
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark-LLAP for Spark 1.6
• User should use LlapContext
• Support Scala/Java and spark-shell
HDP 2.5
var lc = new LlapContext(sc)
lc.sql("select * from t").show
Spark-LLAP (Technical Preview)
Milestone
Spark-LLAP for Spark 2.1
• No need to rewrite SQL related code
• Support all languages and shells
HDP 2.6 Next
Spark-LLAP for Spark 2.1
• Support YARN cluster mode
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark-LLAP GitHub (Apache License)
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How it works
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How it works – Overview
Case: spark-submit with YARN cluster mode
Spark
Hive
(HiveServer2)
Ranger
LLAP
User
Admin
2. Launch
3. Get delegation token
1. Manage policies
7. Monitor Audits
6. Read filtered/masked data
Authorize
5. Get data locations
4. Get metadata
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How it works – Overview
Spark
Hive
(HiveServer2)
Ranger
LLAP
User
Admin
2. Launch
3. Get delegation token
1. Manage policies
7. Monitor Audits
6. Read filtered/masked data
Authorize
5. Get data locations
4. Get metadata
Existing InfraNew for Spark
New for Hive (GA)
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hive
Enable LLAP
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Admin – Manage
Hive Database: db_common
Table: *
Hive Column: *
Select User: spark
Permissions: SELECT
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Admin – Audit
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
User
 spark-submit
--jars spark-llap.jar
--conf spark.sql.hive.llap=true
--conf spark.yarn.security.credentials.hiveserver2.enabled=true
--master yarn
--deploy-mode cluster
sql.py
Launch Spark jobs
Note: There exists more static configurations related LLAP
`--package` option is supported, too
Easy to turn on/off
Only used for YARN cluster mode
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
 HDFS Delegation Token
– HDFSCredentialProvider gets it from namenode
 Hive Metastore Delegation Token
– HiveCredentialProvider gets it from Hive Metastore
 HiveServer2 Delegation Token
– HiveServer2CredentialProvider gets it from HiveServer2
Get delegation tokens
Spark-LLAP
Existing
Note: Spark manages token renewal
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
LlapMetastoreCatalog: Replaces MetastoreRelation into LlapRelation
SELECT gender, count(*)
FROM db_common.t_customer
WHERE name LIKE '%Obama’
GROUP BY gender
LlapRelation
SubqueryAlias
Analyzed Logical Plan
Filter: name like %Obama
Aggregate: gender
UnresolvedRelation
Filter: name like %Obama
Parsed Logical Plan
Aggregate: gender
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
LlapMetastoreCatalog: Replaces MetastoreRelation into LlapRelation
Without Spark-LLAP
With Spark-LLAP
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
LlapRelation supports predicate pushdown during optimization
LlapRelation
SubqueryAlias
Analyzed Logical Plan
Filter: name like %Obama
Aggregate: gender
LlapRelation
Filter: EndsWith(name,Obama)
Optimized Logical Plan
Project: gender
Aggregate: gender
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
LlapRelation supports predicate pushdown during optimization
LlapRelation
SubqueryAlias
Analyzed Logical Plan
Filter: name like %Obama
Aggregate: gender
LlapRelation
Filter: EndsWith(name,Obama)
Optimized Logical Plan
Project: gender
Aggregate: gender
Scan LlapRelation
PushedFilter:
StringEndsWith(name, Obama)
Filter: EndsWith(name, Obama)
Physical Plan
Project: gender
HashAggregate: gender
…
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark
Read filtered and masked data from LLAP
jobConf.set("hive.llap.zk.registry.user", "hive")
jobConf.set("llap.if.hs2.connection", parameters("url"))
jobConf.set("llap.if.query", queryString)
…
// Create Hadoop RDD and convert LLAP Row into Spark Row
sc.sparkContext
.hadoopRDD(…)
.mapPartitionsWithInputSplit(…)
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo (Video)
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Some related SPARK Issues
 SPARK-14743 Add a configurable credential manager for Spark running on YARN
 SPARK-15777 Catalog federation (Open)
 SPARK-17767 Spark SQL ExternalCatalog API custom implementation support (Closed as Later)
 SPARK-17819 Support default database in connection URIs for Spark Thrift Server
 SPARK-18517 DROP TABLE IF EXISTS should not warn for non-exist
 SPARK-18840 Avoid throw exception when getting token renewal interval in non HDFS security env.
 SPARK-18857 Don't use `Iterator.duplicate` in STS
 SPARK-19021 Generailize HDFSCredentialProvider to support non HDFS security filesystems
 SPARK-19038 Avoid overwriting keytab configuration in yarn-client
 SPARK-19179 Change spark.yarn.access.namenodes config and update docs
 SPARK-19970 Table owner should be USER instead of PRINCIPAL
 SPARK-19995 Register tokens to current UGI to avoid re-issuing of tokens in yarn client mode
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary
 Support row/column-level security with
– Spark apps
– Spark shells
– Spark Thrift Server
 You can use the existing Spark 2.X SQL apps and scripts
 Easy to turn on/off with only configurations
 Ranger enforces Hive/Spark simultaneously and consistently
Spark-LLAP with HDP 2.6 is TP
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Acknowledgement
 Apache Hive / Apache Spark / Apache Ranger
 Bikas Saha, Saisai Shao, Jason Dere, Thejas Nair, Zhan Zhang, and
many others
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank you

More Related Content

What's hot

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Databricks
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Vinoth Chandar
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsYaroslav Tkachenko
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...StreamNative
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialDaniel Abadi
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataDataWorks Summit
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookDatabricks
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
 

What's hot (20)

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
Lessons from the Field: Applying Best Practices to Your Apache Spark Applicat...
 
A Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQLA Deep Dive into Query Execution Engine of Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBase
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
 
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
Change Data Capture to Data Lakes Using Apache Pulsar and Apache Hudi - Pulsa...
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
SQL-on-Hadoop Tutorial
SQL-on-Hadoop TutorialSQL-on-Hadoop Tutorial
SQL-on-Hadoop Tutorial
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
Spark SQL Join Improvement at Facebook
Spark SQL Join Improvement at FacebookSpark SQL Join Improvement at Facebook
Spark SQL Join Improvement at Facebook
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 

Similar to Row/Column- Level Security in SQL for Apache Spark

Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDataWorks Summit
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with ZeppelinHortonworks
 
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4DataWorks Summit
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object StoresSteve Loughran
 
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...DataWorks Summit
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark Summit
 
Running Apache Zeppelin production
Running Apache Zeppelin productionRunning Apache Zeppelin production
Running Apache Zeppelin productionVinay Shukla
 
Running Zeppelin in Enterprise
Running Zeppelin in EnterpriseRunning Zeppelin in Enterprise
Running Zeppelin in EnterpriseDataWorks Summit
 
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4DataWorks Summit
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...VMware Tanzu
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017alanfgates
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseDataWorks Summit
 
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisDataWorks Summit/Hadoop Summit
 
Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native Services
Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native ServicesAccumulo Summit 2016: Apache Accumulo on Docker with YARN Native Services
Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native ServicesAccumulo Summit
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark Hortonworks
 

Similar to Row/Column- Level Security in SQL for Apache Spark (20)

Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
 
Intro to Spark with Zeppelin
Intro to Spark with ZeppelinIntro to Spark with Zeppelin
Intro to Spark with Zeppelin
 
What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4What s new in spark 2.3 and spark 2.4
What s new in spark 2.3 and spark 2.4
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 
Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
Security Updates: More Seamless Access Controls with Apache Spark and Apache ...
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
 
Spark Security
Spark SecuritySpark Security
Spark Security
 
Running Apache Zeppelin production
Running Apache Zeppelin productionRunning Apache Zeppelin production
Running Apache Zeppelin production
 
Running Zeppelin in Enterprise
Running Zeppelin in EnterpriseRunning Zeppelin in Enterprise
Running Zeppelin in Enterprise
 
What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4What’s new in Apache Spark 2.3 and Spark 2.4
What’s new in Apache Spark 2.3 and Spark 2.4
 
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
Achieving Mega-Scale Business Intelligence Through Speed of Thought Analytics...
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 
An Apache Hive Based Data Warehouse
An Apache Hive Based Data WarehouseAn Apache Hive Based Data Warehouse
An Apache Hive Based Data Warehouse
 
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
 
Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native Services
Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native ServicesAccumulo Summit 2016: Apache Accumulo on Docker with YARN Native Services
Accumulo Summit 2016: Apache Accumulo on Docker with YARN Native Services
 
YARN Ready: Apache Spark
YARN Ready: Apache Spark YARN Ready: Apache Spark
YARN Ready: Apache Spark
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 

Recently uploaded

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 

Recently uploaded (20)

Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 

Row/Column- Level Security in SQL for Apache Spark

  • 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Row/Column-level Security in SQL for Apache Spark Dongjoon Hyun – Software Engineer Bikas Saha – Software Engineer April 2017
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Who am I  Software Engineer @ Hortonworks  Apache REEF PMC member and committer  Apache Spark project contributor  https://github.com/dongjoon-hyun Dongjoon Hyun
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Agenda Security Issues Goals Components How it works Demo
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Security  One of fundamental features for enterprise adoption – Multi-tenancy: Billing team / Data science team / Marketing teams  Row and column-level access control for SQL users – Row filtering – Column masking  Must enforce shared policies to various SQL engines simultaneously – E.g. Apache Spark 2.1/1.6 and Apache Hive 2.1
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Issue 1  Spark reads all or nothing – Directory/File-based permissions are insufficient  Permission 777 on warehouse? Security starts from storage
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Issue 2  Spark apps should be rewritten – Special data source tables  Duplicated data – Filtered rows – Removed or masked columns  SQL Views – Maintained by manually Overhead during starting and maintaining security policies
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Goals
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Goal 1: Spark SQL Apps Support row/column-level security with the batch apps from pyspark.sql import SparkSession spark = SparkSession .builder .enableHiveSupport() .getOrCreate() spark.sql("select * from db_common.t_customer").show() db_common t_customer …
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Goal 2: Spark shells (1/2) Support row/column-level security in all shells spark-shell pyspark
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Goal 2: Spark shells (2/2) Support row/column-level security in all shells sparkR spark-sql
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Goal 3: Spark Thrift Server Support row/column-level security with Spark Thrift Server Login as `hive` Login as `spark`
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Components
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What are required?  Kerberos  Apache Hadoop (HDFS/YARN)  Apache Ranger  Apache Hive (LLAP)  Spark-LLAP: A library and patches to integrate the above Focus here
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Ranger Provide a standard authorization method across many Hadoop components https://hortonworks.com/apache/ranger/#section_2
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Hive  Hive Ranger Plugin & Policies – Support row/column-level security  LLAP Daemon (GA in HDP 2.6) – Persistent query servers with intelligent in-memory caching – Provide a secure relational datanode view of the data Trusted Service
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark-LLAP for Spark 1.6 • User should use LlapContext • Support Scala/Java and spark-shell HDP 2.5 var lc = new LlapContext(sc) lc.sql("select * from t").show Spark-LLAP (Technical Preview) Milestone Spark-LLAP for Spark 2.1 • No need to rewrite SQL related code • Support all languages and shells HDP 2.6 Next Spark-LLAP for Spark 2.1 • Support YARN cluster mode
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark-LLAP GitHub (Apache License)
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How it works
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How it works – Overview Case: spark-submit with YARN cluster mode Spark Hive (HiveServer2) Ranger LLAP User Admin 2. Launch 3. Get delegation token 1. Manage policies 7. Monitor Audits 6. Read filtered/masked data Authorize 5. Get data locations 4. Get metadata
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved How it works – Overview Spark Hive (HiveServer2) Ranger LLAP User Admin 2. Launch 3. Get delegation token 1. Manage policies 7. Monitor Audits 6. Read filtered/masked data Authorize 5. Get data locations 4. Get metadata Existing InfraNew for Spark New for Hive (GA)
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hive Enable LLAP
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Admin – Manage Hive Database: db_common Table: * Hive Column: * Select User: spark Permissions: SELECT
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Admin – Audit
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved User  spark-submit --jars spark-llap.jar --conf spark.sql.hive.llap=true --conf spark.yarn.security.credentials.hiveserver2.enabled=true --master yarn --deploy-mode cluster sql.py Launch Spark jobs Note: There exists more static configurations related LLAP `--package` option is supported, too Easy to turn on/off Only used for YARN cluster mode
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark  HDFS Delegation Token – HDFSCredentialProvider gets it from namenode  Hive Metastore Delegation Token – HiveCredentialProvider gets it from Hive Metastore  HiveServer2 Delegation Token – HiveServer2CredentialProvider gets it from HiveServer2 Get delegation tokens Spark-LLAP Existing Note: Spark manages token renewal
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark LlapMetastoreCatalog: Replaces MetastoreRelation into LlapRelation SELECT gender, count(*) FROM db_common.t_customer WHERE name LIKE '%Obama’ GROUP BY gender LlapRelation SubqueryAlias Analyzed Logical Plan Filter: name like %Obama Aggregate: gender UnresolvedRelation Filter: name like %Obama Parsed Logical Plan Aggregate: gender
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark LlapMetastoreCatalog: Replaces MetastoreRelation into LlapRelation Without Spark-LLAP With Spark-LLAP
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark LlapRelation supports predicate pushdown during optimization LlapRelation SubqueryAlias Analyzed Logical Plan Filter: name like %Obama Aggregate: gender LlapRelation Filter: EndsWith(name,Obama) Optimized Logical Plan Project: gender Aggregate: gender
  • 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark LlapRelation supports predicate pushdown during optimization LlapRelation SubqueryAlias Analyzed Logical Plan Filter: name like %Obama Aggregate: gender LlapRelation Filter: EndsWith(name,Obama) Optimized Logical Plan Project: gender Aggregate: gender Scan LlapRelation PushedFilter: StringEndsWith(name, Obama) Filter: EndsWith(name, Obama) Physical Plan Project: gender HashAggregate: gender …
  • 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Read filtered and masked data from LLAP jobConf.set("hive.llap.zk.registry.user", "hive") jobConf.set("llap.if.hs2.connection", parameters("url")) jobConf.set("llap.if.query", queryString) … // Create Hadoop RDD and convert LLAP Row into Spark Row sc.sparkContext .hadoopRDD(…) .mapPartitionsWithInputSplit(…)
  • 31. 31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Demo (Video)
  • 32. 32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Some related SPARK Issues  SPARK-14743 Add a configurable credential manager for Spark running on YARN  SPARK-15777 Catalog federation (Open)  SPARK-17767 Spark SQL ExternalCatalog API custom implementation support (Closed as Later)  SPARK-17819 Support default database in connection URIs for Spark Thrift Server  SPARK-18517 DROP TABLE IF EXISTS should not warn for non-exist  SPARK-18840 Avoid throw exception when getting token renewal interval in non HDFS security env.  SPARK-18857 Don't use `Iterator.duplicate` in STS  SPARK-19021 Generailize HDFSCredentialProvider to support non HDFS security filesystems  SPARK-19038 Avoid overwriting keytab configuration in yarn-client  SPARK-19179 Change spark.yarn.access.namenodes config and update docs  SPARK-19970 Table owner should be USER instead of PRINCIPAL  SPARK-19995 Register tokens to current UGI to avoid re-issuing of tokens in yarn client mode
  • 33. 33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary  Support row/column-level security with – Spark apps – Spark shells – Spark Thrift Server  You can use the existing Spark 2.X SQL apps and scripts  Easy to turn on/off with only configurations  Ranger enforces Hive/Spark simultaneously and consistently Spark-LLAP with HDP 2.6 is TP
  • 34. 34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Acknowledgement  Apache Hive / Apache Spark / Apache Ranger  Bikas Saha, Saisai Shao, Jason Dere, Thejas Nair, Zhan Zhang, and many others
  • 35. 35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank you