More Related Content Similar to State of Security: Apache Spark & Apache Zeppelin (20) More from DataWorks Summit/Hadoop Summit (20) State of Security: Apache Spark & Apache Zeppelin3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who Are I?
Product Management
Spark for 2.5 + years, Hadoop for 3+ years
Recovering Programmer
Blog at www.vinayshukla.com
Twitter: @neomythos
Addicted to Yoga, Hiking, & Coffee
Minor contributor to Apache Zeppelin
Vinay Shukla
4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security: Rings of Defense
Perimeter Level Security
•Network Security (i.e. Firewalls)
Data Protection
•Wire encryption
•HDFS TDE/Dare
•Others
Authentication
•Kerberos
•Knox (Other Gateways)
OS Security
Authorization
•Apache Ranger/Sentry
•HDFS Permissions
•HDFS ACLs
•YARN ACL
Page 4
5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key to Spark Security
Spark processes data in-memory, does not store it.
Page 5
6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Context: Spark Deployment Modes
• Spark on YARN
–Spark driver (SparkContext) in YARN AM(yarn-cluster)
–Spark driver (SparkContext) in local (yarn-client):
• Spark Shell & Spark Thrift Server runs in yarn-client only
Client
Executor
App
MasterSpark Driver
Client
Executor
App Master
Spark Driver
YARN-Client YARN-Cluster
7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark on YARN
Spark Submit
John Doe
Spark
AM
Spark
AM
1
Hadoop Cluster
HDFS
Executor
YARN
RM
YARN
RM
4
2 3
Node
Manager
8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Security – Four Pillars
Authentication
Authorization
Audit
Encryption
Spark leverages Kerberos on YARN
Ensure network is secure
9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Authentication: Kerberos Primer
Client
KDC
NN
DN
1. kinit - Login and get Ticket Granting Ticket (TGT)
3. Get NameNode Service Ticket (NN-ST)
2. Client Stores TGT in Ticket
Cache
4. Client Stores NN-ST in Ticket
Cache
5. Read/write file given NN-ST and
file name; returns block locations,
block IDs and Block Access Tokens
if access permitted
6. Read/write block given
Block Access Token and block ID
Client’s
Kerberos
Ticket
Cache
10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kerberos authentication within Spark
KDC
Use Spark ST, submit
Spark Job
Spark gets Namenode (NN)
service ticket
YARN launches Spark
Executors using John
Doe’s identity
Get service ticket for
Spark,
John Doe
Spark AMSpark AM
NNNN
Executor reads from HDFS using
John Doe’s delegation token
kinit
1
2
3
4
5
6
7
Hadoop Cluster
11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark + X (Source of Data)
KDC
Use Spark ST, submit
Spark Job
Spark gets X ST
YARN launches Spark
Executors using John
Doe’s identity
Get Service Ticket (ST)
for Spark
Spark AMSpark AM
XX
Executor reads from X using John
Doe’s delegation token
kinit
1
2
3
4
5
6
7
Hadoop Cluster
John Doe
12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Kerberos - Example
kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@
EXAMPLE.COM
./bin/spark-submit --class org.apache.spark.examples.SparkPi
--master yarn-cluster --num-executors 3 --driver-memory 512m
--executor-memory 512m --executor-cores 1 lib/spark-
examples*.jar 10
13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS
Spark – Authorization
YARN Cluster
A B C
KDC
Use Spark ST,
submit Spark Job
Get Namenode (NN)
service ticket
Executors
read from
HDFS
Client gets service
ticket for Spark
RangerRangerCan John launch this job?
Can John read this file
John Doe
14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Encryption: Spark – Communication Channels
Spark
Submit
RM
Shuffle
Service
AM
Driver
NM
Ex 1 Ex N
Shuffle Data
Control/RPC
Shuffle
BlockTransfer
Data
Source
Read/Write
Data
FS – Broadcast,
File Download
15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Communication Encryption Settings
Shuffle Data
Control/RPC
Shuffle
BlockTransfer
Read/Write
Data
FS – Broadcast,
File Download
spark.authenticate.enableSaslEncryption= true
spark.authenticate = true. Leverage YARN to distribute keys
Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS
NM > Ex leverages YARN based SSL
spark.ssl.enabled = true
16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Gotchas with Spark Security
Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd
hop
– Lowers security, forces STS to run as Hive user to read all data
– Use SparkSQL via shell or programmatic API
– https://issues.apache.org/jira/browse/SPARK-5159
Spark + HBase with Kerberos
– Issue fixed in Spark 1.4 (Spark-6918)
Spark Stream + Kafka + Kerberos
– Issues fixed in HDP 2.4.x
– No SSL support yet
Spark jobs > 72 Hours
– Delegation token not renewed before Spark 1.4
Spark Shuffle > Only SASL, no SSL support
17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How can I get Row/Column/Masking with SparkSQL?
Hopefully you went to “Fine Grained Security for Hive & Spark” yesterday
18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key Features: Spark Column Security with LLAP
Fine-Grained Column Level Access Control for SparkSQL.
Fully dynamic policies per user. Doesn’t require views.
Use Standard Ranger policies and tools to control access and masking policies.
Flow:
1.SparkSQL gets data locations
known as “splits” from HiveServer
and plans query.
2.HiveServer2 authorizes access
using Ranger. Per-user policies like
row filtering are applied.
3.Spark gets a modified query plan
based on dynamic security policy.
4.Spark reads data from LLAP.
Filtering / masking guaranteed by
LLAP server.
HiveServer2
Authorization
Hive Metastore
Data Locations
View Definitions
LLAP
Data Read
Filter Pushdown
Ranger Server
Dynamic Policies
Spark Client
1
2
4
3
19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Example: Per-User Row Filtering by Region in SparkSQL
Spark User 2
(East Region)
Spark User 1
(West Region)
Original Query:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
Query Rewrites based on
Dynamic Ranger Policies
LLAP Data Access
User ID Region Total Spend
1 East 5,131
2 East 27,828
3 West 55,493
4 West 7,193
5 East 18,193
Dynamic Rewrite:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
AND region = “east”
Dynamic Rewrite:
SELECT * from CUSTOMERS
WHERE total_spend > 10000
AND region = “west”
20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interacting with Spark
Ex
Spark on YARN
Zeppelin
Spark-
Shell
Ex
Spark
Thrift
Server
Driver
REST
ServerDriver
Driver
Driver
21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin Security
22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: Authentication + SSL
Spark on YARN
Ex Ex
LDAP
John Doe
1
2
3
SSL
Firewall
23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Zeppelin + Livy E2E Security
Zeppelin
Spark
Yarn
Livy
Ispark Group
Interpreter
SPNego: Kerberos Kerberos/RPC
Livy APIs
LDAP
John Doe
Job runs as John Doe
24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: Authorization
Notebook level authorization
Grant Permissions (Owner, Reader, Writer) to users/groups on Notebooks
LDAP Group integration just got merged (ZEPPELIN-946)
25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
Vinay Shukla
@neomythos
Editor's Notes Thank you all the users of Hadoop & Spark
Thank you if you are developing, contributing to Hadoop & Spark
Thank you for coming to this session.
Access Control governed by external data sources: E.g HDFS, S3, HBase, access policies still apply
John Doe first authenticates to Kerberos before launching Spark Shell
kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
The first step of security is network security
The second step of security is Authentication
Most Hadoop echo system projects rely on Kerberos for Authentication
Kerberos – 3 Headed Guard Dog : https://en.wikipedia.org/wiki/Cerberus
Client talks to KDC with Kerberos Library
Orange line – Client to KDC communication
Green line – Client to HDFS communication, does not talk to Kerberos/KDC
John Doe first authenticates to Kerberos before launching Spark Shell
kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
John Doe first authenticates to Kerberos before launching Spark Shell
kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
Controlling HDFS Authorization is easy/Done
Controlling Hive row/column level authorization in Spark is WIP
For HDFS as Data Source can use RPC or use SSL with WebHDFS
For NM Shuffle Data – Use YARN SSL
Spark support SSL for FS (Broadcast or File download)
Shuffle Block Transfer supports SASL based encryption – SSL coming
Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)
All Images from Flicker Commons