SlideShare a Scribd company logo
1 of 28
Running Apache Spark & Apache
Zeppelin in Production
Director, Product Management
August 31, 2016
Twitter: @neomythos
Vinay Shukla
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who am i?
 Product Management
 Spark for 2.5 + years, Hadoop for 3+ years
 Recovering Programmer
 Blog at www.vinayshukla.com
 Twitter: @neomythos
 Addicted to Yoga, Hiking, & Coffee
 Smallest contributor to Apache Zeppelin
Vinay Shukla
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security: Rings of Defense
Perimeter Level Security
•Network Security (i.e. Firewalls)
Data Protection
•Wire encryption
•HDFS TDE/Dare
•Others
Authentication
•Kerberos
•Knox (Other Gateways)
OS Security
Authorization
•Apache Ranger/Sentry
•HDFS Permissions
•HDFS ACLs
•YARN ACL
Page 3
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interacting with Spark
Ex
Spark on YARN
Zeppelin
Spark-
Shell
Ex
Spark
Thrift
Server
Driver
REST
ServerDriver
Driver
Driver
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Context: Spark Deployment Modes
• Spark on YARN
–Spark driver (SparkContext) in YARN AM(yarn-cluster)
–Spark driver (SparkContext) in local (yarn-client):
• Spark Shell & Spark Thrift Server runs in yarn-client only
Client
Executor
App
MasterSpark Driver
Client
Executor
App Master
Spark Driver
YARN-Client YARN-Cluster
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark on YARN
Spark Submit
John Doe
Spark
AM
Spark
AM
1
Hadoop Cluster
HDFS
Executor
YARN
RM
YARN
RM
4
2 3
Node
Manager
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Security – Four Pillars
 Authentication
 Authorization
 Audit
 Encryption
Spark leverages Kerberos on YARN
Ensure network is secure
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kerberos authentication within Spark
KDC
Use Spark ST, submit
Spark Job
Spark gets Namenode (NN)
service ticket
YARN launches Spark
Executors using John
Doe’s identity
Get service ticket for
Spark,
John Doe
Spark AMSpark AM
NNNN
Executor reads from HDFS using
John Doe’s delegation token
kinit
1
2
3
4
5
6
7
Hadoop Cluster
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Kerberos - Example
kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@
EXAMPLE.COM
./bin/spark-submit --class org.apache.spark.examples.SparkPi
--master yarn-cluster --num-executors 3 --driver-memory 512m
--executor-memory 512m --executor-cores 1 lib/spark-
examples*.jar 10
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS
Spark – Authorization
YARN Cluster
A B C
KDC
Use Spark ST,
submit Spark Job
Get Namenode (NN)
service ticket
Executors
read from
HDFS
Client gets service
ticket for Spark
RangerRangerCan John launch this job?
Can John read this file
John Doe
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Encryption: Spark – Communication Channels
Spark
Submit
RM
Shuffle
Service
AM
Driver
NM
Ex 1 Ex N
Shuffle Data
Control/RPC
Shuffle
BlockTransfer
Data
Source
Read/Write
Data
FS – Broadcast,
File Download
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Communication Encryption Settings
Shuffle Data
Control/RPC
Shuffle
BlockTransfer
Read/Write
Data
FS – Broadcast,
File Download
spark.authenticate.enableSaslEncryption= true
spark.authenticate = true. Leverage YARN to distribute keys
Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS
NM > Ex leverages YARN based SSL
spark.ssl.enabled = true
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sharp Edges with Spark Security
 SparkSQL – Only coarse grain access control today
 Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd
hop
– Lowers security, forces STS to run as Hive user to read all data
– Use SparkSQL via shell or programmatic API
– https://issues.apache.org/jira/browse/SPARK-5159
 Spark Stream + Kafka + Kerberos
– Issues fixed in HDP 2.4.x
– No SSL support yet
 Spark Shuffle > Only SASL, no SSL support
 Spark Shuffle > No encryption for spill to disk or intermediate data
Fine grained Security to
SparkSQL
http://bit.ly/2bLghGz
http://bit.ly/2bTX7Pm
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin Security
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: Authentication + SSL
Spark on YARN
Ex Ex
LDAP
John Doe
1
2
3
SSL
Firewall
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Zeppelin + Livy E2E Security
Zeppelin
Spark
Yarn
Livy
Ispark Group
Interpreter
SPNego: Kerberos Kerberos/RPC
Livy APIs
LDAP
John Doe
Job runs as John Doe
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: Authorization
 Note level authorization
 Grant Permissions (Owner, Reader, Writer)
to users/groups on Notes
 LDAP Group integration
 Zeppelin UI Authorization
 Allow only admins to configure interpreter
 Configured in shiro.ini
 For Spark with Zeppelin > Livy > Spark
– Identity Propagation Jobs run as End-User
 For Hive with Zeppelin > JDBC interpreter
 Shell Interpreter
– Runs as end-user
Authorization in Zeppelin Authorization at Data Level
[urls]
/api/interpreter/** = authc, roles[admin]
/api/configurations/** = authc, roles[admin]
/api/credential/** = authc, roles[admin]
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: Credentials
 LDAP/AD account
 Zeppelin leverages Hadoop Credential API
 Interpreter Credentials
 Not solved yet
 Credentials
Credentials in Zeppelin
This is
still an
open
issue
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: AD Authentication
1. /etc/zeppelin/conf/shiro.ini
[urls]
/api/version = anon
/** = authc
Configure Zeppelin to Authenticate users
Zeppelin leverages
Apache Shiro for
authentication/authoriza
tion
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: AD Authentication
1. Create an entry for AD credential
– Zeppelin leverages Hadoop Credential API
– >hadoop credential create
– activeDirectoryRealm.systemPassword -provider jceks://etc/zeppelin/conf/credentials.jceks
– chmod 400 with only Zeppelin process r/w access, no other user allowed accessCredentials
1. Configure Zeppelin to use AD
activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm
activeDirectoryRealm.systemUsername = CN=Administrator,CN=Users,DC=HWQE,DC=HORTONWORKS,DC=COM
#activeDirectoryRealm.systemPassword = Password1!
activeDirectoryRealm.hadoopSecurityCredentialPath = jceks://etc/zeppelin/conf/credentials.jceks
activeDirectoryRealm.searchBase = CN=Users,DC=HWQE,DC=HORTONWORKS,DC=COM
activeDirectoryRealm.url = ldap://ad-nano.qe.hortonworks.com:389
activeDirectoryRealm.groupRolesMap =
"CN=admin,OU=groups,DC=HWQE,DC=HORTONWORKS,DC=COM":"admin","CN=finance,OU=groups,DC=HWQE,DC=HORTONWORK
S,DC=COM":"finance","CN=zeppelin,OU=groups,DC=HWQE,DC=HORTONWORKS,DC=COM":"zeppelin”
activeDirectoryRealm.authorizationCachingEnabled = true
Active Directory Authentication
Skip step 1 if securing
LDAP password is not an
issue
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Performance
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
#1 Big or Small Executor ?
spark-submit --master yarn 
--deploy-mode client 
--num-executors ? 
--executor-cores ? 
--executor-memory ? 
--class MySimpleApp 
mySimpleApp.jar 
arg1 arg2
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Show the Details
 Cluster resource: 72 cores + 288GB Memory
24 cores
96GB Memory
24 cores
96GB Memory
24 cores
96GB Memory
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Benchmark with Different Combinations
Cores
Per E#
Memory
Per E
(GB)
E Per
Node
1 6 18
2 12 9
3 18 6
6 36 3
9 54 2
18 108 1
#E stands for Executor
Except too small or too big executor, the
performance is relatively close
18 cores 108GB memory per node# The lower the better
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Or Small Executor ?
 Avoid too small or too big executor.
– Small executor will decrease CPU and memory efficiency.
– Big executor will introduce heavy GC overhead.
 Usually 3 ~ 6 cores and 10 ~ 40 GB of memory per executor is a preferable choice.
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Any Other Thing ?
 Executor memory != Container memory
 Container memory = executor memory + overhead memory (10% of executory memory)
 Leave some resources to os and other services
 Enable CPU scheduling if you want to constrain CPU usage#
#CPU scheduling -
http://hortonworks.com/blog/managing-cpu-resources-in-y
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Multi-tenancy for Spark
 Leverage YARN queues
– Set user quotas
– Set default yarn-queue in spark-defaults
– User can override for each job
 Leverage Dynamic Resource Allocation
– Specify range of executors a job uses
– This needs shuffle service to be used
Cluster resource Utilization
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
Vinay Shukla
@neomythos

More Related Content

What's hot

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits allJulian Hyde
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive QueriesOwen O'Malley
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in SparkDatabricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudDatabricks
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationDatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applicationsKnoldus Inc.
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsDatabricks
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDatabricks
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark Summit
 

What's hot (20)

Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Data Source API in Spark
Data Source API in SparkData Source API in Spark
Data Source API in Spark
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Apache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the CloudApache Spark on K8S Best Practice and Performance in the Cloud
Apache Spark on K8S Best Practice and Performance in the Cloud
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Apache Spark Core – Practical Optimization
Apache Spark Core – Practical OptimizationApache Spark Core – Practical Optimization
Apache Spark Core – Practical Optimization
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
Unit testing of spark applications
Unit testing of spark applicationsUnit testing of spark applications
Unit testing of spark applications
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Performance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark MetricsPerformance Troubleshooting Using Apache Spark Metrics
Performance Troubleshooting Using Apache Spark Metrics
 
The Impala Cookbook
The Impala CookbookThe Impala Cookbook
The Impala Cookbook
 
Dynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache SparkDynamic Partition Pruning in Apache Spark
Dynamic Partition Pruning in Apache Spark
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
 

Similar to Running Apache Spark & Apache Zeppelin in Production: A Guide to Security, Performance and Multi-tenancy

Running Apache Zeppelin production
Running Apache Zeppelin productionRunning Apache Zeppelin production
Running Apache Zeppelin productionVinay Shukla
 
Running Zeppelin in Enterprise
Running Zeppelin in EnterpriseRunning Zeppelin in Enterprise
Running Zeppelin in EnterpriseDataWorks Summit
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDataWorks Summit
 
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3DataWorks Summit
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov
 
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...
Dataworks Berlin Summit 18' - Deep learning On YARN -  Running Distributed Te...Dataworks Berlin Summit 18' - Deep learning On YARN -  Running Distributed Te...
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...Wangda Tan
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopYifeng Jiang
 
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisDataWorks Summit/Hadoop Summit
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...DataWorks Summit
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object StoresSteve Loughran
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionDataWorks Summit/Hadoop Summit
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementDataWorks Summit/Hadoop Summit
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)Chris Nauroth
 

Similar to Running Apache Spark & Apache Zeppelin in Production: A Guide to Security, Performance and Multi-tenancy (20)

Running Apache Zeppelin production
Running Apache Zeppelin productionRunning Apache Zeppelin production
Running Apache Zeppelin production
 
Running Zeppelin in Enterprise
Running Zeppelin in EnterpriseRunning Zeppelin in Enterprise
Running Zeppelin in Enterprise
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Curb Your Insecurity - Tips for a Secure Cluster (with Spark too)!!
Curb Your Insecurity - Tips for a Secure Cluster (with Spark too)!!Curb Your Insecurity - Tips for a Secure Cluster (with Spark too)!!
Curb Your Insecurity - Tips for a Secure Cluster (with Spark too)!!
 
Curb your insecurity with HDP
Curb your insecurity with HDPCurb your insecurity with HDP
Curb your insecurity with HDP
 
Row/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache SparkRow/Column- Level Security in SQL for Apache Spark
Row/Column- Level Security in SQL for Apache Spark
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3Deep learning on yarn  running distributed tensorflow etc on hadoop cluster v3
Deep learning on yarn running distributed tensorflow etc on hadoop cluster v3
 
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache ZeppelinIntro to Big Data Analytics using Apache Spark and Apache Zeppelin
Intro to Big Data Analytics using Apache Spark and Apache Zeppelin
 
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...
Dataworks Berlin Summit 18' - Deep learning On YARN -  Running Distributed Te...Dataworks Berlin Summit 18' - Deep learning On YARN -  Running Distributed Te...
Dataworks Berlin Summit 18' - Deep learning On YARN - Running Distributed Te...
 
Hadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise HadoopHadoop Present - Open Enterprise Hadoop
Hadoop Present - Open Enterprise Hadoop
 
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
Apache Spark and Object Stores
Apache Spark and Object StoresApache Spark and Object Stores
Apache Spark and Object Stores
 
Spark Security
Spark SecuritySpark Security
Spark Security
 
Hadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in ProductionHadoop & Cloud Storage: Object Store Integration in Production
Hadoop & Cloud Storage: Object Store Integration in Production
 
Taming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop ManagementTaming the Elephant: Efficient and Effective Apache Hadoop Management
Taming the Elephant: Efficient and Effective Apache Hadoop Management
 
Hadoop & cloud storage object store integration in production (final)
Hadoop & cloud storage  object store integration in production (final)Hadoop & cloud storage  object store integration in production (final)
Hadoop & cloud storage object store integration in production (final)
 

More from DataWorks Summit/Hadoop Summit

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerDataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformDataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLDataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...DataWorks Summit/Hadoop Summit
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesDataWorks Summit/Hadoop Summit
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
How to Optimize Hortonworks Apache Spark ML Workloads on Modern Processors
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Running Apache Spark & Apache Zeppelin in Production: A Guide to Security, Performance and Multi-tenancy

  • 1. Running Apache Spark & Apache Zeppelin in Production Director, Product Management August 31, 2016 Twitter: @neomythos Vinay Shukla
  • 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Who am i?  Product Management  Spark for 2.5 + years, Hadoop for 3+ years  Recovering Programmer  Blog at www.vinayshukla.com  Twitter: @neomythos  Addicted to Yoga, Hiking, & Coffee  Smallest contributor to Apache Zeppelin Vinay Shukla
  • 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Security: Rings of Defense Perimeter Level Security •Network Security (i.e. Firewalls) Data Protection •Wire encryption •HDFS TDE/Dare •Others Authentication •Kerberos •Knox (Other Gateways) OS Security Authorization •Apache Ranger/Sentry •HDFS Permissions •HDFS ACLs •YARN ACL Page 3
  • 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Interacting with Spark Ex Spark on YARN Zeppelin Spark- Shell Ex Spark Thrift Server Driver REST ServerDriver Driver Driver
  • 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Context: Spark Deployment Modes • Spark on YARN –Spark driver (SparkContext) in YARN AM(yarn-cluster) –Spark driver (SparkContext) in local (yarn-client): • Spark Shell & Spark Thrift Server runs in yarn-client only Client Executor App MasterSpark Driver Client Executor App Master Spark Driver YARN-Client YARN-Cluster
  • 6. 6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark on YARN Spark Submit John Doe Spark AM Spark AM 1 Hadoop Cluster HDFS Executor YARN RM YARN RM 4 2 3 Node Manager
  • 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark – Security – Four Pillars  Authentication  Authorization  Audit  Encryption Spark leverages Kerberos on YARN Ensure network is secure
  • 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Kerberos authentication within Spark KDC Use Spark ST, submit Spark Job Spark gets Namenode (NN) service ticket YARN launches Spark Executors using John Doe’s identity Get service ticket for Spark, John Doe Spark AMSpark AM NNNN Executor reads from HDFS using John Doe’s delegation token kinit 1 2 3 4 5 6 7 Hadoop Cluster
  • 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark – Kerberos - Example kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@ EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark- examples*.jar 10
  • 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDFS Spark – Authorization YARN Cluster A B C KDC Use Spark ST, submit Spark Job Get Namenode (NN) service ticket Executors read from HDFS Client gets service ticket for Spark RangerRangerCan John launch this job? Can John read this file John Doe
  • 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Encryption: Spark – Communication Channels Spark Submit RM Shuffle Service AM Driver NM Ex 1 Ex N Shuffle Data Control/RPC Shuffle BlockTransfer Data Source Read/Write Data FS – Broadcast, File Download
  • 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Communication Encryption Settings Shuffle Data Control/RPC Shuffle BlockTransfer Read/Write Data FS – Broadcast, File Download spark.authenticate.enableSaslEncryption= true spark.authenticate = true. Leverage YARN to distribute keys Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS NM > Ex leverages YARN based SSL spark.ssl.enabled = true
  • 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Sharp Edges with Spark Security  SparkSQL – Only coarse grain access control today  Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd hop – Lowers security, forces STS to run as Hive user to read all data – Use SparkSQL via shell or programmatic API – https://issues.apache.org/jira/browse/SPARK-5159  Spark Stream + Kafka + Kerberos – Issues fixed in HDP 2.4.x – No SSL support yet  Spark Shuffle > Only SASL, no SSL support  Spark Shuffle > No encryption for spill to disk or intermediate data Fine grained Security to SparkSQL http://bit.ly/2bLghGz http://bit.ly/2bTX7Pm
  • 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin Security
  • 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin: Authentication + SSL Spark on YARN Ex Ex LDAP John Doe 1 2 3 SSL Firewall
  • 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Zeppelin + Livy E2E Security Zeppelin Spark Yarn Livy Ispark Group Interpreter SPNego: Kerberos Kerberos/RPC Livy APIs LDAP John Doe Job runs as John Doe
  • 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin: Authorization  Note level authorization  Grant Permissions (Owner, Reader, Writer) to users/groups on Notes  LDAP Group integration  Zeppelin UI Authorization  Allow only admins to configure interpreter  Configured in shiro.ini  For Spark with Zeppelin > Livy > Spark – Identity Propagation Jobs run as End-User  For Hive with Zeppelin > JDBC interpreter  Shell Interpreter – Runs as end-user Authorization in Zeppelin Authorization at Data Level [urls] /api/interpreter/** = authc, roles[admin] /api/configurations/** = authc, roles[admin] /api/credential/** = authc, roles[admin]
  • 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin: Credentials  LDAP/AD account  Zeppelin leverages Hadoop Credential API  Interpreter Credentials  Not solved yet  Credentials Credentials in Zeppelin This is still an open issue
  • 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin: AD Authentication 1. /etc/zeppelin/conf/shiro.ini [urls] /api/version = anon /** = authc Configure Zeppelin to Authenticate users Zeppelin leverages Apache Shiro for authentication/authoriza tion
  • 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Apache Zeppelin: AD Authentication 1. Create an entry for AD credential – Zeppelin leverages Hadoop Credential API – >hadoop credential create – activeDirectoryRealm.systemPassword -provider jceks://etc/zeppelin/conf/credentials.jceks – chmod 400 with only Zeppelin process r/w access, no other user allowed accessCredentials 1. Configure Zeppelin to use AD activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm activeDirectoryRealm.systemUsername = CN=Administrator,CN=Users,DC=HWQE,DC=HORTONWORKS,DC=COM #activeDirectoryRealm.systemPassword = Password1! activeDirectoryRealm.hadoopSecurityCredentialPath = jceks://etc/zeppelin/conf/credentials.jceks activeDirectoryRealm.searchBase = CN=Users,DC=HWQE,DC=HORTONWORKS,DC=COM activeDirectoryRealm.url = ldap://ad-nano.qe.hortonworks.com:389 activeDirectoryRealm.groupRolesMap = "CN=admin,OU=groups,DC=HWQE,DC=HORTONWORKS,DC=COM":"admin","CN=finance,OU=groups,DC=HWQE,DC=HORTONWORK S,DC=COM":"finance","CN=zeppelin,OU=groups,DC=HWQE,DC=HORTONWORKS,DC=COM":"zeppelin” activeDirectoryRealm.authorizationCachingEnabled = true Active Directory Authentication Skip step 1 if securing LDAP password is not an issue
  • 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Spark Performance
  • 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved #1 Big or Small Executor ? spark-submit --master yarn --deploy-mode client --num-executors ? --executor-cores ? --executor-memory ? --class MySimpleApp mySimpleApp.jar arg1 arg2
  • 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Show the Details  Cluster resource: 72 cores + 288GB Memory 24 cores 96GB Memory 24 cores 96GB Memory 24 cores 96GB Memory
  • 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Benchmark with Different Combinations Cores Per E# Memory Per E (GB) E Per Node 1 6 18 2 12 9 3 18 6 6 36 3 9 54 2 18 108 1 #E stands for Executor Except too small or too big executor, the performance is relatively close 18 cores 108GB memory per node# The lower the better
  • 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Big Or Small Executor ?  Avoid too small or too big executor. – Small executor will decrease CPU and memory efficiency. – Big executor will introduce heavy GC overhead.  Usually 3 ~ 6 cores and 10 ~ 40 GB of memory per executor is a preferable choice.
  • 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Any Other Thing ?  Executor memory != Container memory  Container memory = executor memory + overhead memory (10% of executory memory)  Leave some resources to os and other services  Enable CPU scheduling if you want to constrain CPU usage# #CPU scheduling - http://hortonworks.com/blog/managing-cpu-resources-in-y
  • 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Multi-tenancy for Spark  Leverage YARN queues – Set user quotas – Set default yarn-queue in spark-defaults – User can override for each job  Leverage Dynamic Resource Allocation – Specify range of executors a job uses – This needs shuffle service to be used Cluster resource Utilization
  • 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Thank You Vinay Shukla @neomythos

Editor's Notes

  1. John Doe first authenticates to Kerberos before launching Spark Shell kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
  2. The first step of security is network security The second step of security is Authentication Most Hadoop echo system projects rely on Kerberos for Authentication Kerberos – 3 Headed Guard Dog : https://en.wikipedia.org/wiki/Cerberus
  3. John Doe first authenticates to Kerberos before launching Spark Shell kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
  4. Controlling HDFS Authorization is easy/Done Controlling Hive row/column level authorization in Spark is WIP
  5. For HDFS as Data Source can use RPC or use SSL with WebHDFS For NM Shuffle Data – Use YARN SSL Spark support SSL for FS (Broadcast or File download) Shuffle Block Transfer supports SASL based encryption – SSL coming
  6. Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)
  7. Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)
  8. Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)
  9. Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)
  10. All Images from Flicker Commons