SlideShare a Scribd company logo
1 of 47
Download to read offline
Spark on YARN
● Shashank L
● Big data consultant and trainer at
datamantra.io
● shashankgowda.me@gmail.com
Agenda
● YARN - Introduction
● Need for YARN
● OS Analogy
● Why run Spark on YARN
● YARN Architecture
● Modes of Spark on YARN
● Internals of Spark on YARN
● Recent developments
● Road ahead
● Hands-on
YARN
● Yet another resource negotiator.
● a general-purpose, distributed, application management
framework.
Need for YARN
Hadoop 1.0
● Single use system
● Capable of running only MR
Need for YARN
● Scalability
○ 2009 – 8 cores, 16GB of RAM, 4x1TB disk
○ 2012 – 16+ cores, 48-96GB of RAM, 12x2TB or 12x3TB of disk.
● Cluster utilization
○ distinct map slots and reduce slots
● Supporting workloads other than MapReduce
○ MapReduce is great for many applications, but not everything.
Need for YARN
Hadoop 2.0
● Multi purpose platform
● Capable of running apps
other than MR
OS analogy
Traditional operating system
Storage:
File system
Execution/Scheduling:
Processes/ Kernel scheduler
OS analogy
Hadoop
Storage:
HDFS
Execution/Scheduling:
YARN
Why run Spark on YARN
● Leverage existing clusters
● Data locality
● Dynamically sharing the cluster resources between different frameworks.
● YARN schedulers can be used for categorizing, isolating, and prioritizing
workloads.
● Only cluster manager for Spark that supports security
YARN architecture
Resource manager
Node manager Node manager
YARN architecture
Resource manager
Node manager Node manager
Container Container Container Container
YARN architecture
Resource manager
Node manager Node manager
Client
YARN architecture
Resource manager
Node manager Node manager
Container
Application
Master
Client
YARN architecture
Resource manager
Node manager Node manager
Container
Application
Master
Client
Container
Processing
Container
Processing
Running Spark on YARN
Spark architecture
Driver program
Spark Context
Cluster manager
Worker node
Executor Cache
Task Task
Worker node
Executor Cache
Task Task
Spark architecture
● Driver Program is responsible for managing the job flow and scheduling
tasks that will run on the executors.
● Executors are processes that run computation and store data for a Spark
application.
● Cluster Manager is responsible for starting executor processes and where
and when they will be run. Spark supports pluggable cluster manager, it
supports
Example: YARN, Mesos and “standalone” cluster manager
Modes on Spark on YARN
● YARN-Client Mode
● YARN-Cluster Mode
YARN client mode
Resource manager
Node manager(s)
Container
Spark
Application Master/
Executor launcher
Container
Executor
Client
Spark driver/
Spark Context
YARN client mode
● Driver runs in the client process, and the application master is only used for
requesting resources from YARN.
● Used for interactive and debugging uses where you want to see your
application’s output immediately (on the client process side).
YARN cluster mode
Resource manager
Node manager(s)
Container
Spark
Application Master/
Spark driver
Container
Executor
Client
Spark submit
YARN cluster mode
● In yarn-cluster mode, the Spark driver runs inside an application master
process which is managed by YARN on the cluster, and the client can go
away after initiating the application.
● Yarn-cluster mode makes sense for production jobs.
Concurrency vs Parallelism
● Concurrency is about dealing with lots of things at once.
● Parallelism is about doing lots of things at once.
Akka
● Follows Actor model
○ Keep mutable state internally and communicate through async messages
● Actors
Treat Actors like People. People who don't talk to each other in person. They
just talk through mails.
○ Object
○ Runs in its own thread
○ Messages are kept in queue and processed in order
Internals of Spark
YARN Cluster
Internals of Spark on YARN
Container
Spark AM
Spark driver
(Spark Context)
ContainerContainer
Container
Executor
DAG
Scheduler
Task
Scheduler
Scheduler
backend
1
2
3
9
Client
5
8
4
6 7
1010
Internals of Spark on YARN
1. Requests container for the AM and launches AM in the container
2. Creates SparkContext (inside AM / inside Client).
This internally creates a DAG Scheduler, Task scheduler and Scheduler
backend.
Creates an Akka actor system.
3. Application master based on the required resources will request for the
containers. Once it get the containers it runs executor process in the
container.
4. The executor process when it comes up registers with the Schedulerbackend
through Akka.
5. When few lines of code has to be run on the cluster. RDD runJob method
calls the DAG scheduler to create a DAG of tasks.
Internals of Spark on YARN
6. Set of tasks which is capable of running in parallel is sent to the Task
Scheduler in the form of TaskSet.
7. Task scheduler in turn will contact the Schedulerbackend to run the tasks on
the executor.
8. Scheduler backend which keeps track of running executors and its statuses,
will schedule tasks on executors
9. Task output if any are sent through heartbeats to Schedulerbackend/
10. SchedulerBackend passes the task output onto the Task and DAG scheduler
which could make use of that output.
Recent developments
● Dynamic resource allocation
○ No need to specify number of executors
○ Application grows and shrinks based on outstanding task count
○ Need to specify other things
● Data locality
○ Allocate executors close to data
○ SPARK-4352
● Cached RDDs
○ Keep executors around
Road ahead
● Making dynamic allocation better
○ Reduce allocation latency
○ Handle cached RDDs
● Simplified allocation
● Encrypt shuffle files
● File distribution
○ Replace HTTP with RPC
Yarn Hands On
● Shashidhar E S
● Big data consultant and trainer at
datamantra.io
● shashidhar.e@gmail.com
Yarn components in different phases
Resource Manager
Yarn Client Application Master
Node Manager
ApplicationClientProtocol ApplicationMasterProtocol
ContainerManagementProtocol
Protocols
Protocols
Application Client Protocol (Client<-->RM)(org.apache.hadoop.yarn.api.
ApplicationClientProtocol)
Application Master Protocol (AM<-->RM)(org.apache.hadoop.yarn.api.
ApplicationMasterProtocol)
Container Management Protocol (AM<-->NM)(org.apache.hadoop.yarn.api.
ContainerManagementProtocol)
Application Client Protocol
● Protocol between client and resource manager
● Allows clients to submit and abort the jobs
● Enables clients to get information about applications
○ Cluster metrics - Active node managers
○ Nodes - Node details
○ Queues - Queue details
○ Delegation Tokens - Token for containers to interact with the services.
○ ApplicationAttemptReport - Application Details (host,port,diagnostics)
○ etc
Application Master Protocol
● Protocol between AM and RM
● Key functionalities
○ RegisterAM
○ FinishAM - Notify RM about completion
○ Allocate - RM responds with available/unused containers
Container Management Protocol
● Protocol between AM and NM
● Key functionalities
○ Start Containers
○ Stop Containers
○ Status of running containers - (NEW,RUNNING,COMPLETE)
Building Blocks of Communication
Records
Each component in YARN architecture communicates between each other by
forming records.Each request sent is a record
Ex: localResource requests,applicationContext, containerLaunchContext etc
Each response obtained is a record
Ex: applicationResponse, applicationMasterResponse, allocateResponse etc
Custom Yarn Application
Components
● Yarn Client
● Yarn Application Master
● Application
Yarn Hello world
Resource ManagerClient
Application
Application Master
Node manager
Package : com.madhukaraphatak.yarnexamples.helloworld
Yarn Hello world
Steps:
● Create application client
○ Communicate with RM to launch AM
○ Specify AM resource requirements
● Application master
○ Communicate with RM to get containers
○ Communicate with NM’s to launch containers
● Application
○ Specify the application logic
AKKA remote example
Remote actor
system
(Remote actor)
Local actor system
(Local actor)
Communicate through
messages
Package : com.madhukaraphatak.yarnexamples.akka
AKKA remote example
Steps:
● Create Remote client with following properties
○ Actor ref provider - References should be remote aware
○ Transports used - Tcp is the transport layer protocol
○ hostname - 127.0.0.1
○ port - 5150
● Create Local Client with following properties
○ Actor ref provider - We are specifying the references should be remote aware
○ Transports used - tcp is the transport layer protocol
○ hostname - 127.0.0.1
○ port - 0
1. Akka actors behave like peers rather than client-server.
2. They talk in similar transport.
3. Only difference is port : 0 -> any free port.
AKKA application on Yarn
Resource Manager
Application
Local actor system
(TaskResultSenderActor)
Application Master
Remote actor system
(Remote actor)
Client
Node manager
AKKA application on Yarn
● Client
○ Defines the tasks to be performed
○ Submits tasks as separate set
● Scheduler
○ Create receiver actor (Remote Actor) for orchestration
○ Set up resources for AM
○ Launch AM for set of tasks
● Application Master
○ Create Executor for each single task
○ Set up resources for Containers
AKKA application on Yarn
● Executor
○ Create local Actor
○ Run task
○ Send response to remote actor
○ Kill local actor

More Related Content

What's hot

How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkDatabricks
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High PerformanceInderaj (Raj) Bains
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenDatabricks
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudNoritaka Sekiyama
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Databricks
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdfAmit Raj
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internalsKostas Tzoumas
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideIBM
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 

What's hot (20)

How to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache SparkHow to Automate Performance Tuning for Apache Spark
How to Automate Performance Tuning for Apache Spark
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Using Apache Hive with High Performance
Using Apache Hive with High PerformanceUsing Apache Hive with High Performance
Using Apache Hive with High Performance
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim ChenApache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
Apache Spark on Kubernetes Anirudh Ramanathan and Tim Chen
 
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the CloudAmazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
 
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
 
Managing Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic OptimizingManaging Apache Spark Workload and Automatic Optimizing
Managing Apache Spark Workload and Automatic Optimizing
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache spark
Apache sparkApache spark
Apache spark
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Spark architecture
Spark architectureSpark architecture
Spark architecture
 
Spark Performance Tuning .pdf
Spark Performance Tuning .pdfSpark Performance Tuning .pdf
Spark Performance Tuning .pdf
 
Apache Flink internals
Apache Flink internalsApache Flink internals
Apache Flink internals
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Spark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting GuideSpark 2.x Troubleshooting Guide
Spark 2.x Troubleshooting Guide
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 

Viewers also liked

Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 
Hadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageHadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageSandeep Patil
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...gethue
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARNDataWorks Summit
 
Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDataWorks Summit
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanDatabricks
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 

Viewers also liked (13)

Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 
Hadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better StorageHadoop and Spark Analytics over Better Storage
Hadoop and Spark Analytics over Better Storage
 
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
Spark Summit Europe: Building a REST Job Server for interactive Spark as a se...
 
Get most out of Spark on YARN
Get most out of Spark on YARNGet most out of Spark on YARN
Get most out of Spark on YARN
 
Dynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark ApplicationDynamically Allocate Cluster Resources to your Spark Application
Dynamically Allocate Cluster Resources to your Spark Application
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
SocSciBot(01 Mar2010) - Korean Manual
SocSciBot(01 Mar2010) - Korean ManualSocSciBot(01 Mar2010) - Korean Manual
SocSciBot(01 Mar2010) - Korean Manual
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu Kasinathan
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Proxy Servers
Proxy ServersProxy Servers
Proxy Servers
 
Proxy Server
Proxy ServerProxy Server
Proxy Server
 

Similar to Spark on yarn

ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...Zhijie Shen
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - InstallationMartin Zapletal
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Apache Apex
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overviewKaran Alang
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to YarnOmid Vahdaty
 
Data Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource ManagersData Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource ManagersAnant Corporation
 
YARN - way to share cluster BEYOND HADOOP
YARN - way to share cluster BEYOND HADOOPYARN - way to share cluster BEYOND HADOOP
YARN - way to share cluster BEYOND HADOOPOmkar Joshi
 
Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Jianfeng Zhang
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analyticsinoshg
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Xuan-Chao Huang
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streamingdatamantra
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 

Similar to Spark on yarn (20)

Spark on YARN
Spark on YARNSpark on YARN
Spark on YARN
 
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
ApacheCon North America 2014 - Apache Hadoop YARN: The Next-generation Distri...
 
Apache spark - Installation
Apache spark - InstallationApache spark - Installation
Apache spark - Installation
 
Apache Spark
Apache SparkApache Spark
Apache Spark
 
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
Intro to YARN (Hadoop 2.0) & Apex as YARN App (Next Gen Big Data)
 
Apache Spark - A High Level overview
Apache Spark - A High Level overviewApache Spark - A High Level overview
Apache Spark - A High Level overview
 
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F... Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
Data Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource ManagersData Engineer's Lunch #80: Apache Spark Resource Managers
Data Engineer's Lunch #80: Apache Spark Resource Managers
 
YARN - way to share cluster BEYOND HADOOP
YARN - way to share cluster BEYOND HADOOPYARN - way to share cluster BEYOND HADOOP
YARN - way to share cluster BEYOND HADOOP
 
Spark & Yarn better together 1.2
Spark & Yarn better together 1.2Spark & Yarn better together 1.2
Spark & Yarn better together 1.2
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130Hadoop Spark Introduction-20150130
Hadoop Spark Introduction-20150130
 
Spark on Yarn @ Netflix
Spark on Yarn @ NetflixSpark on Yarn @ Netflix
Spark on Yarn @ Netflix
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKSCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARK
 
Introduction to yarn
Introduction to yarnIntroduction to yarn
Introduction to yarn
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
Apache spark
Apache sparkApache spark
Apache spark
 

More from datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Telliusdatamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streamingdatamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetesdatamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 APIdatamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Sparkdatamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Executiondatamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsdatamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafkadatamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streamingdatamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle managementdatamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark MLdatamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streamingdatamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scaladatamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scaladatamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetesdatamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsdatamantra
 

More from datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 

Recently uploaded

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一ffjhghh
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一定制英国白金汉大学毕业证(UCB毕业证书)																			成绩单原版一比一
定制英国白金汉大学毕业证(UCB毕业证书) 成绩单原版一比一
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 

Spark on yarn

  • 2. ● Shashank L ● Big data consultant and trainer at datamantra.io ● shashankgowda.me@gmail.com
  • 3. Agenda ● YARN - Introduction ● Need for YARN ● OS Analogy ● Why run Spark on YARN ● YARN Architecture ● Modes of Spark on YARN ● Internals of Spark on YARN ● Recent developments ● Road ahead ● Hands-on
  • 4. YARN ● Yet another resource negotiator. ● a general-purpose, distributed, application management framework.
  • 5. Need for YARN Hadoop 1.0 ● Single use system ● Capable of running only MR
  • 6. Need for YARN ● Scalability ○ 2009 – 8 cores, 16GB of RAM, 4x1TB disk ○ 2012 – 16+ cores, 48-96GB of RAM, 12x2TB or 12x3TB of disk. ● Cluster utilization ○ distinct map slots and reduce slots ● Supporting workloads other than MapReduce ○ MapReduce is great for many applications, but not everything.
  • 7. Need for YARN Hadoop 2.0 ● Multi purpose platform ● Capable of running apps other than MR
  • 8. OS analogy Traditional operating system Storage: File system Execution/Scheduling: Processes/ Kernel scheduler
  • 10. Why run Spark on YARN ● Leverage existing clusters ● Data locality ● Dynamically sharing the cluster resources between different frameworks. ● YARN schedulers can be used for categorizing, isolating, and prioritizing workloads. ● Only cluster manager for Spark that supports security
  • 12. YARN architecture Resource manager Node manager Node manager Container Container Container Container
  • 13. YARN architecture Resource manager Node manager Node manager Client
  • 14. YARN architecture Resource manager Node manager Node manager Container Application Master Client
  • 15. YARN architecture Resource manager Node manager Node manager Container Application Master Client Container Processing Container Processing
  • 17. Spark architecture Driver program Spark Context Cluster manager Worker node Executor Cache Task Task Worker node Executor Cache Task Task
  • 18. Spark architecture ● Driver Program is responsible for managing the job flow and scheduling tasks that will run on the executors. ● Executors are processes that run computation and store data for a Spark application. ● Cluster Manager is responsible for starting executor processes and where and when they will be run. Spark supports pluggable cluster manager, it supports Example: YARN, Mesos and “standalone” cluster manager
  • 19. Modes on Spark on YARN ● YARN-Client Mode ● YARN-Cluster Mode
  • 20. YARN client mode Resource manager Node manager(s) Container Spark Application Master/ Executor launcher Container Executor Client Spark driver/ Spark Context
  • 21. YARN client mode ● Driver runs in the client process, and the application master is only used for requesting resources from YARN. ● Used for interactive and debugging uses where you want to see your application’s output immediately (on the client process side).
  • 22. YARN cluster mode Resource manager Node manager(s) Container Spark Application Master/ Spark driver Container Executor Client Spark submit
  • 23. YARN cluster mode ● In yarn-cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. ● Yarn-cluster mode makes sense for production jobs.
  • 24. Concurrency vs Parallelism ● Concurrency is about dealing with lots of things at once. ● Parallelism is about doing lots of things at once.
  • 25. Akka ● Follows Actor model ○ Keep mutable state internally and communicate through async messages ● Actors Treat Actors like People. People who don't talk to each other in person. They just talk through mails. ○ Object ○ Runs in its own thread ○ Messages are kept in queue and processed in order
  • 27. YARN Cluster Internals of Spark on YARN Container Spark AM Spark driver (Spark Context) ContainerContainer Container Executor DAG Scheduler Task Scheduler Scheduler backend 1 2 3 9 Client 5 8 4 6 7 1010
  • 28. Internals of Spark on YARN 1. Requests container for the AM and launches AM in the container 2. Creates SparkContext (inside AM / inside Client). This internally creates a DAG Scheduler, Task scheduler and Scheduler backend. Creates an Akka actor system. 3. Application master based on the required resources will request for the containers. Once it get the containers it runs executor process in the container. 4. The executor process when it comes up registers with the Schedulerbackend through Akka. 5. When few lines of code has to be run on the cluster. RDD runJob method calls the DAG scheduler to create a DAG of tasks.
  • 29. Internals of Spark on YARN 6. Set of tasks which is capable of running in parallel is sent to the Task Scheduler in the form of TaskSet. 7. Task scheduler in turn will contact the Schedulerbackend to run the tasks on the executor. 8. Scheduler backend which keeps track of running executors and its statuses, will schedule tasks on executors 9. Task output if any are sent through heartbeats to Schedulerbackend/ 10. SchedulerBackend passes the task output onto the Task and DAG scheduler which could make use of that output.
  • 30. Recent developments ● Dynamic resource allocation ○ No need to specify number of executors ○ Application grows and shrinks based on outstanding task count ○ Need to specify other things ● Data locality ○ Allocate executors close to data ○ SPARK-4352 ● Cached RDDs ○ Keep executors around
  • 31. Road ahead ● Making dynamic allocation better ○ Reduce allocation latency ○ Handle cached RDDs ● Simplified allocation ● Encrypt shuffle files ● File distribution ○ Replace HTTP with RPC
  • 33. ● Shashidhar E S ● Big data consultant and trainer at datamantra.io ● shashidhar.e@gmail.com
  • 34. Yarn components in different phases Resource Manager Yarn Client Application Master Node Manager ApplicationClientProtocol ApplicationMasterProtocol ContainerManagementProtocol Protocols
  • 35. Protocols Application Client Protocol (Client<-->RM)(org.apache.hadoop.yarn.api. ApplicationClientProtocol) Application Master Protocol (AM<-->RM)(org.apache.hadoop.yarn.api. ApplicationMasterProtocol) Container Management Protocol (AM<-->NM)(org.apache.hadoop.yarn.api. ContainerManagementProtocol)
  • 36. Application Client Protocol ● Protocol between client and resource manager ● Allows clients to submit and abort the jobs ● Enables clients to get information about applications ○ Cluster metrics - Active node managers ○ Nodes - Node details ○ Queues - Queue details ○ Delegation Tokens - Token for containers to interact with the services. ○ ApplicationAttemptReport - Application Details (host,port,diagnostics) ○ etc
  • 37. Application Master Protocol ● Protocol between AM and RM ● Key functionalities ○ RegisterAM ○ FinishAM - Notify RM about completion ○ Allocate - RM responds with available/unused containers
  • 38. Container Management Protocol ● Protocol between AM and NM ● Key functionalities ○ Start Containers ○ Stop Containers ○ Status of running containers - (NEW,RUNNING,COMPLETE)
  • 39. Building Blocks of Communication Records Each component in YARN architecture communicates between each other by forming records.Each request sent is a record Ex: localResource requests,applicationContext, containerLaunchContext etc Each response obtained is a record Ex: applicationResponse, applicationMasterResponse, allocateResponse etc
  • 40. Custom Yarn Application Components ● Yarn Client ● Yarn Application Master ● Application
  • 41. Yarn Hello world Resource ManagerClient Application Application Master Node manager Package : com.madhukaraphatak.yarnexamples.helloworld
  • 42. Yarn Hello world Steps: ● Create application client ○ Communicate with RM to launch AM ○ Specify AM resource requirements ● Application master ○ Communicate with RM to get containers ○ Communicate with NM’s to launch containers ● Application ○ Specify the application logic
  • 43. AKKA remote example Remote actor system (Remote actor) Local actor system (Local actor) Communicate through messages Package : com.madhukaraphatak.yarnexamples.akka
  • 44. AKKA remote example Steps: ● Create Remote client with following properties ○ Actor ref provider - References should be remote aware ○ Transports used - Tcp is the transport layer protocol ○ hostname - 127.0.0.1 ○ port - 5150 ● Create Local Client with following properties ○ Actor ref provider - We are specifying the references should be remote aware ○ Transports used - tcp is the transport layer protocol ○ hostname - 127.0.0.1 ○ port - 0 1. Akka actors behave like peers rather than client-server. 2. They talk in similar transport. 3. Only difference is port : 0 -> any free port.
  • 45. AKKA application on Yarn Resource Manager Application Local actor system (TaskResultSenderActor) Application Master Remote actor system (Remote actor) Client Node manager
  • 46. AKKA application on Yarn ● Client ○ Defines the tasks to be performed ○ Submits tasks as separate set ● Scheduler ○ Create receiver actor (Remote Actor) for orchestration ○ Set up resources for AM ○ Launch AM for set of tasks ● Application Master ○ Create Executor for each single task ○ Set up resources for Containers
  • 47. AKKA application on Yarn ● Executor ○ Create local Actor ○ Run task ○ Send response to remote actor ○ Kill local actor