SlideShare a Scribd company logo
1 of 25
1 www.prace-ri.euIntroduction to Hadoop (part 2)
Introduction to Hadoop (part 2)
Vienna Technical University - IT Services (TU.it)
Dr. Giovanna Roda
2 www.prace-ri.euIntroduction to Hadoop (part 2)
Some advanced topics
▶ the YARN resource manager
▶ fault tolerance and HDFS Erasure Coding
▶ the mrjob library
▶ HDFS i/o benchmarking
▶ MapReduce spilling
3 www.prace-ri.euIntroduction to Hadoop (part 2)
YARN
Hadoop jobs are managed by YARN (acronym for Yet Another Resource Negotiator), that is responsible
for allocating resources and manage job scheduling.
Basic resource types are:
▶ memory (memory-mb)
▶ virtual cores (vcores)
Yarn supports an extensible resource model that allows to define any countable resource. A
countable resource is a resource that is consumed while a container is running, but is released
afterwards. Such a resource can be for instance:
▶ GPU (gpu)
.
4 www.prace-ri.euIntroduction to Hadoop (part 2)
YARN
Each job submitted to the Yarn is assigned:
▶ a container, that is an abstract entity which incorporates resources such as memory, cpu,
disk, network etc. Container resources are allocated by the Scheduler.
▶ an ApplicationMaster service assigned by the Application Manager for monitoring
the progress of the job, restarting tasks if needed
.
5 www.prace-ri.euIntroduction to Hadoop (part 2)
YARN
The main idea of Yarn is to have two distinct daemons for job monitoring and scheduling,
one global and one local for each application:
▶ the Resource Manager is the global job manager, consisting of:
▶ Scheduler → responsible for allocating of resources across all applications
▶ Applications Manager → accept job submissions, restart Application Masters on
failure
▶ an Application Master for each application, a daemon responsible for negotiating
resources, monitoring status of the job, restarting failed tasks.
6 www.prace-ri.euIntroduction to Hadoop (part 2)
YARN architecture
Source: Apache Software Foundation
7 www.prace-ri.euIntroduction to Hadoop (part 2)
YARN Schedulers
Yarn supports by default two types of schedulers:
▶ Capacity Scheduler
share resources providing capacity guarantees → allows scheduler determinism
▶ Fair Scheduler
all apps get, on average, an equal share of resources over time. Fair sharing allows one
app to use the entire cluster if it’s the only one running.
By default, fairness is based on memory but this can be configured to include also other
resources.
8 www.prace-ri.euIntroduction to Hadoop (part 2)
YARN Dynamic Resource Pools
Yarn supports Dynamic Resource Pools (or dynamic queues).
Each job is assigned to a queue (the queue can be determined for instance by the user’s Unix
primary group).
Queue are assigned a weight and resources are split among queues according to their weights.
For instance, if you have one queue with weight 20 and three other queues with weight 10, the
first queue will get 40%, and each of the other queues 20% of the available resources because:
1 ∗ 20𝑥 + 3 ∗ 10𝑥 = 100
9 www.prace-ri.euIntroduction to Hadoop (part 2)
Testing HDFS I/O troughput with TestDFSIO
TestDFSIO is a tool included in the Hadoop software distribution for measuring read and write
performance of the HDFS filesystem.
It’s a useful tool for assessing the performance of your Hadoop filesystem, identify performance
bottlenecks, support decisions for tuning your HDFS configuration.
TestDFSio uses MapReduce to write and read files on the HDFS filesystem; the reducer is used
to collect and summarize test data.
10 www.prace-ri.euIntroduction to Hadoop (part 2)
Testing HDFS I/O troughput with TestDFSIO
Note: you won’t be able to run TestDFSIO on a cluster unless you’re the cluster’s
admininstrator. Of course, it is recommended to use TestDFSIO to run tests when the Hadoop
cluster is not in use.
To run test as a non-superuser you need to have read/write access to
/benchmarks/TestDFSIO on HDFS or else specify another directory (on HDFS) where to
write and read data with the option –D test.build.data=myDir .
11 www.prace-ri.euIntroduction to Hadoop (part 2)
Testing HDFS I/O troughput with TestDFSIO
TestDFSIO writes and reads file on HDFS spanning exactly one mapper for file.
These are the main options for running stress tests:
▶ -write to run write tests
▶ -read to run read tests
▶ -nrFiles the number of files (set to be equal to the number of mappers)
▶ -fileSize size of files (followed by B|KB|MB|GB|TB)
12 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: run a write test
Run a test with 80 files (remember: this is the number of mappers) each of size 10GB.
JARFILE= /opt/pkg/software/Hadoop/2.10.0-GCCcore-8.3.0-
native/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.10.0-
tests.jar
yarn $JARFILE TestDFSIO -write -nrFiles 80 -fileSize 10GB
The jar file needed to run TestDFSIO is hadoop-mapreduce-client-
jobclient*tests.jar and its location depends on your Hadoop installation.
13 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: examine test results
Test results are appended by default to the file TestDFSIO_results.log. The log file can
be changed with the option -resFile resultFileName.
The main measurements returned by the HDFSio test are:
▶ throughput in mb/sec
▶ average IO rate in mb/sec
▶ standard deviation of the IO rate
▶ test execution time
14 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: examine test results
Main units of measure:
▶ throughput or data transfer rate measures the amount of data read or written (expressed
in Megabytes per second -- MB/s) to the filesystem.
▶ IO rate also abbreviated as IOPS measures IO operations per second, which means the amount of
read or write operations that could be done in one seconds time.
Throughput and IO rate are connected:
Average IO size x IOPS = Throughput in MB/s
15 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: examine test results
Sample output of a test write run:
----- TestDFSIO ----- : write
Date & time: Sun Sep 13 16:14:39 CEST 2020
Number of files: 80
Total MBytes processed: 819200
Throughput mb/sec: 30.64
Average IO rate mb/sec: 34.44
IO rate std deviation: 17.45
Test exec time sec: 429.73
16 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: examine test results
Concurrent versus overall throughput
The resulting throughput is the average throughput among all map tasks. To get an
approximate overall throughput on the cluster one should divide the total MBytes by the test
execution time in seconds.
In our test the overall write throughput on the cluster is:
819200 / 429.73 = 1906MB/sec
17 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: some things to try
▶ run a read test using the same command used for the write test but using the option –
read in place of –write.
▶ use the option -D dfs.replication=1 to measure the effect of replication on I/O
performance
yarn $JARFILE TestDFSIO -D dfs.replication=1 
-write -nrFiles 80 -fileSize 10GB
18 www.prace-ri.euIntroduction to Hadoop (part 2)
TestDFSIO: remove data after completing tests
When done with testing, remove the temporary files with:
yarn $JARFILE TestDFSIO -clean
If you used a special myDir output directory use
yarn $JARFILE TestDFSIO –clean –D test.build.data=myDir
Further reading: TestDFSIO documentation , source code TestDFSIO.java.
19 www.prace-ri.euIntroduction to Hadoop (part 2)
The mrjob library
A typical data processing pipeline consists of multiple steps that needs to run in a sequence.
mrjob is a Python library that offers a convenient framework which allows you to write multi-
step MapReduce jobs in pure Python, test them on your machine, and run them on a cluster.
20 www.prace-ri.euIntroduction to Hadoop (part 2)
The mrjob library: how to define a MapReduce job
A job is defined by a class that inherits from MRJob. This class contains methods that define
the steps of your job.
A “step” consists of a mapper, a combiner, and a reducer.
Step-by-step tutorials and documentation can be foud in: https://mrjob.readthedocs.io/
21 www.prace-ri.euIntroduction to Hadoop (part 2)
The mrjob library: a sample MapReduce job
from mrjob.job import MRJob
class MRWordFrequencyCount(MRJob):
def mapper(self, _, line):
yield "chars", len(line)
yield "words", len(line.split())
yield "lines", 1
def reducer(self, key, values):
yield key, sum(values)
if __name__ == '__main__':
MRWordFrequencyCount.run()
22 www.prace-ri.euIntroduction to Hadoop (part 2)
The mrjob library: a sample MapReduce job
Call the script mr_wordcount.py and run it with:
python mr_wordcount.py file.txt
This command will return the frequencies of characters, words, and lines in file.txt.
To get aggregated frequency counts for multiple files use:
python mr_wordcount.py file1.txt file2.txt …
23 www.prace-ri.euIntroduction to Hadoop (part 2)
The mrjob library: a sample MapReduce pipeline
To define a succession of MapReduce jobs define steps as a list of steps:
def steps(self):
return [
MRStep(mapper=self.mapper_get_words,
combiner=self.combiner_count_words,
reducer=self.reducer_count_words),
MRStep(reducer=self.reducer_find_max_word)
]
24 www.prace-ri.euIntroduction to Hadoop (part 2)
The mrjob library: run offline and on a cluster
Note that up to now we ran everything offline, on our local machine. To run the script on a
cluster use the option -r Hadoop. This will set Hadoop as the target runner.
Alternatively:
▶ -r emr for AWS clusters
▶ -r dataproc for Google cloud
25 www.prace-ri.euIntroduction to Hadoop (part 2)
THANK YOU FOR YOUR ATTENTION
www.prace-ri.eu

More Related Content

What's hot

What's hot (20)

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Hadoop 1.x vs 2
Hadoop 1.x vs 2Hadoop 1.x vs 2
Hadoop 1.x vs 2
 
Hadoop2.2
Hadoop2.2Hadoop2.2
Hadoop2.2
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Apache hadoop technology : Beginners
Apache hadoop technology : BeginnersApache hadoop technology : Beginners
Apache hadoop technology : Beginners
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Hadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce programHadoop installation, Configuration, and Mapreduce program
Hadoop installation, Configuration, and Mapreduce program
 
Hadoop hdfs interview questions
Hadoop hdfs interview questionsHadoop hdfs interview questions
Hadoop hdfs interview questions
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
 
Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
A Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animationA Basic Introduction to the Hadoop eco system - no animation
A Basic Introduction to the Hadoop eco system - no animation
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 

Similar to Introduction to Hadoop part 2

Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
rpbrehm
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
pappupassindia
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Yahoo Developer Network
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 

Similar to Introduction to Hadoop part 2 (20)

H04502048051
H04502048051H04502048051
H04502048051
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Hadoop and Mapreduce Introduction
Hadoop and Mapreduce IntroductionHadoop and Mapreduce Introduction
Hadoop and Mapreduce Introduction
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
Schedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop clusterSchedulers optimization to handle multiple jobs in hadoop cluster
Schedulers optimization to handle multiple jobs in hadoop cluster
 
Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014Recommender.system.presentation.pjug.05.20.2014
Recommender.system.presentation.pjug.05.20.2014
 
Hadoop online-training
Hadoop online-trainingHadoop online-training
Hadoop online-training
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
final report
final reportfinal report
final report
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
Apache Hadoop India Summit 2011 talk "Hadoop Map-Reduce Programming & Best Pr...
 
Hadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questionsHadoop 31-frequently-asked-interview-questions
Hadoop 31-frequently-asked-interview-questions
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
hadoop
hadoophadoop
hadoop
 
Hadoop
HadoopHadoop
Hadoop
 

More from Giovanna Roda

More from Giovanna Roda (6)

Distributed Computing for Everyone
Distributed Computing for EveryoneDistributed Computing for Everyone
Distributed Computing for Everyone
 
The need for new paradigms in IT services provisioning
The need for new paradigms in IT services provisioningThe need for new paradigms in IT services provisioning
The need for new paradigms in IT services provisioning
 
Apache Spark™ is here to stay
Apache Spark™ is here to stayApache Spark™ is here to stay
Apache Spark™ is here to stay
 
Chances and Challenges in Comparing Cross-Language Retrieval Tools
Chances and Challenges in Comparing Cross-Language Retrieval ToolsChances and Challenges in Comparing Cross-Language Retrieval Tools
Chances and Challenges in Comparing Cross-Language Retrieval Tools
 
CLEF-IP 2009: retrieval experiments in the Intellectual Property domain
CLEF-IP 2009: retrieval experiments in the Intellectual Property domainCLEF-IP 2009: retrieval experiments in the Intellectual Property domain
CLEF-IP 2009: retrieval experiments in the Intellectual Property domain
 
Patent Search: An important new test bed for IR
Patent Search: An important new test bed for IRPatent Search: An important new test bed for IR
Patent Search: An important new test bed for IR
 

Recently uploaded

Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 

Recently uploaded (20)

ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 

Introduction to Hadoop part 2

  • 1. 1 www.prace-ri.euIntroduction to Hadoop (part 2) Introduction to Hadoop (part 2) Vienna Technical University - IT Services (TU.it) Dr. Giovanna Roda
  • 2. 2 www.prace-ri.euIntroduction to Hadoop (part 2) Some advanced topics ▶ the YARN resource manager ▶ fault tolerance and HDFS Erasure Coding ▶ the mrjob library ▶ HDFS i/o benchmarking ▶ MapReduce spilling
  • 3. 3 www.prace-ri.euIntroduction to Hadoop (part 2) YARN Hadoop jobs are managed by YARN (acronym for Yet Another Resource Negotiator), that is responsible for allocating resources and manage job scheduling. Basic resource types are: ▶ memory (memory-mb) ▶ virtual cores (vcores) Yarn supports an extensible resource model that allows to define any countable resource. A countable resource is a resource that is consumed while a container is running, but is released afterwards. Such a resource can be for instance: ▶ GPU (gpu) .
  • 4. 4 www.prace-ri.euIntroduction to Hadoop (part 2) YARN Each job submitted to the Yarn is assigned: ▶ a container, that is an abstract entity which incorporates resources such as memory, cpu, disk, network etc. Container resources are allocated by the Scheduler. ▶ an ApplicationMaster service assigned by the Application Manager for monitoring the progress of the job, restarting tasks if needed .
  • 5. 5 www.prace-ri.euIntroduction to Hadoop (part 2) YARN The main idea of Yarn is to have two distinct daemons for job monitoring and scheduling, one global and one local for each application: ▶ the Resource Manager is the global job manager, consisting of: ▶ Scheduler → responsible for allocating of resources across all applications ▶ Applications Manager → accept job submissions, restart Application Masters on failure ▶ an Application Master for each application, a daemon responsible for negotiating resources, monitoring status of the job, restarting failed tasks.
  • 6. 6 www.prace-ri.euIntroduction to Hadoop (part 2) YARN architecture Source: Apache Software Foundation
  • 7. 7 www.prace-ri.euIntroduction to Hadoop (part 2) YARN Schedulers Yarn supports by default two types of schedulers: ▶ Capacity Scheduler share resources providing capacity guarantees → allows scheduler determinism ▶ Fair Scheduler all apps get, on average, an equal share of resources over time. Fair sharing allows one app to use the entire cluster if it’s the only one running. By default, fairness is based on memory but this can be configured to include also other resources.
  • 8. 8 www.prace-ri.euIntroduction to Hadoop (part 2) YARN Dynamic Resource Pools Yarn supports Dynamic Resource Pools (or dynamic queues). Each job is assigned to a queue (the queue can be determined for instance by the user’s Unix primary group). Queue are assigned a weight and resources are split among queues according to their weights. For instance, if you have one queue with weight 20 and three other queues with weight 10, the first queue will get 40%, and each of the other queues 20% of the available resources because: 1 ∗ 20𝑥 + 3 ∗ 10𝑥 = 100
  • 9. 9 www.prace-ri.euIntroduction to Hadoop (part 2) Testing HDFS I/O troughput with TestDFSIO TestDFSIO is a tool included in the Hadoop software distribution for measuring read and write performance of the HDFS filesystem. It’s a useful tool for assessing the performance of your Hadoop filesystem, identify performance bottlenecks, support decisions for tuning your HDFS configuration. TestDFSio uses MapReduce to write and read files on the HDFS filesystem; the reducer is used to collect and summarize test data.
  • 10. 10 www.prace-ri.euIntroduction to Hadoop (part 2) Testing HDFS I/O troughput with TestDFSIO Note: you won’t be able to run TestDFSIO on a cluster unless you’re the cluster’s admininstrator. Of course, it is recommended to use TestDFSIO to run tests when the Hadoop cluster is not in use. To run test as a non-superuser you need to have read/write access to /benchmarks/TestDFSIO on HDFS or else specify another directory (on HDFS) where to write and read data with the option –D test.build.data=myDir .
  • 11. 11 www.prace-ri.euIntroduction to Hadoop (part 2) Testing HDFS I/O troughput with TestDFSIO TestDFSIO writes and reads file on HDFS spanning exactly one mapper for file. These are the main options for running stress tests: ▶ -write to run write tests ▶ -read to run read tests ▶ -nrFiles the number of files (set to be equal to the number of mappers) ▶ -fileSize size of files (followed by B|KB|MB|GB|TB)
  • 12. 12 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: run a write test Run a test with 80 files (remember: this is the number of mappers) each of size 10GB. JARFILE= /opt/pkg/software/Hadoop/2.10.0-GCCcore-8.3.0- native/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.10.0- tests.jar yarn $JARFILE TestDFSIO -write -nrFiles 80 -fileSize 10GB The jar file needed to run TestDFSIO is hadoop-mapreduce-client- jobclient*tests.jar and its location depends on your Hadoop installation.
  • 13. 13 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: examine test results Test results are appended by default to the file TestDFSIO_results.log. The log file can be changed with the option -resFile resultFileName. The main measurements returned by the HDFSio test are: ▶ throughput in mb/sec ▶ average IO rate in mb/sec ▶ standard deviation of the IO rate ▶ test execution time
  • 14. 14 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: examine test results Main units of measure: ▶ throughput or data transfer rate measures the amount of data read or written (expressed in Megabytes per second -- MB/s) to the filesystem. ▶ IO rate also abbreviated as IOPS measures IO operations per second, which means the amount of read or write operations that could be done in one seconds time. Throughput and IO rate are connected: Average IO size x IOPS = Throughput in MB/s
  • 15. 15 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: examine test results Sample output of a test write run: ----- TestDFSIO ----- : write Date & time: Sun Sep 13 16:14:39 CEST 2020 Number of files: 80 Total MBytes processed: 819200 Throughput mb/sec: 30.64 Average IO rate mb/sec: 34.44 IO rate std deviation: 17.45 Test exec time sec: 429.73
  • 16. 16 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: examine test results Concurrent versus overall throughput The resulting throughput is the average throughput among all map tasks. To get an approximate overall throughput on the cluster one should divide the total MBytes by the test execution time in seconds. In our test the overall write throughput on the cluster is: 819200 / 429.73 = 1906MB/sec
  • 17. 17 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: some things to try ▶ run a read test using the same command used for the write test but using the option – read in place of –write. ▶ use the option -D dfs.replication=1 to measure the effect of replication on I/O performance yarn $JARFILE TestDFSIO -D dfs.replication=1 -write -nrFiles 80 -fileSize 10GB
  • 18. 18 www.prace-ri.euIntroduction to Hadoop (part 2) TestDFSIO: remove data after completing tests When done with testing, remove the temporary files with: yarn $JARFILE TestDFSIO -clean If you used a special myDir output directory use yarn $JARFILE TestDFSIO –clean –D test.build.data=myDir Further reading: TestDFSIO documentation , source code TestDFSIO.java.
  • 19. 19 www.prace-ri.euIntroduction to Hadoop (part 2) The mrjob library A typical data processing pipeline consists of multiple steps that needs to run in a sequence. mrjob is a Python library that offers a convenient framework which allows you to write multi- step MapReduce jobs in pure Python, test them on your machine, and run them on a cluster.
  • 20. 20 www.prace-ri.euIntroduction to Hadoop (part 2) The mrjob library: how to define a MapReduce job A job is defined by a class that inherits from MRJob. This class contains methods that define the steps of your job. A “step” consists of a mapper, a combiner, and a reducer. Step-by-step tutorials and documentation can be foud in: https://mrjob.readthedocs.io/
  • 21. 21 www.prace-ri.euIntroduction to Hadoop (part 2) The mrjob library: a sample MapReduce job from mrjob.job import MRJob class MRWordFrequencyCount(MRJob): def mapper(self, _, line): yield "chars", len(line) yield "words", len(line.split()) yield "lines", 1 def reducer(self, key, values): yield key, sum(values) if __name__ == '__main__': MRWordFrequencyCount.run()
  • 22. 22 www.prace-ri.euIntroduction to Hadoop (part 2) The mrjob library: a sample MapReduce job Call the script mr_wordcount.py and run it with: python mr_wordcount.py file.txt This command will return the frequencies of characters, words, and lines in file.txt. To get aggregated frequency counts for multiple files use: python mr_wordcount.py file1.txt file2.txt …
  • 23. 23 www.prace-ri.euIntroduction to Hadoop (part 2) The mrjob library: a sample MapReduce pipeline To define a succession of MapReduce jobs define steps as a list of steps: def steps(self): return [ MRStep(mapper=self.mapper_get_words, combiner=self.combiner_count_words, reducer=self.reducer_count_words), MRStep(reducer=self.reducer_find_max_word) ]
  • 24. 24 www.prace-ri.euIntroduction to Hadoop (part 2) The mrjob library: run offline and on a cluster Note that up to now we ran everything offline, on our local machine. To run the script on a cluster use the option -r Hadoop. This will set Hadoop as the target runner. Alternatively: ▶ -r emr for AWS clusters ▶ -r dataproc for Google cloud
  • 25. 25 www.prace-ri.euIntroduction to Hadoop (part 2) THANK YOU FOR YOUR ATTENTION www.prace-ri.eu