SlideShare a Scribd company logo
1 of 64
Download to read offline
Introduction YARN MapReduce Conclusion
Apache Hadoop: design and implementation
Emilio Coppa
April 29, 2014
Big Data Computing
Master of Science in Computer Science
1 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
Hadoop Facts
Open-source software framework for storage and large-scale
processing of data-sets on clusters of commodity hardware.
MapReduce paradigm: “Our abstraction is inspired by the map
and reduce primitives present in Lisp and many other
functional languages” (Dean && Ghemawat – Google – 2004)
First released in 2005 by D. Cutting (Yahoo) and Mike
Cafarella (U. Michigan)
2 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
Hadoop Facts (2)
2,5 millions of LOC – Java (47%), XML (36%)
681 years of effort (COCOMO)
Organized in 4 projects: Common, HDFS, YARN, MapReduce
81 contributors
3 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
Hadoop Facts (3) – Top Contributors
Analyzing the top 10 of contributors...
4 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
Hadoop Facts (3) – Top Contributors
Analyzing the top 10 of contributors...
1 6 HortonWorks (“We Do Hadoop”)
4 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
Hadoop Facts (3) – Top Contributors
Analyzing the top 10 of contributors...
1 6 HortonWorks (“We Do Hadoop”)
2 3 Cloudera (“Ask Big Questions”)
4 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
Hadoop Facts (3) – Top Contributors
Analyzing the top 10 of contributors...
1 6 HortonWorks (“We Do Hadoop”)
2 3 Cloudera (“Ask Big Questions”)
3 1 Yahoo
4 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
Hadoop Facts (3) – Top Contributors
Analyzing the top 10 of contributors...
1 6 HortonWorks (“We Do Hadoop”)
2 3 Cloudera (“Ask Big Questions”)
3 1 Yahoo
Doug Cutting currently works at Cloudera.
4 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
Apache Hadoop Architecture
5 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
Apache Hadoop Architecture
Cluster: set of host machines (nodes). Nodes may be partitioned in
racks. This is the hardware part of the infrastructure.
5 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
Apache Hadoop Architecture
YARN: Yet Another Resource Negotiator – framework responsible
for providing the computational resources (e.g., CPUs,
memory, etc.) needed for application executions.
5 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
Apache Hadoop Architecture
HDFS: framework responsible for providing permanent, reliable and
distributed storage. This is typically used for storing inputs
and output (but not intermediate ones).
5 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
Apache Hadoop Architecture
Storage: Other alternative storage solutions. Amazon uses the Simple
Storage Service (S3).
5 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
Apache Hadoop Architecture
MapReduce: the software layer implementing the MapReduce paradigm.
Notice that YARN and HDFS can easily support other
frameworks (highly decoupled).
5 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
YARN Infrastructure:
Yet Another Resource Negotiator
6 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
YARN Infrastructure: overview
YARN handles the computational resources (CPU, memory, etc.) of
the cluster. The main actors are:
7 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
YARN Infrastructure: overview
YARN handles the computational resources (CPU, memory, etc.) of
the cluster. The main actors are:
– Job Submitter: the client who submits an application
7 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
YARN Infrastructure: overview
YARN handles the computational resources (CPU, memory, etc.) of
the cluster. The main actors are:
– Job Submitter: the client who submits an application
– Resource Manager: the master of the infrastructure
7 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
YARN Infrastructure: overview
YARN handles the computational resources (CPU, memory, etc.) of
the cluster. The main actors are:
– Job Submitter: the client who submits an application
– Resource Manager: the master of the infrastructure
– Node Manager: A slave of the infrastructure
7 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
YARN Infrastructure: Node Manager
The Node Manager (NM) is the slave.
When it starts, it announces himself to the
RM. Periodically, it sends an heartbeat to
the RM. Its resource capacity is the amount
of memory and the number of vcores.
A container is a fraction of the NM capacity:
container := (amount of memory, # vcores)
# containers
(on a NM)
yarn.nodemanager.resource.memory-mb /
yarn.scheduler.minimum-allocation-mb
8 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
YARN Infrastructure: Resource Manager
The Resource Manager (RM) is the master. It knows where the
Node Managers are located (Rack Awareness) and how many
resources (containers) they have. It runs several services, the most
important is the Resource Scheduler.
9 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
YARN Infrastructure: Application Startup
1 a client submits an application to the RM
2 the RM allocates a container
3 the RM contacts the NM
4 the NM launches the container
5 the container executes the Application Master
10 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
YARN Infrastructure: Application Master
The AM is responsible for the execution of an application. It asks
for containers to the Resource Scheduler (RM) and executes
specific programs (e.g., the main of a Java class) on the obtained
containers. The AM is framework-specific.
The RM is a single point of failure in YARN. Using AMs, YARN is
spreading over the cluster the metadata related to the running
applications.
«
RM: reduced load & fast recovery
11 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
MapReduce Framework:
Anatomy of MR Job
12 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
MapReduce: Application MR Job
Timeline of a MR Job execution:
Map Phase: executed several Map Tasks
Reduce Phase: executed several Reduce Tasks
The MRAppMaster is the director of the job.
13 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
MapReduce: what does the user give us?
A Job submitted by a user is composed by:
a configuration: if partial then use global/default values
a JAR containing:
a map() implementation
a combine implementation
a reduce() implementation
input and output information:
input directory: are they on HDFS? S3? How many files?
output directory: where? HDFS? S3?
14 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: How many Map Tasks?
One Map Task for each input split (Job Submitter):
num_splits = 0
for each input file f:
remaining = f.length
while remaining / split_size > split_slope:
num_splits += 1
remaining -= split_size
where:
split_slope = 1.1
split_size dfs.blocksize
mapreduce.job.maps is ignored in MRv2 (before it was an hint)!
15 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: MapTask launch
The MRAppMaster immediately asks for containers needed by all
MapTasks:
=⇒ num_splits container requests
A container request for a MapTask tries to exploit data locality:
a node where input split is stored
if not, a node in same rack
if not, any other node
This is just an hint to the Resource Scheduler!
After a container has been assigned, the MapTask is launched.
16 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: Execution Overview
Possible execution scenario:
2 Node Managers (capacity 2 containers)
no other running applications
8 input splits
17 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: MapTask
Execution timeline:
18 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: MapTask – Init
1 create a context (TaskAttemptContext)
2 create an instance of the user Mapper class
3 setup input (InputFormat, InputSplit, RecordReader)
4 setup output (NewOutputCollector)
5 create a mapper context (MapContext, Mapper.Context)
6 initialize input, e.g.:
create a SplitLineReader obj
create a HdfsDataInputStream obj
19 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: MapTask – Execution
Mapper.Context.nextKeyValue() will load data from the input
Mapper.Context.write() will write the output to a circular
buffer
20 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: MapTask – Spilling
Mapper.Context.write() writes to a MapOutputBuffer of size
mapreduce.task.io.sort.mb (100MB). If it is mapreduce.map.
sort.spill.percent (80%) full, then parallel spilling phase is
started.
If the circular buffer is 100% full, then map() is blocked!
21 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: MapTask – Spilling (2)
1 create a SpillRecord & create a FSOutputStream (local fs)
2 in-memory sort the chunk of the buffer (quicksort):
sort by <partitionIdx, key>
3 divide in partitions:
1 partition for each reducer (mapreduce.job.reduces)
write partitions into output file
22 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: MapTask – Spilling (partitioning)
How do we partition the <key, value> tuples?
During a Mapper.Context.write():
partitionIdx = (key.hashCode() & Integer.MAX_VALUE)
% numReducers
Stored as metadata of the tuple in circular buffer.
Use mapreduce.job.partitioner.class for a custom partitioner
23 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: MapTask – Spilling (combine)
If the user specifies a combiner then, before writing the tuples to
the file, we apply it on tuples of a partition:
1 create an instance of the user Reducer class
2 create a Reducer.Context: output on the local fs file
3 execute Reduce.run(): see Reduce Task slides
The combiner typically use the same implementation of the
reduce() function and thus can be seen as a local reducer.
24 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: MapTask – Spilling (end of execution)
At the end of the execution of the Mapper.run():
1 sort and spill the remaining unspilled tuples
2 start the shuffle phase
25 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: MapTask – Shuffle
Spill files need to be merged: this is done by a k-way merge where
k is equal to mapreduce.task.io.sort.factor (100).
These are intermediate output files of only one MapTask!
26 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Map Phase: Execution Overview
Possible execution scenario:
2 Node Managers (capacity 2 containers)
no other running applications
8 input splits
The Node Managers locally store the map outputs (reduce inputs).
27 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Reduce Task Launch
The MRAppMaster waits until mapreduce.job.reduce.
slowstart.completedmaps (5%) MapTasks are completed.
Then (periodically executed):
if all maps have a container assigned then all (remaining)
reducers are scheduled
otherwise it checks percentage of completed maps:
check available cluster resources for the app
check resource needed for unassigned rescheduled maps
ramp down (unschedule/kill) or ramp up (schedule) reduce
tasks
When a reduce task is scheduled, a container request is made. This
does NOT exploit data locality.
A MapTask request has a higher priority than Reduce Task request.
28 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Execution Overview
Possible execution scenario:
2 Node Managers (capacity 2 containers each)
no other running applications
4 reducers (mapreduce.job.reduces, default: 1)
29 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Reduce Task
Execution timeline:
30 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Reduce Task – Init
1 init a codec (if map outputs are compressed)
2 create an instance of the combine output collector (if needed)
3 create an instance of the shuffle plugin (mapreduce.job.
reduce.shuffle.consumer.plugin.class, default:
org.apache.hadoop.mapreduce.task.reduce.Shuffle.class)
4 create a shuffle context (ShuffleConsumerPlugin.Context)
31 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Reduce Task – Shuffle
The shuffle has two steps:
1 fetch map outputs from Node Managers
2 merge them
32 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Reduce Task – Shuffle (fetch)
Several parallel fetchers are started (up to mapreduce.reduce.
shuffle.parallelcopies, default: 5). Each fetcher collects map
outputs from one NM (possibly many containers). For each map
output:
if output size less than 25% of NM memory then create an in
memory output (wait until enough memory is available)
otherwise create a disk output
33 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Reduce Task – Shuffle (fetch) (2)
Fetch the outputs over HTTP and add to related merge queue.
A Reduce Task may start before the end of the Map Phase thus it
can fetch only from completed map tasks. Periodically repeat fetch
process.
34 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Reduce Task – Shuffle (in memory merge)
The in memory merger:
1 perform a k-way merge
2 run the combiner (if needed)
3 result is written on a On Disk Map Output and it is queued
35 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Reduce Task – Shuffle (on disk merge)
Extract from the queue, k-way merge and queue the result:
Stop when all files has been merged together: the final merge will
provide a RawKeyValueIterator instance (input of the reducer).
36 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Reduce Task – Execution (init)
1 create a context (TaskAttemptContext)
2 create an instance of the user Reduce class
3 setup output (RecordWriter, TextOutputFormat)
4 create a reducer context (Reducer.Context)
37 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Reduce Task – Execution (run)
The output is typically written on HDFS file.
38 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Reduce Phase: Execution Overview
39 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
MapReduce: Application MR Job
Possible execution timeline:
That’s it!
40 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
MapReduce: Task Progress
A MapTask has two phases:
Map (66%): progress due to perc. of processed input
Sort (33%): 1 subphase for each reducer
subphase progress due to perc. of merged bytes
A ReduceTask has three phases:
Copy (33%): progress due to perc. of fetched input
Sort (33%): progress due to processed bytes in final merge
Reduce (33%): progress due to perc. of processed input
41 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
MapReduce: Speculation
MRAppMaster may launch speculative tasks:
est = (ts - start) / MAX(0.0001, Status.progress())
estEndTime = start + est
estReplacementEndTime = now() + TaskDurations.mean()
if estEndTime < now() then
return PROGRESS_IS_GOOD
elif estReplacementEndTime >= estEndTime then
return TOO_LATE_TO_SPECULATE
else then
return estEndTime - estReplacementEndTime // score
Launch a replica of the task with highest score.
42 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
MapReduce: Application Status
The status of a MR job is tracked by the MRAppMaster using
several Finite State Machines:
Job: 14 states, 80 transitions, 19 events
Task: 14 states, 36 transitions, 9 events
Task Attempt: 13 states, 60 transitions, 17 events
A job is composed by several tasks. Each tasks may have several
task attempts. Each task attempt is executed on a container.
Instead, a Node Manager maintains the states of:
Application: 7 states, 21 transitions, 9 events
Container: 11 states, 46 transitions, 12 events
43 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
MapReduce: Job FSM (example)
44 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Configuration Parameters (recap)
Parameter Meaning
mapreduce.framework.name The runtime framework for executing
MapReduce jobs. Set to YARN.
mapreduce.job.reduces Number of reduce tasks. Default: 1
dfs.blocksize HDFS block size. Default 128MB.
yarn.resourcemanager.
scheduler.class
Scheduler class. Default: CapacityScheduler
yarn.nodemanager.
resource.memory-mb
Memory available on a NM for containers.
Default: 8192
yarn.scheduler.
minimum-allocation-mb
Min allocation for every container request.
Default: 1024
mapreduce.map.memory.mb Memory request for a MapTask. Default:
1024
mapreduce.reduce.
memory.mb
Memory request for a ReduceTask. Default:
1024
45 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Configuration Parameters (recap) (2)
Parameter Meaning
mapreduce.task.
io.sort.mb
Size of the circular buffer (map output). De-
fault: 100MB
mapreduce.map.
sort.spill.percent
Circular buffer soft limit. Once reached,
start the spilling process. Default: 0.80
mapreduce.job.
partitioner.class
The Partitioner class. Default: HashParti-
tioner.class
map.sort.class The sort class for sorting keys. Default:
org.apache.hadoop.util.QuickSort
mapreduce.reduce.shuffle
.memory.limit.percent
Maximum percentage of the in-memory limit
that a single shuffle can consume. Default:
0.25
mapreduce.reduce.shuffle
.input.buffer.percent
The % of memory to be allocated from the
maximum heap size to storing map outputs
during the shuffle. Default: 0.70
46 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra
Configuration Parameters (recap) (3)
Parameter Meaning
mapreduce.reduce.shuffle
.merge.percent
The usage % at which an in-memory merge
will be initiated. Default: 0.66
mapreduce.map.
combine.minspills
Apply combine only if you have at least this
number of spill files. Default: 3.
mapreduce.task.
io.sort.factor
The number of streams to merge at once
while sorting files. Default: 100 (10)
mapreduce.job.reduce.
slowstart.completedmaps
Fraction of the number of maps in the job
which should be complete before reduces are
scheduled for the job. Default: 0.05
mapreduce.reduce.
shuffle.parallelcopies
Number of parallel transfers run by reduce
during the shuffle (fetch) phase. Default: 5
mapreduce.reduce.
memory.totalbytes
Memory of a NM. Default: Run-
time.maxMemory()
47 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
Hadoop: a bad angel
Writing a MapReduce program is relatively easy. On the other
hand, writing an efficient MapReduce program is hard:
many configuration parameters:
YARN: 115 parameters
MapReduce: 195 parameters
HDFS: 173 parameters
core: 145 parameters
lack of control over the execution: how to debug?
many implementation details: what is happening?
48 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
How can we help the user?
49 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
How can we help the user?
We need profilers!
49 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
How can we help the user?
We need profilers!
My current research is focused on this goal.
49 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
Introduction YARN MapReduce Conclusion
References
My Personal Page: sites.google.com/a/di.uniroma1.it/coppa/
Hadoop Internals: ercoppa.github.io/HadoopInternals/
50 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)

More Related Content

What's hot

MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduceFARUK BERKSÖZ
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce introGeoff Hendrey
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop EcosystemJ Singh
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARNAdam Kawa
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformBikas Saha
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringGeorge Ang
 
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Simplilearn
 
Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersDataWorks Summit
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep divet3rmin4t0r
 
YARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerYARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerVertiCloud Inc
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Sumeet Singh
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 

What's hot (20)

Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Hadoop hbase mapreduce
Hadoop hbase mapreduceHadoop hbase mapreduce
Hadoop hbase mapreduce
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 
February 2014 HUG : Pig On Tez
February 2014 HUG : Pig On TezFebruary 2014 HUG : Pig On Tez
February 2014 HUG : Pig On Tez
 
Apache Hadoop YARN
Apache Hadoop YARNApache Hadoop YARN
Apache Hadoop YARN
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
YARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute PlatformYARN - Hadoop Next Generation Compute Platform
YARN - Hadoop Next Generation Compute Platform
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Hadoop Design and k -Means Clustering
Hadoop Design and k -Means ClusteringHadoop Design and k -Means Clustering
Hadoop Design and k -Means Clustering
 
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
Hadoop YARN | Hadoop YARN Architecture | Hadoop YARN Tutorial | Hadoop Tutori...
 
Towards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN ClustersTowards SLA-based Scheduling on YARN Clusters
Towards SLA-based Scheduling on YARN Clusters
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
YARN - Hadoop's Resource Manager
YARN - Hadoop's Resource ManagerYARN - Hadoop's Resource Manager
YARN - Hadoop's Resource Manager
 
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
Hadoop Summit San Jose 2015: Towards SLA-based Scheduling on YARN Clusters
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 

Viewers also liked

MapReduce Container ReUse
MapReduce Container ReUseMapReduce Container ReUse
MapReduce Container ReUseHortonworks
 
Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)Peter Wood
 
Hdp security overview
Hdp security overview Hdp security overview
Hdp security overview Hortonworks
 
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...Kevin Minder
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...DataWorks Summit
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop SecurityDataWorks Summit
 
Information security in big data -privacy and data mining
Information security in big data -privacy and data miningInformation security in big data -privacy and data mining
Information security in big data -privacy and data miningharithavijay94
 
Apache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXApache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXAbhishek Mallick
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the CloudDataWorks Summit
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastDataWorks Summit
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureUwe Printz
 
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise UsersApache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise UsersDataWorks Summit
 
Big Data Security with Hadoop
Big Data Security with HadoopBig Data Security with Hadoop
Big Data Security with HadoopCloudera, Inc.
 
Hadoop Security Today & Tomorrow with Apache Knox
Hadoop Security Today & Tomorrow with Apache KnoxHadoop Security Today & Tomorrow with Apache Knox
Hadoop Security Today & Tomorrow with Apache KnoxVinay Shukla
 
Hadoop and Data Access Security
Hadoop and Data Access SecurityHadoop and Data Access Security
Hadoop and Data Access SecurityCloudera, Inc.
 
OAuth - Open API Authentication
OAuth - Open API AuthenticationOAuth - Open API Authentication
OAuth - Open API Authenticationleahculver
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security ArchitectureOwen O'Malley
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 

Viewers also liked (20)

MapReduce Container ReUse
MapReduce Container ReUseMapReduce Container ReUse
MapReduce Container ReUse
 
An Approach for Multi-Tenancy Through Apache Knox
An Approach for Multi-Tenancy Through Apache KnoxAn Approach for Multi-Tenancy Through Apache Knox
An Approach for Multi-Tenancy Through Apache Knox
 
Hadoop
HadoopHadoop
Hadoop
 
Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)Big Data and Security - Where are we now? (2015)
Big Data and Security - Where are we now? (2015)
 
Hdp security overview
Hdp security overview Hdp security overview
Hdp security overview
 
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
Securing Hadoop's REST APIs with Apache Knox Gateway Hadoop Summit June 6th, ...
 
Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...Treat your enterprise data lake indigestion: Enterprise ready security and go...
Treat your enterprise data lake indigestion: Enterprise ready security and go...
 
Improvements in Hadoop Security
Improvements in Hadoop SecurityImprovements in Hadoop Security
Improvements in Hadoop Security
 
Information security in big data -privacy and data mining
Information security in big data -privacy and data miningInformation security in big data -privacy and data mining
Information security in big data -privacy and data mining
 
Apache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOXApache Knox setup and hive and hdfs Access using KNOX
Apache Knox setup and hive and hdfs Access using KNOX
 
Built-In Security for the Cloud
Built-In Security for the CloudBuilt-In Security for the Cloud
Built-In Security for the Cloud
 
Troubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the BeastTroubleshooting Kerberos in Hadoop: Taming the Beast
Troubleshooting Kerberos in Hadoop: Taming the Beast
 
Hadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, FutureHadoop & Security - Past, Present, Future
Hadoop & Security - Past, Present, Future
 
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise UsersApache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
Apache Knox Gateway "Single Sign On" expands the reach of the Enterprise Users
 
Big Data Security with Hadoop
Big Data Security with HadoopBig Data Security with Hadoop
Big Data Security with Hadoop
 
Hadoop Security Today & Tomorrow with Apache Knox
Hadoop Security Today & Tomorrow with Apache KnoxHadoop Security Today & Tomorrow with Apache Knox
Hadoop Security Today & Tomorrow with Apache Knox
 
Hadoop and Data Access Security
Hadoop and Data Access SecurityHadoop and Data Access Security
Hadoop and Data Access Security
 
OAuth - Open API Authentication
OAuth - Open API AuthenticationOAuth - Open API Authentication
OAuth - Open API Authentication
 
Hadoop Security Architecture
Hadoop Security ArchitectureHadoop Security Architecture
Hadoop Security Architecture
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 

Similar to Hadoop Internals (2.3.0 or later)

XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map ReduceEdureka!
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkMila, Université de Montréal
 
Hadoop with Lustre WhitePaper
Hadoop with Lustre WhitePaperHadoop with Lustre WhitePaper
Hadoop with Lustre WhitePaperDavid Luan
 
YARN - way to share cluster BEYOND HADOOP
YARN - way to share cluster BEYOND HADOOPYARN - way to share cluster BEYOND HADOOP
YARN - way to share cluster BEYOND HADOOPOmkar Joshi
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model examIndhujeni
 
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkDache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkSafir Shah
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hoodAdarsh Pannu
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to YarnApache Apex
 
Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
FinalprojectpresentationSANTOSH WAYAL
 
Hadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch trainingHadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch trainingNandan Kumar
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce FrameworkEdureka!
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 

Similar to Hadoop Internals (2.3.0 or later) (20)

Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
XML Parsing with Map Reduce
XML Parsing with Map ReduceXML Parsing with Map Reduce
XML Parsing with Map Reduce
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Large scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using sparkLarge scale logistic regression and linear support vector machines using spark
Large scale logistic regression and linear support vector machines using spark
 
Hadoop with Lustre WhitePaper
Hadoop with Lustre WhitePaperHadoop with Lustre WhitePaper
Hadoop with Lustre WhitePaper
 
YARN - way to share cluster BEYOND HADOOP
YARN - way to share cluster BEYOND HADOOPYARN - way to share cluster BEYOND HADOOP
YARN - way to share cluster BEYOND HADOOP
 
YARN (2).pptx
YARN (2).pptxYARN (2).pptx
YARN (2).pptx
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
20140708hcj
20140708hcj20140708hcj
20140708hcj
 
Big data unit iv and v lecture notes qb model exam
Big data unit iv and v lecture notes   qb model examBig data unit iv and v lecture notes   qb model exam
Big data unit iv and v lecture notes qb model exam
 
Hadoop
HadoopHadoop
Hadoop
 
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce frameworkDache: A Data Aware Caching for Big-Data using Map Reduce framework
Dache: A Data Aware Caching for Big-Data using Map Reduce framework
 
Apache Spark: What's under the hood
Apache Spark: What's under the hoodApache Spark: What's under the hood
Apache Spark: What's under the hood
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
Finalprojectpresentation
FinalprojectpresentationFinalprojectpresentation
Finalprojectpresentation
 
Hadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch trainingHadoop 2.0 yarn arch training
Hadoop 2.0 yarn arch training
 
Hadoop MapReduce Framework
Hadoop MapReduce FrameworkHadoop MapReduce Framework
Hadoop MapReduce Framework
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Introduction to yarn
Introduction to yarnIntroduction to yarn
Introduction to yarn
 

Recently uploaded

What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringHironori Washizaki
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf31events.com
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 

Recently uploaded (20)

What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
Machine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their EngineeringMachine Learning Software Engineering Patterns and Their Engineering
Machine Learning Software Engineering Patterns and Their Engineering
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Sending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdfSending Calendar Invites on SES and Calendarsnack.pdf
Sending Calendar Invites on SES and Calendarsnack.pdf
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 

Hadoop Internals (2.3.0 or later)

  • 1. Introduction YARN MapReduce Conclusion Apache Hadoop: design and implementation Emilio Coppa April 29, 2014 Big Data Computing Master of Science in Computer Science 1 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 2. Introduction YARN MapReduce Conclusion Hadoop Facts Open-source software framework for storage and large-scale processing of data-sets on clusters of commodity hardware. MapReduce paradigm: “Our abstraction is inspired by the map and reduce primitives present in Lisp and many other functional languages” (Dean && Ghemawat – Google – 2004) First released in 2005 by D. Cutting (Yahoo) and Mike Cafarella (U. Michigan) 2 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 3. Introduction YARN MapReduce Conclusion Hadoop Facts (2) 2,5 millions of LOC – Java (47%), XML (36%) 681 years of effort (COCOMO) Organized in 4 projects: Common, HDFS, YARN, MapReduce 81 contributors 3 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 4. Introduction YARN MapReduce Conclusion Hadoop Facts (3) – Top Contributors Analyzing the top 10 of contributors... 4 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 5. Introduction YARN MapReduce Conclusion Hadoop Facts (3) – Top Contributors Analyzing the top 10 of contributors... 1 6 HortonWorks (“We Do Hadoop”) 4 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 6. Introduction YARN MapReduce Conclusion Hadoop Facts (3) – Top Contributors Analyzing the top 10 of contributors... 1 6 HortonWorks (“We Do Hadoop”) 2 3 Cloudera (“Ask Big Questions”) 4 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 7. Introduction YARN MapReduce Conclusion Hadoop Facts (3) – Top Contributors Analyzing the top 10 of contributors... 1 6 HortonWorks (“We Do Hadoop”) 2 3 Cloudera (“Ask Big Questions”) 3 1 Yahoo 4 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 8. Introduction YARN MapReduce Conclusion Hadoop Facts (3) – Top Contributors Analyzing the top 10 of contributors... 1 6 HortonWorks (“We Do Hadoop”) 2 3 Cloudera (“Ask Big Questions”) 3 1 Yahoo Doug Cutting currently works at Cloudera. 4 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 9. Introduction YARN MapReduce Conclusion Apache Hadoop Architecture 5 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 10. Introduction YARN MapReduce Conclusion Apache Hadoop Architecture Cluster: set of host machines (nodes). Nodes may be partitioned in racks. This is the hardware part of the infrastructure. 5 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 11. Introduction YARN MapReduce Conclusion Apache Hadoop Architecture YARN: Yet Another Resource Negotiator – framework responsible for providing the computational resources (e.g., CPUs, memory, etc.) needed for application executions. 5 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 12. Introduction YARN MapReduce Conclusion Apache Hadoop Architecture HDFS: framework responsible for providing permanent, reliable and distributed storage. This is typically used for storing inputs and output (but not intermediate ones). 5 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 13. Introduction YARN MapReduce Conclusion Apache Hadoop Architecture Storage: Other alternative storage solutions. Amazon uses the Simple Storage Service (S3). 5 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 14. Introduction YARN MapReduce Conclusion Apache Hadoop Architecture MapReduce: the software layer implementing the MapReduce paradigm. Notice that YARN and HDFS can easily support other frameworks (highly decoupled). 5 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 15. Introduction YARN MapReduce Conclusion YARN Infrastructure: Yet Another Resource Negotiator 6 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 16. Introduction YARN MapReduce Conclusion YARN Infrastructure: overview YARN handles the computational resources (CPU, memory, etc.) of the cluster. The main actors are: 7 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 17. Introduction YARN MapReduce Conclusion YARN Infrastructure: overview YARN handles the computational resources (CPU, memory, etc.) of the cluster. The main actors are: – Job Submitter: the client who submits an application 7 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 18. Introduction YARN MapReduce Conclusion YARN Infrastructure: overview YARN handles the computational resources (CPU, memory, etc.) of the cluster. The main actors are: – Job Submitter: the client who submits an application – Resource Manager: the master of the infrastructure 7 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 19. Introduction YARN MapReduce Conclusion YARN Infrastructure: overview YARN handles the computational resources (CPU, memory, etc.) of the cluster. The main actors are: – Job Submitter: the client who submits an application – Resource Manager: the master of the infrastructure – Node Manager: A slave of the infrastructure 7 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 20. Introduction YARN MapReduce Conclusion YARN Infrastructure: Node Manager The Node Manager (NM) is the slave. When it starts, it announces himself to the RM. Periodically, it sends an heartbeat to the RM. Its resource capacity is the amount of memory and the number of vcores. A container is a fraction of the NM capacity: container := (amount of memory, # vcores) # containers (on a NM) yarn.nodemanager.resource.memory-mb / yarn.scheduler.minimum-allocation-mb 8 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 21. Introduction YARN MapReduce Conclusion YARN Infrastructure: Resource Manager The Resource Manager (RM) is the master. It knows where the Node Managers are located (Rack Awareness) and how many resources (containers) they have. It runs several services, the most important is the Resource Scheduler. 9 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 22. Introduction YARN MapReduce Conclusion YARN Infrastructure: Application Startup 1 a client submits an application to the RM 2 the RM allocates a container 3 the RM contacts the NM 4 the NM launches the container 5 the container executes the Application Master 10 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 23. Introduction YARN MapReduce Conclusion YARN Infrastructure: Application Master The AM is responsible for the execution of an application. It asks for containers to the Resource Scheduler (RM) and executes specific programs (e.g., the main of a Java class) on the obtained containers. The AM is framework-specific. The RM is a single point of failure in YARN. Using AMs, YARN is spreading over the cluster the metadata related to the running applications. « RM: reduced load & fast recovery 11 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 24. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra MapReduce Framework: Anatomy of MR Job 12 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 25. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra MapReduce: Application MR Job Timeline of a MR Job execution: Map Phase: executed several Map Tasks Reduce Phase: executed several Reduce Tasks The MRAppMaster is the director of the job. 13 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 26. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra MapReduce: what does the user give us? A Job submitted by a user is composed by: a configuration: if partial then use global/default values a JAR containing: a map() implementation a combine implementation a reduce() implementation input and output information: input directory: are they on HDFS? S3? How many files? output directory: where? HDFS? S3? 14 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 27. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Map Phase: How many Map Tasks? One Map Task for each input split (Job Submitter): num_splits = 0 for each input file f: remaining = f.length while remaining / split_size > split_slope: num_splits += 1 remaining -= split_size where: split_slope = 1.1 split_size dfs.blocksize mapreduce.job.maps is ignored in MRv2 (before it was an hint)! 15 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 28. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Map Phase: MapTask launch The MRAppMaster immediately asks for containers needed by all MapTasks: =⇒ num_splits container requests A container request for a MapTask tries to exploit data locality: a node where input split is stored if not, a node in same rack if not, any other node This is just an hint to the Resource Scheduler! After a container has been assigned, the MapTask is launched. 16 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 29. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Map Phase: Execution Overview Possible execution scenario: 2 Node Managers (capacity 2 containers) no other running applications 8 input splits 17 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 30. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Map Phase: MapTask Execution timeline: 18 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 31. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Map Phase: MapTask – Init 1 create a context (TaskAttemptContext) 2 create an instance of the user Mapper class 3 setup input (InputFormat, InputSplit, RecordReader) 4 setup output (NewOutputCollector) 5 create a mapper context (MapContext, Mapper.Context) 6 initialize input, e.g.: create a SplitLineReader obj create a HdfsDataInputStream obj 19 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 32. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Map Phase: MapTask – Execution Mapper.Context.nextKeyValue() will load data from the input Mapper.Context.write() will write the output to a circular buffer 20 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 33. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Map Phase: MapTask – Spilling Mapper.Context.write() writes to a MapOutputBuffer of size mapreduce.task.io.sort.mb (100MB). If it is mapreduce.map. sort.spill.percent (80%) full, then parallel spilling phase is started. If the circular buffer is 100% full, then map() is blocked! 21 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 34. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Map Phase: MapTask – Spilling (2) 1 create a SpillRecord & create a FSOutputStream (local fs) 2 in-memory sort the chunk of the buffer (quicksort): sort by <partitionIdx, key> 3 divide in partitions: 1 partition for each reducer (mapreduce.job.reduces) write partitions into output file 22 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 35. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Map Phase: MapTask – Spilling (partitioning) How do we partition the <key, value> tuples? During a Mapper.Context.write(): partitionIdx = (key.hashCode() & Integer.MAX_VALUE) % numReducers Stored as metadata of the tuple in circular buffer. Use mapreduce.job.partitioner.class for a custom partitioner 23 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 36. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Map Phase: MapTask – Spilling (combine) If the user specifies a combiner then, before writing the tuples to the file, we apply it on tuples of a partition: 1 create an instance of the user Reducer class 2 create a Reducer.Context: output on the local fs file 3 execute Reduce.run(): see Reduce Task slides The combiner typically use the same implementation of the reduce() function and thus can be seen as a local reducer. 24 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 37. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Map Phase: MapTask – Spilling (end of execution) At the end of the execution of the Mapper.run(): 1 sort and spill the remaining unspilled tuples 2 start the shuffle phase 25 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 38. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Map Phase: MapTask – Shuffle Spill files need to be merged: this is done by a k-way merge where k is equal to mapreduce.task.io.sort.factor (100). These are intermediate output files of only one MapTask! 26 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 39. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Map Phase: Execution Overview Possible execution scenario: 2 Node Managers (capacity 2 containers) no other running applications 8 input splits The Node Managers locally store the map outputs (reduce inputs). 27 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 40. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Reduce Phase: Reduce Task Launch The MRAppMaster waits until mapreduce.job.reduce. slowstart.completedmaps (5%) MapTasks are completed. Then (periodically executed): if all maps have a container assigned then all (remaining) reducers are scheduled otherwise it checks percentage of completed maps: check available cluster resources for the app check resource needed for unassigned rescheduled maps ramp down (unschedule/kill) or ramp up (schedule) reduce tasks When a reduce task is scheduled, a container request is made. This does NOT exploit data locality. A MapTask request has a higher priority than Reduce Task request. 28 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 41. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Reduce Phase: Execution Overview Possible execution scenario: 2 Node Managers (capacity 2 containers each) no other running applications 4 reducers (mapreduce.job.reduces, default: 1) 29 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 42. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Reduce Phase: Reduce Task Execution timeline: 30 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 43. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Reduce Phase: Reduce Task – Init 1 init a codec (if map outputs are compressed) 2 create an instance of the combine output collector (if needed) 3 create an instance of the shuffle plugin (mapreduce.job. reduce.shuffle.consumer.plugin.class, default: org.apache.hadoop.mapreduce.task.reduce.Shuffle.class) 4 create a shuffle context (ShuffleConsumerPlugin.Context) 31 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 44. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Reduce Phase: Reduce Task – Shuffle The shuffle has two steps: 1 fetch map outputs from Node Managers 2 merge them 32 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 45. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Reduce Phase: Reduce Task – Shuffle (fetch) Several parallel fetchers are started (up to mapreduce.reduce. shuffle.parallelcopies, default: 5). Each fetcher collects map outputs from one NM (possibly many containers). For each map output: if output size less than 25% of NM memory then create an in memory output (wait until enough memory is available) otherwise create a disk output 33 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 46. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Reduce Phase: Reduce Task – Shuffle (fetch) (2) Fetch the outputs over HTTP and add to related merge queue. A Reduce Task may start before the end of the Map Phase thus it can fetch only from completed map tasks. Periodically repeat fetch process. 34 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 47. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Reduce Phase: Reduce Task – Shuffle (in memory merge) The in memory merger: 1 perform a k-way merge 2 run the combiner (if needed) 3 result is written on a On Disk Map Output and it is queued 35 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 48. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Reduce Phase: Reduce Task – Shuffle (on disk merge) Extract from the queue, k-way merge and queue the result: Stop when all files has been merged together: the final merge will provide a RawKeyValueIterator instance (input of the reducer). 36 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 49. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Reduce Phase: Reduce Task – Execution (init) 1 create a context (TaskAttemptContext) 2 create an instance of the user Reduce class 3 setup output (RecordWriter, TextOutputFormat) 4 create a reducer context (Reducer.Context) 37 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 50. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Reduce Phase: Reduce Task – Execution (run) The output is typically written on HDFS file. 38 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 51. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Reduce Phase: Execution Overview 39 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 52. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra MapReduce: Application MR Job Possible execution timeline: That’s it! 40 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 53. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra MapReduce: Task Progress A MapTask has two phases: Map (66%): progress due to perc. of processed input Sort (33%): 1 subphase for each reducer subphase progress due to perc. of merged bytes A ReduceTask has three phases: Copy (33%): progress due to perc. of fetched input Sort (33%): progress due to processed bytes in final merge Reduce (33%): progress due to perc. of processed input 41 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 54. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra MapReduce: Speculation MRAppMaster may launch speculative tasks: est = (ts - start) / MAX(0.0001, Status.progress()) estEndTime = start + est estReplacementEndTime = now() + TaskDurations.mean() if estEndTime < now() then return PROGRESS_IS_GOOD elif estReplacementEndTime >= estEndTime then return TOO_LATE_TO_SPECULATE else then return estEndTime - estReplacementEndTime // score Launch a replica of the task with highest score. 42 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 55. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra MapReduce: Application Status The status of a MR job is tracked by the MRAppMaster using several Finite State Machines: Job: 14 states, 80 transitions, 19 events Task: 14 states, 36 transitions, 9 events Task Attempt: 13 states, 60 transitions, 17 events A job is composed by several tasks. Each tasks may have several task attempts. Each task attempt is executed on a container. Instead, a Node Manager maintains the states of: Application: 7 states, 21 transitions, 9 events Container: 11 states, 46 transitions, 12 events 43 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 56. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra MapReduce: Job FSM (example) 44 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 57. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Configuration Parameters (recap) Parameter Meaning mapreduce.framework.name The runtime framework for executing MapReduce jobs. Set to YARN. mapreduce.job.reduces Number of reduce tasks. Default: 1 dfs.blocksize HDFS block size. Default 128MB. yarn.resourcemanager. scheduler.class Scheduler class. Default: CapacityScheduler yarn.nodemanager. resource.memory-mb Memory available on a NM for containers. Default: 8192 yarn.scheduler. minimum-allocation-mb Min allocation for every container request. Default: 1024 mapreduce.map.memory.mb Memory request for a MapTask. Default: 1024 mapreduce.reduce. memory.mb Memory request for a ReduceTask. Default: 1024 45 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 58. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Configuration Parameters (recap) (2) Parameter Meaning mapreduce.task. io.sort.mb Size of the circular buffer (map output). De- fault: 100MB mapreduce.map. sort.spill.percent Circular buffer soft limit. Once reached, start the spilling process. Default: 0.80 mapreduce.job. partitioner.class The Partitioner class. Default: HashParti- tioner.class map.sort.class The sort class for sorting keys. Default: org.apache.hadoop.util.QuickSort mapreduce.reduce.shuffle .memory.limit.percent Maximum percentage of the in-memory limit that a single shuffle can consume. Default: 0.25 mapreduce.reduce.shuffle .input.buffer.percent The % of memory to be allocated from the maximum heap size to storing map outputs during the shuffle. Default: 0.70 46 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 59. Introduction YARN MapReduce Conclusion Map Phase Reduce Phase Extra Configuration Parameters (recap) (3) Parameter Meaning mapreduce.reduce.shuffle .merge.percent The usage % at which an in-memory merge will be initiated. Default: 0.66 mapreduce.map. combine.minspills Apply combine only if you have at least this number of spill files. Default: 3. mapreduce.task. io.sort.factor The number of streams to merge at once while sorting files. Default: 100 (10) mapreduce.job.reduce. slowstart.completedmaps Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job. Default: 0.05 mapreduce.reduce. shuffle.parallelcopies Number of parallel transfers run by reduce during the shuffle (fetch) phase. Default: 5 mapreduce.reduce. memory.totalbytes Memory of a NM. Default: Run- time.maxMemory() 47 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 60. Introduction YARN MapReduce Conclusion Hadoop: a bad angel Writing a MapReduce program is relatively easy. On the other hand, writing an efficient MapReduce program is hard: many configuration parameters: YARN: 115 parameters MapReduce: 195 parameters HDFS: 173 parameters core: 145 parameters lack of control over the execution: how to debug? many implementation details: what is happening? 48 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 61. Introduction YARN MapReduce Conclusion How can we help the user? 49 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 62. Introduction YARN MapReduce Conclusion How can we help the user? We need profilers! 49 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 63. Introduction YARN MapReduce Conclusion How can we help the user? We need profilers! My current research is focused on this goal. 49 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)
  • 64. Introduction YARN MapReduce Conclusion References My Personal Page: sites.google.com/a/di.uniroma1.it/coppa/ Hadoop Internals: ercoppa.github.io/HadoopInternals/ 50 / 50 Emilio Coppa Hadoop Internals (2.3.0 or later)