SlideShare a Scribd company logo
1 of 23
MapReduce
Farzad Nozarian
4/11/15 @AUT
Purpose
This document describes how to set up and configure a single-node Hadoop
installation so that you can quickly perform simple operations using Hadoop
MapReduce
2
Supported Platforms
• GNU/Linux is supported as a development and production platform.
Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes.
• Windows is also a supported platform but the followings steps are for
Linux only.
3
Required Software
• Java™ must be installed. Recommended Java versions are described at
http://wiki.apache.org/hadoop/HadoopJavaVersions
• ssh must be installed and sshd must be running to use the Hadoop scripts
that manage remote Hadoop daemons.
• To get a Hadoop distribution, download a recent stable release from one
of the Apache Download Mirrors
$ sudo apt-get install ssh
$ sudo apt-get install rsync
4
Prepare to Start the Hadoop Cluster
• Unpack the downloaded Hadoop distribution. In the distribution, edit the
file etc/hadoop/hadoop-env.sh to define some parameters as follows:
• Try the following command:
This will display the usage documentation for the hadoop script.
# set to the root of your Java installation
export JAVA_HOME=/usr/lib/jvm/jdk1.7.0
# Assuming your installation directory is
/usr/local/hadoop
export HADOOP_PREFIX=/usr/local/hadoop
$ bin/hadoop
5
Prepare to Start the Hadoop Cluster (Cont.)
• Now you are ready to start your Hadoop cluster in one of the three
supported modes:
• Local (Standalone) Mode
• By default, Hadoop is configured to run in a non-distributed mode, as a single Java
process. This is useful for debugging.
• Pseudo-Distributed Mode
• Hadoop can also be run on a single-node in a pseudo-distributed mode where each
Hadoop daemon runs in a separate Java process.
• Fully-Distributed Mode
6
Pseudo-Distributed Configuration
• etc/hadoop/core-site.xml:
• etc/hadoop/hdfs-site.xml:
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
</configuration>
7
MapReduce Execution Pipeline
8
Main components of the MapReduce
execution pipeline
• Driver:
• The main program that initializes a MapReduce job.
• It defines job-specific configuration, and specifies all of its components:
• input and output formats
• mapper and reducer
• use of a combiner
• use of a custom partitioner
• The driver can also get back the status of the job execution.
9
Main components of the MapReduce
execution pipeline
• Context:
• The driver, mappers, and reducers are executed in different processes, typically
on multiple machines.
• A context object is available at any point of MapReduce execution.
• It provides a convenient mechanism for exchanging required system and job-
wide information.
10
Main components of the MapReduce
execution pipeline
• Input data:
• This is where the data for a MapReduce task is initially stored
• This data can reside in HDFS, HBase, or other storage.
• InputFormat:
• This defines how input data is read and split.
• InputFormat is a class that defines the InputSplits that break input data into
tasks.
• It provides a factory for RecordReader objects that read the file.
• Several InputFormats are provided by Hadoop
11
Main components of the MapReduce
execution pipeline
• InputSplit:
• An InputSplit defines a unit of work for a single map task in a MapReduce
program.
• The InputFormat (invoked directly by a job driver) defines the number of map
tasks that make up the mapping phase.
• Each map task is given a single InputSplit to work on
12
Main components of the MapReduce
execution pipeline
• RecordReader:
• Although the InputSplit defines a data subset for a map task, it does not
describe how to access the data.
• The RecordReader class actually reads the data from its source, converts it into
key/value pairs suitable for processing by the mapper, and delivers them to the
map method.
• The RecordReader class is defined by the InputFormat.
13
Main components of the MapReduce
execution pipeline
• Mapper:
• Performs the user-defined work of the first phase of the MapReduce program.
• It takes input data in the form of a series of key/value pairs (k1, v1), which are
used for individual map execution.
• The map typically transforms the input pair into an output pair (k2, v2), which is
used as an input for shuffle and sort.
14
Main components of the MapReduce
execution pipeline
• Partition:
• A subset of the intermediate key space (k2, v2) produced by each individual
mapper is assigned to each reducer.
• These subsets (or partitions) are the inputs to the reduce tasks.
• Each map task may emit key/value pairs to any partition.
• The Partitioner class determines which reducer a given key/value pair will go to.
• The default Partitioner computes a hash value for the key, and assigns the
partition based on this result.
15
Main components of the MapReduce
execution pipeline
• Shuffle:
• Once at least one map function for a given node is completed, and the keys’
space is partitioned, the run time begins moving the intermediate outputs
from the map tasks to where they are required by the reducers.
• This process of moving map outputs to the reducers is known as shuffling.
• Sort:
• The set of intermediate key/value pairs for a given reducer is automatically
sorted by Hadoop to form keys/values (k2, {v2, v2,…}) before they are presented
to the reducer.
16
Main components of the MapReduce
execution pipeline
• Reducer:
• A reducer is responsible for an execution of user-provided code for the second
phase of job-specific work.
• For each key assigned to a given reducer, the reducer’s reduce() method is called
once.
• This method receives a key, along with an iterator over all the values associated
with the key.
• The reducer typically transforms the input key/value pairs into
output pairs (k3, v3).
17
Main components of the MapReduce
execution pipeline
• OutputFormat:
• The responsibility of the OutputFormat is to define a location of the output
data and RecordWriter used for storing the resulting data.
• RecordWriter:
• A RecordWriter defines how individual output records are written.
18
Let’s try it with simple example!
Word Count
(the Hello World! for MapReduce, available in Hadoop sources)
We want to count the occurrences of every word
of a text file
19
Driver
20
public class WordCount {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
…
Job job = new Job(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
for (int i = 0; i < otherArgs.length - 1; ++i) {
FileInputFormat.addInputPath(job, new Path(otherArgs[i]));
}
FileOutputFormat.setOutputPath(job, new Path(
otherArgs[otherArgs.length - 1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
} }
Mapper class
21
//inside WordCount class
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
Reducer class
22
//inside WordCount class
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
References:
• hadoop.apache.org
• Professional Hadoop Solutions - Boris Lublinsky, Kevin T.
Smith, Alexey Yakubovich - WILEY
23

More Related Content

What's hot

Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReducefvanvollenhoven
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterDataWorks Summit
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 

What's hot (18)

Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
01 hbase
01 hbase01 hbase
01 hbase
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Hadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduceHadoop, HDFS and MapReduce
Hadoop, HDFS and MapReduce
 
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at TwitterHadoop Performance Optimization at Scale, Lessons Learned at Twitter
Hadoop Performance Optimization at Scale, Lessons Learned at Twitter
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
6.hive
6.hive6.hive
6.hive
 
Unit 1
Unit 1Unit 1
Unit 1
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 

Similar to Apache Hadoop MapReduce Tutorial

Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigKhanKhaja1
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfBikalAdhikari4
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentationAhmad El Tawil
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map ReduceUrvashi Kataria
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and PipesHanborq Inc.
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersKumari Surabhi
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsKrishnaVeni451953
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advancedChirag Ahuja
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce scriptHaripritha
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionSubhas Kumar Ghosh
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programsjani shaik
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introductionDong Ngoc
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2Fabio Fumarola
 
Hadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data AnalyticsHadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data AnalyticsRUHULAMINHAZARIKA
 

Similar to Apache Hadoop MapReduce Tutorial (20)

Hadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pigHadoop eco system with mapreduce hive and pig
Hadoop eco system with mapreduce hive and pig
 
Introduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdfIntroduction to the Map-Reduce framework.pdf
Introduction to the Map-Reduce framework.pdf
 
Hadoop
HadoopHadoop
Hadoop
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map reduce presentation
Map reduce presentationMap reduce presentation
Map reduce presentation
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop ClustersA performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
A performance analysis of OpenStack Cloud vs Real System on Hadoop Clusters
 
A slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analyticsA slide share pig in CCS334 for big data analytics
A slide share pig in CCS334 for big data analytics
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Mapreduce script
Mapreduce scriptMapreduce script
Mapreduce script
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
writing Hadoop Map Reduce programs
writing Hadoop Map Reduce programswriting Hadoop Map Reduce programs
writing Hadoop Map Reduce programs
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
11. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:211. From Hadoop to Spark 1:2
11. From Hadoop to Spark 1:2
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Hadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data AnalyticsHadoop Map-Reduce from the subject: Big Data Analytics
Hadoop Map-Reduce from the subject: Big Data Analytics
 

More from Farzad Nozarian

SHARE Interface in Flash Storage for Relational and NoSQL Databases
SHARE Interface in Flash Storage for Relational and NoSQL DatabasesSHARE Interface in Flash Storage for Relational and NoSQL Databases
SHARE Interface in Flash Storage for Relational and NoSQL DatabasesFarzad Nozarian
 
Ultimate Goals In Robotics
Ultimate Goals In RoboticsUltimate Goals In Robotics
Ultimate Goals In RoboticsFarzad Nozarian
 
Tank Battle - A simple game powered by JMonkey engine
Tank Battle - A simple game powered by JMonkey engineTank Battle - A simple game powered by JMonkey engine
Tank Battle - A simple game powered by JMonkey engineFarzad Nozarian
 
The Continuous Distributed Monitoring Model
The Continuous Distributed Monitoring ModelThe Continuous Distributed Monitoring Model
The Continuous Distributed Monitoring ModelFarzad Nozarian
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesFarzad Nozarian
 
Apache HBase - Lab Assignment
Apache HBase - Lab AssignmentApache HBase - Lab Assignment
Apache HBase - Lab AssignmentFarzad Nozarian
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentFarzad Nozarian
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud ComputingFarzad Nozarian
 
Big Data Processing in Cloud Computing Environments
Big Data Processing in Cloud Computing EnvironmentsBig Data Processing in Cloud Computing Environments
Big Data Processing in Cloud Computing EnvironmentsFarzad Nozarian
 
S4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformS4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformFarzad Nozarian
 

More from Farzad Nozarian (14)

SHARE Interface in Flash Storage for Relational and NoSQL Databases
SHARE Interface in Flash Storage for Relational and NoSQL DatabasesSHARE Interface in Flash Storage for Relational and NoSQL Databases
SHARE Interface in Flash Storage for Relational and NoSQL Databases
 
Object Based Databases
Object Based DatabasesObject Based Databases
Object Based Databases
 
Ultimate Goals In Robotics
Ultimate Goals In RoboticsUltimate Goals In Robotics
Ultimate Goals In Robotics
 
Tank Battle - A simple game powered by JMonkey engine
Tank Battle - A simple game powered by JMonkey engineTank Battle - A simple game powered by JMonkey engine
Tank Battle - A simple game powered by JMonkey engine
 
The Continuous Distributed Monitoring Model
The Continuous Distributed Monitoring ModelThe Continuous Distributed Monitoring Model
The Continuous Distributed Monitoring Model
 
Big data Clustering Algorithms And Strategies
Big data Clustering Algorithms And StrategiesBig data Clustering Algorithms And Strategies
Big data Clustering Algorithms And Strategies
 
Shark - Lab Assignment
Shark - Lab AssignmentShark - Lab Assignment
Shark - Lab Assignment
 
Apache HBase - Lab Assignment
Apache HBase - Lab AssignmentApache HBase - Lab Assignment
Apache HBase - Lab Assignment
 
Apache HDFS - Lab Assignment
Apache HDFS - Lab AssignmentApache HDFS - Lab Assignment
Apache HDFS - Lab Assignment
 
Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Apache Storm Tutorial
Apache Storm TutorialApache Storm Tutorial
Apache Storm Tutorial
 
Big Data and Cloud Computing
Big Data and Cloud ComputingBig Data and Cloud Computing
Big Data and Cloud Computing
 
Big Data Processing in Cloud Computing Environments
Big Data Processing in Cloud Computing EnvironmentsBig Data Processing in Cloud Computing Environments
Big Data Processing in Cloud Computing Environments
 
S4: Distributed Stream Computing Platform
S4: Distributed Stream Computing PlatformS4: Distributed Stream Computing Platform
S4: Distributed Stream Computing Platform
 

Recently uploaded

WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is insideshinachiaurasa2
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfkalichargn70th171
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxAnnaArtyushina1
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...masabamasaba
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024VictoriaMetrics
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...masabamasaba
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...Jittipong Loespradit
 
tonesoftg
tonesoftgtonesoftg
tonesoftglanshi9
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastPapp Krisztián
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...masabamasaba
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...chiefasafspells
 

Recently uploaded (20)

WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
The title is not connected to what is inside
The title is not connected to what is insideThe title is not connected to what is inside
The title is not connected to what is inside
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdfPayment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
Payment Gateway Testing Simplified_ A Step-by-Step Guide for Beginners.pdf
 
Artyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptxArtyushina_Guest lecture_YorkU CS May 2024.pptx
Artyushina_Guest lecture_YorkU CS May 2024.pptx
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
Large-scale Logging Made Easy: Meetup at Deutsche Bank 2024
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
Love witchcraft +27768521739 Binding love spell in Sandy Springs, GA |psychic...
 

Apache Hadoop MapReduce Tutorial

  • 2. Purpose This document describes how to set up and configure a single-node Hadoop installation so that you can quickly perform simple operations using Hadoop MapReduce 2
  • 3. Supported Platforms • GNU/Linux is supported as a development and production platform. Hadoop has been demonstrated on GNU/Linux clusters with 2000 nodes. • Windows is also a supported platform but the followings steps are for Linux only. 3
  • 4. Required Software • Java™ must be installed. Recommended Java versions are described at http://wiki.apache.org/hadoop/HadoopJavaVersions • ssh must be installed and sshd must be running to use the Hadoop scripts that manage remote Hadoop daemons. • To get a Hadoop distribution, download a recent stable release from one of the Apache Download Mirrors $ sudo apt-get install ssh $ sudo apt-get install rsync 4
  • 5. Prepare to Start the Hadoop Cluster • Unpack the downloaded Hadoop distribution. In the distribution, edit the file etc/hadoop/hadoop-env.sh to define some parameters as follows: • Try the following command: This will display the usage documentation for the hadoop script. # set to the root of your Java installation export JAVA_HOME=/usr/lib/jvm/jdk1.7.0 # Assuming your installation directory is /usr/local/hadoop export HADOOP_PREFIX=/usr/local/hadoop $ bin/hadoop 5
  • 6. Prepare to Start the Hadoop Cluster (Cont.) • Now you are ready to start your Hadoop cluster in one of the three supported modes: • Local (Standalone) Mode • By default, Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging. • Pseudo-Distributed Mode • Hadoop can also be run on a single-node in a pseudo-distributed mode where each Hadoop daemon runs in a separate Java process. • Fully-Distributed Mode 6
  • 7. Pseudo-Distributed Configuration • etc/hadoop/core-site.xml: • etc/hadoop/hdfs-site.xml: <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://localhost:9000</value> </property> </configuration> <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration> 7
  • 9. Main components of the MapReduce execution pipeline • Driver: • The main program that initializes a MapReduce job. • It defines job-specific configuration, and specifies all of its components: • input and output formats • mapper and reducer • use of a combiner • use of a custom partitioner • The driver can also get back the status of the job execution. 9
  • 10. Main components of the MapReduce execution pipeline • Context: • The driver, mappers, and reducers are executed in different processes, typically on multiple machines. • A context object is available at any point of MapReduce execution. • It provides a convenient mechanism for exchanging required system and job- wide information. 10
  • 11. Main components of the MapReduce execution pipeline • Input data: • This is where the data for a MapReduce task is initially stored • This data can reside in HDFS, HBase, or other storage. • InputFormat: • This defines how input data is read and split. • InputFormat is a class that defines the InputSplits that break input data into tasks. • It provides a factory for RecordReader objects that read the file. • Several InputFormats are provided by Hadoop 11
  • 12. Main components of the MapReduce execution pipeline • InputSplit: • An InputSplit defines a unit of work for a single map task in a MapReduce program. • The InputFormat (invoked directly by a job driver) defines the number of map tasks that make up the mapping phase. • Each map task is given a single InputSplit to work on 12
  • 13. Main components of the MapReduce execution pipeline • RecordReader: • Although the InputSplit defines a data subset for a map task, it does not describe how to access the data. • The RecordReader class actually reads the data from its source, converts it into key/value pairs suitable for processing by the mapper, and delivers them to the map method. • The RecordReader class is defined by the InputFormat. 13
  • 14. Main components of the MapReduce execution pipeline • Mapper: • Performs the user-defined work of the first phase of the MapReduce program. • It takes input data in the form of a series of key/value pairs (k1, v1), which are used for individual map execution. • The map typically transforms the input pair into an output pair (k2, v2), which is used as an input for shuffle and sort. 14
  • 15. Main components of the MapReduce execution pipeline • Partition: • A subset of the intermediate key space (k2, v2) produced by each individual mapper is assigned to each reducer. • These subsets (or partitions) are the inputs to the reduce tasks. • Each map task may emit key/value pairs to any partition. • The Partitioner class determines which reducer a given key/value pair will go to. • The default Partitioner computes a hash value for the key, and assigns the partition based on this result. 15
  • 16. Main components of the MapReduce execution pipeline • Shuffle: • Once at least one map function for a given node is completed, and the keys’ space is partitioned, the run time begins moving the intermediate outputs from the map tasks to where they are required by the reducers. • This process of moving map outputs to the reducers is known as shuffling. • Sort: • The set of intermediate key/value pairs for a given reducer is automatically sorted by Hadoop to form keys/values (k2, {v2, v2,…}) before they are presented to the reducer. 16
  • 17. Main components of the MapReduce execution pipeline • Reducer: • A reducer is responsible for an execution of user-provided code for the second phase of job-specific work. • For each key assigned to a given reducer, the reducer’s reduce() method is called once. • This method receives a key, along with an iterator over all the values associated with the key. • The reducer typically transforms the input key/value pairs into output pairs (k3, v3). 17
  • 18. Main components of the MapReduce execution pipeline • OutputFormat: • The responsibility of the OutputFormat is to define a location of the output data and RecordWriter used for storing the resulting data. • RecordWriter: • A RecordWriter defines how individual output records are written. 18
  • 19. Let’s try it with simple example! Word Count (the Hello World! for MapReduce, available in Hadoop sources) We want to count the occurrences of every word of a text file 19
  • 20. Driver 20 public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); … Job job = new Job(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); for (int i = 0; i < otherArgs.length - 1; ++i) { FileInputFormat.addInputPath(job, new Path(otherArgs[i])); } FileOutputFormat.setOutputPath(job, new Path( otherArgs[otherArgs.length - 1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }
  • 21. Mapper class 21 //inside WordCount class public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{ private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }
  • 22. Reducer class 22 //inside WordCount class public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
  • 23. References: • hadoop.apache.org • Professional Hadoop Solutions - Boris Lublinsky, Kevin T. Smith, Alexey Yakubovich - WILEY 23