SlideShare a Scribd company logo
1 of 20
Map-Reduce Programming
with Hadoop
CS5225 Parallel and Concurrent Programming
Dilum Bandara
Dilum.Bandara@uom.lk
Some slides adapted from Dr. Srinath Perera
HDFS
 HDFS – Hadoop Distributed File System
 File system supported by Hadoop
 Based on ideas presented in “The Google File
System” Paper
 Highly scalable file system for handling large
data
2
HDFS Architecture
3
HDFS Architecture (Cont.)
 HDFS has master-slave architecture
 Name Node – Master node
 Manages file system namespace
 Regulates access to files by clients
 Data node
 Manage storage attached to nodes
 Responsible for serving read & write requests from
file system’s clients
 Perform block creation, deletion, & replication upon
instruction from Name Node
4
HDFS Architecture (Cont.)
5
HDFS in Production
 Yahoo! Search Webmap is a Hadoop application
 Webmap starts with every webpage crawled by Yahoo!
& produces a database of all known web pages
 This derived data feed to Machine Learned Ranking
algorithms
 Runs on 10,000+ core Linux clusters & produces
data that is used in every Yahoo! Web search
query
 1 trillion links
 Produce over 300 TB, compressed!
 Over 5 Petabytes of raw disk used in production cluster
6
HDFS Java Client
Configuration conf = new Configuration(false);
conf.addResource(new Path("/works/fsaas/hadoop-0.20.2/conf/core-site.xml"));
conf.addResource(new Path("/works/fsaas/hadoop-0.20.2/conf/hdfs-site.xml"));
FileSystem fs = null;
fs = FileSystem.get(conf);
Path filenamePath = new Path(filename);
FileSystem fs = getFileSystemConnection();
if (fs.exists(filenamePath)) {
// remove the file first
fs.delete(filenamePath);
}
FSDataOutputStream out = fs.create(filenamePath);
out.writeUTF(String.valueOf(currentSystemTime));
out.close();
FSDataInputStream in = fs.open(filenamePath);
String messageIn = in.readUTF();
System.out.print(messageIn);
in.close();
System.out.println(fs.getContentSummary(filenamePath).toString());
7
Install Hadoop
 3 different Options
1. Local
 One JVM installation
 Just Unzip
2. Pseudo Distributed
 One JVM, but like distributed installation
3. Distributed Installation
8
More General Map/Reduce
 Typically Map-Reduce implementations are bit
more general
1. Formatters
2. Partition Function
 Break map output across many reduce function
instances
3. Map Function
4. Combine Function
 If there are many map steps, this step combine the
result before giving it to Reduce
5. Reduce Function 9
Example – Word Count
 Find words in a collection of documents & their
frequency of occurrence
Map(docId, text):
for all terms t in text
emit(t, 1);
Reduce(t, values[])
int sum = 0;
for all values v
sum += v;
emit(t, sum); 10
Example – Mean
 Compute mean value associated with same key
Map(k, value):
emit(k, value);
Reduce(k, values[])
int sum = 0;
int count = 0;
for all values v
sum += v;
count += 1;
emit(k, sum/count); 11
Example – Sorting
 How to sort an array of 1 million integers using
Map reduce?
 Partial sorts at mapper & final sort by reducer
 Use of locality preserving hash function
 If k1 < k2 then hash(k1) < hash(k2)
Map(k, v):
int val = read value from v
emit(val, val);
Reduce(k, values[])
emit(k, k); 12
Example – Inverted Index
 Normal index is a mapping from document to terms
 Inverted index is mapping from terms to documents
 If we have a million documents, how do we build a
inverted index using Map-Reduce?
Map(docid, text):
for all word w in text
emit(w, docid)
Reduce(w, docids[])
emit(w, docids[]);
13
Example – Distributed Grep
map(k, v):
Id docId = .. (read file name)
If (v maps grep)
emit(k, (pattern, docid))
Reduce(k, values[])
emit(k, values);
14
Composition with Map-Reduce
 Map/Reduce is not a tool to use as a fixed
template
 It should be used with Fork/Join, etc., to build
solutions
 Solution may have more than one Map/Reduce
step
15
Composition with Map-Reduce –
Example
 Calculate following for a list of million integers
16
Map Reduce Client
public class WordCountSample {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException {….. }
}
}
public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter)
throws IOException { ..}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCountSample.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path("/input"));
FileOutputFormat.setOutputPath(conf, new Path("/output/"+ System.currentTimeMillis()));
JobClient.runJob(conf);
}
}
17
Example: http://wiki.apache.org/hadoop/WordCount
Format to Parse Custom Data
//add following to the main method
Job job = new Job(conf, "LogProcessingHitsByLink");
….
job.setInputFormatClass(MboxFileFormat.class);
..
System.exit(job.waitForCompletion(true) ? 0 : 1);
// write a formatter
public class MboxFileFormat extends FileInputFormat<Text, Text>{
private MBoxFileReader boxFileReader = null;
public RecordReader<Text, Text> createRecordReader(
InputSplit inputSplit, TaskAttemptContext attempt) throws IOException, InterruptedException {
boxFileReader = new MBoxFileReader();
boxFileReader.initialize(inputSplit, attempt);
return boxFileReader;
}
}
//write a reader
public class MBoxFileReader extends RecordReader<Text, Text> {
public void initialize(InputSplit inputSplit, TaskAttemptContext attempt)
throws IOException, InterruptedException { .. }
public boolean nextKeyValue() throws IOException, InterruptedException { ..}
18
Your Own Partioner
public class IPBasedPartitioner extends Partitioner<Text, IntWritable>{
public int getPartition(Text ipAddress, IntWritable value, int numPartitions) {
String region = getGeoLocation(ipAddress);
if (region!=null){
return ((region.hashCode() & Integer.MAX_VALUE) % numPartitions);
}
return 0;
}
}
Set the Partitioner class parameter in the job object.
Job job = new Job(getConf(), "log-analysis");
……
job.setPartitionerClass(IPBasedPartitioner.class);
19
Using Distributed File Cache
 Give access to a static file from a Job
Job job = new Job(conf, "word count");
FileSystem fs = FileSystem.get(conf);
fs.copyFromLocalFile(new Path(scriptFileLocation),
new Path("/debug/fail-script"));
DistributedCache.addCacheFile(mapUri, conf);
DistributedCache.createSymlink(conf);
20

More Related Content

Similar to Introduction to Map-Reduce Programming with Hadoop

Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
PennonSoft
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 

Similar to Introduction to Map-Reduce Programming with Hadoop (20)

Map Reduce
Map ReduceMap Reduce
Map Reduce
 
ch02-mapreduce.pptx
ch02-mapreduce.pptxch02-mapreduce.pptx
ch02-mapreduce.pptx
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Hadoop_Pennonsoft
Hadoop_PennonsoftHadoop_Pennonsoft
Hadoop_Pennonsoft
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Spark what's new what's coming
Spark what's new what's comingSpark what's new what's coming
Spark what's new what's coming
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Hadoop - Introduction to mapreduce
Hadoop -  Introduction to mapreduceHadoop -  Introduction to mapreduce
Hadoop - Introduction to mapreduce
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
 
TheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the RescueTheEdge10 : Big Data is Here - Hadoop to the Rescue
TheEdge10 : Big Data is Here - Hadoop to the Rescue
 
Hadoop
HadoopHadoop
Hadoop
 
Map reducefunnyslide
Map reducefunnyslideMap reducefunnyslide
Map reducefunnyslide
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 

More from Dilum Bandara

More from Dilum Bandara (20)

Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Time Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in PracticeTime Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in Practice
 
Introduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCAIntroduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCA
 
Introduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsIntroduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive Analytics
 
Introduction to Concurrent Data Structures
Introduction to Concurrent Data StructuresIntroduction to Concurrent Data Structures
Introduction to Concurrent Data Structures
 
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-MatrixHard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
 
Introduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale ComputersIntroduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale Computers
 
Introduction to Thread Level Parallelism
Introduction to Thread Level ParallelismIntroduction to Thread Level Parallelism
Introduction to Thread Level Parallelism
 
CPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesCPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching Techniques
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
 
Instruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware TechniquesInstruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware Techniques
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler Techniques
 
CPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An IntroductionCPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An Introduction
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
High Performance Networking with Advanced TCP
High Performance Networking with Advanced TCPHigh Performance Networking with Advanced TCP
High Performance Networking with Advanced TCP
 
Introduction to Content Delivery Networks
Introduction to Content Delivery NetworksIntroduction to Content Delivery Networks
Introduction to Content Delivery Networks
 
Peer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and StreamingPeer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and Streaming
 
Mobile Services
Mobile ServicesMobile Services
Mobile Services
 
Wired Broadband Communication
Wired Broadband CommunicationWired Broadband Communication
Wired Broadband Communication
 

Recently uploaded

TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 

Recently uploaded (20)

ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Navigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi DaparthiNavigating the Large Language Model choices_Ravi Daparthi
Navigating the Large Language Model choices_Ravi Daparthi
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
Observability Concepts EVERY Developer Should Know (DevOpsDays Seattle)
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
UiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overviewUiPath manufacturing technology benefits and AI overview
UiPath manufacturing technology benefits and AI overview
 

Introduction to Map-Reduce Programming with Hadoop

  • 1. Map-Reduce Programming with Hadoop CS5225 Parallel and Concurrent Programming Dilum Bandara Dilum.Bandara@uom.lk Some slides adapted from Dr. Srinath Perera
  • 2. HDFS  HDFS – Hadoop Distributed File System  File system supported by Hadoop  Based on ideas presented in “The Google File System” Paper  Highly scalable file system for handling large data 2
  • 4. HDFS Architecture (Cont.)  HDFS has master-slave architecture  Name Node – Master node  Manages file system namespace  Regulates access to files by clients  Data node  Manage storage attached to nodes  Responsible for serving read & write requests from file system’s clients  Perform block creation, deletion, & replication upon instruction from Name Node 4
  • 6. HDFS in Production  Yahoo! Search Webmap is a Hadoop application  Webmap starts with every webpage crawled by Yahoo! & produces a database of all known web pages  This derived data feed to Machine Learned Ranking algorithms  Runs on 10,000+ core Linux clusters & produces data that is used in every Yahoo! Web search query  1 trillion links  Produce over 300 TB, compressed!  Over 5 Petabytes of raw disk used in production cluster 6
  • 7. HDFS Java Client Configuration conf = new Configuration(false); conf.addResource(new Path("/works/fsaas/hadoop-0.20.2/conf/core-site.xml")); conf.addResource(new Path("/works/fsaas/hadoop-0.20.2/conf/hdfs-site.xml")); FileSystem fs = null; fs = FileSystem.get(conf); Path filenamePath = new Path(filename); FileSystem fs = getFileSystemConnection(); if (fs.exists(filenamePath)) { // remove the file first fs.delete(filenamePath); } FSDataOutputStream out = fs.create(filenamePath); out.writeUTF(String.valueOf(currentSystemTime)); out.close(); FSDataInputStream in = fs.open(filenamePath); String messageIn = in.readUTF(); System.out.print(messageIn); in.close(); System.out.println(fs.getContentSummary(filenamePath).toString()); 7
  • 8. Install Hadoop  3 different Options 1. Local  One JVM installation  Just Unzip 2. Pseudo Distributed  One JVM, but like distributed installation 3. Distributed Installation 8
  • 9. More General Map/Reduce  Typically Map-Reduce implementations are bit more general 1. Formatters 2. Partition Function  Break map output across many reduce function instances 3. Map Function 4. Combine Function  If there are many map steps, this step combine the result before giving it to Reduce 5. Reduce Function 9
  • 10. Example – Word Count  Find words in a collection of documents & their frequency of occurrence Map(docId, text): for all terms t in text emit(t, 1); Reduce(t, values[]) int sum = 0; for all values v sum += v; emit(t, sum); 10
  • 11. Example – Mean  Compute mean value associated with same key Map(k, value): emit(k, value); Reduce(k, values[]) int sum = 0; int count = 0; for all values v sum += v; count += 1; emit(k, sum/count); 11
  • 12. Example – Sorting  How to sort an array of 1 million integers using Map reduce?  Partial sorts at mapper & final sort by reducer  Use of locality preserving hash function  If k1 < k2 then hash(k1) < hash(k2) Map(k, v): int val = read value from v emit(val, val); Reduce(k, values[]) emit(k, k); 12
  • 13. Example – Inverted Index  Normal index is a mapping from document to terms  Inverted index is mapping from terms to documents  If we have a million documents, how do we build a inverted index using Map-Reduce? Map(docid, text): for all word w in text emit(w, docid) Reduce(w, docids[]) emit(w, docids[]); 13
  • 14. Example – Distributed Grep map(k, v): Id docId = .. (read file name) If (v maps grep) emit(k, (pattern, docid)) Reduce(k, values[]) emit(k, values); 14
  • 15. Composition with Map-Reduce  Map/Reduce is not a tool to use as a fixed template  It should be used with Fork/Join, etc., to build solutions  Solution may have more than one Map/Reduce step 15
  • 16. Composition with Map-Reduce – Example  Calculate following for a list of million integers 16
  • 17. Map Reduce Client public class WordCountSample { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {….. } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { ..} } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCountSample.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path("/input")); FileOutputFormat.setOutputPath(conf, new Path("/output/"+ System.currentTimeMillis())); JobClient.runJob(conf); } } 17 Example: http://wiki.apache.org/hadoop/WordCount
  • 18. Format to Parse Custom Data //add following to the main method Job job = new Job(conf, "LogProcessingHitsByLink"); …. job.setInputFormatClass(MboxFileFormat.class); .. System.exit(job.waitForCompletion(true) ? 0 : 1); // write a formatter public class MboxFileFormat extends FileInputFormat<Text, Text>{ private MBoxFileReader boxFileReader = null; public RecordReader<Text, Text> createRecordReader( InputSplit inputSplit, TaskAttemptContext attempt) throws IOException, InterruptedException { boxFileReader = new MBoxFileReader(); boxFileReader.initialize(inputSplit, attempt); return boxFileReader; } } //write a reader public class MBoxFileReader extends RecordReader<Text, Text> { public void initialize(InputSplit inputSplit, TaskAttemptContext attempt) throws IOException, InterruptedException { .. } public boolean nextKeyValue() throws IOException, InterruptedException { ..} 18
  • 19. Your Own Partioner public class IPBasedPartitioner extends Partitioner<Text, IntWritable>{ public int getPartition(Text ipAddress, IntWritable value, int numPartitions) { String region = getGeoLocation(ipAddress); if (region!=null){ return ((region.hashCode() & Integer.MAX_VALUE) % numPartitions); } return 0; } } Set the Partitioner class parameter in the job object. Job job = new Job(getConf(), "log-analysis"); …… job.setPartitionerClass(IPBasedPartitioner.class); 19
  • 20. Using Distributed File Cache  Give access to a static file from a Job Job job = new Job(conf, "word count"); FileSystem fs = FileSystem.get(conf); fs.copyFromLocalFile(new Path(scriptFileLocation), new Path("/debug/fail-script")); DistributedCache.addCacheFile(mapUri, conf); DistributedCache.createSymlink(conf); 20