SlideShare a Scribd company logo
1 of 21
Map Reduce  Muhammad UsmanShahid Software Engineer Usman.shahid.st@hotmail.com 10/17/2011 1
Parallel Programming Used for performance and efficiency. Processing is broken up into parts and done concurrently. Instruction of each part run on a separate CPU while many processors are connected. Identification of set of tasks which can run concurrently is important. A Fibonacci function is Fk+2 = Fk + Fk+1. It is clear that Fibonacci function can not be parallelized as each computed value depends on previous. Now consider a huge array which can be broken up into sub-arrays. 10/17/2011 2
Parallel Programming 10/17/2011 3 If each element required some processing, with no dependencies in the computation, we have an ideal parallel computing opportunity.
Google Data Center Google believes buy cheap computers but numerous in number. Google has parallel processing concept in its data centers. Map Reduce is a parallel and distributed approach developed by Google for processing large data sets.  10/17/2011 4
Map Reduce Introduction Map Reduce has two key components. Map and Reduce. Map function is used on input values to calculate a set of key/Value pairs. Reduce aggregates this data into a scalar. 10/17/2011 5
Data Distribution	 Input files are split into M pieces on distributed file systems. Intermediate files are created from map tasks are written to local disks. Output files are written to distributed file systems. 10/17/2011 6
Data Distribution 10/17/2011 7
Map Reduce Function Map Reduce function by an example see the query “Select Sum(stuMarks) from student group by studentSection”. In above query “select” phase is doing the same as Map do and “Group By” same as Reduce Phase. 10/17/2011 8
Classical Example The classical example of Map Reduce is the log file analysis. Big log files are split and mapper search for different web pages which are accessed. Every time a web page is found in the log a key/value pair is emitted to the reducer in such way that key = web page and value = 1. The reducer aggregates the number for a certain web pages.  Result is the count of total hits for each web page. 10/17/2011 9
Reverse Web Link Graph In this example Map function outputs (URL target, source) from an input web page (source). Reduce function concatenates the list of all source URL(s) with a give target of URL and returns (target, list(source)). 10/17/2011 10
Other Examples  Map Reduce can be used for the lot of problems. For Example the Google used the Map Reduce for the calculation of page ranks. Word count in large set of documents can also be resolved by Map Reduce very efficiently. Google library for the Map Reduce is not open source but an implementation in java called hadoop is an open source. 10/17/2011 11
Implementation of Example Word Count is a simple application that counts the number of occurrences of words in a given set of inputs. Hadoop library is used for its implementation. Code is given in the below attached file. 10/17/2011 12
Usage of Implementation For example the input files are $ bin/hadoopdfs -ls /usr/joe/wordcount/input/ /usr/joe/wordcount/input/file01 /usr/joe/wordcount/input/file02  $ bin/hadoopdfs -cat /usr/joe/wordcount/input/file01 Hello World Bye World  $ bin/hadoopdfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop Goodbye Hadoop Run the application. Word Count is straight forward problem. 10/17/2011 13
Walk Through Implementation The Mapper implementation (lines 14-26), via the map method (lines 18-25), processes one line at a time, as provided by the specified TextInputFormat (line 49). It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and emits a key-value pair of < <word>, 1>. For the given sample input the first map emits:< Hello, 1> < World, 1> < Bye, 1> < World, 1>  The second map emits:< Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>  10/17/2011 14
Walk Through Implementation WordCount also specifies a combiner (line 46). Hence, the output of each map is passed through the local combiner (which is same as the Reducer as per the job configuration) for local aggregation, after being sorted on the keys. The output of the first map:< Bye, 1> < Hello, 1> < World, 2>  The output of the second map:< Goodbye, 1> < Hadoop, 2> < Hello, 1>  10/17/2011 15
Walk Through Implementation The Reducer implementation (lines 28-36), via the reduce method (lines 29-35) just sums up the values, which are the occurence counts for each key (i.e. words in this example). Thus the output of the job is:< Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>  The run method specifies various facets of the job, such as the input/output paths (passed via the command line), key/value types, input/output formats etc., in theJobConf. It then calls the JobClient.runJob (line 55) to submit the and monitor its progress. 10/17/2011 16
Execution Overview 10/17/2011 17
Map Reduce Execution Map Reduce library is the user program that first splits the input files in M pieces. Then it start ups many copies of the program on cluster of machines. One of the copy is special – The Master other are the workers. There are M Map tasks and R Reduce tasks to assign. The master picks the idle workers and assign them the Map task or Reduce Task. A worker who is assigned Map task reads the contents of corresponding input split. It parses the key value pair and pass it to user defined Map function this generates the intermediate key/value pairs buffered in the memory. Periodically, the buffered pairs are written to local disks. The locations of these buffered pairs on local disks are passed back to the master, who is responsible for forwarding them to the reducer workers. 10/17/2011 18
Map Reduce Execution	 When master notify a reduce worker about these location, it uses RPC to access this local data, then it sorts the data. The reduce worker iterates over the sorted intermediate data, for each unique key it passes the key and values to the reduce function. The output is appended to the final output file. Many associated issues are handled by the library like Parallelization Fault Tolerance  Data Distribution Load Balancing 10/17/2011 19
Debugging Offer human readable status info on http server, user can see jobs In progress, Completed etc. Allows use of GDB and other debugging tools. 10/17/2011 20
Conclusions Simplifies large scale computations that fit this model. Allows user to focus on the problem without worrying about the details. It is being used by renowned companies like Google and Yahoo. Google library for Map Reduce is not open source but a project of Apache called hadoop is an open source library for Map Reduce. 10/17/2011 21

More Related Content

What's hot (19)

Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Hadoop Map Reduce Arch
Hadoop Map Reduce ArchHadoop Map Reduce Arch
Hadoop Map Reduce Arch
 
Topic 6: MapReduce Applications
Topic 6: MapReduce ApplicationsTopic 6: MapReduce Applications
Topic 6: MapReduce Applications
 
Introduction to Map Reduce
Introduction to Map ReduceIntroduction to Map Reduce
Introduction to Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
An Introduction to MapReduce
An Introduction to MapReduceAn Introduction to MapReduce
An Introduction to MapReduce
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formats
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 

Similar to Map reduce and Hadoop on windows

2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)anh tuan
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingcoolmirza143
 
Map reduce
Map reduceMap reduce
Map reducexydii
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large ClustersIRJET Journal
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduceM Baddar
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersXiao Qin
 
Big data shim
Big data shimBig data shim
Big data shimtistrue
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptxShimoFcis
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 

Similar to Map reduce and Hadoop on windows (20)

Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
 
Map reduce
Map reduceMap reduce
Map reduce
 
Map and Reduce
Map and ReduceMap and Reduce
Map and Reduce
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
E031201032036
E031201032036E031201032036
E031201032036
 
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
MapReduce: Ordering and  Large-Scale Indexing on Large ClustersMapReduce: Ordering and  Large-Scale Indexing on Large Clusters
MapReduce: Ordering and Large-Scale Indexing on Large Clusters
 
Introduction to map reduce
Introduction to map reduceIntroduction to map reduce
Introduction to map reduce
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
Apache Crunch
Apache CrunchApache Crunch
Apache Crunch
 
MapReduce
MapReduceMapReduce
MapReduce
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
Unit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdfUnit-2 Hadoop Framework.pdf
Unit-2 Hadoop Framework.pdf
 
The google MapReduce
The google MapReduceThe google MapReduce
The google MapReduce
 
Big data shim
Big data shimBig data shim
Big data shim
 
mapreduce.pptx
mapreduce.pptxmapreduce.pptx
mapreduce.pptx
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 

Recently uploaded

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Recently uploaded (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

Map reduce and Hadoop on windows

  • 1. Map Reduce Muhammad UsmanShahid Software Engineer Usman.shahid.st@hotmail.com 10/17/2011 1
  • 2. Parallel Programming Used for performance and efficiency. Processing is broken up into parts and done concurrently. Instruction of each part run on a separate CPU while many processors are connected. Identification of set of tasks which can run concurrently is important. A Fibonacci function is Fk+2 = Fk + Fk+1. It is clear that Fibonacci function can not be parallelized as each computed value depends on previous. Now consider a huge array which can be broken up into sub-arrays. 10/17/2011 2
  • 3. Parallel Programming 10/17/2011 3 If each element required some processing, with no dependencies in the computation, we have an ideal parallel computing opportunity.
  • 4. Google Data Center Google believes buy cheap computers but numerous in number. Google has parallel processing concept in its data centers. Map Reduce is a parallel and distributed approach developed by Google for processing large data sets. 10/17/2011 4
  • 5. Map Reduce Introduction Map Reduce has two key components. Map and Reduce. Map function is used on input values to calculate a set of key/Value pairs. Reduce aggregates this data into a scalar. 10/17/2011 5
  • 6. Data Distribution Input files are split into M pieces on distributed file systems. Intermediate files are created from map tasks are written to local disks. Output files are written to distributed file systems. 10/17/2011 6
  • 8. Map Reduce Function Map Reduce function by an example see the query “Select Sum(stuMarks) from student group by studentSection”. In above query “select” phase is doing the same as Map do and “Group By” same as Reduce Phase. 10/17/2011 8
  • 9. Classical Example The classical example of Map Reduce is the log file analysis. Big log files are split and mapper search for different web pages which are accessed. Every time a web page is found in the log a key/value pair is emitted to the reducer in such way that key = web page and value = 1. The reducer aggregates the number for a certain web pages. Result is the count of total hits for each web page. 10/17/2011 9
  • 10. Reverse Web Link Graph In this example Map function outputs (URL target, source) from an input web page (source). Reduce function concatenates the list of all source URL(s) with a give target of URL and returns (target, list(source)). 10/17/2011 10
  • 11. Other Examples Map Reduce can be used for the lot of problems. For Example the Google used the Map Reduce for the calculation of page ranks. Word count in large set of documents can also be resolved by Map Reduce very efficiently. Google library for the Map Reduce is not open source but an implementation in java called hadoop is an open source. 10/17/2011 11
  • 12. Implementation of Example Word Count is a simple application that counts the number of occurrences of words in a given set of inputs. Hadoop library is used for its implementation. Code is given in the below attached file. 10/17/2011 12
  • 13. Usage of Implementation For example the input files are $ bin/hadoopdfs -ls /usr/joe/wordcount/input/ /usr/joe/wordcount/input/file01 /usr/joe/wordcount/input/file02  $ bin/hadoopdfs -cat /usr/joe/wordcount/input/file01 Hello World Bye World  $ bin/hadoopdfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop Goodbye Hadoop Run the application. Word Count is straight forward problem. 10/17/2011 13
  • 14. Walk Through Implementation The Mapper implementation (lines 14-26), via the map method (lines 18-25), processes one line at a time, as provided by the specified TextInputFormat (line 49). It then splits the line into tokens separated by whitespaces, via the StringTokenizer, and emits a key-value pair of < <word>, 1>. For the given sample input the first map emits:< Hello, 1> < World, 1> < Bye, 1> < World, 1>  The second map emits:< Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>  10/17/2011 14
  • 15. Walk Through Implementation WordCount also specifies a combiner (line 46). Hence, the output of each map is passed through the local combiner (which is same as the Reducer as per the job configuration) for local aggregation, after being sorted on the keys. The output of the first map:< Bye, 1> < Hello, 1> < World, 2>  The output of the second map:< Goodbye, 1> < Hadoop, 2> < Hello, 1>  10/17/2011 15
  • 16. Walk Through Implementation The Reducer implementation (lines 28-36), via the reduce method (lines 29-35) just sums up the values, which are the occurence counts for each key (i.e. words in this example). Thus the output of the job is:< Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>  The run method specifies various facets of the job, such as the input/output paths (passed via the command line), key/value types, input/output formats etc., in theJobConf. It then calls the JobClient.runJob (line 55) to submit the and monitor its progress. 10/17/2011 16
  • 18. Map Reduce Execution Map Reduce library is the user program that first splits the input files in M pieces. Then it start ups many copies of the program on cluster of machines. One of the copy is special – The Master other are the workers. There are M Map tasks and R Reduce tasks to assign. The master picks the idle workers and assign them the Map task or Reduce Task. A worker who is assigned Map task reads the contents of corresponding input split. It parses the key value pair and pass it to user defined Map function this generates the intermediate key/value pairs buffered in the memory. Periodically, the buffered pairs are written to local disks. The locations of these buffered pairs on local disks are passed back to the master, who is responsible for forwarding them to the reducer workers. 10/17/2011 18
  • 19. Map Reduce Execution When master notify a reduce worker about these location, it uses RPC to access this local data, then it sorts the data. The reduce worker iterates over the sorted intermediate data, for each unique key it passes the key and values to the reduce function. The output is appended to the final output file. Many associated issues are handled by the library like Parallelization Fault Tolerance Data Distribution Load Balancing 10/17/2011 19
  • 20. Debugging Offer human readable status info on http server, user can see jobs In progress, Completed etc. Allows use of GDB and other debugging tools. 10/17/2011 20
  • 21. Conclusions Simplifies large scale computations that fit this model. Allows user to focus on the problem without worrying about the details. It is being used by renowned companies like Google and Yahoo. Google library for Map Reduce is not open source but a project of Apache called hadoop is an open source library for Map Reduce. 10/17/2011 21