SlideShare a Scribd company logo
1 of 10
Developing a MapReduce
Application
Dr. C.V. Suresh Babu
(CentreforKnowledgeTransfer)
institute
Phases of Developing a MapReduce
Application
• Configuration API
• Configuring the Development Environment
• GenericOptionsParser, Tool and ToolRunner
• Writing Unit Tests
• Running locally and in a cluster on Test Data
• The MapReduce Web UI
• Hadoop Logs
• Tuning a Job to improve performance
(CentreforKnowledgeTransfer)
institute
Stages 1:Developing a MapReduce
Application
• Writing a program in MapReduce follows a certain pattern.
• You start by writing your map and reduce functions, ideally with unit
tests to make sure they do what you expect.
• Then you write a driver program to run a job, which can run from your
IDE using a small subset of the data to check that it is working.
• If it fails, you can use your IDE’s debugger to find the source of the
problem.
• With this information, you can expand your unit tests to cover this case
and improve your mapper or reducer as appropriate to handle such input
correctly.
(CentreforKnowledgeTransfer)
institute
Stages 2:Developing a MapReduce
Application
• When the program runs as expected against the small dataset, you
are ready to unleash it on a cluster.
• Running against the full dataset is likely to expose some more
issues, which you can fix as before, by expanding your tests and
mapper or reducer to handle the new cases.
• Debugging failing programs in the cluster is a challenge, so we
look at some common techniques to make it easier.
(CentreforKnowledgeTransfer)
institute
Stage 3: Developing a MapReduce
Application
• After the program is working, you may wish to do some tuning,
first by running through some standard checks for making
MapReduce programs faster and then by doing task profiling.
• Profiling distributed programs is not easy, but Hadoop has hooks to
aid the process.
(CentreforKnowledgeTransfer)
institute
Example: Word Count
Task: Counting the word occurances (frequencies) in a text file (or set of files).
<word, count >as <key, value >pair
Mapper: Emits <word, 1 >for each word (no counting at this part).
Shuffle in between: pairs with same keys grouped together and passed to a
single machine.
Reducer: Sums up the values (1s) with the same key value
(CentreforKnowledgeTransfer)
institute
Example:
Word Count
(CentreforKnowledgeTransfer)
institute
Example:
Job Tracker
(CentreforKnowledgeTransfer)
institute
Example:
Job Tracker
(CentreforKnowledgeTransfer)
institute
Example:
Job Tracker
(CentreforKnowledgeTransfer)
institute

More Related Content

What's hot

Issues in knowledge representation
Issues in knowledge representationIssues in knowledge representation
Issues in knowledge representationSravanthi Emani
 
Challenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxChallenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxGovardhanV7
 
Introduction to Distributed System
Introduction to Distributed SystemIntroduction to Distributed System
Introduction to Distributed SystemSunita Sahu
 
Water jug problem ai part 6
Water jug problem ai part 6Water jug problem ai part 6
Water jug problem ai part 6Kirti Verma
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data miningSlideshare
 
Fault tolerance in distributed systems
Fault tolerance in distributed systemsFault tolerance in distributed systems
Fault tolerance in distributed systemssumitjain2013
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methodsrajshreemuthiah
 
I. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHMI. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHMvikas dhakane
 
CS8791 Cloud Computing - Question Bank
CS8791 Cloud Computing - Question BankCS8791 Cloud Computing - Question Bank
CS8791 Cloud Computing - Question Bankpkaviya
 

What's hot (20)

Replication in Distributed Systems
Replication in Distributed SystemsReplication in Distributed Systems
Replication in Distributed Systems
 
Agent architectures
Agent architecturesAgent architectures
Agent architectures
 
Issues in knowledge representation
Issues in knowledge representationIssues in knowledge representation
Issues in knowledge representation
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Distributed DBMS - Unit 6 - Query Processing
Distributed DBMS - Unit 6 - Query ProcessingDistributed DBMS - Unit 6 - Query Processing
Distributed DBMS - Unit 6 - Query Processing
 
Challenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptxChallenges of Conventional Systems.pptx
Challenges of Conventional Systems.pptx
 
Introduction to Distributed System
Introduction to Distributed SystemIntroduction to Distributed System
Introduction to Distributed System
 
Clustering in Data Mining
Clustering in Data MiningClustering in Data Mining
Clustering in Data Mining
 
Rdbms
RdbmsRdbms
Rdbms
 
Water jug problem ai part 6
Water jug problem ai part 6Water jug problem ai part 6
Water jug problem ai part 6
 
Predicate logic
 Predicate logic Predicate logic
Predicate logic
 
Major issues in data mining
Major issues in data miningMajor issues in data mining
Major issues in data mining
 
Hadoop
HadoopHadoop
Hadoop
 
Taxonomy for bugs
Taxonomy for bugsTaxonomy for bugs
Taxonomy for bugs
 
Fault tolerance in distributed systems
Fault tolerance in distributed systemsFault tolerance in distributed systems
Fault tolerance in distributed systems
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
I. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHMI. AO* SEARCH ALGORITHM
I. AO* SEARCH ALGORITHM
 
Clustering
ClusteringClustering
Clustering
 
CS8791 Cloud Computing - Question Bank
CS8791 Cloud Computing - Question BankCS8791 Cloud Computing - Question Bank
CS8791 Cloud Computing - Question Bank
 
Soft computing
Soft computingSoft computing
Soft computing
 

Similar to Developing a Map Reduce Application

Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
Download It
Download ItDownload It
Download Itbutest
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online trainingHarika583
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterIRJET Journal
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabadsreehari orienit
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigDataThanusha154
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopGERARDO BARBERENA
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Rusif Eyvazli
 
Debugging of hadoop production errors using open source flow analyzer – semic...
Debugging of hadoop production errors using open source flow analyzer – semic...Debugging of hadoop production errors using open source flow analyzer – semic...
Debugging of hadoop production errors using open source flow analyzer – semic...Mahesh Nair
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacmlmphuong06
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericNik Peric
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintijccsa
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfTSANKARARAO
 

Similar to Developing a Map Reduce Application (20)

Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Hadoop
HadoopHadoop
Hadoop
 
Download It
Download ItDownload It
Download It
 
Hadoop live online training
Hadoop live online trainingHadoop live online training
Hadoop live online training
 
Enhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop ClusterEnhancing Performance and Fault Tolerance of Hadoop Cluster
Enhancing Performance and Fault Tolerance of Hadoop Cluster
 
A hadoop map reduce
A hadoop map reduceA hadoop map reduce
A hadoop map reduce
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
Learn what is Hadoop-and-BigData
Learn  what is Hadoop-and-BigDataLearn  what is Hadoop-and-BigData
Learn what is Hadoop-and-BigData
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...Scaling Application on High Performance Computing Clusters and Analysis of th...
Scaling Application on High Performance Computing Clusters and Analysis of th...
 
Debugging of hadoop production errors using open source flow analyzer – semic...
Debugging of hadoop production errors using open source flow analyzer – semic...Debugging of hadoop production errors using open source flow analyzer – semic...
Debugging of hadoop production errors using open source flow analyzer – semic...
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 
Optimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola PericOptimizing Performance - Clojure Remote - Nikola Peric
Optimizing Performance - Clojure Remote - Nikola Peric
 
Hadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraintHadoop scheduler with deadline constraint
Hadoop scheduler with deadline constraint
 
Vedic Calculator
Vedic CalculatorVedic Calculator
Vedic Calculator
 
Map reduce
Map reduceMap reduce
Map reduce
 
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdfmodule3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
module3part-1-bigdata-230301002404-3db4f2a4 (1).pdf
 
Big Data.pptx
Big Data.pptxBig Data.pptx
Big Data.pptx
 

More from Dr. C.V. Suresh Babu (20)

Data analytics with R
Data analytics with RData analytics with R
Data analytics with R
 
Association rules
Association rulesAssociation rules
Association rules
 
Clustering
ClusteringClustering
Clustering
 
Classification
ClassificationClassification
Classification
 
Blue property assumptions.
Blue property assumptions.Blue property assumptions.
Blue property assumptions.
 
Introduction to regression
Introduction to regressionIntroduction to regression
Introduction to regression
 
DART
DARTDART
DART
 
Mycin
MycinMycin
Mycin
 
Expert systems
Expert systemsExpert systems
Expert systems
 
Dempster shafer theory
Dempster shafer theoryDempster shafer theory
Dempster shafer theory
 
Bayes network
Bayes networkBayes network
Bayes network
 
Bayes' theorem
Bayes' theoremBayes' theorem
Bayes' theorem
 
Knowledge based agents
Knowledge based agentsKnowledge based agents
Knowledge based agents
 
Rule based system
Rule based systemRule based system
Rule based system
 
Formal Logic in AI
Formal Logic in AIFormal Logic in AI
Formal Logic in AI
 
Production based system
Production based systemProduction based system
Production based system
 
Game playing in AI
Game playing in AIGame playing in AI
Game playing in AI
 
Diagnosis test of diabetics and hypertension by AI
Diagnosis test of diabetics and hypertension by AIDiagnosis test of diabetics and hypertension by AI
Diagnosis test of diabetics and hypertension by AI
 
A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”
 
A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”A study on “impact of artificial intelligence in covid19 diagnosis”
A study on “impact of artificial intelligence in covid19 diagnosis”
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

Developing a Map Reduce Application

  • 1. Developing a MapReduce Application Dr. C.V. Suresh Babu (CentreforKnowledgeTransfer) institute
  • 2. Phases of Developing a MapReduce Application • Configuration API • Configuring the Development Environment • GenericOptionsParser, Tool and ToolRunner • Writing Unit Tests • Running locally and in a cluster on Test Data • The MapReduce Web UI • Hadoop Logs • Tuning a Job to improve performance (CentreforKnowledgeTransfer) institute
  • 3. Stages 1:Developing a MapReduce Application • Writing a program in MapReduce follows a certain pattern. • You start by writing your map and reduce functions, ideally with unit tests to make sure they do what you expect. • Then you write a driver program to run a job, which can run from your IDE using a small subset of the data to check that it is working. • If it fails, you can use your IDE’s debugger to find the source of the problem. • With this information, you can expand your unit tests to cover this case and improve your mapper or reducer as appropriate to handle such input correctly. (CentreforKnowledgeTransfer) institute
  • 4. Stages 2:Developing a MapReduce Application • When the program runs as expected against the small dataset, you are ready to unleash it on a cluster. • Running against the full dataset is likely to expose some more issues, which you can fix as before, by expanding your tests and mapper or reducer to handle the new cases. • Debugging failing programs in the cluster is a challenge, so we look at some common techniques to make it easier. (CentreforKnowledgeTransfer) institute
  • 5. Stage 3: Developing a MapReduce Application • After the program is working, you may wish to do some tuning, first by running through some standard checks for making MapReduce programs faster and then by doing task profiling. • Profiling distributed programs is not easy, but Hadoop has hooks to aid the process. (CentreforKnowledgeTransfer) institute
  • 6. Example: Word Count Task: Counting the word occurances (frequencies) in a text file (or set of files). <word, count >as <key, value >pair Mapper: Emits <word, 1 >for each word (no counting at this part). Shuffle in between: pairs with same keys grouped together and passed to a single machine. Reducer: Sums up the values (1s) with the same key value (CentreforKnowledgeTransfer) institute