SlideShare a Scribd company logo
1 of 29
Download to read offline
MapReduce	
  	
  
as	
  a	
  General	
  Framework	
  to	
  Support	
  Research	
  in	
  
Mining	
  So8ware	
  Repositories(MSR)
	
  Weiyi	
  Shang,	
  Zhen	
  Ming	
  Jiang,	
  Bram	
  Adams,	
  Ahmed	
  Hassan	
  
	
  
	
  	
  	
  	
  	
  	
  So8ware	
  Analysis	
  and	
  Intelligence	
  Lab(SAIL)	
  
School	
  of	
  CompuCng,	
  Queen’s	
  University	
  
As	
  an	
  MSR	
  researcher,	
  
have	
  you	
  ever	
  been	
  in	
  such	
  a	
  
situa>on?
• Analyzing	
  gigabytes	
  of	
  data?	
  
• WaiCng	
  hours	
  for	
  experimental	
  results?	
  
• Experiments	
  fail	
  with	
  “out	
  of	
  memory”	
  excepCons?
To	
  overcome	
  these	
  problems,	
  you	
  could	
  …
…	
  buy	
  more	
  powerful	
  machines
…	
  spend	
  weeks	
  to	
  make	
  your	
  
tools	
  more	
  efficient
However!
• The	
  data	
  will	
  keep	
  on	
  
growing	
  
• Spend	
  	
  Cme	
  on	
  research	
  
not	
  on	
  speeding	
  up	
  
experiments	
  
Debian	
  doubles	
  in	
  size	
  
approximately	
  every	
  two	
  years	
  
•  Idle	
  compuCng	
  power	
  is	
  available	
  in	
  every	
  lab	
  
•  We	
  can	
  bundle	
  these	
  computers	
  together	
  
•  A	
  distributed	
  framework	
  can	
  help	
  us	
  do	
  so
General	
  requirements	
  for	
  a	
  	
  
distributed	
  framework:
1.  Efficiency	
  
speed	
  up	
  the	
  process	
  significantly	
  
2.  Scalability	
  
scale	
  with	
  data	
  size	
  and	
  compuCng	
  power	
  
3.  Adaptability	
  
require	
  only	
  minimal	
  programming	
  effort	
  
4.  Flexibility	
  
run	
  in	
  various	
  environments
Google’s	
  	
  MapReduce	
  
is	
  an	
  idea	
  of	
  distributed	
  computa8on
Google’s	
  	
  MapReduce	
  
is	
  an	
  idea	
  of	
  distributed	
  computa8on
•  Open-­‐source	
  MapReduce	
  implementaCon	
  
•  Well	
  documented	
  and	
  many	
  examples	
  
available	
  	
  
•  Well	
  supported	
  by	
  large	
  user	
  base	
  and	
  news	
  
groups	
  
•  Straight	
  forward	
  API	
  
Example:	
  coun>ng	
  the	
  frequency	
  of	
  
word	
  lengths
dog
cat
fish
good
hello
night
happy
school
# WordsLength
23
24
35
16
dog	
  
cat	
  
fish	
  
hello	
  
good	
  
night	
  
happy	
  
school
Example:	
  coun>ng	
  the	
  frequency	
  of	
  
word	
  length
1.	
  Deploy	
  data	
  into	
  a	
  distributed	
  file	
  system	
  
data
network	
  compuCng	
  environment
Example:	
  coun>ng	
  the	
  frequency	
  of	
  
word	
  length
2.	
  Read	
  data	
  as	
  records	
  
Data
dog
cat
fish
hello
good
night
happy
school
Example:	
  coun>ng	
  the	
  frequency	
  of	
  
word	
  length
3.	
  Generate	
  keys	
  of	
  each	
  record	
  by	
  Mappers	
  
Data
dog
cat
fish
hello
good
night
happy
school
ValueKey
dog3
cat3
fish4
hello5
good4
night5
happy5
school6
Mapper
Mapper
Mapper
Mapperdog3
cat3
fish4
hello5
good4
night5
happy5
school6
Example:	
  coun>ng	
  the	
  frequency	
  of	
  
word	
  length
4.	
  Group	
  and	
  sort	
  records	
  by	
  keys	
  	
  
ValueKey
dog3
cat3
fish4
hello5
good4
night5
happy5
school6
ValueKey
dog3
cat3
fish4
good4
hello5
night5
happy5
school6
Example:	
  coun>ng	
  the	
  frequency	
  of	
  
word	
  length
5.	
  Send	
  records	
  with	
  the	
  same	
  key	
  to	
  one	
  reducer	
  
ValueKey
dog3
cat3
fish4
good4
hello5
night5
happy5
school6
Reducer
Reducer
Reducer
dog3
cat3
Reducer
fish4
good4
hello5
night5
happy5
school6
Example:	
  coun>ng	
  the	
  frequency	
  of	
  
word	
  length
6.	
  Generate	
  outputs	
  by	
  Reducers	
  
ValueKey
dog3
cat3
fish4
good4
hello5
night5
happy5
school6
Reducer
Reducer
Reducer
dog3
cat3
Reducer
fish4
good4
hello5
night5
happy5
school6
ValueKey
23
24
35
16
A	
  typical	
  MSR	
  analysis
Extract	
  all	
  versions	
  of	
  all	
  files	
  
Analyze	
  each	
  version	
  
Compare	
  versions	
  to	
  each	
  other
We	
  implement	
  MapReduce	
  on	
  a	
  typical	
  MSR	
  tool
Repository	
  
Applying	
  MapReduce	
  to	
  typical	
  MSR	
  tools
Repository	
  
Data
a0.java
a1.java
b0.java
a2.java
b1.java
Applying	
  MapReduce	
  to	
  typical	
  MSR	
  tools
Mapper
Mapper
Mapper
Data
a0.java
a1.java
b0.java
a2.java
b1.java
ValueKey
a.java
a.java
b.java
a.java
b.java
a0.java
a1.java
b0.java
a2.java
b1.java
a.java
a.java
a0.java
a1.java
b.java
a.java
b0.java
a2.java
b.java b1.java
Applying	
  MapReduce	
  to	
  typical	
  MSR	
  tools
ValueKey
a.java
a.java
b.java
a.java
b.java
a0.java
a1.java
b0.java
a2.java
b1.java
ValueKey
a.java
a.java
a.java
b.java
b.java
a0.java
a1.java
a2.java
b0.java
b1.java
Applying	
  MapReduce	
  to	
  typical	
  MSR	
  tools
Reducer
Reducer
ValueKey
a.java
a.java
a.java
b.java
b.java
a0.java
a1.java
a2.java
b0.java
b1.java
a.java
a.java
a.java
a0.java
a1.java
a2.java
b.java
b.java
b0.java
b1.java
Applying	
  MapReduce	
  to	
  typical	
  MSR	
  tools
Reducer
Reducer
ValueKey
a.java
a.java
a.java
b.java
b.java
a0.java
a1.java
a2.java
b0.java
b1.java
a.java
a.java
a.java
a0.java
a1.java
a2.java
b.java
b.java
b0.java
b1.java
ValueKey
a.outputa.java
b.outputb.java
Case	
  study: J-­‐REX
Extract	
  snapshots	
  
from	
  CVS	
  repository
Use	
  Eclipse	
  JDT	
  to	
  
parse	
  source	
  code	
  to	
  
XML	
  files
Compare	
  each	
  XML	
  file	
  
to	
  generate	
  evoluCon	
  
informaCon
XML	
  
output	
  
n
…
JDT
EvoluCon	
  Analyzer
EvoluConary	
  Change	
  
Data
… Snapshot	
  n
XML	
  
output	
  
1
Snapshot	
  extractor
CVS
ExtracCon	
  
phase
Parsing	
  
phase
Analysis	
  
phase
Snapshot	
  1
Case	
  study:	
  data
Repository	
  
Size
#Source	
  
Code	
  
Files
Length	
  
of	
  
History
#Revisions
Datatools 394MB 10,552 2	
  years 2,398
BIRT 810MB 13,002 4	
  years 19,583
Eclipse 4.2GB 56,851 8	
  years 82,682
Case	
  study:	
  experimental	
  setup
CPU	
  type #CPU	
   Memory	
  size Disk	
  type
Desktop Intel	
  Quad	
  Core	
  
Q6600	
  @	
  2.40	
  
GHz
4 2GB SATA
Server Intel	
  Quad	
  Core	
  
Q6600	
  @	
  2.40	
  
GHz
4 8GB RAID5
Server Intel	
  Core	
  i7	
  
920	
  @	
  2.67	
  GHz
8 6GB SSD
Efficiency:	
  significant	
  reduc>on	
  of	
  
running	
  >me	
  by	
  using	
  MapReduce
Desktop	
  
Server(SSD)	
  
With	
  MapReduce	
  
70%	
  less
64%	
  less
Running	
  
>me	
  (hour)
faste
r
59%	
  less
Scalability:	
  dras>c	
  reduc>on	
  of	
  run	
  
>me	
  by	
  adding	
  machines
•  When	
  adding	
  machines	
  
– Time	
  to	
  deploy	
  data	
  increases	
  	
  
– Time	
  to	
  process	
  decreases	
  
	
  
2nodes	
  
3nodes	
  
4nodes	
  
faster
Adaptability:	
  liale	
  effort	
  to	
  apply	
  
MapReduce	
  to	
  MSR	
  tool
•  J-­‐REX	
  logic	
  unchanged	
  
•  Only	
  300-­‐400	
  LOC	
  to	
  implement	
  Map	
  and	
  
Reduce	
  
•  Typical	
  MapReduce	
  examples	
  available	
  
•  Less	
  than	
  one	
  hour	
  for	
  deployment	
  
Flexibility:	
  run	
  on	
  various	
  environments
Conclusions
•  Distributed	
  frameworks	
  are	
  needed	
  to	
  
– deal	
  with	
  growing	
  data	
  
– make	
  best	
  use	
  of	
  available	
  compuCng	
  resources	
  
•  A	
  MapReduce	
  soluCon	
  of	
  a	
  typical	
  MSR	
  
analysis	
  is:	
  
– straight	
  forward	
  
– scalable	
  
– efficient	
  

More Related Content

What's hot

Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactArun Kejariwal
 
(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...
(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...
(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...Amazon Web Services
 
Runtime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsRuntime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsIRJET Journal
 
Streaming Analytics: It's Not the Same Game
Streaming Analytics: It's Not the Same GameStreaming Analytics: It's Not the Same Game
Streaming Analytics: It's Not the Same GameNumenta
 
QUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeQUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeMasud Rahman
 
Automated Traceability for Software Engineering Tasks
Automated Traceability for Software Engineering TasksAutomated Traceability for Software Engineering Tasks
Automated Traceability for Software Engineering TasksDharmalingam Ganesan
 

What's hot (7)

Finding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impactFinding bad apples early: Minimizing performance impact
Finding bad apples early: Minimizing performance impact
 
(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...
(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...
(BDT207) Use Streaming Analytics to Exploit Perishable Insights | AWS re:Inve...
 
Runtime Behavior of JavaScript Programs
Runtime Behavior of JavaScript ProgramsRuntime Behavior of JavaScript Programs
Runtime Behavior of JavaScript Programs
 
Streaming Analytics: It's Not the Same Game
Streaming Analytics: It's Not the Same GameStreaming Analytics: It's Not the Same Game
Streaming Analytics: It's Not the Same Game
 
PCR Digital Test Tube
PCR Digital Test TubePCR Digital Test Tube
PCR Digital Test Tube
 
QUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-SingaporeQUICKAR-ASE2016-Singapore
QUICKAR-ASE2016-Singapore
 
Automated Traceability for Software Engineering Tasks
Automated Traceability for Software Engineering TasksAutomated Traceability for Software Engineering Tasks
Automated Traceability for Software Engineering Tasks
 

Similar to MSR 2009

Msr2009 ian
Msr2009 ianMsr2009 ian
Msr2009 ianSAIL_QU
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesWorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesIlkay Altintas, Ph.D.
 
Visualizing big data in the browser using spark
Visualizing big data in the browser using sparkVisualizing big data in the browser using spark
Visualizing big data in the browser using sparkDatabricks
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsYasin Memari
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerAndrew Yongjoon Kong
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedWee Hyong Tok
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...areej qasrawi
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores inside-BigData.com
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibilityc.titus.brown
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesDavid Martínez Rego
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencingGuy Coates
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Mark Kromer
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performanceinside-BigData.com
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Databricks
 

Similar to MSR 2009 (20)

Msr2009 ian
Msr2009 ianMsr2009 ian
Msr2009 ian
 
Spark
SparkSpark
Spark
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing ArchitecturesWorDS of Data Science in the Presence of Heterogenous Computing Architectures
WorDS of Data Science in the Presence of Heterogenous Computing Architectures
 
Visualizing big data in the browser using spark
Visualizing big data in the browser using sparkVisualizing big data in the browser using spark
Visualizing big data in the browser using spark
 
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
MongoDB World 2019: Finding the Right MongoDB Atlas Cluster Size: Does This I...
 
Challenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data GenomicsChallenges and Opportunities of Big Data Genomics
Challenges and Opportunities of Big Data Genomics
 
GPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and ContainerGPU cloud with Job scheduler and Container
GPU cloud with Job scheduler and Container
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Distributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learnedDistributed DNN training: Infrastructure, challenges, and lessons learned
Distributed DNN training: Infrastructure, challenges, and lessons learned
 
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...MapReduce:Simplified Data Processing on Large Cluster  Presented by Areej Qas...
MapReduce:Simplified Data Processing on Large Cluster Presented by Areej Qas...
 
HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores HPC Cluster Computing from 64 to 156,000 Cores 
HPC Cluster Computing from 64 to 156,000 Cores 
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Storage for next-generation sequencing
Storage for next-generation sequencingStorage for next-generation sequencing
Storage for next-generation sequencing
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Modern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High PerformanceModern Computing: Cloud, Distributed, & High Performance
Modern Computing: Cloud, Distributed, & High Performance
 
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
Lessons Learned Replatforming A Large Machine Learning Application To Apache ...
 

Recently uploaded

Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfVishalKumarJha10
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesVictorSzoltysek
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...software pro Development
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfryanfarris8
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplatePresentation.STUDIO
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 

Recently uploaded (20)

Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdfintroduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
introduction-to-automotive Andoid os-csimmonds-ndctechtown-2021.pdf
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...How to Choose the Right Laravel Development Partner in New York City_compress...
How to Choose the Right Laravel Development Partner in New York City_compress...
 
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdfAzure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
Azure_Native_Qumulo_High_Performance_Compute_Benchmarks.pdf
 
AI & Machine Learning Presentation Template
AI & Machine Learning Presentation TemplateAI & Machine Learning Presentation Template
AI & Machine Learning Presentation Template
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 

MSR 2009

  • 1. MapReduce     as  a  General  Framework  to  Support  Research  in   Mining  So8ware  Repositories(MSR)  Weiyi  Shang,  Zhen  Ming  Jiang,  Bram  Adams,  Ahmed  Hassan                So8ware  Analysis  and  Intelligence  Lab(SAIL)   School  of  CompuCng,  Queen’s  University  
  • 2. As  an  MSR  researcher,   have  you  ever  been  in  such  a   situa>on? • Analyzing  gigabytes  of  data?   • WaiCng  hours  for  experimental  results?   • Experiments  fail  with  “out  of  memory”  excepCons?
  • 3. To  overcome  these  problems,  you  could  … …  buy  more  powerful  machines …  spend  weeks  to  make  your   tools  more  efficient
  • 4. However! • The  data  will  keep  on   growing   • Spend    Cme  on  research   not  on  speeding  up   experiments   Debian  doubles  in  size   approximately  every  two  years  
  • 5. •  Idle  compuCng  power  is  available  in  every  lab   •  We  can  bundle  these  computers  together   •  A  distributed  framework  can  help  us  do  so
  • 6. General  requirements  for  a     distributed  framework: 1.  Efficiency   speed  up  the  process  significantly   2.  Scalability   scale  with  data  size  and  compuCng  power   3.  Adaptability   require  only  minimal  programming  effort   4.  Flexibility   run  in  various  environments
  • 7. Google’s    MapReduce   is  an  idea  of  distributed  computa8on
  • 8. Google’s    MapReduce   is  an  idea  of  distributed  computa8on •  Open-­‐source  MapReduce  implementaCon   •  Well  documented  and  many  examples   available     •  Well  supported  by  large  user  base  and  news   groups   •  Straight  forward  API  
  • 9. Example:  coun>ng  the  frequency  of   word  lengths dog cat fish good hello night happy school # WordsLength 23 24 35 16
  • 10. dog   cat   fish   hello   good   night   happy   school Example:  coun>ng  the  frequency  of   word  length 1.  Deploy  data  into  a  distributed  file  system   data network  compuCng  environment
  • 11. Example:  coun>ng  the  frequency  of   word  length 2.  Read  data  as  records   Data dog cat fish hello good night happy school
  • 12. Example:  coun>ng  the  frequency  of   word  length 3.  Generate  keys  of  each  record  by  Mappers   Data dog cat fish hello good night happy school ValueKey dog3 cat3 fish4 hello5 good4 night5 happy5 school6 Mapper Mapper Mapper Mapperdog3 cat3 fish4 hello5 good4 night5 happy5 school6
  • 13. Example:  coun>ng  the  frequency  of   word  length 4.  Group  and  sort  records  by  keys     ValueKey dog3 cat3 fish4 hello5 good4 night5 happy5 school6 ValueKey dog3 cat3 fish4 good4 hello5 night5 happy5 school6
  • 14. Example:  coun>ng  the  frequency  of   word  length 5.  Send  records  with  the  same  key  to  one  reducer   ValueKey dog3 cat3 fish4 good4 hello5 night5 happy5 school6 Reducer Reducer Reducer dog3 cat3 Reducer fish4 good4 hello5 night5 happy5 school6
  • 15. Example:  coun>ng  the  frequency  of   word  length 6.  Generate  outputs  by  Reducers   ValueKey dog3 cat3 fish4 good4 hello5 night5 happy5 school6 Reducer Reducer Reducer dog3 cat3 Reducer fish4 good4 hello5 night5 happy5 school6 ValueKey 23 24 35 16
  • 16. A  typical  MSR  analysis Extract  all  versions  of  all  files   Analyze  each  version   Compare  versions  to  each  other We  implement  MapReduce  on  a  typical  MSR  tool Repository  
  • 17. Applying  MapReduce  to  typical  MSR  tools Repository   Data a0.java a1.java b0.java a2.java b1.java
  • 18. Applying  MapReduce  to  typical  MSR  tools Mapper Mapper Mapper Data a0.java a1.java b0.java a2.java b1.java ValueKey a.java a.java b.java a.java b.java a0.java a1.java b0.java a2.java b1.java a.java a.java a0.java a1.java b.java a.java b0.java a2.java b.java b1.java
  • 19. Applying  MapReduce  to  typical  MSR  tools ValueKey a.java a.java b.java a.java b.java a0.java a1.java b0.java a2.java b1.java ValueKey a.java a.java a.java b.java b.java a0.java a1.java a2.java b0.java b1.java
  • 20. Applying  MapReduce  to  typical  MSR  tools Reducer Reducer ValueKey a.java a.java a.java b.java b.java a0.java a1.java a2.java b0.java b1.java a.java a.java a.java a0.java a1.java a2.java b.java b.java b0.java b1.java
  • 21. Applying  MapReduce  to  typical  MSR  tools Reducer Reducer ValueKey a.java a.java a.java b.java b.java a0.java a1.java a2.java b0.java b1.java a.java a.java a.java a0.java a1.java a2.java b.java b.java b0.java b1.java ValueKey a.outputa.java b.outputb.java
  • 22. Case  study: J-­‐REX Extract  snapshots   from  CVS  repository Use  Eclipse  JDT  to   parse  source  code  to   XML  files Compare  each  XML  file   to  generate  evoluCon   informaCon XML   output   n … JDT EvoluCon  Analyzer EvoluConary  Change   Data … Snapshot  n XML   output   1 Snapshot  extractor CVS ExtracCon   phase Parsing   phase Analysis   phase Snapshot  1
  • 23. Case  study:  data Repository   Size #Source   Code   Files Length   of   History #Revisions Datatools 394MB 10,552 2  years 2,398 BIRT 810MB 13,002 4  years 19,583 Eclipse 4.2GB 56,851 8  years 82,682
  • 24. Case  study:  experimental  setup CPU  type #CPU   Memory  size Disk  type Desktop Intel  Quad  Core   Q6600  @  2.40   GHz 4 2GB SATA Server Intel  Quad  Core   Q6600  @  2.40   GHz 4 8GB RAID5 Server Intel  Core  i7   920  @  2.67  GHz 8 6GB SSD
  • 25. Efficiency:  significant  reduc>on  of   running  >me  by  using  MapReduce Desktop   Server(SSD)   With  MapReduce   70%  less 64%  less Running   >me  (hour) faste r 59%  less
  • 26. Scalability:  dras>c  reduc>on  of  run   >me  by  adding  machines •  When  adding  machines   – Time  to  deploy  data  increases     – Time  to  process  decreases     2nodes   3nodes   4nodes   faster
  • 27. Adaptability:  liale  effort  to  apply   MapReduce  to  MSR  tool •  J-­‐REX  logic  unchanged   •  Only  300-­‐400  LOC  to  implement  Map  and   Reduce   •  Typical  MapReduce  examples  available   •  Less  than  one  hour  for  deployment  
  • 28. Flexibility:  run  on  various  environments
  • 29. Conclusions •  Distributed  frameworks  are  needed  to   – deal  with  growing  data   – make  best  use  of  available  compuCng  resources   •  A  MapReduce  soluCon  of  a  typical  MSR   analysis  is:   – straight  forward   – scalable   – efficient