SlideShare a Scribd company logo
1 of 18
Mining Unstructured Software
Repositories Using IRModels
Stephen W. Thomas
PhD Candidate
Queen’s University
BBAA
2
Stephen W. Thomas
Mining Software Repositories with Topic Models.
ICSE 2011
Stephen W. Thomas, Hadi Hemmati, Ahmed E. Hassan, and Dorothea Blostein
Static TestC ase Prioritization Using Topic Models.
Empirical Software Engineering, 2012
Stephen W. Thomas, Nicolas Bettenburg, Ahmed E. Hassan, and Dorothea Blostein
Talk and Work: Recovering the Relationship between Mailing ListDiscussions and Development
Activity.
Empirical Software Engineering, 2nd
round
Stephen W. Thomas, Meiyappan Nagappan , Ahmed E. Hassan, and Dorothea Blostein
The ImpactofC lassifierC onfiguration and C lassifierC ombination on Bug Localization.
IEEE Transactions on Software Engineering, 2nd
round
Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein
Validating the Use ofTopic Models forSoftware Evolution.
SCAM 2010
Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein
Modeling the Evolution ofTopics in Source C ode Histories.
MSR 2011
Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein
Studying Software Evolution Using Topic Models.
Science of Computer Programming, 2012
code changes
logs
bugs
email
reqs
bug prediction
traceability linking
feature location
architecture recovery
change pattern detection
3
00:03:45: E22344, 76, 90.3,
00:03:46: E2f3a4, 82, 95.0,
00:03:56: E22345, 78, 96.6,
00:04:15: E22344, 23, 95.1,
00:04:35: E23348, 65, 95.7,
00:04:37: E2234b, 56, 93.1,
00:04:38: E2234b, 54, 95.0,
00:04:39: E22a34, 98, 95.1,
00:05:42: E353f4, 65, 94.7,
00:05:42: E3556j, 45, 95.2,
00:05:42: E3545g, 63, 92.8,
00:05:42: E354r4, 94, 95.6,
source code comments
bug reports
emails
requirement descriptions
forum and blog posts
commit messages
source code identifiers
4
NPE caused by
no spashscreen
handler service
available
Provide unittests for link
creation constraints, unit tests
fail in standalone build
5
Service
pricing
Confer
6
pricing
Conference
Service
7
New!
1
2
3 8
Part
Part
Part
9
The research and practice of using IR models to
mine software repositories can be improved by
(i) considering additional software engineering
tasks, such as prioritizing test cases;
(ii) using advanced IR techniques, such as
combining multiple IR models; and
(iii) better understanding the assumptions and
parameters of IR models.
Test Case Prioritization
Less similar
Higher prioritySimilarity
identifiers
comments
string literals
Part 1
10[EMSE 2012]
structural-based IR-based
Source code ↔ Email Interaction
cleaning and
preprocessing
identifiers
comments
string literals
mail codeXML
printing
installation
GUI
Code
Mail
Time
Activity
XML
Monitoring project status
Software explanation
Training and documentation
11
Part 1
[EMSE 20XX]
New!
1
2
3 12
Part
Part
Part
Combining Multiple IRModels
identifiers
comments
string literalsBug
report
Bug
report
Similarity
title
description
Best individual
IR model
Random subset,
combined
13
Part 2
[TSE 20XX] sets had improved performance median improvement
XML concept
Swing concept
Encryption concept
Time
Popularity
Concept Evolution Models
identifiers
comments
string literals
14
Part 2
[SCP 2012]
[SCAM 2010]
accuracy of topic evolutions
New!
1
2
3 15
Part
Part
Part
Data Duplication Problem
identical
16
Part 3
[MSR 2011] accuracysensitivity
Preprocessing and ParameterEffects
Code representation
identifiers? comments?
past bug reports?
Bug report representation
title? description?
Preprocessing
split identifiers? remove stop words?
word stemming?
IR Model parameters
term weighting?
No. of topics? similarity measure?
No. of iterations?
Configuration matters!
worst:
best:
mean:
17
Part 3
[TSE 20XX]
“configuration”
New!
1
2
3
18
Part
Part
Part
Proposed and evaluated a technique to prioritize test cases
Proposed and evaluated a technique to analyze the interaction of source code and mailing lists
Described and evaluated a technique to analyze code histories using topic evolution models
Proposed and evaluated a frameworkforcombining the results of disparate IR models
Overcame the data duplication problem in large source code histories
Analyzed the sensitivity of IRmodels to data preprocessing and IR model parameters

More Related Content

Viewers also liked

빅데이터와 교육데이터마이닝 (고려대학교 대학원 강의) 6주차
빅데이터와 교육데이터마이닝 (고려대학교 대학원 강의) 6주차빅데이터와 교육데이터마이닝 (고려대학교 대학원 강의) 6주차
빅데이터와 교육데이터마이닝 (고려대학교 대학원 강의) 6주차JM code group
 
Mineograph Mining Automation Software
Mineograph Mining Automation SoftwareMineograph Mining Automation Software
Mineograph Mining Automation SoftwareMineograph Software
 
Mining the Modern Code Review Repositories: A Dataset of People, Process and ...
Mining the Modern Code Review Repositories: A Dataset of People, Process and ...Mining the Modern Code Review Repositories: A Dataset of People, Process and ...
Mining the Modern Code Review Repositories: A Dataset of People, Process and ...Norihiro Yoshida
 
Data mining software comparison
Data mining software comparison Data mining software comparison
Data mining software comparison Esteban Alcaide
 
임태현, software catastrophe
임태현, software catastrophe임태현, software catastrophe
임태현, software catastrophe태현 임
 
Mining Software Archives to Support Software Development
Mining Software Archives to Support Software DevelopmentMining Software Archives to Support Software Development
Mining Software Archives to Support Software DevelopmentThomas Zimmermann
 
Model Comparison for Delta-Compression
Model Comparison for Delta-CompressionModel Comparison for Delta-Compression
Model Comparison for Delta-CompressionMarkus Scheidgen
 
An Empirical Study of Goto in C Code from GitHub Repositories
An Empirical Study of Goto in C Code from GitHub RepositoriesAn Empirical Study of Goto in C Code from GitHub Repositories
An Empirical Study of Goto in C Code from GitHub RepositoriesSAIL_QU
 
MSR mining challenge 2015 - Quick Trigger
MSR mining challenge 2015 - Quick TriggerMSR mining challenge 2015 - Quick Trigger
MSR mining challenge 2015 - Quick TriggerXin Yang
 
[우리가 데이터를 쓰는 법] 온라인 서비스 개선을 위한 데이터 활용법 - 마이크로소프트 김진영 데이터과학자
[우리가 데이터를 쓰는 법] 온라인 서비스 개선을 위한 데이터 활용법 - 마이크로소프트 김진영 데이터과학자[우리가 데이터를 쓰는 법] 온라인 서비스 개선을 위한 데이터 활용법 - 마이크로소프트 김진영 데이터과학자
[우리가 데이터를 쓰는 법] 온라인 서비스 개선을 위한 데이터 활용법 - 마이크로소프트 김진영 데이터과학자Dylan Ko
 
MSR 2016 data showcase - Mining Code Review Repositories
MSR 2016 data showcase - Mining Code Review RepositoriesMSR 2016 data showcase - Mining Code Review Repositories
MSR 2016 data showcase - Mining Code Review RepositoriesXin Yang
 
Software Analytics: Towards Software Mining that Matters
Software Analytics: Towards Software Mining that MattersSoftware Analytics: Towards Software Mining that Matters
Software Analytics: Towards Software Mining that MattersTao Xie
 
연관도 분석을 이용한 데이터마이닝
연관도 분석을 이용한 데이터마이닝연관도 분석을 이용한 데이터마이닝
연관도 분석을 이용한 데이터마이닝Keunhyun Oh
 
고품질 Sw와 개발문화
고품질 Sw와 개발문화고품질 Sw와 개발문화
고품질 Sw와 개발문화도형 임
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Jujuseoul_engineer
 
Software Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled DatasetsSoftware Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled DatasetsSung Kim
 
Dissertation Defense
Dissertation DefenseDissertation Defense
Dissertation DefenseSung Kim
 
위대한개발문화
위대한개발문화위대한개발문화
위대한개발문화신승환
 
Mining Software Repositories
Mining Software RepositoriesMining Software Repositories
Mining Software RepositoriesIsrael Herraiz
 
Introduce Deep learning & A.I. Applications
Introduce Deep learning & A.I. ApplicationsIntroduce Deep learning & A.I. Applications
Introduce Deep learning & A.I. ApplicationsMario Cho
 

Viewers also liked (20)

빅데이터와 교육데이터마이닝 (고려대학교 대학원 강의) 6주차
빅데이터와 교육데이터마이닝 (고려대학교 대학원 강의) 6주차빅데이터와 교육데이터마이닝 (고려대학교 대학원 강의) 6주차
빅데이터와 교육데이터마이닝 (고려대학교 대학원 강의) 6주차
 
Mineograph Mining Automation Software
Mineograph Mining Automation SoftwareMineograph Mining Automation Software
Mineograph Mining Automation Software
 
Mining the Modern Code Review Repositories: A Dataset of People, Process and ...
Mining the Modern Code Review Repositories: A Dataset of People, Process and ...Mining the Modern Code Review Repositories: A Dataset of People, Process and ...
Mining the Modern Code Review Repositories: A Dataset of People, Process and ...
 
Data mining software comparison
Data mining software comparison Data mining software comparison
Data mining software comparison
 
임태현, software catastrophe
임태현, software catastrophe임태현, software catastrophe
임태현, software catastrophe
 
Mining Software Archives to Support Software Development
Mining Software Archives to Support Software DevelopmentMining Software Archives to Support Software Development
Mining Software Archives to Support Software Development
 
Model Comparison for Delta-Compression
Model Comparison for Delta-CompressionModel Comparison for Delta-Compression
Model Comparison for Delta-Compression
 
An Empirical Study of Goto in C Code from GitHub Repositories
An Empirical Study of Goto in C Code from GitHub RepositoriesAn Empirical Study of Goto in C Code from GitHub Repositories
An Empirical Study of Goto in C Code from GitHub Repositories
 
MSR mining challenge 2015 - Quick Trigger
MSR mining challenge 2015 - Quick TriggerMSR mining challenge 2015 - Quick Trigger
MSR mining challenge 2015 - Quick Trigger
 
[우리가 데이터를 쓰는 법] 온라인 서비스 개선을 위한 데이터 활용법 - 마이크로소프트 김진영 데이터과학자
[우리가 데이터를 쓰는 법] 온라인 서비스 개선을 위한 데이터 활용법 - 마이크로소프트 김진영 데이터과학자[우리가 데이터를 쓰는 법] 온라인 서비스 개선을 위한 데이터 활용법 - 마이크로소프트 김진영 데이터과학자
[우리가 데이터를 쓰는 법] 온라인 서비스 개선을 위한 데이터 활용법 - 마이크로소프트 김진영 데이터과학자
 
MSR 2016 data showcase - Mining Code Review Repositories
MSR 2016 data showcase - Mining Code Review RepositoriesMSR 2016 data showcase - Mining Code Review Repositories
MSR 2016 data showcase - Mining Code Review Repositories
 
Software Analytics: Towards Software Mining that Matters
Software Analytics: Towards Software Mining that MattersSoftware Analytics: Towards Software Mining that Matters
Software Analytics: Towards Software Mining that Matters
 
연관도 분석을 이용한 데이터마이닝
연관도 분석을 이용한 데이터마이닝연관도 분석을 이용한 데이터마이닝
연관도 분석을 이용한 데이터마이닝
 
고품질 Sw와 개발문화
고품질 Sw와 개발문화고품질 Sw와 개발문화
고품질 Sw와 개발문화
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
 
Software Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled DatasetsSoftware Defect Prediction on Unlabeled Datasets
Software Defect Prediction on Unlabeled Datasets
 
Dissertation Defense
Dissertation DefenseDissertation Defense
Dissertation Defense
 
위대한개발문화
위대한개발문화위대한개발문화
위대한개발문화
 
Mining Software Repositories
Mining Software RepositoriesMining Software Repositories
Mining Software Repositories
 
Introduce Deep learning & A.I. Applications
Introduce Deep learning & A.I. ApplicationsIntroduce Deep learning & A.I. Applications
Introduce Deep learning & A.I. Applications
 

Similar to Mining Unstructured Software Repositories Using IR Models

Studying Software Quality Using Topic Models
Studying Software Quality Using Topic ModelsStudying Software Quality Using Topic Models
Studying Software Quality Using Topic ModelsSAIL_QU
 
AI-Driven Software Quality Assurance in the Age of DevOps
AI-Driven Software Quality Assurance in the Age of DevOpsAI-Driven Software Quality Assurance in the Age of DevOps
AI-Driven Software Quality Assurance in the Age of DevOpsChakkrit (Kla) Tantithamthavorn
 
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...Hong-Linh Truong
 
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSINGFEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSINGIJCI JOURNAL
 
(Structural) Feature Interactions for Variability-Intensive Systems Testing
(Structural) Feature Interactions for Variability-Intensive Systems Testing (Structural) Feature Interactions for Variability-Intensive Systems Testing
(Structural) Feature Interactions for Variability-Intensive Systems Testing Gilles Perrouin
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET Journal
 
Large Language Models Bootcamp
Large Language Models BootcampLarge Language Models Bootcamp
Large Language Models BootcampData Science Dojo
 
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...IJCSEA Journal
 
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...IJCSEA Journal
 
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringTao Xie
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceCS, NcState
 
PhD defense: David Ameller
PhD defense: David AmellerPhD defense: David Ameller
PhD defense: David AmellerDavid Ameller
 
Not Only Statements: The Role of Textual Analysis in Software Quality
Not Only Statements: The Role of Textual Analysis in Software QualityNot Only Statements: The Role of Textual Analysis in Software Quality
Not Only Statements: The Role of Textual Analysis in Software QualityRocco Oliveto
 
IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNING
IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNINGIMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNING
IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNINGIRJET Journal
 
Full resume dr_russell_john_childs_2016
Full resume dr_russell_john_childs_2016Full resume dr_russell_john_childs_2016
Full resume dr_russell_john_childs_2016Russell Childs
 
Top cited articles 2020 - Advanced Computational Intelligence: An Internation...
Top cited articles 2020 - Advanced Computational Intelligence: An Internation...Top cited articles 2020 - Advanced Computational Intelligence: An Internation...
Top cited articles 2020 - Advanced Computational Intelligence: An Internation...aciijournal
 

Similar to Mining Unstructured Software Repositories Using IR Models (20)

Paper summary
Paper summaryPaper summary
Paper summary
 
Studying Software Quality Using Topic Models
Studying Software Quality Using Topic ModelsStudying Software Quality Using Topic Models
Studying Software Quality Using Topic Models
 
Software bug prediction
Software bug prediction Software bug prediction
Software bug prediction
 
AI-Driven Software Quality Assurance in the Age of DevOps
AI-Driven Software Quality Assurance in the Age of DevOpsAI-Driven Software Quality Assurance in the Age of DevOps
AI-Driven Software Quality Assurance in the Age of DevOps
 
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...
TUW- 184.742 Emerging Dynamic Distributed Systems and Challenges for Advanced...
 
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSINGFEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
FEATURES MATCHING USING NATURAL LANGUAGE PROCESSING
 
(Structural) Feature Interactions for Variability-Intensive Systems Testing
(Structural) Feature Interactions for Variability-Intensive Systems Testing (Structural) Feature Interactions for Variability-Intensive Systems Testing
(Structural) Feature Interactions for Variability-Intensive Systems Testing
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
 
Large Language Models Bootcamp
Large Language Models BootcampLarge Language Models Bootcamp
Large Language Models Bootcamp
 
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
 
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
IMPLEMENTATION OF DYNAMIC COUPLING MEASUREMENT OF DISTRIBUTED OBJECT ORIENTED...
 
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
ACM Chicago March 2019 meeting: Software Engineering and AI - Prof. Tao Xie, ...
 
Intelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software EngineeringIntelligent Software Engineering: Synergy between AI and Software Engineering
Intelligent Software Engineering: Synergy between AI and Software Engineering
 
Tim Menzies, directions in Data Science
Tim Menzies, directions in Data ScienceTim Menzies, directions in Data Science
Tim Menzies, directions in Data Science
 
PhD defense: David Ameller
PhD defense: David AmellerPhD defense: David Ameller
PhD defense: David Ameller
 
Not Only Statements: The Role of Textual Analysis in Software Quality
Not Only Statements: The Role of Textual Analysis in Software QualityNot Only Statements: The Role of Textual Analysis in Software Quality
Not Only Statements: The Role of Textual Analysis in Software Quality
 
IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNING
IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNINGIMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNING
IMAGE TO TEXT TO SPEECH CONVERSION USING MACHINE LEARNING
 
Full resume dr_russell_john_childs_2016
Full resume dr_russell_john_childs_2016Full resume dr_russell_john_childs_2016
Full resume dr_russell_john_childs_2016
 
Aj35198205
Aj35198205Aj35198205
Aj35198205
 
Top cited articles 2020 - Advanced Computational Intelligence: An Internation...
Top cited articles 2020 - Advanced Computational Intelligence: An Internation...Top cited articles 2020 - Advanced Computational Intelligence: An Internation...
Top cited articles 2020 - Advanced Computational Intelligence: An Internation...
 

More from SAIL_QU

Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...SAIL_QU
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...SAIL_QU
 
Improving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load testsImproving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load testsSAIL_QU
 
Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...SAIL_QU
 
Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...SAIL_QU
 
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...SAIL_QU
 
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...SAIL_QU
 
Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...SAIL_QU
 
Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?SAIL_QU
 
Towards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log ChangesTowards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log ChangesSAIL_QU
 
The Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution AnalysesThe Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution AnalysesSAIL_QU
 
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...SAIL_QU
 
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...SAIL_QU
 
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...SAIL_QU
 
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...SAIL_QU
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...SAIL_QU
 
What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?SAIL_QU
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...SAIL_QU
 
Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...SAIL_QU
 
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsMeasuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsSAIL_QU
 

More from SAIL_QU (20)

Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...Studying the Integration Practices and the Evolution of Ad Libraries in the G...
Studying the Integration Practices and the Evolution of Ad Libraries in the G...
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
 
Improving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load testsImproving the testing efficiency of selenium-based load tests
Improving the testing efficiency of selenium-based load tests
 
Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...Studying User-Developer Interactions Through the Distribution and Reviewing M...
Studying User-Developer Interactions Through the Distribution and Reviewing M...
 
Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...Studying online distribution platforms for games through the mining of data f...
Studying online distribution platforms for games through the mining of data f...
 
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
Understanding the Factors for Fast Answers in Technical Q&A Websites: An Empi...
 
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
Investigating the Challenges in Selenium Usage and Improving the Testing Effi...
 
Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...Mining Development Knowledge to Understand and Support Software Logging Pract...
Mining Development Knowledge to Understand and Support Software Logging Pract...
 
Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?Which Log Level Should Developers Choose For a New Logging Statement?
Which Log Level Should Developers Choose For a New Logging Statement?
 
Towards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log ChangesTowards Just-in-Time Suggestions for Log Changes
Towards Just-in-Time Suggestions for Log Changes
 
The Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution AnalysesThe Impact of Task Granularity on Co-evolution Analyses
The Impact of Task Granularity on Co-evolution Analyses
 
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
A Framework for Evaluating the Results of the SZZ Approach for Identifying Bu...
 
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
How are Discussions Associated with Bug Reworking? An Empirical Study on Open...
 
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
A Study of the Relation of Mobile Device Attributes with the User-Perceived Q...
 
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
A Large-Scale Study of the Impact of Feature Selection Techniques on Defect C...
 
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...Studying the Dialogue Between Users and Developers of Free Apps in the Google...
Studying the Dialogue Between Users and Developers of Free Apps in the Google...
 
What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?What Do Programmers Know about Software Energy Consumption?
What Do Programmers Know about Software Energy Consumption?
 
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
Threshold for Size and Complexity Metrics: A Case Study from the Perspective ...
 
Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...Revisiting the Experimental Design Choices for Approaches for the Automated R...
Revisiting the Experimental Design Choices for Approaches for the Automated R...
 
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with ProfessionalsMeasuring Program Comprehension: A Large-Scale Field Study with Professionals
Measuring Program Comprehension: A Large-Scale Field Study with Professionals
 

Recently uploaded

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprisepreethippts
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 

Recently uploaded (20)

SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Odoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 EnterpriseOdoo 14 - eLearning Module In Odoo 14 Enterprise
Odoo 14 - eLearning Module In Odoo 14 Enterprise
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 

Mining Unstructured Software Repositories Using IR Models

  • 1. Mining Unstructured Software Repositories Using IRModels Stephen W. Thomas PhD Candidate Queen’s University BBAA
  • 2. 2 Stephen W. Thomas Mining Software Repositories with Topic Models. ICSE 2011 Stephen W. Thomas, Hadi Hemmati, Ahmed E. Hassan, and Dorothea Blostein Static TestC ase Prioritization Using Topic Models. Empirical Software Engineering, 2012 Stephen W. Thomas, Nicolas Bettenburg, Ahmed E. Hassan, and Dorothea Blostein Talk and Work: Recovering the Relationship between Mailing ListDiscussions and Development Activity. Empirical Software Engineering, 2nd round Stephen W. Thomas, Meiyappan Nagappan , Ahmed E. Hassan, and Dorothea Blostein The ImpactofC lassifierC onfiguration and C lassifierC ombination on Bug Localization. IEEE Transactions on Software Engineering, 2nd round Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein Validating the Use ofTopic Models forSoftware Evolution. SCAM 2010 Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein Modeling the Evolution ofTopics in Source C ode Histories. MSR 2011 Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein Studying Software Evolution Using Topic Models. Science of Computer Programming, 2012
  • 3. code changes logs bugs email reqs bug prediction traceability linking feature location architecture recovery change pattern detection 3
  • 4. 00:03:45: E22344, 76, 90.3, 00:03:46: E2f3a4, 82, 95.0, 00:03:56: E22345, 78, 96.6, 00:04:15: E22344, 23, 95.1, 00:04:35: E23348, 65, 95.7, 00:04:37: E2234b, 56, 93.1, 00:04:38: E2234b, 54, 95.0, 00:04:39: E22a34, 98, 95.1, 00:05:42: E353f4, 65, 94.7, 00:05:42: E3556j, 45, 95.2, 00:05:42: E3545g, 63, 92.8, 00:05:42: E354r4, 94, 95.6, source code comments bug reports emails requirement descriptions forum and blog posts commit messages source code identifiers 4
  • 5. NPE caused by no spashscreen handler service available Provide unittests for link creation constraints, unit tests fail in standalone build 5
  • 7. 7
  • 9. 9 The research and practice of using IR models to mine software repositories can be improved by (i) considering additional software engineering tasks, such as prioritizing test cases; (ii) using advanced IR techniques, such as combining multiple IR models; and (iii) better understanding the assumptions and parameters of IR models.
  • 10. Test Case Prioritization Less similar Higher prioritySimilarity identifiers comments string literals Part 1 10[EMSE 2012] structural-based IR-based
  • 11. Source code ↔ Email Interaction cleaning and preprocessing identifiers comments string literals mail codeXML printing installation GUI Code Mail Time Activity XML Monitoring project status Software explanation Training and documentation 11 Part 1 [EMSE 20XX]
  • 13. Combining Multiple IRModels identifiers comments string literalsBug report Bug report Similarity title description Best individual IR model Random subset, combined 13 Part 2 [TSE 20XX] sets had improved performance median improvement
  • 14. XML concept Swing concept Encryption concept Time Popularity Concept Evolution Models identifiers comments string literals 14 Part 2 [SCP 2012] [SCAM 2010] accuracy of topic evolutions
  • 16. Data Duplication Problem identical 16 Part 3 [MSR 2011] accuracysensitivity
  • 17. Preprocessing and ParameterEffects Code representation identifiers? comments? past bug reports? Bug report representation title? description? Preprocessing split identifiers? remove stop words? word stemming? IR Model parameters term weighting? No. of topics? similarity measure? No. of iterations? Configuration matters! worst: best: mean: 17 Part 3 [TSE 20XX] “configuration”
  • 18. New! 1 2 3 18 Part Part Part Proposed and evaluated a technique to prioritize test cases Proposed and evaluated a technique to analyze the interaction of source code and mailing lists Described and evaluated a technique to analyze code histories using topic evolution models Proposed and evaluated a frameworkforcombining the results of disparate IR models Overcame the data duplication problem in large source code histories Analyzed the sensitivity of IRmodels to data preprocessing and IR model parameters

Editor's Notes

  1. This diagram describes the field of Mining Software Repositories. The overall goal is take software repositories (which are readily-available datasets about a software project, such as [list a few]), apply data mining and machine learning techniques, and come out with some actionable knowledge that will help developers in some way. For example: bug prediction, traceability linking, feature location, …
  2. In current research, the majority of the repositories that are mined are structured: call graphs, parse trees, execution logs; However, there are also many repositories that are unstructured: [name them] In fact, research has shown that about 80% of the content in software repositories is unstructured, meaning that we to consider this data if we want to take full advantage of the software repositories.
  3. However, unstructured data brings with it many challenges. Consider these two seemingly-innocent bug reports from one of my case studies. Here we see many difficulties, such as undefined acronyms; spelling errors and typos; inconsistent usages; no labels, vague wording. These problems exist because most unstructured data comes in the form of natural language text written by humans, which is notoriously difficult for a computer to deal with.
  4. In an attempt to deal with unstructured software repositories, researchers have began to use IR. IR models come from the NLP community, and a good fit for our problem because they were designed to handle many of the problems of unstructured data. IR models help you search, organize, and provide structure for your unstructured data. IR models use a simplifying assumption of the data, called the “bag of words” approach. This means that word order is not considered in IR models. By ignoring word order, analysis is simpler and faster, and the techniques can scale to large datasets. And we demonstrate that despite this simplifying assumption, IR models actually perform quite well in many scenarios. Initial successes: concept location; document clustering; new code metrics; code search engines; traceability linking
  5. To understand how IR models have been used in MSR, I did a thorough literature review of all papers that use IR models to mine unstructured data. In all, there are about 67 papers. I analyzed the trends and common usages, and found three shortcomings of the state-of-the-art, i.e., some areas where we could improve. My thesis is the proposal of solutions to each of these three shortcomings.
  6. First shortcoming: most papers that use IR models only perform one of two software engineering tasks: concept location, and traceability linking. There’s nothing wrong with these applications, but I propose that we can go beyond these two tasks and use IR models to perform new SE tasks, and help software developers even further. Second shortcoming: most papers use only the most basic IR models, such as the Vector Space Model (1975, 37 years ago). I propose that we use some of the more advanced, super-man like IR techniques, which may bring better results and new capabilities to software developers. Third shortcoming: most papers use IR models as off-the-shelf black boxes, without fully understanding how their parameters work, what input is required, and what the output means. I propose that we develop a better understanding of how IR models, which will allow us to take full advantage of their potential, and improve results for software developers.
  7. My thesis statement has a parallel structure: [read]
  8. In TCP, the goal is take an unordered set of test cases, and provide an ordering such that more bugs are detected earlier in the testing process. By doing so, if the test suite must be stopped early, then you can rest assured that you have detected as many bugs as possible. Typically, TCP is tackled by using some sort of structural code coverage metric, that says: hey, how much code does this test case execute? If it executes a lot of code, then let’s give it a high priority. Otherwise, let’s give it a low priority. This is how it’s traditionally done. However, I propose that we can use IR models to solve the same problem, only with the additional advantage of not having to run the test case to collect the execution information. Here’s how. First, we extract the unstructured information from the source code: identifier names, comments, and string literals. Then, we compute the IR similarity between each pair of test cases. This will tell us if the test cases are textually similar or not. Then, if a test case is not very similar to other test cases, we give it a higher priority. The thought here is: if two test cases are exactly the same, then they will find the same bugs, so we don’t need to execute both. So we’re looking for test cases that are highly unlike any other test case, because it will detect unique bugs. We did a case study on 5 real world systems, and found that our IR based approach was as good or better than existing approaches prioritizing test cases.
  9. The first advanced technique I propose is that of combining multiple IR models. Let me explain this in the context of bug localization. […] A simple way to combine models is to just add the scores of each file from the various IR models. That way, if a file gets a high score in several models, it will shoot up to the top in the combined model. Another way is expert voting, where only the rank of each file is used, as opposed to the score. Either way, the end goal is to utilize the “expertise” of each model.
  10. If a manager or developer had a dashboard that magically told them what developers were working on, and when, at a high level, they would be very happy. This would keep them informed, allow them to perform retrospective analysis, and maybe even be part of a preemptive maintenance solution that automatically monitored the “health” of the source code over time. To achieve this goal, we use an advanced IR model called a topic evolution model. It works by [explain] We input these versions into an advanced IR model, called a topic evolution model, which gives us exactly what we’re looking for. A case study found that a large majority of the discovered evolutions were in-sync with how developers described the project, and since this technique is automatic, it will be helpful to use in an automatic dashboard setting.
  11. During my research, I came across an issue which I now call the “data duplication problem”. When I tried to analyze the evolution of long-lived systems with many different versions, I found that the IR model was producing unusual and unexpected results. Things just didn’t make sense: the topics were weird, and something was off, but I didn’t know what. Upon further analysis, I learned that the cause of this problem was that in source code, hardly any of the words change between versions. A new version typically contains some bug fixes and some new features, but these only affect at most 1% of the lines of code, meaning that 99% of the data is exactly the same. It’s identical. This was throwing the IR models out of balance, and causing the problems that we experienced. The reason is, IR models weren’t originally designed for source code. They were designed for newspaper articles or books. So version 1 here might contain all the newspaper articles in January, and version 2 contains all the newspaper articles in February. Sure, there might be some overlap, but in general we do not expect that 99% of the articles in February are exact duplicates from January. I believe that someone would be fired from the newspaper if this happened. So I proposed a model that better handled this data duplication inherent to source code. Basically what it does, is it only inputs the differences between versions into the IR model. This keeps everything in balance because it meets the implicit assumptions made by the IR model. Our case studies showed that results are better when the duplication is removed.
  12. Another way to better understand IR models is to understand their parameters and configurations. IR models have a lot of dials, knobs, and switches that you can tweak. For example, … Currently, researchers don’t focus on these parameters, and just seem to randomly choose settings without fully understanding the associated consequences. To better understand the parameters, we ran a large, empirical case study. We had 8000 bug reports, and we ran each of them through 3,168 IR model configurations. What we found was, that there is a HUGE difference in performance between the various configurations. For example, the worst IR model could only achieve 1% accuracy; the best could get as high 55%. And the mean was 23%. So the range was quite big, as was the variance. In addition, in this study we were able to determine which configurations were best, so that researchers, tool vendors, and developers could use these when building their own IR-based solutions.
  13. Let me conclude by summarizing the main contributions of this thesis. First, I proposed new application of IR models in SE: TCP, and measuring the interaction of email and source code. I also proposed that we start using more advanced IR techniques in our work, such as topic evolution models and model combination and Finally, I proposed that if we increase our understanding of IR models, we further improve results. The two studies have show that by looking into the details of IR models, instead of treating them as black boxes, we can improve our techniques and get better results. My broader research vision is to provide better tools, techniques, and insights for software development teams, so that they can build better software at lower costs and have happier customers. In this thesis, I have taken a step towards that vision by proposing and evaluating ways to better utilize the unstructured elements of software repositories, which in turn provide new and better capabilities for software developers.