2. 2
Stephen W. Thomas
Mining Software Repositories with Topic Models.
ICSE 2011
Stephen W. Thomas, Hadi Hemmati, Ahmed E. Hassan, and Dorothea Blostein
Static TestC ase Prioritization Using Topic Models.
Empirical Software Engineering, 2012
Stephen W. Thomas, Nicolas Bettenburg, Ahmed E. Hassan, and Dorothea Blostein
Talk and Work: Recovering the Relationship between Mailing ListDiscussions and Development
Activity.
Empirical Software Engineering, 2nd
round
Stephen W. Thomas, Meiyappan Nagappan , Ahmed E. Hassan, and Dorothea Blostein
The ImpactofC lassifierC onfiguration and C lassifierC ombination on Bug Localization.
IEEE Transactions on Software Engineering, 2nd
round
Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein
Validating the Use ofTopic Models forSoftware Evolution.
SCAM 2010
Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein
Modeling the Evolution ofTopics in Source C ode Histories.
MSR 2011
Stephen W. Thomas, Bram Adams, Ahmed E. Hassan, and Dorothea Blostein
Studying Software Evolution Using Topic Models.
Science of Computer Programming, 2012
9. 9
The research and practice of using IR models to
mine software repositories can be improved by
(i) considering additional software engineering
tasks, such as prioritizing test cases;
(ii) using advanced IR techniques, such as
combining multiple IR models; and
(iii) better understanding the assumptions and
parameters of IR models.
10. Test Case Prioritization
Less similar
Higher prioritySimilarity
identifiers
comments
string literals
Part 1
10[EMSE 2012]
structural-based IR-based
11. Source code ↔ Email Interaction
cleaning and
preprocessing
identifiers
comments
string literals
mail codeXML
printing
installation
GUI
Code
Mail
Time
Activity
XML
Monitoring project status
Software explanation
Training and documentation
11
Part 1
[EMSE 20XX]
13. Combining Multiple IRModels
identifiers
comments
string literalsBug
report
Bug
report
Similarity
title
description
Best individual
IR model
Random subset,
combined
13
Part 2
[TSE 20XX] sets had improved performance median improvement
14. XML concept
Swing concept
Encryption concept
Time
Popularity
Concept Evolution Models
identifiers
comments
string literals
14
Part 2
[SCP 2012]
[SCAM 2010]
accuracy of topic evolutions
17. Preprocessing and ParameterEffects
Code representation
identifiers? comments?
past bug reports?
Bug report representation
title? description?
Preprocessing
split identifiers? remove stop words?
word stemming?
IR Model parameters
term weighting?
No. of topics? similarity measure?
No. of iterations?
Configuration matters!
worst:
best:
mean:
17
Part 3
[TSE 20XX]
“configuration”
18. New!
1
2
3
18
Part
Part
Part
Proposed and evaluated a technique to prioritize test cases
Proposed and evaluated a technique to analyze the interaction of source code and mailing lists
Described and evaluated a technique to analyze code histories using topic evolution models
Proposed and evaluated a frameworkforcombining the results of disparate IR models
Overcame the data duplication problem in large source code histories
Analyzed the sensitivity of IRmodels to data preprocessing and IR model parameters
Editor's Notes
This diagram describes the field of Mining Software Repositories. The overall goal is take software repositories (which are readily-available datasets about a software project, such as [list a few]), apply data mining and machine learning techniques, and come out with some actionable knowledge that will help developers in some way. For example: bug prediction, traceability linking, feature location, …
In current research, the majority of the repositories that are mined are structured: call graphs, parse trees, execution logs;
However, there are also many repositories that are unstructured: [name them]
In fact, research has shown that about 80% of the content in software repositories is unstructured, meaning that we to consider this data if we want to take full advantage of the software repositories.
However, unstructured data brings with it many challenges. Consider these two seemingly-innocent bug reports from one of my case studies.
Here we see many difficulties, such as undefined acronyms; spelling errors and typos; inconsistent usages; no labels, vague wording.
These problems exist because most unstructured data comes in the form of natural language text written by humans, which is notoriously difficult for a computer to deal with.
In an attempt to deal with unstructured software repositories, researchers have began to use IR. IR models come from the NLP community, and a good fit for our problem because they were designed to handle many of the problems of unstructured data. IR models help you search, organize, and provide structure for your unstructured data.
IR models use a simplifying assumption of the data, called the “bag of words” approach. This means that word order is not considered in IR models. By ignoring word order, analysis is simpler and faster, and the techniques can scale to large datasets. And we demonstrate that despite this simplifying assumption, IR models actually perform quite well in many scenarios.
Initial successes: concept location; document clustering; new code metrics; code search engines; traceability linking
To understand how IR models have been used in MSR, I did a thorough literature review of all papers that use IR models to mine unstructured data. In all, there are about 67 papers. I analyzed the trends and common usages, and found three shortcomings of the state-of-the-art, i.e., some areas where we could improve. My thesis is the proposal of solutions to each of these three shortcomings.
First shortcoming: most papers that use IR models only perform one of two software engineering tasks: concept location, and traceability linking. There’s nothing wrong with these applications, but I propose that we can go beyond these two tasks and use IR models to perform new SE tasks, and help software developers even further.
Second shortcoming: most papers use only the most basic IR models, such as the Vector Space Model (1975, 37 years ago). I propose that we use some of the more advanced, super-man like IR techniques, which may bring better results and new capabilities to software developers.
Third shortcoming: most papers use IR models as off-the-shelf black boxes, without fully understanding how their parameters work, what input is required, and what the output means. I propose that we develop a better understanding of how IR models, which will allow us to take full advantage of their potential, and improve results for software developers.
My thesis statement has a parallel structure: [read]
In TCP, the goal is take an unordered set of test cases, and provide an ordering such that more bugs are detected earlier in the testing process. By doing so, if the test suite must be stopped early, then you can rest assured that you have detected as many bugs as possible.
Typically, TCP is tackled by using some sort of structural code coverage metric, that says: hey, how much code does this test case execute? If it executes a lot of code, then let’s give it a high priority. Otherwise, let’s give it a low priority. This is how it’s traditionally done.
However, I propose that we can use IR models to solve the same problem, only with the additional advantage of not having to run the test case to collect the execution information. Here’s how.
First, we extract the unstructured information from the source code: identifier names, comments, and string literals.
Then, we compute the IR similarity between each pair of test cases. This will tell us if the test cases are textually similar or not.
Then, if a test case is not very similar to other test cases, we give it a higher priority. The thought here is: if two test cases are exactly the same, then they will find the same bugs, so we don’t need to execute both. So we’re looking for test cases that are highly unlike any other test case, because it will detect unique bugs.
We did a case study on 5 real world systems, and found that our IR based approach was as good or better than existing approaches prioritizing test cases.
The first advanced technique I propose is that of combining multiple IR models.
Let me explain this in the context of bug localization. […]
A simple way to combine models is to just add the scores of each file from the various IR models. That way, if a file gets a high score in several models, it will shoot up to the top in the combined model. Another way is expert voting, where only the rank of each file is used, as opposed to the score. Either way, the end goal is to utilize the “expertise” of each model.
If a manager or developer had a dashboard that magically told them what developers were working on, and when, at a high level, they would be very happy. This would keep them informed, allow them to perform retrospective analysis, and maybe even be part of a preemptive maintenance solution that automatically monitored the “health” of the source code over time.
To achieve this goal, we use an advanced IR model called a topic evolution model. It works by [explain]
We input these versions into an advanced IR model, called a topic evolution model, which gives us exactly what we’re looking for.
A case study found that a large majority of the discovered evolutions were in-sync with how developers described the project, and since this technique is automatic, it will be helpful to use in an automatic dashboard setting.
During my research, I came across an issue which I now call the “data duplication problem”.
When I tried to analyze the evolution of long-lived systems with many different versions, I found that the IR model was producing unusual and unexpected results. Things just didn’t make sense: the topics were weird, and something was off, but I didn’t know what.
Upon further analysis, I learned that the cause of this problem was that in source code, hardly any of the words change between versions. A new version typically contains some bug fixes and some new features, but these only affect at most 1% of the lines of code, meaning that 99% of the data is exactly the same. It’s identical. This was throwing the IR models out of balance, and causing the problems that we experienced.
The reason is, IR models weren’t originally designed for source code. They were designed for newspaper articles or books. So version 1 here might contain all the newspaper articles in January, and version 2 contains all the newspaper articles in February. Sure, there might be some overlap, but in general we do not expect that 99% of the articles in February are exact duplicates from January. I believe that someone would be fired from the newspaper if this happened.
So I proposed a model that better handled this data duplication inherent to source code. Basically what it does, is it only inputs the differences between versions into the IR model. This keeps everything in balance because it meets the implicit assumptions made by the IR model. Our case studies showed that results are better when the duplication is removed.
Another way to better understand IR models is to understand their parameters and configurations. IR models have a lot of dials, knobs, and switches that you can tweak. For example, …
Currently, researchers don’t focus on these parameters, and just seem to randomly choose settings without fully understanding the associated consequences.
To better understand the parameters, we ran a large, empirical case study. We had 8000 bug reports, and we ran each of them through 3,168 IR model configurations. What we found was, that there is a HUGE difference in performance between the various configurations. For example, the worst IR model could only achieve 1% accuracy; the best could get as high 55%. And the mean was 23%. So the range was quite big, as was the variance.
In addition, in this study we were able to determine which configurations were best, so that researchers, tool vendors, and developers could use these when building their own IR-based solutions.
Let me conclude by summarizing the main contributions of this thesis.
First, I proposed new application of IR models in SE: TCP, and measuring the interaction of email and source code.
I also proposed that we start using more advanced IR techniques in our work, such as topic evolution models and model combination and
Finally, I proposed that if we increase our understanding of IR models, we further improve results. The two studies have show that by looking into the details of IR models, instead of treating them as black boxes, we can improve our techniques and get better results.
My broader research vision is to provide better tools, techniques, and insights for software development teams, so that they can build better software at lower costs and have happier customers. In this thesis, I have taken a step towards that vision by proposing and evaluating ways to better utilize the unstructured elements of software repositories, which in turn provide new and better capabilities for software developers.