WordPress Websites for Engineers: Elevate Your Brand
Mining Development Repositories to Study the Impact of Collaboration on Software Systems
1. Mining Development Repositories
to Study the Impact of
Collaboration on Software Systems
Nicolas Bettenburg
nicbet@cs.queensu.ca
SOFTWARE ANALYSIS
& INTELLIGENCE LAB
Wednesday, 11 April, 12 1
2. Software Development is a Social Activity
Source Code stands in direct relation to
organizational structure. [Conway:Datamation:1968]
Developers spent large part of work day
communicating with fellow developers. [Begel:ICSE:2010]
Wednesday, 11 April, 12 2
3. Communication is Critical for Success
Communication is the most referenced
problem in distributed development.
[Grinter:GROUP:1999]
[Bird:ACMComm:2009]
Wednesday, 11 April, 12 3
4. Research Hypothesis
“The collaboration between stakeholders
impacts the code quality and the development
community of a software system.”
Wednesday, 11 April, 12 4
5. Proposed Approach
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
Wednesday, 11 April, 12 5
6. Proposed Approach
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
Wednesday, 11 April, 12 6
7. Available Knowledge in Data
Version Control Systems Mailing Lists Issue Tracking Systems
Wednesday, 11 April, 12 7
8. Available Knowledge in Data
Version Control Systems Mailing Lists Issue Tracking Systems
Communication Data
Wednesday, 11 April, 12 7
9. Available Knowledge in Data
Version Control Systems Mailing Lists Issue Tracking Systems
Communication Data
• Source Code Comments
• Change-Log Messages
• Developer Emails & Discussions
• Support Dialogues
Wednesday, 11 April, 12 7
10. Communication Data Exists
Mainly as Unstructured Data
In this report, you have defined a parameter named blocksize,
which is given a value of "7|D|1|D". In open script of data set,
there are below lines code:
<script begin>
token=Packages.java.util.StringTokenizer(params["blocksize"],"|");
vec=new Packages.java.util.Vector();
while(token.hasMoreTokens()){
vec.addElement(token.nextToken()); Eclipse #150222
}
params["DateRange"]=java.lang.Integer.parseInt(vec.elementAt(0));
</script end>
Since the value of params["blocksize"] is "7|D|1|D", vec.elementAt(0)
is "7", and then it can not be parsed to int value. In 1.0.1,
the value of params["blocksize"] might be 7|D|1|D, so it can be
parsed to int value of 7.
Extraction and processing of unstructured
data is challenging. [MUD:Workshop:2010]
Wednesday, 11 April, 12 8
11. Mining Collaboration Data
[Bettenburg:ICPC:2011]
chnical Information in Un structured Data
A Lightw eight Approach to Uncover Te
Michel Smidt
ams, Ahmed E. Hassan
Build ID: M20070212-1330
Nicolas Bettenburg, Bram Ad Dept. of Computer Science S)
gence Lab
Software Analysis and Intelli
Steps To Reproduce:
Una des a keytyinof Bremen
ng for "M1+S" (ie. Alt+
1. Create a plugin for eclipse that iversi bindione of the top level
inclu
Queen’s University
• Use Spellchecking
as mnem onic
as Bremen, for Help > any
where S is any letter that is used
the mnemonic Germ &So
ftware Updates,
menus. Since eclipse uses "S"
Kingston, Ontario, Canada Email: michelIDE nformatik.u
"S" is sufficient . @i ni-bremen.de
• Empirical validation
cs.queensu.ca
Email: {nicbet,bram,ahmed}@
2. Laun ch the plugin as part of Eclipse our example in #1)
the Help menu (to go along with
3. Press Alt+H to bring down
tes" is missing its mnemonic.
BUG: Notice "Software Upda
nication through email, cha
t, or
More information:
The code after "if (callback.is
Eclipse's MenuManager.
AcceleratorInUse(SWT
java removes the mnemonic,
.ALT | character))" inside
but it seems like Eclipse
level menumanagers like
• Improved on state of the art
Abstract—Developer commu
eratorInUse" only for top
should be checking "isAccel
s mostly of largely uns tructured
issue report comments consist
,Edit,...,Help, etc. :
rma-
File
text, mixed with technical info
data, i.e., natural language ons, source code
jargon, abbreviati
/* (non-Javadoc) onItem#update(java.l
ang.String)
tion such as project-specific
e.action.IContributi
cal artifacts * @see org.eclipse.jfac
patches, stack traces and identifiers. These techni */
of knowle dge on the technical tring property) {
represent a valuable source
public void update(S
applications from
= getItems();
tributionItem items[]
tem, with a wide range of
ICon
part of the sys vo-
s to creating project-specific items.length; i++) {
establishing traceability link en natural
for (int i = 0; i <
e-style delimiters betwe property);
cabularies. However, the fre
items[i].update(
hnical
tent make the mining of tec }
language and technical con general-purpose
t step towards a
[...]
artifacts challenging. As a firs information
}
technique to extractin g all kinds of technical
present a lightweight approach Any status on this bug?
from unstructured data, we guage text. Our
cal artifacts and natural lan
) [...]
for M6 (API) or M7 (non-API by a prototype
to untangle techni are I'd consider any contributions
nical information uncovered
g spell checking tools, which Figure 1. Examples of tech optionalposed Manager with API (Eclipse Platform
approach is based on existin in Menu in this paper.
and
ms and A 3.5 fix enta be to of the approach pro
available across platfor
that behaviour
implemwouldtion makeand to have the WorkbenchActionBuilder contributed
well-understood, fast, readily gh a
of technical artifacts. Throu
off by#208626).in 3.5,
default early gers turn it on
Bug ions contributed MenuMana
impartial to different kinds
and actionSets/editorAct
our approach
MenuManagers
demonstrate that in the correct place).
handcrafted benchmark, we
(if I can find MenuManagers
technical
is able to successfully uncover a wide range of team to make sure we understan
a
d what the
such, mining unstructured dat
I'd like us to work with the SWT
data.
way
sure that we aren't getting in the
information in unstructured or project-specific terms. As
correct platform behavior is, and make
ormation
onics) seems odd to me, in
ge analysis, unstructured dat
a, the exchange of inf
nt behavior (i.e. turning off mnem
is challenging: it is meant for
of that. The curre
Keywords-text mining, langua
we should fix it properly.
automated processing using
general. If we're going to fix this,
technical information. between humans, rather than
presents an example of tech-
computer machinery. Figure 1
I. I NT RO DU CT ION found in unstructured data.
nical information commonly
a unique history of design ering technical information
Every software system has Recent approaches for discov
Wednesday, 11 April, 12 changes, as well as development and e focussed on recognizing 9
ions, software unstructured data [3]–[5] hav
12. Proposed Approach
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
Wednesday, 11 April, 12 10
13. Proposed Approach
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
Wednesday, 11 April, 12 10
14. Proposed Approach
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
Wednesday, 11 April, 12 11
15. Proposed Approach
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
Wednesday, 11 April, 12 11
16. Quantify Impact on Quality: Idea
Extracted Communication Data
Wednesday, 11 April, 12 12
17. Quantify Impact on Quality: Idea
Extracted Communication Data
compute
Social Metrics
Wednesday, 11 April, 12 12
18. Quantify Impact on Quality: Idea
Extracted Communication Data
compute
Social Metrics
measure relationships
Post-Release Defects
Wednesday, 11 April, 12 12
19. Discussion Social
CONTENT STRUCTURES
4 Dimensions
of Measures
Measures of Communication
WORKFLOW DYNAMICS
Wednesday, 11 April, 12 13
20. Conceptual Approach
Measure Measure
Discussion Post-Release
Metrics Bugs
6 months 6 months
time
LINK USING STATISTICAL MODELS
Wednesday, 11 April, 12 14
21. Findings of our work
(1) Social metrics explain post-release defects
as good as code metrics.
Wednesday, 11 April, 12 15
22. Findings of our work
(1) Social metrics explain post-release defects
as good as code metrics.
(2) Combination of social metrics and code
metrics is cumulative.
Wednesday, 11 April, 12 15
23. Findings of our work
(1) Social metrics explain post-release defects
as good as code metrics.
(2) Combination of social metrics and code
metrics is cumulative.
(3) Identify factors that have positive and
negative relationships with defects.
Wednesday, 11 April, 12 15
24. Findings of our work
(1) Social metrics explain post-release defects
as good as code metrics.
(2) Combination of social metrics and code
metrics is cumulative.
(3) Identify factors that have positive and
negative relationships with defects.
[ICPC‘2010] (Best Paper)
[JEMSE?]
Wednesday, 11 April, 12 15
25. Proposed Approach
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
Wednesday, 11 April, 12 16
26. Proposed Approach
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
Wednesday, 11 April, 12 16
27. Proposed Approach
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
Wednesday, 11 April, 12 16
28. Proposed Approach
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
Wednesday, 11 April, 12 17
29. Proposed Approach
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
Wednesday, 11 April, 12 17
30. Proposed Approach
I. Extraction of communication data
II. Study impact on software quality
III. Study impact on development community
Wednesday, 11 April, 12 17
31. Available Knowledge in Data
Code Review Systems Mailing Lists Issue Tracking Systems
Data on Management
of Code Contributions
Wednesday, 11 April, 12 18
32. Contribution Management
Patch
Project
Feedback
Repository
Feedback
Submission
Review OK Verification OK
Integration
Wednesday, 11 April, 12 19
33. Studying Impact on Community through
Contribution Management
Goal:
Study how contributors, reviewers, verifiers and the
software are impacted by communication (anomalies)
through statistical models.
Example:
Reviewers leaving community due to lack of feedback
Wednesday, 11 April, 12 20
34. Available Knowledge in Data
Version Control Systems Mailing Lists Issue Tracking Systems
Workflow Information
Social Networks
Wednesday, 11 April, 12 21
35. Evolution of Code-Knowledge
Communities
Internet Explorer reed
masayuki
cjcypoi02 dietrich
steve.england corevette
steffen.wilberg
davemgarrett
mmortal03 timeless mano
fittysix
matspal
longsonr
zurtex
matti edilee
mconnor
cwwmozilla beltzner
dveditz
adelfino zeniko
kliu
alice0775
sziadeh mark.finkle robert.bugzilla
philringnalda
sgautherie.bz kev
faaborg
johnath
martijn.martijn
jmjeffery jo.hermans nrthomas gavin.sharp polidobj
m-wada
XML Parser
jbecerra jdarmochwal
john.p.baker jruderman mak77
ria.klaassen
VYV03354 cbook bomfog
dao
elmar.ludwig sdaugherty
vseerror
nightstalkerz l10n highmind63 twalker
mh+mozilla
klaas1988
ehsan stephen.donner
me.at.work
phiw
hskupin
ctalbert
tchung tomer
marcia timwi rotis
uliss
sylvain.pasche
bugzilla
marco.zehe cl-bugs-new2
JavaScript
tonglebeak
abillings info UI
Engine
deletesoftware anselm.meyer
eddy_nigg
matt
RainerStroebel
samuel.sidler+old alex
hasham8888
aarobertxtr
manujsabarwal johnjbarton
myles7897
paulc
shaver
smichaud
mozilla
zhangchunlin dtownsend
jdaggett
kbrosnan
bzbarsky
sdwilsh
Wednesday, 11 April, 12 22
36. Thesis Progress
Tools and techniques Empirical Validation
for mining communication repositories of presented tools and techniques
Empirical Validation Empirical Validation
of relationship between collaboration of relationship between collaboration
and software quality. and development teams.
Wednesday, 11 April, 12 23
37. Thesis Progress
Tools and techniques Empirical Validation
for mining communication repositories of presented tools and techniques
Empirical Validation Empirical Validation
of relationship between collaboration of relationship between collaboration
and software quality. and development teams.
Wednesday, 11 April, 12 23
38. Thesis Progress
Tools and techniques Empirical Validation
for mining communication repositories of presented tools and techniques
Empirical Validation Empirical Validation
of relationship between collaboration of relationship between collaboration
and software quality. and development teams.
Wednesday, 11 April, 12 23
39. Thesis Progress
Tools and techniques Empirical Validation
for mining communication repositories of presented tools and techniques
Empirical Validation Empirical Validation
of relationship between collaboration of relationship between collaboration
and software quality. and development teams.
Wednesday, 11 April, 12 23
40. Thesis Progress
Tools and techniques Empirical Validation
for mining communication repositories of presented tools and techniques
Empirical Validation Empirical Validation
of relationship between collaboration of relationship between collaboration
and software quality. and development teams.
Wednesday, 11 April, 12 23
41. Points for Discussion
• How to do evaluation of code-knowledge
communities? (ground truth)?
• Applicability to industrial settings (almost no
communication data records available)?
• Extend work to defect prediction?
• Practical implications: management,
moderation, staffing, ... ?
Wednesday, 11 April, 12 24