SlideShare a Scribd company logo
1 of 38
Mining Sociotechnical Information
From Software Repositories
Marco Aurélio Gerosa
UNIVERSITY OF SÃO PAULO, BRAZIL

University of California, Irvine
Informatics Seminar
March/2014
Software repositories

SW Architect

Manager

Tester
Programmer

Programmer

Computer Mediated
Tools
Client

Source
Code

User

Issues

Bug
Reports

Message
Archives

Etc.

Current and historical artifacts and interactions are registered in software repositories
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

2
Repositories of repositories
11.3 millions repositories
5.4 millions users
In 2013:
• 3 millions new users
• 152 millions pushes
• 25 millions comments
• 14 millions issue
• 7 millions pull requests

93K projects
1 million users

250K projects

661K projects
29 billions of lines of codes
3 millions users

30K projects
https://launchpad.net

http://www.ohloh.net/

https://github.com/about/press
http://octoverse.github.com/

324K projects
3.4 millions developers

36K projects

33K projects

http://en.wikipedia.org/wiki/CodePlex

200 projects
http://projects.apache.org/indexes/alpha.html

http://sourceforge.net/apps/trac/sourceforge/wiki/What%20is%20SourceForge.net

http://en.wikipedia.org/wiki/Comparison_of_open_source_software_hosting_facilities
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

3
Mining software repositories
Communication

Source code
and artifacts

3C Model*

Discussion lists
Comments on issues
Code comments
User reports
Q&A sites
Social media
Issue trackers
Project management systems
Coordination
Reputation systems

Practitioner

Researcher

Applications

Collaboration and software production

Mining

Information about a
project

Decision making

Information about an
ecosystem

Cooperation

Software
understanding

Information about
Software Engineering

Support maintenance

Empirical validation of
ideas & techniques
Tag cloud from MSR 2014 CFP

“The Mining Software Repositories (MSR) field analyzes the rich data available in
software repositories to uncover interesting and actionable information about
software systems and projects.”
http://2014.msrconf.org/
March/2014

*3C Model: Fuks, H., Raposo, A., Gerosa, M.A., Pimentel, M. & Lucena, C.J.P. (2007) “The 3C Collaboration Model” in: The Encyclopedia of ECollaboration, Ned Kock (org), ISBN 978-1-59904-000-4, pp. 637-644.
Marco Aurélio Gerosa (gerosa@ime.usp.br)

4
Examples
Sentiment analysis on commits

GitHub Data
Challenge 2nd place

http://www.igvita.com/slides/2012/bigquery-github-strata.pdf
March/2014

http://geeksta.net/geeklog/exploring-expressions-emotions-github-commit-messages/

Marco Aurélio Gerosa (gerosa@ime.usp.br)

6
Programming language relations

http://www.igvita.com/slides/2012/bigquery-github-strata.pdf

https://github.com/mjwillson/ProgLangVisualise
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

7
Mining discussion lists
What is the discussion list for?

Christian Bird, Alex Gourley, Prem Devanbu, Michael
Gertz, and Anand Swaminathan. 2006. Mining email
social networks. MSR 2006
March/2014

Anja Guzzi, Alberto Bacchelli, Michele Lanza, Martin
Pinzger, and Arie van Deursen. 2013. Communication in
open source software development mailing lists. MSR
2013

Marco Aurélio Gerosa (gerosa@ime.usp.br)

8
Coordination requirements

Cataldo, M., Dependencies in geographically distributed software
development: Overcoming the limits of modularity, PhD Thesis

Santana, F. et a. “XFlow: An Extensible Tool for Empirical
Analysis of Software Systems Evolution”. ESELAW 2011
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

9
Programmers who changed this function also changed …

Thomas Zimmermann, Peter Weissgerber, Stephan Diehl, and Andreas Zeller. 2005. Mining Version
Histories to Guide Software Changes. IEEE Trans. Software Eng. 31, 6 (June 2005), 429-445.
DOI=10.1109/TSE.2005.72 http://dx.doi.org/10.1109/TSE.2005.72
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

10
Don’t program on Fridays

http://www.slideshare.net/taoxiease/software-analytics-towards-software-mining-that-matters
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

11
Which files are more buggy?

March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

12
Lack of documentation

Deficient Documentation Detection: A Methodology to Locate Deficient Project Documentation using Topic
Analysis, Joshua Charles Campbell, Chenlei Zhang, Zhen Xu, Abram Hindle, and James Miller, MSR 2013
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

13
Some questions from MSR 2013
http://2013.msrconf.org/program.php
• How to recommend developers to bug reports?
• How to filter update notifications?
• What can apps’ permissions reveal?
• How to extract feature requests from apps online reviews?
• How to analyze peer code-review data?

• How to find API usage patterns?
• Is programming knowledge related to age?
• How do patches reach the Linux kernel?
• How do code clones evolve?

• How to process crash reports?
• How to predict defects and effort?
• Etc.
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

14
Some of our work
Some of our work
1.

Change dependencies identification

2.

Design degradation identification

3.

Key developers characterization

4.

Unit tests’ feedback for code quality

5.

Refactoring

6.

Change prediction

7.

Support for OSS projects newcomers

March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

16
1) Change dependencies identification
Dependency?
dF)
oo
Tr
hM
i
s e
(

C
lA
a
s
s

C
lB
a
s
s

A

B

• “A dependency means that a client element has knowledge of the supplier element and a
change in the supplier may affect the client” [Larman, 2004]
• A dependency relation means that the semantics of the depending elements is semantically
or structurally dependent on the definition of the supplier element [UML Formal
Specification]
• Unrecognized dependencies result in a higher number of defects [Herbsleb et al., 2006]
• Structural (or static), dynamic, or logical dependency?

• Dependencies may be hard to identify
• Publisher/subscriber, polymorphism, clones, crosscutting concerns, semantic relations etc.

De
ed
pn
e c
ny
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

C
h
a
n
g
e
17
Change dependencies
Files frequently changed together share some sort of dependency [Gall et al. 1998]

Strong change dependency from B to A
(the opposite is a much weaker dependency)

A logical dependency denotes an implicit and evolutionary
relationship between software artifacts

The concept has proven useful in several different studies:
◦
◦
◦
◦
◦
◦
◦
◦

Software quality analysis [Cataldo & Nambiar, 2010]
Bugs prediction [D’Ambros et al. , 2009a]
Change prediction and change impact analysis [Zimmermann et al., 2005]
Uncover cross-cutting concerns [Adams et al., 2010]
Uncover design flaws and opportunities for refactoring [D’Ambros et al., 2009b]
Understand and evaluate software architecture [Zimmermann et al., 2003]
Requirements traceability [Ali et al., 2013]
Maintain documentation [Kagdi et al., 2006]

March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

18
1.1) Structural x Change Dependencies
Overlap between change dependencies
and structural dependencies

Sta
tc l
ru
u
r
De
ed
pn
e c
ny

Cn
o g
-as
c e
h ?

Gustavo Oliva,
PhD candidate

Analysis of 150K commits of the ASF showed that:
- 93% of the change dependencies did not involve structural dependencies
- 95% of the structural dependencies did not imply in a change dependency

Oliva, G.A., Gerosa, M. A. (2011) “On the Interplay between Structural and Logical Dependencies in Free Software”.
Brazilian Symposium on Software Engineering (SBES 2011)
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

19
1.2) Change dependencies origins
Manual classification of commits to understand the origins of
change dependencies
Category
Refactoring elements that belong to
a same semantic class
Structural dependencies on a
changing semantic class
Cross-cutting concerns
Overloaded revision
Repository operations
Structural dependencies on specific
elements
Other reasons
Total

Joint-changes

Total %

80

19.6%

9

2.2%

165
60
21

40.4%
14.7%
5.1%

66

16.2%

7
408

Gustavo Oliva,
PhD candidate

1.7%

Oliva, G.A., Santana, F., Gerosa, M. A., Souza, C. (2011) “Towards a Classification of Logical Dependencies Origins: A
Case Study”. Proceedings of the 12th International Workshop on Principles of Software Evolution and the 7th annual
ERCIM Workshop on Software Evolution (IWPSE-EVOL '11)
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

20
1.3) Preprocessing commits
Commit = change?
Using the sliding time window approach [Zimmermann & Weißgerber, 2004]
to group SVN commits

N

Sum

Mean

StDev

Skewness

Kurtosis

Before

479,794

3,206,900

6.68

37.84

33.80

1,844.00

After

453,865

3,174,051

6.99

40.79

39.56

Gustavo Oliva,
PhD candidate

2,829.94

Evaluation in the Apache code repository showed that the produced grouping corresponded
to 4.6% of the number of commits
What about commit habits/practices/policies?
Social aspects matter!
Oliva, G. A., Santana, F., Gerosa, M. A., Souza, C. (2012), “Preprocessing Change-Sets to Improve Logical Dependencies
Identification”, 6th Int. Workshop on Software Quality and Maintainability (SQM 2012)
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

21
2) Design degradation identification
Rigidity and fragility [Martin & Martin, 2006] identification based on commit
metadata

Gustavo Oliva,
PhD candidate

5.3

Rigidity => designs difficult to change due to ripple
effects => commit density (number of changed files per
commit)
Fragility => designs break in different areas when a
change is performed => commit dispersion (distance in
the directory tree among file paths included in a
commit)
Oliva, G., Steinmacher, I., Wiese, I.S., Gerosa, M.A. “What Can Commit Metadata Tell Us About Design Degradation?”,
In: 13th International Workshop on Principles on Software Evolution (IWPSE 2013), Saint Petersburg.
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

22
3) Key developers characterization
Key developers participation

Oliva, G., Santana, F.W., da Silva, J. T., Oliveira, K.C.M., Werner, C.M.L., Souza, C.R.B. & Gerosa, M.A., “Evolving the System’s
Core: A Case Study on the Identification and Characterization of Key Developers in Apache Ant”, Computing and Informatics [to
appear].
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

23
4) Unit tests’ feedback for code quality
Mauricio Aniche,
PhD candidate

Number of asserts indicate
- Cyclomatic complexity?
- LOC ?
- Method calls ?

- “Asserted Objects” metric presents
better results than “number of asserts”
- Statistically difference in 20% of the
projects

22 ASF projects
3 industry projects
Aniche, M., Oliva, G.A., Gerosa, M.A., “What Do the Asserts in a Unit Test Tell Us about Code Quality? A Study on Open
Source and Industrial Projects”, 17th European Conference on Software Maintenance and Reengineering (CSMR 2013).
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

24
5) Refactoring
Most part of the documented refactoring does not reduce cyclomatic complexity.
However, 23% of the documented refactoring reduce cyclomatic complexity while
12% of the other commits have the same effect.

Francisco Sokol,
MSc candidate

www.metricminer.org.br

Sokol, F., Aniche, M.F., Gerosa, M.A., “Does the Act of Refactoring Really Make Code Simpler? A Preliminary Study”,. In:
IV Brazilian Workshop of Agile Methods (WBMA 2013).
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

25
6) Change prediction
Using social, process, and architectural metrics to improve change proneness
prediction

Igor Wiese,
PhD candidate

13 projects, 6 classifiers, 11 metrics

Social and architectural metrics improved the prediction model
Similar results when considering projects grouped by change ratio
Dimensions
Process

Social

Architecture

Metric Name
HCM
WHCM
Churn
WChurn
MFD – Modification
File-Developer
DevParticipation
WDevParticipation
Rigidity
Fragility
WCo-Changes
UD – Unworked
Dependencies

Reference
Hassan [13]
D’Ambros et al. [8]
D’Ambros et al. [8]
Nagappan and Ball [23]
New
Rahman & Devanbu [28]
New
Oliva et al. [26]
Oliva et al. [26]
New
New

Wiese, I. S, Nassif Jr, Steinmacher I, Re Reginaldo, Gerosa, M.A “Comparing communication and development networks for
predicting file change proneness: An exploratory study considering process and social metrics”,. In: International Workshop on
Software Quality and Maintainability (SQM 2014).
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

26
7) Supporting Newcomers to OSS projects
Analysis of newcomers dropout reasons
Use of MSR to identify newcomers

Igor Steinmacher,
PhD candidate

Systematic review on awareness in DSD [JCSCW 2013]

comum + ação
Ação de tornar comum
COMUNICAÇÃO

demanda

gera compromissos
gerenciados pela

Awareness

COOPERAÇÃO

COORDENAÇÃO

co + operar + ação
Ação de operar
em conjunto

organiza as tarefas para

co + ordem + ação
Ação de organizar
em conjunto

Steinmacher, I., Wiese, I.S., Chaves, A.P., Gerosa, M.A., Why do newcomers abandon open source software projects?, 6th Int.

Workshop on Cooperative and Human Aspects of Software Engineering (CHASE 2013)
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

27
7) Supporting Newcomers to OSS projects
◦
◦

Systematic Literature Review on barriers for newcomers to OSS
Qualitative analysis of interviews
Igor Steinmacher,
PhD candidate

Steinmacher, I., Wiese, I., Conte, T., Gerosa, M.A., Redmiles, D.F. The Hard Life of Newcomers to OSS Projects, 7th International

Workshop on Cooperative and Human Aspects of Software Engineering

Steinmacher, I.; Graciotto Silva, M. A. ; GEROSA, M.A., Barriers faced by newcomers to open source projects: a systematic
review. 10th International Conference on Open Source Systems (OSS 2014)
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

28
And now?
Some MSR challenges
• Years of sociotechnical data of millions of projects
•

BigData & large-scale computing

• Natural Language processing
•

Hao Zhong, H., Zhang, L., Xie, T. & Mei, H. “Inferring Resource Specifications from Natural
Language API Documentation”, ASE 2009.

• Incomplete and inaccurate information (e.g., commit comments)

• Multiple identities and commits on behalf of others
• Traceability and provenance
•

Davies, German, Godfrey & Hindle. “Software bertillonage: finding the provenance of an entity”,
MSR 2011

• Tools usage change over time
•

March/2014

Guzzi et al. “Communication in open source software development mailing lists”, MSR 2013

Marco Aurélio Gerosa (gerosa@ime.usp.br)

30
What’s next? (1/2)
Towards theories and consolidation
Sharing insights

Sharing patterns
◦ DAPSE http://dapse.unbox.org
◦ MSR Cookbook

http://swag.cs.uwaterloo.ca/~okononen/msr/

Sharing data
◦
◦
◦
◦
◦
◦
◦
◦

[Gaines, 1999]

http://www.githubarchive.org/
http://www.ohloh.net /
Dataset papers published at MSR
MSR Challenges dataset
Promisedata.org Promisedata.org
Flossmetrics.org
Flossmole.org
MetricMiner

Replicating studies
◦ Ghezzi, G. & Gall, H. “Replicating Mining
Studies with SOFAS”, MSR 2013
http://thomas-zimmermann.com/2013/10/software-analytics-sharing-information/
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

31
What’s next? (2/2)
Actionable information in the Software
Environments (new IDEs)

Software Analytics
◦ “The use of analysis, data, and systematic
reasoning to make decisions”

New sources
◦ A lot of information is lost between commits => IDE
& desktop instrumentation and live data collection
◦ Singer, J. “Navtracks: Supporting navigation in
software maintenance”, ICSM'05
◦ Blincoe, K., Valetto, G. & Goggins, S. “Proximity: a
measure to quantify the need for developers'
coordination. CSCW'12
◦ Software execution logs
◦ Social Media
◦ Storey, M., Treude, C., Deursen, A. & Cheng, L.,
“The impact of social media on software
engineering practices and tools” FoSER'10
◦ Bougie, G., Starke, J., Storey, M. & German, D.,
“Towards understanding twitter use in software
engineering: preliminary findings, ongoing
challenges and future questions”, Web2SE'11
◦ Q&A Sites
◦ “Over 92% of Stack Overflow questions about
expert topics are answered - in a median time of
11 minutes” [Mamykina et al., CHI'11]
◦ Campbell et al., “Deficient Documentation
Detection: A Methodology to Locate Deficient
Project Documentation using Topic Analysis”,
MSR'13

http://thomas-zimmermann.com/2013/10/software-analytics-sharing-information/
March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

32
Conferences
◦ MSR: Working Conference on Mining Software Repositories
◦ http://msrconf.org
◦ ICSM: International Conference on Software Maintenance
◦ http://conferences.computer.org/icsm
◦ CSMR-WCRE: Software Evolution Week (joins CSMR and WCRE)
◦ http://ansymo.ua.ac.be/csmr-wcre
◦ ICSE: International Conference on Software Engineering
◦ http://www.icse-conferences.org
◦ SCAM: International Working Conference on Source Code Analysis and Manipulation
◦ http://www.ieee-scam.org/
◦ ESEM: International Symposium on Empirical Software Engineering and Measurement
◦ http://www.esem-conferences.org
◦ PROMISE

March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

33
Thank you!
MARCO AURELIO GEROSA
( G E R O S A @ I M E . U S P. B R ) – O F F I C E 5 2 2 8 @ D B H
@GEROSA_MARCO
H T T P : / / L A P E S S C . I M E . U S P. B R / ( A L L P U B L I C AT I O N S A R E A V A I L A B L E AT T H I S S I T E )
H T T P : / / W W W. I M E . U S P. B R / ~ G E R O S A
H T T P : / / N A P. U S P. B R / N A W E B
Other projects

Smart Audio City Guide

http://www.arquigrafia.org.br

A social mobile system based on georeferenced audio information to support
the mobility of blind people

82,000 images of Brazilian architecture

March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

35
Opportunities for Research in Brazil
More than 38k scholarships implemented in 2013 for sending students abroad
Visiting researcher (PVE Ciencia sem Fronteiras) - http://bit.ly/1fJPfLG
◦ Projects for 2 or 3 years, with a stay of 30 to 90 days per year in Brazil, continuous or not

◦ ≈ US$ 6400 (R$ 14,000)/month stayed in Brazil

CNPq Investment on
Research (in BRL millions)
2,000
1,000

2012

2010

2008

2006

2004

2002

2000

◦ Transportation assistance, among other benefits
Young Talent Attraction (BJT) - http://bit.ly/17T8ycr
◦ 1 to 3 years

1998

0

1996

◦ ≈ US$ 23,000 (R$ 50,000)/year for funding + scholarships

Scholarships abroad (in
thousands BRL)

◦ Min ≈ US$ 3200 (R$ 50,000)/month

250,000

100,000
50,000

2012

2010

2008

2006

2004

0

2002

◦ Support for accommodation and transportation
FAPESP (Sao Paulo state) - http://fapesp.br/147
◦ Visiting professor: ≈US$ 6300, 5000, or 4200 (R$ 13,653, R$ 10,950, R$ 9184)/month
depending on the level, visits of any length up to 12 months + support for transportation and
health insurance

150,000

2000

◦ ≈ US$ 3200 or US$ 4100 (R$ 6900 or R$ 8900)/month depending on the level

200,000

1998

◦ Transportation assistance, support for accommodation
Visiting scholar (CAPES PVE) http://capes.gov.br/cooperacao-internacional/multinacional/pve
◦ Visits from 15 days to 1 year

1996

◦ Min ≈ US$ 9200 (R$ 20,000)/year for funding

◦ Post-doc: ≈US$ 2700 (R$ 5908) + support for accommodation. 6 to 36 months.
USP (University of Sao Paulo) - http://www.usp.br/prp/pagina.php?menu=3&pagina=17
◦ Specific grants for receiving retired researchers or faculty on sabbatical
Bilateral agreements (with several universities)

March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

36
References
Change Dependencies
[Larman, 2004] LARMAN, CRAIG: Applying UML and Patterns: An Introduction to Object-Oriented Analysis and Design and
Iterative Development. third. ed. : Prentice Hall, 2004
[UML Formal Specification] OMG: Object Management Group: Unified Modeling Language (UML).
http://www.omg.org/spec/UML/
[Ball et al, 1997] BALL, THOMAS ; ADAM, JUNG-MIN KIM ; HARVEY, A. PORTER ; SIY, P.: If Your Version Control System Could
Talk... In: ICSE Workshop on Process Modeling and Empirical Studies of Software Engineering, 1997
[Gall et al., 1998] GALL, HARALD ; HAJEK, KARIN ; JAZAYERI, MEHDI: Detection of Logical Coupling Based on Product Release
History. In: Proceedings of the International Conference on Software Maintenance, ICSM ’98. Washington, DC, USA : IEEE
Computer Society, 1998 — ISBN 0-8186-8779-7, p. 190–
Change Dependencies and Its Use
[Cataldo & Nambiar, 2010] CATALDO, MARCELO ; NAMBIAR, SANGEETH: The impact of geographic distribution and the nature
of technical coupling on the quality of global software development projects. In: Journal of Software Maintenance and
Evolution: Research and Practice, John Wiley & Sons, Ltd. (2010)
[D’Ambros et al., 2009a] D’AMBROS, MARCO ; LANZA, MICHELE ; ROBBES, ROMAIN: On the Relationship Between Change
Coupling and Software Defects. In: ZAIDMAN, A. ; ANTONIOL, G. ; DUCASSE, S. (eds.): 16th Working Conference on Reverse
Engineering, WCRE 2009, 13-16 October 2009, Lille, France : IEEE Computer Society, 2009 — ISBN 978-0-7695-3867-9,
pp. 135–144
[Zimmermann et al., 2005] ZIMMERMANN, THOMAS ; WEISSGERBER, PETER ; DIEHL, STEPHAN ; ZELLER, ANDREAS: Mining Version
Histories to Guide Software Changes. In: IEEE Trans. Softw. Eng. vol. 31. Piscataway, NJ, USA, IEEE Press (2005), Nr. 6,
pp. 429–445

March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

37
References
Change dependencies and its use (continued)
[Adams et al., 2010] ADAMS, BRAM ; JIANG, ZHEN MING ; HASSAN, AHMED E.: Identifying crosscutting concerns using historical
code changes. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE
’10. Cape Town, South Africa : ACM, 2010 — ISBN 978-1-60558-719-6, pp. 305–314
[D’Ambros et al., 2009b] D’AMBROS, MARCO ; LANZA, MICHELE ; LUNGU, MIRCEA: Visualizing Co-Change Information with the
Evolution Radar. In: IEEE Trans. Software Eng vol. 35 (2009), Nr. 5, pp. 720–735
[Zimmermann et al., 2003] ZIMMERMANN, T. ; DIEHL, S. ; ZELLER, A.: How history justifies system architecture (or not). In:
Software Evolution, 2003. Proceedings. Sixth International Workshop on Principles of, 2003, pp. 73–83
[Ali et al., 2013] ALI, N. ; JAAFAR, F. ; HASSAN, A.E.: Leveraging historical co-change information for requirements
traceability. In: Reverse Engineering (WCRE), 2013 20th Working Conference on, 2013, pp. 361–370

[Kagdi et al., 2006] KAGDI, H. ; MALETIC, J.I.: Mining for Co-Changes in the Context of Web Localization. In: Web Site
Evolution, 2006. WSE ’06. Eighth IEEE International Symposium on, 2006, pp. 50–57
Improving change dependencies identification
[Zimmermann et al., 2004] ZIMMERMANN, THOMAS ; WEIßGERBER, PETER: Preprocessing CVS Data for Fine-Grained Analysis.
In: Proceedings 1st International Workshop on Mining Software Repositories (MSR 2004). Los Alamitos CA : IEEE
Computer Society Press, 2004, pp. 2–6

Design Degradation
[Martin & Martin, 2006] MARTIN, ROBERT C. ; MARTIN, MICAH: Agile Principles, Patterns, and Practices in C#. first. ed. :
Prentice Hall, 2006

March/2014

Marco Aurélio Gerosa (gerosa@ime.usp.br)

38

More Related Content

What's hot

Searching for Key Stakeholders in Large-Scale Software Projects
Searching for Key Stakeholders in Large-Scale Software ProjectsSearching for Key Stakeholders in Large-Scale Software Projects
Searching for Key Stakeholders in Large-Scale Software Projects
Soo Ling Lim
 
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
dgarijo
 

What's hot (20)

Towards Reusable Research Software
Towards Reusable Research SoftwareTowards Reusable Research Software
Towards Reusable Research Software
 
FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
FOOPS!: An Ontology Pitfall Scanner for the FAIR principlesFOOPS!: An Ontology Pitfall Scanner for the FAIR principles
FOOPS!: An Ontology Pitfall Scanner for the FAIR principles
 
Mining Software Repositories: Using Humans to Better Software
Mining Software Repositories: Using Humans to Better SoftwareMining Software Repositories: Using Humans to Better Software
Mining Software Repositories: Using Humans to Better Software
 
Cser13.ppt
Cser13.pptCser13.ppt
Cser13.ppt
 
SOMEF: a metadata extraction framework from software documentation
SOMEF: a metadata extraction framework from software documentationSOMEF: a metadata extraction framework from software documentation
SOMEF: a metadata extraction framework from software documentation
 
Searching for Key Stakeholders in Large-Scale Software Projects
Searching for Key Stakeholders in Large-Scale Software ProjectsSearching for Key Stakeholders in Large-Scale Software Projects
Searching for Key Stakeholders in Large-Scale Software Projects
 
Eliciting Requirements for Search based Requirements Prioritisation
Eliciting Requirements for Search based Requirements PrioritisationEliciting Requirements for Search based Requirements Prioritisation
Eliciting Requirements for Search based Requirements Prioritisation
 
Scientific Software Registry Collaboration Workshop: From Software Metadata r...
Scientific Software Registry Collaboration Workshop: From Software Metadata r...Scientific Software Registry Collaboration Workshop: From Software Metadata r...
Scientific Software Registry Collaboration Workshop: From Software Metadata r...
 
FAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the FutureFAIR Workflows: A step closer to the Scientific Paper of the Future
FAIR Workflows: A step closer to the Scientific Paper of the Future
 
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
A Controlled Crowdsourcing Approach for Practical Ontology Extensions and Met...
 
Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019Towards Human-Guided Machine Learning - IUI 2019
Towards Human-Guided Machine Learning - IUI 2019
 
ICSE10 StakeNet Talk
ICSE10 StakeNet TalkICSE10 StakeNet Talk
ICSE10 StakeNet Talk
 
ICSE10 StakeSource Talk
ICSE10 StakeSource TalkICSE10 StakeSource Talk
ICSE10 StakeSource Talk
 
Icsm20.ppt
Icsm20.pptIcsm20.ppt
Icsm20.ppt
 
zzhang_resume
zzhang_resumezzhang_resume
zzhang_resume
 
Coming to terms to FAIR semantics
Coming to terms to FAIR semanticsComing to terms to FAIR semantics
Coming to terms to FAIR semantics
 
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
OKG-Soft: An Open Knowledge Graph With Mathine Readable Scientific Software M...
 
CROSSMINER - Developer-Centric Knowledge Mining from Large Open-Source Softwa...
CROSSMINER - Developer-Centric Knowledge Mining from Large Open-Source Softwa...CROSSMINER - Developer-Centric Knowledge Mining from Large Open-Source Softwa...
CROSSMINER - Developer-Centric Knowledge Mining from Large Open-Source Softwa...
 
Analytics of Performance and Data Quality for Mobile Edge Cloud Applications
Analytics of Performance and Data Quality for Mobile Edge Cloud ApplicationsAnalytics of Performance and Data Quality for Mobile Edge Cloud Applications
Analytics of Performance and Data Quality for Mobile Edge Cloud Applications
 
LIVE: a Tool for Checking Licenses Compatibility between Vocabularies and Data
LIVE: a Tool for Checking Licenses Compatibility between Vocabularies and DataLIVE: a Tool for Checking Licenses Compatibility between Vocabularies and Data
LIVE: a Tool for Checking Licenses Compatibility between Vocabularies and Data
 

Viewers also liked

MSR End of Internship Talk
MSR End of Internship TalkMSR End of Internship Talk
MSR End of Internship Talk
Ray Buse
 
A Metric for Code Readability
A Metric for Code ReadabilityA Metric for Code Readability
A Metric for Code Readability
Ray Buse
 

Viewers also liked (20)

Lanubile@SSE2013
Lanubile@SSE2013Lanubile@SSE2013
Lanubile@SSE2013
 
ICSE 2011: Research industry panel
ICSE 2011: Research industry panelICSE 2011: Research industry panel
ICSE 2011: Research industry panel
 
Icpc 2011 storey
Icpc 2011 storeyIcpc 2011 storey
Icpc 2011 storey
 
Mining Development Repositories to Study the Impact of Collaboration on Softw...
Mining Development Repositories to Study the Impact of Collaboration on Softw...Mining Development Repositories to Study the Impact of Collaboration on Softw...
Mining Development Repositories to Study the Impact of Collaboration on Softw...
 
ICPE2015
ICPE2015ICPE2015
ICPE2015
 
WCRE2011
WCRE2011WCRE2011
WCRE2011
 
ICSE2013
ICSE2013ICSE2013
ICSE2013
 
ICSME2014
ICSME2014ICSME2014
ICSME2014
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
 
Msr2016 tarek
Msr2016 tarek Msr2016 tarek
Msr2016 tarek
 
ICSE2014
ICSE2014ICSE2014
ICSE2014
 
MSR End of Internship Talk
MSR End of Internship TalkMSR End of Internship Talk
MSR End of Internship Talk
 
Towards the Social Programmer (MSR 2012 Keynote by M. Storey)
Towards the Social Programmer (MSR 2012 Keynote by M. Storey)Towards the Social Programmer (MSR 2012 Keynote by M. Storey)
Towards the Social Programmer (MSR 2012 Keynote by M. Storey)
 
ASE2010
ASE2010ASE2010
ASE2010
 
Empirical Software Engineering at Microsoft Research
Empirical Software Engineering at Microsoft ResearchEmpirical Software Engineering at Microsoft Research
Empirical Software Engineering at Microsoft Research
 
Stackoverflow Data Analysis-Homework3
Stackoverflow Data Analysis-Homework3Stackoverflow Data Analysis-Homework3
Stackoverflow Data Analysis-Homework3
 
A Metric for Code Readability
A Metric for Code ReadabilityA Metric for Code Readability
A Metric for Code Readability
 
StackOverflow Architectural Overview
StackOverflow Architectural OverviewStackOverflow Architectural Overview
StackOverflow Architectural Overview
 
Benevol 2012 Keynote: The Social Software (R)evolution
Benevol 2012 Keynote: The Social Software (R)evolutionBenevol 2012 Keynote: The Social Software (R)evolution
Benevol 2012 Keynote: The Social Software (R)evolution
 
The (R)evolution of Social Media in Software Engineering
The (R)evolution of Social Media in Software EngineeringThe (R)evolution of Social Media in Software Engineering
The (R)evolution of Social Media in Software Engineering
 

Similar to Mining Sociotechnical Information From Software Repositories

Data_Mining_for_Software_Engineering.pdf
Data_Mining_for_Software_Engineering.pdfData_Mining_for_Software_Engineering.pdf
Data_Mining_for_Software_Engineering.pdf
assadabbas22
 
· Discussion Assignments You are required to answer one of t.docx
· Discussion Assignments You are required to answer one of t.docx· Discussion Assignments You are required to answer one of t.docx
· Discussion Assignments You are required to answer one of t.docx
oswald1horne84988
 
Syst biol 2012-burguiere-sysbio sys069
Syst biol 2012-burguiere-sysbio sys069Syst biol 2012-burguiere-sysbio sys069
Syst biol 2012-burguiere-sysbio sys069
Thomas Burguiere
 
24 Reasons Why Variability Models Are Not Yet Universal (24RWVMANYU)
24 Reasons Why Variability Models Are Not Yet Universal (24RWVMANYU)24 Reasons Why Variability Models Are Not Yet Universal (24RWVMANYU)
24 Reasons Why Variability Models Are Not Yet Universal (24RWVMANYU)
University of Rennes, INSA Rennes, Inria/IRISA, CNRS
 
oXabcaARRAAKALSL.docx
oXabcaARRAAKALSL.docxoXabcaARRAAKALSL.docx
oXabcaARRAAKALSL.docx
alfred4lewis58146
 

Similar to Mining Sociotechnical Information From Software Repositories (20)

Data_Mining_for_Software_Engineering.pdf
Data_Mining_for_Software_Engineering.pdfData_Mining_for_Software_Engineering.pdf
Data_Mining_for_Software_Engineering.pdf
 
· Discussion Assignments You are required to answer one of t.docx
· Discussion Assignments You are required to answer one of t.docx· Discussion Assignments You are required to answer one of t.docx
· Discussion Assignments You are required to answer one of t.docx
 
Of Changes and Their History
Of Changes and Their HistoryOf Changes and Their History
Of Changes and Their History
 
Summarization Techniques for Code, Change, Testing and User Feedback
Summarization Techniques for Code, Change, Testing and User FeedbackSummarization Techniques for Code, Change, Testing and User Feedback
Summarization Techniques for Code, Change, Testing and User Feedback
 
PhD Defense Øyvind Hauge
PhD Defense Øyvind HaugePhD Defense Øyvind Hauge
PhD Defense Øyvind Hauge
 
Social and Technical Evolution of the Ruby on Rails Software Ecosystem
Social and Technical Evolution of the Ruby on Rails Software EcosystemSocial and Technical Evolution of the Ruby on Rails Software Ecosystem
Social and Technical Evolution of the Ruby on Rails Software Ecosystem
 
Syst biol 2012-burguiere-sysbio sys069
Syst biol 2012-burguiere-sysbio sys069Syst biol 2012-burguiere-sysbio sys069
Syst biol 2012-burguiere-sysbio sys069
 
Lopez
LopezLopez
Lopez
 
Preserving software workshop - Community engagement workshop
Preserving software workshop - Community engagement workshopPreserving software workshop - Community engagement workshop
Preserving software workshop - Community engagement workshop
 
24 Reasons Why Variability Models Are Not Yet Universal (24RWVMANYU)
24 Reasons Why Variability Models Are Not Yet Universal (24RWVMANYU)24 Reasons Why Variability Models Are Not Yet Universal (24RWVMANYU)
24 Reasons Why Variability Models Are Not Yet Universal (24RWVMANYU)
 
A Primer on High-Quality Identifier Naming [ASE 2022]
A Primer on High-Quality Identifier Naming [ASE 2022]A Primer on High-Quality Identifier Naming [ASE 2022]
A Primer on High-Quality Identifier Naming [ASE 2022]
 
Social and Technical Evolution of the Ruby on Rails Software Ecosystem
Social and Technical Evolution of the Ruby on Rails Software EcosystemSocial and Technical Evolution of the Ruby on Rails Software Ecosystem
Social and Technical Evolution of the Ruby on Rails Software Ecosystem
 
oXabcaARRAAKALSL.docx
oXabcaARRAAKALSL.docxoXabcaARRAAKALSL.docx
oXabcaARRAAKALSL.docx
 
Systems of Systems - Design and Management
Systems of Systems - Design and ManagementSystems of Systems - Design and Management
Systems of Systems - Design and Management
 
Quantitative And Qualitative Evaluation Of F/Oss Volunteer Participation In D...
Quantitative And Qualitative Evaluation Of F/Oss Volunteer Participation In D...Quantitative And Qualitative Evaluation Of F/Oss Volunteer Participation In D...
Quantitative And Qualitative Evaluation Of F/Oss Volunteer Participation In D...
 
A Systematic Mapping Study on Analysis of Code Repositories.pdf
A Systematic Mapping Study on Analysis of Code Repositories.pdfA Systematic Mapping Study on Analysis of Code Repositories.pdf
A Systematic Mapping Study on Analysis of Code Repositories.pdf
 
Data and Knowledge Evolution
Data and Knowledge Evolution  Data and Knowledge Evolution
Data and Knowledge Evolution
 
Read Between The Lines: an Annotation Tool for Multimodal Data
Read Between The Lines: an Annotation Tool for Multimodal DataRead Between The Lines: an Annotation Tool for Multimodal Data
Read Between The Lines: an Annotation Tool for Multimodal Data
 
Better Software, Better Research
Better Software, Better ResearchBetter Software, Better Research
Better Software, Better Research
 
Introduction to MDE
Introduction to MDEIntroduction to MDE
Introduction to MDE
 

Recently uploaded

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Mining Sociotechnical Information From Software Repositories

  • 1. Mining Sociotechnical Information From Software Repositories Marco Aurélio Gerosa UNIVERSITY OF SÃO PAULO, BRAZIL University of California, Irvine Informatics Seminar March/2014
  • 2. Software repositories SW Architect Manager Tester Programmer Programmer Computer Mediated Tools Client Source Code User Issues Bug Reports Message Archives Etc. Current and historical artifacts and interactions are registered in software repositories March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 2
  • 3. Repositories of repositories 11.3 millions repositories 5.4 millions users In 2013: • 3 millions new users • 152 millions pushes • 25 millions comments • 14 millions issue • 7 millions pull requests 93K projects 1 million users 250K projects 661K projects 29 billions of lines of codes 3 millions users 30K projects https://launchpad.net http://www.ohloh.net/ https://github.com/about/press http://octoverse.github.com/ 324K projects 3.4 millions developers 36K projects 33K projects http://en.wikipedia.org/wiki/CodePlex 200 projects http://projects.apache.org/indexes/alpha.html http://sourceforge.net/apps/trac/sourceforge/wiki/What%20is%20SourceForge.net http://en.wikipedia.org/wiki/Comparison_of_open_source_software_hosting_facilities March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 3
  • 4. Mining software repositories Communication Source code and artifacts 3C Model* Discussion lists Comments on issues Code comments User reports Q&A sites Social media Issue trackers Project management systems Coordination Reputation systems Practitioner Researcher Applications Collaboration and software production Mining Information about a project Decision making Information about an ecosystem Cooperation Software understanding Information about Software Engineering Support maintenance Empirical validation of ideas & techniques Tag cloud from MSR 2014 CFP “The Mining Software Repositories (MSR) field analyzes the rich data available in software repositories to uncover interesting and actionable information about software systems and projects.” http://2014.msrconf.org/ March/2014 *3C Model: Fuks, H., Raposo, A., Gerosa, M.A., Pimentel, M. & Lucena, C.J.P. (2007) “The 3C Collaboration Model” in: The Encyclopedia of ECollaboration, Ned Kock (org), ISBN 978-1-59904-000-4, pp. 637-644. Marco Aurélio Gerosa (gerosa@ime.usp.br) 4
  • 6. Sentiment analysis on commits GitHub Data Challenge 2nd place http://www.igvita.com/slides/2012/bigquery-github-strata.pdf March/2014 http://geeksta.net/geeklog/exploring-expressions-emotions-github-commit-messages/ Marco Aurélio Gerosa (gerosa@ime.usp.br) 6
  • 8. Mining discussion lists What is the discussion list for? Christian Bird, Alex Gourley, Prem Devanbu, Michael Gertz, and Anand Swaminathan. 2006. Mining email social networks. MSR 2006 March/2014 Anja Guzzi, Alberto Bacchelli, Michele Lanza, Martin Pinzger, and Arie van Deursen. 2013. Communication in open source software development mailing lists. MSR 2013 Marco Aurélio Gerosa (gerosa@ime.usp.br) 8
  • 9. Coordination requirements Cataldo, M., Dependencies in geographically distributed software development: Overcoming the limits of modularity, PhD Thesis Santana, F. et a. “XFlow: An Extensible Tool for Empirical Analysis of Software Systems Evolution”. ESELAW 2011 March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 9
  • 10. Programmers who changed this function also changed … Thomas Zimmermann, Peter Weissgerber, Stephan Diehl, and Andreas Zeller. 2005. Mining Version Histories to Guide Software Changes. IEEE Trans. Software Eng. 31, 6 (June 2005), 429-445. DOI=10.1109/TSE.2005.72 http://dx.doi.org/10.1109/TSE.2005.72 March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 10
  • 11. Don’t program on Fridays http://www.slideshare.net/taoxiease/software-analytics-towards-software-mining-that-matters March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 11
  • 12. Which files are more buggy? March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 12
  • 13. Lack of documentation Deficient Documentation Detection: A Methodology to Locate Deficient Project Documentation using Topic Analysis, Joshua Charles Campbell, Chenlei Zhang, Zhen Xu, Abram Hindle, and James Miller, MSR 2013 March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 13
  • 14. Some questions from MSR 2013 http://2013.msrconf.org/program.php • How to recommend developers to bug reports? • How to filter update notifications? • What can apps’ permissions reveal? • How to extract feature requests from apps online reviews? • How to analyze peer code-review data? • How to find API usage patterns? • Is programming knowledge related to age? • How do patches reach the Linux kernel? • How do code clones evolve? • How to process crash reports? • How to predict defects and effort? • Etc. March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 14
  • 15. Some of our work
  • 16. Some of our work 1. Change dependencies identification 2. Design degradation identification 3. Key developers characterization 4. Unit tests’ feedback for code quality 5. Refactoring 6. Change prediction 7. Support for OSS projects newcomers March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 16
  • 17. 1) Change dependencies identification Dependency? dF) oo Tr hM i s e ( C lA a s s C lB a s s A B • “A dependency means that a client element has knowledge of the supplier element and a change in the supplier may affect the client” [Larman, 2004] • A dependency relation means that the semantics of the depending elements is semantically or structurally dependent on the definition of the supplier element [UML Formal Specification] • Unrecognized dependencies result in a higher number of defects [Herbsleb et al., 2006] • Structural (or static), dynamic, or logical dependency? • Dependencies may be hard to identify • Publisher/subscriber, polymorphism, clones, crosscutting concerns, semantic relations etc. De ed pn e c ny March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) C h a n g e 17
  • 18. Change dependencies Files frequently changed together share some sort of dependency [Gall et al. 1998] Strong change dependency from B to A (the opposite is a much weaker dependency) A logical dependency denotes an implicit and evolutionary relationship between software artifacts The concept has proven useful in several different studies: ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Software quality analysis [Cataldo & Nambiar, 2010] Bugs prediction [D’Ambros et al. , 2009a] Change prediction and change impact analysis [Zimmermann et al., 2005] Uncover cross-cutting concerns [Adams et al., 2010] Uncover design flaws and opportunities for refactoring [D’Ambros et al., 2009b] Understand and evaluate software architecture [Zimmermann et al., 2003] Requirements traceability [Ali et al., 2013] Maintain documentation [Kagdi et al., 2006] March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 18
  • 19. 1.1) Structural x Change Dependencies Overlap between change dependencies and structural dependencies Sta tc l ru u r De ed pn e c ny Cn o g -as c e h ? Gustavo Oliva, PhD candidate Analysis of 150K commits of the ASF showed that: - 93% of the change dependencies did not involve structural dependencies - 95% of the structural dependencies did not imply in a change dependency Oliva, G.A., Gerosa, M. A. (2011) “On the Interplay between Structural and Logical Dependencies in Free Software”. Brazilian Symposium on Software Engineering (SBES 2011) March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 19
  • 20. 1.2) Change dependencies origins Manual classification of commits to understand the origins of change dependencies Category Refactoring elements that belong to a same semantic class Structural dependencies on a changing semantic class Cross-cutting concerns Overloaded revision Repository operations Structural dependencies on specific elements Other reasons Total Joint-changes Total % 80 19.6% 9 2.2% 165 60 21 40.4% 14.7% 5.1% 66 16.2% 7 408 Gustavo Oliva, PhD candidate 1.7% Oliva, G.A., Santana, F., Gerosa, M. A., Souza, C. (2011) “Towards a Classification of Logical Dependencies Origins: A Case Study”. Proceedings of the 12th International Workshop on Principles of Software Evolution and the 7th annual ERCIM Workshop on Software Evolution (IWPSE-EVOL '11) March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 20
  • 21. 1.3) Preprocessing commits Commit = change? Using the sliding time window approach [Zimmermann & Weißgerber, 2004] to group SVN commits N Sum Mean StDev Skewness Kurtosis Before 479,794 3,206,900 6.68 37.84 33.80 1,844.00 After 453,865 3,174,051 6.99 40.79 39.56 Gustavo Oliva, PhD candidate 2,829.94 Evaluation in the Apache code repository showed that the produced grouping corresponded to 4.6% of the number of commits What about commit habits/practices/policies? Social aspects matter! Oliva, G. A., Santana, F., Gerosa, M. A., Souza, C. (2012), “Preprocessing Change-Sets to Improve Logical Dependencies Identification”, 6th Int. Workshop on Software Quality and Maintainability (SQM 2012) March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 21
  • 22. 2) Design degradation identification Rigidity and fragility [Martin & Martin, 2006] identification based on commit metadata Gustavo Oliva, PhD candidate 5.3 Rigidity => designs difficult to change due to ripple effects => commit density (number of changed files per commit) Fragility => designs break in different areas when a change is performed => commit dispersion (distance in the directory tree among file paths included in a commit) Oliva, G., Steinmacher, I., Wiese, I.S., Gerosa, M.A. “What Can Commit Metadata Tell Us About Design Degradation?”, In: 13th International Workshop on Principles on Software Evolution (IWPSE 2013), Saint Petersburg. March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 22
  • 23. 3) Key developers characterization Key developers participation Oliva, G., Santana, F.W., da Silva, J. T., Oliveira, K.C.M., Werner, C.M.L., Souza, C.R.B. & Gerosa, M.A., “Evolving the System’s Core: A Case Study on the Identification and Characterization of Key Developers in Apache Ant”, Computing and Informatics [to appear]. March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 23
  • 24. 4) Unit tests’ feedback for code quality Mauricio Aniche, PhD candidate Number of asserts indicate - Cyclomatic complexity? - LOC ? - Method calls ? - “Asserted Objects” metric presents better results than “number of asserts” - Statistically difference in 20% of the projects 22 ASF projects 3 industry projects Aniche, M., Oliva, G.A., Gerosa, M.A., “What Do the Asserts in a Unit Test Tell Us about Code Quality? A Study on Open Source and Industrial Projects”, 17th European Conference on Software Maintenance and Reengineering (CSMR 2013). March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 24
  • 25. 5) Refactoring Most part of the documented refactoring does not reduce cyclomatic complexity. However, 23% of the documented refactoring reduce cyclomatic complexity while 12% of the other commits have the same effect. Francisco Sokol, MSc candidate www.metricminer.org.br Sokol, F., Aniche, M.F., Gerosa, M.A., “Does the Act of Refactoring Really Make Code Simpler? A Preliminary Study”,. In: IV Brazilian Workshop of Agile Methods (WBMA 2013). March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 25
  • 26. 6) Change prediction Using social, process, and architectural metrics to improve change proneness prediction Igor Wiese, PhD candidate 13 projects, 6 classifiers, 11 metrics Social and architectural metrics improved the prediction model Similar results when considering projects grouped by change ratio Dimensions Process Social Architecture Metric Name HCM WHCM Churn WChurn MFD – Modification File-Developer DevParticipation WDevParticipation Rigidity Fragility WCo-Changes UD – Unworked Dependencies Reference Hassan [13] D’Ambros et al. [8] D’Ambros et al. [8] Nagappan and Ball [23] New Rahman & Devanbu [28] New Oliva et al. [26] Oliva et al. [26] New New Wiese, I. S, Nassif Jr, Steinmacher I, Re Reginaldo, Gerosa, M.A “Comparing communication and development networks for predicting file change proneness: An exploratory study considering process and social metrics”,. In: International Workshop on Software Quality and Maintainability (SQM 2014). March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 26
  • 27. 7) Supporting Newcomers to OSS projects Analysis of newcomers dropout reasons Use of MSR to identify newcomers Igor Steinmacher, PhD candidate Systematic review on awareness in DSD [JCSCW 2013] comum + ação Ação de tornar comum COMUNICAÇÃO demanda gera compromissos gerenciados pela Awareness COOPERAÇÃO COORDENAÇÃO co + operar + ação Ação de operar em conjunto organiza as tarefas para co + ordem + ação Ação de organizar em conjunto Steinmacher, I., Wiese, I.S., Chaves, A.P., Gerosa, M.A., Why do newcomers abandon open source software projects?, 6th Int. Workshop on Cooperative and Human Aspects of Software Engineering (CHASE 2013) March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 27
  • 28. 7) Supporting Newcomers to OSS projects ◦ ◦ Systematic Literature Review on barriers for newcomers to OSS Qualitative analysis of interviews Igor Steinmacher, PhD candidate Steinmacher, I., Wiese, I., Conte, T., Gerosa, M.A., Redmiles, D.F. The Hard Life of Newcomers to OSS Projects, 7th International Workshop on Cooperative and Human Aspects of Software Engineering Steinmacher, I.; Graciotto Silva, M. A. ; GEROSA, M.A., Barriers faced by newcomers to open source projects: a systematic review. 10th International Conference on Open Source Systems (OSS 2014) March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 28
  • 30. Some MSR challenges • Years of sociotechnical data of millions of projects • BigData & large-scale computing • Natural Language processing • Hao Zhong, H., Zhang, L., Xie, T. & Mei, H. “Inferring Resource Specifications from Natural Language API Documentation”, ASE 2009. • Incomplete and inaccurate information (e.g., commit comments) • Multiple identities and commits on behalf of others • Traceability and provenance • Davies, German, Godfrey & Hindle. “Software bertillonage: finding the provenance of an entity”, MSR 2011 • Tools usage change over time • March/2014 Guzzi et al. “Communication in open source software development mailing lists”, MSR 2013 Marco Aurélio Gerosa (gerosa@ime.usp.br) 30
  • 31. What’s next? (1/2) Towards theories and consolidation Sharing insights Sharing patterns ◦ DAPSE http://dapse.unbox.org ◦ MSR Cookbook http://swag.cs.uwaterloo.ca/~okononen/msr/ Sharing data ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ [Gaines, 1999] http://www.githubarchive.org/ http://www.ohloh.net / Dataset papers published at MSR MSR Challenges dataset Promisedata.org Promisedata.org Flossmetrics.org Flossmole.org MetricMiner Replicating studies ◦ Ghezzi, G. & Gall, H. “Replicating Mining Studies with SOFAS”, MSR 2013 http://thomas-zimmermann.com/2013/10/software-analytics-sharing-information/ March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 31
  • 32. What’s next? (2/2) Actionable information in the Software Environments (new IDEs) Software Analytics ◦ “The use of analysis, data, and systematic reasoning to make decisions” New sources ◦ A lot of information is lost between commits => IDE & desktop instrumentation and live data collection ◦ Singer, J. “Navtracks: Supporting navigation in software maintenance”, ICSM'05 ◦ Blincoe, K., Valetto, G. & Goggins, S. “Proximity: a measure to quantify the need for developers' coordination. CSCW'12 ◦ Software execution logs ◦ Social Media ◦ Storey, M., Treude, C., Deursen, A. & Cheng, L., “The impact of social media on software engineering practices and tools” FoSER'10 ◦ Bougie, G., Starke, J., Storey, M. & German, D., “Towards understanding twitter use in software engineering: preliminary findings, ongoing challenges and future questions”, Web2SE'11 ◦ Q&A Sites ◦ “Over 92% of Stack Overflow questions about expert topics are answered - in a median time of 11 minutes” [Mamykina et al., CHI'11] ◦ Campbell et al., “Deficient Documentation Detection: A Methodology to Locate Deficient Project Documentation using Topic Analysis”, MSR'13 http://thomas-zimmermann.com/2013/10/software-analytics-sharing-information/ March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 32
  • 33. Conferences ◦ MSR: Working Conference on Mining Software Repositories ◦ http://msrconf.org ◦ ICSM: International Conference on Software Maintenance ◦ http://conferences.computer.org/icsm ◦ CSMR-WCRE: Software Evolution Week (joins CSMR and WCRE) ◦ http://ansymo.ua.ac.be/csmr-wcre ◦ ICSE: International Conference on Software Engineering ◦ http://www.icse-conferences.org ◦ SCAM: International Working Conference on Source Code Analysis and Manipulation ◦ http://www.ieee-scam.org/ ◦ ESEM: International Symposium on Empirical Software Engineering and Measurement ◦ http://www.esem-conferences.org ◦ PROMISE March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 33
  • 34. Thank you! MARCO AURELIO GEROSA ( G E R O S A @ I M E . U S P. B R ) – O F F I C E 5 2 2 8 @ D B H @GEROSA_MARCO H T T P : / / L A P E S S C . I M E . U S P. B R / ( A L L P U B L I C AT I O N S A R E A V A I L A B L E AT T H I S S I T E ) H T T P : / / W W W. I M E . U S P. B R / ~ G E R O S A H T T P : / / N A P. U S P. B R / N A W E B
  • 35. Other projects Smart Audio City Guide http://www.arquigrafia.org.br A social mobile system based on georeferenced audio information to support the mobility of blind people 82,000 images of Brazilian architecture March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 35
  • 36. Opportunities for Research in Brazil More than 38k scholarships implemented in 2013 for sending students abroad Visiting researcher (PVE Ciencia sem Fronteiras) - http://bit.ly/1fJPfLG ◦ Projects for 2 or 3 years, with a stay of 30 to 90 days per year in Brazil, continuous or not ◦ ≈ US$ 6400 (R$ 14,000)/month stayed in Brazil CNPq Investment on Research (in BRL millions) 2,000 1,000 2012 2010 2008 2006 2004 2002 2000 ◦ Transportation assistance, among other benefits Young Talent Attraction (BJT) - http://bit.ly/17T8ycr ◦ 1 to 3 years 1998 0 1996 ◦ ≈ US$ 23,000 (R$ 50,000)/year for funding + scholarships Scholarships abroad (in thousands BRL) ◦ Min ≈ US$ 3200 (R$ 50,000)/month 250,000 100,000 50,000 2012 2010 2008 2006 2004 0 2002 ◦ Support for accommodation and transportation FAPESP (Sao Paulo state) - http://fapesp.br/147 ◦ Visiting professor: ≈US$ 6300, 5000, or 4200 (R$ 13,653, R$ 10,950, R$ 9184)/month depending on the level, visits of any length up to 12 months + support for transportation and health insurance 150,000 2000 ◦ ≈ US$ 3200 or US$ 4100 (R$ 6900 or R$ 8900)/month depending on the level 200,000 1998 ◦ Transportation assistance, support for accommodation Visiting scholar (CAPES PVE) http://capes.gov.br/cooperacao-internacional/multinacional/pve ◦ Visits from 15 days to 1 year 1996 ◦ Min ≈ US$ 9200 (R$ 20,000)/year for funding ◦ Post-doc: ≈US$ 2700 (R$ 5908) + support for accommodation. 6 to 36 months. USP (University of Sao Paulo) - http://www.usp.br/prp/pagina.php?menu=3&pagina=17 ◦ Specific grants for receiving retired researchers or faculty on sabbatical Bilateral agreements (with several universities) March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 36
  • 37. References Change Dependencies [Larman, 2004] LARMAN, CRAIG: Applying UML and Patterns: An Introduction to Object-Oriented Analysis and Design and Iterative Development. third. ed. : Prentice Hall, 2004 [UML Formal Specification] OMG: Object Management Group: Unified Modeling Language (UML). http://www.omg.org/spec/UML/ [Ball et al, 1997] BALL, THOMAS ; ADAM, JUNG-MIN KIM ; HARVEY, A. PORTER ; SIY, P.: If Your Version Control System Could Talk... In: ICSE Workshop on Process Modeling and Empirical Studies of Software Engineering, 1997 [Gall et al., 1998] GALL, HARALD ; HAJEK, KARIN ; JAZAYERI, MEHDI: Detection of Logical Coupling Based on Product Release History. In: Proceedings of the International Conference on Software Maintenance, ICSM ’98. Washington, DC, USA : IEEE Computer Society, 1998 — ISBN 0-8186-8779-7, p. 190– Change Dependencies and Its Use [Cataldo & Nambiar, 2010] CATALDO, MARCELO ; NAMBIAR, SANGEETH: The impact of geographic distribution and the nature of technical coupling on the quality of global software development projects. In: Journal of Software Maintenance and Evolution: Research and Practice, John Wiley & Sons, Ltd. (2010) [D’Ambros et al., 2009a] D’AMBROS, MARCO ; LANZA, MICHELE ; ROBBES, ROMAIN: On the Relationship Between Change Coupling and Software Defects. In: ZAIDMAN, A. ; ANTONIOL, G. ; DUCASSE, S. (eds.): 16th Working Conference on Reverse Engineering, WCRE 2009, 13-16 October 2009, Lille, France : IEEE Computer Society, 2009 — ISBN 978-0-7695-3867-9, pp. 135–144 [Zimmermann et al., 2005] ZIMMERMANN, THOMAS ; WEISSGERBER, PETER ; DIEHL, STEPHAN ; ZELLER, ANDREAS: Mining Version Histories to Guide Software Changes. In: IEEE Trans. Softw. Eng. vol. 31. Piscataway, NJ, USA, IEEE Press (2005), Nr. 6, pp. 429–445 March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 37
  • 38. References Change dependencies and its use (continued) [Adams et al., 2010] ADAMS, BRAM ; JIANG, ZHEN MING ; HASSAN, AHMED E.: Identifying crosscutting concerns using historical code changes. In: Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering - Volume 1, ICSE ’10. Cape Town, South Africa : ACM, 2010 — ISBN 978-1-60558-719-6, pp. 305–314 [D’Ambros et al., 2009b] D’AMBROS, MARCO ; LANZA, MICHELE ; LUNGU, MIRCEA: Visualizing Co-Change Information with the Evolution Radar. In: IEEE Trans. Software Eng vol. 35 (2009), Nr. 5, pp. 720–735 [Zimmermann et al., 2003] ZIMMERMANN, T. ; DIEHL, S. ; ZELLER, A.: How history justifies system architecture (or not). In: Software Evolution, 2003. Proceedings. Sixth International Workshop on Principles of, 2003, pp. 73–83 [Ali et al., 2013] ALI, N. ; JAAFAR, F. ; HASSAN, A.E.: Leveraging historical co-change information for requirements traceability. In: Reverse Engineering (WCRE), 2013 20th Working Conference on, 2013, pp. 361–370 [Kagdi et al., 2006] KAGDI, H. ; MALETIC, J.I.: Mining for Co-Changes in the Context of Web Localization. In: Web Site Evolution, 2006. WSE ’06. Eighth IEEE International Symposium on, 2006, pp. 50–57 Improving change dependencies identification [Zimmermann et al., 2004] ZIMMERMANN, THOMAS ; WEIßGERBER, PETER: Preprocessing CVS Data for Fine-Grained Analysis. In: Proceedings 1st International Workshop on Mining Software Repositories (MSR 2004). Los Alamitos CA : IEEE Computer Society Press, 2004, pp. 2–6 Design Degradation [Martin & Martin, 2006] MARTIN, ROBERT C. ; MARTIN, MICAH: Agile Principles, Patterns, and Practices in C#. first. ed. : Prentice Hall, 2006 March/2014 Marco Aurélio Gerosa (gerosa@ime.usp.br) 38