Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

3

Share

Download to read offline

Mining Software Archives to Support Software Development

Download to read offline

Job application talk.

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Mining Software Archives to Support Software Development

  1. 1. Mining Software Archives to Support Software Development Tom Zimmermann Saarland University
  2. 2. Software Development Hello Build Calgary!
  3. 3. Software Development Build
  4. 4. Collaboration
  5. 5. Collaboration
  6. 6. Collaboration Comm. Archive
  7. 7. Collaboration Version Comm. Archive Archive
  8. 8. Collaboration Version Comm. Bug Archive Archive Database
  9. 9. Collaboration Version Comm. Bug Archive Archive Database Mining Software Archives
  10. 10. Mining Software Archives
  11. 11. Mining Software Archives eROSE BugCache Vulture
  12. 12. eROSE Related Changes (ICSE 2004, TSE 2005) Tom Zimmermann • Saarland University Peter Weißgerber • University of Trier Stephan Diehl • University of Trier Andreas Zeller • Saarland University
  13. 13. Developers who changed this function also changed...
  14. 14. eROSE: Guiding Developers Customers who bought this item also bought... Purchase History
  15. 15. eROSE: Guiding Developers Developers who Customers who changed this function bought this item also also changed... bought... Version Purchase Archive History
  16. 16. eROSE suggests further locations.
  17. 17. eROSE prevents incomplete changes.
  18. 18. Processing CVS data
  19. 19. Processing CVS data
  20. 20. Processing CVS data 1. Comparing files 2. Building transactions
  21. 21. Comparing Files
  22. 22. Comparing Files A() B() C() D() E()
  23. 23. Comparing Files A() A() B() F() C() B() D() D() E() E()
  24. 24. Comparing Files A() A() B() F() C() B() D() D() E() E()
  25. 25. Building Transactions CVS 150,000
  26. 26. Building Transactions 2003-02-19 (aweinand): fixed #13332 CVS createGeneralPage() createTextComparePage() 150,000 fKeys[] initDefaults() buildnotes_compare.html PatchMessages.properties plugin.properties
  27. 27. Building Transactions same author + message + time 2003-02-19 (aweinand): fixed #13332 CVS createGeneralPage() createTextComparePage() 150,000 fKeys[] initDefaults() buildnotes_compare.html PatchMessages.properties plugin.properties
  28. 28. Mining Associations User changes fKeys[] and initDefaults()
  29. 29. Mining Associations
  30. 30. Mining Associations EROSE finds past transactions
  31. 31. Mining Associations #756 #6721 #21078 EROSE fKeys[] fKeys[] fKeys[] initDefaults() initDefaults() initDefaults() finds past ... ... ... transactions plugin.properties plugin.properties plugin.properties #42432 #51345 #59998 #71003 fKeys[] fKeys[] fKeys[] fKeys[] initDefaults() initDefaults() initDefaults() initDefaults() ... ... ... ... plugin.properties plugin.properties plugin.properties plugin.properties #87264 #91220 #101823 #104223 fKeys[] fKeys[] fKeys[] fKeys[] initDefaults() initDefaults() initDefaults() initDefaults() ... ... ... ... plugin.properties plugin.properties plugin.properties
  32. 32. Mining Associations #756 #6721 #21078 EROSE fKeys[] fKeys[] fKeys[] initDefaults() initDefaults() initDefaults() finds past ... ... ... transactions plugin.properties plugin.properties plugin.properties #42432 #51345 #59998 #71003 {fKeys[], initDefaults()} {plugin.properties} fKeys[] fKeys[] fKeys[] fKeys[] initDefaults() initDefaults() initDefaults() initDefaults() Support 10, Confidence 10/11 = 0.909 ... ... ... ... plugin.properties plugin.properties plugin.properties plugin.properties #87264 #91220 #101823 #104223 fKeys[] fKeys[] fKeys[] fKeys[] initDefaults() initDefaults() initDefaults() initDefaults() ... ... ... ... plugin.properties plugin.properties plugin.properties
  33. 33. Evaluation GIMP PostgreSQL KOffice jEdit
  34. 34. Evaluation EROSE predicts 33% of all changed entities. GIMP (files: 44%) PostgreSQL KOffice jEdit
  35. 35. Evaluation EROSE predicts 33% of all changed entities. GIMP (files: 44%) In 70% of all transactions, EROSE’s topmost three suggestions contain a changed entity. PostgreSQL (files: 72%) KOffice jEdit
  36. 36. Evaluation EROSE predicts 33% of all changed entities. GIMP (files: 44%) In 70% of all transactions, EROSE’s topmost three suggestions contain a changed entity. PostgreSQL (files: 72%) EROSE learns quickly (within 30 days). KOffice jEdit
  37. 37. eROSE Related Changes (ICSE 2004, TSE 2005) guides developers non-program elements (documentation) learns quickly
  38. 38. BugCache Predicting Defects (ASE 2006, ICSE 2007) ` Sung Kim • MIT Tom Zimmermann • Saarland University Jim Whitehead • Univ. of California SC Andreas Zeller • Saarland University
  39. 39. The Problem How should we allocate our resources for quality assurance?
  40. 40. One Solution List with elements that (will) have defects List is adaptive, i.e., it changes over time
  41. 41. One Solution List with elements that (will) have defects Cache List is adaptive, i.e., it changes over time
  42. 42. The BugCache Model What is loaded in the cache? Cache size: 2 Hypothesis: Temporal locality between defects
  43. 43. The BugCache Model What is loaded in the cache? Cache size: 2 Hypothesis: Temporal locality between defects
  44. 44. The BugCache Model What is loaded in the cache? Cache size: 2 Hypothesis: Temporal locality between defects
  45. 45. The BugCache Model What is loaded in the cache? Cache size: 2 Hypothesis: Temporal locality between defects
  46. 46. The BugCache Model What is loaded in the cache? Cache size: 2 Hypothesis: Temporal locality between defects
  47. 47. The BugCache Model What is loaded in the cache? Cache size: 2 Miss Hypothesis: Temporal locality between defects
  48. 48. The BugCache Model What is loaded in the cache? Cache size: 2 Miss Hypothesis: Temporal locality between defects
  49. 49. The BugCache Model Cache size: 2 Miss
  50. 50. The BugCache Model Cache size: 2 Miss
  51. 51. The BugCache Model Cache size: 2 Miss Hit
  52. 52. The BugCache Model Cache size: 2 Miss Hit
  53. 53. The BugCache Model Cache size: 2 Miss Hit Miss
  54. 54. The BugCache Model Cache size: 2 Miss Hit Miss
  55. 55. The BugCache Model Cache size: 2 Miss Hit Miss Hit rate = #Hits / #Defects = 33.3%
  56. 56. The BugCache Model Cache size: 2 Miss Hit Miss
  57. 57. The BugCache Model Cache size: 2 Miss Hit Miss
  58. 58. The BugCache Model Cache size: 2 Miss Hit Miss Miss
  59. 59. The BugCache Model Cache size: 2 Miss Hit Miss Miss
  60. 60. The BugCache Model Cache size: 2 Miss Hit Miss Miss
  61. 61. Loading Elements Temporal locality – as shown before Spatial locality – load “nearby” elements (i.e., co-changed before) Changed-entity locality – load changed elements New-entity locality – load new elements Initial pre-fetch – start with a loaded cache
  62. 62. Evaluation Mozilla jEdit PostgreSQL Columba
  63. 63. Hit Rates Methods Files Project BugCache FixCache BugCache FixCache Apache 1.3 59.6% 61.5% 83.9% 81.5% Columba 58.9% 67.6% 83.5% 83.0% Eclipse 64.5% 71.6% 95.1% 95.0% JEdit 50.5% 48.9% 85.7% 85.4% Mozilla 49.3% 55.0% 93.3% 88.0% PostgreSQL 61.9% 59.2% 73.9% 71.0% Subversion 68.3% 43.8% 82.0% 81.3% Cache size = 10%
  64. 64. Hit Rates Methods Files Project BugCache FixCache BugCache FixCache Apache 1.3 59.6% 61.5% 83.9% 81.5% Columba 58.9% 67.6% 83.5% 83.0% Eclipse 64.5% 71.6% 95.1% 95.0% JEdit 50.5% 48.9% 85.7% 85.4% Mozilla 49.3% 55.0% 93.3% 88.0% PostgreSQL 61.9% 59.2% 73.9% 71.0% Subversion 68.3% 43.8% 82.0% 81.3% Cache size = 10%
  65. 65. Reasons for Hits Initial pre-fetch Spatial locality 18% 18% Initial pre-fetch Temporal locality Temporal locality Spatial locality Changed-entity locality 60% New-entity locality
  66. 66. Warning Developers “Safe” Location (not in FixCache) Risky Location (red, in FixCache)
  67. 67. BugCache Predicting Defects (ASE 2006, ICSE 2007) temporal locality adaptive hit rates of 71%~95%
  68. 68. Vulture Predicting Security Vulnerabilities (Work in Progress) Stephan Neuhaus • Saarland University Tom Zimmermann • Saarland University Andreas Zeller • Saarland University
  69. 69. Firefox/Mozilla >700 developers 228,365 commits 14,368 C/C++ files 1,012,512 revisions (10,452 components)
  70. 70. >700 developers 228,365 commits 14,368 C/C++ files 1,012,512 revisions (10,452 components)
  71. 71. Vulnerabilities
  72. 72. Vulnerabilities
  73. 73. Vulnerabilities 0 Vulnerabilities
  74. 74. Vulnerabilities Security Advisory 2005-12 Title: Livefeed bookmarks can steal cookies Impact: High Products: Firefox Description: Earlier versions of Firefox allowed javascript: and data: URLs as Livefeed bookmarks. When they updated the URL would be run in the context of the current page and could be used to steal cookies or data displayed on the page. If the user were on a page with elevated privileges (for example, about:config) when the Livefeed was updated, the feed URL could potentially run arbitrary code on the user's machine. 0 Vulnerabilities
  75. 75. Vulnerabilities 0 Vulnerabilities
  76. 76. Vulnerabilities Security Advisory 2005-13 Title: Window Injection Spoofing Severity: Low Products: Firefox, Mozilla Suite Description: A website can inject content into a popup opened by another site if the target name of the popup window is known. An attacker who knows you are going to visit that other site could spoof the contents of the popup. 0 Vulnerabilities
  77. 77. Vulnerabilities Security Advisory 2005-15 2005-41 2005-16 2006-76 2005-14 Title: Heap overflow possible security dialogs Title: Spoofing escalation via DOM property XSS quot;secure sitequot;window's Function Privilege download and in UTF8 to object SSL using outer indicator spoofing Impact: Moderate Unicode conversion overrides High with overlapping windows Severity: Products:Critical 2.0 Severity: High Products: Firefox Mozilla Suite Firefox, Description:Various schemesdemonstrated Products: Firefox, Thunderbird, Mozilla Suitethat Description: moz_bug_r_a4 were reported Mozilla Suite Description: It thepossible forreportedstringin the Function prototype regressionlock icon to with that could causeMichael Kraxsitequot; UTF8 several moz_bug_r_a4 a described is quot;secure demonstrates that the download dialog trigger details overflow be bug 355161 couldto and security dialogs the exploitsand show attacker the ability tothe wrong invalid sequences certificate a heap bypass can of appear giving an be exploited to for install malicious could be data. by requiring would spoofed byUnicode cross Exploitability only convertedcode or steal data,phishers to an that site. These against used site script (XSS) protections partially covering them with make injection, which could be used to particularly a the user do commonplace users get click onin overlapping window. Some actionsstealthe string depend on the attackers abilityto may not notice their spoofs look more legitimate, like credentials or the buggyhide the and browser or perform link or window from arbitrary sitescommon thesensitive the context menu. Theshowing the intoOS opendataborderaddress barweb content is windows that converter. General statusbar destructive actions on privileged rule out cause in what appears to be of a logged-in and bisectingeach case was behalf a single dialog,user. converted elsewhere but we can'tUI code the be true location. (quot;chromequot;) being overly attack. convinced by the spoofing text of the top-most possibility of a successfultrusting of DOM nodes from the content window. window to click on the quot;Allowquot; or quot;Openquot; button of the window below. 0 Vulnerabilities
  78. 78. Vulnerabilities 0 Vulnerabilities
  79. 79. Vulnerabilities 10,452 components 424 vulnerable 4.05% 0 Vulnerabilities
  80. 80. Vulnerabilities What other components are vulnerable? 0 Vulnerabilities
  81. 81. Vulnerabilities 0 Vulnerabilities
  82. 82. Vulnerabilities 0 Vulnerabilities ?
  83. 83. Vulnerabilities Is this new component likely to be vulnerable? 0 Vulnerabilities ?
  84. 84. Vulture Code Vulnerability Version Code Code Database Archive Code Redo diagram
  85. 85. Vulture Code Vulnerability Version Code Code Database Archive Code Redo diagram Vulture
  86. 86. Vulture Code Vulnerability Version Code Code Database Archive Code Redo diagram Vulture Component Component Component
  87. 87. Vulture Code Vulnerability Version Code Code Database Archive Code Redo diagram Vulture Predictor Component Component Component
  88. 88. Vulture Code Vulnerability Version Code Code Code Database Archive Code Redo diagram Vulture Predictor Component Component Component
  89. 89. Correlations
  90. 90. Correlations Programmer Code Complexity Language
  91. 91. Correlations Code Complexity Language
  92. 92. Correlations Language
  93. 93. Correlations Language Problem Domain
  94. 94. Imports
  95. 95. Imports GUI Database Certificates OS
  96. 96. Imports GUI Database Certificates OS
  97. 97. Imports GUI Database Certificates OS
  98. 98. Example (1) nsIContent.h nsIContentUtils.h nsIScriptSecurityManager.h
  99. 99. Example (1) nsIContent.h import nsIContentUtils.h nsIScriptSecurityManager.h
  100. 100. Example (1) ✘ ✘ ✘ ✘ ✘ ✘ nsIContent.h ✘ ✘ ✘ ✘✘ ✘ import ✘ ✘ ✘ nsIContentUtils.h ✘ ✘ 95.5% ✘ ✔ ✘ ✘ ✘ nsIScriptSecurityManager.h
  101. 101. Example (2) nsIPrivateDOMEvent.h nsReadableUtils.h
  102. 102. Example (2) import nsIPrivateDOMEvent.h nsReadableUtils.h
  103. 103. Example (2) ✘ ✘ ✘ ✘ ✘ ✘ ✘ ✘ ✘ ✘ import nsIPrivateDOMEvent.h ✘ ✘ ✘ ✘ 100% ✘ ✘ ✘ ✘ ✘ nsReadableUtils.h
  104. 104. Research Questions • How well do imports predict vulnerabilities? • Can imports be used for − classification (vulnerable or not) and for − regression (number of vulnerabilities)?
  105. 105. Input Data nsCOMArray 0 nsIDocument.h 1 nspr_md.h 0 nsDOMClassInfo 10 EmbedGTKTools 0 MozillaControl.cpp 0 nsDOMClassInfo has had 10 vulnerability-related bug reports
  106. 106. Input Data e. am t.h h e. re Fr c bt ack nne e or St o di h s/fi h m ns PC st le. 9, h ut o.h sy pl. 9 il.h IX Im 05 ns ss nsCOMArray 0 1 0 0 0 1 0 0 nsIDocument.h 1 0 0 1 0 0 1 0 nspr_md.h 0 0 1 1 0 0 1 0 nsDOMClassInfo 10 0 0 1 0 1 0 0 EmbedGTKTools 0 0 0 0 0 1 0 0 MozillaControl.cpp 0 0 1 0 1 0 0 0 nsDOMClassInfo has had 10 nsDOMClassInfo imports vulnerability-related bug reports “nsIXPConnect.h”
  107. 107. Distribution ibution of MFSAs Distribution of Bug Reports 300 Number of Components 20 50 5 12 5 7 9 11 13 13579 13 17 24 umber of MFSAs Number of Bug Reports
  108. 108. Experiments • 40 randomtraining set, 3,484 rows in validation set splits 6,968 rows in • Classification recall and precision Train SVM, compute • Regression rank correlation on top 1% Train SVM, compute • SVM: linear kernel10GB ofdefault parameters with R implementation (up to main memory)
  109. 109. Results (a) Precision and Recall (b) Rank Correlation 0.55 1.0 ● ● ● ● ● ● ● Cumulative Distribution ● 0.8 ● ● 0.50 ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● 0.6 Precision ● ● ● ● ● 0.45 ● ●● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ●●● ● ● ● ● ● ● ● ● 0.40 ● ● ● ● ● ● 0.2 ● ●● ● ● ● ● ● ● ● ● ● 0.35 ● 0.0 ● 0.55 0.60 0.65 0.70 0.75 0.2 0.3 0.4 0.5 0.6 0.7 Recall Rank Correlation
  110. 110. Results (a) Precision and Recall (b) Rank Correlation 0.55 1.0 ● ● ● ● ● ● ● Cumulative Distribution ● 0.8 ● ● 0.50 ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● 0.6 Precision ● ● ● ● ● 0.45 ● ●● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ●●● ● ● ● ● ● ● ● ● 0.40 ● ● ● ● ● ● 0.2 ● ●● ● ● ● ● ● ● ● ● ● 0.35 ● 0.0 ● 0.55 0.60 0.65 0.70 0.75 0.2 0.3 0.4 0.5 0.6 0.7 Recall Rank Correlation 45% (about 1/2) of predictions correct
  111. 111. Results (a) Precision and Recall (b) Rank Correlation 0.55 1.0 ● ● ● ● ● ● ● Cumulative Distribution ● 0.8 ● ● 0.50 ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● 0.6 Precision ● ● ● ● ● 0.45 ● ●● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ●●● ● ● ● ● ● ● ● ● 0.40 ● ● ● ● ● ● 0.2 ● ●● ● ● ● ● ● ● ● ● ● 0.35 ● 0.0 ● 0.55 0.60 0.65 0.70 0.75 0.2 0.3 0.4 0.5 0.6 0.7 Recall Rank Correlation 2/3 of all vulnerable components detected 45% (about 1/2) of predictions correct
  112. 112. Results (a) Precision and Recall (b) Rank Correlation 0.55 1.0 ● ● ● ● ● ● ● Cumulative Distribution ● 0.8 ● ● 0.50 ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● 0.6 Precision ● ● ● ● ● 0.45 ● ●● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ●●● ● ● ● ● ● ● ● ● 0.40 ● ● ● ● ● ● 0.2 ● ●● ● ● ● ● ● ● ● ● ● 0.35 ● 0.0 ● 0.55 0.60 0.65 0.70 0.75 0.2 0.3 0.4 0.5 0.6 0.7 Recall Rank Correlation 2/3 of all vulnerable components detected 45% (about 1/2) of predictions correct
  113. 113. Results moderately strong correlation (mostly significant at p < 0.01) (a) Precision and Recall (b) Rank Correlation 0.55 1.0 ● ● ● ● ● ● ● Cumulative Distribution ● 0.8 ● ● 0.50 ● ● ● ● ● ● ●● ● ● ●●● ●● ● ● ● 0.6 Precision ● ● ● ● ● 0.45 ● ●● ● ● ● ● ● ● ● ● 0.4 ● ● ● ● ●●● ● ● ● ● ● ● ● ● 0.40 ● ● ● ● ● ● 0.2 ● ●● ● ● ● ● ● ● ● ● ● 0.35 ● 0.0 ● 0.55 0.60 0.65 0.70 0.75 0.2 0.3 0.4 0.5 0.6 0.7 Recall Rank Correlation 2/3 of all vulnerable components detected 45% (about 1/2) of predictions correct
  114. 114. Ranking
  115. 115. Ranking Rank Component Actual Rank 1 nsDOMClassInfo 3 2 SGridRowLayout 95 3 xpcprivate 6 4 jsxml 2 5 nsGenericHTMLElement 8 6 jsgc 3 7 nsISEnvironment 12 8 jsfun 1 9 nsHTMLLabelElement 18 10 nsHttpTransaction 35 ... (3,474 components)
  116. 116. Ranking Rank Component Actual Rank 1 nsDOMClassInfo 3 2 SGridRowLayout 95 3 xpcprivate 6 4 jsxml 2 5 nsGenericHTMLElement 8 6 jsgc 3 7 nsISEnvironment 12 8 jsfun 1 9 nsHTMLLabelElement 18 10 nsHttpTransaction 35 ... (3,474 components)
  117. 117. Ranking Rank Component Actual Rank 1 nsDOMClassInfo 3 2 SGridRowLayout 95 3 xpcprivate 6 4 jsxml 2 5 nsGenericHTMLElement 8 6 jsgc 3 7 nsISEnvironment 12 8 jsfun 1 9 nsHTMLLabelElement 18 10 nsHttpTransaction 35 ... (3,474 components)
  118. 118. Ranking Rank Component Actual Rank 1 nsDOMClassInfo 3 2 SGridRowLayout 95 3 xpcprivate 6 4 jsxml 2 5 nsGenericHTMLElement 8 6 jsgc 3 7 nsISEnvironment 12 8 jsfun 1 9 nsHTMLLabelElement 18 10 nsHttpTransaction 35 ... (3,474 components)
  119. 119. Similar Results for Bugs Packages + Import relationships (ISESE 2006) Precision: 66.7% Recall: 69.4% Binaries + Dependencies (Internship @ Microsoft Research, 2006) Precision: 64.4% Recall: 75.3%
  120. 120. Vulture Predicting Security Vulnerabilities (Work in Progress) locates past + predicts new vulnerabilities problem domain
  121. 121. Future Work ?
  122. 122. #1: Mining across Projects • Complement source code search engines with mining techniques. • Large-scale mining (144,000 SF projects)
  123. 123. #2: Developer Buddy MOCKUP
  124. 124. eROSE BugCache Vulture
  125. 125. automatic eROSE BugCache Vulture
  126. 126. automatic large-scale eROSE BugCache Vulture
  127. 127. automatic large-scale eROSE BugCache Vulture tool-oriented
  128. 128. automatic large-scale Empirical Software Engineering 2.0 tool-oriented
  129. 129. automatic large-scale Empirical Software Engineering 2.0 tool-oriented Thanks! Questions?
  • tungisu

    Jan. 15, 2018
  • hunkim

    Sep. 20, 2009
  • rpremraj

    May. 22, 2008

Job application talk.

Views

Total views

2,194

On Slideshare

0

From embeds

0

Number of embeds

46

Actions

Downloads

71

Shares

0

Comments

0

Likes

3

×