Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Think Locally, Act Globally
                       Improving Defect and Effort Prediction Models

                        ...
Data Modelling in Empirical SE


                             measured from project data

                                ...
Data Modelling in Empirical SE


                                      measured from project data

                       ...
Data Modelling in Empirical SE


                                      measured from project data

                       ...
Model Building Today




                  Whole Dataset




                                                     3

Satur...
Model Building Today




                  Whole Dataset      Training Data




                                      Test...
Model Building Today




                  Whole Dataset      Training Data   Learned Model
                              ...
Model Building Today




                  Whole Dataset      Training Data   Learned Model
                              ...
Model Building Today




                  Whole Dataset      Training Data   Learned Model
                              ...
Much Research Effort on
                       new metrics and new models!




                                           ...
Maybe we need to look more at the data part




Saturday, 2 June, 12
In the Field




Saturday, 2 June, 12
In the Field




        Tom Zimmermann




Saturday, 2 June, 12
In the Field
                            We ran 622 cross-project
                         predictions and found that only...
In the Field
                            We ran 622 cross-project
                         predictions and found that only...
In the Field
                                            We ran 622 cross-project
                                        ...
In the Field
                                            We ran 622 cross-project
                                        ...
Using Locality in Statistical Models




Saturday, 2 June, 12
Using Locality in Statistical Models


             1         Does this principle work for statistical models?




Saturda...
Using Locality in Statistical Models


             1         Does this principle work for statistical models?

          ...
Using Locality in Statistical Models


             1         Does this principle work for statistical models?

          ...
Building Local Models




                 Whole Dataset       Training Data   Learned Model
                             ...
Building Local Models


                                         ter Data
                                     Clus

     ...
Building Local Models
                                                              ltiple
                               ...
Building Local Models
                                                              ltiple
                               ...
Building Local Models
                                                              ltiple
                               ...
HAPTER 2.
                                   Global StatisticalMODELS
                        GENERAL ASPECTS OF FITTING R...
HAPTER 2.
                                   Global StatisticalMODELS
                        GENERAL ASPECTS OF FITTING R...
HAPTER 2.
                                   Global StatisticalMODELS
                        GENERAL ASPECTS OF FITTING R...
HAPTER 2.
                                   Global StatisticalMODELS
                        GENERAL ASPECTS OF FITTING R...
Local Statistical Model
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS                                           ...
Local Statistical Model
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS                                           ...
Local Statistical Model
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS                                           ...
Local Statistical Model
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS                                           ...
How can we use this approach to get an
                  even better fit?




Saturday, 2 June, 12
Be Even More Local !
HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS                                               ...
Be Even More Local !
HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS                                               ...
Be Even More Local !
HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS                                               ...
Be Even More Local !
HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS                                               ...
Be Even More Local !
HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS                                               ...
Saturday, 2 June, 12
Clustering independent of Fit




Saturday, 2 June, 12
CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS
GENERAL ASPECTS OF FITTING REGRESSION MODELS                      ...
Optimize Local Fit wrt. Minimizing Global Overfit


                                                                       ...
Optimize Local Fit wrt. Minimizing Global Overfit
 CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS                 ...
Optimize Local Fit wrt. Minimizing Global Overfit
 CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS                 ...
Optimize Local Fit wrt. Minimizing Global Overfit
 CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS                 ...
Optimize Local Fit wrt. Minimizing Global Overfit
 CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS                 ...
Case Study




                       15

Saturday, 2 June, 12
Case Study


                   Xalan 2.6
                               Post-Release Defects per Class
                  ...
Case Study


                   Xalan 2.6
                                Post-Release Defects per Class
                 ...
Case Study


                   Xalan 2.6
                                Post-Release Defects per Class
                 ...
Results: Goodness of Fit

                  Rank-Correlation (0 = worst fit, 1 = optimal fit)




                          ...
Results: Goodness of Fit

                  Rank-Correlation (0 = worst fit, 1 = optimal fit)
                              ...
Results: Goodness of Fit

                  Rank-Correlation (0 = worst fit, 1 = optimal fit)
                              ...
Results: Goodness of Fit

                  Rank-Correlation (0 = worst fit, 1 = optimal fit)
                              ...
Results: Goodness of Fit

                  Rank-Correlation (0 = worst fit, 1 = optimal fit)
                              ...
Results: Goodness of Fit

                                  Rank-Correlation (0 = worst fit, 1 = optimal fit)
              ...
Results: Goodness of Fit

                  Rank-Correlation (0 = worst fit, 1 = optimal fit)
                              ...
Results: Prediction Error                           Global      Local         MARS



                       0.7          ...
Results: Prediction Error                           Global      Local         MARS



                       0.7          ...
?
                Model
            Interpretation




Saturday, 2 June, 12
Model Interpretation
        0.5
                             1 avg_cc                                         2 ca       ...
Model Interpretation
        0.5
                             1 avg_cc                                         2 ca       ...
Model Interpretation
        0.5
                             1 avg_cc                                         2 ca       ...
1
                                                                                                                        ...
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
Think Locally, Act Gobally - Improving Defect and Effort Prediction Models
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
Think globally, act locally - HR Hot Topics
Next
Upcoming SlideShare
Think globally, act locally - HR Hot Topics
Next
Download to read offline and view in fullscreen.

Share

Think Locally, Act Gobally - Improving Defect and Effort Prediction Models

Download to read offline

Talk given at the 2012 Working Conference on Mining Software Repositories (MSR'12) in Zürich, Switzerland.

Related Books

Free with a 30 day trial from Scribd

See all

Think Locally, Act Gobally - Improving Defect and Effort Prediction Models

  1. 1. Think Locally, Act Globally Improving Defect and Effort Prediction Models Nicolas Bettenburg • Meiyappan Nagappan • Ahmed E. Hassan Queen’s University • Kingston, ON, Canada SOFTWARE ANALYSIS & INTELLIGENCE LAB T Saturday, 2 June, 12
  2. 2. Data Modelling in Empirical SE measured from project data Observations 2 Saturday, 2 June, 12
  3. 3. Data Modelling in Empirical SE measured from project data Observations describe observations mathematically Model 2 Saturday, 2 June, 12
  4. 4. Data Modelling in Empirical SE measured from project data Observations describe observations mathematically Model Prediction guide decision making Understanding guide process optimizations and future research 2 Saturday, 2 June, 12
  5. 5. Model Building Today Whole Dataset 3 Saturday, 2 June, 12
  6. 6. Model Building Today Whole Dataset Training Data Testing Data 3 Saturday, 2 June, 12
  7. 7. Model Building Today Whole Dataset Training Data Learned Model M Testing Data 3 Saturday, 2 June, 12
  8. 8. Model Building Today Whole Dataset Training Data Learned Model M Y Testing Data Predictions 3 Saturday, 2 June, 12
  9. 9. Model Building Today Whole Dataset Training Data Learned Model M Y Testing Data Predictions Compare 3 Saturday, 2 June, 12
  10. 10. Much Research Effort on new metrics and new models! 4 Saturday, 2 June, 12
  11. 11. Maybe we need to look more at the data part Saturday, 2 June, 12
  12. 12. In the Field Saturday, 2 June, 12
  13. 13. In the Field Tom Zimmermann Saturday, 2 June, 12
  14. 14. In the Field We ran 622 cross-project predictions and found that only 3.4% actually worked. Tom Zimmermann Saturday, 2 June, 12
  15. 15. In the Field We ran 622 cross-project predictions and found that only 3.4% actually worked. Tom Zimmermann Tim Menzies Saturday, 2 June, 12
  16. 16. In the Field We ran 622 cross-project predictions and found that only 3.4% actually worked. Tom Zimmermann Rather than focus on generalities, empirical SE should focus more on context-specific principles. Tim Menzies Saturday, 2 June, 12
  17. 17. In the Field We ran 622 cross-project predictions and found that only 3.4% actually worked. Tom Zimmermann Taking local properties of data into consideration leads to better models! Rather than focus on generalities, empirical SE should focus more on context-specific principles. Tim Menzies Saturday, 2 June, 12
  18. 18. Using Locality in Statistical Models Saturday, 2 June, 12
  19. 19. Using Locality in Statistical Models 1 Does this principle work for statistical models? Saturday, 2 June, 12
  20. 20. Using Locality in Statistical Models 1 Does this principle work for statistical models? 2 Does it work for Prediction? Saturday, 2 June, 12
  21. 21. Using Locality in Statistical Models 1 Does this principle work for statistical models? 2 Does it work for Prediction? 3 Can we do better? Saturday, 2 June, 12
  22. 22. Building Local Models Whole Dataset Training Data Learned Model M Y Testing Data Predictions 8 Saturday, 2 June, 12
  23. 23. Building Local Models ter Data Clus Whole Dataset Training Data Learned Model M Y Testing Data Predictions 8 Saturday, 2 June, 12
  24. 24. Building Local Models ltiple n Mu Data Lear dels ter Mo Clus Whole Dataset Training Data Learned Models M1 M2 M3 Y Testing Data Predictions 8 Saturday, 2 June, 12
  25. 25. Building Local Models ltiple n Mu Data Lear dels ter Mo Clus Whole Dataset Training Data Learned Models M1 M2 M3 Y Y Y Testing Data Predictions dict Pre ally Ind ividu 8 Saturday, 2 June, 12
  26. 26. Building Local Models ltiple n Mu Data Lear dels ter Mo Clus Whole Dataset Training Data Learned Models M1 M2 M3 Y Y Y Testing Data Predictions Compare dict Pre ally Ind ividu 8 Saturday, 2 June, 12
  27. 27. HAPTER 2. Global StatisticalMODELS GENERAL ASPECTS OF FITTING REGRESSION Model 34 f(X) 0 1 2 3 4 5 6 X Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 9 Saturday, 2 June, 12
  28. 28. HAPTER 2. Global StatisticalMODELS GENERAL ASPECTS OF FITTING REGRESSION Model 34 f(X) 0 1 2 3 4 5 6 X Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 9 Saturday, 2 June, 12
  29. 29. HAPTER 2. Global StatisticalMODELS GENERAL ASPECTS OF FITTING REGRESSION Model 34 f(X) 0 1 2 3 4 5 6 X Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 9 Saturday, 2 June, 12
  30. 30. HAPTER 2. Global StatisticalMODELS GENERAL ASPECTS OF FITTING REGRESSION Model 34 f(X) 0 1 2 3 4 5 6 X Model fit leaves much room for improvement! Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 9 Saturday, 2 June, 12
  31. 31. Local Statistical Model CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 3 f(X) 0 1 2 3 4 5 6 X Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 10 Saturday, 2 June, 12
  32. 32. Local Statistical Model CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 3 f(X) 0 1 2 3 4 5 6 X Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 10 Saturday, 2 June, 12
  33. 33. Local Statistical Model CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 3 f(X) Model 2 Model 1 0 1 2 3 4 5 6 X Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 10 Saturday, 2 June, 12
  34. 34. Local Statistical Model CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 3 f(X) Model 2 Model 1 0 1 2 3 4 5 6 X Improved Fit! Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 10 Saturday, 2 June, 12
  35. 35. How can we use this approach to get an even better fit? Saturday, 2 June, 12
  36. 36. Be Even More Local ! HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 f(X) 0 1 2 3 4 5 6 X Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 12 Saturday, 2 June, 12
  37. 37. Be Even More Local ! HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 f(X) 0 1 2 3 4 5 6 X Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 12 Saturday, 2 June, 12
  38. 38. Be Even More Local ! HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 f(X) 0 1 2 3 4 5 6 X Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 12 Saturday, 2 June, 12
  39. 39. Be Even More Local ! HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 f(X) Great Fit! 0 1 2 3 4 5 6 X Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 12 Saturday, 2 June, 12
  40. 40. Be Even More Local ! HAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 f(X) Great Fit! BUT: Risk of Overfitting the Data!! 0 1 2 3 4 5 6 X Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 12 Saturday, 2 June, 12
  41. 41. Saturday, 2 June, 12
  42. 42. Clustering independent of Fit Saturday, 2 June, 12
  43. 43. CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 f(X) f(X) 0 1 2 3 4 5 6 0 1 2 3 4 5 6 X X Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. C(Y |X) = f (X) = X , C(Y |X) = f (X) = X , where X = 0 + 1 X1 + 2 X2 + 3 X3 + 4 X = 0 + 1 X1 + 2 X2 + 3 X3 + 4 X4 , and X1 = X X2 = (X a)+ 14 X1 = X X2 = (X a)+ Saturday, 2 June, 12 X3 = (X b)+ X4 = (X c)+.
  44. 44. Optimize Local Fit wrt. Minimizing Global Overfit CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 f(X) f(X) 0 1 2 3 4 5 6 0 1 2 3 4 5 6 X X Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. C(Y |X) = f (X) = X , C(Y |X) = f (X) = X , where X = 0 + 1 X1 + 2 X2 + 3 X3 + 4 X = 0 + 1 X1 + 2 X2 + 3 X3 + 4 X4 , and X1 = X X2 = (X a)+ 14 X1 = X X2 = (X a)+ Saturday, 2 June, 12 X3 = (X b)+ X4 = (X c)+.
  45. 45. Optimize Local Fit wrt. Minimizing Global Overfit CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 f(X) f(X) f(X) 0 1 2 3 4 5 6 0 1 2 3 4 5 6 X X Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 0 1 2 3 4 5 6 X C(Y |X) = f (X) = X , C(Y |X) = f (X) = X linear spline function with knots at a = 1, b = 3, c = 5. Figure 2.1: A , where X = 0 + 1X1 + 2X2 + 3X3 + 4 X = 0 + 1 X1 + 2 X2 + 3 X3 + 4 X4 , and X1 = X X2 = (X a)+ 14 X1 = X X2 = (X a)+ Saturday, 2 June, 12 C(Y |X) = f (X) = X , X3 = (X b)+ X4 = (X c)+.
  46. 46. Optimize Local Fit wrt. Minimizing Global Overfit CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 f(X) f(X) f(X) 0 1 2 3 4 5 6 0 1 2 3 4 5 6 X X Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 0 1 2 3 4 5 6 X C(Y |X) = f (X) = X , C(Y |X) = f (X) = X linear spline function with knots at a = 1, b = 3, c = 5. Figure 2.1: A , where X = 0 + 1X1 + 2X2 + 3X3 + 4 X = 0 + 1 X1 + 2 X2 + 3 X3 + 4 X4 , and X1 = X X2 = (X a)+ 14 X1 = X X2 = (X a)+ Saturday, 2 June, 12 C(Y |X) = f (X) = X , X3 = (X b)+ X4 = (X c)+.
  47. 47. Optimize Local Fit wrt. Minimizing Global Overfit CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 f(X) f(X) f(X) 0 1 2 3 4 5 6 0 1 2 3 4 5 6 X X Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 0 1 2 3 4 5 6 X C(Y |X) = f (X) = X , C(Y |X) = f (X) = X linear spline function with knots at a = 1, b = 3, c = 5. Figure 2.1: A , where X = 0 + 1X1 + 2X2 + 3X3 + 4 X = Multivariate2 Adaptive4X4, 0 + 1X1 + 2X + 3X3 + Regression Splines (MARS) and X1 = X X2 = (X a)+ 14 X1 = X X2 = (X a)+ Saturday, 2 June, 12 C(Y |X) = f (X) = X , X3 = (X b)+ X4 = (X c)+.
  48. 48. Optimize Local Fit wrt. Minimizing Global Overfit CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 CHAPTER 2. GENERAL ASPECTS OF FITTING REGRESSION MODELS GENERAL ASPECTS OF FITTING REGRESSION MODELS 34 f(X) f(X) f(X) 0 1 2 3 4 5 6 0 1 2 3 4 5 6 X X Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. Figure 2.1: A linear spline function with knots at a = 1, b = 3, c = 5. 0 1 2 3 4 5 6 X C(Y |X) = f (X) = X , C(Y |X) = f (X) = X linear spline function with knots at a = 1, b = 3, c = 5. Figure 2.1: A , where X = 0 + 1X1 + 2X2 + 3X3 + 4 X = Multivariate2 Adaptive4X4, 0 + 1X1 + 2X + 3X3 + Regression Splines (MARS) and create local knowledge that optimizes process globally X1 = X X2 = (X a)+ 14 X1 = X X2 = (X a)+ Saturday, 2 June, 12 C(Y |X) = f (X) = X , X3 = (X b)+ X4 = (X c)+.
  49. 49. Case Study 15 Saturday, 2 June, 12
  50. 50. Case Study Xalan 2.6 Post-Release Defects per Class 20 CK Metrics Lucene 2.4 15 Saturday, 2 June, 12
  51. 51. Case Study Xalan 2.6 Post-Release Defects per Class 20 CK Metrics Lucene 2.4 Total Development Effort in Hours CHINA 14 FP Metrics 15 Saturday, 2 June, 12
  52. 52. Case Study Xalan 2.6 Post-Release Defects per Class 20 CK Metrics Lucene 2.4 Total Development Effort in Hours CHINA 14 FP Metrics Development Length in Months NasaCoc 24 COCOMO-II Metrics 15 Saturday, 2 June, 12
  53. 53. Results: Goodness of Fit Rank-Correlation (0 = worst fit, 1 = optimal fit) 16 Saturday, 2 June, 12
  54. 54. Results: Goodness of Fit Rank-Correlation (0 = worst fit, 1 = optimal fit) Local Global MARS (Clustered) Xalan 2.6 0.33 0.52 0.69 Lucene 2.4 0.32 0.60 0.83 CHINA 0.83 0.89 0.89 NasaCOC 0.93 0.97 0.99 16 Saturday, 2 June, 12
  55. 55. Results: Goodness of Fit Rank-Correlation (0 = worst fit, 1 = optimal fit) Local Global MARS (Clustered) Xalan 2.6 0.33 0.52 0.69 Lucene 2.4 0.32 0.60 0.83 CHINA 0.83 0.89 0.89 NasaCOC 0.93 0.97 0.99 16 Saturday, 2 June, 12
  56. 56. Results: Goodness of Fit Rank-Correlation (0 = worst fit, 1 = optimal fit) Local Global MARS (Clustered) Xalan 2.6 0.33 0.52 0.69 Lucene 2.4 0.32 0.60 0.83 CHINA 0.83 0.89 0.89 NasaCOC 0.93 0.97 0.99 16 Saturday, 2 June, 12
  57. 57. Results: Goodness of Fit Rank-Correlation (0 = worst fit, 1 = optimal fit) Local Global MARS (Clustered) Xalan 2.6 0.33 0.52 0.69 Lucene 2.4 0.32 0.60 0.83 CHINA 0.83 0.89 0.89 NasaCOC 0.93 0.97 0.99 16 Saturday, 2 June, 12
  58. 58. Results: Goodness of Fit Rank-Correlation (0 = worst fit, 1 = optimal fit) Local Global MARS (Clustered) 8 Xalan 2.6 0.33 0.52 0.69 Number of Clusters Dataset 6 CHINA 4 Lucene 2.4 0.32 0.60 0.83 Lucene 2.4 NasaCoc Xalan 2.6 2 0 CHINA 0.83 0.89 0.89 Fold01 Fold02 Fold03 Fold04 Fold05 Fold06 Fold07 Fold08 Fold09 Fold10 NasaCOC 0.93 0.97 0.99 Figure 3: Number of clusters generated by MCLUST in each run of the 10-fold cross validation. term for each additional prediction variable entering the is too small to continue or until a maximum number of terms regression model [23]. is reached. In our case study, the maximum number of terms For practical purposes, we use a publicly available imple- is automatically determined by the implementation, and is mentation of BIC-based model selection, contained in the based on the amount of independent variables we give as R package: BMA. The input to the BMA implementation input. For MARS models, we use all independent variables is the dataset itself, as well as a list of all dependent and in a dataset after VIF analysis. independent variables that should be considered. In our case The first phase often builds a model that suffers from 16 study, we always supply a list of all independent variables overfitting. As a result, the second phase, called the back- Saturday,were 12 that 2 June, left after VIF analysis. The output of the BMA ward phase, prunes the model, to increase the model’s gen-
  59. 59. Results: Goodness of Fit Rank-Correlation (0 = worst fit, 1 = optimal fit) Local Global MARS (Clustered) Xalan 2.6 0.33 0.52 0.69 Lucene 2.4 0.32 0.60 0.83 CHINA 0.83 0.89 0.89 NasaCOC 0.93 0.97 0.99 UP TO 2.5x BETTER FIT WHEN USING DATA LOCALITY! 16 Saturday, 2 June, 12
  60. 60. Results: Prediction Error Global Local MARS 0.7 1.2 0.525 0.9 0.35 0.64 0.6 1.15 1.15 0.52 0.94 0.175 0.4 0.3 0 0 Xalan 2.6 Lucene 2.4 800 4 600 3 400 765 2 3.26 552.85 200 1 2.14 1.63 234.43 0 0 CHINA NasaCoC 17 Saturday, 2 June, 12
  61. 61. Results: Prediction Error Global Local MARS 0.7 1.2 0.525 0.9 0.35 0.64 0.6 1.15 1.15 0.52 0.94 0.175 0.4 0.3 0 0 Xalan 2.6 Lucene 2.4 800 4 600 3 400 765 2 3.26 552.85 200 1 2.14 1.63 234.43 0 0 CHINA NasaCoC Up to 4x lower prediction error with Local Models! 17 Saturday, 2 June, 12
  62. 62. ? Model Interpretation Saturday, 2 June, 12
  63. 63. Model Interpretation 0.5 1 avg_cc 2 ca 3 cam 4 cbm 0.80 1.1 0.52 1.6 −0.5 0.70 0.9 0.48 1.2 −1.5 0.60 0.7 0.44 0.50 0.5 −2.5 0.8 0 5 10 15 20 0 50 100 150 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 30 0.0 5 ce 6 dam 7 dit 8 ic 0.62 0.6 0.8 0.65 0.58 0.5 0.45 0.6 0.60 0.4 0.54 0.55 0.4 0.3 0.35 0.50 0.50 0.2 0 10 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 1 (a)lcom of a global 10 lcom3 learned on the Xalan 2.6 dataset 9 Part Model 11 loc 12 max_cc (b) P 1.8 0.7 6 2.6 d 2.0 4 0.6 5 1.4 4 3 0.5 1.5 Figure 6: Global models report general trends, while global models with local c 1.0 3 2 0.4 1.0 2 1 0.3 0.6 describes the response (in this case bugs) while keeping all other prediction variab 0.5 1 0 1000 3000 5000 0.0 0.5 1.0 1.5 2.0 0 1000 2000 3000 4000 0 20 40 60 80 120 0 Fold 9, Cluster 1 13 mfa 14 moa 15 noc 16 npm pr 0.50 0.58 1.0 0.51 ic npm mfa O 0.70 0.5 19 0.49 0.46 w 0.0 0.54 0.60 .47 Saturday, 2 June, 12
  64. 64. Model Interpretation 0.5 1 avg_cc 2 ca 3 cam 4 cbm 0.80 1.1 0.52 1.6 −0.5 0.70 0.9 0.48 1.2 −1.5 0.60 0.7 0.44 0.50 0.5 −2.5 0.8 0 5 10 15 20 0 50 100 150 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 30 0.0 5 ce 6 dam 7 dit 8 ic 0.62 0.6 0.8 0.65 0.58 0.5 0.45 0.6 0.60 0.4 0.54 0.55 0.4 0.3 0.35 0.50 0.50 0.2 0 10 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 1 (a)lcom of a global 10 lcom3 learned on the Xalan 2.6 dataset 9 Part Model 11 loc 12 max_cc (b) P 1.8 0.7 6 2.6 d 2.0 4 0.6 5 1.4 4 3 0.5 1.5 Figure 6: Global models report general trends, while global models with local c Traditional Global Model: General Trends 1.0 3 2 0.4 1.0 2 1 0.3 0.6 describes the response (in this case bugs) while keeping all other prediction variab 0.5 1 0 1000 3000 5000 0.0 0.5 1.0 1.5 2.0 0 1000 2000 3000 4000 0 20 40 60 80 120 0 Fold 9, Cluster 1 13 mfa 14 moa 15 noc 16 npm pr 0.50 0.58 1.0 0.51 ic npm mfa O 0.70 0.5 19 0.49 0.46 w 0.0 0.54 0.60 .47 Saturday, 2 June, 12
  65. 65. Model Interpretation 0.5 1 avg_cc 2 ca 3 cam 4 cbm 0.80 1.1 0.52 1.6 −0.5 0.70 0.9 0.48 1.2 −1.5 0.60 0.7 0.44 0.50 0.5 −2.5 0.8 0 5 10 15 20 0 50 100 150 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 30 0.0 5 ce 6 dam 7 dit 8 ic 0.62 0.6 0.8 0.65 0.58 0.5 0.45 0.6 0.60 0.4 0.54 0.55 0.4 0.3 0.35 0.50 0.50 0.2 0 10 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 6 7 8 0 1 2 3 4 5 1 (a)lcom of a global 10 lcom3 learned on the Xalan 2.6 dataset 9 Part Model 11 loc 12 max_cc (b) P 1.8 0.7 6 2.6 d 2.0 4 0.6 5 1.4 4 3 0.5 1.5 Figure 6: Global models report general trends, while global models with local c Traditional Global Model: General Trends 1.0 3 2 0.4 1.0 2 describes One Curve per metric, run corp on all other prediction variab the response (in this case bugs) while keeping that curve 1 0.3 0.6 0.5 1 0 1000 3000 5000 0.0 0.5 1.0 1.5 2.0 0 1000 2000 3000 4000 0 20 40 60 80 120 0 Fold 9, Cluster 1 13 mfa 14 moa 15 noc 16 npm pr 0.50 0.58 1.0 0.51 ic npm mfa O 0.70 0.5 19 0.49 0.46 w 0.0 0.54 0.60 .47 Saturday, 2 June, 12
  66. 66. 1 4 0.3 0.4 0. 0.5 1.0 1. 3 0.3 0.4 0.5 Figure 6: Global models report general trends, while global models with local considerations give insig 0.5 1.0 1.5 Model Interpretation Figure 6: Global models report general trends, while global models with local considerations give insight 1.0 3 1.0 2 1.0 3 1.0 2 2 2 1 0.6 describes the response (in this case bugs) while keeping all other prediction variables atat their median val describes the response (in this case bugs) while keeping all other prediction variables their median value 0.8 1 1 0.6 0.8 1 0 1000 3000 5000 0.0 0.5 1.0 1.5 2.0 0 1000 2000 3000 4000 0 20 40 60 80 120 0 1000 2000 3000 4000 0.0 0.2 0.4 0 1000 3000 5000 0.0 0.5 1.0 1.5 2.0 0 1000 2000 3000 4000 0 20 40 60 80 120 0 1000 2000 3000 4000 0.0 0.2 0 Fold 9, Cluster 1 15 noc Fold 9, Cluster 1 prediction models lead prediction models lea 13 mfa 14 moa 16 npm 13 npm 0.50 13 npm 0.58 13 mfa 14 moa 15 noc 16 npm 0.0 0.5 1.0 0.51 0.50 0.58 ic npm mfa Our findings thus co 0.0 0.5 1.0 0.51 0.70 ic npm mfa Our findings thus c 0.70 0.49 0.46 who observed a asimil 0.49 0.54 0.46 who observed sim 0.60 0.54 0.47 0.60 Clustermachine-lear WHICH 1 0.47 0.42 WHICH machine-lea −1.0 0.42 0.50 0.50 0.45 −1.0 0.50 0.50 0.45 have practical implic 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 0 5 10 15 20 25 30 0 20 40 60 80 100 120 0 20 40 60 80 100 120 have practical impli 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 0 5 10 15 20 25 30 0 20 40 60 80 100 120 0 20 40 60 80 100 120 0 2 4 6 8 10 using regression mod 0 2 4 6 8 10 using regression mo are more insightful th Fold 9, Cluster 6 ... are more insightful t general trends across Fold 9, Cluster 6 general trends acros ic npm mfa demonstrated that such ic npm mfa demonstrated that su particular parts of the 0 01 12 2 3 3 particular parts of th in the Xalan 2.6 def in the Xalan 2.6 de Cluster 6 are infl sets of classes 0 1 2 3 4 0 10 20 30 40 60 sets of classes are in as inheritance, cohes 0 1 2 3 4 0 10 20 30 40 60 as inheritance, coh reinforce the recomm Figure 7: Example of contradicting trends in local models (Xalan 2.6, Figure 17: Example ofin Fold 9). trends in local models (Xalan 2.6, contradicting the use of the recom reinforce a “one-size Cluster and Cluster 6 model, whenatrying to the use of “one-si Cluster 1 and Cluster 6 in Fold 9). model, when trying t model already partition the data into regions with individual model already partition the data into regions increase of ic properties. For example, we observe that an with individual B. Act Globally properties. For example, we observethrough parent classes) B. Act Globally (measuring the inheritance coupling that an increase of ic When the goal is carry (measuring the only have a negative effect on bug-proneness is predicted to When the goal is car inheritance coupling through parent classes) understanding, local m 20 Saturday, predicted to only have a negative effect on bug-proneness is 2 June, 12 understanding, local
  • powerirs

    Jun. 2, 2012

Talk given at the 2012 Working Conference on Mining Software Repositories (MSR'12) in Zürich, Switzerland.

Views

Total views

1,871

On Slideshare

0

From embeds

0

Number of embeds

90

Actions

Downloads

43

Shares

0

Comments

0

Likes

1

×