Pests of castor_Binomics_Identification_Dr.UPR.pdf
Automated parameter optimization should be included in future defect prediction studies
1. Automated Parameter Optimization
for Defect Prediction Models
Chakkrit (Kla)
Tantithamthavorn
Shane McIntosh Ahmed E. Hassan Kenichi Matsumoto
http://chakkrit.com kla@chakkrit.com @klainfo
2. Defect models are used to predict software
modules that are likely to be defective in the future
2
Pre-release period
Releasedate
Post-release period
Defect
prediction
models
Module A
Module C
Module B
Module D
Module A
Module C
Module B
Module D
Clean
Defect-prone
Clean
Defect-prone
3. Defect models are trained
using classification techniques
3
1
2
Decision Tree
Algorithms
Regression
Algorithms
Clustering
Algorithms
Ensemble
Algorithms
5. Such classification techniques often
require parameter settings
4
The number of trees in
a random forest classifier
Ensemble
Algorithms
6. Such classification techniques often
require parameter settings
4
The number of trees in
a random forest classifier
26 of the 30 most commonly used
classification techniques require at least
one parameter setting
Ensemble
Algorithms
7. 5
Defect models may underperform if they are
trained using suboptimal parameter settings
The default settings of
random forest, naïve bayes,
and support vector machines
are suboptimal
[Jiang et al., DEFECTS’08]
[Tosun et al., ESEM’09]
[Hall et al., TSE’12]
8. 6
Different toolkits have different default
settings for the same classification technique
randomForest package
Default setting of the number of trees
in a random forest
10
50
100
500
bigrf package
9. 7
The parameter space is too large
for manual inspection
There are at least 17,000
possible settings to
explore when
training k-NN classifiers
[Kocaguneli et al., TSE’12]
10. 8
The parameter space is too large
for manual inspection
There are at least 17,000
possible settings to
explore when
training k-NN classifiers
[Kocaguneli et al., TSE’12]
How do automated parameter
optimization techniques fare when
applied to defect prediction?
14. Caret — an off-the-shelf automated
parameter optimization technique
12
(Step-1)
Generate
candidate
settings
Settings
(Step-2)
Evaluate
candidate
settings
Performance
for each setting
(Step-3)
Identify
optimal
setting
Optimal
setting
15. Generate a set of
candidate settings to evaluate
13
#Trees for random forest
#Trees = 10 #Trees = 20 #Trees = 30
#Trees = 40 #Trees = 50
(Step-1)
Generate
candidate
settings
16. Evaluate the performance of each candidate
setting using bootstrap validation
14
Defect
Dataset
Testing
Corpus
Training
Corpus
Generate
bootstrap
samples
Construct
defect
model Model
Calculate
performance
Perf.
Out-of-sample Bootstrap Validation
with 100 repetitions
(Step-1)
Generate
candidate
settings
(Step-2)
Evaluate
candidate
settings
17. The optimal setting is the one that
achieved the top performance score
15
AUC=0.65 AUC=0.68 AUC=0.70
AUC=0.80 AUC=0.86
10 20 30
40 50
(Step-1)
Generate
candidate
settings
(Step-2)
Evaluate
candidate
settings
(Step-3)
Identify
optimal
setting
18. We study a collection of
18 datasets from 5 open corpora
16
A threat of bias exists if researchers fixate on
studying the same datasets with the same metrics
[Tantithamthavorn et al., TSE’16]
19. We study a collection of
18 datasets from 5 open corpora
16
1-7K Modules
21-28% Defective Rate
21-38 Metrics
[Shepperd et al., TSE’13]
1-10K Modules
11-44% Defective Rate
15-32 Metrics
[Zimmermann et al., PROMISE’07]
[D’Ambros et al., MSR’10]
[Kim et al., ICSE’11]
600-800 Modules
36-48% Defective Rate
20 Metrics
[Jureczko et al., PROMISE’10]
A threat of bias exists if researchers fixate on
studying the same datasets with the same metrics
[Tantithamthavorn et al., TSE’16]
24. Large Medium Small
●
●
●
●
●
● ●
●
0.0
0.1
0.2
0.3
0.4
C
5.0
AdaBoost
AVN
N
etC
ART
PC
AN
N
etN
N
etFDA
M
LPW
eightD
ecayM
LP
LM
TG
PLS
LogitBoostKN
N
xG
BTreeG
BM
N
B
R
BF
SVM
R
adia
G
AM
AUCPerformanceImprovement
Parameter settings can substantially influence
the performance of defect prediction models
18
Each boxplot presents
the performance
improvement for all
the 18 studied datasets
25. Large Medium Small
●
●
●
●
●
● ●
●
0.0
0.1
0.2
0.3
0.4
C
5.0
AdaBoost
AVN
N
etC
ART
PC
AN
N
etN
N
etFDA
M
LPW
eightD
ecayM
LP
LM
TG
PLS
LogitBoostKN
N
xG
BTreeG
BM
N
B
R
BF
SVM
R
adia
G
AM
AUCPerformanceImprovement
Parameter settings can substantially influence
the performance of defect prediction models
19
9 of the 26 studied
classification techniques
have a large performance
improvement
26. Large Medium Small
●
●
●
●
●
● ●
●
0.0
0.1
0.2
0.3
0.4
C
5.0
AdaBoost
AVN
N
etC
ART
PC
AN
N
etN
N
etFDA
M
LPW
eightD
ecayM
LP
LM
TG
PLS
LogitBoostKN
N
xG
BTreeG
BM
N
B
R
BF
SVM
R
adia
G
AM
AUCPerformanceImprovement
Parameter settings can substantially influence
the performance of defect prediction models
20
C5.0 and AdaBoost
have a median
improvement of 0.27
and 0.14 AUC
27. Large Medium Small
●
●
●
●
●
● ●
●
0.0
0.1
0.2
0.3
0.4
C
5.0
AdaBoost
AVN
N
etC
ART
PC
AN
N
etN
N
etFDA
M
LPW
eightD
ecayM
LP
LM
TG
PLS
LogitBoostKN
N
xG
BTreeG
BM
N
B
R
BF
SVM
R
adia
G
AM
AUCPerformanceImprovement
Parameter settings can substantially influence
the performance of defect prediction models
21
C5.0 and AdaBoost
span up to 0.40 AUC
28. Large Medium Small Ne
●
●
●
●
●
● ●
● ●
●
●
0.0
0.1
0.2
0.3
0.4
C
5.0
AdaBoost
AVN
N
etC
ART
PC
AN
N
etN
N
etFDA
M
LPW
eightD
ecayM
LP
LM
TG
PLS
LogitBoostKN
N
xG
BTreeG
BM
N
B
R
BF
SVM
R
adial
G
AM
Boost
R
FR
ippe
AUCPerformanceImprovement
22
Caret improves the AUC
performance by up to 40
percentage points
Performance
Improvement
Performance
Stability
29. 23
Performance
Stability
Large Medium Small Ne
●
●
●
●
●
● ●
● ●
●
●
0.0
0.1
0.2
0.3
0.4
C
5.0
AdaBoost
AVN
N
etC
ART
PC
AN
N
etN
N
etFDA
M
LPW
eightD
ecayM
LP
LM
TG
PLS
LogitBoostKN
N
xG
BTreeG
BM
N
B
R
BF
SVM
R
adial
G
AM
Boost
R
FR
ippe
AUCPerformanceImprovement
Caret improves the AUC
performance by up to 40
percentage points
Performance
Improvement
30. 24
Default settings may introduce
instability into defect prediction models
Unstable performance
estimates may introduce bias
into the conclusion of research
[Jorgensen et al., TSE’07]
[Menzies and Shepperd, EMSE’12]
42. Prior findings
on top-
performing
classification
techniques
32
However, these studies have not taken
parameter optimization into account
17 of 22
classification
techniques are
indistinguishable
[Lessmann et. al. TSE’08]
Classification
techniques have a
large impact on
the performance
[Ghotra et al., ICSE’15]
43. Identifying statistically distinct
ranks of classification techniques
33
100x
Technique 1
AUC Performance
Distribution
100x
Technique 26
AUC Performance
Distribution
Dataset 1
….100x
Technique 2
AUC Performance
Distribution
44. Identifying statistically distinct
ranks of classification techniques
33
Scott-Knott
ESD test
100x
Technique 1
AUC Performance
Distribution
100x
Technique 26
AUC Performance
Distribution
Dataset 1
….100x
Technique 2
AUC Performance
Distribution
45. Identifying statistically distinct
ranks of classification techniques
33
Scott-Knott
ESD test
100x
Technique 1
AUC Performance
Distribution
100x
Technique 26
AUC Performance
Distribution
Dataset 1
….100x
Technique 2
AUC Performance
Distribution
Ranking for
dataset 1
46. Identifying statistically distinct
ranks of classification techniques
33
Scott-Knott
ESD test
100x
Technique 1
AUC Performance
Distribution
100x
Technique 26
AUC Performance
Distribution
Dataset 1
….100x
Technique 2
AUC Performance
Distribution
Ranking for
dataset 1
Ranking for
Dataset 18
Scott-Knott
ESD test
Dataset 18
100x
Technique 1
AUC Performance
Distribution
100x
Technique 26
AUC Performance
Distribution
….100x
Technique 2
AUC Performance
Distribution
47. Identifying statistically distinct
ranks of classification techniques
34
Pool of ranking
for each dataset
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
3 1 1 2
Scott-Knott
ESD test
100x
Technique 26
AUC Performance
Distribution
Dataset 1
Ranking for
dataset 1
Ranking for
Dataset 18
Scott-Knott
ESD test
Dataset 18
100x
Technique 26
AUC Performance
Distribution
48. Pool of ranking
for each dataset
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
3 1 1 2
Compute the proportion of datasets
where a classifier appears in the top rank
35
Likelihood
for each
technique
T1 T2 T3
0.67 0.67 0
Compute
likelihood
49. Pool of ranking
for each dataset
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
3 1 1 2
Compute the proportion of datasets
where a classifier appears in the top rank
36
Likelihood
for each
technique
T1 T2 T3
0.67 0.67 0
Compute
likelihood
50. Bootstrap resampling to combat
sample selection bias
37
Bootstrap
sample of ranking
Bootstrap
Sampling
Pool of ranking
for each dataset
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
3 1 1 2
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
2 1 2 3
51. Re-compute the likelihood
for each sample
38
Likelihood
for each
technique
T1 T2 T3
0.67 0.33 0
Bootstrap
Sampling
Pool of ranking
for each dataset
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
3 1 1 2
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
2 1 2 3
Compute
likelihood
Bootstrap
sample of ranking
52. Repeat the bootstrap 100 times to
estimate the confidence interval
39
Bootstrap
Sampling
Pool of ranking
for each dataset
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
3 1 1 2
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
2 1 2 3
Bootstrap
sample of ranking
53. Repeat the bootstrap 100 times to
estimate the confidence interval
39
Bootstrap
Sampling
Pool of ranking
for each dataset
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
3 1 1 2
Dataset T1 T2 T3
1 2 1 3
2 1 2 3
2 1 2 3
Repeat 100 times
T1 T2 T3
0.67 0.33 0
… … …
0.33 0 0
Distribution
of likelihood
Compute
likelihood
Bootstrap
sample of ranking
54. Caret optimization can substantially shift
the top-ranked classification techniques
40
●
●
●
● ●
●
●
● ●
●
● ●
● ●
● ●
PLS
PDA
N
N
et
PM
R
Boost
N
N
et
AR
S
FDA
Boost
adial
ecay
M
LP
R
BF
N
B
ipper
LM
T
●Optimized Classifier Default Classifier
●
●
●
●
●
●
●
●
● ●
●
●
● ●0.0
0.2
0.4
0.6
0.8
1.0
C
5.0xG
BTreeAVN
N
et
G
BM
R
FG
PLS
PDA
N
N
et
PM
RAM
Boost
PC
AN
N
etM
AR
S
FDA
AdaBoost
VM
R
adia
igh
Likelihood
●Optimized Classifier D
Top-ranklikelihoodestimate
55. Caret optimization can substantially shift
the top-ranked classification techniques
41
●
●
●
● ●
●
●
● ●
●
● ●
● ●
● ●
PLS
PDA
N
N
et
PM
R
Boost
N
N
et
AR
S
FDA
Boost
adial
ecay
M
LP
R
BF
N
B
ipper
LM
T
●Optimized Classifier Default Classifier
●
●
●
●
●
●
●
●
● ●
●
●
● ●0.0
0.2
0.4
0.6
0.8
1.0
C
5.0xG
BTreeAVN
N
et
G
BM
R
FG
PLS
PDA
N
N
et
PM
RAM
Boost
PC
AN
N
etM
AR
S
FDA
AdaBoost
VM
R
adia
igh
Likelihood
●Optimized Classifier D
Top-ranklikelihoodestimate
56. Caret optimization can substantially shift
the top-ranked classification techniques
42
●
●
●
● ●
●
●
● ●
●
● ●
● ●
● ●
PLS
PDA
N
N
et
PM
R
Boost
N
N
et
AR
S
FDA
Boost
adial
ecay
M
LP
R
BF
N
B
ipper
LM
T
●Optimized Classifier Default Classifier
●
●
●
●
●
●
●
●
● ●
●
●
● ●0.0
0.2
0.4
0.6
0.8
1.0
C
5.0xG
BTreeAVN
N
et
G
BM
R
FG
PLS
PDA
N
N
et
PM
RAM
Boost
PC
AN
N
etM
AR
S
FDA
AdaBoost
VM
R
adia
igh
Likelihood
●Optimized Classifier D
Top-ranklikelihoodestimate
Caret increases the
likelihood of appearing in
the top rank by up to 83%