The statistical assessment of the empirical comparison of algorithms is an essential step in heuristic optimization. Classically, researchers have relied on the use of statistical tests. However, recently, concerns about their use have arisen and, in many fields, other (Bayesian) alternatives are being considered. For a proper analysis, different aspects should be considered. In this talk, we focus on the question: what is the probability of a given algorithm being the best (among the compared)? To tackle this question, we propose a Bayesian analysis based on the Plackett-Luce model over rankings that allows several algorithms to be considered at the same time. In order to illustrate the proposed Bayesian alternative, explicitly, we examine performance data of 11 evolutionary algorithms (EAs) over a set of 23 discrete optimization problems in several dimensions. Using this data, and following a brief introduction to the relevant Bayesian inference practice, we demonstrate how to draw the algorithms’ probabilities of winning.
sdfsadopkjpiosufoiasdoifjasldkjfl a asldkjflaskdjflkjsdsdf
Bayesian Performance Analysis for Optimization Algorithm Comparison
1. Josu Ceberio
Bayesian Analysis for
Algorithm Performance Comparison
Is it possible to compare optimization
algorithms without hypothesis testing?
2. Is there a reproducibility crisis?
Fuente: Monya Baker (2016) Is there a
reproducibility crisis? Nature, 533, 452-454
3. Hypothesis
Idea for solving a set
of problems more
efficiently.
Questions
Is my algorithm
better than the state-
of-the-art?
On which problems is
my algorithm better?
Why is my algorithm
better (or worse)?
Experimentation
Compare the performance
of my algorithm with the-
state-of-the-art on some
benchmark of problems.
The analysis of the results
should take into account
the associated
uncertainty.
Conclusions
What conclusions do we
draw from the
experimentation?
How do we answer to the
formulated questions?
Is there a reproducibility crisis?
4. The Questions
How likely is my proposal to
be the best algorithm to solve
a problem?
How likely is my proposal to
be the best algorithm from
the compared ones?
5. The Point
STATISTICAL ANALYSIS OF
EXPERIMENTAL RESULTS
NULL HYPOTHESIS
STATISTICAL TESTING
WHAT NHST COMPUTES
p(t(x) > ⌧|H0)<latexit sha1_base64="QScPf75YqpsLM08xO+kyaRgOrOs=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahvZREBT1JwUuPFWwrtCFstpt26WYTdifFEvtPvHhQxKv/xJv/xm2bg7Y+GHi8N8PMvCARXIPjfFuFtfWNza3idmlnd2//wD48aus4VZS1aCxi9RAQzQSXrAUcBHtIFCNRIFgnGN3O/M6YKc1jeQ+ThHkRGUgeckrASL5tJxWoPFZvekDSp4bvVH277NScOfAqcXNSRjmavv3V68c0jZgEKojWXddJwMuIAk4Fm5Z6qWYJoSMyYF1DJYmY9rL55VN8ZpQ+DmNlSgKeq78nMhJpPYkC0xkRGOplbyb+53VTCK+9jMskBSbpYlGYCgwxnsWA+1wxCmJiCKGKm1sxHRJFKJiwSiYEd/nlVdI+r7kXNefuslx38ziK6ASdogpy0RWqowZqohaiaIye0St6szLrxXq3PhatBSufOUZ/YH3+ANqXknE=</latexit>
Unknown Behaviour
Observed Sample
7. The controversy with NHST
We assume the null hypothesis, the average
performance of the compared methods is the same.
Then, the observed difference is computed from data
and the probability of observing such a difference (or
bigger) is estimated: the p-value.
The p-value refers to the probability of erroneously
assuming that there are differences when actually
there are not. It is used to measure the magnitude of
difference, as it decreases when the difference
increases.
WHAT NHST COMPUTES
p(t(x) > ⌧|H0)<latexit sha1_base64="QScPf75YqpsLM08xO+kyaRgOrOs=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahvZREBT1JwUuPFWwrtCFstpt26WYTdifFEvtPvHhQxKv/xJv/xm2bg7Y+GHi8N8PMvCARXIPjfFuFtfWNza3idmlnd2//wD48aus4VZS1aCxi9RAQzQSXrAUcBHtIFCNRIFgnGN3O/M6YKc1jeQ+ThHkRGUgeckrASL5tJxWoPFZvekDSp4bvVH277NScOfAqcXNSRjmavv3V68c0jZgEKojWXddJwMuIAk4Fm5Z6qWYJoSMyYF1DJYmY9rL55VN8ZpQ+DmNlSgKeq78nMhJpPYkC0xkRGOplbyb+53VTCK+9jMskBSbpYlGYCgwxnsWA+1wxCmJiCKGKm1sxHRJFKJiwSiYEd/nlVdI+r7kXNefuslx38ziK6ASdogpy0RWqowZqohaiaIye0St6szLrxXq3PhatBSufOUZ/YH3+ANqXknE=</latexit>
1 p(t(x) > ⌧|H0) = p(t(x) < ⌧|H0)<latexit sha1_base64="ixOtl42DABu1QXwNHfHlqHttk6E=">AAACDXicbZC7SgNBFIZnvcZ4W7W0GYxCUhh2VdBCJWCTMoK5QLIss5PZZMjshZmzYoh5ARtfxcZCEVt7O9/GSbKIJv4w8POdczhzfi8WXIFlfRlz8wuLS8uZlezq2vrGprm1XVNRIimr0khEsuERxQQPWRU4CNaIJSOBJ1jd612N6vVbJhWPwhvox8wJSCfkPqcENHLNffswzkP+rnDZApLcl12rcIEn5PyHuGbOKlpj4VljpyaHUlVc87PVjmgSsBCoIEo1bSsGZ0AkcCrYMNtKFIsJ7ZEOa2obkoApZzC+ZogPNGljP5L6hYDH9PfEgARK9QNPdwYEumq6NoL/1ZoJ+GfOgIdxAiykk0V+IjBEeBQNbnPJKIi+NoRKrv+KaZdIQkEHmNUh2NMnz5raUdE+LlrXJ7mSncaRQbtoD+WRjU5RCZVRBVURRQ/oCb2gV+PReDbejPdJ65yRzuygPzI+vgFYSZkn</latexit>
WHAT WE WOULD LIKE TO KNOW
1 p(H0|x) = p(H1|x)<latexit sha1_base64="1JettnS1nfDHVeV06DeUX+AEQ8Y=">AAAB/HicbZDLSgMxFIYz9VbrbbRLN8Ei1IVlooJuhIKbLivYC7TDkEnTNjSTGZKMOIz1Vdy4UMStD+LOtzHTzkJbfwh8/OcczsnvR5wp7TjfVmFldW19o7hZ2tre2d2z9w/aKowloS0S8lB2fawoZ4K2NNOcdiNJceBz2vEnN1m9c0+lYqG400lE3QCPBBsygrWxPLuMTqNqw3MeH06uM0AGPLvi1JyZ4DKgHCogV9Ozv/qDkMQBFZpwrFQPOZF2Uyw1I5xOS/1Y0QiTCR7RnkGBA6rcdHb8FB4bZwCHoTRPaDhzf0+kOFAqCXzTGWA9Vou1zPyv1ov18MpNmYhiTQWZLxrGHOoQZknAAZOUaJ4YwEQycyskYywx0SavkgkBLX55GdpnNXRec24vKnWUx1EEh+AIVAECl6AOGqAJWoCABDyDV/BmPVkv1rv1MW8tWPlMGfyR9fkDE+OTDg==</latexit>
p(H0|x)<latexit sha1_base64="/MpXzWcP8EqakOTUlXIzz1ULR90=">AAAB73icbVDLSgNBEOz1GeMr6tHLYBDiJeyqoMeAlxwjmAckS5idzCZDZmfXmV4xxPyEFw+KePV3vPk3TpI9aGJBQ1HVTXdXkEhh0HW/nZXVtfWNzdxWfntnd2+/cHDYMHGqGa+zWMa6FVDDpVC8jgIlbyWa0yiQvBkMb6Z+84FrI2J1h6OE+xHtKxEKRtFKraRU7bpPj2fdQtEtuzOQZeJlpAgZat3CV6cXszTiCpmkxrQ9N0F/TDUKJvkk30kNTygb0j5vW6poxI0/nt07IadW6ZEw1rYUkpn6e2JMI2NGUWA7I4oDs+hNxf+8dorhtT8WKkmRKzZfFKaSYEymz5Oe0JyhHFlCmRb2VsIGVFOGNqK8DcFbfHmZNM7L3kXZvb0sVrwsjhwcwwmUwIMrqEAValAHBhKe4RXenHvnxXl3PuatK042cwR/4Hz+ABOrj0c=</latexit>
8. The Point
Unknown Behaviour
Observed Sample
Many alternatives to handle uncertainty
associated with empirical results:
6WDWLVWLFDO QDOVLV
+DQGERRN
$ &RPSUHKHQVL H +DQGERRN RI 6 D LV LFDO
&RQFHS V 7HFKQLT HV DQG 6RI DUH 7RROV
(GL LRQ
'U 0LFKDHO - GH 6PL K
9. WHAT NHST COMPUTES
p(t(x) > ⌧|H0)<latexit sha1_base64="QScPf75YqpsLM08xO+kyaRgOrOs=">AAAB+XicbVBNS8NAEN3Ur1q/oh69LBahvZREBT1JwUuPFWwrtCFstpt26WYTdifFEvtPvHhQxKv/xJv/xm2bg7Y+GHi8N8PMvCARXIPjfFuFtfWNza3idmlnd2//wD48aus4VZS1aCxi9RAQzQSXrAUcBHtIFCNRIFgnGN3O/M6YKc1jeQ+ThHkRGUgeckrASL5tJxWoPFZvekDSp4bvVH277NScOfAqcXNSRjmavv3V68c0jZgEKojWXddJwMuIAk4Fm5Z6qWYJoSMyYF1DJYmY9rL55VN8ZpQ+DmNlSgKeq78nMhJpPYkC0xkRGOplbyb+53VTCK+9jMskBSbpYlGYCgwxnsWA+1wxCmJiCKGKm1sxHRJFKJiwSiYEd/nlVdI+r7kXNefuslx38ziK6ASdogpy0RWqowZqohaiaIye0St6szLrxXq3PhatBSufOUZ/YH3+ANqXknE=</latexit>
BAYESIAN STATISTICAL
ANALYSIS
The Point
STATISTICAL ANALYSIS OF
EXPERIMENTAL RESULTS
NULL HYPOTHESIS
STATISTICAL TESTING
Unknown Behaviour
Observed Sample
10. The Bayesian Approach
The method focuses on estimating relevant
information about the underlying performance
parametric distribution represented by a set of
parameters θ.
This method asses the distribution of θ
conditioned on a sample s drawn from the
performance distribution.
Instead of having a single probability distribution
to model the underlying performance, Bayesian
statistics considers all possible distributions
and assigns a probability to each.
P(✓|s) / P(s|✓)P(✓)<latexit sha1_base64="1oaUrufzQhQHrQgFYQ+vqg7duQg=">AAACEXicbVDLSsNAFJ34rPUVdelmsAjppiRV0GXRjcsI9gFtKJPppB06eTBzI5S0v+DGX3HjQhG37tz5N07bCNp6YOBwzr3cOcdPBFdg21/Gyura+sZmYau4vbO7t28eHDZUnErK6jQWsWz5RDHBI1YHDoK1EslI6AvW9IfXU795z6TicXQHo4R5IelHPOCUgJa6puVaHRgwIGNV7iQyTiDGrqXGc7GMf+xy1yzZFXsGvEycnJRQDrdrfnZ6MU1DFgEVRKm2YyfgZUQCp4JNip1UsYTQIemztqYRCZnyslmiCT7VSg8HsdQvAjxTf29kJFRqFPp6MiQwUIveVPzPa6cQXHoZj5IUWETnh4JUYB17Wg/ucckoiJEmhEqu/4rpgEhCQZdY1CU4i5GXSaNacc4q1dvzUu0qr6OAjtEJspCDLlAN3SAX1RFFD+gJvaBX49F4Nt6M9/noipHvHKE/MD6+ASzGnJY=</latexit>
Posterior distribution
of the parameters
Likelihood
function
Prior distribution
of the parameters
HOW DO WE COMPARE MULTIPLE
ALGORITHMS?
11. Minimizing some instances of a problemMinimizing a given instance of a problem
Algorithm f1
GA 100
PSO 90
ILP 135
SA 105
GP 95
.
.
.
.
.
.
From Results to Rankings
Observed Sample
σ1
3
1
5
4
2
.
.
.
Algorithm f2
GA 130
PSO 80
ILP 135
SA 30
GP 300
.
.
.
.
.
.
σ2
3
2
4
1
5
.
.
.
σ3
3
5
2
4
1
.
.
.
σ4
4
5
3
1
2
.
.
.
σ5
4
3
2
5
1
.
.
.
Algorithm f3
GA 37
PSO 352
ILP 19
SA 100
GP 10
.
.
.
.
.
.
Algorithm f4
GA 566
PSO 756
ILP 101
SA 56
GP 57
.
.
.
.
.
.
Algorithm f5
GA 256
PSO 125
ILP 89
SA 369
GP 36
.
.
.
.
.
.
rankings, permutations
12. ● Each algorithm in the comparison has a weight associated.
● The weights sum up 1.
● The weight associated to an algorithm represents its probability to appear at first rank.
Plackett-luce Model
P( ) =
nY
i=1
w i
Pn
j=i w j
!
<latexit sha1_base64="l2ncjWDTg/lJpaxSOQNZ0W4MK+s=">AAACQXicbVBLSwMxGMzWd31VPXoJFqFeyq4KeikIXjxWsFro1iWbZtvYJLsk3ypl2b/mxX/gzbsXD4p49WL6OGjrQGAyMx9fMmEiuAHXfXEKc/MLi0vLK8XVtfWNzdLW9rWJU01Zg8Yi1s2QGCa4Yg3gIFgz0YzIULCbsH8+9G/umTY8VlcwSFhbkq7iEacErBSUmvWKb3hXkoOan+i4E2S85uW3CvuCRVDxI01o9hBk45B18zy3l1QG2V2ND4O/zDtrYl/zbg8OglLZrboj4FniTUgZTVAPSs9+J6apZAqoIMa0PDeBdkY0cCpYXvRTwxJC+6TLWpYqIplpZ6MGcrxvlQ6OYm2PAjxSf09kRBozkKFNSgI9M+0Nxf+8VgrRaTvjKkmBKTpeFKUCQ4yHdeIO14yCGFhCqOb2rZj2iC0NbOlFW4I3/eVZcn1Y9Y6q7uVx+ex4Uscy2kV7qII8dILO0AWqowai6BG9onf04Tw5b86n8zWOFpzJzA76A+f7BwPtslo=</latexit>
16. The bayesian model
Posterior distribution of the weights Likelihood of the sample
Prior distribution of the weights
NY
k=1
nY
i=1
0
@
w (k)
i
Pn
j=i w (k)
j
1
A
<latexit sha1_base64="382jpMOvUOBX2CNNv68sNU9hmgk=">AAACUHicbVFNaxsxFHzrNh910sRNj72ImoBzMbtNoL0EArnkFFKonYDXWbSydq1Y0i7S2xQj9ifmklt/Ry49tLRa24U27gOheTPzkDRKSykshuG3oPXi5cbm1var9s7u6739zpuDoS0qw/iAFbIwNym1XArNByhQ8pvScKpSya/T2XmjX99zY0Whv+C85GNFcy0ywSh6KunkcWmKSeJmp1F9e7lqRNNoEkueYS/ODGXua+JiK3JFb11vdlR7T13XnqpU4u5ORWNfs9x5C4mNyKd4lHS6YT9cFFkH0Qp0YVVXSecxnhSsUlwjk9TaURSWOHbUoGCS1+24srykbEZzPvJQU8Xt2C0CqcmhZyYkK4xfGsmC/XvCUWXtXKXeqShO7XOtIf+njSrMPo2d0GWFXLPlQVklCRakSZdMhOEM5dwDyozwdyVsSn2A6P+g7UOInj95HQw/9KPjfvj5pHt2sopjG97Be+hBBB/hDC7gCgbA4AGe4Af8DB6D78GvVrC0/tnhLfxTrfZvGq62uA==</latexit>
R = { (1)
, . . . , (N)
}<latexit sha1_base64="p6uONzgcyQDNNmoWvv+HTlxo17g=">AAACD3icbVBNS8NAEN3Ur1q/oh69LBalhVISLehFKHjxJFVsKzSxbDbbdulmE3Y3Qgn5B178K148KOLVqzf/jds2iLY+GHi8N8PMPC9iVCrL+jJyC4tLyyv51cLa+sbmlrm905JhLDBp4pCF4tZDkjDKSVNRxchtJAgKPEba3vB87LfviZA05DdqFBE3QH1OexQjpaWueXh95iSOpP0A3SUlu5xWHOaHSlZ+tMty6qRds2hVrQngPLEzUgQZGl3z0/FDHAeEK8yQlB3bipSbIKEoZiQtOLEkEcJD1CcdTTkKiHSTyT8pPNCKD3uh0MUVnKi/JxIUSDkKPN0ZIDWQs95Y/M/rxKp36iaUR7EiHE8X9WIGVQjH4UCfCoIVG2mCsKD6VogHSCCsdIQFHYI9+/I8aR1V7eOqdVUr1mtZHHmwB/ZBCdjgBNTBBWiAJsDgATyBF/BqPBrPxpvxPm3NGdnMLvgD4+Mb1wab2w==</latexit>
P(w|R) /<latexit sha1_base64="CzIyNBVIpLnUlZDF5eJdtnMe9Lw=">AAAB/3icbVDLSgMxFM3UV62vUcGNm2AR6qbMaEGXBTcuq9gHdIaSSTNtaGYSkoxSpl34K25cKOLW33Dn35hpZ6GtBwKHc+7lnpxAMKq043xbhZXVtfWN4mZpa3tnd8/eP2gpnkhMmpgzLjsBUoTRmDQ11Yx0hCQoChhpB6PrzG8/EKkoj+/1WBA/QoOYhhQjbaSefdSoeBHSwyBMH6eTuzNPSC4079llp+rMAJeJm5MyyNHo2V9en+MkIrHGDCnVdR2h/RRJTTEj05KXKCIQHqEB6Roao4goP53ln8JTo/RhyKV5sYYz9fdGiiKlxlFgJrOsatHLxP+8bqLDKz+lsUg0ifH8UJgwqDnMyoB9KgnWbGwIwpKarBAPkURYm8pKpgR38cvLpHVedS+qzm2tXK/ldRTBMTgBFeCCS1AHN6ABmgCDCXgGr+DNerJerHfrYz5asPKdQ/AH1ucPKnKWJw==</latexit>
1
B
nY
i=1
w↵i 1
i
<latexit sha1_base64="/gfyjh4UDNfus5EbeDuQVHLsAyw=">AAACE3icbVDLSsNAFJ34rPUVdelmsAgiWBIt6EYounFZwT6gScNkMmmHTiZhZqKUkH9w46+4caGIWzfu/BunbRbaeuDC4Zx7ufceP2FUKsv6NhYWl5ZXVktr5fWNza1tc2e3JeNUYNLEMYtFx0eSMMpJU1HFSCcRBEU+I21/eD322/dESBrzOzVKiBuhPqchxUhpyTOPnVAgnNl5dpU7iYgDL6OXdt7j8MGjvcxBLBkgj8ITO/fMilW1JoDzxC5IBRRoeOaXE8Q4jQhXmCEpu7aVKDdDQlHMSF52UkkShIeoT7qachQR6WaTn3J4qJUAhrHQxRWcqL8nMhRJOYp83RkhNZCz3lj8z+umKrxwM8qTVBGOp4vClEEVw3FAMKCCYMVGmiAsqL4V4gHSISkdY1mHYM++PE9ap1X7rGrd1ir1WhFHCeyDA3AEbHAO6uAGNEATYPAInsEreDOejBfj3fiYti4Yxcwe+APj8wfKaZ4G</latexit>
B =
Qn
i=1 (↵i)
(
Pn
i=1 ↵i)<latexit sha1_base64="lQ2UQ095A4jrK9whnNjihdhrbPg=">AAACL3icbVDLSgMxFM34rPU16tJNsAh1U2ZU0I1QFNRlBauFTh3upBkbmmSGJCOUoX/kxl/pRkQRt/6Faa1vDwQO55zLzT1Rypk2nvfgTExOTc/MFuaK8wuLS8vuyuqFTjJFaJ0kPFGNCDTlTNK6YYbTRqooiIjTy6h7NPQvb6jSLJHnppfSloBryWJGwFgpdI8PD4JYAcmDVCXtMGcHfv9K4uAEhIByADztQMi2+vmHojPxlfq0Q7fkVbwR8F/ij0kJjVEL3UHQTkgmqDSEg9ZN30tNKwdlGOG0XwwyTVMgXbimTUslCKpb+ejePt60ShvHibJPGjxSv0/kILTuicgmBZiO/u0Nxf+8Zmbi/VbOZJoZKsn7ojjj2CR4WB5uM0WJ4T1LgChm/4pJB2x9xlZctCX4v0/+Sy62K/5OxTvbLVV3x3UU0DraQGXkoz1URaeohuqIoFs0QI/oyblz7p1n5+U9OuGMZ9bQDzivb04RqSw=</latexit>
No way to sample posterior
distribution exactly à MCMC
18. The Case of Study
23 FUNCTIONS TO OPTIMIZE:
• OneMax (F1) and W-model extensions (F4-F10)
• LeadingOnes (F2) and W-model extensions (F11-
F17)
• Harmonic (F3)
• LABS: Low Autocorrelation Binary Sequences (F18)
• Ising-Ring (F19)
• Ising-Torus (F20)
• Ising-Triangular (F21)
• MIVS: Maximum Independent Vertex Set (F22)
• NQP: N-Queens problem (F23)
n 2 {16, 64, 100, 625}<latexit sha1_base64="HS0JdBr8a6YmSKd4vVyu+TiOCPw=">AAAB/nicbVBNS8NAEJ3Ur1q/ouLJy2IRPJSS1Fr1VvDisYKthSaUzXbbLt1swu5GKKHgX/HiQRGv/g5v/hu3bQ7a+mDg8d4MM/OCmDOlHefbyq2srq1v5DcLW9s7u3v2/kFLRYkktEkiHsl2gBXlTNCmZprTdiwpDgNOH4LRzdR/eKRSsUjc63FM/RAPBOszgrWRuvaR8JjwUrdWqlVLruOUapULb9K1i07ZmQEtEzcjRcjQ6NpfXi8iSUiFJhwr1XGdWPsplpoRTicFL1E0xmSEB7RjqMAhVX46O3+CTo3SQ/1ImhIazdTfEykOlRqHgekMsR6qRW8q/ud1Et2/8lMm4kRTQeaL+glHOkLTLFCPSUo0HxuCiWTmVkSGWGKiTWIFE4K7+PIyaVXK7nm5clct1q+zOPJwDCdwBi5cQh1uoQFNIJDCM7zCm/VkvVjv1se8NWdlM4fwB9bnD6Ask0w=</latexit>
Problem Size:
11 Metaheuristic algorithms:
• greedy Hill Climber (gHC)
• Randomlized Local Search (RLS)
• (1+1) EA
• fast Genetic Algorithm (fGA)
• (1+10) EA
• (1+10) EAr/2,2r
• (1+10) EAnorm
• (1+10) EAvar
• (1+10) EAlog-n
• (1+(λ+λ)) GA
• “vanilla” GA (vGA)
Results of 11.132 runs are collected (23 x 4 x 11 x 11)
• Aggregation of performances across 11 instances.
• Median performance across 11 repetitions.
Estimate the probability of each algorithm being top-ranked
• as its expected weight in the posterior distribution of weights
Analyze the uncertainty about the probabilities
• By estimating the 90% credible intervals of the posterior distribution of weights (5% and 95%)
19. Inference analyses & results
QUALITATIVE SUMMARY
Similar perf. (1+(λ+λ)) GA, (1+1)-EA, (1+10)-EAvar, (1+10)-Ealog-n, (1+10)-Eanorm,(1+10)-EAr/2,2r and fGA.
Extreme perf. vGA and gHC.
Easily treated instances are F1-F6, F8, F11-F13 and F15-16.
Best solutions found for n=625
20. Inference analyses & results
Fixed-target perspective – Record Running-time
(1+( , )) GA
(1+1) EA
gHC
(1+10) EA_r/2,2r
(1+10) EA
(1+10) EA_log-n.
(1+10) EA_norm.
(1+1) EA_var.
fGA
vGA
RLS
0.0 0.2 0.4 0.6
Probability of winning
Algorithm
F17, n=625, φ=625 F19, n=100, φ=100
(1+( , )) GA
(1+1) EA
gHC
(1+10) EA_r/2,2r
(1+10) EA
(1+10) EA_log-n.
(1+10) EA_norm.
(1+1) EA_var.
fGA
vGA
RLS
0.0 0.1 0.2 0.3 0.4 0.5
Probability of winning
Algorithm
Credible Intervals
Only 11 samples to do inference à High uncertainty is expected!
The more samples, the lower the uncertainty à Credibility intervals are more tight!
Expected
probability
High
uncertainty
INTERPRETABILITY
21. Inference analyses & results
Fixed-target perspective – Record Running-time – Set of easy functions
(1+( , )) GA
(1+1) EA
gHC
(1+10) EA_r/2,2r
(1+10) EA
(1+10) EA_log-n.
(1+10) EA_norm.
(1+1) EA_var.
fGA
vGA
RLS
0.00 0.25 0.50 0.75 1.00
Probability of winning
Algorithm
n=625, all runs
(1+( , )) GA
(1+1) EA
gHC
(1+10) EA_r/2,2r
(1+10) EA
(1+10) EA_log-n.
(1+10) EA_norm.
(1+1) EA_var.
fGA
vGA
RLS
0.0 0.2 0.4 0.6
Probability of winning
Algorithm
n=625, median
Credible Intervals
Set of functions, two paths à (1) take all the runs, (2) take the median of the runs on each instance.
gHC is the best in both cases à with more samples the uncertainty is lower
22. Inference analyses & results
Fixed-target perspective – Record Running-time – Set of non-easy functions
Credible Intervals
Good estimations à credible intervals smaller than 0.05
Probabilities are similar à due to overlapping
Uncertainty about which is the best à but not due to
limitation of data, but due to equivalence in the
algorithms
(1+( , )) GA
(1+1) EA
gHC
(1+10) EA_r/2,2r
(1+10) EA
(1+10) EA_log-n.
(1+10) EA_norm.
(1+1) EA_var.
fGA
vGA
RLS
0.050 0.075 0.100 0.125 0.150
Probability of winning
Algorithm
n=625, all runs
23. Inference analyses & results
Fixed-budget perspective – Evolution winning probability - %90 credibility intervals
0.0
0.2
0.4
0.6
0 300 600 900
Budget
Winningprobability
(1+( , )) GA
(1+1) EA
gHC
(1+10) EA_r/2,2r
(1+10) EA
(1+10) EA_log-n.
(1+10) EA_norm.
(1+1) EA_var.
fGA
vGA
RLS
F21, n=100
gHC is the best, but probability decreases while the rest improve.
gHC becomes better, as the budget increases.
3 4 5 6 7 8 9 10 11
Algorithms ranked with average data
Wilcoxon test for pairwise comparisons, and
shaffer’s method for p-value correction.
BAYESIAN ANALYSIS
ESTIMATED PROBABILITY AND
NOTION OF UNCERTAINTY IN THE
FORM OF CREDIBLE INTERVAL
24. Inference analyses & results
Impact of the prior distribution – Comparison of three different priors
0.0
0.2
0.4
0.6
(1+(
,
))G
A
(1+1)EA
gH
C
(1+10)EA_r/2,2r
(1+10)EA
(1+10)EA_log-n.
(1+10)EA_norm
.
(1+1)EA_var.
fG
A
vG
A
R
LS
Algorithm
Winningprobability
Prior Unifor Empirical Deceptive
F9, n=100, φ=100
Empirical data favours the best
performing algorithms
Neligible effect (even when median
values are considered)
25. Discussion
Bayesian inference using Plackett-Luce for analysis of algorithms’ performance ranking
Include it in the practical EC performance comparison’ tool set à IOHProfiler
Strong points
Ability to handle multiple
algorithms
Interpretability
Exact description of the
uncertainty
WEAKNESSES
Aggregating performances into
rankings we loose information about
the magnitude of differences
Limitations of the Plackett-Luce model
à From n! to n parameters.
How do we deal with ties?