The document discusses probability forecasting from a machine learning perspective. It describes probability forecasting as estimating the conditional probability of possible labels for new examples, rather than just predicting the most likely label. It evaluates several learners on reliability and resolution criteria. It introduces the Probability Calibration Graph (PCG) as a visual tool for assessing reliability without other metrics like log loss that conflate reliability and resolution. Traditional learners are found to be unreliable in their probability forecasts despite being accurate, while the Venn Probability Machine (VPM) framework produces more reliable forecasts.
14. The Online Learning Setting 2 7 6 1 7 ? ? 2 7 6 1 7 2 ? Before After Update training data for learning machine for next trial Learning machine makes prediction for new example. (label withheld) Repeat process for all examples
24. Inspiration for PCG (Meteorology) Reliable points lie close to diagonal Murphy & Winkler (1977) Calibration data for precipitation forecasts
25. A PCG plot of ZeroR on Abdominal Pain Reliability PCG coordinates lie close to line of calibration i.e. ZeroR may is not accurate but it is reliable ! Plot may not span whole axis – ZeroR makes no predictions with high probability Predicted Probability Empirical frequency of being correct Line of calibration PCG coordinates
26. PCG a visualisation tool and measure of reliability VPM is reliable as PCG follows the diagonal! Over and under estimates its probabilities – much like real doctors! 4.9e-17 Min 0.4203 Max 0.0757 Standard Deviation 0.0483 Mean 2764.5 Total Naïve Bayes VPM Naïve Bayes 9.2e-8 Min 0.1017 Max 0.0112 Standard Deviation 0.0087 Mean 496.7 Total Unreliable, forecast of 0.9 only has 0.55 chance being right! (over estimate) Unreliable, forecast of 0.1 only has 0.3 chance being right! (under estimate)
27. Learners predicting like people! Lots of psychological research people make unreliable probability forecasts Naïve Bayes People
30. Correlations of scores Inverse No -0.1 ROC vs. Sqr Reliability Direct Weak 0.26 PCG vs. Error Direct No 0.04 PCG vs. Sqr Resolution Direct Strong 0.76 PCG vs. Sqr Reliability Interpretation Corr. Coeff. Scores Inverse Moderate -0.52 ROC vs. Error Direct Strong 0.67 ROC vs. Sqr Resolution
31.
32.
33. Volodya’s original use of VPM Online Trial Number Error rate and bounds 22.1% 1414.1 Low Error 28.9% 1835 Error 34.7% 2216.5 Up Error Upper (red) and lower (green) bounds lie above and below the actual number of errors ( black ) made on the data.
34. Output from VPM compared with that of original underlying learner Key: Predicted = underlined , Actual = NA NA 7.6e-9 6.3e-10 4.0e-11 2.2e-9 1.3e-9 0.07 1.7e-13 2.9e-9 0.93 5831 NA NA 2.2e-4 2.2e-7 0.2 0.46 0.16 2.3e-5 0.17 0.01 9.4e-5 2490 NA NA 1.3e-4 4.1e-10 3.4e-3 4.2e-3 0.99 4.4e-5 3.3e-6 4.5e-6 3.08e-9 1653 Naïve Bayes Low Up Dysp. Renal. Pancr Intest obstr Choli Non. Spec Perf. Pept. Div. Appx Bounds Probability forecast for each class label Trial # 0.41 0.68 0.01 0.01 0.0 0.01 0.01 0.42 0.0 0.01 0.53 5831 0.07 0.71 0.4 0.09 0.08 0.15 0.05 0.07 0.10 0.03 0.02 2490 0.08 0.82 0.09 0.01 0.04 0.0 0.73 0.08 0.03 0.0 0.03 1653 VPM Naïve Bayes
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
Editor's Notes
Hello and welcome to my cake talk, cakes are situated at the front please feel free to munch away The title of my talk is Reliable Probability Forecasting – a Machine Learning Perspective I have been working on this research for about 9 months, the talk will be quite high level, if anyone wants to find out more low level detail then you can ask questions at the end or look at my 3 tech reports which some material of this talk is taken from. I have attempted to make this talk accessible to people outside my field and I hope that you all at least understand some part of my talk. If anything is very confusing on the slides please stop to ask questions.
So let me start by giving an overview of what I am going to talk about today, I will return to this plan as we go along. CLICK Firstly I will introduce the problem of probability forecasting, and describe how the problem is a generalisation of the standard pattern recognition problem studied in machine learning CLICK Then I will describe the reliability and resolution criteria (proposed by research in statistics and psychology) which can be used for assessing he effectiveness of probability forecasts CLICK I will follow this by briefly detail my experimental design CLICK I will then showcase current methods of assessing probabilities, namely square loss, log loss and ROC curves and highlight the problems with these approaches for assessing reliability only. CLICK Introduce the Probability Calibration Graph (PCG) Lindsay (2004) for solely assessing the reliability of probability forecasts. CLICK Show how many traditional learners are unreliable yet accurate! This can seem a counter intuitive argument as we’ll see later. CLICK Show how the newly developed Venn Probability Machine (VPM) meta learning framework can be extended Lindsay (2004) and used to correct these problems with traditional learners! CLICK Summarise which learners have been demonstrated as reliable and which are unreliable. CLICK And finally I Give theoretical and psychological viewpoint for the reliable learners that my studies have identified.
Lets go through some initial benefits of probability forecasting. CLICK Qualified predictions are important in many real life applications (especially medicine), it is very handy for the user to know when and how much trust can be placed in a prediction made by a learner. CLICK Having said that most machine learning algorithms make bare predictions, don’t give any indication of how likely the prediction is correct, I think this is why very few learning systems are used in practice. CLICK Those learners that do make qualified predictions makes no claims of how effective the measures are! For example the WEKA data mining system provides a load of tweaks to existing algorithms to output probability forecasts, but not many have any theoretical proof of their validity.
So let me just review a general problem which is commonly tackled by machine learning namely pattern recognition. The goal of pattern recognition is quite simple = find the “ best ” label for each new test object. CLICK An example that I will use throughout this talk is the Abdominal Pain Dataset (mainly because I believe that this research is most applicable to the medical problem domain). CLICK. The data is very noisy and complex, we have roughly 6300 patients details collected by a hospital in Cardiff. Each patient is described using 135 properties, and associated with each patient is 1 of 9 different abdominal pain diseases (Appendicitis, Dyspepsia, Non-specific abdominal pain, Renal Colic, etc). CLICK. Relating this to the notation and jargon I will use throughout, we think of our examples as information pairs, each example represents a patient, each object x (POINT and CLICK) describes the patients symptoms etc, and the corresponding label y (POINT and CLICK) is the diagnosis of the abdominal pain disease that they are suffering from. In a machine learning interpretation of the pattern recognition problem, the supervisor (in this case doctor) provides a training set to learn the all important relationship between objects and labels. CLICK. The hope is that if the training set is large and clean enough the user will be able to input the details of a new patient and the learning algorithm will diagnose that patient by making a prediction of 1 of the 9 possible labels. CLICK. Usually we keep back a test set to validate the predictions made by the learner so we can test the performance of the learning algorithm.
A probability forecast is an estimate of the conditional probability of a label given an observed object CLICK TWICE. I use the hat notation [POINT] to distinguish the predicted value output by the learner, from the true value determined by nature Obviously in real life applications I do not have access to some higher power to give me details access to the true probability distribution, so it is awkard to check whether my forecasts are accurate. We want learner to estimate probabilities for all class labels CLICK Returning to our example of the abdominal pain dataset, we have our training data CLICK And the unlabelled test object CLICK Both of these are fed into our Learner Gamma CLICK Our Learner Gamma outputs probability forecasts for each possible label (i.e. disease) for that new test object (i.e. patient) CLICK 3 TIMES POINT: Remember all the predicted probabilities output sum to one. Naturally, we predict the label with the highest associated probability CLICK
So using the standard notation we have X as the object space and Y as the label space, so that Z = X Y is the example space CLICK Our learner makes probability forecasts for all possible labels CLICK Use probability forecasts to predict label most likely label CLICK TWICE
So hopefully its clear what probability forecasting is, but how can we assess the quality or effectiveness of these forecasts? We shall now see.
Probability forecasting is well studied area of since 1970’s: CLICK Psychology Statistics Meteorology These studies assessed two criteria of probability forecasts:CLICK Reliability = the probability forecasts should not lie Resolution = the probability forecasts are practically useful
So an informal definition of reliable probability forecasting is when an event is predicted with probability p should have approx 1-p chance of being incorrect This term is known by many names, being well calibrated etc. CLICK Reliability is normally considered an asymptotic property (as the number of training examples tends to infinity) in statistical studies, however the work by Volodya was able to generalise this problem for finite data. CLICK In 1985 Dawid proved that no deterministic learner can be reliable for all data – still interesting to investigate the problem of reliable probability forecasting as the work by Volodya and me shows. CLICK This property is often overlooked in practical studies! This is a real shame as I think many applications would find this property very attractive. If the probability forecasts of a learner were reliable then they would at least be trustworthy. CLICK
Now lets look at the second term resolution. Resolution demands that the probability forecasts are practically useful, eg. they can be used to effectively rank the labels in order of likelihood! CLICK Closely related to classification accuracy which is commonly studied in machine learning. CLICK They are separate from reliability, one of my papers shows that reliability and classification accuracy/resolution do not go “ hand in hand ” CLICK
So to recap I have detailed what probability forecasting is. And that lots of studies in different fields have identified that probability forecasting can be assessed using the reliability and resolution criteria Now I will describe how and why I conducted my experiments
I tested several learners on many datasets in the online setting (which I will explain later) CLICK ZeroR. This learner was used as a control: it is the most basic, simple learner that you could consider. We would therefore expect any other learner to have improved probability forecasts over ZeroR. ZeroR is not well-known and it comes as part of the WEKA data-mining system that I will describe later. CLICK K-Nearest Neighbour CLICK Neural Network CLICK C4.5 Decision Tree CLICK Naïve Bayes CLICK Venn Probability Machine Meta Learner. I will discuss this later, as the VPM meta-learner has been applied to all these learners here.
Traditionally most studies in machine learning are carried out in the offline learning setting (where the learning machine is provided with a fixed training and test set to evaluate). For my research I looked primarily at the online learning setting as it fits nicely with the theory and allows you to see how the learner improves with experience. Having said that I have conducted all these experiments in the offline setting as well and they come out the same. The crucial difference with the online setting is that we imagine that the training set provided to the learning machine is continually updated. I will now give a quick example using the handwritten digits image data set. Our images are the objects, and the label says which digit it is. The strict online learning setting works as follows: CLICK First the leaning machine makes a prediction of a new test example CLICK Second the teacher/supervisor of the learning machine provides the true label of the example (in this case 2) and adds it to the training set for the learning machine. CLICK Finally the process is repeated for each example in the dataset presenting each example as a “trial” in the online process.
To get an idea of the kind of data I tested my learning algorithms on I have compiled this slide. As you can see I tested a variety of well known datasets (varying in size, complexity and noise level), mostly benchmark data from the UCI, but I also tested some home grown favourites such as the Abdominal pain dataset. I chose to use a lot of medical datasets as these tend to be quite noisy and I believe the need for reliable probability forecasting is exemplified by this problem domain.
For the programming side, I decided to capitalise on the lovely WEKA data mining system (distributed under GNU public licence) CLICK This package implemented in Java offers an extensive library of well known machine learning algorithms. Because its written in an object oriented programming language this made things very easy for me to extend the existing functionality of the system, this is how I added extra algorithms such as the Venn Probability Machine that I will be talking about today. CLICK I also extended the WEKA system to allow all learners to be tested in the online learning setting (that I mentioned a few slides ago) as yet not many people test in this mode. CLICK To create all the lovely graphs I wrote some handy Matlab scripts, and all these programs are available via my website. CLICK
As I mentioned at the start all of the research I am talking about can be found in the three tech reports (details on the slide) that I have been working on for several months now. I have also tried to publish shortened versions of these papers at some of the big machine learning conferences (unsuccessfully) All tech reports will hopefully be available on the CLRC website and my own, pending review.
So going back to the plan again. So we know what probability forecasting is. And that we can intuitively assess the performance of probability forecasts using two criteria reliability and resolution. I have detailed my experimental design, what learners and data I have tested But there are methods which are currently used in machine learning for assessing the performance of probability forecasts and this is what we will now look at now and also highlight problems with them.
There are many other possible loss functions… Square loss CLICK And Log loss CLICK In 1982 Degroot and Feinberg showed that all loss functions measure a mixture of reliability and resolution Log loss punishes more harshly, and it is forced to spread its bets
ROC curves check the proportion of correct versus incorrect predictions made by a learner. See tech report on PCG for more details. CLICK ROC curves are popularly used in machine learning studies to assess probability forecasts ROC is commonly used to measure the tradeoffs between false and true positive classification. We want the ROC curve to be as close to the upper left corner as possible POINT, we want it to deviate from the diagonal as much as possible. CLICK My results show that this graph tests resolution. CLICK The area under the ROC curve is often used as a measure of the quality of the probability forecasts being made. CLICK Still does not tell us how/why probability forecasts are unreliable! This has more to do with accuracy.
To try an reinforce my point that error rate does not reflect the quality of the probability forecast. Traditional studies would get the classification accuracy or inversely the error rate of our learners tested on a dataset (in this case Abdominal pain) producing the kind of league table you see here. CLICK Here you can see that each learner is given a rank in brackets in terms of its error rate. Obviously we want the error rate to be as small as possible. So the Naïve Bayes learners are the most accurate, and the ZeroR learner is the least accurate as we would expect. This is where the analysis would end and the user would probably choose the most accurate learners for their practical application. However if we look at the results if the loss functions and the area under the ROC curves for each algorithm we see a different story emerging! CLICK 3 TIMES. We see that these measures rank these learners differently – see the Naïve Bayes is starting to slip down the rankings from second to seventh! Conversley, ZeroR is starting to rise – from last place to sixth. VPM Naïve Bayes remains high – and I’ll discuss this later (point).
Loss functions and ROC give more information than error rate about the quality of probability forecasts. CLICK But as I said earlier, loss functions = mixture of resolution and reliability ROC curve = measures resolution CLICK Don’t have any method of solely assessing reliability CLICK Don’t have method of telling if probability forecasts are over- or under- estimated CLICK This is where I introduce my contribution to this research the Probability Calibration Graph technique.
So we know what probability forecasting is. We can assess the performance of probability forecasts using the reliability and resolution criteria. I have tested various learners on various datasets (mostly medical) We have briefly looked at traditional methods of assessing probabilities and highlighted that none solely assess reliability. Now this has set the scene for me to introduce my Probability Calibration Graph technique for visualising the reliability of probability forecasts output by learners.
So briefly here is a scan of a graph which served as my inspiration for the PCG graph that I developed for checking the reliability of probability forecasts output by learning algorithms. This is taken from Meteorological study Murphy and Winkler (1977) which analysed the calibration/reliability of the forecasts for the likelihood of precipitation made by the American national weather service. The graph is pretty simple, on the horizontal axis POINT is the forecast probability The vertical axis is the observed relative frequency of precipitation (i.e. the prediction being correct) The points plotted have a number next to them indicating how many predictions were made at that predicted probability. CLICK If the forecasts are reliable then they will stick to the diagonal line POINT
Graphs similar to the PCG plot were first used in the early 1970’s by psychological and meteorlogical studies to assess the reliability of probability forecasts. CLICK. Here is a PCG plot, you have the predicted probability on the horizontal axis (point) versus the empirical frequency of the forecast being correct on the vertical axis (point). For more in depth detail on its construction see my tech report, I wont bore you with the formulas. CLICK. This red line is the Line of calibration and this is the ideal line that reliable learners will stick to (predicted probability = empirical frequency) CLICK. Here are the PCG coordinates for the ZeroR learner when tested on the Abdominal Pain data CLICK. Plot may not span whole axis – ZeroR doesn’t make any predictions with high probability (vague predictions) CLICK. Reliability PCG coordinates lie close to line of calibration i.e. ZeroR may is not accurate but it is reliable !
Here we have two PCG plots side by side, of the Naïve Bayes learner CLICK with its VPM counter part next to it CLICK. For now Ill just give examples of how to interpret. Ideally a reliable learner gives a line close to the diagonal. This is a brief taste of things to come, Ill explain VPM later. Now from the PCG plot on the left POINT we can clearly see that the Naïve Bayes learner is producing unreliable probability forecasts – as the PCG plot (thick black line) deviates quite dramatically from the line of calibration (red diagonal line). For example CLICK unreliable forecasts are made with forecasts of 0.9 actually having 0.55 chance of being correct, the learner is tending to overestimate or be overconfident with its predictions. You can imagine this is bad. The Naïve Bayes learner is very accurate on this abdominal pain data, and if I gave this system to the doctor they would get predictions output from the learner predicting a disease with 0.9 and actually there is a lot less chance of the patient having that disease, which could lead to improper treatment. CLICK On the flip side CLICK forecasts of 0.1 actually have 0.3 chance of being correct, this is evidence of under estimation where the learner is being under confident in its predictions. CLICK And this pattern of over and under confidence is actually reported as the behaviour or people CLICK when asked to make estimates of probability. Doctors especially have been known to perform using the sort of PCG graphs made by Naïve Bayes, which is a bit worrying I think. The Probability Calibration Graph (PCG) is a useful visualisation technique for seeing how reliable probability forecasts are and can also be used to calculate useful measures eg. statistics about the deviations are given beneath each PCG plot. CLICK X2 These are useful when there is not much between each PCG plot to distinguish which learner is more reliable, for this I calculate various statistics such as total, mean, standard deviation etc. from the absolute deviation of the PCG plot from the diagonal line of calibration. This graph has wide applications and is the first to solely concentrate on reliability. We can see clearly that the Naïve Bayes classifier is unreliable, to understand how to interpret the graphs, [POINT] the horizontal axis is the predicted probability, and the vertical axis is the empirical frequency of those predictions at the predicted probability being correct. So in a nut shell the Naïve Bayes learner is unreliable, but the VPM Naïve Bayes learner CLICK is reliable as its PCG plot sticks close to the line of calibration. We can see this also in the statistical measures in the tables below POINT.
Returning back to the PCG plot of the Naïve Bayes learner CLICK on the abdominal pain data As I said on the previous slide there is a lot of psychological research and evidence that doctors and many other people make unreliable probability forecasts. CLICK Here is a PCG like plot created back in 1977 taken from a psychological journal, notice the similarity in the shape of the graph to the Naïve Bayes PCG plot. CLICK There are lots of graphs like this in psychology research, and lots of interpretation as to why people predict unreliably, and I think these results interesting for us as practitioners in machine learning.
So we know what probability forecasting is. We can assess the performance of probability forecasts using the reliability and resolution criteria. I have tested various learners on various datasets (mostly medical) We have briefly looked at traditional methods of assessing probabilities and highlighted that none solely assess reliability. We have seen that the PCG technique offers a useful solution to this problem as it gives intuitive visualisation and measures of reliability. Now I will give a brief summary of the results I found, importantly that reliability and classification accuracy do not go hand in hand. i.e. you can have a not very accurate learner that is reliable (ZeroR), and vice versa you can have a learner with good classification accuracy but poor reliability (Naïve Bayes).
Lets return to those results that we saw earlier of various learners on the abdominal pain dataset. This time we will add the PCG total deviation scores. Once again with each learners score is a ranking of how good that score was compared to the others this is given as a number in brackets POINT. As I said earlier you can use the statistics of the deviation of the PCG plot from the line of calibration as a measure of reliability, in the table above I have given total absolute deviation. We want the deviation to be as small as possible, and this is how we order the PCG deviations. We can see a very different ordering of the learners from the error rate. Notice that the ZeroR CLICK POINT is ranked last in terms of error rate, it gets around 55% of patients misdiagnosed. This is of no surprise as the learner is very simple, it outputs probability forecasts which are just frequency counts of labels in the training data. It uses no information about the patient to diagnose. We see ZeroR is quite respectively ranked 3 rd in terms of reliability. So ZeroR may not be accurate (or resolute) but it is reliable. Conversely the Naïve Bayes learner CLICK is very accurate ranked in a close 2 nd with only 29% of errors, but its reliability is ranked 7 th – so Naïve bayes is accurate but not reliable! CLICK Concentrating on the PCG deviation (i.e. reliability) we can see a significant re-ordering of the learners. CLICK . In the top 5 learners are ZeroR, various VPM’s implementations (I’ll explain those later) and a K-NN learner. We shall see later that all these learners have theoretical and psychological justifications of reliability. So in summary the PCG gives us a visualisation and a measure of reliability, and we can see that reliability and accuracy do not go hand in hand.
For more evidence that PCG is a measure of reliability check out this result taken from my tech reports. If you can imagine lots and lots of those kinds of tables on the previous slide, over many datasets. In the PCG tech report I gave a broad review of all the traditional methods of assessing probabilities. I decomposed the square loss function into its reliability and resolution components, and then calculated the correlation between these scores and other methods such as PCG, ROC etc. Of most particular interest are the relationships highlighted [CLICK] the PCG correlates with reliability, and ROC correlates with resolution, these results are of interest to the machine learning field think that ROC measures reliability, when in fact it measures the other useful property resolution. We also notice that there is a weak relationship between error rate and PCG, indicating that classification accuracy cannot guarantee that the probability forecasts are reliable. Conversely ROC has CLICK a moderate correlation with error rate indicating that error rate is more closely related to resolution. These results were submitted to ICML and it was rejected.
So we know what probability forecasting is. We can assess the performance of probability forecasts using the reliability and resolution criteria. I have tested various learners on various datasets (mostly medical) We have briefly looked at traditional methods of assessing probabilities and highlighted that none solely assess reliability. We have seen that the PCG technique offers a useful solution to this problem as it gives intuitive visualisation and measures of reliability. We have seen that reliability and classification accuracy do not go hand in hand. Now its time to fill in the gaps and explain how and why the VPM was extended by me for probability forecasting.
Can be applied on top of any existing learning algorithm. CLICK TWICE Vovk introduced the VPM and he originally used it to output provably valid bounds for conditional probabilities CLICK However these bounds had limited practical use because… So I extended the VPM Lindsay (2004) to: CLICK Extract more information from the probability forecasts output by the VPM learner by: CLICK outputting probability forecasts for all possible labels CLICK predicting a label using these probability forecasts 5. I should point out that my extended VPM hasn’t lost the ability to produces bounds CLICK
As I said on previous slide, the VPM was originally used to calculate bounds for the probability of a predicted label made by a VPM being correct. If we invert these bounds (eg. 1-p) then this gives us probability bounds for the prediction being incorrect and so Volodya created nice graphs like this CLICK and POINT to validate the incredibly complicated theory behind VPM. Here you can see that the upper bounds in red and the lower bounds in green lie above and below the actual number of errors in black that are made on the data CLICK This is great as the theory is nicely demonstrated by practical experiments, but as we can see these bounds can be quite loose which limits their practical usefulness as we shall see on the next slide.
I am going to show you an example which clearly indicates the practical usefulness of the extended VPM's probability forecasts for each class label as compared to the predicted bounds discussed previously. Here we have predictions made by the Naïve Bayes learner CLICK and its VPM counterpart CLICK for the same trials in the online process. Actual labels (the true disease for that patient at that trial) are indicated in yellow, and predicted label made by the learner and emboldened and underlined At a glance it is obvious that the predicted probabilities output by the Naive Bayes learner are far more extreme (i.e. very close to 0 or 1) than those output by its VPM counterpart. For example, trial 1653 shows a patient object which is predicted correctly by Naive Bayes p=0.99 and less emphatically by its VPM counterpart p=0.73 CLICK. Remember this data is very noisy so its very unlikely that any predictions can be made with 0.99 chance of being correct! CLICK Trial 2490 demonstrates the problem of over- and under- estimation by the Naive Bayes learner where a patient is incorrectly diagnosed with Intestinal obstruction (overestimation), yet the true diagnosis of Dyspepsia is ranked 6 th CLICK with a very low predicted probability of p=22/1000 (underestimation). In contrast the VPM Naive Bayes learner makes more reliable probability forecasts; for trial 2490 the true class is correctly predicted albeit with lower predicted probability 0.4 CLICK POINT. Trial 5381 demonstrates a situation where both learners encounter an error in their predictions. The Naive Bayes learner gives misleading predicted probabilities of CLICK p=0.93 for the incorrect diagnosis of Appendicitis, and a mere p=0.07 for the true class label of Non-specific abdominal pain. In contrast, even though the VPM Naive Bayes learner incorrectly predicts Appendicitis, it is with far less certainty CLICK p=0.53 and if the user were to look at all probability forecasts CLICK it would be clear that the true class label should not be ignored with a predicted probability p=0.42.
So to quickly recap on everything so far. We know what probability forecasting is. We can assess the performance of probability forecasts using the reliability and resolution criteria. I described how I tested various learners on various datasets We have briefly looked at traditional methods of assessing probabilities and highlighted that none solely assess reliability. We have seen that the PCG technique offers a useful solution to this problem as it gives intuitive visualisation and measures of reliability. We have seen that reliability and classification accuracy do not go hand in hand. I have told you how VPM was extended by me for probability forecasting, and also given compared its forecasts with the underlying learner. Now I will explain which learners that I tested have been found to be reliable, and which are unreliable.
Here are some PCG plots assessing the reliability of probability forecasts output by the ZeroR learner on various datasets CLICK 3 TIMES ZeroR outputs probability forecasts which are mere label frequencies, it predicts the majority class at each trial. Uses no information about the objects in its learning – the simplest of all learners. Accuracy is poor, but reliability is good as you can see from the PCG plots above, they are tight, albeit over a small range of predicted probabilities. ONLY SAY BELOW BIT IF TIME!! Acts as a control in my experiments all learners should at least beat ZeroR in classification accuracy, people often overlook this classifier which I think is a bit stupid, if your data is heavily imbalanced For example consider a dataset for a rare disease, 90% are normal, and 10% have the disease, then a majority classifier like ZeroR can achieve 90% classification accuracy, so if any significant learning is taking place then a learner must beat this!
K-NN finds subset of K closest (nearest neighbouring) examples in training data using a distance metric . Then counts the label frequencies amongst this subset. Acts like a more sophisticated version of ZeroR that uses information held in the object. Appropriate choice of K must be made to obtain reliable probability forecasts, this choice of K depends on the size, complexity, noise level of the data, mainly found by trial and error! In general the larger K is the more reliable the learner, but this can dramatically decrease classification accuracy.
Traditional learners can be very unreliable (yet accurate)- it really depends on the dataset being used. CLICK My research shows empirically that the VPM consistently outputs reliable probability forecasts.CLICK And this extended VPM can recalibrate a learners original probability forecasts to make them more reliable! CLICK This improvement in reliability made by VPM is often without detriment to classification accuracy. CLICK For example, look at these PCG plots showing improvement of reliability before and after VPM implementation CLICK EIGHT TIMES. We have the traditional learners (Naïve Bayes, Neural Net, Decision Tree and 1-NN) on the top row of PCG plots, with the VPM implementations underneath on the second row.
So to quickly recap on the later points. We have seen that ZeroR, K-NN and VPM are reliable probability forecasters, and that traditional learners can produce very unreliable probability forecasts! Now I will briefly detail a pschological and theoretical viewpoint of why these learners are reliable.
There are many psychological studies interested in the problem of making effective judgements under uncertainty When faced with the difficult task of judging probability, people employ a limited number of heuristics which reduce the judgements to simpler ones: CLICK Many heuristics have been identified, some of which are given here: Availability - An event is predicted more likely to occur if it has occurred frequently in the past Representativeness - One compares the essential features of the event to those of the structure of previous events Simulation - The ease in which the simulation of a system of events reaches a particular state can be used to judge the propensity of the (real) system to produce that state. Generally the more heuristics applied the more robust and reliable the probability forecasts are.
I showed empirically that ZeroR, K-NN and VPM learners are reliable probability forecasters. Can identify these heuristics in these learning algorithms Remember psychological research states: CLICK More heuristics More reliable forecasts
The simplest of all reliable probability forecasters uses 1 heuristic: CLICK The learner merely counts labels it has observed so far, and uses the frequencies of labels as its forecasts ( Availability)
More sophisticated than the ZeroR learner, the K-NN learner uses 2 heuristics: CLICK Uses the distance metric to find subset of K closest examples in training set. ( Representativeness) CLICK Then counts the label frequencies in the subset of K-nearest neighbours to makes its forecasts ( Availability)
Even more sophisticated the VPM meta-learner uses all 3 heuristics: CLICK The VPM tries each new test example with all possible classifications (Simulation) CLICK Then under each tentative simulation clusters training examples which are similar into groups (Representativeness) CLICK Finally the VPM calculates the frequency of labels in each of these groups to make its forecasts ( Availability)
CLICK ZeroR can be proven to be asymptotically reliable (but experiments show well in finite data) CLICK K-NN has lots of theory Stone 1977 To support its convergence to true probability distribution CLICK VPM has a lots of theoretical justification for finite data using martingales, still trying to decipher Volodya’s proofs
CLICK Probability forecasting is useful for real life applications especially medicine. CLICK Want learners to be reliable and accurate. CLICK PCG can be used to check reliability. CLICK ZeroR, K-NN and VPM provide consistently reliable probability forecasts. CLICK Traditional learners Naïve Bayes, Neural Net and Decision Tree can provide unreliable forecasts. CLICK VPM can be used to improve reliability of probability forecasts without detriment to classification accuracy.
And finally I’d like to thank the following people
Look at applications in bioinformatics and medicine – noisy data really needs reliable probability forecasts so user can know whether to trust predictions! Results with time series data. Investigate further relationships with psychology. Recursive application of VPM to improve reliability and accuracy.