SlideShare a Scribd company logo
1 of 48
Reliable Probability Forecasting – a Machine Learning Perspective David Lindsay Supervisors: Zhiyuan Luo, Alex Gammerman, Volodya Vovk
Overview ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Probability Forecasting ,[object Object],[object Object],[object Object]
Probability Forecasting: Generalisation of Pattern Recognition ,[object Object],[object Object],Training Set  to “ learn ” from Label Diagnosis Object Patient  Details Name:  David Sex: M Height: 6’2” Appendicitis Name:  Daniil Sex: M Height: 6’4” Dyspepsia Name:  Mark Sex: M Height: 6’1” Non-specific ,..., Name:  Sian Sex: F Height: 5’8” Dyspepsia , , Name:  Wilma Sex: F Height: 5’6” ? Test Object , what is the  true  label? True label  unknown  or  withheld  from learner
Probability Forecasting: Generalisation of Pattern Recognition ,[object Object],[object Object], learner Training  set Name:  Helen Sex: F Height: 5’6” Name:  Helen Sex: F Height: 5’6” Name:  Helen Sex: F Height: 5’6” Name:  Helen Sex: F Height: 5’6” Test object ? Name:  Helen Sex: F Height: 5’6” = 0.1 Name:  Helen Sex: F Height: 5’6” = 0.7 Name:  Helen Sex: F Height: 5’6” = 0.2 Name:  Helen Sex: F Height: 5’6” etc…
Probability forecasting more formally… ,[object Object],[object Object],[object Object]
Back to the plan… ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Studies of Probability Forecasting ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],Reliability ,[object Object],[object Object],[object Object],[object Object]
Definition of Reliability
Resolution ,[object Object],[object Object],[object Object]
Back to the plan… ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Experimental design ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The Online Learning Setting 2 7 6 1 7 ? ? 2 7 6 1 7 2 ? Before After Update training data for learning machine for next trial Learning machine makes prediction for new example. (label withheld) Repeat process for all examples
Lots of benchmark data ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Programs ,[object Object],[object Object],[object Object],[object Object]
Results, papers and website ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Back to the plan… ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Loss Functions ,[object Object],[object Object],[object Object],Square loss Log loss
ROC Curves Naïve Bayes on the Abdominal pain data set ,[object Object],[object Object],[object Object],[object Object]
Table comparing traditional scores VPM C4.5 Naïve Bayes VPM Naïve Bayes 10-NN 20-NN C4.5 Neural Net 30-NN VPM 1-NN 1-NN ZeroR PCG ROC Area Log Loss Sqr Loss Error Algorithm 0.76 (1) 0.72 (5) 0.75 (2) 0.54 (10) 0.55 (9) 0.57 (8) 0.75 (3) 0.74 (4) 0.61 (6) 0.59 (7) 0.49 (11) 0.8 (4) 1.3 (7) 0.6 (1) 2.6 (10) 2.2 (9) 3.3 (11) 0.72 (2) 0.73 (3) 0.9 (5) 2.1 (8) 1.1 (6) 0.54 (5) 0.50 (4) 0.44 (1) 1.0 (11) 0.96 (10) 0.67 (7) 0.45 (2) 0.47 (3) 0.58 (6) 0.73 (8) 0.74 (9) 40.7 (8) 29.2 (2) 28.9 (1) 33.4 (4) 33.4 (4) 39.6 (7) 30.5 (3) 34.3 (5) 41.6 (9) 34.6 (6) 55.6 (10)
Problems with Traditional Assessment ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Back to the plan… ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Inspiration for PCG (Meteorology) Reliable    points lie close to diagonal Murphy & Winkler (1977) Calibration data for precipitation forecasts
A PCG plot of ZeroR on Abdominal Pain Reliability    PCG coordinates lie close to line of calibration i.e. ZeroR may is  not accurate  but it  is reliable ! Plot may not span whole axis – ZeroR makes no predictions with high probability Predicted Probability Empirical frequency of being correct Line of calibration PCG coordinates
PCG a visualisation tool and measure of reliability VPM is reliable as PCG follows the diagonal! Over and under estimates its probabilities – much like real doctors! 4.9e-17 Min 0.4203 Max 0.0757 Standard Deviation 0.0483 Mean 2764.5 Total Naïve Bayes VPM Naïve Bayes 9.2e-8 Min 0.1017 Max 0.0112 Standard Deviation 0.0087 Mean 496.7 Total Unreliable, forecast of 0.9 only has 0.55 chance being right! (over estimate) Unreliable, forecast of 0.1 only has 0.3 chance being right! (under estimate)
Learners predicting like people! Lots of psychological research    people make unreliable probability forecasts  Naïve Bayes People
Back to the plan… ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Table comparing scores with PCG 838.1 (4) 0.76 (1) 0.8 (4) 0.54 (5) 40.7 (8) VPM C4.5 2764.5 (7) 0.72 (5) 1.3 (7) 0.50 (4) 29.2 (2) Naïve Bayes 496.7 (1) 0.75 (2) 0.6 (1) 0.44 (1) 28.9 (1) VPM Naïve Bayes 5062.9 (11) 0.54 (10) 2.6 (10) 1.0 (11) 33.4 (4) 10-NN 4492.7 (10) 0.55 (9) 2.2 (9) 0.96 (10) 33.4 (4) 20-NN 3481.2 (8) 0.57 (8) 3.3 (11) 0.67 (7) 39.6 (7) C4.5 1320.5 (6) 0.75 (3) 0.72 (2) 0.45 (2) 30.5 (3) Neural Net 921.2 (5) 0.74 (4) 0.73 (3) 0.47 (3) 34.3 (5) 30-NN 554.6 (2) 0.61 (6) 0.9 (5) 0.58 (6) 41.6 (9) VPM 1-NN 4307.5 (9) 0.59 (7) 2.1 (8) 0.73 (8) 34.6 (6) 1-NN 678.6 (3) 0.49 (11) 1.1 (6) 0.74 (9) 55.6 (10) ZeroR PCG ROC Area Log Loss Sqr Loss Error Algorithm
Correlations of scores Inverse No -0.1 ROC vs. Sqr Reliability Direct Weak 0.26 PCG vs. Error Direct No 0.04 PCG vs. Sqr Resolution Direct Strong 0.76 PCG vs. Sqr Reliability Interpretation Corr. Coeff. Scores Inverse Moderate -0.52 ROC vs. Error Direct Strong 0.67 ROC vs. Sqr Resolution
Back to the plan… ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
What is the VPM meta-learner? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Learner  Γ VPM meta  learning framework VPM “sits on top” of existing learner to complement predictions with probability estimates
Volodya’s original use of VPM Online Trial Number Error rate and bounds 22.1% 1414.1 Low Error 28.9% 1835 Error 34.7% 2216.5 Up Error Upper (red)  and  lower (green)  bounds lie  above  and  below  the actual number of  errors  ( black ) made on the data.
Output from VPM compared with that of original underlying learner Key:  Predicted =  underlined  , Actual =  NA NA 7.6e-9 6.3e-10 4.0e-11 2.2e-9 1.3e-9 0.07 1.7e-13 2.9e-9 0.93 5831 NA NA 2.2e-4 2.2e-7 0.2 0.46 0.16 2.3e-5 0.17 0.01 9.4e-5 2490 NA NA 1.3e-4 4.1e-10 3.4e-3 4.2e-3 0.99 4.4e-5 3.3e-6 4.5e-6 3.08e-9 1653 Naïve Bayes Low Up Dysp. Renal. Pancr Intest obstr Choli Non. Spec Perf. Pept. Div. Appx Bounds Probability forecast for each class label Trial # 0.41 0.68 0.01 0.01 0.0 0.01 0.01 0.42 0.0 0.01 0.53 5831 0.07 0.71 0.4 0.09 0.08 0.15 0.05 0.07 0.10 0.03 0.02 2490 0.08 0.82 0.09 0.01 0.04 0.0 0.73 0.08 0.03 0.0 0.03 1653 VPM Naïve Bayes
Back to the plan… ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
ZeroR ,[object Object],[object Object],[object Object],[object Object],Heart Disease Lymphography Diabetes
K-NN 10-NN 20-NN 30-NN ,[object Object],[object Object],[object Object]
Traditional Learners and VPM ,[object Object],[object Object],[object Object],[object Object],Naïve Bayes VPM Naïve Bayes C4.5 VPM C4.5 Neural Net VPM Neural Net 1-NN VPM 1-NN
Back to the plan… ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Psychological Heuristics ,[object Object],[object Object],[object Object],[object Object]
Interpretation of reliable learners using heuristics ,[object Object],[object Object],[object Object],More heuristics    More reliable forecasts
Psychological Interpretation of ZeroR ,[object Object],[object Object]
Psychological Interpretation of  K-NN ,[object Object],[object Object],[object Object]
Psychological Interpretation of VPM ,[object Object],[object Object],[object Object],[object Object]
Theoretical justifications ,[object Object],[object Object],[object Object]
Take home points ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Fin    Acknowledgments  
What next? ,[object Object],[object Object],[object Object],[object Object]

More Related Content

What's hot

Module 5: Decision Trees
Module 5: Decision TreesModule 5: Decision Trees
Module 5: Decision TreesSara Hooker
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyMarina Santini
 
MLlecture1.ppt
MLlecture1.pptMLlecture1.ppt
MLlecture1.pptbutest
 
Can Concussions Be Diagnosed Using the Microsoft Kinect and Machine Learning?
Can Concussions Be Diagnosed Using the Microsoft Kinect and Machine Learning?Can Concussions Be Diagnosed Using the Microsoft Kinect and Machine Learning?
Can Concussions Be Diagnosed Using the Microsoft Kinect and Machine Learning?Eric Solender
 
Introduction to e tapr for hai con -kor
Introduction to e tapr for hai con -korIntroduction to e tapr for hai con -kor
Introduction to e tapr for hai con -korDACON AI 데이콘
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.pptbutest
 

What's hot (7)

Module 5: Decision Trees
Module 5: Decision TreesModule 5: Decision Trees
Module 5: Decision Trees
 
Lecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language TechnologyLecture 2 Basic Concepts in Machine Learning for Language Technology
Lecture 2 Basic Concepts in Machine Learning for Language Technology
 
MLlecture1.ppt
MLlecture1.pptMLlecture1.ppt
MLlecture1.ppt
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Can Concussions Be Diagnosed Using the Microsoft Kinect and Machine Learning?
Can Concussions Be Diagnosed Using the Microsoft Kinect and Machine Learning?Can Concussions Be Diagnosed Using the Microsoft Kinect and Machine Learning?
Can Concussions Be Diagnosed Using the Microsoft Kinect and Machine Learning?
 
Introduction to e tapr for hai con -kor
Introduction to e tapr for hai con -korIntroduction to e tapr for hai con -kor
Introduction to e tapr for hai con -kor
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 

Viewers also liked

Verkade and Werner (2011) Estimating the benefits of probability forecasting ...
Verkade and Werner (2011) Estimating the benefits of probability forecasting ...Verkade and Werner (2011) Estimating the benefits of probability forecasting ...
Verkade and Werner (2011) Estimating the benefits of probability forecasting ...Jan Verkade
 
GPU Accelerated Backtesting and Machine Learning for Quant Trading Strategies
GPU Accelerated Backtesting and Machine Learning for Quant Trading StrategiesGPU Accelerated Backtesting and Machine Learning for Quant Trading Strategies
GPU Accelerated Backtesting and Machine Learning for Quant Trading StrategiesDaniel Egloff
 
Expertise on Demand - How machine learning puts the best-of-the-best at your ...
Expertise on Demand - How machine learning puts the best-of-the-best at your ...Expertise on Demand - How machine learning puts the best-of-the-best at your ...
Expertise on Demand - How machine learning puts the best-of-the-best at your ...10x Nation
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Hacking The Trading Floor
Hacking The Trading FloorHacking The Trading Floor
Hacking The Trading Flooriffybird_099
 
Optimizing the
 Data Supply Chain
 for Data Science
Optimizing the
 Data Supply Chain
 for Data ScienceOptimizing the
 Data Supply Chain
 for Data Science
Optimizing the
 Data Supply Chain
 for Data ScienceVital.AI
 
thesis_jinxing_lin
thesis_jinxing_linthesis_jinxing_lin
thesis_jinxing_linjinxing lin
 
Machine learning ~ Forecasting
Machine learning ~ ForecastingMachine learning ~ Forecasting
Machine learning ~ ForecastingShaswat Mandhanya
 
Demand estimation and forecasting
Demand estimation and forecastingDemand estimation and forecasting
Demand estimation and forecastingshivraj negi
 
Demand estimation
Demand estimation Demand estimation
Demand estimation Qamar Farooq
 
Ronald Menich, Chief Data Scientist, Predictix, LLC at MLconf NYC
Ronald Menich, Chief Data Scientist, Predictix, LLC at MLconf NYCRonald Menich, Chief Data Scientist, Predictix, LLC at MLconf NYC
Ronald Menich, Chief Data Scientist, Predictix, LLC at MLconf NYCMLconf
 
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and Spark
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and SparkReal-Time Supply Chain Analytics with Machine Learning, Kafka, and Spark
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and SparkSingleStore
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Grigory Sapunov
 
Forecasting Slides
Forecasting SlidesForecasting Slides
Forecasting Slidesknksmart
 
Forecasting Techniques
Forecasting TechniquesForecasting Techniques
Forecasting Techniquesguest865c0e0c
 

Viewers also liked (17)

Verkade and Werner (2011) Estimating the benefits of probability forecasting ...
Verkade and Werner (2011) Estimating the benefits of probability forecasting ...Verkade and Werner (2011) Estimating the benefits of probability forecasting ...
Verkade and Werner (2011) Estimating the benefits of probability forecasting ...
 
GPU Accelerated Backtesting and Machine Learning for Quant Trading Strategies
GPU Accelerated Backtesting and Machine Learning for Quant Trading StrategiesGPU Accelerated Backtesting and Machine Learning for Quant Trading Strategies
GPU Accelerated Backtesting and Machine Learning for Quant Trading Strategies
 
Expertise on Demand - How machine learning puts the best-of-the-best at your ...
Expertise on Demand - How machine learning puts the best-of-the-best at your ...Expertise on Demand - How machine learning puts the best-of-the-best at your ...
Expertise on Demand - How machine learning puts the best-of-the-best at your ...
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Hacking The Trading Floor
Hacking The Trading FloorHacking The Trading Floor
Hacking The Trading Floor
 
Optimizing the
 Data Supply Chain
 for Data Science
Optimizing the
 Data Supply Chain
 for Data ScienceOptimizing the
 Data Supply Chain
 for Data Science
Optimizing the
 Data Supply Chain
 for Data Science
 
thesis_jinxing_lin
thesis_jinxing_linthesis_jinxing_lin
thesis_jinxing_lin
 
Machine learning ~ Forecasting
Machine learning ~ ForecastingMachine learning ~ Forecasting
Machine learning ~ Forecasting
 
Demand estimation and forecasting
Demand estimation and forecastingDemand estimation and forecasting
Demand estimation and forecasting
 
Demand estimation
Demand estimation Demand estimation
Demand estimation
 
Ronald Menich, Chief Data Scientist, Predictix, LLC at MLconf NYC
Ronald Menich, Chief Data Scientist, Predictix, LLC at MLconf NYCRonald Menich, Chief Data Scientist, Predictix, LLC at MLconf NYC
Ronald Menich, Chief Data Scientist, Predictix, LLC at MLconf NYC
 
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and Spark
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and SparkReal-Time Supply Chain Analytics with Machine Learning, Kafka, and Spark
Real-Time Supply Chain Analytics with Machine Learning, Kafka, and Spark
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
 
3...forecasting methods
3...forecasting methods3...forecasting methods
3...forecasting methods
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016
 
Forecasting Slides
Forecasting SlidesForecasting Slides
Forecasting Slides
 
Forecasting Techniques
Forecasting TechniquesForecasting Techniques
Forecasting Techniques
 

Similar to Probability Forecasting - a Machine Learning Perspective

Developing and validating statistical models for clinical prediction and prog...
Developing and validating statistical models for clinical prediction and prog...Developing and validating statistical models for clinical prediction and prog...
Developing and validating statistical models for clinical prediction and prog...Evangelos Kritsotakis
 
Disease Prediction And Doctor Appointment system
Disease Prediction And Doctor Appointment  systemDisease Prediction And Doctor Appointment  system
Disease Prediction And Doctor Appointment systemKOYELMAJUMDAR1
 
Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regionsbutest
 
Improving predictions: Lasso, Ridge and Stein's paradox
Improving predictions: Lasso, Ridge and Stein's paradoxImproving predictions: Lasso, Ridge and Stein's paradox
Improving predictions: Lasso, Ridge and Stein's paradoxMaarten van Smeden
 
IRJET- Disease Prediction using Machine Learning
IRJET-  Disease Prediction using Machine LearningIRJET-  Disease Prediction using Machine Learning
IRJET- Disease Prediction using Machine LearningIRJET Journal
 
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample Size
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample SizeBayesian Assurance: Formalizing Sensitivity Analysis For Sample Size
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample SizenQuery
 
Medical Segmentation Decathalon
Medical Segmentation DecathalonMedical Segmentation Decathalon
Medical Segmentation Decathalonimgcommcall
 
840 plenary elder_using his laptop
840 plenary elder_using his laptop840 plenary elder_using his laptop
840 plenary elder_using his laptopRising Media, Inc.
 
MLPA for health care presentation smc
MLPA for health care presentation   smcMLPA for health care presentation   smc
MLPA for health care presentation smcShaun Comfort
 
Webinar slides sample size for survival analysis - a guide to planning succ...
Webinar slides   sample size for survival analysis - a guide to planning succ...Webinar slides   sample size for survival analysis - a guide to planning succ...
Webinar slides sample size for survival analysis - a guide to planning succ...nQuery
 
Sample size for survival analysis - a guide to planning successful clinical t...
Sample size for survival analysis - a guide to planning successful clinical t...Sample size for survival analysis - a guide to planning successful clinical t...
Sample size for survival analysis - a guide to planning successful clinical t...nQuery
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selectionchenhm
 
Power and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar SlidesPower and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar SlidesnQuery
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkStats Statswork
 

Similar to Probability Forecasting - a Machine Learning Perspective (20)

Developing and validating statistical models for clinical prediction and prog...
Developing and validating statistical models for clinical prediction and prog...Developing and validating statistical models for clinical prediction and prog...
Developing and validating statistical models for clinical prediction and prog...
 
Disease Prediction And Doctor Appointment system
Disease Prediction And Doctor Appointment  systemDisease Prediction And Doctor Appointment  system
Disease Prediction And Doctor Appointment system
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Computational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding RegionsComputational Biology, Part 4 Protein Coding Regions
Computational Biology, Part 4 Protein Coding Regions
 
Improving predictions: Lasso, Ridge and Stein's paradox
Improving predictions: Lasso, Ridge and Stein's paradoxImproving predictions: Lasso, Ridge and Stein's paradox
Improving predictions: Lasso, Ridge and Stein's paradox
 
IRJET- Disease Prediction using Machine Learning
IRJET-  Disease Prediction using Machine LearningIRJET-  Disease Prediction using Machine Learning
IRJET- Disease Prediction using Machine Learning
 
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample Size
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample SizeBayesian Assurance: Formalizing Sensitivity Analysis For Sample Size
Bayesian Assurance: Formalizing Sensitivity Analysis For Sample Size
 
5 5 10
5 5 105 5 10
5 5 10
 
Medical Segmentation Decathalon
Medical Segmentation DecathalonMedical Segmentation Decathalon
Medical Segmentation Decathalon
 
840 plenary elder_using his laptop
840 plenary elder_using his laptop840 plenary elder_using his laptop
840 plenary elder_using his laptop
 
Feedbackdriven radiologyreportretrieval ichi2015-v2
Feedbackdriven radiologyreportretrieval ichi2015-v2Feedbackdriven radiologyreportretrieval ichi2015-v2
Feedbackdriven radiologyreportretrieval ichi2015-v2
 
910 plenary Elder
910 plenary Elder910 plenary Elder
910 plenary Elder
 
920 plenary elder
920 plenary elder920 plenary elder
920 plenary elder
 
MLPA for health care presentation smc
MLPA for health care presentation   smcMLPA for health care presentation   smc
MLPA for health care presentation smc
 
Webinar slides sample size for survival analysis - a guide to planning succ...
Webinar slides   sample size for survival analysis - a guide to planning succ...Webinar slides   sample size for survival analysis - a guide to planning succ...
Webinar slides sample size for survival analysis - a guide to planning succ...
 
Sample size for survival analysis - a guide to planning successful clinical t...
Sample size for survival analysis - a guide to planning successful clinical t...Sample size for survival analysis - a guide to planning successful clinical t...
Sample size for survival analysis - a guide to planning successful clinical t...
 
Intro to Model Selection
Intro to Model SelectionIntro to Model Selection
Intro to Model Selection
 
Power and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar SlidesPower and sample size calculations for survival analysis webinar Slides
Power and sample size calculations for survival analysis webinar Slides
 
Declarative data analysis
Declarative data analysisDeclarative data analysis
Declarative data analysis
 
How to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - StatsworkHow to establish and evaluate clinical prediction models - Statswork
How to establish and evaluate clinical prediction models - Statswork
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Probability Forecasting - a Machine Learning Perspective

  • 1. Reliable Probability Forecasting – a Machine Learning Perspective David Lindsay Supervisors: Zhiyuan Luo, Alex Gammerman, Volodya Vovk
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 11.
  • 12.
  • 13.
  • 14. The Online Learning Setting 2 7 6 1 7 ? ? 2 7 6 1 7 2 ? Before After Update training data for learning machine for next trial Learning machine makes prediction for new example. (label withheld) Repeat process for all examples
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21. Table comparing traditional scores VPM C4.5 Naïve Bayes VPM Naïve Bayes 10-NN 20-NN C4.5 Neural Net 30-NN VPM 1-NN 1-NN ZeroR PCG ROC Area Log Loss Sqr Loss Error Algorithm 0.76 (1) 0.72 (5) 0.75 (2) 0.54 (10) 0.55 (9) 0.57 (8) 0.75 (3) 0.74 (4) 0.61 (6) 0.59 (7) 0.49 (11) 0.8 (4) 1.3 (7) 0.6 (1) 2.6 (10) 2.2 (9) 3.3 (11) 0.72 (2) 0.73 (3) 0.9 (5) 2.1 (8) 1.1 (6) 0.54 (5) 0.50 (4) 0.44 (1) 1.0 (11) 0.96 (10) 0.67 (7) 0.45 (2) 0.47 (3) 0.58 (6) 0.73 (8) 0.74 (9) 40.7 (8) 29.2 (2) 28.9 (1) 33.4 (4) 33.4 (4) 39.6 (7) 30.5 (3) 34.3 (5) 41.6 (9) 34.6 (6) 55.6 (10)
  • 22.
  • 23.
  • 24. Inspiration for PCG (Meteorology) Reliable  points lie close to diagonal Murphy & Winkler (1977) Calibration data for precipitation forecasts
  • 25. A PCG plot of ZeroR on Abdominal Pain Reliability  PCG coordinates lie close to line of calibration i.e. ZeroR may is not accurate but it is reliable ! Plot may not span whole axis – ZeroR makes no predictions with high probability Predicted Probability Empirical frequency of being correct Line of calibration PCG coordinates
  • 26. PCG a visualisation tool and measure of reliability VPM is reliable as PCG follows the diagonal! Over and under estimates its probabilities – much like real doctors! 4.9e-17 Min 0.4203 Max 0.0757 Standard Deviation 0.0483 Mean 2764.5 Total Naïve Bayes VPM Naïve Bayes 9.2e-8 Min 0.1017 Max 0.0112 Standard Deviation 0.0087 Mean 496.7 Total Unreliable, forecast of 0.9 only has 0.55 chance being right! (over estimate) Unreliable, forecast of 0.1 only has 0.3 chance being right! (under estimate)
  • 27. Learners predicting like people! Lots of psychological research  people make unreliable probability forecasts Naïve Bayes People
  • 28.
  • 29. Table comparing scores with PCG 838.1 (4) 0.76 (1) 0.8 (4) 0.54 (5) 40.7 (8) VPM C4.5 2764.5 (7) 0.72 (5) 1.3 (7) 0.50 (4) 29.2 (2) Naïve Bayes 496.7 (1) 0.75 (2) 0.6 (1) 0.44 (1) 28.9 (1) VPM Naïve Bayes 5062.9 (11) 0.54 (10) 2.6 (10) 1.0 (11) 33.4 (4) 10-NN 4492.7 (10) 0.55 (9) 2.2 (9) 0.96 (10) 33.4 (4) 20-NN 3481.2 (8) 0.57 (8) 3.3 (11) 0.67 (7) 39.6 (7) C4.5 1320.5 (6) 0.75 (3) 0.72 (2) 0.45 (2) 30.5 (3) Neural Net 921.2 (5) 0.74 (4) 0.73 (3) 0.47 (3) 34.3 (5) 30-NN 554.6 (2) 0.61 (6) 0.9 (5) 0.58 (6) 41.6 (9) VPM 1-NN 4307.5 (9) 0.59 (7) 2.1 (8) 0.73 (8) 34.6 (6) 1-NN 678.6 (3) 0.49 (11) 1.1 (6) 0.74 (9) 55.6 (10) ZeroR PCG ROC Area Log Loss Sqr Loss Error Algorithm
  • 30. Correlations of scores Inverse No -0.1 ROC vs. Sqr Reliability Direct Weak 0.26 PCG vs. Error Direct No 0.04 PCG vs. Sqr Resolution Direct Strong 0.76 PCG vs. Sqr Reliability Interpretation Corr. Coeff. Scores Inverse Moderate -0.52 ROC vs. Error Direct Strong 0.67 ROC vs. Sqr Resolution
  • 31.
  • 32.
  • 33. Volodya’s original use of VPM Online Trial Number Error rate and bounds 22.1% 1414.1 Low Error 28.9% 1835 Error 34.7% 2216.5 Up Error Upper (red) and lower (green) bounds lie above and below the actual number of errors ( black ) made on the data.
  • 34. Output from VPM compared with that of original underlying learner Key: Predicted = underlined , Actual = NA NA 7.6e-9 6.3e-10 4.0e-11 2.2e-9 1.3e-9 0.07 1.7e-13 2.9e-9 0.93 5831 NA NA 2.2e-4 2.2e-7 0.2 0.46 0.16 2.3e-5 0.17 0.01 9.4e-5 2490 NA NA 1.3e-4 4.1e-10 3.4e-3 4.2e-3 0.99 4.4e-5 3.3e-6 4.5e-6 3.08e-9 1653 Naïve Bayes Low Up Dysp. Renal. Pancr Intest obstr Choli Non. Spec Perf. Pept. Div. Appx Bounds Probability forecast for each class label Trial # 0.41 0.68 0.01 0.01 0.0 0.01 0.01 0.42 0.0 0.01 0.53 5831 0.07 0.71 0.4 0.09 0.08 0.15 0.05 0.07 0.10 0.03 0.02 2490 0.08 0.82 0.09 0.01 0.04 0.0 0.73 0.08 0.03 0.0 0.03 1653 VPM Naïve Bayes
  • 35.
  • 36.
  • 37.
  • 38.
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
  • 47.
  • 48.

Editor's Notes

  1. Hello and welcome to my cake talk, cakes are situated at the front please feel free to munch away The title of my talk is Reliable Probability Forecasting – a Machine Learning Perspective I have been working on this research for about 9 months, the talk will be quite high level, if anyone wants to find out more low level detail then you can ask questions at the end or look at my 3 tech reports which some material of this talk is taken from. I have attempted to make this talk accessible to people outside my field and I hope that you all at least understand some part of my talk. If anything is very confusing on the slides please stop to ask questions.
  2. So let me start by giving an overview of what I am going to talk about today, I will return to this plan as we go along. CLICK Firstly I will introduce the problem of probability forecasting, and describe how the problem is a generalisation of the standard pattern recognition problem studied in machine learning CLICK Then I will describe the reliability and resolution criteria (proposed by research in statistics and psychology) which can be used for assessing he effectiveness of probability forecasts CLICK I will follow this by briefly detail my experimental design CLICK I will then showcase current methods of assessing probabilities, namely square loss, log loss and ROC curves and highlight the problems with these approaches for assessing reliability only. CLICK Introduce the Probability Calibration Graph (PCG) Lindsay (2004) for solely assessing the reliability of probability forecasts. CLICK Show how many traditional learners are unreliable yet accurate! This can seem a counter intuitive argument as we’ll see later. CLICK Show how the newly developed Venn Probability Machine (VPM) meta learning framework can be extended Lindsay (2004) and used to correct these problems with traditional learners! CLICK Summarise which learners have been demonstrated as reliable and which are unreliable. CLICK And finally I Give theoretical and psychological viewpoint for the reliable learners that my studies have identified.
  3. Lets go through some initial benefits of probability forecasting. CLICK Qualified predictions are important in many real life applications (especially medicine), it is very handy for the user to know when and how much trust can be placed in a prediction made by a learner. CLICK Having said that most machine learning algorithms make bare predictions, don’t give any indication of how likely the prediction is correct, I think this is why very few learning systems are used in practice. CLICK Those learners that do make qualified predictions makes no claims of how effective the measures are! For example the WEKA data mining system provides a load of tweaks to existing algorithms to output probability forecasts, but not many have any theoretical proof of their validity.
  4. So let me just review a general problem which is commonly tackled by machine learning namely pattern recognition. The goal of pattern recognition is quite simple = find the “ best ” label for each new test object. CLICK An example that I will use throughout this talk is the Abdominal Pain Dataset (mainly because I believe that this research is most applicable to the medical problem domain). CLICK. The data is very noisy and complex, we have roughly 6300 patients details collected by a hospital in Cardiff. Each patient is described using 135 properties, and associated with each patient is 1 of 9 different abdominal pain diseases (Appendicitis, Dyspepsia, Non-specific abdominal pain, Renal Colic, etc). CLICK. Relating this to the notation and jargon I will use throughout, we think of our examples as information pairs, each example represents a patient, each object x (POINT and CLICK) describes the patients symptoms etc, and the corresponding label y (POINT and CLICK) is the diagnosis of the abdominal pain disease that they are suffering from. In a machine learning interpretation of the pattern recognition problem, the supervisor (in this case doctor) provides a training set to learn the all important relationship between objects and labels. CLICK. The hope is that if the training set is large and clean enough the user will be able to input the details of a new patient and the learning algorithm will diagnose that patient by making a prediction of 1 of the 9 possible labels. CLICK. Usually we keep back a test set to validate the predictions made by the learner so we can test the performance of the learning algorithm.
  5. A probability forecast is an estimate of the conditional probability of a label given an observed object CLICK TWICE. I use the hat notation [POINT] to distinguish the predicted value output by the learner, from the true value determined by nature Obviously in real life applications I do not have access to some higher power to give me details access to the true probability distribution, so it is awkard to check whether my forecasts are accurate. We want learner  to estimate probabilities for all class labels CLICK Returning to our example of the abdominal pain dataset, we have our training data CLICK And the unlabelled test object CLICK Both of these are fed into our Learner Gamma CLICK Our Learner Gamma outputs probability forecasts for each possible label (i.e. disease) for that new test object (i.e. patient) CLICK 3 TIMES POINT: Remember all the predicted probabilities output sum to one. Naturally, we predict the label with the highest associated probability CLICK
  6. So using the standard notation we have X as the object space and Y as the label space, so that Z = X  Y is the example space CLICK Our learner  makes probability forecasts for all possible labels CLICK Use probability forecasts to predict label most likely label CLICK TWICE
  7. So hopefully its clear what probability forecasting is, but how can we assess the quality or effectiveness of these forecasts? We shall now see.
  8. Probability forecasting is well studied area of since 1970’s: CLICK Psychology Statistics Meteorology These studies assessed two criteria of probability forecasts:CLICK Reliability = the probability forecasts should not lie Resolution = the probability forecasts are practically useful
  9. So an informal definition of reliable probability forecasting is when an event is predicted with probability p should have approx 1-p chance of being incorrect This term is known by many names, being well calibrated etc. CLICK Reliability is normally considered an asymptotic property (as the number of training examples tends to infinity) in statistical studies, however the work by Volodya was able to generalise this problem for finite data. CLICK In 1985 Dawid proved that no deterministic learner can be reliable for all data – still interesting to investigate the problem of reliable probability forecasting as the work by Volodya and me shows. CLICK This property is often overlooked in practical studies! This is a real shame as I think many applications would find this property very attractive. If the probability forecasts of a learner were reliable then they would at least be trustworthy. CLICK
  10. Now lets look at the second term resolution. Resolution demands that the probability forecasts are practically useful, eg. they can be used to effectively rank the labels in order of likelihood! CLICK Closely related to classification accuracy which is commonly studied in machine learning. CLICK They are separate from reliability, one of my papers shows that reliability and classification accuracy/resolution do not go “ hand in hand ” CLICK
  11. So to recap I have detailed what probability forecasting is. And that lots of studies in different fields have identified that probability forecasting can be assessed using the reliability and resolution criteria Now I will describe how and why I conducted my experiments
  12. I tested several learners on many datasets in the online setting (which I will explain later) CLICK ZeroR. This learner was used as a control: it is the most basic, simple learner that you could consider. We would therefore expect any other learner to have improved probability forecasts over ZeroR. ZeroR is not well-known and it comes as part of the WEKA data-mining system that I will describe later. CLICK K-Nearest Neighbour CLICK Neural Network CLICK C4.5 Decision Tree CLICK Naïve Bayes CLICK Venn Probability Machine Meta Learner. I will discuss this later, as the VPM meta-learner has been applied to all these learners here.
  13. Traditionally most studies in machine learning are carried out in the offline learning setting (where the learning machine is provided with a fixed training and test set to evaluate). For my research I looked primarily at the online learning setting as it fits nicely with the theory and allows you to see how the learner improves with experience. Having said that I have conducted all these experiments in the offline setting as well and they come out the same. The crucial difference with the online setting is that we imagine that the training set provided to the learning machine is continually updated. I will now give a quick example using the handwritten digits image data set. Our images are the objects, and the label says which digit it is. The strict online learning setting works as follows: CLICK First the leaning machine makes a prediction of a new test example CLICK Second the teacher/supervisor of the learning machine provides the true label of the example (in this case 2) and adds it to the training set for the learning machine. CLICK Finally the process is repeated for each example in the dataset presenting each example as a “trial” in the online process.
  14. To get an idea of the kind of data I tested my learning algorithms on I have compiled this slide. As you can see I tested a variety of well known datasets (varying in size, complexity and noise level), mostly benchmark data from the UCI, but I also tested some home grown favourites such as the Abdominal pain dataset. I chose to use a lot of medical datasets as these tend to be quite noisy and I believe the need for reliable probability forecasting is exemplified by this problem domain.
  15. For the programming side, I decided to capitalise on the lovely WEKA data mining system (distributed under GNU public licence) CLICK This package implemented in Java offers an extensive library of well known machine learning algorithms. Because its written in an object oriented programming language this made things very easy for me to extend the existing functionality of the system, this is how I added extra algorithms such as the Venn Probability Machine that I will be talking about today. CLICK I also extended the WEKA system to allow all learners to be tested in the online learning setting (that I mentioned a few slides ago) as yet not many people test in this mode. CLICK To create all the lovely graphs I wrote some handy Matlab scripts, and all these programs are available via my website. CLICK
  16. As I mentioned at the start all of the research I am talking about can be found in the three tech reports (details on the slide) that I have been working on for several months now. I have also tried to publish shortened versions of these papers at some of the big machine learning conferences (unsuccessfully) All tech reports will hopefully be available on the CLRC website and my own, pending review.
  17. So going back to the plan again. So we know what probability forecasting is. And that we can intuitively assess the performance of probability forecasts using two criteria reliability and resolution. I have detailed my experimental design, what learners and data I have tested But there are methods which are currently used in machine learning for assessing the performance of probability forecasts and this is what we will now look at now and also highlight problems with them.
  18. There are many other possible loss functions… Square loss CLICK And Log loss CLICK In 1982 Degroot and Feinberg showed that all loss functions measure a mixture of reliability and resolution Log loss punishes more harshly, and it is forced to spread its bets
  19. ROC curves check the proportion of correct versus incorrect predictions made by a learner. See tech report on PCG for more details. CLICK ROC curves are popularly used in machine learning studies to assess probability forecasts ROC is commonly used to measure the tradeoffs between false and true positive classification. We want the ROC curve to be as close to the upper left corner as possible POINT, we want it to deviate from the diagonal as much as possible. CLICK My results show that this graph tests resolution. CLICK The area under the ROC curve is often used as a measure of the quality of the probability forecasts being made. CLICK Still does not tell us how/why probability forecasts are unreliable! This has more to do with accuracy.
  20. To try an reinforce my point that error rate does not reflect the quality of the probability forecast. Traditional studies would get the classification accuracy or inversely the error rate of our learners tested on a dataset (in this case Abdominal pain) producing the kind of league table you see here. CLICK Here you can see that each learner is given a rank in brackets in terms of its error rate. Obviously we want the error rate to be as small as possible. So the Naïve Bayes learners are the most accurate, and the ZeroR learner is the least accurate as we would expect. This is where the analysis would end and the user would probably choose the most accurate learners for their practical application. However if we look at the results if the loss functions and the area under the ROC curves for each algorithm we see a different story emerging! CLICK 3 TIMES. We see that these measures rank these learners differently – see the Naïve Bayes is starting to slip down the rankings from second to seventh! Conversley, ZeroR is starting to rise – from last place to sixth. VPM Naïve Bayes remains high – and I’ll discuss this later (point).
  21. Loss functions and ROC give more information than error rate about the quality of probability forecasts. CLICK But as I said earlier, loss functions = mixture of resolution and reliability ROC curve = measures resolution CLICK Don’t have any method of solely assessing reliability CLICK Don’t have method of telling if probability forecasts are over- or under- estimated CLICK This is where I introduce my contribution to this research the Probability Calibration Graph technique.
  22. So we know what probability forecasting is. We can assess the performance of probability forecasts using the reliability and resolution criteria. I have tested various learners on various datasets (mostly medical) We have briefly looked at traditional methods of assessing probabilities and highlighted that none solely assess reliability. Now this has set the scene for me to introduce my Probability Calibration Graph technique for visualising the reliability of probability forecasts output by learners.
  23. So briefly here is a scan of a graph which served as my inspiration for the PCG graph that I developed for checking the reliability of probability forecasts output by learning algorithms. This is taken from Meteorological study Murphy and Winkler (1977) which analysed the calibration/reliability of the forecasts for the likelihood of precipitation made by the American national weather service. The graph is pretty simple, on the horizontal axis POINT is the forecast probability The vertical axis is the observed relative frequency of precipitation (i.e. the prediction being correct) The points plotted have a number next to them indicating how many predictions were made at that predicted probability. CLICK If the forecasts are reliable then they will stick to the diagonal line POINT
  24. Graphs similar to the PCG plot were first used in the early 1970’s by psychological and meteorlogical studies to assess the reliability of probability forecasts. CLICK. Here is a PCG plot, you have the predicted probability on the horizontal axis (point) versus the empirical frequency of the forecast being correct on the vertical axis (point). For more in depth detail on its construction see my tech report, I wont bore you with the formulas. CLICK. This red line is the Line of calibration and this is the ideal line that reliable learners will stick to (predicted probability = empirical frequency) CLICK. Here are the PCG coordinates for the ZeroR learner when tested on the Abdominal Pain data CLICK. Plot may not span whole axis – ZeroR doesn’t make any predictions with high probability (vague predictions) CLICK. Reliability  PCG coordinates lie close to line of calibration i.e. ZeroR may is not accurate but it is reliable !
  25. Here we have two PCG plots side by side, of the Naïve Bayes learner CLICK with its VPM counter part next to it CLICK. For now Ill just give examples of how to interpret. Ideally a reliable learner gives a line close to the diagonal. This is a brief taste of things to come, Ill explain VPM later. Now from the PCG plot on the left POINT we can clearly see that the Naïve Bayes learner is producing unreliable probability forecasts – as the PCG plot (thick black line) deviates quite dramatically from the line of calibration (red diagonal line). For example CLICK unreliable forecasts are made with forecasts of 0.9 actually having 0.55 chance of being correct, the learner is tending to overestimate or be overconfident with its predictions. You can imagine this is bad. The Naïve Bayes learner is very accurate on this abdominal pain data, and if I gave this system to the doctor they would get predictions output from the learner predicting a disease with 0.9 and actually there is a lot less chance of the patient having that disease, which could lead to improper treatment. CLICK On the flip side CLICK forecasts of 0.1 actually have 0.3 chance of being correct, this is evidence of under estimation where the learner is being under confident in its predictions. CLICK And this pattern of over and under confidence is actually reported as the behaviour or people CLICK when asked to make estimates of probability. Doctors especially have been known to perform using the sort of PCG graphs made by Naïve Bayes, which is a bit worrying I think. The Probability Calibration Graph (PCG) is a useful visualisation technique for seeing how reliable probability forecasts are and can also be used to calculate useful measures eg. statistics about the deviations are given beneath each PCG plot. CLICK X2 These are useful when there is not much between each PCG plot to distinguish which learner is more reliable, for this I calculate various statistics such as total, mean, standard deviation etc. from the absolute deviation of the PCG plot from the diagonal line of calibration. This graph has wide applications and is the first to solely concentrate on reliability. We can see clearly that the Naïve Bayes classifier is unreliable, to understand how to interpret the graphs, [POINT] the horizontal axis is the predicted probability, and the vertical axis is the empirical frequency of those predictions at the predicted probability being correct. So in a nut shell the Naïve Bayes learner is unreliable, but the VPM Naïve Bayes learner CLICK is reliable as its PCG plot sticks close to the line of calibration. We can see this also in the statistical measures in the tables below POINT.
  26. Returning back to the PCG plot of the Naïve Bayes learner CLICK on the abdominal pain data As I said on the previous slide there is a lot of psychological research and evidence that doctors and many other people make unreliable probability forecasts. CLICK Here is a PCG like plot created back in 1977 taken from a psychological journal, notice the similarity in the shape of the graph to the Naïve Bayes PCG plot. CLICK There are lots of graphs like this in psychology research, and lots of interpretation as to why people predict unreliably, and I think these results interesting for us as practitioners in machine learning.
  27. So we know what probability forecasting is. We can assess the performance of probability forecasts using the reliability and resolution criteria. I have tested various learners on various datasets (mostly medical) We have briefly looked at traditional methods of assessing probabilities and highlighted that none solely assess reliability. We have seen that the PCG technique offers a useful solution to this problem as it gives intuitive visualisation and measures of reliability. Now I will give a brief summary of the results I found, importantly that reliability and classification accuracy do not go hand in hand. i.e. you can have a not very accurate learner that is reliable (ZeroR), and vice versa you can have a learner with good classification accuracy but poor reliability (Naïve Bayes).
  28. Lets return to those results that we saw earlier of various learners on the abdominal pain dataset. This time we will add the PCG total deviation scores. Once again with each learners score is a ranking of how good that score was compared to the others this is given as a number in brackets POINT. As I said earlier you can use the statistics of the deviation of the PCG plot from the line of calibration as a measure of reliability, in the table above I have given total absolute deviation. We want the deviation to be as small as possible, and this is how we order the PCG deviations. We can see a very different ordering of the learners from the error rate. Notice that the ZeroR CLICK POINT is ranked last in terms of error rate, it gets around 55% of patients misdiagnosed. This is of no surprise as the learner is very simple, it outputs probability forecasts which are just frequency counts of labels in the training data. It uses no information about the patient to diagnose. We see ZeroR is quite respectively ranked 3 rd in terms of reliability. So ZeroR may not be accurate (or resolute) but it is reliable. Conversely the Naïve Bayes learner CLICK is very accurate ranked in a close 2 nd with only 29% of errors, but its reliability is ranked 7 th – so Naïve bayes is accurate but not reliable! CLICK Concentrating on the PCG deviation (i.e. reliability) we can see a significant re-ordering of the learners. CLICK . In the top 5 learners are ZeroR, various VPM’s implementations (I’ll explain those later) and a K-NN learner. We shall see later that all these learners have theoretical and psychological justifications of reliability. So in summary the PCG gives us a visualisation and a measure of reliability, and we can see that reliability and accuracy do not go hand in hand.
  29. For more evidence that PCG is a measure of reliability check out this result taken from my tech reports. If you can imagine lots and lots of those kinds of tables on the previous slide, over many datasets. In the PCG tech report I gave a broad review of all the traditional methods of assessing probabilities. I decomposed the square loss function into its reliability and resolution components, and then calculated the correlation between these scores and other methods such as PCG, ROC etc. Of most particular interest are the relationships highlighted [CLICK] the PCG correlates with reliability, and ROC correlates with resolution, these results are of interest to the machine learning field think that ROC measures reliability, when in fact it measures the other useful property resolution. We also notice that there is a weak relationship between error rate and PCG, indicating that classification accuracy cannot guarantee that the probability forecasts are reliable. Conversely ROC has CLICK a moderate correlation with error rate indicating that error rate is more closely related to resolution. These results were submitted to ICML and it was rejected.
  30. So we know what probability forecasting is. We can assess the performance of probability forecasts using the reliability and resolution criteria. I have tested various learners on various datasets (mostly medical) We have briefly looked at traditional methods of assessing probabilities and highlighted that none solely assess reliability. We have seen that the PCG technique offers a useful solution to this problem as it gives intuitive visualisation and measures of reliability. We have seen that reliability and classification accuracy do not go hand in hand. Now its time to fill in the gaps and explain how and why the VPM was extended by me for probability forecasting.
  31. Can be applied on top of any existing learning algorithm. CLICK TWICE Vovk introduced the VPM and he originally used it to output provably valid bounds for conditional probabilities CLICK However these bounds had limited practical use because… So I extended the VPM Lindsay (2004) to: CLICK Extract more information from the probability forecasts output by the VPM learner by: CLICK outputting probability forecasts for all possible labels CLICK predicting a label using these probability forecasts 5. I should point out that my extended VPM hasn’t lost the ability to produces bounds CLICK
  32. As I said on previous slide, the VPM was originally used to calculate bounds for the probability of a predicted label made by a VPM being correct. If we invert these bounds (eg. 1-p) then this gives us probability bounds for the prediction being incorrect and so Volodya created nice graphs like this CLICK and POINT to validate the incredibly complicated theory behind VPM. Here you can see that the upper bounds in red and the lower bounds in green lie above and below the actual number of errors in black that are made on the data CLICK This is great as the theory is nicely demonstrated by practical experiments, but as we can see these bounds can be quite loose which limits their practical usefulness as we shall see on the next slide.
  33. I am going to show you an example which clearly indicates the practical usefulness of the extended VPM's probability forecasts for each class label as compared to the predicted bounds discussed previously. Here we have predictions made by the Naïve Bayes learner CLICK and its VPM counterpart CLICK for the same trials in the online process. Actual labels (the true disease for that patient at that trial) are indicated in yellow, and predicted label made by the learner and emboldened and underlined At a glance it is obvious that the predicted probabilities output by the Naive Bayes learner are far more extreme (i.e. very close to 0 or 1) than those output by its VPM counterpart. For example, trial 1653 shows a patient object which is predicted correctly by Naive Bayes p=0.99 and less emphatically by its VPM counterpart p=0.73 CLICK. Remember this data is very noisy so its very unlikely that any predictions can be made with 0.99 chance of being correct! CLICK Trial 2490 demonstrates the problem of over- and under- estimation by the Naive Bayes learner where a patient is incorrectly diagnosed with Intestinal obstruction (overestimation), yet the true diagnosis of Dyspepsia is ranked 6 th CLICK with a very low predicted probability of p=22/1000 (underestimation). In contrast the VPM Naive Bayes learner makes more reliable probability forecasts; for trial 2490 the true class is correctly predicted albeit with lower predicted probability 0.4 CLICK POINT. Trial 5381 demonstrates a situation where both learners encounter an error in their predictions. The Naive Bayes learner gives misleading predicted probabilities of CLICK p=0.93 for the incorrect diagnosis of Appendicitis, and a mere p=0.07 for the true class label of Non-specific abdominal pain. In contrast, even though the VPM Naive Bayes learner incorrectly predicts Appendicitis, it is with far less certainty CLICK p=0.53 and if the user were to look at all probability forecasts CLICK it would be clear that the true class label should not be ignored with a predicted probability p=0.42.
  34. So to quickly recap on everything so far. We know what probability forecasting is. We can assess the performance of probability forecasts using the reliability and resolution criteria. I described how I tested various learners on various datasets We have briefly looked at traditional methods of assessing probabilities and highlighted that none solely assess reliability. We have seen that the PCG technique offers a useful solution to this problem as it gives intuitive visualisation and measures of reliability. We have seen that reliability and classification accuracy do not go hand in hand. I have told you how VPM was extended by me for probability forecasting, and also given compared its forecasts with the underlying learner. Now I will explain which learners that I tested have been found to be reliable, and which are unreliable.
  35. Here are some PCG plots assessing the reliability of probability forecasts output by the ZeroR learner on various datasets CLICK 3 TIMES ZeroR outputs probability forecasts which are mere label frequencies, it predicts the majority class at each trial. Uses no information about the objects in its learning – the simplest of all learners. Accuracy is poor, but reliability is good as you can see from the PCG plots above, they are tight, albeit over a small range of predicted probabilities. ONLY SAY BELOW BIT IF TIME!! Acts as a control in my experiments all learners should at least beat ZeroR in classification accuracy, people often overlook this classifier which I think is a bit stupid, if your data is heavily imbalanced For example consider a dataset for a rare disease, 90% are normal, and 10% have the disease, then a majority classifier like ZeroR can achieve 90% classification accuracy, so if any significant learning is taking place then a learner must beat this!
  36. K-NN finds subset of K closest (nearest neighbouring) examples in training data using a distance metric . Then counts the label frequencies amongst this subset. Acts like a more sophisticated version of ZeroR that uses information held in the object. Appropriate choice of K must be made to obtain reliable probability forecasts, this choice of K depends on the size, complexity, noise level of the data, mainly found by trial and error! In general the larger K is the more reliable the learner, but this can dramatically decrease classification accuracy.
  37. Traditional learners can be very unreliable (yet accurate)- it really depends on the dataset being used. CLICK My research shows empirically that the VPM consistently outputs reliable probability forecasts.CLICK And this extended VPM can recalibrate a learners original probability forecasts to make them more reliable! CLICK This improvement in reliability made by VPM is often without detriment to classification accuracy. CLICK For example, look at these PCG plots showing improvement of reliability before and after VPM implementation CLICK EIGHT TIMES. We have the traditional learners (Naïve Bayes, Neural Net, Decision Tree and 1-NN) on the top row of PCG plots, with the VPM implementations underneath on the second row.
  38. So to quickly recap on the later points. We have seen that ZeroR, K-NN and VPM are reliable probability forecasters, and that traditional learners can produce very unreliable probability forecasts! Now I will briefly detail a pschological and theoretical viewpoint of why these learners are reliable.
  39. There are many psychological studies interested in the problem of making effective judgements under uncertainty When faced with the difficult task of judging probability, people employ a limited number of heuristics which reduce the judgements to simpler ones: CLICK Many heuristics have been identified, some of which are given here: Availability - An event is predicted more likely to occur if it has occurred frequently in the past Representativeness - One compares the essential features of the event to those of the structure of previous events Simulation - The ease in which the simulation of a system of events reaches a particular state can be used to judge the propensity of the (real) system to produce that state. Generally the more heuristics applied the more robust and reliable the probability forecasts are.
  40. I showed empirically that ZeroR, K-NN and VPM learners are reliable probability forecasters. Can identify these heuristics in these learning algorithms Remember psychological research states: CLICK More heuristics  More reliable forecasts
  41. The simplest of all reliable probability forecasters uses 1 heuristic: CLICK The learner merely counts labels it has observed so far, and uses the frequencies of labels as its forecasts ( Availability)
  42. More sophisticated than the ZeroR learner, the K-NN learner uses 2 heuristics: CLICK Uses the distance metric to find subset of K closest examples in training set. ( Representativeness) CLICK Then counts the label frequencies in the subset of K-nearest neighbours to makes its forecasts ( Availability)
  43. Even more sophisticated the VPM meta-learner uses all 3 heuristics: CLICK The VPM tries each new test example with all possible classifications (Simulation) CLICK Then under each tentative simulation clusters training examples which are similar into groups (Representativeness) CLICK Finally the VPM calculates the frequency of labels in each of these groups to make its forecasts ( Availability)
  44. CLICK ZeroR can be proven to be asymptotically reliable (but experiments show well in finite data) CLICK K-NN has lots of theory Stone 1977 To support its convergence to true probability distribution CLICK VPM has a lots of theoretical justification for finite data using martingales, still trying to decipher Volodya’s proofs
  45. CLICK Probability forecasting is useful for real life applications especially medicine. CLICK Want learners to be reliable and accurate. CLICK PCG can be used to check reliability. CLICK ZeroR, K-NN and VPM provide consistently reliable probability forecasts. CLICK Traditional learners Naïve Bayes, Neural Net and Decision Tree can provide unreliable forecasts. CLICK VPM can be used to improve reliability of probability forecasts without detriment to classification accuracy.
  46. And finally I’d like to thank the following people
  47. Look at applications in bioinformatics and medicine – noisy data really needs reliable probability forecasts so user can know whether to trust predictions! Results with time series data. Investigate further relationships with psychology. Recursive application of VPM to improve reliability and accuracy.