Probability Forecasting - a Machine Learning Perspective

•Download as PPS, PDF•

2 likes•541 views

The document discusses probability forecasting from a machine learning perspective. It describes probability forecasting as estimating the conditional probability of possible labels for new examples, rather than just predicting the most likely label. It evaluates several learners on reliability and resolution criteria. It introduces the Probability Calibration Graph (PCG) as a visual tool for assessing reliability without other metrics like log loss that conflate reliability and resolution. Traditional learners are found to be unreliable in their probability forecasts despite being accurate, while the Venn Probability Machine (VPM) framework produces more reliable forecasts.

Reliable Probability Forecasting – a Machine Learning Perspective David Lindsay Supervisors: Zhiyuan Luo, Alex Gammerman, Volodya Vovk

Overview ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Probability Forecasting ,[object Object],[object Object],[object Object]

Probability Forecasting: Generalisation of Pattern Recognition ,[object Object],[object Object],Training Set to “ learn ” from Label Diagnosis Object Patient Details Name: David Sex: M Height: 6’2” Appendicitis Name: Daniil Sex: M Height: 6’4” Dyspepsia Name: Mark Sex: M Height: 6’1” Non-specific ,..., Name: Sian Sex: F Height: 5’8” Dyspepsia , , Name: Wilma Sex: F Height: 5’6” ? Test Object , what is the true label? True label unknown or withheld from learner

Probability Forecasting: Generalisation of Pattern Recognition ,[object Object],[object Object], learner Training set Name: Helen Sex: F Height: 5’6” Name: Helen Sex: F Height: 5’6” Name: Helen Sex: F Height: 5’6” Name: Helen Sex: F Height: 5’6” Test object ? Name: Helen Sex: F Height: 5’6” = 0.1 Name: Helen Sex: F Height: 5’6” = 0.7 Name: Helen Sex: F Height: 5’6” = 0.2 Name: Helen Sex: F Height: 5’6” etc…

Probability forecasting more formally… ,[object Object],[object Object],[object Object]

Back to the plan… ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Studies of Probability Forecasting ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],Reliability ,[object Object],[object Object],[object Object],[object Object]

Resolution ,[object Object],[object Object],[object Object]

Experimental design ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

The Online Learning Setting 2 7 6 1 7 ? ? 2 7 6 1 7 2 ? Before After Update training data for learning machine for next trial Learning machine makes prediction for new example. (label withheld) Repeat process for all examples

Lots of benchmark data ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Programs ,[object Object],[object Object],[object Object],[object Object]

Results, papers and website ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Loss Functions ,[object Object],[object Object],[object Object],Square loss Log loss

ROC Curves Naïve Bayes on the Abdominal pain data set ,[object Object],[object Object],[object Object],[object Object]

Table comparing traditional scores VPM C4.5 Naïve Bayes VPM Naïve Bayes 10-NN 20-NN C4.5 Neural Net 30-NN VPM 1-NN 1-NN ZeroR PCG ROC Area Log Loss Sqr Loss Error Algorithm 0.76 (1) 0.72 (5) 0.75 (2) 0.54 (10) 0.55 (9) 0.57 (8) 0.75 (3) 0.74 (4) 0.61 (6) 0.59 (7) 0.49 (11) 0.8 (4) 1.3 (7) 0.6 (1) 2.6 (10) 2.2 (9) 3.3 (11) 0.72 (2) 0.73 (3) 0.9 (5) 2.1 (8) 1.1 (6) 0.54 (5) 0.50 (4) 0.44 (1) 1.0 (11) 0.96 (10) 0.67 (7) 0.45 (2) 0.47 (3) 0.58 (6) 0.73 (8) 0.74 (9) 40.7 (8) 29.2 (2) 28.9 (1) 33.4 (4) 33.4 (4) 39.6 (7) 30.5 (3) 34.3 (5) 41.6 (9) 34.6 (6) 55.6 (10)

Problems with Traditional Assessment ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Inspiration for PCG (Meteorology) Reliable  points lie close to diagonal Murphy & Winkler (1977) Calibration data for precipitation forecasts

A PCG plot of ZeroR on Abdominal Pain Reliability  PCG coordinates lie close to line of calibration i.e. ZeroR may is not accurate but it is reliable ! Plot may not span whole axis – ZeroR makes no predictions with high probability Predicted Probability Empirical frequency of being correct Line of calibration PCG coordinates

PCG a visualisation tool and measure of reliability VPM is reliable as PCG follows the diagonal! Over and under estimates its probabilities – much like real doctors! 4.9e-17 Min 0.4203 Max 0.0757 Standard Deviation 0.0483 Mean 2764.5 Total Naïve Bayes VPM Naïve Bayes 9.2e-8 Min 0.1017 Max 0.0112 Standard Deviation 0.0087 Mean 496.7 Total Unreliable, forecast of 0.9 only has 0.55 chance being right! (over estimate) Unreliable, forecast of 0.1 only has 0.3 chance being right! (under estimate)

Learners predicting like people! Lots of psychological research  people make unreliable probability forecasts Naïve Bayes People

Table comparing scores with PCG 838.1 (4) 0.76 (1) 0.8 (4) 0.54 (5) 40.7 (8) VPM C4.5 2764.5 (7) 0.72 (5) 1.3 (7) 0.50 (4) 29.2 (2) Naïve Bayes 496.7 (1) 0.75 (2) 0.6 (1) 0.44 (1) 28.9 (1) VPM Naïve Bayes 5062.9 (11) 0.54 (10) 2.6 (10) 1.0 (11) 33.4 (4) 10-NN 4492.7 (10) 0.55 (9) 2.2 (9) 0.96 (10) 33.4 (4) 20-NN 3481.2 (8) 0.57 (8) 3.3 (11) 0.67 (7) 39.6 (7) C4.5 1320.5 (6) 0.75 (3) 0.72 (2) 0.45 (2) 30.5 (3) Neural Net 921.2 (5) 0.74 (4) 0.73 (3) 0.47 (3) 34.3 (5) 30-NN 554.6 (2) 0.61 (6) 0.9 (5) 0.58 (6) 41.6 (9) VPM 1-NN 4307.5 (9) 0.59 (7) 2.1 (8) 0.73 (8) 34.6 (6) 1-NN 678.6 (3) 0.49 (11) 1.1 (6) 0.74 (9) 55.6 (10) ZeroR PCG ROC Area Log Loss Sqr Loss Error Algorithm

Correlations of scores Inverse No -0.1 ROC vs. Sqr Reliability Direct Weak 0.26 PCG vs. Error Direct No 0.04 PCG vs. Sqr Resolution Direct Strong 0.76 PCG vs. Sqr Reliability Interpretation Corr. Coeff. Scores Inverse Moderate -0.52 ROC vs. Error Direct Strong 0.67 ROC vs. Sqr Resolution

What is the VPM meta-learner? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Learner Γ VPM meta learning framework VPM “sits on top” of existing learner to complement predictions with probability estimates

Volodya’s original use of VPM Online Trial Number Error rate and bounds 22.1% 1414.1 Low Error 28.9% 1835 Error 34.7% 2216.5 Up Error Upper (red) and lower (green) bounds lie above and below the actual number of errors ( black ) made on the data.

Output from VPM compared with that of original underlying learner Key: Predicted = underlined , Actual = NA NA 7.6e-9 6.3e-10 4.0e-11 2.2e-9 1.3e-9 0.07 1.7e-13 2.9e-9 0.93 5831 NA NA 2.2e-4 2.2e-7 0.2 0.46 0.16 2.3e-5 0.17 0.01 9.4e-5 2490 NA NA 1.3e-4 4.1e-10 3.4e-3 4.2e-3 0.99 4.4e-5 3.3e-6 4.5e-6 3.08e-9 1653 Naïve Bayes Low Up Dysp. Renal. Pancr Intest obstr Choli Non. Spec Perf. Pept. Div. Appx Bounds Probability forecast for each class label Trial # 0.41 0.68 0.01 0.01 0.0 0.01 0.01 0.42 0.0 0.01 0.53 5831 0.07 0.71 0.4 0.09 0.08 0.15 0.05 0.07 0.10 0.03 0.02 2490 0.08 0.82 0.09 0.01 0.04 0.0 0.73 0.08 0.03 0.0 0.03 1653 VPM Naïve Bayes

ZeroR ,[object Object],[object Object],[object Object],[object Object],Heart Disease Lymphography Diabetes

K-NN 10-NN 20-NN 30-NN ,[object Object],[object Object],[object Object]

Traditional Learners and VPM ,[object Object],[object Object],[object Object],[object Object],Naïve Bayes VPM Naïve Bayes C4.5 VPM C4.5 Neural Net VPM Neural Net 1-NN VPM 1-NN

Psychological Heuristics ,[object Object],[object Object],[object Object],[object Object]

Interpretation of reliable learners using heuristics ,[object Object],[object Object],[object Object],More heuristics  More reliable forecasts

Psychological Interpretation of ZeroR ,[object Object],[object Object]

Psychological Interpretation of K-NN ,[object Object],[object Object],[object Object]

Psychological Interpretation of VPM ,[object Object],[object Object],[object Object],[object Object]

Theoretical justifications ,[object Object],[object Object],[object Object]

Take home points ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Fin  Acknowledgments 

What next? ,[object Object],[object Object],[object Object],[object Object]

What's hot

Delta Analytics is a 501(c)3 non-profit in the Bay Area. We believe that data is powerful, and that anybody should be able to harness it for change. Our teaching fellows partner with schools and organizations worldwide to work with students excited about the power of data to do good. Welcome to the course! These modules will teach you the fundamental building blocks and the theory necessary to be a responsible machine learning practitioner in your own community. Each module focuses on accessible examples designed to teach you about good practices and the powerful (yet surprisingly simple) algorithms we use to model data. To learn more about our mission or provide feedback, take a look at www.deltanalytics.org.

Module 5: Decision Trees

Sara Hooker

Lecture 2 Basic Concepts in Machine Learning for Language Technology

Marina Santini

MLlecture1.ppt

butest

Machine Learning

Bhupender Sharma

Can Concussions Be Diagnosed Using the Microsoft Kinect and Machine Learning?

Eric Solender

Introduction to e tapr for hai con -kor

DACON AI 데이콘

MachineLearning.ppt

butest

What's hot (7)

Module 5: Decision Trees

Lecture 2 Basic Concepts in Machine Learning for Language Technology

MLlecture1.ppt

Machine Learning

Can Concussions Be Diagnosed Using the Microsoft Kinect and Machine Learning?

Introduction to e tapr for hai con -kor

MachineLearning.ppt

Viewers also liked

Verkade and Werner (2011) Estimating the benefits of probability forecasting ...

Jan Verkade

GPU Accelerated Backtesting and Machine Learning for Quant Trading Strategies

Daniel Egloff

Expertise on Demand - How machine learning puts the best-of-the-best at your ...

10x Nation

Machine Learning at the Limit John Canny, UC Berkeley How fast can machine learning and graph algorithms be? In "roofline" design, every kernel is driven toward the limits imposed by CPU, memory, network etc. This can lead to dramatic improvements: BIDMach is a toolkit for machine learning that uses rooflined design and GPUs to achieve two- to three-orders of magnitude improvements over other toolkits on single machines. These speedups are larger than have been reported for *cluster* systems (e.g. Spark/MLLib, Powergraph) running on hundreds of nodes, and BIDMach with a GPU outperforms these systems for most common machine learning tasks. For algorithms (e.g. graph algorithms) which do require cluster computing, we have developed a rooflined network primitive called "Kylix". We can show that Kylix approaches the rooline limits for sparse Allreduce, and empirically holds the record for distributed Pagerank. Beyond rooflining, we believe there are great opportunities from deep algorithm/hardware codesign. Gibbs Sampling (GS) is a very general tool for inference, but is typically much slower than alternatives. SAME (State Augmentation for Marginal Estimation) is a variation of GS which was developed for marginal parameter estimation. We show that it has high parallelism, and a fast GPU implementation. Using SAME, we developed a GS implementation of Latent Dirichlet Allocation whose running time is 100x faster than other samplers, and within 3x of the fastest symbolic methods. We are extending this approach to general graphical models, an area where there is currently a void of (practically) fast tools. It seems at least plausible that a general-purpose solution based on these techniques can closely approach the performance of custom algorithms. Bio John Canny is a professor in computer science at UC Berkeley. He is an ACM dissertation award winner and a Packard Fellow. He is currently a Data Science Senior Fellow in Berkeley's new Institute for Data Science and holds a INRIA (France) International Chair. Since 2002, he has been developing and deploying large-scale behavioral modeling systems. He designed and protyped production systems for Overstock.com, Yahoo, Ebay, Quantcast and Microsoft. He currently works on several applications of data mining for human learning (MOOCs and early language learning), health and well-being, and applications in the sciences.

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...

Chester Chen

Hacking The Trading Floor

iffybird_099

As we move from the Data Warehouse to the Data Supply Chain, we open our perspective to include the full life cycle of data, from raw material to data product. To produce data products with the most value, in an efficient and cost effective manner, quality control processes must be put into place at each link in the chain, driven by the requirements of data scientists. With such quality control processes in place, the burden of data scientists to cleanse data – typically 80% of the data scientists’ efforts – can be greatly reduced. Data Models – including schema, metadata, rules, and provenance – play a crucial role in ensuring an effective Data Supply Chain. Each Data Supply Chain link must be defined with firm boundaries with clear lines of team responsibility – with Data Models providing the natural borders. In this talk we will discuss the processes that must be put into place at each link in the Data Supply Chain including perspectives on: * The definition of Data Supply Chain vs. Data Warehouse * Tools to create, manage, utilize, and share Data Models * Tracking Data Provenance * ETL processes, driven by Data Models * Collaborative processes across Data Science teams * Visualization of Data and Data Flow across the Data Supply Chain * Apache Hadoop and Apache Spark as enabling technologies * Data Science * Cross-Organizational Collaboration * Security

Optimizing the  Data Supply Chain  for Data Science

Vital.AI

thesis_jinxing_lin

jinxing lin

Machine learning ~ Forecasting

Shaswat Mandhanya

Demand estimation and forecasting

shivraj negi

Demand estimation

Qamar Farooq

Retail Demand Forecasting with Machine Learning: For over two decades, time-series methods, in combination with hierarchical spreading/aggregation via location and product hierarchies, and subsequent manual user adjustments, have been a standard means by which retailers and the software vendors who serve them have created demand forecasts. The forecasts so produced are and were used as inputs to store and vendor replenishment, regular and markdown pricing, and other downstream decision support systems. The rise of machine learning — the advent of high-powered commercial product recommender systems such as books at amazon book and movies at netflix, of powerful search (e.g., google), text processing (e.g., Facebook) and sentiment analysis capabilities, IBM Watson, self-driving cars and the like — is real phenomenon based on academically-sound and industrially-proven techniques whose application to retail demand forecasting is ripe.

Ronald Menich, Chief Data Scientist, Predictix, LLC at MLconf NYC

MLconf

Real-Time Supply Chain Analytics with Machine Learning, Kafka, and Spark

SingleStore

Artificial Intelligence, Machine Learning and Deep Learning

Sujit Pal

3...forecasting methods

DEVIKA ANTHARJANAM

Deep Learning and the state of AI / 2016

Grigory Sapunov

Forecasting Slides

knksmart

Forecasting Techniques

guest865c0e0c

Viewers also liked (17)

Verkade and Werner (2011) Estimating the benefits of probability forecasting ...

GPU Accelerated Backtesting and Machine Learning for Quant Trading Strategies

Expertise on Demand - How machine learning puts the best-of-the-best at your ...

SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...

Hacking The Trading Floor

Optimizing the  Data Supply Chain  for Data Science

thesis_jinxing_lin

Machine learning ~ Forecasting

Demand estimation and forecasting

Demand estimation

Ronald Menich, Chief Data Scientist, Predictix, LLC at MLconf NYC

Real-Time Supply Chain Analytics with Machine Learning, Kafka, and Spark

Artificial Intelligence, Machine Learning and Deep Learning

3...forecasting methods

Deep Learning and the state of AI / 2016

Forecasting Slides

Forecasting Techniques

Similar to Probability Forecasting - a Machine Learning Perspective

Developing and validating statistical models for clinical prediction and prog...

Evangelos Kritsotakis

Disease Prediction And Doctor Appointment system

KOYELMAJUMDAR1

Machine Learning

Paolo Marcatili

Computational Biology, Part 4 Protein Coding Regions

butest

Improving predictions: Lasso, Ridge and Stein's paradox

Maarten van Smeden

IRJET- Disease Prediction using Machine Learning

IRJET Journal

Title: Bayesian Assurance: Formalizing Sensitivity Analysis For Sample Size Duration: 60 minutes Speaker: Ronan Fitzpatrick, Head of Statistics, Statsols Watch Here: http://bit.ly/2ndRG4B In this webinar you’ll learn about: Benefits of Sensitivity Analysis: What does the researcher gain by conducting a sensitivity analysis? Why isn't Sensitivity Analysis formalized: Why does sensitivity analysis still lack the type of formalized rules and grounding to make it a routine part of sample size determination in every field? How Bayesian Assurance works: Using Bayesian Assurance provides key contextual information on what is likely to happen over the total range possible values rather than the small number of fixed points used in a sensitivity analysis Elicitation & SHELF: How expert opinion is elicited and then how to integrate these opinions with each other plus prior data using the Sheffield Elicitation Framework (SHELF) Why use in both Frequentist or Bayesian analysis: How and why these methods can be used for studies which will use Frequentist or Bayesian methods in their final analysis Plus more

Bayesian Assurance: Formalizing Sensitivity Analysis For Sample Size

nQuery

5 5 10

kohannim

Medical Segmentation Decathalon

imgcommcall

840 plenary elder_using his laptop

Rising Media, Inc.

Feedbackdriven radiologyreportretrieval ichi2015-v2

Artificial Intelligence Institute at UofSC

920 plenary elder

Rising Media, Inc.

910 plenary Elder

Rising Media, Inc.

MLPA for health care presentation smc

Shaun Comfort

Determining the appropriate number of events needed for survival analysis is a complex task as study planners try to predict what sample size will be needed after accounting for the complications of unequal follow-up, drop-out and treatment crossover. The statistical, logistical and ethical considerations all complicate life for biostatisticians as issues to balance in planning a survival analysis. However, this complexity has created a need for new analyses and procedures to help the planning process for survival analysis trials. The wider move from fixed to flexible designs has opened up opportunities for advanced methods such as adaptive design and Bayesian analysis to help deal with the unique complications of planning for survival data but these methods have their own complications that need to be explored too.

Webinar slides sample size for survival analysis - a guide to planning succ...

nQuery

Sample size for survival analysis - a guide to planning successful clinical t...

nQuery

Intro to Model Selection

chenhm

Power and sample size calculations for survival analysis webinar Slides

nQuery

May 2015 talk to SW Data Meetup by Professor Hendrik Blockeel from KU Leuven & Leiden University. With increasing amounts of ever more complex forms of digital data becoming available, the methods for analyzing these data have also become more diverse and sophisticated. With this comes an increased risk of incorrect use of these methods, and a greater burden on the user to be knowledgeable about their assumptions. In addition, the user needs to know about a wide variety of methods to be able to apply the most suitable one to a particular problem. This combination of broad and deep knowledge is not sustainable. The idea behind declarative data analysis is that the burden of choosing the right statistical methodology for answering a research question should no longer lie with the user, but with the system. The user should be able to simply describe the problem, formulate a question, and let the system take it from there. To achieve this, we need to find answers to questions such as: what languages are suitable for formulating these questions, and what execution mechanisms can we develop for them? In this talk, I will discuss recent and ongoing research in this direction. The talk will touch upon query languages for data mining and for statistical inference, declarative modeling for data mining, meta-learning, and constraint-based data mining. What connects these research threads is that they all strive to put intelligence about data analysis into the system, instead of assuming it resides in the user. Hendrik Blockeel is a professor of computer science at KU Leuven, Belgium, and part-time associate professor at Leiden University, The Netherlands. His research interests lie mostly in machine learning and data mining. He has made a variety of research contributions in these fields, including work on decision tree learning, inductive logic programming, predictive clustering, probabilistic-logical models, inductive databases, constraint-based data mining, and declarative data analysis. He is an action editor for Machine Learning and serves on the editorial board of several other journals. He has chaired or organized multiple conferences, workshops, and summer schools, including ILP, ECMLPKDD, IDA and ACAI, and he has been vice-chair, area chair, or senior PC member for ECAI, IJCAI, ICML, KDD, ICDM. He was a member of the board of the European Coordinating Committee for Artificial Intelligence from 2004 to 2010, and currently serves as publications chair for the ECMLPKDD steering committee.

Declarative data analysis

South West Data Meetup

A clinical prediction model can be used in various clinical contexts, including screening for asymptomatic illness, forecasting future events such as disease, and assisting doctors in their decision-making and health education. Despite the positive effects of clinical prediction models on practice, prediction modeling is a difficult process that necessitates meticulous statistical analysis and sound clinical judgments. Statswork offers statistical services as per the requirements of the customers. When you Order statistical Services at Statswork, we promise you the following always on Time, outstanding customer support, and High-quality Subject Matter Experts. Read More With Us: https://bit.ly/3dxn32c Why Statswork? Plagiarism Free | Unlimited Support | Prompt Turnaround Times | Subject Matter Expertise | Experienced Bio-statisticians & Statisticians | Statistics across Methodologies | Wide Range of Tools & Technologies Supports | Tutoring Services | 24/7 Email Support | Recommended by Universities Contact Us: Website: www.statswork.com Email: info@statswork.com United Kingdom: 44-1143520021 India: 91-4448137070 WhatsApp: 91-8754446690

How to establish and evaluate clinical prediction models - Statswork

Stats Statswork

Similar to Probability Forecasting - a Machine Learning Perspective (20)

Developing and validating statistical models for clinical prediction and prog...

Disease Prediction And Doctor Appointment system

Machine Learning

Computational Biology, Part 4 Protein Coding Regions

Improving predictions: Lasso, Ridge and Stein's paradox

IRJET- Disease Prediction using Machine Learning

Bayesian Assurance: Formalizing Sensitivity Analysis For Sample Size

5 5 10

Medical Segmentation Decathalon

840 plenary elder_using his laptop

Feedbackdriven radiologyreportretrieval ichi2015-v2

920 plenary elder

910 plenary Elder

MLPA for health care presentation smc

Webinar slides sample size for survival analysis - a guide to planning succ...

Sample size for survival analysis - a guide to planning successful clinical t...

Intro to Model Selection

Power and sample size calculations for survival analysis webinar Slides

Declarative data analysis

How to establish and evaluate clinical prediction models - Statswork

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE

butest

1. MPEG I.B.P frame之不同

butest

LESSONS FROM THE MICHAEL JACKSON TRIAL

butest

Timeline: The Life of Michael Jackson

butest

Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...

butest

LESSONS FROM THE MICHAEL JACKSON TRIAL

butest

Com 380, Summer II

butest

PPT

butest

The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz

butest

MICHAEL JACKSON.doc

butest

Social Networks: Twitter Facebook SL - Slide 1

butest

Facebook

butest

Executive Summary Hare Chevrolet is a General Motors dealership ...

butest

Welcome to the Dougherty County Public Library's Facebook and ...

butest

NEWS ANNOUNCEMENT

butest

C-2100 Ultra Zoom.doc

butest

MAC Printing on ITS Printers.doc.doc

Mac OS X Guide.doc

hier

WEB DESIGN!

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE

1. MPEG I.B.P frame之不同

LESSONS FROM THE MICHAEL JACKSON TRIAL

Timeline: The Life of Michael Jackson

Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...

LESSONS FROM THE MICHAEL JACKSON TRIAL

Com 380, Summer II

PPT

The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz

MICHAEL JACKSON.doc

Social Networks: Twitter Facebook SL - Slide 1

Facebook

Executive Summary Hare Chevrolet is a General Motors dealership ...

Welcome to the Dougherty County Public Library's Facebook and ...

NEWS ANNOUNCEMENT

C-2100 Ultra Zoom.doc

MAC Printing on ITS Printers.doc.doc

Mac OS X Guide.doc

hier

WEB DESIGN!

Probability Forecasting - a Machine Learning Perspective

1. Reliable Probability Forecasting – a Machine Learning Perspective David Lindsay Supervisors: Zhiyuan Luo, Alex Gammerman, Volodya Vovk

10. Definition of Reliability

11.

12.

13.

14. The Online Learning Setting 2 7 6 1 7 ? ? 2 7 6 1 7 2 ? Before After Update training data for learning machine for next trial Learning machine makes prediction for new example. (label withheld) Repeat process for all examples

15.

16.

17.

18.

19.

20.

21. Table comparing traditional scores VPM C4.5 Naïve Bayes VPM Naïve Bayes 10-NN 20-NN C4.5 Neural Net 30-NN VPM 1-NN 1-NN ZeroR PCG ROC Area Log Loss Sqr Loss Error Algorithm 0.76 (1) 0.72 (5) 0.75 (2) 0.54 (10) 0.55 (9) 0.57 (8) 0.75 (3) 0.74 (4) 0.61 (6) 0.59 (7) 0.49 (11) 0.8 (4) 1.3 (7) 0.6 (1) 2.6 (10) 2.2 (9) 3.3 (11) 0.72 (2) 0.73 (3) 0.9 (5) 2.1 (8) 1.1 (6) 0.54 (5) 0.50 (4) 0.44 (1) 1.0 (11) 0.96 (10) 0.67 (7) 0.45 (2) 0.47 (3) 0.58 (6) 0.73 (8) 0.74 (9) 40.7 (8) 29.2 (2) 28.9 (1) 33.4 (4) 33.4 (4) 39.6 (7) 30.5 (3) 34.3 (5) 41.6 (9) 34.6 (6) 55.6 (10)

22.

23.

24. Inspiration for PCG (Meteorology) Reliable  points lie close to diagonal Murphy & Winkler (1977) Calibration data for precipitation forecasts

25. A PCG plot of ZeroR on Abdominal Pain Reliability  PCG coordinates lie close to line of calibration i.e. ZeroR may is not accurate but it is reliable ! Plot may not span whole axis – ZeroR makes no predictions with high probability Predicted Probability Empirical frequency of being correct Line of calibration PCG coordinates

26. PCG a visualisation tool and measure of reliability VPM is reliable as PCG follows the diagonal! Over and under estimates its probabilities – much like real doctors! 4.9e-17 Min 0.4203 Max 0.0757 Standard Deviation 0.0483 Mean 2764.5 Total Naïve Bayes VPM Naïve Bayes 9.2e-8 Min 0.1017 Max 0.0112 Standard Deviation 0.0087 Mean 496.7 Total Unreliable, forecast of 0.9 only has 0.55 chance being right! (over estimate) Unreliable, forecast of 0.1 only has 0.3 chance being right! (under estimate)

27. Learners predicting like people! Lots of psychological research  people make unreliable probability forecasts Naïve Bayes People

28.

29. Table comparing scores with PCG 838.1 (4) 0.76 (1) 0.8 (4) 0.54 (5) 40.7 (8) VPM C4.5 2764.5 (7) 0.72 (5) 1.3 (7) 0.50 (4) 29.2 (2) Naïve Bayes 496.7 (1) 0.75 (2) 0.6 (1) 0.44 (1) 28.9 (1) VPM Naïve Bayes 5062.9 (11) 0.54 (10) 2.6 (10) 1.0 (11) 33.4 (4) 10-NN 4492.7 (10) 0.55 (9) 2.2 (9) 0.96 (10) 33.4 (4) 20-NN 3481.2 (8) 0.57 (8) 3.3 (11) 0.67 (7) 39.6 (7) C4.5 1320.5 (6) 0.75 (3) 0.72 (2) 0.45 (2) 30.5 (3) Neural Net 921.2 (5) 0.74 (4) 0.73 (3) 0.47 (3) 34.3 (5) 30-NN 554.6 (2) 0.61 (6) 0.9 (5) 0.58 (6) 41.6 (9) VPM 1-NN 4307.5 (9) 0.59 (7) 2.1 (8) 0.73 (8) 34.6 (6) 1-NN 678.6 (3) 0.49 (11) 1.1 (6) 0.74 (9) 55.6 (10) ZeroR PCG ROC Area Log Loss Sqr Loss Error Algorithm

30. Correlations of scores Inverse No -0.1 ROC vs. Sqr Reliability Direct Weak 0.26 PCG vs. Error Direct No 0.04 PCG vs. Sqr Resolution Direct Strong 0.76 PCG vs. Sqr Reliability Interpretation Corr. Coeff. Scores Inverse Moderate -0.52 ROC vs. Error Direct Strong 0.67 ROC vs. Sqr Resolution

31.

32.

33. Volodya’s original use of VPM Online Trial Number Error rate and bounds 22.1% 1414.1 Low Error 28.9% 1835 Error 34.7% 2216.5 Up Error Upper (red) and lower (green) bounds lie above and below the actual number of errors ( black ) made on the data.

34. Output from VPM compared with that of original underlying learner Key: Predicted = underlined , Actual = NA NA 7.6e-9 6.3e-10 4.0e-11 2.2e-9 1.3e-9 0.07 1.7e-13 2.9e-9 0.93 5831 NA NA 2.2e-4 2.2e-7 0.2 0.46 0.16 2.3e-5 0.17 0.01 9.4e-5 2490 NA NA 1.3e-4 4.1e-10 3.4e-3 4.2e-3 0.99 4.4e-5 3.3e-6 4.5e-6 3.08e-9 1653 Naïve Bayes Low Up Dysp. Renal. Pancr Intest obstr Choli Non. Spec Perf. Pept. Div. Appx Bounds Probability forecast for each class label Trial # 0.41 0.68 0.01 0.01 0.0 0.01 0.01 0.42 0.0 0.01 0.53 5831 0.07 0.71 0.4 0.09 0.08 0.15 0.05 0.07 0.10 0.03 0.02 2490 0.08 0.82 0.09 0.01 0.04 0.0 0.73 0.08 0.03 0.0 0.03 1653 VPM Naïve Bayes

35.

36.

37.

38.

39.

40.

41.

42.

43.

44.

45.

46.

47.

48.

Editor's Notes

Hello and welcome to my cake talk, cakes are situated at the front please feel free to munch away The title of my talk is Reliable Probability Forecasting – a Machine Learning Perspective I have been working on this research for about 9 months, the talk will be quite high level, if anyone wants to find out more low level detail then you can ask questions at the end or look at my 3 tech reports which some material of this talk is taken from. I have attempted to make this talk accessible to people outside my field and I hope that you all at least understand some part of my talk. If anything is very confusing on the slides please stop to ask questions.
So let me start by giving an overview of what I am going to talk about today, I will return to this plan as we go along. CLICK Firstly I will introduce the problem of probability forecasting, and describe how the problem is a generalisation of the standard pattern recognition problem studied in machine learning CLICK Then I will describe the reliability and resolution criteria (proposed by research in statistics and psychology) which can be used for assessing he effectiveness of probability forecasts CLICK I will follow this by briefly detail my experimental design CLICK I will then showcase current methods of assessing probabilities, namely square loss, log loss and ROC curves and highlight the problems with these approaches for assessing reliability only. CLICK Introduce the Probability Calibration Graph (PCG) Lindsay (2004) for solely assessing the reliability of probability forecasts. CLICK Show how many traditional learners are unreliable yet accurate! This can seem a counter intuitive argument as we’ll see later. CLICK Show how the newly developed Venn Probability Machine (VPM) meta learning framework can be extended Lindsay (2004) and used to correct these problems with traditional learners! CLICK Summarise which learners have been demonstrated as reliable and which are unreliable. CLICK And finally I Give theoretical and psychological viewpoint for the reliable learners that my studies have identified.
Lets go through some initial benefits of probability forecasting. CLICK Qualified predictions are important in many real life applications (especially medicine), it is very handy for the user to know when and how much trust can be placed in a prediction made by a learner. CLICK Having said that most machine learning algorithms make bare predictions, don’t give any indication of how likely the prediction is correct, I think this is why very few learning systems are used in practice. CLICK Those learners that do make qualified predictions makes no claims of how effective the measures are! For example the WEKA data mining system provides a load of tweaks to existing algorithms to output probability forecasts, but not many have any theoretical proof of their validity.
So let me just review a general problem which is commonly tackled by machine learning namely pattern recognition. The goal of pattern recognition is quite simple = find the “ best ” label for each new test object. CLICK An example that I will use throughout this talk is the Abdominal Pain Dataset (mainly because I believe that this research is most applicable to the medical problem domain). CLICK. The data is very noisy and complex, we have roughly 6300 patients details collected by a hospital in Cardiff. Each patient is described using 135 properties, and associated with each patient is 1 of 9 different abdominal pain diseases (Appendicitis, Dyspepsia, Non-specific abdominal pain, Renal Colic, etc). CLICK. Relating this to the notation and jargon I will use throughout, we think of our examples as information pairs, each example represents a patient, each object x (POINT and CLICK) describes the patients symptoms etc, and the corresponding label y (POINT and CLICK) is the diagnosis of the abdominal pain disease that they are suffering from. In a machine learning interpretation of the pattern recognition problem, the supervisor (in this case doctor) provides a training set to learn the all important relationship between objects and labels. CLICK. The hope is that if the training set is large and clean enough the user will be able to input the details of a new patient and the learning algorithm will diagnose that patient by making a prediction of 1 of the 9 possible labels. CLICK. Usually we keep back a test set to validate the predictions made by the learner so we can test the performance of the learning algorithm.
A probability forecast is an estimate of the conditional probability of a label given an observed object CLICK TWICE. I use the hat notation [POINT] to distinguish the predicted value output by the learner, from the true value determined by nature Obviously in real life applications I do not have access to some higher power to give me details access to the true probability distribution, so it is awkard to check whether my forecasts are accurate. We want learner  to estimate probabilities for all class labels CLICK Returning to our example of the abdominal pain dataset, we have our training data CLICK And the unlabelled test object CLICK Both of these are fed into our Learner Gamma CLICK Our Learner Gamma outputs probability forecasts for each possible label (i.e. disease) for that new test object (i.e. patient) CLICK 3 TIMES POINT: Remember all the predicted probabilities output sum to one. Naturally, we predict the label with the highest associated probability CLICK
So using the standard notation we have X as the object space and Y as the label space, so that Z = X  Y is the example space CLICK Our learner  makes probability forecasts for all possible labels CLICK Use probability forecasts to predict label most likely label CLICK TWICE
So hopefully its clear what probability forecasting is, but how can we assess the quality or effectiveness of these forecasts? We shall now see.
Probability forecasting is well studied area of since 1970’s: CLICK Psychology Statistics Meteorology These studies assessed two criteria of probability forecasts:CLICK Reliability = the probability forecasts should not lie Resolution = the probability forecasts are practically useful
So an informal definition of reliable probability forecasting is when an event is predicted with probability p should have approx 1-p chance of being incorrect This term is known by many names, being well calibrated etc. CLICK Reliability is normally considered an asymptotic property (as the number of training examples tends to infinity) in statistical studies, however the work by Volodya was able to generalise this problem for finite data. CLICK In 1985 Dawid proved that no deterministic learner can be reliable for all data – still interesting to investigate the problem of reliable probability forecasting as the work by Volodya and me shows. CLICK This property is often overlooked in practical studies! This is a real shame as I think many applications would find this property very attractive. If the probability forecasts of a learner were reliable then they would at least be trustworthy. CLICK
Now lets look at the second term resolution. Resolution demands that the probability forecasts are practically useful, eg. they can be used to effectively rank the labels in order of likelihood! CLICK Closely related to classification accuracy which is commonly studied in machine learning. CLICK They are separate from reliability, one of my papers shows that reliability and classification accuracy/resolution do not go “ hand in hand ” CLICK
So to recap I have detailed what probability forecasting is. And that lots of studies in different fields have identified that probability forecasting can be assessed using the reliability and resolution criteria Now I will describe how and why I conducted my experiments
I tested several learners on many datasets in the online setting (which I will explain later) CLICK ZeroR. This learner was used as a control: it is the most basic, simple learner that you could consider. We would therefore expect any other learner to have improved probability forecasts over ZeroR. ZeroR is not well-known and it comes as part of the WEKA data-mining system that I will describe later. CLICK K-Nearest Neighbour CLICK Neural Network CLICK C4.5 Decision Tree CLICK Naïve Bayes CLICK Venn Probability Machine Meta Learner. I will discuss this later, as the VPM meta-learner has been applied to all these learners here.
Traditionally most studies in machine learning are carried out in the offline learning setting (where the learning machine is provided with a fixed training and test set to evaluate). For my research I looked primarily at the online learning setting as it fits nicely with the theory and allows you to see how the learner improves with experience. Having said that I have conducted all these experiments in the offline setting as well and they come out the same. The crucial difference with the online setting is that we imagine that the training set provided to the learning machine is continually updated. I will now give a quick example using the handwritten digits image data set. Our images are the objects, and the label says which digit it is. The strict online learning setting works as follows: CLICK First the leaning machine makes a prediction of a new test example CLICK Second the teacher/supervisor of the learning machine provides the true label of the example (in this case 2) and adds it to the training set for the learning machine. CLICK Finally the process is repeated for each example in the dataset presenting each example as a “trial” in the online process.
To get an idea of the kind of data I tested my learning algorithms on I have compiled this slide. As you can see I tested a variety of well known datasets (varying in size, complexity and noise level), mostly benchmark data from the UCI, but I also tested some home grown favourites such as the Abdominal pain dataset. I chose to use a lot of medical datasets as these tend to be quite noisy and I believe the need for reliable probability forecasting is exemplified by this problem domain.
For the programming side, I decided to capitalise on the lovely WEKA data mining system (distributed under GNU public licence) CLICK This package implemented in Java offers an extensive library of well known machine learning algorithms. Because its written in an object oriented programming language this made things very easy for me to extend the existing functionality of the system, this is how I added extra algorithms such as the Venn Probability Machine that I will be talking about today. CLICK I also extended the WEKA system to allow all learners to be tested in the online learning setting (that I mentioned a few slides ago) as yet not many people test in this mode. CLICK To create all the lovely graphs I wrote some handy Matlab scripts, and all these programs are available via my website. CLICK
As I mentioned at the start all of the research I am talking about can be found in the three tech reports (details on the slide) that I have been working on for several months now. I have also tried to publish shortened versions of these papers at some of the big machine learning conferences (unsuccessfully) All tech reports will hopefully be available on the CLRC website and my own, pending review.
So going back to the plan again. So we know what probability forecasting is. And that we can intuitively assess the performance of probability forecasts using two criteria reliability and resolution. I have detailed my experimental design, what learners and data I have tested But there are methods which are currently used in machine learning for assessing the performance of probability forecasts and this is what we will now look at now and also highlight problems with them.
There are many other possible loss functions… Square loss CLICK And Log loss CLICK In 1982 Degroot and Feinberg showed that all loss functions measure a mixture of reliability and resolution Log loss punishes more harshly, and it is forced to spread its bets
ROC curves check the proportion of correct versus incorrect predictions made by a learner. See tech report on PCG for more details. CLICK ROC curves are popularly used in machine learning studies to assess probability forecasts ROC is commonly used to measure the tradeoffs between false and true positive classification. We want the ROC curve to be as close to the upper left corner as possible POINT, we want it to deviate from the diagonal as much as possible. CLICK My results show that this graph tests resolution. CLICK The area under the ROC curve is often used as a measure of the quality of the probability forecasts being made. CLICK Still does not tell us how/why probability forecasts are unreliable! This has more to do with accuracy.
To try an reinforce my point that error rate does not reflect the quality of the probability forecast. Traditional studies would get the classification accuracy or inversely the error rate of our learners tested on a dataset (in this case Abdominal pain) producing the kind of league table you see here. CLICK Here you can see that each learner is given a rank in brackets in terms of its error rate. Obviously we want the error rate to be as small as possible. So the Naïve Bayes learners are the most accurate, and the ZeroR learner is the least accurate as we would expect. This is where the analysis would end and the user would probably choose the most accurate learners for their practical application. However if we look at the results if the loss functions and the area under the ROC curves for each algorithm we see a different story emerging! CLICK 3 TIMES. We see that these measures rank these learners differently – see the Naïve Bayes is starting to slip down the rankings from second to seventh! Conversley, ZeroR is starting to rise – from last place to sixth. VPM Naïve Bayes remains high – and I’ll discuss this later (point).
Loss functions and ROC give more information than error rate about the quality of probability forecasts. CLICK But as I said earlier, loss functions = mixture of resolution and reliability ROC curve = measures resolution CLICK Don’t have any method of solely assessing reliability CLICK Don’t have method of telling if probability forecasts are over- or under- estimated CLICK This is where I introduce my contribution to this research the Probability Calibration Graph technique.
So we know what probability forecasting is. We can assess the performance of probability forecasts using the reliability and resolution criteria. I have tested various learners on various datasets (mostly medical) We have briefly looked at traditional methods of assessing probabilities and highlighted that none solely assess reliability. Now this has set the scene for me to introduce my Probability Calibration Graph technique for visualising the reliability of probability forecasts output by learners.
So briefly here is a scan of a graph which served as my inspiration for the PCG graph that I developed for checking the reliability of probability forecasts output by learning algorithms. This is taken from Meteorological study Murphy and Winkler (1977) which analysed the calibration/reliability of the forecasts for the likelihood of precipitation made by the American national weather service. The graph is pretty simple, on the horizontal axis POINT is the forecast probability The vertical axis is the observed relative frequency of precipitation (i.e. the prediction being correct) The points plotted have a number next to them indicating how many predictions were made at that predicted probability. CLICK If the forecasts are reliable then they will stick to the diagonal line POINT
Graphs similar to the PCG plot were first used in the early 1970’s by psychological and meteorlogical studies to assess the reliability of probability forecasts. CLICK. Here is a PCG plot, you have the predicted probability on the horizontal axis (point) versus the empirical frequency of the forecast being correct on the vertical axis (point). For more in depth detail on its construction see my tech report, I wont bore you with the formulas. CLICK. This red line is the Line of calibration and this is the ideal line that reliable learners will stick to (predicted probability = empirical frequency) CLICK. Here are the PCG coordinates for the ZeroR learner when tested on the Abdominal Pain data CLICK. Plot may not span whole axis – ZeroR doesn’t make any predictions with high probability (vague predictions) CLICK. Reliability  PCG coordinates lie close to line of calibration i.e. ZeroR may is not accurate but it is reliable !
Here we have two PCG plots side by side, of the Naïve Bayes learner CLICK with its VPM counter part next to it CLICK. For now Ill just give examples of how to interpret. Ideally a reliable learner gives a line close to the diagonal. This is a brief taste of things to come, Ill explain VPM later. Now from the PCG plot on the left POINT we can clearly see that the Naïve Bayes learner is producing unreliable probability forecasts – as the PCG plot (thick black line) deviates quite dramatically from the line of calibration (red diagonal line). For example CLICK unreliable forecasts are made with forecasts of 0.9 actually having 0.55 chance of being correct, the learner is tending to overestimate or be overconfident with its predictions. You can imagine this is bad. The Naïve Bayes learner is very accurate on this abdominal pain data, and if I gave this system to the doctor they would get predictions output from the learner predicting a disease with 0.9 and actually there is a lot less chance of the patient having that disease, which could lead to improper treatment. CLICK On the flip side CLICK forecasts of 0.1 actually have 0.3 chance of being correct, this is evidence of under estimation where the learner is being under confident in its predictions. CLICK And this pattern of over and under confidence is actually reported as the behaviour or people CLICK when asked to make estimates of probability. Doctors especially have been known to perform using the sort of PCG graphs made by Naïve Bayes, which is a bit worrying I think. The Probability Calibration Graph (PCG) is a useful visualisation technique for seeing how reliable probability forecasts are and can also be used to calculate useful measures eg. statistics about the deviations are given beneath each PCG plot. CLICK X2 These are useful when there is not much between each PCG plot to distinguish which learner is more reliable, for this I calculate various statistics such as total, mean, standard deviation etc. from the absolute deviation of the PCG plot from the diagonal line of calibration. This graph has wide applications and is the first to solely concentrate on reliability. We can see clearly that the Naïve Bayes classifier is unreliable, to understand how to interpret the graphs, [POINT] the horizontal axis is the predicted probability, and the vertical axis is the empirical frequency of those predictions at the predicted probability being correct. So in a nut shell the Naïve Bayes learner is unreliable, but the VPM Naïve Bayes learner CLICK is reliable as its PCG plot sticks close to the line of calibration. We can see this also in the statistical measures in the tables below POINT.
Returning back to the PCG plot of the Naïve Bayes learner CLICK on the abdominal pain data As I said on the previous slide there is a lot of psychological research and evidence that doctors and many other people make unreliable probability forecasts. CLICK Here is a PCG like plot created back in 1977 taken from a psychological journal, notice the similarity in the shape of the graph to the Naïve Bayes PCG plot. CLICK There are lots of graphs like this in psychology research, and lots of interpretation as to why people predict unreliably, and I think these results interesting for us as practitioners in machine learning.
So we know what probability forecasting is. We can assess the performance of probability forecasts using the reliability and resolution criteria. I have tested various learners on various datasets (mostly medical) We have briefly looked at traditional methods of assessing probabilities and highlighted that none solely assess reliability. We have seen that the PCG technique offers a useful solution to this problem as it gives intuitive visualisation and measures of reliability. Now I will give a brief summary of the results I found, importantly that reliability and classification accuracy do not go hand in hand. i.e. you can have a not very accurate learner that is reliable (ZeroR), and vice versa you can have a learner with good classification accuracy but poor reliability (Naïve Bayes).
Lets return to those results that we saw earlier of various learners on the abdominal pain dataset. This time we will add the PCG total deviation scores. Once again with each learners score is a ranking of how good that score was compared to the others this is given as a number in brackets POINT. As I said earlier you can use the statistics of the deviation of the PCG plot from the line of calibration as a measure of reliability, in the table above I have given total absolute deviation. We want the deviation to be as small as possible, and this is how we order the PCG deviations. We can see a very different ordering of the learners from the error rate. Notice that the ZeroR CLICK POINT is ranked last in terms of error rate, it gets around 55% of patients misdiagnosed. This is of no surprise as the learner is very simple, it outputs probability forecasts which are just frequency counts of labels in the training data. It uses no information about the patient to diagnose. We see ZeroR is quite respectively ranked 3 rd in terms of reliability. So ZeroR may not be accurate (or resolute) but it is reliable. Conversely the Naïve Bayes learner CLICK is very accurate ranked in a close 2 nd with only 29% of errors, but its reliability is ranked 7 th – so Naïve bayes is accurate but not reliable! CLICK Concentrating on the PCG deviation (i.e. reliability) we can see a significant re-ordering of the learners. CLICK . In the top 5 learners are ZeroR, various VPM’s implementations (I’ll explain those later) and a K-NN learner. We shall see later that all these learners have theoretical and psychological justifications of reliability. So in summary the PCG gives us a visualisation and a measure of reliability, and we can see that reliability and accuracy do not go hand in hand.
For more evidence that PCG is a measure of reliability check out this result taken from my tech reports. If you can imagine lots and lots of those kinds of tables on the previous slide, over many datasets. In the PCG tech report I gave a broad review of all the traditional methods of assessing probabilities. I decomposed the square loss function into its reliability and resolution components, and then calculated the correlation between these scores and other methods such as PCG, ROC etc. Of most particular interest are the relationships highlighted [CLICK] the PCG correlates with reliability, and ROC correlates with resolution, these results are of interest to the machine learning field think that ROC measures reliability, when in fact it measures the other useful property resolution. We also notice that there is a weak relationship between error rate and PCG, indicating that classification accuracy cannot guarantee that the probability forecasts are reliable. Conversely ROC has CLICK a moderate correlation with error rate indicating that error rate is more closely related to resolution. These results were submitted to ICML and it was rejected.
So we know what probability forecasting is. We can assess the performance of probability forecasts using the reliability and resolution criteria. I have tested various learners on various datasets (mostly medical) We have briefly looked at traditional methods of assessing probabilities and highlighted that none solely assess reliability. We have seen that the PCG technique offers a useful solution to this problem as it gives intuitive visualisation and measures of reliability. We have seen that reliability and classification accuracy do not go hand in hand. Now its time to fill in the gaps and explain how and why the VPM was extended by me for probability forecasting.
Can be applied on top of any existing learning algorithm. CLICK TWICE Vovk introduced the VPM and he originally used it to output provably valid bounds for conditional probabilities CLICK However these bounds had limited practical use because… So I extended the VPM Lindsay (2004) to: CLICK Extract more information from the probability forecasts output by the VPM learner by: CLICK outputting probability forecasts for all possible labels CLICK predicting a label using these probability forecasts 5. I should point out that my extended VPM hasn’t lost the ability to produces bounds CLICK
As I said on previous slide, the VPM was originally used to calculate bounds for the probability of a predicted label made by a VPM being correct. If we invert these bounds (eg. 1-p) then this gives us probability bounds for the prediction being incorrect and so Volodya created nice graphs like this CLICK and POINT to validate the incredibly complicated theory behind VPM. Here you can see that the upper bounds in red and the lower bounds in green lie above and below the actual number of errors in black that are made on the data CLICK This is great as the theory is nicely demonstrated by practical experiments, but as we can see these bounds can be quite loose which limits their practical usefulness as we shall see on the next slide.
I am going to show you an example which clearly indicates the practical usefulness of the extended VPM's probability forecasts for each class label as compared to the predicted bounds discussed previously. Here we have predictions made by the Naïve Bayes learner CLICK and its VPM counterpart CLICK for the same trials in the online process. Actual labels (the true disease for that patient at that trial) are indicated in yellow, and predicted label made by the learner and emboldened and underlined At a glance it is obvious that the predicted probabilities output by the Naive Bayes learner are far more extreme (i.e. very close to 0 or 1) than those output by its VPM counterpart. For example, trial 1653 shows a patient object which is predicted correctly by Naive Bayes p=0.99 and less emphatically by its VPM counterpart p=0.73 CLICK. Remember this data is very noisy so its very unlikely that any predictions can be made with 0.99 chance of being correct! CLICK Trial 2490 demonstrates the problem of over- and under- estimation by the Naive Bayes learner where a patient is incorrectly diagnosed with Intestinal obstruction (overestimation), yet the true diagnosis of Dyspepsia is ranked 6 th CLICK with a very low predicted probability of p=22/1000 (underestimation). In contrast the VPM Naive Bayes learner makes more reliable probability forecasts; for trial 2490 the true class is correctly predicted albeit with lower predicted probability 0.4 CLICK POINT. Trial 5381 demonstrates a situation where both learners encounter an error in their predictions. The Naive Bayes learner gives misleading predicted probabilities of CLICK p=0.93 for the incorrect diagnosis of Appendicitis, and a mere p=0.07 for the true class label of Non-specific abdominal pain. In contrast, even though the VPM Naive Bayes learner incorrectly predicts Appendicitis, it is with far less certainty CLICK p=0.53 and if the user were to look at all probability forecasts CLICK it would be clear that the true class label should not be ignored with a predicted probability p=0.42.
So to quickly recap on everything so far. We know what probability forecasting is. We can assess the performance of probability forecasts using the reliability and resolution criteria. I described how I tested various learners on various datasets We have briefly looked at traditional methods of assessing probabilities and highlighted that none solely assess reliability. We have seen that the PCG technique offers a useful solution to this problem as it gives intuitive visualisation and measures of reliability. We have seen that reliability and classification accuracy do not go hand in hand. I have told you how VPM was extended by me for probability forecasting, and also given compared its forecasts with the underlying learner. Now I will explain which learners that I tested have been found to be reliable, and which are unreliable.
Here are some PCG plots assessing the reliability of probability forecasts output by the ZeroR learner on various datasets CLICK 3 TIMES ZeroR outputs probability forecasts which are mere label frequencies, it predicts the majority class at each trial. Uses no information about the objects in its learning – the simplest of all learners. Accuracy is poor, but reliability is good as you can see from the PCG plots above, they are tight, albeit over a small range of predicted probabilities. ONLY SAY BELOW BIT IF TIME!! Acts as a control in my experiments all learners should at least beat ZeroR in classification accuracy, people often overlook this classifier which I think is a bit stupid, if your data is heavily imbalanced For example consider a dataset for a rare disease, 90% are normal, and 10% have the disease, then a majority classifier like ZeroR can achieve 90% classification accuracy, so if any significant learning is taking place then a learner must beat this!
K-NN finds subset of K closest (nearest neighbouring) examples in training data using a distance metric . Then counts the label frequencies amongst this subset. Acts like a more sophisticated version of ZeroR that uses information held in the object. Appropriate choice of K must be made to obtain reliable probability forecasts, this choice of K depends on the size, complexity, noise level of the data, mainly found by trial and error! In general the larger K is the more reliable the learner, but this can dramatically decrease classification accuracy.
Traditional learners can be very unreliable (yet accurate)- it really depends on the dataset being used. CLICK My research shows empirically that the VPM consistently outputs reliable probability forecasts.CLICK And this extended VPM can recalibrate a learners original probability forecasts to make them more reliable! CLICK This improvement in reliability made by VPM is often without detriment to classification accuracy. CLICK For example, look at these PCG plots showing improvement of reliability before and after VPM implementation CLICK EIGHT TIMES. We have the traditional learners (Naïve Bayes, Neural Net, Decision Tree and 1-NN) on the top row of PCG plots, with the VPM implementations underneath on the second row.
So to quickly recap on the later points. We have seen that ZeroR, K-NN and VPM are reliable probability forecasters, and that traditional learners can produce very unreliable probability forecasts! Now I will briefly detail a pschological and theoretical viewpoint of why these learners are reliable.
There are many psychological studies interested in the problem of making effective judgements under uncertainty When faced with the difficult task of judging probability, people employ a limited number of heuristics which reduce the judgements to simpler ones: CLICK Many heuristics have been identified, some of which are given here: Availability - An event is predicted more likely to occur if it has occurred frequently in the past Representativeness - One compares the essential features of the event to those of the structure of previous events Simulation - The ease in which the simulation of a system of events reaches a particular state can be used to judge the propensity of the (real) system to produce that state. Generally the more heuristics applied the more robust and reliable the probability forecasts are.
I showed empirically that ZeroR, K-NN and VPM learners are reliable probability forecasters. Can identify these heuristics in these learning algorithms Remember psychological research states: CLICK More heuristics  More reliable forecasts
The simplest of all reliable probability forecasters uses 1 heuristic: CLICK The learner merely counts labels it has observed so far, and uses the frequencies of labels as its forecasts ( Availability)
More sophisticated than the ZeroR learner, the K-NN learner uses 2 heuristics: CLICK Uses the distance metric to find subset of K closest examples in training set. ( Representativeness) CLICK Then counts the label frequencies in the subset of K-nearest neighbours to makes its forecasts ( Availability)
Even more sophisticated the VPM meta-learner uses all 3 heuristics: CLICK The VPM tries each new test example with all possible classifications (Simulation) CLICK Then under each tentative simulation clusters training examples which are similar into groups (Representativeness) CLICK Finally the VPM calculates the frequency of labels in each of these groups to make its forecasts ( Availability)
CLICK ZeroR can be proven to be asymptotically reliable (but experiments show well in finite data) CLICK K-NN has lots of theory Stone 1977 To support its convergence to true probability distribution CLICK VPM has a lots of theoretical justification for finite data using martingales, still trying to decipher Volodya’s proofs
CLICK Probability forecasting is useful for real life applications especially medicine. CLICK Want learners to be reliable and accurate. CLICK PCG can be used to check reliability. CLICK ZeroR, K-NN and VPM provide consistently reliable probability forecasts. CLICK Traditional learners Naïve Bayes, Neural Net and Decision Tree can provide unreliable forecasts. CLICK VPM can be used to improve reliability of probability forecasts without detriment to classification accuracy.
And finally I’d like to thank the following people
Look at applications in bioinformatics and medicine – noisy data really needs reliable probability forecasts so user can know whether to trust predictions! Results with time series data. Investigate further relationships with psychology. Recursive application of VPM to improve reliability and accuracy.

Probability Forecasting - a Machine Learning Perspective

Recommended

Recommended

More Related Content

What's hot

What's hot (7)

Viewers also liked

Viewers also liked (17)

Similar to Probability Forecasting - a Machine Learning Perspective

Similar to Probability Forecasting - a Machine Learning Perspective (20)

More from butest

More from butest (20)

Probability Forecasting - a Machine Learning Perspective

Editor's Notes