Bias-variance decomposition in Random Forests

•

7 likes•11,047 views

Gilles Louppe

A brief introduction on the bias-variance decomposition in random forests

Technology

Bias-variance decomposition
in Random Forests
Gilles Louppe @glouppe
Paris ML S02E04, December 8, 2014
1 / 14

Motivation
In supervised learning, combining the predictions of several
randomized models often achieves better results than a single
non-randomized model.
Why ?
2 / 14

Supervised learning
The inputs are random variables X = X1, ..., Xp ;
The output is a random variable Y .
Data comes as a

nite learning set
L = f(xi , yi )ji = 0, . . . ,N - 1g,
where xi 2 X = X1 ... Xp and yi 2 Y are randomly drawn
from PX,Y .
The goal is to

nd a model 'L : X7! Y minimizing
Err ('L) = EX,Y fL(Y ,'L(X))g.
3 / 14

cation
Symbolic output (e.g., Y = fyes, nog)
Zero-one loss
L(Y ,'L(X)) = 1(Y6= 'L(X))
Regression
Numerical output (e.g., Y = R)
Squared error loss
L(Y ,'L(X)) = (Y - 'L(X))2
4 / 14

Decision trees
0.7
0.5
X1
X2
t5 t3 t4
푡2
푡1
푋1 ≤ 0.7
푡3
푡4 푡5
풙
푝(푌 = 푐|푋 = 풙)
Split node
≤ Leaf node
푋2 ≤ 0.5 ≤
t 2 ' : nodes of the tree '
Xt : split variable at t
vt 2 R : split threshold at t
'(x) = arg maxc2Y p(Y = cjX = x)
5 / 14

Bias-variance decomposition in regression
Theorem. For the squared error loss, the bias-variance
decomposition of the expected generalization error at X = x is
ELfErr ('L(x))g = noise(x) + bias2(x) + var(x)
where
noise(x) = Err ('B(x)),
bias2(x) = ('B(x) - ELf'L(x)g)2,
var(x) = ELf(ELf'L(x)g - 'L(x))2g.
6 / 14

Diagnosing the generalization error of a decision tree
(Residual error : Lowest achievable error, independent of 'L.)
Bias : Decision trees usually have low bias.
Variance : They often suer from high variance.
Solution : Combine the predictions of several randomized trees
into a single model.
8 / 14

Random forests
풙
휑1 휑푀
푝휑1 (푌 = 푐|푋 = 풙)
…
푝휑푚 (푌 = 푐|푋 = 풙)
Σ
푝휓(푌 = 푐|푋 = 풙)
Randomization
Bootstrap samples Random selection of K 6 p split variables g Random Forests Random selection of the threshold g Extra-Trees
9 / 14

Bias-variance decomposition (cont.)
Theorem. For the squared error loss, the bias-variance
decomposition of the expected generalization error
ELfErr ( L,1,...,M (x))g at X = x of an ensemble of M
randomized models 'L,m is
ELfErr ( L,1,...,M (x))g = noise(x) + bias2(x) + var(x),
where
noise(x) = Err ('B(x)),
bias2(x) = ('B(x) - EL,f'L,(x)g)2,
var(x) = (x)2
L,(x) +
1 - (x)
M
2
L,(x).
and where (x) is the Pearson correlation coecient between the
predictions of two randomized trees built on the same learning set.
10 / 14

Interpretation of (x) (Louppe, 2014)
Theorem. (x) =
VLfEjLf'L,(x)gg
VLfEjLf'L,(x)gg+ELfVjLf'L,(x)gg
In other words, it is the ratio between
the variance due to the learning set and
the total variance, accounting for random eects due to both
the learning set and the random perburbations.
(x) ! 1 when variance is mostly due to the learning set ;
(x) ! 0 when variance is mostly due to the random
perturbations ;
(x) 0.
11 / 14

What's hot

Knn Algorithm presentationRishavSharma112

Gram-Schmidt process linear algberaPulakKundu1

Chapter 6 part1- Introduction to Inference-Estimating with Confidence (Introd...nszakir

Mean Squared Error (MSE) of an EstimatorSuruchi Somwanshi

Bernoulli distributionSuchithra Edakunni

Probability Density Functionsguestb86588

Lecture 18: Gaussian Mixture Models and Expectation Maximizationbutest

Solution of non-linear equationsZunAib Ali

Bayesian networkKrutika Shrivastava

PSOPooya Sagharchiha

Health care analyticsGaurav Dubey

Electromagnetic analysis using FDTD method, Biological Effects Of EM SpectrumSuleyman Demirel University

Bernoullis Random Variables And Binomial Distributionmathscontent

An Overview of Simple Linear RegressionGeorgian Court University

Naive BayesCloudxLab

Parameter estimationMohamed Mohamed El-Sayed

Calculating a two sample z test by handKen Plummer

Reinforcement learning, Q-LearningKuppusamy P

Multiple Linear Regression Vamshi krishna Guptha

Logistic Regression in Case-Control StudySatish Gupta

What's hot (20)

Knn Algorithm presentation

Gram-Schmidt process linear algbera

Chapter 6 part1- Introduction to Inference-Estimating with Confidence (Introd...

Mean Squared Error (MSE) of an Estimator

Bernoulli distribution

Probability Density Functions

Lecture 18: Gaussian Mixture Models and Expectation Maximization

Solution of non-linear equations

Bayesian network

PSO

Health care analytics

Electromagnetic analysis using FDTD method, Biological Effects Of EM Spectrum

Bernoullis Random Variables And Binomial Distribution

An Overview of Simple Linear Regression

Naive Bayes

Parameter estimation

Calculating a two sample z test by hand

Reinforcement learning, Q-Learning

Multiple Linear Regression

Logistic Regression in Case-Control Study

Similar to Bias-variance decomposition in Random Forests

Tree models with Scikit-Learn: Great models with little assumptionsGilles Louppe

1 - Linear RegressionNikita Zhiltsov

Estimationtheory2Gopi Saiteja

The dual geometry of Shannon informationFrank Nielsen

ma112011id535matsushimalab

Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...The Statistical and Applied Mathematical Sciences Institute

Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...Frank Nielsen

Numarical valuesAmanSaeed11

Numarical values highlightedAmanSaeed11

IVR - Chapter 1 - IntroductionCharles Deledalle

Distributionsadnan gilani

Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...Christian Robert

QMC: Operator Splitting Workshop, Progressive Decoupling of Linkages in Optim...The Statistical and Applied Mathematical Sciences Institute

Understanding Random Forests: From Theory to PracticeGilles Louppe

Density theorems for Euclidean point configurationsVjekoslavKovac1

Voronoi diagrams in information geometry: Statistical Voronoi diagrams and ...Frank Nielsen

02-Random Variables.pptAkliluAyele3

Kernels and Support Vector MachinesEdgar Marca

Lecture notes graciasabineKAGLAN

On learning statistical mixtures maximizing the complete likelihoodFrank Nielsen

Similar to Bias-variance decomposition in Random Forests (20)

Tree models with Scikit-Learn: Great models with little assumptions

1 - Linear Regression

Estimationtheory2

The dual geometry of Shannon information

ma112011id535

Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...

Slides: On the Chi Square and Higher-Order Chi Distances for Approximating f-...

Numarical values

Numarical values highlighted

IVR - Chapter 1 - Introduction

Distributions

Statistics (1): estimation Chapter 3: likelihood function and likelihood esti...

QMC: Operator Splitting Workshop, Progressive Decoupling of Linkages in Optim...

Understanding Random Forests: From Theory to Practice

Density theorems for Euclidean point configurations

Voronoi diagrams in information geometry: Statistical Voronoi diagrams and ...

02-Random Variables.ppt

Kernels and Support Vector Machines

Lecture notes

On learning statistical mixtures maximizing the complete likelihood

Recently uploaded

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

unit 4 immunoblotting technique complete.pptxBkGupta21

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Recently uploaded (20)

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES

What's New in Teams Calling, Meetings and Devices March 2024

New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Dev Dives: Streamline document processing with UiPath Studio Web

SIP trunking in Janus @ Kamailio World 2024

Are Multi-Cloud and Serverless Good or Bad?

Scanning the Internet for External Cloud Exposures via SSL Certs

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

DevoxxFR 2024 Reproducible Builds with Apache Maven

How AI, OpenAI, and ChatGPT impact business and software.

unit 4 immunoblotting technique complete.pptx

Gen AI in Business - Global Trends Report 2024.pdf

Unraveling Multimodality with Large Language Models.pdf

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Streamlining Python Development: A Guide to a Modern Project Setup

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx

TeamStation AI System Report LATAM IT Salaries 2024

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

Bias-variance decomposition in Random Forests

1. Bias-variance decomposition in Random Forests Gilles Louppe @glouppe Paris ML S02E04, December 8, 2014 1 / 14

2. Motivation In supervised learning, combining the predictions of several randomized models often achieves better results than a single non-randomized model. Why ? 2 / 14

3. Supervised learning The inputs are random variables X = X1, ..., Xp ; The output is a random variable Y . Data comes as a

4. nite learning set L = f(xi , yi )ji = 0, . . . ,N - 1g, where xi 2 X = X1 ... Xp and yi 2 Y are randomly drawn from PX,Y . The goal is to

5. nd a model 'L : X7! Y minimizing Err ('L) = EX,Y fL(Y ,'L(X))g. 3 / 14

6. Performance evaluation Classi

7. cation Symbolic output (e.g., Y = fyes, nog) Zero-one loss L(Y ,'L(X)) = 1(Y6= 'L(X)) Regression Numerical output (e.g., Y = R) Squared error loss L(Y ,'L(X)) = (Y - 'L(X))2 4 / 14

8. Decision trees 0.7 0.5 X1 X2 t5 t3 t4 푡2 푡1 푋1 ≤ 0.7 푡3 푡4 푡5 풙 푝(푌 = 푐|푋 = 풙) Split node ≤ Leaf node 푋2 ≤ 0.5 ≤ t 2 ' : nodes of the tree ' Xt : split variable at t vt 2 R : split threshold at t '(x) = arg maxc2Y p(Y = cjX = x) 5 / 14

9. Bias-variance decomposition in regression Theorem. For the squared error loss, the bias-variance decomposition of the expected generalization error at X = x is ELfErr ('L(x))g = noise(x) + bias2(x) + var(x) where noise(x) = Err ('B(x)), bias2(x) = ('B(x) - ELf'L(x)g)2, var(x) = ELf(ELf'L(x)g - 'L(x))2g. 6 / 14

10. Bias-variance decomposition 7 / 14

11. Diagnosing the generalization error of a decision tree (Residual error : Lowest achievable error, independent of 'L.) Bias : Decision trees usually have low bias. Variance : They often suer from high variance. Solution : Combine the predictions of several randomized trees into a single model. 8 / 14

12. Random forests 풙 휑1 휑푀 푝휑1 (푌 = 푐|푋 = 풙) … 푝휑푚 (푌 = 푐|푋 = 풙) Σ 푝휓(푌 = 푐|푋 = 풙) Randomization Bootstrap samples Random selection of K 6 p split variables g Random Forests Random selection of the threshold g Extra-Trees 9 / 14

13. Bias-variance decomposition (cont.) Theorem. For the squared error loss, the bias-variance decomposition of the expected generalization error ELfErr ( L,1,...,M (x))g at X = x of an ensemble of M randomized models 'L,m is ELfErr ( L,1,...,M (x))g = noise(x) + bias2(x) + var(x), where noise(x) = Err ('B(x)), bias2(x) = ('B(x) - EL,f'L,(x)g)2, var(x) = (x)2 L,(x) + 1 - (x) M 2 L,(x). and where (x) is the Pearson correlation coecient between the predictions of two randomized trees built on the same learning set. 10 / 14

14. Interpretation of (x) (Louppe, 2014) Theorem. (x) = VLfEjLf'L,(x)gg VLfEjLf'L,(x)gg+ELfVjLf'L,(x)gg In other words, it is the ratio between the variance due to the learning set and the total variance, accounting for random eects due to both the learning set and the random perburbations. (x) ! 1 when variance is mostly due to the learning set ; (x) ! 0 when variance is mostly due to the random perturbations ; (x) 0. 11 / 14

15. Diagnosing the generalization error of random forests Bias : Identical to the bias of a single randomized tree. Variance : var(x) = (x)2 L,(x) + 1-(x) M 2 L,(x) As M ! 1, var(x) ! (x)2 L,(x) The stronger the randomization, (x) ! 0, var(x) ! 0. The weaker the randomization, (x) ! 1, var(x) ! 2 L,(x) Bias-variance trade-o. Randomization increases bias but makes it possible to reduce the variance of the corresponding ensemble model through averaging. The crux of the problem is to

16. nd the right trade-o. Tips : tune max features in Random Forests. 12 / 14

17. Bias-variance decomposition in classi

18. cation Theorem. For the zero-one loss and binary classi

19. cation, the expected generalization error ELfErr ('L(x))g at X = x decomposes as follows : ELfErr ('L(x))g = P('B(x)6= Y ) + ( 0.5 - ELfbp p bp L(Y = 'B(x))g VLfL(Y = 'B(x))g )(2P('B(x) = Y ) - 1) For ELfbp L(Y = 'B(x)g 0.5, VLfbp L(Y = 'B(x))g ! 0 makes ! 0 and the expected generalization error tends to the error of the Bayes model. Conversely, for ELfbp L(Y = 'B(x)g 0.5, VLfbp L(Y = 'B(x))g ! 0 makes ! 1 and the error is maximal. 13 / 14

20. Questions ? 14 / 14

Bias-variance decomposition in Random Forests

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Bias-variance decomposition in Random Forests

Similar to Bias-variance decomposition in Random Forests (20)

Recently uploaded

Recently uploaded (20)

Bias-variance decomposition in Random Forests