SlideShare a Scribd company logo
1 of 40
Download to read offline
XGBoost: the algorithm that wins every
competition
Poznań Univeristy of Technology; April 28th, 2016
meet.ml #1 - Applied Big Data and Machine Learning
By Jarosław Szymczak
Data science competitions
Some recent competitions from Kaggle
(Binary classification problem)
Product Classification Challenge
(Multi-label classification problem)
Store Sales
(Regression problem)
Some recent competitions from Kaggle
Features:
93
Some recent competitions from Kaggle
CTR – click through rate
(~ equal to probability of click)
Features:
Some recent competitions from Kaggle
Sales prediction for store at certain date
Features:
• store id
• date
• school and state holidays
• store type
• assortment
• promotions
• competition
Solutions evaluation
0
2
4
6
8
10
12
0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1
logarithmic loss for positive example
• Root Mean Square
Percentage Error (RMSPE)
• Binary logarithmic loss
(loss is capped at ~35 for
a single observation)
• Multi-class logarithmic loss
What algorithms we tackle the problems with?
Bias related errors
Bagging
Random forest
Adaptive boosting
Gradient boosting
Variance related errors
Shall we play tenis?
Decision tree standard example – shall we play tenis?
Bias related errors
Bagging
Random forest
Adaptive boosting
Gradient boosting
Variance related errors
But one tree for complicated
problems is simply not
enough, and we will need to
reduce erros coming from:
• variance
• bias
Bias related errors
Bagging
Random forest
Adaptive boosting
Gradient boosting
Variance related errors
44
2
Bagging – bootstrap aggregation
2 5 5 6
2 3 3 4 5 5
1 1 3 4 4 6
1 2 2 + 5
1 2 3 4 5 6Our initial dataset:
First draw:
Second draw:
Third draw:
Fourth draw:
3
Average:
Model 1
Model 2
Model 3
Model 4
Final model
Bias related errors
Bagging
Random forest
Adaptive boosting
Gradient boosting
Variance related errors
Random forest
Bias related errors
Bagging
Random forest
Adaptive boosting
Gradient boosting
Variance related errors
AdaBoost – adaptive boosting
+
+
+
-
+
+
-
-
-
-
AdaBoost – adaptive boosting
+
+
+
-
+
+
-
-
-
-
+
+
+
-
+
+
-
-
-
-
+
+
+
-
+
+
-
-
-
-
+
+
+
-
+
+
-
-
-
-
ℎ1
ε1 = 0,30
α1 = 0,42
ℎ2
ε2 = 0,21
α2 = 0,65
ℎ3
ε3 = 0,14
α3 = 0,92
𝐻 = 0,42ℎ1
+0,65ℎ2
+0,92ℎ3
Bias related errors
Bagging
Random forest
Adaptive boosting
Gradient boosting
Variance related errors
Another boosting approach
What if we, instead of reweighting examples, made some corrections to prediction errors directly?
Note:
Residual is a gradient of single
observation error contribution
in one of the most common
evaluation measure for regression:
root mean squared error (RMSE)
We add new model, to the one we
already have, so:
Ypred = X1(Y) + NEW(Y)
NEW(Y) = X1(Y) - Ypred
our residual
Gradient descent
Gradient boosting
It can be:
• same algorithm as for
further steps
• or something very
simple (like uniform
probabilities or average
target in regression)
For any function that:
• agrregates the error from
examples (e.g. log-loss,
RMSE, but not AUC)
• you can calculate gradient
on example level (it is
called pseudo-residual)
Fit model to initial data Fit pseudo-residuals Finalization
Sum up all the
models
Bias related errors
Bagging
Random forest
Adaptive boosting
Gradient boosting
Variance related errors
XGBoost
by Tianqi Chen
eXtreme Gradient
Boosting
Used for:
• classification
• regression
• ranking
with custom loss
functions
Interfaces for
Python and R,
can be executed on
YARN
custom tree
building
algorithm
XGBoost – handling the features
Numeric values
• for each numeric value, XGBoost finds the best available split (it is always a binary split)
• algorithm is designed to work with numeric values only
Nominal values
• need to be converted to numeric ones
• classic way is to perform one-hot-encoding / get dummies (for all values)
• for variables with large cardinality, some other, more sophisticated methods may need to
be used
Missing values
• XGBoost handles missing values separately
• missing value is always added to a branch in which it would minimize the loss (so it is
either treated as very large / very small value)
• it is a great advantage over, e.g. scikit-learn gradient boosting implementation
XGBoost – loss functions
Used as optimization objective
• logloss (optimized in complicated manner) for objectives binary:logistic and reg:logistic (they only differ by
evaluation measure, all the rest is the same)
• softmax (which is logloss transofrmed to handle multiple classes) for objective multi:softmax
• RMSE (root mean squared error) for objective reg:linear
Only measured
• AUC – area under ROC curve
• MAE – mean absolute error
• error / merror – error rate for binary / multi-class classification
Custom
• only measured
• simply implement the evaluation function
• used as optimization objective
• provided that you have a measure that is aggregation of element-wise losses and is differentiable
• you create a function for element-wise gradient (derivative of prediction function)
• then hessian (second derivative of prediction function)
• that’s it, algorithm will approximate your loss function with Taylor series (pre-defined loss functions are
implemented in the same manner)
XGBoost – setup for Python environment
Installing XGBoost in your environment
Check out this wonderful tutorial to install Docker with
kaggle Python stack:
http://goo.gl/wSCkmC
http://xgboost.readthedocs.org/en/latest/build.html
Link as above, expect potential troubles
Clone: https://github.com/dmlc/xgboost
At revision: 219e58d453daf26d6717e2266b83fca0071f3129
And follow the instructions in it (it is of course older version)
XGBoost code example on Otto data
Train dataset
Test dataset
Sample submission
XGBoost – simple use example (direct)
Link to script: https://goo.gl/vwQeXs
for direct use
we need to
specify
number
of classes
scikit-learn
conventional
names
to account for
examples importance
we can assign weights
to them in DMatrix
(not done here)
XGBoost – feature importance
Link to script: http://pastebin.com/uPK5aNkf
XGBoost – simple use example (in scikit-learn style)
Link to script: https://goo.gl/IPxKh5
XGBoost – most common parameters for tree booster
subsample
• ratio of instances to take
• example values: 0.6, 0.9
colsample_by_tree
• ratio of columns used for
whole tree creation
• example values: 0.1, …, 0.9
colsample_by_level
• ratio of columns sampled for
each split
• example values: 0.1, …, 0.9
eta
• how fast algorithm will learn
(shrinkage)
• example values: 0.01, 0.05, 0.1
max_depth
• maximum numer of consecutive
splits
• example values: 1, …, 15 (for
dozen or so or more, needs to
be set with regularization
parametrs)
min_child_weight
• minimum weight of children in
leaf, needs to be adopted for
each measure
• example values: 1 (for linear
regression it would be one
example, for classification it is
gradient of pseudo-residual)
XGBoost – regularization parameters for tree booster
alpha
• L1 norm (simple average) of
weights for whole objective
function
lambda
• L2 norm (root from average of
squares) of weights, added as
penalty to objective function
gamma
• L0 norm, multiplied by numer of
leafs in a tree is used to decide
whether to make a split
lambda is in
denominator because
of various
transformation, in
original objective it is
„classic” L2 penalty
XGBoost – some cool things not necesarilly present elswhere
Watchlist
You can create some small
validation set to prevent
overfitting without need to
guess how many iterations
would be enough
Warm
start
Thanks to the nature of
gradient boosting, you can
take some already woking
solution and start the
„improvement” from it
Easy cut-
off at any
round
when
classyfing
Simple but powerful while
tuning the parameters,
allows you to recover from
overfitted model in easy
way
Last, but not least – who won?
Winning solutions
The winning solution consists of over 20 xgboost models
that each need about two hours to train when running three
models in parallel on my laptop. So I think it could be done
within 24 hours. Most of the models individually achieve a
very competitive (top 3 leaderboard) score.
by Gert Jacobusse
I did quite a bit of manual feature engineering and my models
are entirely based on xgboost. Feature engineering is a
combination of “brutal force” (trying different
transformations, etc that I know) and “heuristic” (trying to
think about drivers of the target in real world settings). One
thing I learned recently was entropy based features, which
were useful in my model.
by Owen Zhang
Winning solutions
by Gilberto Titericz Jr and Stanislav Semenov
Even deeper dive
• XGBoost:
• https://github.com/dmlc/xgboost/blob/master/demo/README.md
• More info about machine learning:
• Uczenie maszynowe i sieci neuronowe.
Krawiec K., Stefanowski J., Wydawnictwo PP, Poznań, 2003. (2 wydanie 2004)
• The Elements of Statistical Learning: Data Mining, Inference, and Prediction
by Trevor Hastie, Robert Tibshirani, Jerome Friedman (available on-line for free)
• Sci-kit learn and pandas official pages:
• http://scikit-learn.org/
• http://pandas.pydata.org/
• Kaggle blog and forum:
• http://blog.kaggle.com/ (no free hunch)
• https://www.datacamp.com/courses/kaggle-python-tutorial-on-machine-learning
• https://www.datacamp.com/courses/intro-to-python-for-data-science
• Python basics:
• https://www.coursera.org/learn/interactive-python-1
• https://www.coursera.org/learn/interactive-python-2

More Related Content

What's hot

Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangVivian S. Zhang
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CARTXueping Peng
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning Mohammad Junaid Khan
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringSri Ambati
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithmRashid Ansari
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision treesKnoldus Inc.
 
Random Forest and KNN is fun
Random Forest and KNN is funRandom Forest and KNN is fun
Random Forest and KNN is funZhen Li
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorVivian S. Zhang
 
ADABoost classifier
ADABoost classifierADABoost classifier
ADABoost classifierSreerajVA
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter TuningJon Lederman
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentationRishavSharma112
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitionsOwen Zhang
 
Introduction to optimization technique
Introduction to optimization techniqueIntroduction to optimization technique
Introduction to optimization techniqueKAMINISINGH963
 

What's hot (20)

Winning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen ZhangWinning data science competitions, presented by Owen Zhang
Winning data science competitions, presented by Owen Zhang
 
Decision tree
Decision treeDecision tree
Decision tree
 
Decision tree and random forest
Decision tree and random forestDecision tree and random forest
Decision tree and random forest
 
Decision Tree - C4.5&CART
Decision Tree - C4.5&CARTDecision Tree - C4.5&CART
Decision Tree - C4.5&CART
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Decision trees in Machine Learning
Decision trees in Machine Learning Decision trees in Machine Learning
Decision trees in Machine Learning
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
XGBoost & LightGBM
XGBoost & LightGBMXGBoost & LightGBM
XGBoost & LightGBM
 
Random forest algorithm
Random forest algorithmRandom forest algorithm
Random forest algorithm
 
Machine Learning with Decision trees
Machine Learning with Decision treesMachine Learning with Decision trees
Machine Learning with Decision trees
 
Decision Tree Learning
Decision Tree LearningDecision Tree Learning
Decision Tree Learning
 
Random Forest and KNN is fun
Random Forest and KNN is funRandom Forest and KNN is fun
Random Forest and KNN is fun
 
CatBoost intro
CatBoost   introCatBoost   intro
CatBoost intro
 
boosting algorithm
boosting algorithmboosting algorithm
boosting algorithm
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 
ADABoost classifier
ADABoost classifierADABoost classifier
ADABoost classifier
 
Hyperparameter Tuning
Hyperparameter TuningHyperparameter Tuning
Hyperparameter Tuning
 
Knn Algorithm presentation
Knn Algorithm presentationKnn Algorithm presentation
Knn Algorithm presentation
 
Tips for data science competitions
Tips for data science competitionsTips for data science competitions
Tips for data science competitions
 
Introduction to optimization technique
Introduction to optimization techniqueIntroduction to optimization technique
Introduction to optimization technique
 

Similar to XGBoost: the algorithm that wins every competition

Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Greg Makowski
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
 
XGBOOST [Autosaved]12.pptx
XGBOOST [Autosaved]12.pptxXGBOOST [Autosaved]12.pptx
XGBOOST [Autosaved]12.pptxyadav834181
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdDatabricks
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Venturesmicrosoftventures
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahouttanuvir
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...Institute of Contemporary Sciences
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson ChallengeRaouf KESKES
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with SparkBarak Gitsis
 
Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1Suvadip Shome
 
Understanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-LearnUnderstanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-Learn철민 권
 
Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr - Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr - PyData
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkIvo Andreev
 
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
Support Vector Machines USING MACHINE LEARNING HOW IT WORKSSupport Vector Machines USING MACHINE LEARNING HOW IT WORKS
Support Vector Machines USING MACHINE LEARNING HOW IT WORKSrajalakshmi5921
 
Webpage Personalization and User Profiling
Webpage Personalization and User ProfilingWebpage Personalization and User Profiling
Webpage Personalization and User Profilingyingfeng
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntEugene Yan Ziyou
 

Similar to XGBoost: the algorithm that wins every competition (20)

Production model lifecycle management 2016 09
Production model lifecycle management 2016 09Production model lifecycle management 2016 09
Production model lifecycle management 2016 09
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
R user group meeting 25th jan 2017
R user group meeting 25th jan 2017R user group meeting 25th jan 2017
R user group meeting 25th jan 2017
 
XGBOOST [Autosaved]12.pptx
XGBOOST [Autosaved]12.pptxXGBOOST [Autosaved]12.pptx
XGBOOST [Autosaved]12.pptx
 
Tensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with HummingbirdTensors Are All You Need: Faster Inference with Hummingbird
Tensors Are All You Need: Faster Inference with Hummingbird
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Ventures
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
 
Higgs Boson Challenge
Higgs Boson ChallengeHiggs Boson Challenge
Higgs Boson Challenge
 
Scaling out logistic regression with Spark
Scaling out logistic regression with SparkScaling out logistic regression with Spark
Scaling out logistic regression with Spark
 
Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1Real-time Face Recognition & Detection Systems 1
Real-time Face Recognition & Detection Systems 1
 
Understanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-LearnUnderstanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-Learn
 
Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr - Using CNTK's Python Interface for Deep LearningDave DeBarr -
Using CNTK's Python Interface for Deep LearningDave DeBarr -
 
chapter1.pdf
chapter1.pdfchapter1.pdf
chapter1.pdf
 
Repair dagstuhl jan2017
Repair dagstuhl jan2017Repair dagstuhl jan2017
Repair dagstuhl jan2017
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
Support Vector Machines USING MACHINE LEARNING HOW IT WORKSSupport Vector Machines USING MACHINE LEARNING HOW IT WORKS
Support Vector Machines USING MACHINE LEARNING HOW IT WORKS
 
Webpage Personalization and User Profiling
Webpage Personalization and User ProfilingWebpage Personalization and User Profiling
Webpage Personalization and User Profiling
 
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learntKaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
Kaggle Otto Challenge: How we achieved 85th out of 3,514 and what we learnt
 

Recently uploaded

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Pooja Nehwal
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfMarinCaroMartnezBerg
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 

XGBoost: the algorithm that wins every competition

  • 1. XGBoost: the algorithm that wins every competition Poznań Univeristy of Technology; April 28th, 2016 meet.ml #1 - Applied Big Data and Machine Learning By Jarosław Szymczak
  • 3. Some recent competitions from Kaggle (Binary classification problem) Product Classification Challenge (Multi-label classification problem) Store Sales (Regression problem)
  • 4. Some recent competitions from Kaggle Features: 93
  • 5. Some recent competitions from Kaggle CTR – click through rate (~ equal to probability of click) Features:
  • 6. Some recent competitions from Kaggle Sales prediction for store at certain date Features: • store id • date • school and state holidays • store type • assortment • promotions • competition
  • 7. Solutions evaluation 0 2 4 6 8 10 12 0 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1 logarithmic loss for positive example • Root Mean Square Percentage Error (RMSPE) • Binary logarithmic loss (loss is capped at ~35 for a single observation) • Multi-class logarithmic loss
  • 8. What algorithms we tackle the problems with?
  • 9. Bias related errors Bagging Random forest Adaptive boosting Gradient boosting Variance related errors
  • 10. Shall we play tenis?
  • 11. Decision tree standard example – shall we play tenis?
  • 12. Bias related errors Bagging Random forest Adaptive boosting Gradient boosting Variance related errors
  • 13. But one tree for complicated problems is simply not enough, and we will need to reduce erros coming from: • variance • bias
  • 14. Bias related errors Bagging Random forest Adaptive boosting Gradient boosting Variance related errors
  • 15. 44 2 Bagging – bootstrap aggregation 2 5 5 6 2 3 3 4 5 5 1 1 3 4 4 6 1 2 2 + 5 1 2 3 4 5 6Our initial dataset: First draw: Second draw: Third draw: Fourth draw: 3 Average: Model 1 Model 2 Model 3 Model 4 Final model
  • 16. Bias related errors Bagging Random forest Adaptive boosting Gradient boosting Variance related errors
  • 18. Bias related errors Bagging Random forest Adaptive boosting Gradient boosting Variance related errors
  • 19. AdaBoost – adaptive boosting + + + - + + - - - -
  • 20. AdaBoost – adaptive boosting + + + - + + - - - - + + + - + + - - - - + + + - + + - - - - + + + - + + - - - - ℎ1 ε1 = 0,30 α1 = 0,42 ℎ2 ε2 = 0,21 α2 = 0,65 ℎ3 ε3 = 0,14 α3 = 0,92 𝐻 = 0,42ℎ1 +0,65ℎ2 +0,92ℎ3
  • 21. Bias related errors Bagging Random forest Adaptive boosting Gradient boosting Variance related errors
  • 22. Another boosting approach What if we, instead of reweighting examples, made some corrections to prediction errors directly? Note: Residual is a gradient of single observation error contribution in one of the most common evaluation measure for regression: root mean squared error (RMSE) We add new model, to the one we already have, so: Ypred = X1(Y) + NEW(Y) NEW(Y) = X1(Y) - Ypred our residual
  • 24. Gradient boosting It can be: • same algorithm as for further steps • or something very simple (like uniform probabilities or average target in regression) For any function that: • agrregates the error from examples (e.g. log-loss, RMSE, but not AUC) • you can calculate gradient on example level (it is called pseudo-residual) Fit model to initial data Fit pseudo-residuals Finalization Sum up all the models
  • 25. Bias related errors Bagging Random forest Adaptive boosting Gradient boosting Variance related errors
  • 26. XGBoost by Tianqi Chen eXtreme Gradient Boosting Used for: • classification • regression • ranking with custom loss functions Interfaces for Python and R, can be executed on YARN custom tree building algorithm
  • 27. XGBoost – handling the features Numeric values • for each numeric value, XGBoost finds the best available split (it is always a binary split) • algorithm is designed to work with numeric values only Nominal values • need to be converted to numeric ones • classic way is to perform one-hot-encoding / get dummies (for all values) • for variables with large cardinality, some other, more sophisticated methods may need to be used Missing values • XGBoost handles missing values separately • missing value is always added to a branch in which it would minimize the loss (so it is either treated as very large / very small value) • it is a great advantage over, e.g. scikit-learn gradient boosting implementation
  • 28. XGBoost – loss functions Used as optimization objective • logloss (optimized in complicated manner) for objectives binary:logistic and reg:logistic (they only differ by evaluation measure, all the rest is the same) • softmax (which is logloss transofrmed to handle multiple classes) for objective multi:softmax • RMSE (root mean squared error) for objective reg:linear Only measured • AUC – area under ROC curve • MAE – mean absolute error • error / merror – error rate for binary / multi-class classification Custom • only measured • simply implement the evaluation function • used as optimization objective • provided that you have a measure that is aggregation of element-wise losses and is differentiable • you create a function for element-wise gradient (derivative of prediction function) • then hessian (second derivative of prediction function) • that’s it, algorithm will approximate your loss function with Taylor series (pre-defined loss functions are implemented in the same manner)
  • 29. XGBoost – setup for Python environment Installing XGBoost in your environment Check out this wonderful tutorial to install Docker with kaggle Python stack: http://goo.gl/wSCkmC http://xgboost.readthedocs.org/en/latest/build.html Link as above, expect potential troubles Clone: https://github.com/dmlc/xgboost At revision: 219e58d453daf26d6717e2266b83fca0071f3129 And follow the instructions in it (it is of course older version)
  • 30. XGBoost code example on Otto data Train dataset Test dataset Sample submission
  • 31. XGBoost – simple use example (direct) Link to script: https://goo.gl/vwQeXs for direct use we need to specify number of classes scikit-learn conventional names to account for examples importance we can assign weights to them in DMatrix (not done here)
  • 32. XGBoost – feature importance Link to script: http://pastebin.com/uPK5aNkf
  • 33. XGBoost – simple use example (in scikit-learn style) Link to script: https://goo.gl/IPxKh5
  • 34. XGBoost – most common parameters for tree booster subsample • ratio of instances to take • example values: 0.6, 0.9 colsample_by_tree • ratio of columns used for whole tree creation • example values: 0.1, …, 0.9 colsample_by_level • ratio of columns sampled for each split • example values: 0.1, …, 0.9 eta • how fast algorithm will learn (shrinkage) • example values: 0.01, 0.05, 0.1 max_depth • maximum numer of consecutive splits • example values: 1, …, 15 (for dozen or so or more, needs to be set with regularization parametrs) min_child_weight • minimum weight of children in leaf, needs to be adopted for each measure • example values: 1 (for linear regression it would be one example, for classification it is gradient of pseudo-residual)
  • 35. XGBoost – regularization parameters for tree booster alpha • L1 norm (simple average) of weights for whole objective function lambda • L2 norm (root from average of squares) of weights, added as penalty to objective function gamma • L0 norm, multiplied by numer of leafs in a tree is used to decide whether to make a split lambda is in denominator because of various transformation, in original objective it is „classic” L2 penalty
  • 36. XGBoost – some cool things not necesarilly present elswhere Watchlist You can create some small validation set to prevent overfitting without need to guess how many iterations would be enough Warm start Thanks to the nature of gradient boosting, you can take some already woking solution and start the „improvement” from it Easy cut- off at any round when classyfing Simple but powerful while tuning the parameters, allows you to recover from overfitted model in easy way
  • 37. Last, but not least – who won?
  • 38. Winning solutions The winning solution consists of over 20 xgboost models that each need about two hours to train when running three models in parallel on my laptop. So I think it could be done within 24 hours. Most of the models individually achieve a very competitive (top 3 leaderboard) score. by Gert Jacobusse I did quite a bit of manual feature engineering and my models are entirely based on xgboost. Feature engineering is a combination of “brutal force” (trying different transformations, etc that I know) and “heuristic” (trying to think about drivers of the target in real world settings). One thing I learned recently was entropy based features, which were useful in my model. by Owen Zhang
  • 39. Winning solutions by Gilberto Titericz Jr and Stanislav Semenov
  • 40. Even deeper dive • XGBoost: • https://github.com/dmlc/xgboost/blob/master/demo/README.md • More info about machine learning: • Uczenie maszynowe i sieci neuronowe. Krawiec K., Stefanowski J., Wydawnictwo PP, Poznań, 2003. (2 wydanie 2004) • The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, Jerome Friedman (available on-line for free) • Sci-kit learn and pandas official pages: • http://scikit-learn.org/ • http://pandas.pydata.org/ • Kaggle blog and forum: • http://blog.kaggle.com/ (no free hunch) • https://www.datacamp.com/courses/kaggle-python-tutorial-on-machine-learning • https://www.datacamp.com/courses/intro-to-python-for-data-science • Python basics: • https://www.coursera.org/learn/interactive-python-1 • https://www.coursera.org/learn/interactive-python-2