3. 如果說 Deep Learning 改變了 ML 的遊戲規則
XGBoost : Kaggle Winning Solution
Giuliano Janson: Won two games and retired from Kaggle
Persistence: every Kaggler nowadays can put up a great model in a few hours
and usually achieve 95% of final score. Only persistence will get you the
remaining 5%.
Ensembling: need to know how to do it "like a pro". Forget about averaging
models. Nowadays many Kaggler do meta-models, and meta-meta-models.
4. Why Ensemble is needed?
奧卡姆剃刀 Occam's Razor
● An explanation of the data should be made as simple as possible, but no simpler.
簡單的方法,勝過複雜的方法。 Simple s good. 任何的浪費都是不好的
將多個簡單的模型組合起來,效果比一個複雜的模型還要好
● Training data might not provide sufficient information for choosing a single best learner.
● The search processes of the learning algorithms might be imperfect (difficult to achieve unique
best hypothesis)
● Hypothesis space being searched might not contain the true target function.
5. 所謂簡單的方法是指
ID3, C4.5, CART … Tree base method
Entropy
ex. 找出愛花錢的人,以性別作為切分 5 愛(1M,4F), 9 不愛(6M,3F)
● E_all → -5/14 * log(5/14) - 9/14 * log(9/14)
● Entropy is 1 if 50% - 50%, 0 if 100% - 0%
Information Gain
● 選擇 a 當作 split attribute,之後 Entropy 比原本減少了多少
● E_gender → P(M) * E(1,6) + P(F) * E(4,3) Gain = E_all - E_gender
http://www.saedsayad.com/decision_tree.htm
8. 一秒鐘學會用 Ensemble
我想你已經 try 過一些不同 model 了
● Decision tree, NN, SVM, Regression ..
Ensemble Kaggle submission CSV files. → It’s work!
Majority Voting
● Three models : 70%, 70%, 70%
● Majority vote ensemble will be ~78%.
● Averaging predictions often reduces overfit.
http://mlwave.com/kaggle-ensembling-guide/
9. Ensemble 的陷阱
把 Kobe, Curry, LBJ 組一隊,就會拿總冠軍嗎?
Uncorrelated models usually performed better
As more accurate as possible, and as more diverse aspossible
常見機制 Majority Vote, Weighted Averaging
Voting Ensemble → RandomForest → GradientBoostingMachine
1111111100 = 80% accuracy
1111111100 = 80% accuracy
1011111100 = 70% accuracy
1111111100 = 80% accuracy
1111111100 = 80% accuracy
0111011101 = 70% accuracy
1000101111 = 60% accuracy
1111111101 = 90% accuracy
10. 你一定聽過的
Ensemble 方法
● Randomly sampling not
only dat but also feature
● Majority vote
● Minimal tuning
● Performance pass lots of
complex method
n: subsample size
m: subfeature set size
tree size, tree number
http://www.slideshare.net/0xdata/jan-vitek-distributedrandomforest522013
11. Base Learner:被拿來 ensemble 的基礎模型 ex. 一棵樹, simple neural network
● Train by base learning algorithm (ex. decision tree, neural network ..)
三大訓練方法分支:
● Boosting - Boost weak learners too strong learners (sequential learners)
● Bagging - Like RandomForest, sampling from data or features
● Stacking - 打包的概念 (parallel learners)
● Employing different learning algorithms to train individual learners
● Individual learners then combined by a second-level learner which is
called meta-learner.
Ensemble 的關鍵字
12. Bagging Ensemble Bootstrap Aggregating
每次取樣m個資料點 (bootstrap sample) train base learner by calling a base
learning algorithm
● Sampling 的比例是學問
● 甚至針對不同特徵的子資料集 train 不同 model
○ Cherkauer(1996) 火山鑑定工程 32 NN,依據不同 input feature 切分
● 加入 randomness 元素
○ backpropagation random init, tree random select feature
● Majority voting
優點 -- 保留整體假說的多樣化特徵
13. Boost Family
● AdaBoost (Adaptive Boosting)
● Gradient Tree Boosting
● XGBoost
Conbination of Additive Models
學習收斂效能好
有放大雜訊的危險性
● Bagging can significantly reduce the variance
● Boosting can significantly reduce the bias
16. Gradient Boosting
Additive training
● New predictor is optimized by moving in the opposite direction of the
gradient to minimize the loss function.
GBDT 中的決策樹深度較小一般不會超過5,葉子節點的數量也不會超過10
● Boosted Tree: GBDT, GBRT, MART, LambdaMART
17. Gradient Boosting Model Steps
● Leaf weighted cost score
● Additive training: 加入一個新模型到模型中 → 選擇一個
加入後 cost error 下降最多的模型
● Greedy algorithm to build new tree from a single leaf
● Gradient update weight
18. Training Tips
Shrinkage
● Reduces the influence of each individual tree and leaves space for
future trees to improve the model.
● Better to improve model by many small steps than lagre steps.
Subsampling, Early Stopping, Post-Prunning
19. ● In 2015, 29 challenge winning solutions, 17 used XGBoost (deep neural
nets 11)
● KDDCup 2015 all winning solution mention it.
● 用了直接上 leaderboard top 10
Scalability enables data scientists to process hundred millions of examples
on a desktop.
● OpenMP CPU multi-thread
● DMatrix
● Cache-aware and Sparsity-aware
為什麼 XGBoost 這麼威
20. Column Block for Parallel Learning
The most time consuming part of tree learning is to get the data into sorted
order.
In memory block, compressed column format, each column sorted by the
corresponding feature value. Block Compression, Block Sharding.
22. Use it in Python
xgb_model = XGBClassifier( learning_rate =0.1, n_estimators=1000,
max_depth=5, min_child_weight=1, gamma=0, subsample=0.8,
colsample_bytree=0.8, objective= 'binary:logistic', nthread=8,
scale_pos_weight=1, seed=27)
● gamma : Minimum loss reduction required to make a further partition on a
leaf node of the tree.
● min_child_weight : Minimum sum of instance weight(hessian) needed in a
child.
● colsample_bytree : Subsample ratio of columns when constructing each
tree.
24. 圖片分類比賽
● Voting ensemble of around 30 convnets. The best single model scored
0.93170. Final score 0.94120.
Ensemble in Kaggle
25. No Free Lunch
Ensemble is much better than single learner.
Bias-variance tradeoff → Boosting or Average vote it.
● Not understandable -- like DNN, Non-linear SVM
● There is no ensemble method which outperforms other ensemble methods
consistently
Selecting some base learners instead of using all of them to compose an
ensemble is a better choice -- selective ensembles
XGBoost(tabular data) v.s. Deep Learning(more & complex data, hard tuning)
26. Reference
● Gradient boosting machines, a tutorial Alexey Natekin1* and Alois Knoll2
● XGBoost: A Scalable Tree Boosting System - Tianqi Chen
● NTU cmlab http://www.cmlab.csie.ntu.edu.tw/~cyy/learning/tutorials/
● http://mlwave.com/kaggle-ensembling-guide/