2. 1. Introduction
2. Boosted Tree
3. Tree Ensemble
4. Additive Training
5. Split Algorithm
School of Computer Sicience and
3. 1 Introduction
• What Xgboost can do ?
School of Computer Sicience and
Binary
Classification
Multiclass
Classification
Regression Learning to
Rank
By 02. March.2017
Scalable, Portable and Distributed Gradient
Boosting (GBDT, GBRT or GBM) Library
Support Language
• Python
• R
• Java
• Scala
• C++ and more
Support Platform
• Runs on single machine,
• Hadoop
• Spark
• Flink
• DataFlow
4. 2 Boosted Tree
• Variants:
• GBDT: gradient boosted decision tree
• GBRT: gradient boosted regression tree
• MART: Multiple Additive Regression Trees
• LambdaMART, for ranking task
• ...
School of Computer Sicience and
5. 2.1 CART
• CART: Classification and Regression Tree
• Classification
• Three Classes
• Two Variables
School of Computer Sicience and
6. 2.1 CART
Prediction
• predicting price of 1993-model cars.
• standardized (zero mean,unit variance)
School of Computer Sicience andpartition
7. 2.1 CART
• Information Gain
• Gain Ratio
• Gini Index
• Pruning: prevent overfitting
School of Computer Sicience and
Which variable to use for division
8. 2.2 CART
• Input: Age, gender, occupation
• Goal: Does the person like computer games
School of Computer Sicience and
9. 3 Tree Ensemble
• What is Tree Ensemble ?
• Single Tree is not powerful enough
• Benifts of Tree Ensemble ?
• Very widely used
• Invariant to scaling of inputs
• Learn higher order interaction between features
• Scalable
School of Computer Sicience and
Boosted Tree
Random Forest
Tree
Ensemble
10. 3 Tree Ensemble
School of Computer Sicience and
Prediction of is sum of scores predicted by each of the tree
11. 3 Tree Ensemble-Elements of Supervised Learning
• Linear model
School of Computer Sicience and
Optimizing training loss encourages predictive models
Opyimizing regularization encourages simple models
12. 3 Tree Ensemble
• Assuming we have k trees
School of Computer Sicience and
• Parameters
• Including structure of each tree, and the score in the leaf
• Or simply use function as parameters
• Instead learning weights in R^d, we are learning functions ( trees)
13. 3 Tree Ensemble
• How can we learn functions?
School of Computer Sicience and
The height
in each
segment
Splitting
positions
• Training loss: How will the function fit on the points?
• Regularization: How do we define complexity of the function?
14. 3 Tree Ensemble
School of Computer Sicience and
Regularization
Number of splitting points
L2 norm of the leaf weights
Training loss:
error =
15. 3 Tree Ensemble
• We define tree by a vector of scores in leafs, and a leaf index mapping
function that maps an instance to a leaf
School of Computer Sicience and
16. 3 Tree Ensemble
• Objective:
• Definiation of Complexity
School of Computer Sicience and
17. 4 Addictive Training (Boosting)
• We can not use methods such as SGD, to find f ( since thet are trees,
instead of just numerical vectors)
• Start from constant prediction, add a new function each time.
School of Computer Sicience and
18. 4 Addictive Training (Boosting)
• How do we decide which f to add ?
• The prediction at round t is
• Consider square loss
School of Computer Sicience and
19. 4 Addictive Training (Boosting)
• Taylor expansion of the objective
• Objective after expansion
School of Computer Sicience and
20. 4 Addictive Training (Boosting)
• Our new goal, with constants removed
• Benifits
School of Computer Sicience and
21. 4 Addictive Training (Boosting)
• Define the instance set in leaf j as
• Regroup the objective by each leaf
• This is sum of T independent quadratic functions
• Two facts about single variable quadratic function
School of Computer Sicience and
22. 4 Addictive Training (Boosting)
• Let us define
• Results
School of Computer Sicience and
There can be infinite possible tree
structures
23. 4 Addictive Training (Boosting)
• Greedy Learning , we grow the tree greedily
School of Computer Sicience and
24. 5 Spliting algorithm
• Efficeint finding of the best split
• What is the gain of a split rule xj < a ? say xj is age
School of Computer Sicience and
All we need is sume of g and h in each side, and calculate
• Left to right linear scan over sorted instance is enough to decide the best split
28. References
• http://www.52cs.org/?p=429
• http://www.stat.cmu.edu/~cshalizi/350-2006/lecture-10.pdf
• http://www.sigkdd.org/node/362
• http://homes.cs.washington.edu/~tqchen/pdf/BoostedTree.pdf
• http://www.stat.wisc.edu/~loh/treeprogs/guide/wires11.pdf
• https://github.com/dmlc/xgboost/blob/master/demo/README.md
• http://datascience.la/xgboost-workshop-and-meetup-talk-with-tianqi-chen/
• http://xgboost.readthedocs.io/en/latest/model.html
• http://machinelearningmastery.com/gentle-introduction-xgboost-applied-machine-
learning/
School of Computer Sicience and
29. Suplementary
• Tree model, works very well on tabular data, easy to use,
and interpret and control
• It can not extrapolate
• Deep Forest: Towards An Alternative to Deep Neural
Networks, Zhi-Hua Zhou, Ji Feng, Nanjing University
• Submitted on 28 Feb 2017
• Comparable performance and easy to train (less parameters)
School of Computer Sicience and
XGBoost is one of the most frequently used package to win machine learning challenges
XGBoost can solve billion scale problems with few resources and is widely adopted in industry.
XGBoost is an optimized distributed gradient boosting system designed to be highly efficient, flexible and portable. It implements machine learning algorithms under the Gradient Boosting framework. XGBoost provides a parallel tree boosting(also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. The same code runs on major distributed environment(Hadoop, SGE, MPI) and can solve problems beyond billions of examples. The most recent version integrates naturally with DataFlow frameworks(e.g. Flink and Spark)
Fitting well in training data at least get you close to training data
which is hopefully close to the underlying distribution
Simpler models tends to have smaller variance in future
predictions, making prediction stable
Fitting well in training data at least get you close to training data
which is hopefully close to the underlying distribution
Simpler models tends to have smaller variance in future
predictions, making prediction stable
Fitting well in training data at least get you close to training data
which is hopefully close to the underlying distribution
Simpler models tends to have smaller variance in future
predictions, making prediction stable
Fitting well in training data at least get you close to training data
which is hopefully close to the underlying distribution
Simpler models tends to have smaller variance in future
predictions, making prediction stable
Fitting well in training data at least get you close to training data
which is hopefully close to the underlying distribution
Simpler models tends to have smaller variance in future
predictions, making prediction stable
Fitting well in training data at least get you close to training data
which is hopefully close to the underlying distribution
Simpler models tends to have smaller variance in future
predictions, making prediction stable
Fitting well in training data at least get you close to training data
which is hopefully close to the underlying distribution
Simpler models tends to have smaller variance in future
predictions, making prediction stable
Fitting well in training data at least get you close to training data
which is hopefully close to the underlying distribution
Simpler models tends to have smaller variance in future
predictions, making prediction stable
Fitting well in training data at least get you close to training data
which is hopefully close to the underlying distribution
Simpler models tends to have smaller variance in future
predictions, making prediction stable
Fitting well in training data at least get you close to training data
which is hopefully close to the underlying distribution
Simpler models tends to have smaller variance in future
predictions, making prediction stable
Fitting well in training data at least get you close to training data
which is hopefully close to the underlying distribution
Simpler models tends to have smaller variance in future
predictions, making prediction stable
Fitting well in training data at least get you close to training data
which is hopefully close to the underlying distribution
Simpler models tends to have smaller variance in future
predictions, making prediction stable
Fitting well in training data at least get you close to training data
which is hopefully close to the underlying distribution
Simpler models tends to have smaller variance in future
predictions, making prediction stable
Fitting well in training data at least get you close to training data
which is hopefully close to the underlying distribution
Simpler models tends to have smaller variance in future
predictions, making prediction stable
Fitting well in training data at least get you close to training data
which is hopefully close to the underlying distribution
Simpler models tends to have smaller variance in future
predictions, making prediction stable
Fitting well in training data at least get you close to training data
which is hopefully close to the underlying distribution
Simpler models tends to have smaller variance in future
predictions, making prediction stable
Fitting well in training data at least get you close to training data
which is hopefully close to the underlying distribution
Simpler models tends to have smaller variance in future
predictions, making prediction stable
1. Almost half of data mining competition are won by using some variants of tree ensemble methods
2. so you do not need to do careful features normalization
3. and are used in Industry