Random Forest による分類

Random Forest for
Classification
2016/2/24
Ken'ichi Matsui

決定木 Random Forest
特徴
• 分枝とノードのコストを鑑み
て剪定を行う
• 剪定しない
• データからサンプリングを行い
データを増やして学習する。
• 各ノードで分割を行う際、ラン
ダムに特徴量を選択する
メリット
• 分割基準が目に見えてわかり
やすいのでそこから知見も得
られる
• 比較的早い
• 予測精度が高い
• ランダム性を取り入れ分散を小
さく抑えられている
デメリット
• 分散が大きくなりがち • 複数の木を使って構成されるの
で、分割基準は非常に見えずら
い
• 比較的遅い
決定木とRandom Forestの比較

…
特徴量： d次元
特徴量： d次元特徴量： d次元特徴量： d次元特徴量： d次元
データ数： N個
データ数： N個
⇒ ただし、訓練データから重複ありでランダムサンプリングしたもの
ブートストラップ
サンプル 1
サンプル 2
サンプル 3
サンプル M
重複ありランダムサンプリング
ブートストラップサンプル数： M個
訓練データ
ブートストラップ法

…
…
サンプル 1
サンプル 2
サンプル 3
サンプル M
N個
弱学習器１弱学習器２弱学習器３弱学習器 M
Random Forest
木の深さ

木の深さ
…
サンプル 1
サンプル 2
サンプル 3
サンプル M
N個
Random Forest
…
(決定)木がたくさん集まっているので森！

Yes No
1 2 3 4 5 6 7 8 9 10
1
2
3
4
5
6
7
8
9
このノードは
ピンクのエリア
このノードは
ブルーのエリア
弱学習器の各ノードにおける分割 (2次元の場合)
分割前の状態
※ 簡単化のため特徴量選択を
していないとする

取りうる分割 (2次元の例)
この赤い線が不純度を一番下げる分割

axis value ratio_l gini_l ratio_r gini_r
ave gini
gini
x 1.8 0.111 0.000 0.889 0.469 0.417
x 2.45 0.222 0.000 0.778 0.408 0.317
x 3.0 0.333 0.000 0.667 0.278 0.185
x 4.2 0.444 0.375 0.556 0.320 0.344
x 5.75 0.556 0.480 0.444 0.375 0.433
x 6.8 0.667 0.444 0.333 0.000 0.296
x 7.9 0.778 0.490 0.222 0.000 0.381
x 8.85 0.889 0.500 0.111 0.000 0.444
y 1.05 0.111 0.000 0.889 0.469 0.417
y 1.85 0.222 0.500 0.778 0.490 0.492
y 2.6 0.333 0.444 0.667 0.444 0.444
y 3.6 0.444 0.375 0.556 0.320 0.344
y 4.8 0.556 0.480 0.444 0.375 0.433
y 5.95 0.667 0.500 0.333 0.444 0.481
y 6.65 0.778 0.490 0.222 0.000 0.381
y 7.5 0.889 0.500 0.111 0.000 0.444
取りうる分割 (2次元の例)
不純度の計算(gini係数)

…
…
サンプル 1
サンプル 2
サンプル 3
サンプル MN個
データの特徴量はd次元なので各弱学
習器の各ノード分割時に、d次元から
d’個サンプリングしたデータから
最良の分割点を探し出して分割する。
( がよく使われる)
Random Forestの特徴量選択

…
…
サンプル 1
サンプル 2
サンプル 3
サンプル MN個
Random Forestの特徴量選択
⇒ ランダムフォレストの
ランダムと言われる所以
データの特徴量はd次元なので各弱学
習器の各ノード分割時に、d次元から
d’個サンプリングしたデータから
最良の分割点を探し出して分割する。
( がよく使われる)

…
Random Forest (Classification)
インプット
「Bだ！」「Aだ！」「Bだ！」「Bだ！」
⇒ 多数決により”B”に決定。

ブートストラップで作成する木の数
ノード分割時の不純度の計算種別
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
特徴量抽出の最大値設定

木の深さの最大値設定
ノード分割時の分割先の最小データ数
終端ノードの最小データ数
最大終端ノード数
ブートストラップサンプリング実行要否
終端ノードにおける最小分割比

木の構築時の詳細情報表示設定
fitした時に前回のモデルを再利用する
各クラスにウェイトをかける
ブートストラップ、特徴量抽出の乱数シード設定
並列処理数の設定
out-of-bagサンプルを評価に使うか否か

Scikit-LearnとMNISTで試すRandom Forest
https://github.com/matsuken92/Qiita_Contents/blob/master/General/Decision_tree.ipynb
# Random Forestによるモデル構築
clf = RandomForestClassifier(n_estimators=50, criterion='gini', max_depth=None, min_samples_split=2,
min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto',
max_leaf_nodes=None, bootstrap=True, oob_score=False, n_jobs=2,
random_state=None, verbose=0, warm_start=False, class_weight=None)
clf = clf.fit(x_train, y_train)
# 訓練データでの精度確認
print "train"
confirm_result(clf, x_train, y_train)
classification report
precision recall f1-score support
0 1.00 1.00 1.00 5923
1 1.00 1.00 1.00 6742
2 1.00 1.00 1.00 5958
3 1.00 1.00 1.00 6131
4 1.00 1.00 1.00 5842
5 1.00 1.00 1.00 5421
6 1.00 1.00 1.00 5918
7 1.00 1.00 1.00 6265
8 1.00 1.00 1.00 5851
9 1.00 1.00 1.00 5949
avg / total 1.00 1.00 1.00 60000
accuracy
0.999983333333
MNIST (手書き数字データ)
コードの全文はココ↓

https://github.com/matsuken92/Qiita_Contents/blob/master/General/Decision_tree.ipynb
# 検証データでの精度確認
print "test"
confirm_result(clf, x_test, y_test)
test
confusion matrix
[[ 969 0 2 0 0 2 3 1 3 0]
[ 0 1122 3 3 1 1 2 0 3 0]
[ 5 0 999 6 2 0 4 9 7 0]
[ 1 0 10 973 0 7 0 8 8 3]
[ 1 0 1 0 947 0 7 0 4 22]
[ 4 2 1 14 3 854 5 1 7 1]
[ 6 3 1 0 3 5 936 0 4 0]
[ 1 3 20 2 3 0 0 989 3 7]
[ 5 0 5 8 5 7 4 4 929 7]
[ 7 6 3 12 15 3 1 5 4 953]]
classification report
precision recall f1-score support
0 0.97 0.99 0.98 980
1 0.99 0.99 0.99 1135
2 0.96 0.97 0.96 1032
3 0.96 0.96 0.96 1010
4 0.97 0.96 0.97 982
5 0.97 0.96 0.96 892
6 0.97 0.98 0.97 958
7 0.97 0.96 0.97 1028
8 0.96 0.95 0.95 974
9 0.96 0.94 0.95 1009
avg / total 0.97 0.97 0.97 10000
accuracy
0.9671
Scikit-LearnとMNISTで試すRandom Forest

拡大
MNIST学習時のRandom Forest 弱学習器の一部
拡大

Random Forestの類似度の算出とMDSによる２次元可視化
元データ(iris)のプロットデータ類似度のプロット
※ 類似度の計算はRじゃないとできませんでした・・・
require(rfPermute)
data(iris)
iris.rf <- randomForest(Species ~ ., data = iris,
importance = TRUE, proximity = TRUE)
iris.rf
proximity.plot(iris.rf, legend.loc = "topleft")
http://www.inside-r.org/packages/cran/rfPermute/docs/proximity.plot

参考
• “Intuition of Random Forest”
https://stat.ethz.ch/education/semesters/ss2012/ams/slides/v10.2.pdf
• Scikit-Learn RandomForestClassifier
http://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.ht
ml
• 「初めてのパターン認識」平井有三 (著)
http://www.amazon.co.jp/dp/4627849710
• 本スライドで使ったPythonコード
https://github.com/matsuken92/Qiita_Contents/blob/master/General/Decision_tree
.ipynb

Random Forest による分類

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Random Forest による分類

Similar to Random Forest による分類 (20)

More from Ken'ichi Matsui

More from Ken'ichi Matsui (20)

Random Forest による分類