R言語によるアソシエーション分析－組合せ・事象の規則を解明する－（第５回R勉強会＠東京）

第５回R勉強会＠東京
2010/05/22

R言語による
アソシエーション分析
Association Analysis in R

－組合せ・事象の規則を解明する－

hamadakoichi
濱田晃一

AGENDA
◆自己紹介
◆アソシエーション分析とは
◆アソシエーションルール
◆R言語による実装
◆データ
◆アソシエーションルールの抽出
◆頻出アイテムの抽出
◆抽出結果のクラスター分析
◆最後に

hamadakoichi
濱田晃一
http://iddy.jp/profile/hamadakoichi

5

自己紹介：hamadakoichi 濱田晃一

6

データマイニング+WEB勉強会＠東京主催者です
ぜひご参加下さい

Google Group： http://groups.google.com/group/webmining-tokyo 7

第５回データマイニング+WEB勉強会＠東京
－はじめてでもわかるWEB行動マイニング－
発表者募集中です
2010/06/20(日) @Nifty
AGENDA
-1. 「はじめてでもわかるベイジアンネットワーク」 (講師:@hamadakoichi) (60分)
-2. 「はじめてでもわかるWEB行動マイニング」 (講師:@kur) (60分)
-3. 「はじめてでもわかるYahoo! Web API入門」 (講師:@yokkuns) (60分)
-4. 「通信データのデータマイニング」(講師:@lumin)(60分)
-5. 「数式を一切使用しないSupport vector machine入門」 (講師:@super_rti)(60分)
-6. 「はじめてでもわかる Linear Filtering Method の理論と実践
－新たな記述形式による高速集計－」 (講師:@zanjibar)(60分)
※ランチLT (昼食時間に皆で食べながらLT大会) も開催予定です
連絡： @hamadakoichi (Twitter) 8

－はじめてでもわかるWEBマイニング－
2010/06/20(日) @Nifty
AGENDA

2010/06/20(日) @Nifty
AGENDA

理論物理博士(2004.3取得)
量子統計場の理論
Statistical Field Theory Spontaneously
Time-Reversal Symmetry Breaking

Anisotropic Massless Dirac Fermions

博士論文： http://hosi.phys.s.u-tokyo.ac.jp/~koichi/PhD-thesis.pdf 12


文部大臣に褒められた
元文部大臣・法務大臣六法全書著者・元法学政治学研究科長
森山眞弓さん菅野和夫さん

13


Los Angelesでプロダンサーに褒められた

・HIP HOP/House ダンス歴１３年
・ダンス開始後 1年半でL.A.でプロダンサーに褒められる

Youtube Channel： http://www.youtube.com/hamadakoichi 14


毎週末３時間ダンスコーチをしています

■過去、東京と京都でも
ダンス部を創設。
コーチをしていました
駒場物理ダンス部京都大学基礎物理学研究所ダンス部
部長兼コーチ部長兼コーチ

現在：毎週末３時間ダンスコーチ
Youtube Channel： http://www.youtube.com/hamadakoichi 15

数理解析手法の実ビジネスへの適用
2004年博士号取得後
数理解析手法を実ビジネス適用の方法論構築
主な領域
◆活動の数理モデル化・解析手法
◆活動の分析手法・再構築手法
◆活動の実行制御・実績解析システム
…
内容抜粋
“Decoupling Executions in Navigating Manufacturing "Unified graph representation of processes
Processes for Shortening Lead Time and Its Implementation for scheduling with flexible resource
to an Unmanned Machine Shop”, assignment",

16

数理解析手法の実ビジネスへの適用：活動例
活動例
活動の統一グラフモデルを構築・解析
Unified graphical model of processes and resources
青字：割付モデル属性
[ ] : Optional
Node ・priority(優先度) Edge
・duration(予定時間)
[・earliest(再早開始日時) ] Process Edge
Process [・deadline(納期) ]
[・or(条件集約数) ]
前プロセスの終了後に後プロセスが
プロセスを表す開始できること表す
・attributes（属性）
preemptable(中断可否),
successive(引継ぎ可否)
Uses Edge
workload(作業負荷) Processが使用する
uses uses uses uses uses uses Assign Region を表す

Assign Region Assigns from Edge
同一Resourceを割付け続ける Assign Regionに
assigns from assigns from 指定Resourceの子Resource集合の
範囲を表す
assigns assigns 中から割付けることを示す
企業01 [process]
has has [startDate(開始日時)]
[endDate(終了日時)] Assigns Edge
製品01 組織A StartDateからEndDateまでの間
Resource has Assign RegionにResourceを
割付対象要素を表す has has has has has has 割付けることを表す
・capacity(容量)
・calender(カレンダー)
AAA01 AAB02 … 山田さん田中さん鈴木さん・attributes(属性) Has Edge
東さん Resourceの所有関係を表す
17

Association
つながり

20

Association
つながり関連性

21

Association
つながり関連性連関

22

Association
連想

23

Association
連想相関

24

Association
連想相関

25

アソシエーション分析とは

巨大なデータから
価値ある Association Rule を抽出する分析手法

例データ Association Rule
購買データパンとバターを購入した取引の90％が
ミルクも購入している

別名
◆Association Rule Extraction (アソシエーション・ルール抽出)
◆Association Rule Mining (アソシエーション・ルール・マイニング)
◆Association Rule Discovery (アソシエーション・ルール発見)
26

アソシエーションルール

アイテム間の関連性の規則

28


Ａが起こると、Ｂが起こる

29



Association Rule

Ａ ⇒ Ｂ

30



Association Rule

条件部結論部

Ａ ⇒ Ｂ

31



Association Rule

条件部結論部

Ａ ⇒ Ｂ
例パンとバターを購入するミルクを購入する

32

アソシエーションルールの評価指標
よく用いられる
Association Rule の3つの評価指標
Association Rule： X⇒Y
1. Support (支持度)

2. Confidence (確信度)

3. Lift (リフト)

33

条件Xと結論Yを含むデータが、全データ中に占める比率
条件Xと結論Yを含むデータ件数
Support(X⇒Y)＝
全データ件数
⇒全事象中でルールが現れる確率

3. Lift (リフト)

34

Support(X⇒Y)＝
全データ件数
条件Xと結論Yを含むデータが、条件Xを含むデータ中に占める比率
Confidence(X⇒Y)＝
条件Xを含むデータ件数
⇒条件Xが発生したとき、結論がYになる確率
3. Lift (リフト)

35

Support(X⇒Y)＝
全データ件数
3. Lift (リフト)
条件Xと結論Yの同時発生確率と条件X・結論Yそれぞれの発生確率の積の比率
Confidence(X⇒Y)
Lift(X⇒Y)＝
Support(Y)
⇒2事象の独立性を判定。条件Xと結論Yが単体の発生で高確率であるルールを排除 36

Support(X⇒Y)＝
全データ件数
3. Lift (リフト)
条件Xと結論Yの同時発生確率と条件Xと結論Yの発生確率積の比率
Confidence(X⇒Y)
Lift(X⇒Y)＝
Support(Y)
⇒2事象の独立性を判定。条件Xと結論Yが単体の発生で高確率であるルールを排除 37

テストデータ (package:arules)

Income
サンフランシスコのショッピングモールで顧客のアンケート結果
アンケートのYes, Noが 0, 1で記述されているデータ

40


Income
data(Income) #サンフランシスコのショッピングモールの顧客のアンケート結果
Income
incomematrix <-as(Income,"matrix")
incomematrix[1,] #Incomeの1データの内容表示

41


Income
Income
incomematrix <-as(Income,"matrix")
incomematrix[1,] #Incomeの1データの内容表示

実行結果
> data(Income) 性別 sex=male
Income 1
> transactions in sparse format with sex=female
6876 transactions (rows) and 0
50 items (columns) 結婚暦 marital status=married
> incomematrix <-as(Income,"matrix") 1
incomematrix[1,] #Incomeの1データの内容表示 marital status=cohabitation
0
> 収入 income=$0-$40,000
marital status=divorced
0
0
income=$40,000+
marital status=widowed
1
0
marital status=single
… …
0
42


Income
項目番号項目
1 収入(income)
2 性別(sex)
3 結婚歴(marital status)
4 年齢(age)
5 学歴(education)
6 職業(occupation)
7 ベイエリアでの居住歴(years in Bay Area)
8 夫婦収入(dual income)
9 家族の数(number in household)
10 子供の数(number of children)
11 居住家屋状況(householder status)
12 家の形態(type of home)
13 人種の分類(ethnic classification)
14 自宅での使用言語(language in home) 43


Income

itemFrequencyPlot(Income) #相対頻度の表示

Incomeの変数の相対頻度

44

アソシエーションルールの抽出(Package:arules)

関数 apriori
Association Ruleの抽出

apriori(data, parameter, ...)
data: データ型;
transaction, itemMatrix, matrix, data.frame, …

parameter: リスト形式で評価指標の閾値指定
support(支持度), confidence(確信度),
maxlen(maximum size of mined frequent item set
頻出アイテムの最大数)

（※デフォルト値：support=0.1, confidence=0.8, maxlen=5）

46


関数 apriori
実行手順
install.packages(“arules”, dependencies = TRUE) #arules パッケージをインストール
library(arules)

result <- apriori(Income) #Association Ruleの抽出

47


関数 apriori
実行手順
install.packages(“arules”, dependencies = TRUE) #arules パッケージをインストール
library(arules)

result <- apriori(Income) #Association Ruleの抽出

実行情報
> result <- apriori(Income) #Association Ruleの抽出
実行パラメータ
parameter specification:
confidence minval smax arem aval originalSupport support minlen maxlen target
0.8 0.1 1 none FALSE TRUE 0.1 1 5 rules
ext
FALSE

algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE

apriori - find association rules with the apriori algorithm
version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
set item appearances ...[0 item(s)] done [0.00s].
48


関数 inspect
抽出ルールの呼び出し
inspect(head(SORT(result, by = "support"),n=20)) #supportが大きい順に20個を抽出

49


関数 inspect

実行結果
条件部（左辺）結論部（右辺）支持度確信度リフト
> lhs rhs support confidence lift
1 {} => {language in home=english} 0.9128854 0.9128854 1.0000000
2 {ethnic classification=white} => {language in home=english} 0.6595404 0.9847991 1.0787763
3 {number in household=1} => {language in home=english} 0.6495055 0.9388270 1.0284171
4 {education=no college graduate} => {language in home=english} 0.6343805 0.8995669 0.9854106
5 {years in bay area=10+} => {language in home=english} 0.6013671 0.9300495 1.0188020
6 {number of children=0} => {language in home=english} 0.5801338 0.9328812 1.0219040
7 {income=$0-$40,000} => {language in home=english} 0.5578825 0.8962617 0.9817899
8 {number of children=0} => {number in household=1} 0.5532286 0.8896165 1.2858951
9 {type of home=house} => {language in home=english} 0.5446481 0.9129693 1.0000919
10 {dual incomes=not married} => {language in home=english} 0.5426120 0.9069033 0.9934470
11 {age=14-34} => {language in home=english} 0.5248691 0.8966460 0.9822109
12 {number in household=1,
number of children=0} => {language in home=english} 0.5213787 0.9424290 1.0323629
13 {number of children=0,
language in home=english} => {number in household=1} 0.5213787 0.8987215 1.2990559
language in home=english} => {number of children=0} 0.5213787 0.8027318 1.2908287
15 {sex=female} => {language in home=english} 0.5122164 0.9246521 1.0128896
16 {income=$0-$40,000} => {education=no college graduate} 0.5018906 0.8063084 1.1433649
ethnic classification=white} => {language in home=english} 0.4941827 0.9880779 1.0823680
18 {number of children=0,
ethnic classification=white} => {language in home=english} 0.4474985 0.9868505 1.0810235
19 {income=$0-$40,000,
education=no college graduate} => {language in home=english} 0.4454625 0.8875688 0.9722675
20 {education=no college graduate,
ethnic classification=white} => {language in home=english} 0.4384817 0.9827249 1.0765041 50


関数 summary
基本統計量の算出
summary(result) #基本統計量の算出

51


関数 summary
summary(result) #基本統計量の算出

実行結果
> summary(result) #概要情報
set of 6346 rules ルール数
rule length distribution (lhs + rhs):sizes
1 2 3 4 5
1 56 615 2287 3387

Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 4.000 5.000 4.419 5.000 5.000
summary of quality measures: 評価指標の基本統計量
support confidence lift
Min. :0.1001 Min. :0.8000 Min. :0.897
1st Qu.:0.1129 1st Qu.:0.8420 1st Qu.:1.075
Median :0.1313 Median :0.9040 Median :1.314
Mean :0.1471 Mean :0.9010 Mean :1.387
3rd Qu.:0.1617 3rd Qu.:0.9553 3rd Qu.:1.480
Max. :0.9129 Max. :1.0000 Max. :4.331
mining info:
data ntransactions support confidence 実行パラメータ情報
Income 6876 0.1 0.8
52


関数 subset
結果の部分抽出
#"income=$40,000+" で"liftが2より大きい"ルールを抽出
sub<- subset(result, subset = rhs %in% "income=$40,000+" & lift >2)
inspect(SORT(sub)[1:3]) #Support上位 3ルールを表示

53


関数 subset
#"income=$40,000+" で"liftが2より大きい"ルールを抽出
sub<- subset(result, subset = rhs %in% "income=$40,000+" & lift >2)
inspect(SORT(sub)[1:3]) #Support上位 3ルールを表示

実行結果
> lhs rhs support confidence lift
1 {occupation=professional/managerial,
householder status=own} => {income=$40,000+} 0.1384526 0.8074640 2.138722
2 {occupation=professional/managerial,
householder status=own,
language in home=english} => {income=$40,000+} 0.1336533 0.8075571 2.138969
3 {dual incomes=yes,
householder status=own} => {income=$40,000+} 0.1260908 0.8156162 2.160315
>

54

頻出アイテムの抽出(Package:arules)

関数 eclat
頻出アイテムの組合せの抽出
Equivalence class transformation

eclat(data, parameter, ...)
data: データ型;

parameter: リスト形式で評価指標の閾値指定
support(支持度),
maxlen(maximum size of mined frequent item set
頻出アイテムの最大数)

※引数 confidence(確信度)は無し
（※デフォルト値：support=0.1, maxlen=5）

56


関数 eclat
result_items <- eclat(Income) #頻出アイテムの抽出

57


関数 eclat
result_items <- eclat(Income) #頻出アイテムの抽出

実行情報
> result_items <- eclat(Income) #頻出アイテムの抽出
実行パラメータ
parameter specification:
tidLists support minlen maxlen target ext
FALSE 0.1 1 5 frequent itemsets FALSE

algorithmic control:
sparse sort verbose
7 -2 TRUE

eclat - find frequent item sets with the eclat algorithm
version 2.6 (2004.08.16) (c) 2002-2004 Christian Borgelt
create itemset ...
set transactions ...[50 item(s), 6876 transaction(s)] done [0.00s].
sorting and recoding items ... [30 item(s)] done [0.01s].
creating bit matrix ... [30 row(s), 6876 column(s)] done [0.00s].
writing ... [4925 set(s)] done [0.02s].
Creating S4 object ... done [0.00s].
58


関数 eclat

59


関数 eclat

実行結果
アイテムの組合せ支持度
items support
1 {language in home=english} 0.9128854
2 {education=no college graduate} 0.7052065
3 {number in household=1} 0.6918266
4 {ethnic classification=white} 0.6697208
5 {ethnic classification=white,
language in home=english} 0.6595404
7 {years in bay area=10+} 0.6465969
8 {education=no college graduate,
9 {income=$0-$40,000} 0.6224549
10 {number of children=0} 0.6218732
60


関数 summary
summary(result_items) #基本統計情報の算出

61


関数 summary
summary(result_items) #基本統計情報の算出

実行結果
set of 4925 itemsets
most frequent items: アイテム集合数
language in home=english number in household=1
2018 1305
ethnic classification=white education=no college graduate
1291 1278
dual incomes=not married (Other)
1194 12405
element (itemset/transaction) length distribution:sizes
1 2 3 4 5
30 293 1113 1909 1580

Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 3.000 4.000 3.958 5.000 5.000
summary of quality measures:
support
Min. :0.1001
評価指標の基本統計量
1st Qu.:0.1136
Median :0.1339
Mean :0.1549
3rd Qu.:0.1707
Max. :0.9129
includes transaction ID lists: FALSE

mining info:
data ntransactions support 実行パラメータ情報
Income 6876 0.1
62


関数 subset
#"income=$40,000+" でItem数が3より大きい"ルールを抽出
sitems <- subset(result_items,subset = items %in% "income=$40,000+"&size(items)>3)
inspect(SORT(sitems)[1:3]) #上位 3アイテム組合せを表示

63


関数 subset
#"income=$40,000+" でItem数が3より大きい"ルールを抽出
sitems <- subset(result_items,subset = items %in% "income=$40,000+"&size(items)>3)
inspect(SORT(sitems)[1:3]) #上位 3アイテム組合せを表示

実行結果
アイテムの組合せ支持度
items support
1 {income=$40,000+,
type of home=house,
ethnic classification=white,
2 {income=$40,000+,
number in household=1,
3 {income=$40,000+,
number in household=1,
number of children=0,
language in home=english} 0.2033159>
64

R言語による実装：ルール・アイテムのクラスター分析

関数 dissimilarity
ルール・アイテム・トランザクションの距離（非類似度）を算出

dissimilarity(data, method, ...)
data: データ型;

method: 距離の算出方式を指定
‘jaccard’(デフォルト), ‘matching’ ‘dice’, ‘affinity’,

66

クラスター分析：参考資料

第２回データマイニング+WEB勉強会＠東京第３回データマイニング+WEB勉強会＠東京

はじめてでもわかるR言語によるクラスター分析クラスター分析活用編
ー似ているものをグループ化するー
http://d.hatena.ne.jp/hamadakoichi/20100320/p1 http://d.hatena.ne.jp/hamadakoichi/20100428/p1

67

階層的クラスター分析
概要５枚

68

クラスター分析：概要（1/5）

クラスタリング
データを類似度に従いグループに分けること

クラスタリングA

クラスタリングB
69


階層的手法
アルゴリズム

①各データが自身をクラスターと考え
データ数のクラスターを作る

②クラスタ間の距離を測り
クラスタ間の距離行列を作成する

③最も距離の近いクラスタを併合する

④クラスタ間の距離行列を作成する

⑤最も短い距離のクラスタを併合する

※ひとつのクラスタになるまで繰り返し
70


クラスタリング手法クラスタ間距離
群平均法クラスタ間の全てのデータ組合せの距離の平均値
(Group Average method )
単連結法クラスタ間の最小距離を与えるデータ対の距離
(Single Linkage Method )
完全連結法クラスタの最大距離を与えるデータ対の距離
(Complete Linkage Method）
ウォード法クラスタ内の平方和の増加分
(Ward Method)
重心法クラスタの重心間の距離の自乗
(Centroid Method)
メディアン法重心法と同じ。クラスタ併合時に、新たな重心を元
(Median Method) の重心の中点にとる。

71

デンドログラム
クラスタ構造を表す
横棒の縦軸の目盛りがクラスタ間距離

クラスタ間距離

72


階層的クラスタリングの関数
Hierarchical Clustering

hclust(d, method=“complete”, member =NULL, ...)

d:距離行列
method: 階層的クラスタリング手法を指定
members: 通常は指定しない。
※テンドログラムの途中から
クラスタリングを行いたい場合に用いる

73

R言語による実装(Package:arules)

アソシエーションルールの
階層的クラスタリング
rules <- apriori(Income) #Association Ruleの抽出
#結果部"income=$40,000+" かつ lift>2 のルールの部分集合を抽出
subrules <- subset(rules, subset=rhs %in% "income=$40,000+" & lift>2)
d <- dissimilarity(subrules) #Jaccard距離算出
plot(hclust(d,'ward')) #Ward法でのクラスタリング

実行結果 Jaccard距離

おおまかに
４クラスター

Class1: 1-7 葉 Class 2 : 8-13葉 Class3: 14-23葉 Class4 : 24-29葉 74


アソシエーションルールの
階層的クラスタリング
rules <- apriori(Income) #Association Ruleの抽出
#結果部"income=$40,000+" かつ lift>2 のルールの部分集合を抽出
subrules <- subset(rules, subset=rhs %in% "income=$40,000+" & lift>2)
d <- dissimilarity(subrules) #Jaccard距離算出
plot(hclust(d,'ward')) #Ward法でのクラスタリング

実行結果 Jaccard距離

おおまかに
４クラスター

Class1: 1-7 葉 Class 2 : 8-13葉 Class3: 14-23葉 Class4 : 24-29葉 75

order
Leaf要素の抽出
class1 <- hclust(d,"ward")$order[1:7] #Class1:1-7の葉を抽出
inspect(subrules[class1]) #1-7葉のルールを出力

実行結果

lhs rhs support confidence lift
2 {marital status=married,
dual incomes=yes,
type of home=house} => {income=$40,000+} 0.1121291 0.8193411 2.170181
type of home=house,
…
⇒「居住状況（持ち家）」と「高収入」のAssociation Ruleクラスタ 76

order
Leaf要素の抽出
class2 <- hclust(d,"ward")$order[8:13] #Class2:8-13の葉を抽出
inspect(subrules[class2]) #8-13葉のルールを出力

実行結果

lhs rhs support confidence lift
1 {marital status=married,
education=college graduate,
2 {education=college graduate,
ethnic classification=white} => {income=$40,000+} 0.1007853 0.8086348 2.141823
…

⇒「学歴（大卒）」と「高収入」のAssociation Rule クラスタ 77

推薦文献

Rによるデータサイエンス
～データ解析の基礎から最新手法まで～

Rで学ぶクラスタ解析

79

最後に

蓄積されたデータを有効活用してきたい

81

最後に

蓄積されたデータを有効活用してきたい

Google Group： http://groups.google.com/group/webmining-tokyo

82

最後に
データマイニング+WEB勉強会
発表者を募集しています

連絡
Twitter ： http://twitter.com/hamadakoichi
83

ご清聴ありがとうございました

84

AGENDA
◆自己紹介
◆アソシエーション分析とは
◆アソシエーションルール
◆R言語による実装
◆データ
◆アソシエーションルールの抽出
◆頻出アイテムの抽出
◆抽出結果のクラスター分析
◆最後に
85

最後に
データマイニング+WEB勉強会
発表者を募集しています

連絡
Twitter ： http://twitter.com/hamadakoichi
86

目的：データマイニング+WEB勉強会＠東京
データマイニングの方法論を用い
蓄積されたデータを有効活用していく方法を学ぶ
統計解析
Web API
データマイニング
Amazon Web Service
楽天 Web Service 対応分析時系列分析
Twitter API Recruit Web Service 回帰分析
Yahoo! Web Service クラスター分析
はてな Web Service 判別分析
主成分分析因子分析
(Bookmark/Graph/Keyword,…)
カーネル法
Google Data API 樹木モデル
(Calendar/Maps/BookSearch/
FinancePortfolioData,…) ニューラルネットワーク
サポートベクターマシン
… 免疫型最適化 Particle Swam …
Memetic Ant Colony
遺伝的熱力学的
シミュレーテドアニーリング
力学モデルによる最適化
タブーサーチグラフ
…
最適解探索
アルゴリズム
87

R言語によるアソシエーション分析－組合せ・事象の規則を解明する－（第５回R勉強会＠東京）

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to R言語によるアソシエーション分析－組合せ・事象の規則を解明する－（第５回R勉強会＠東京）

Similar to R言語によるアソシエーション分析－組合せ・事象の規則を解明する－（第５回R勉強会＠東京） (20)

More from Koichi Hamada

More from Koichi Hamada (20)

Recently uploaded

Recently uploaded (9)