Data Intensive Text Processing with MapReduce - #3 MapReduce Algorithm Design -

Data Intensive Text
Processing with MapReduce
- #3 MapReduce Algorithm Design -

@just_do_neet

Data Intensive...(snip
書籍

Data Intensive Text Processing with MapReduce #3 2

#3 MapReduce Algorithm Design
第三章：MapReduce アルゴリズムの設計

•MapReduceはシンプルでスケーラブル
（Mapper / Reducer）

•シンプルなため制約が大きく、限定的な手法しか
用いることができない。

•その中で、MapReduceにおけるデザインパターン
的なものや、問題解決のテクニックを紹介。


#3 MapReduce Algorithm Design
第三章：MapReduce アルゴリズムの設計

•ローカル集約

•pairsとstripes

•相対頻度の計算

•セカンダリソート

•リレーショナルな結合


ローカル集約


Local Aggregation
ローカル集約

•HadoopではMap→Reduce間の受け渡しの際に中
間データをディスクに書き込む

•オーバーヘッドが大きい

•中間データの削減を行う事で処理効率がアップす
る


Local Aggregation
ローカル集約

•問題設定：さだまさしの歌詞から頻出単語を抽
出。

•データ元：http://www.cai-insect.jp/sada/


Local Aggregation
ローカル集約

•標準的なMapReduce処理
ドキュメント中に語が出現するごとにEmit

• https://gist.github.com/3475182
https://gist.github.com/3475195


Local Aggregation
ローカル集約

•連想配列を用いてドキュメントごとに語のカウン
トを集計してEmit(in-mapper combining)



Local Aggregation
ローカル集約

•連想配列をクラス内で保持し、すべてのドキュメ
ント中の語のカウントを集計した後にEmit



Local Aggregation
ローカル集約

•in-mapper combining のメリット

•Map→Reduceの受け渡し回数を減らすことで
パフォーマンスの向上が期待できる。

•デメリット

•Mapタスクのメモリ枯渇に注意

•データ出現パターンによってはあまり有効でな
いケースもある。

Local Aggregation
ローカル集約

•in-mapper combiningのnaiveな改善(メモリ関連)

•https://gist.github.com/3475348

•定期的にMapの内容をフラッシュ


pairsとstripes


pairs and stripes
pairsとstripes

•複合型のキーの集約テクニック

•一例：文章の中から語の共起頻度を算出する

•共起：ある単語がある文章中に出たとき、その
文章中に別の限られた単語が頻繁に出現するこ
と。(wikipedia)

•「私はさだまさしが好きです。」
→「私：さだまさし」「私：好き」...


pairs and stripes
pairsとstripes

•共起語抽出の情報量→基本的にO(n^2）

•「私はさだまさしが好きです。」
→「私：は」「私：さだまさし」「私：が」...
「好き：です」

私はさだまさしが好きです
は (私) さだまさしが好きです
さだまさし (私) (は）が好きです
が (私) (は） (さだまさし) 好きです
好き (私) (は） (さだまさし) （が）です
です (私) (は） (さだまさし) （が）（好き）


pairs and stripes
pairsとstripes

•問題設定：さだまさしの歌詞から頻出する共起語
を抽出。

•データ元：http://www.cai-insect.jp/sada/


pairs and stripes
pairsとstripes

•pairs:ワードwの共起語uを抽出し複合キーとし、
複合キー＋出現頻度をEmit



pairs and stripes
pairsとstripes

•stripes:ウインドウの最初の語ｗをキー。共起語u
のそれぞれの頻度をHashで保持しEmit



pairs and stripes
pairsとstripes

•「私はさだまさしが好きです」

•pairs
• {私は:1}, {私さだまさし:1} ,{私が:1}, {私好き:1}, {私です:1},
{はさだまさし:1}, {はが:1}.....

•stripes
• {私: {さだまさし:1} {が:1} {好き:1} {です:1}},
{は: {さだまさし:1} {が:1}.....}

•Map→Reduceのemitの数は paris > stripes


pairs and stripes
pairsとstripes

•共起語の出現頻度


相対頻度


Computing Relative Freq.
相対頻度

•ある語ｗと共起するuの出現頻度だけでなく、相
対頻度（条件付き確率？）が取得したい場合があ
る。

•そのためには語ｗの出現頻度（式右下部）を算出
する必要がある。


Computing Relative Freq.
相対頻度

•stripes: https://gist.github.com/3475934

•語ｗについて、すべての共起語uとその出現頻
度がReducerに渡されるので、出現頻度を合算
して計算すれば良い。

•pairs: https://gist.github.com/3475992

•そのままでは不可。Partitionerを改修して、語
ｗが先頭のkeyをすべて同じReducerに振り分
けるようにする必要がある。


セカンダリソート


Secondary Sort

•Keyだけでなく、Valueでもソートをしたい

1.Reduceの中でソート

2.Map→Reduceの際に、ソートしたいValueを
Keyに含めてしまう。
（value-to-key conversion）


Secondary Sort

•問題設定：さだまさしのコンサート会場のリスト
を解析

•Sort1：コンサート会場
Sort2：コンサート実施年


Secondary Sort

•Reduceの中でソート


• Map→Reduce
{“東京厚生年金会館” : “2000t1”} {“東京厚生年金会館” : “2000t1”} {“東
京厚生年金会館” : “2001t1”}
Reduce→Result
{“東京厚生年金会館” : “2000t2”}
{“東京厚生年金会館” : “2001t1”} ←Reduce内で年で並び替え


Secondary Sort

•value-to-key conversion

• Map→Reduce
{“東京厚生年金会館t2000” : 1} {“東京厚生年金会館t2000” : 1} {“東京
厚生年金会館t2001” : 1} ←Keyの中に年を含める
Reduce→Result
{“東京厚生年金会館t2000” : 2}
{“東京厚生年金会館t2001” : 1}


リレーショナルな結合


Relational Join

•手法だけ紹介

•Reduce Side Join
→Reduce側でJoinする

•Map Side Join
→Map側でJoinする

•Memory-Backed Join
→Mapperもしくは外部メモリ(memcachedな
ど）でデータをまとめて保持し、Joinする


Relational Join

•Reduce Side Join

•参考：
http://code.google.com/p/try-hadoop-mapreduce-java/source/browse/trunk/try-mapreduce/
src/main/java/jp/gr/java_conf/n3104/try_mapreduce/
JoinWithDeptNameUsingReduceSideJoin.java

•Map Side Join

•参考：
http://code.google.com/p/try-hadoop-mapreduce-java/source/browse/trunk/try-mapreduce/
src/main/java/jp/gr/java_conf/n3104/try_mapreduce/
JoinWithDeptNameUsingReduceSideJoin.java


Relational Join

•Memory-Backed Join

•参考：
http://d.hatena.ne.jp/wyukawa/20110818/1313670105


Bibliography
参考文献（書籍以外）

•http://www.slideshare.net/nokuno/
hadoopreading05-data-intensive3

•http://d.hatena.ne.jp/wyukawa/
20111002/1317550750


ご清聴
ありがとうございました


Data Intensive Text Processing with MapReduce - #3 MapReduce Algorithm Design -

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (14)

More from moai kids

More from moai kids (20)

Recently uploaded

Recently uploaded (9)

Data Intensive Text Processing with MapReduce - #3 MapReduce Algorithm Design -