Spark graph framesとopencypherによる分散グラフ処理の最新動向

/ 45
Spark GraphFramesとopenCypher
による分散グラフ処理の最新動向
ビッグデータ部加嵜長門
2016年3月8日

/ 45
自己紹介
• 加嵜長門
• 2014年4月～ DMM.comラボ
• Hadoop基盤構築
• Spark MLlib, GraphXを用いたレコメンド開発
• 好きな言語
• SQL
• Cypher
2

/ 45
GraphFramesとは？
• GraphFrames
• http://graphframes.github.io/
• 分散グラフ処理のための Apache Spark パッケージ
• Spark GraphX と DataFrames (SparkSQL) の統合
• Databricksが2016年3月3日にリリース
3

/ 45
• Spark Summit East 2016
• 2016/2/18
4
https://spark-summit.org/east-2016/events/graphframes-graph-queries-in-spark-sql/

/ 45
• Spark package
• 2016/2/25
5
http://spark-packages.org/package/graphframes/graphframes

/ 45
• Introducing GraphFrames
• 2016/3/3
6
https://databricks.com/blog/2016/03/03/introducing-graphframes.html

/ 45
GraphFramesの特徴
• openCypherによるグラフ検索
• Pregelを用いたグラフ処理
• 分散処理
7

/ 45
• 分散処理
8

/ 45
openCypherによるグラフ検索
• グラフ分析とグラフ検索
9
引用：http://www.slideshare.net/SparkSummit/graphframes-graph-queries-in-spark-sql-by-ankur-dave

/ 45
openCypher
• オープンソースのグラフクエリ言語
• Neo4jのCypherから派生
• SQLに似た宣言的な記述が可能
10
MATCH (cypher:QueryLanguage)-[:QUERIES]->(graphs)
MATCH (cypher)<-[:USES]-(u:User) WHERE u.name IN [‘Oracle’, ‘Apache Spark’, ‘Tableau’, ‘Structr’]
MATCH (openCypher)-[:MAKES_AVAILABLE]->(cypher)
RETURN cypher.attributes
-----------
[‘awesome’,…]
http://www.opencypher.org/

/ 45
GraphFramesを試す
• 使い方
• Sparkと同様、Scala, Java, Python, R向けのAPIを使用可能
• インストール方法
• Spark Shell でインタラクティブに試す
• Build.sbt を利用
11

/ 45
• Spark Shell でインタラクティブに試す
• Spark 1.4以上に対応
• DataFramesの利点を活かすなら最新版を推奨
12
# spark をダウンロード
$ wget http://ftp.jaist.ac.jp/pub/apache/spark/spark-1.6.0/spark-1.6.0-bin-hadoop2.6.tgz
$ tar xzvf spark-1.6.0-bin-hadoop2.6.tgz
# graphframesパッケージを指定してspark-shellを起動
$ spark-1.6.0-bin-hadoop2.6/bin/spark-shell --packages graphframes:graphframes:0.1.0-spark1.6

/ 45
• Build.sbt を利用
13
resolvers += "Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.6.0",
"org.apache.spark" %% "spark-sql" % "1.6.0",
"org.apache.spark" %% "spark-graphx" % "1.6.0",
"graphframes" % "graphframes" % "0.1.0-spark1.6“
)

/ 45
GraphFrames – アイテムレコメンドの実行例
14
// graphframesパッケージのインポート
scala> import org.graphframes._
import org.graphframes._
// Vertex（頂点）となるDataFrameを作成
scala> val v = sqlContext.createDataFrame(List(
| (0L, "user", "u1"),
| (1L, "user", "u2"),
| (2L, "item", "i1"),
| (3L, "item", "i2"),
| (4L, "item", "i3"),
| (5L, "item", "i4")
| )).toDF("id", "type", "name")
v: org.apache.spark.sql.DataFrame = [id: bigint, type: string, name: string]
u1
u2
i1
i2
i3
i4
ユーザ
アイテム

/ 45
15
// Edge（辺）となるDataFrameを作成
scala> val e = sqlContext.createDataFrame(List(
| (0L, 2L, "purchase"),
| (1L, 5L, "purchase")
| )).toDF("src", "dst", "type")
e: org.apache.spark.sql.DataFrame = [src: bigint, dst: bigint, type: string]
// GraphFrameを作成
scala> val g = GraphFrame(v, e)
g: org.graphframes.GraphFrame = GraphFrame(v:[id: bigint, attr: string, gender: string],
e:[src: bigint, dst: bigint, relationship: string])
u1
u2
i1
i2
i3
i4
購入ログ

/ 45
16
// レコメンドアイテムの問い合わせ例
scala> g.find(
| " (user1)-[]->(item1); (user2)-[]->(item1);" +
| " (user2)-[]->(item2); !(user1)-[]->(item2)"
| ).groupBy(
| "user1.name", "item2.name"
| ).count().show()
name name count
u1 i4 2
u2 i1 2
u1
u2
i1
i2
i3
i4
共通の商品を
購入したユーザ
まだ購入していないアイテムをレコメンド

/ 45
• 分散処理
17

/ 45
BSP, Pregel, Graph
18
Pregel
BSP
Apache Hama
グラフ特化
開発
実装
実装
活用
継承
影響
Open Graph
Graph Search
Knowledge Graph

/ 45
バルク同期並列(BSP)
19

/ 45
20
Concurrent computation Communication Barrier synchronisation
superstep

/ 45
21
Concurrent computation Communication Barrier synchronisation
superstep

/ 45
Question: PregelでAC間の距離を図る方法
22
A B C
a𝑏 + 𝑏𝑐
a𝑏 𝑏𝑐

/ 45
Question: PregelでAC間の距離を図る方法
23
A B CB
a𝑏 𝑏𝑐
a𝑏 + 𝑏𝑐 ?

/ 45
A1. Iter=1, send message
24
A B C
a𝑏 𝑏𝑐
a𝑏

/ 45
A1. Iter=1, vertex program
25
A B C
a𝑏 𝑏𝑐
a𝑏

/ 45
A1. Iter=2, send message
26
A B C
a𝑏 𝑏𝑐
a𝑏
A B C
a𝑏 𝑏𝑐
a𝑏
a𝑏 + 𝑏𝑐

/ 45
A1. Iter=2, vertex program
27
A B C
a𝑏 𝑏𝑐
a𝑏
A B C
a𝑏 𝑏𝑐
a𝑏 a𝑏 + 𝑏𝑐

/ 45
GraphX Pregel API
28

/ 45
• 分散処理
29

/ 45
GraphFrames (GraphX) のデータ構造
• 分散グラフ
30
http://spark.apache.org/docs/latest/graphx-programming-guide.html

/ 45
GraphFrames (GraphX) のデータ構造
• 分散グラフ
31
http://spark.apache.org/docs/latest/graphx-programming-guide.html

/ 45
Partition Strategy
• 次数 10000
• Partition数 100
32
Vn
V1
V2
V10000
・
・
・
Partition 1
Partition 2
Partition 100
・
・
・
?

/ 45
Partition Strategy
• RandomVertexCut
• Hash(src, dst)
33
Vn
V1
V2
V10000
・
・
・
Partition 1
Partition 2
Partition 100
・
・
・
Vn V1
Vn V2
Vn V10000
1 Partition あたり
平均 100 Edges
I/O効率が悪い

/ 45
Partition Strategy
• EdgePartition1D
• Hash(src)
34
Vn
V1
V2
V10000
・
・
・
Partition 1
Partition 2
Partition 100
・
・
・
Vn V1
Vn V2
Vn V10000
srcに対して
Partitionが決まる
I/Oが発生する
Partitionを限定
できる

/ 45
Partition Strategy
• EdgePartition1D
• Hash(src)
35
Vn
V1
V2
V10000
・
・
・
Partition 1
Partition 2
Partition 100
・
・
・
srcに対して
Partitionが決まる
Edge の順方向
にしか意味が無い
Vn V1
Vn V2
Vn V10000

/ 45
Partition Strategy
• EdgePartition2D
36
Vn
V1
V2
V10000
・
・
・
* * *Vn
V1 V2 V10000
Partition 1
・・・
10/100
Partitions
Partition 100

/ 45
Partition Strategy
• EdgePartition2D
37
*
*
*
Vn
V1
V2
V10000
Vn
V1
V10000
V2
・
・
・
・
・
・
10/100
Partitions

/ 45
Partition Strategy
• EdgePartition2D
38
*
*
*
Vn
V1
V2
V10000
Vn
V1
V10000
V2
・
・
・
・
・
・
Vi
Vj
・
・
・
Vk
Vn
V1 V2 V10000・・・
高々
20/100 Partitions
=20%
200/10000 なら
2%

/ 45
GraphFrames vs. Neo4j
39

/ 45
GraphFrames × Spark 2.0
40
引用： http://www.slideshare.net/databricks/2016-spark-summit-east-keynote-matei-zaharia

/ 45
参考文献
• 複雑ネットワーク―基礎から応用まで
• 増田直紀、今野紀雄
• http://www.amazon.co.jp/dp/4764903636
41

/ 45
参考文献
• Cypherクエリー言語の事例で学ぶグラフデータベースNeo4j
• 李昌桓
42

/ 45
参考文献
• Neo4j Webinar
• http://neo4j.com/webinars/
• Bootstrapping Recommendations with Neo4j
• Fraud Detection with Neo4j
• Natural Language Processing with Graphs
• etc.
43

/ 45
参考文献
• Apache Spark Graph Processing
• Rindra Ramamonjison
44

/ 45
参考文献
• Graph Mining: Laws, Tools, and Case Studies
• Deepayan Chakrabarti, Christos Faloutsos
• http://www.amazon.com/dp/B00AF2CVE6
45

Spark graph framesとopencypherによる分散グラフ処理の最新動向

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Spark graph framesとopencypherによる分散グラフ処理の最新動向

Similar to Spark graph framesとopencypherによる分散グラフ処理の最新動向 (20)

Spark graph framesとopencypherによる分散グラフ処理の最新動向