Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
VENUS: Vertex-Centric Streamlined Graph Computation on a Single PC
1. VENUS: Vertex-Centric Streamlined
Graph Computation on a Single PC
Jiefeng Cheng1, Qin Liu2, Zhenguo Li1,
Wei Fan1, John C.S. Lui2, Cheng He1
1Huawei Noah’s Ark Lab
2 The Chinese University of Hong Kong
ICDE’15
2. Graph is everywhere
We have large graphs
• Web graph
• Social graph
• User-movie ratings graph
• …
Graph Computation
• PageRank
• Community detection
• ALS for collaborative filtering
• …
3. Mining from Big Graphs:
two feasible ways
Distributed systems
• Pregel[SIGMOD’10], GraphLab[OSDI’12],
GraphX[OSDI’14], Giraph, ...
• Expensive cluster, complex setup, writing distributed
programs
Single-machine system
• Disk: GraphChi[OSDI’12], X-Stream[SOSP’13]
• SSD: TurboGraph[KDD’13], FlashGraph[FAST’15]
• Computation time close to distributed systems
• PageRank on Twitter graph (41M nodes, 1.4B edges)
• Spark: 8.1min with 50 machines (each with 2 CPUs, 7.5G
RAM)[Stanton KDD’12]
• VENUS: 8 min on a single machine with quad-core CPU, 16G RAM
• Affordable, easy to program/debug
4. Existing Systems
Vertex-centric programming model: popularized
by Pregel / GraphLab / GraphChi
• Each vertex updates itself based on its neighborhood
GraphChi
• Updated data on each vertex must be propagated to its
neighbors through disk
• Extensive disk I/O
X-Stream
• Different API: edge-centric programming
• Less expressive, re-implement common algorithms
• Also use disk to propagate updates
5. Our Contributions
Design and implement a disk-based system,
VENUS
• A new vertex-centric streamlined processing model
• Separate mutable vertex data and immutable edge
data
• Read/Write less data compared to other systems
Evaluation on large graphs
• Outperform GraphChi and X-Stream
• Verify that our design reduce data access
6. Vertex-Centric Programming
Consider GraphChi
for each iteration
for each vertex v
update(v)
void update(v)
fetch data from each in-edge
update data on v
spread data to each out-edge
Duplicated data
v
7. Vertex-Centric Programming
VENUS:
• Only store mutable values on vertices
Pros
• Less data access
• Enable ``streamlined’’ processing
Cons
• Limited expressiveness
void update(v)
fetch data from each in-edge
update data on v
spread data to each out-edge
in-neighbor
v
9. VENUS Architecture
Disk storage (offline)
• Sharding
• Separation of edge data and vertex data
Computing model (online)
• Load edge data sequentially
• Execute the update function on each vertex
• How to load vertex data and propagate updates
10. Sharding
Graph cannot fit in RAM?
• Split the graph into shards
Each shard corresponds to an interval of vertices:
• G-shard: immutable structure of graph
• In-edges of nodes in the interval
• V-shard: mutable vertex values
• Vertex values of all vertices in the shard
Structure table: all g-shards
Value table: all vertex data
Vertex ID 1 2 3 4 5 6 7 8 9 10 1
1
12
Data
12. Vertex-Centric Streamlined
Processing
V-shards are much smaller than g-shards
• Load each v-shard entirely into memory
Scan each g-shard sequentially
• Execute the update function in parallel
14. Load and Update v-shards
Two I/O efficient algorithms
• Algorithm 1: Extension of PSW in GraphChi (skip)
• Algorithm 2: Merge-Join
• Load: merge-join between value table and v-shard
• Update: write values of [1,4] back to vertex table
Use value buffer to cache value table
ID 1 2 3 4 5 6 7 8 9 10 1
1
12
Data
Value table on disk
ID 1 2 3 4 6 7 9 10 Vertices in v-shard 1
on disk
ID 1 2 3 4 6 7 9 10
Data
Loaded v-shard 1
15. Evaluation of VENUS
Setup: a commodity PC
• quad-core 3.4GHz CPU
• 16GB RAM and 4TB hard disk
Main competitors:
• GraphChi and X-Stream
Applications:
• PageRank
• WCC: weakly connected components
• CD: community detection
• ALS: alternating least square for collaborative filtering
• Shortest path, label propagations, etc.
18. Applications: WCC, CD, ALS
Failed to implement CD on X-Stream,
due to its edge-centric programming
model
19. Web-Scale Graph
Clueweb12: web scale graph
• 978 million nodes, 42.5 billion edges
• 402 GB on disk
• 2 iterations of PageRank
Computation time
• GraphChi: 4.3 hours
• X-Stream: 7.4 hours
• VENUS-I: 2 hours
• VENUS-II: 1.8 hours
20. Conclusion
Present a disk-based graph computation system,
VENUS
Our design of graph storage and execution can
reduce data access and I/O
Evaluations show it outperforms GraphChi and
X-Stream
Also VENUS can handle billion-scale problems