SlideShare a Scribd company logo
1 of 11
Download to read offline
GraSU: A Fast Graph Update Library for FPGA-based Dynamic
Graph Processing
Qinggang Wang, Long Zheng, Yu Huang, Pengcheng Yao, Chuangyi Gui, Xiaofei Liao, Hai Jin,
Wenbin Jiang, and Fubing Mao
National Engineering Research Center for Big Data Technology and System/Service Computing Technology and System
Lab/Cluster and Grid Computing Lab, Huazhong University of Science and Technology, China
{qgwang,longzh,yuh,pcyao,chygui,xfliao,hjin,wenbinjiang,fbmao}@hust.edu.cn
ABSTRACT
Existing FPGA-based graph accelerators, typically designed for
static graphs, rarely handle dynamic graphs that often involve sub-
stantial graph updates (e.g., edge/node insertion and deletion) over
time. In this paper, we aim to fill this gap. The key innovation of this
work is to build an FPGA-based dynamic graph accelerator easily
from any off-the-shelf static graph accelerator with minimal hard-
ware engineering efforts (rather than from scratch). We observe
spatial similarity of dynamic graph updates in the sense that most
of graph updates get involved with only a small fraction of vertices.
We therefore propose an FPGA library, called GraSU, to exploit
spatial similarity for fast graph updates. GraSU uses a differential
data management, which retains the high-value data (that will be
frequently accessed) in the specialized on-chip UltraRAM while
the overwhelming majority of low-value ones reside in the off-chip
memory. Thus, GraSU can transform most of off-chip communica-
tions arising in dynamic graph updates into fast on-chip memory
accesses. Our experiences show that GraSU can be easily integrated
into existing state-of-the-art static graph accelerators with only 11
lines of code modifications. Our implementation atop AccuGraph
using a Xilinx Alveo™ U250 board outperforms two state-of-the-art
CPU-based dynamic graph systems, Stinger and Aspen, by an aver-
age of 34.24× and 4.42× in terms of update throughput, improving
further overall efficiency by 9.80× and 3.07× on average.
CCS CONCEPTS
• Hardware → Reconfigurable logic and FPGAs.
KEYWORDS
Accelerators; Dynamic Graph; Library
ACM Reference Format:
Qinggang Wang, Long Zheng, Yu Huang, Pengcheng Yao, Chuangyi Gui, Xi-
aofei Liao, Hai Jin, Wenbin Jiang, and Fubing Mao. 2021. GraSU: A Fast Graph
Update Library for FPGA-based Dynamic Graph Processing. In Proceedings
of the 2021 ACM/SIGDA International Symposium on Field Programmable
Gate Arrays (FPGA ’21), February 28-March 2, 2021, Virtual Event, USA. ACM,
New York, NY, USA, 11 pages. https://doi.org/10.1145/3431920.3439288
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a
fee. Request permissions from permissions@acm.org.
FPGA ’21, February 28-March 2, 2021, Virtual Event, USA
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8218-2/21/02...$15.00
https://doi.org/10.1145/3431920.3439288
1 INTRODUCTION
Graph processing has been widely used for relationship analy-
sis in a large variety of domains, such as social network analyt-
ics [41], financial fraud detection [34], and coronavirus transmis-
sion tracking [47]. In recent years, hardware acceleration has been
demonstrated to effectively boost the performance of graph pro-
cessing [19, 20, 22, 33]. In particular, FPGA can be regarded as one
of most promising hardware platforms for graph processing due to
its fine-grained parallelism, low power consumption, and flexible
configurability, showing impressive results [4, 7, 10–12, 15, 24, 31,
32, 39, 44, 50–52, 55].
Unfortunately, most of existing FPGA-based graph accelerators
are typically designed for handling static graphs [19]. In actuality,
real-world graphs are often subject to change over time [27, 38],
a.k.a. dynamic graphs. For instance, the twitter graph may be added
or deleted with 6,000 tweets per second [36]. To process a dynamic
graph, it often relies on an iterative process with two basic kennels:
graph update and graph computation [1, 5]. Graph update aims to
generate a new graph by modifying a graph snapshot with addition
and deletion operations upon a vertex or an edge. Graph computa-
tion (which can be a graph algorithm or an ad-hoc query) is then
performed based on this new graph to extract useful information.
In this context, graph computation has been well studied in existing
FPGA-based graph processing accelerators [10, 11, 39, 44, 50, 55],
while graph update is still understudied significantly [1, 19].
Prior studies show that graph update can function as importantly
as graph computation [1]. An inefficient graph update may lead to
the unexpected outcome [38]. A typical example is financial fraud
detection, which aims to find fake transactions using the ring anal-
ysis on evolving graphs. In this case, a slow-rate update may cause
the ring detection to operate upon an outdated graph such that the
analysis results can be inaccurate or even useless [34]. Although
earlier studies on traditional architectures [6, 13, 14, 16, 29, 38, 46]
adopt novel graph representations to improve the concurrency of
dynamic graph updates, their underlying efficiency is still limited
due to the inflexibility of traditional memory subsystems, yielding
excessive off-chip communications [1, 19]. In this paper, we focus
on filling the gap between graph update and graph computation
using FPGA. We expect to develop a specialized FPGA-based graph
update library that can be easily integrated into any off-the-shelf
static graph accelerator so as to support dynamic graph processing
with minimal hardware engineering efforts.
However, performing graph updates efficiently on FPGAs re-
mains challenging. For a typical FPGA-based graph accelerator
architecture (as shown in Figure 1(a)), vertices are often stored
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
149
Off-the-shelf Graph
Computation PEs
Graph Update PEs
FPGA
chip
Off-chip Memory
Edge Data
…
Vertex data are
stored in BRAMs
Register
Update Handling Logic
①
②
③
Expensive off-chip
communication
overhead
(a) Existing accelerators with naive
graph update scheme
Graph Update PEs
Off-chip Memory
Vertex data are
stored in BRAMs
FPGA
chip
Edge Data
…
UltraRAMs
Register
Edge Update Handling Logic
Value-Aware Memory Manager
Incremental
Value Measurer
Off-the-shelf Graph
Computation PEs
(b) Existing accelerators integrated
with GraSU
Figure 1: Handling dynamic graph updates under existing
static graph accelerators with (a) naive update scheme and
(b) GraSU library. The edge data associated with different
vertices are marked in different colors.
in the on-chip BRAMs to reduce substantial random access over-
heads of vertex data, while edges, massive in quantity, have to
be loaded in a streaming-apply fashion from the off-chip mem-
ory [10, 11, 39, 44, 50, 55]. In this context, a graph update operation
can be split into three basic steps. Consider an edge update. First,
the source vertex index of the to-be-updated edge is read. Its as-
sociated edge array will be loaded from the off-chip memory, and
stored into the registers attached to each processing element (PE).
Second, the to-be-updated edge is then inserted into (or deleted
from) the loaded edge array, which is finally written back to the
off-chip memory. In the real world, there often are a large num-
ber of updates applied. The off-chip edge data may be repeatedly
and redundantly accessed by each separate PE, thus resulting in
expensive off-chip communication overheads (as discussed in §2.2).
In this paper, we focus on addressing whether and how the most
off-chip communications arising in dynamic graph updates can
be transformed into on-chip memory accesses. Note that we con-
sider only edge updates in this work since vertex updates can be
represented by a series of edge insertions and/or deletions [38].
We observe that dynamic graph updates operating upon real-
world graphs show significant spatial similarity for off-chip edge
data access, in the sense that most random off-chip memory re-
quests come from accessing the edges associated with a few ‘valu-
able’ vertices (as discussed in §2.3). Consider the realistic dynamic
graph wiki-talk-temporal [28], most (>99.04%) of edge updates are
associated with only 5% of vertices. This motivates us to reduce
the most off-chip communication overheads by applying a differ-
ential data management, in which for each batch of graph update,
high-value vertices’ edges are resident on the specialized on-chip
UltraRAM [48] while most of low-value ones are stored in the off-
chip memory. Note that the value of a vertex represents the number
of edges updating it during an update batch, differential data man-
agement distinguishes the data based on value priority for being
stored on-chip or off-chip. However, achieving this goal for accel-
erating graph updates on FPGA is still challenging. First, the data
that are valuable for each batch of graph update are dynamically
changed over time. An offline value measurement towards static
graphs may lead to inaccurate detection on valuable data [53]. Yet,
performing the exhaustive value computation on the fly is also
expensive. Second, the differential memory architecture makes data
addressing more complex. Both on-chip UltraRAM and off-chip
DRAM can store edge data. We must accurately know which data
Time
Graph
Computation
Time
Graph
Update #0
Store graph
in a format
Graph
Computation
Graph
Update #1
Graph
Computation
(a) Static Graph Processing (b) Dynamic Graph Processing
Graph
Update #n
Graph
Computation
…
Figure 2: The basic workflow of (a) static graph processing
and (b) dynamic graph processing
is in which memory location in a space-efficient manner, which is
also difficult.
In this paper, we propose an FPGA library, called GraSU, for
fast graph updates. As shown in Figure 1(b), GraSU consists of an
incremental value measurer and a value-aware differential memory
manager to fully exploit spatial similarity of dynamic graph up-
dates. Consider the value of a vertex exhibits significant difference
across different update batches for dynamic graphs, we propose
to quantify the value incrementally based on the update history
to capture important changes across batches such that the detec-
tion accuracy can be improved dynamically. GraSU also enables to
overlap incremental value measurement with normal graph com-
putation and thus its runtime overheads can be fully hidden. To
make a better tradeoff between space overhead and efficiency for
differential memory addressing, we present a value-aware data
management, which reduces unnecessary memory consumption
by leveraging a bitmap and implements the fast yet accurate data
addressing via bitmap-assisted address resolution.
This paper has the following main contributions:
• We observe spatial similarity arising in dynamic graph up-
dates on real-world graphs for improving memory efficiency
of dynamic graph processing.
• We present an FPGA library, namely GraSU, which uses a
differential data management to exploit spatial similarity for
fast graph updates. GraSU can be easily integrated into any
FPGA-based static graph accelerator with only a few lines
of code modifications so as to handle dynamic graphs.
• Our implementation on a Xilinx Alveo™ U250 card outper-
forms Stinger [14] and Aspen [13] by an average of 34.24×
and 4.42× in terms of update throughput, and improve fur-
ther overall efficiency by 9.80× and 3.07× on average.
The rest of this paper is organized as follows. §2 introduces the
background and motivation. The overview of GraSU is presented
in §3. §4 and §5 elaborate the value-aware differential data man-
agement. §6 discusses the results. The related work is surveyed in
§7. §8 concludes the paper.
2 BACKGROUND AND MOTIVATION
We first review preliminaries of dynamic graph processing. We
then identify memory inefficiency of existing FPGA-based graph
accelerators for dynamic graphs, finally motivating our approach.
2.1 Dynamic Graph Processing
Figure 2 depicts the basic workflows of static and dynamic graph
processing, respectively. Static graph processing (Figure 2(a)) often
works on topology-fixed graphs that can be organized in different
graph representations, e.g., compressed sparse row (CSR) [50], com-
pressed sparse column (CSC) [44], and coordinate list (COO) [55].
Dynamic graph processing can be understood as performing a se-
ries of graph computations upon different graph versions that are
successively generated by a sequence of graph update batches (as
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
150
(a) Graph Example
0 2 5 6
1 2 0 1 2 0
Vertex IDs
Offset Array
Edge Array
(Neighbour IDs)
0 1 2
(b) CSR-based Format
0
1
2
[0,3]
S 1 2 S 0 1 2 S 0
[12,15]
[8,11]
[4,7]
[0,8] [8,15]
[0,15]
Edge
Array
0 5 13 16
Offset Array
(c) PMA-based Dynamic Graph Representation
Vertex IDs 0 1 2
Segment #0 Segment #1 Segment #2 Segment #3
Figure 3: A simplified example for illustrating the PMA for-
mat and how it supports CSR format. (a) An example graph.
(b) CSR-based static graph format. (c) PMA-based dynamic
graph format. ‘𝑆’ is a sentinel entry indicating each vertex’s
edge range for maintaining the offset array. When a sentinel
is changed, the corresponding entry in the offset array will
be modified.
shown in Figure 2(b)). To increase the concurrency of graph updates,
earlier studies [6, 8, 13, 14, 16, 18, 21, 25, 29, 37, 38, 45, 46] present
various dynamic graph representations based on Packed Memory
Array (PMA) [38, 45], tree [6, 13], CSR variants [14, 16, 29, 37], ad-
jacency list (AL) [8, 25, 46], and hash table [18, 21]. Compared to
others applying on a specific graph format, PMA can be flexibly
adaptive to different graph representations [38]. In this work, we
architect GraSU based on the PMA format for the purpose of sup-
porting a wide variety of FPGA-based static graph accelerators that
may adopt different underlying graph representations.
Figure 3 depicts an example graph with the PMA format and
how it supports the traditional CSR format. As shown in Figure 3(c),
the PMA format maintains sorted edges in a partially-consecutive
manner by leaving some gaps for possible edge updates where ‘𝑆’ is
a sentinel entry indicating each vertex’s edge range for maintaining
the offset array. Logically, the PMA separates the whole edge array
into a series of leaf segments and keeps an implicit tree for locating
the position of edge updates quickly. When a leaf segment becomes
full or empty, all edges of its parent will be redistributed to rebalance
the edges array. Suppose all gaps are exhausted, the root segment
will be doubled with workload rebalance re-invoked.
2.2 Off-Chip Communication Overheads
As described in Figure 1(a), in an effort to reduce the random vertex
access overheads, existing FPGA-based graph accelerators often
store vertex data in the BRAMs while edges reside in the off-chip
memory. In this setting, many edges may be repeatedly loaded from
the off-chip memory by different update operations. In addition,
these off-chip edge accesses closely depend on the sequence of the
edges that need to be updated, and therefore are totally random
with excessive off-chip communication overheads, further slowing
down the overall efficiency of graph updates.
To demonstrate this, we conduct a set of experiments to break
down the real computation time and off-chip communication time
of graph updates operating over AccuGraph [50]. Figure 4 shows
the results on five real-world dynamic graphs (i.e., sx-askubuntu
(AU), sx-superuser (SU), wiki-talk-temporal (WK), sx-stackoverflow
(SO), and soc-bitcoin (BC), more details as shown in Table 2) with a
varying number of edge updates. We see that communication over-
heads dominate the overall execution time of graph updates for all
dynamic graphs. In particular, communication overheads increase
0
2
4
6
8
10
Commu. Overhead Real Computation
A
U
S
U
W
K
S
O
B
C
20%
A
U
S
U
W
K
S
O
B
C
40%
A
U
S
U
W
K
S
O
B
C
80%
A
U
S
U
W
K
S
O
B
C
100%
Normalized


execution
time
Figure 4: Execution time breakdown of graph updates op-
erating upon AccuGraph by applying different proportion
(20%/40%/80%/100%) of edge updates on five real-world dy-
namic graphs. All results are normalized to the case of ap-
plying 20% edge updates.
2
5
4
9
8
7
3
1
6
0
Time
0
1
8
6
5
7
5
0
8
2
8
7
2
4
5
2
8
9
5
4
Edge Updates
Applying
1 3 6 5 7 2 0 2 5 7 9 2 4 5 7 9
Edge
Array
Vertex IDs
Offset Array
0 1 2
0 2 3 5 6 7 13 13 13 19 19
3 4 5 6 7 8 9
(a) Edge Updates
(c) Graph Topology
(b) Graph Data
Edge data of vertex V5
Figure 5: An example for illustrating spatial similarity of
edge updates. The blue solid (dashed) lines indicate edge in-
sertions (deletions).
significantly as edge update sizes increase. For example, when 20%
of edge updates are applied for WK, communication overheads
take 65.80% of overall performance. However, as this proportion is
increased to 100%, communication overheads will be up to 83.26%,
exerting more pressure on the off-chip memory bandwidth. Overall,
communication overheads can take most (85.62% on average) of
overall performance of dynamic graph updates.
2.3 Differential Data Management
In this work, we find that not all off-chip communications arising
in graph updates for real-world graphs are contributed equally. The
reason behind this is complex. The nature of “rich get richer” [26]
and the power-law degree feature1 of real-world graphs can help a
partial understanding. The "rich get richer" nature indicates that a
new edge update can be more likely operated on a high-degree ‘rich’
vertex. As a result, the minority of high-degree vertices get involved
by most graph updates. In a real scenario of online shopping, users
are more inclined to buy (i.e., edge update) those high-sale products
which are often in the minority among all products. In summary,
we have the following observation:
Observation: Graph updates on real-world graphs show significant
spatial similarity, indicating that most of the off-chip random
memory accesses root from requesting a few vertices’ edge data.
Figure 5 shows an example for helping understand spatial sim-
ilarity. The initial graph (Figure 5(c)) will be modified by 10 edge
update operations (Figure 5(a)). We can see that 8 out of 10 edge
updates are associated to only 2 vertices, 5 and 8 . Figure 6 further
investigates the percentage of edge updates that get involved by dif-
ferent scale (1%∼5%) of top vertices. All results are collected from an
1Most vertices in a graph have a few neighbors while a few have many ones [17].
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
151
0%
20%
40%
60%
80%
100%
1% 2% 3% 4% 5%
AU
SU
WK
SO
BC
Percentage of vertices
Percentage
of
edge
updates
Figure 6: Quantitative relationship between edge updates
and vertices for five real-world dynamic graphs
offline trace analysis. Basically, we see that most edge updates are
operated on a few vertices. Consider SO, 58.28% of edge updates are
operated upon the top-1% vertices. However, the percentage growth
becomes gradually slow and saturated in the case that vertex ratio
is 5%. Overall, we can see that 71.26%~99.04% of edge updates are
focused on accessing the top-5% vertices. Note that spatial similar-
ity is not assumed on power-law graphs. For USA-road [35] with a
relatively-even degree distribution, 5% accident-prone vertices may
frequently cause 66.74% road congestion.
Spatial similarity motivates us to classify the entire edge data
into high-value data (if it is requested by many edge updates) and
low-value data (otherwise). As shown in Figure 5(b), the colored
edges in the edge array are high-value data due to the association
with frequently accessed 5 and 8 . This further requires to invent
a differential data management, in which high-value data reside on
the on-chip memory while low-value data are stored in the off-chip
memory. In this case, most of random off-chip memory accesses
arising in graph updates can be therefore transformed into fast
on-chip accesses.
However, realizing the differential data management on FPGA
remains challenging and needs to meet at least two requirements:
• Accuracy: We must accurately know the value of each ver-
tex so as to place its associated edges into certain memory
device. However, unlike static graphs, data value of dynamic
graphs is dynamically various such that measuring it accu-
rately and efficiently becomes difficult.
• Space-Efficiency: In the differential memory architecture,
both on-chip and off-chip memories have a copy of edge
data. This introduces a new data addressing mechanism for
positioning the data location. However, how to ensure the
space efficiency of data addressing is also difficult.
To address these issues, we propose GraSU that can exploit spa-
tial similarity effectively and efficiently.
3 GRASU OVERVIEW
In an effort to reduce excessive random off-chip communications
induced by redundant memory accesses arising in graph updates,
GraSU uses "value" to characterize the importance of data (i.e., high-
accuracy value measurement) and treats them differentially (i.e.,
value-aware differential management) based on the key insight that
not all off-chip accesses are created equally.
3.1 Architecture
Figure 7 shows the overall architecture of GraSU, which consists
of five components: dynamic graph storage, incremental value
(a) Overall Architecture of GraSU
Graph Data
(stored in PMA-
based dynamic
graph format) Edge Updates
Batch #1 Batch #2
Batch #3 ……
Off-chip Memory
FPGA Chip
Update
Handling Logic
Update-relevant
Data Register
Edge
Updates
Dispatcher
Graph Update PEs
Value-Aware
Memory Manager
Off-chip
Memory
Controller
High-value Data Buffer
Incremental
Value Measurer
Updates
Buffer
(processed in batches)
Edge Read
Edge Write
Edge Update
Edge
Update
Send
access
request
to
Value-Aware
Memory
Manager
Edge Insertion/Deletion on
Update-relevant Data Register
(b) Architecture of Update
Handling Logic
Edge Array
0 5 13 16
Offset Array
Vertex IDs 0 1 2
S12 S0 12 S0
[0,3][4,7][8,11][12,15]
[8,15]
[0,7]
[0,15]
Segment
(c) Dynamic Graph
Organization
➏
➊
➋
➌
UltraRAM
➍
➎
§4
§5
Figure 7: The GraSU architecture
measurer, edge updates dispatcher, edge updates handling logic,
and value-aware memory manager.
Dynamic Graph Organization. As discussed in §2.1, we fol-
low the PMA representation [38, 45]. In GraSU, both on-chip and
off-chip memories can store edge data. There may be the case that
a segment contains edges with different value levels, making it
extremely difficult for data organization. To avoid this, we propose
to enforce each segment to contain edges from only one vertex (Fig-
ure 7(c)). In addition, for the traditional PMA format, the edge array
space will be doubled if it becomes full. However, FPGA currently
does not support dynamic memory allocation effectively [49]. To
achieve this functionality, we pre-allocate the off-chip memory into
many spaces physically, and use the segment space logically for
space doubling when necessary.
Incremental Value Measurer (IVM). The IVM module is re-
sponsible for quantifying the value of each vertex for graph updates,
and further notifying the value-aware memory manager (VMM) to
dispatch edges of high-value vertices into the on-chip UltraRAM
(❶). Since the data value is dynamically changing, IVM adopts an
incremental value measurement based on graph update history to
improve measurement accuracy constantly. IVM is invoked every
time a batch of graph updates are completed (❻). Note that value
measurement overheads can be fully hidden behind normal graph
computations. More details are discussed in §4.
Edge Update Dispatcher (EUD). When high-value data reside
in the on-chip UltraRAM, EUD gets started (❷). It reads a batch
of edge updates from the off-chip memory and further dispatches
them to appropriate graph update PEs orderly according to the
timestamp order of each edge update (❸).
Update Handling Logic (UHL). The UHL module makes sure
that each edge update can be correctly inserted into or deleted from
the edge array. UHL is equipped with a three-stage pipeline: edge
read, edge update, and edge write (Figure 7(b)). The read stage loads
the requested data of the to-be-updated edge by sending a request
to the VMM (❹), discussed below. Afterwards, the update stage
performs the insertion or deletion operations. Finally, the write
stage writes back the updated edge data into the off-chip memory
(or the UltraRAM) through VMM.
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
152
Table 1: Programming interfaces of GraSU
Interfaces Description
GraSU_alloc_UltraRAM UltraRAM Allocation
GraSU_DGS Transform a static graph into the PMA format
GraSU_Init_Value Initialize the value of each vertex
GraSU_LHD Load the high-value data into the UltraRAM
GraSU_Update_Start Handle edge updates
GraSU_WHD Writeback high-value data into off-chip RAM
GraSU_Quantify_Value Quantify the value of each vertex
GraSU_Overlap Overlap value measurement with graph computation
GraSU_alloc_UltraRAM( );
GraSU_DGS( );

GraSU_Init_Value( );

DynamicGraphProcessing( ){

for( each update batch ){

GraSU_LHD( update_batch_valid, LHD_valid );

GraSU_Update_Start( LHD_valid, GUS_valid );

GraSU_WHD( GUS_valid, WHD_valid );

/*Overlap graph computation with value measurement*/

GraSU_Overlap( WHD_valid ){

/*Notify AccuGraph to start graph computation*/ 

AccuGraph( WHD_valid, computation_valid );

GraSU_Quantify_Value( WHD_valid, quantify_valid );

}

/* The signal indicates whether graph computation and 

value measurement are completed*/

update_batch_valid = computation_valid & quantify_valid;

}

}
1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18
Figure 8: A uniform programming framework for illustrat-
ing how to integrate GraSU into an existing static graph ac-
celerator AccuGraph [50] for handling dynamic graphs.
Value-Aware Memory Manager (VMM). The VMM module
aims to locate the requested edge data in an accurate and efficient
manner. To make a good tradeoff between memory space over-
heads and data addressing efficiency in the differential memory
architecture, VMM adopts a bitmap-indexed structure to minimize
space consumption and uses a bitmap-assisted addressing resolu-
tion mechanism to enable fast yet accurate differential data accesses.
When receiving a read request (❹), which consists of the source
and destination vertex indices of a to-be-updated edge, VMM will
capture the edge array address of the source vertex as an initial
on-chip (or off-chip) address. Afterward, VMM locates the target
segment in the edge data and also loads the segment (as well as
its address) to the update-relevant registers coupled with UHL (❺).
When VMM receives a write request from UHL, the data residing
in the UHL-attached buffer will be written back according to the
address of the segment. More details are described in §5.
3.2 Programming Interfaces
Table 1 shows the programming interfaces of GraSU for graph up-
dates. Using GraSU, we do not need to modify the upper graph al-
gorithm programming. Figure 8 shows an example of how GraSU is
integrated into an existing state-of-the-art graph accelerator Accu-
Graph [50] effectively. Two parameters of each interface indicate an
input and an output signal, respectively. For each update batch, we
use GraSU_LHD, GraSU_Update_Start, and GraSU_WHD to complete
0%
20%
40%
60%
80%
100%
0 1 2 3 4 5 6 7 8 9
Ideal
Degree-based
Incremental
Batches of updates
Percentage
of
edge
updates
Figure 9: Accuracy of the top 5% vertices identified by three
different schemes over update batches for WK. Ideal results
are obtained through an offline trace analysis.
a graph update operation. After the graph update is finished, the out-
put signal WHD_valid of GraSU_WHD will be valid to simultaneously
activate graph computation (i.e., AccuGraph) and value measure-
ment of the next graph update (i.e., GraSU_Quantify_Value). The
parameter update_batch_valid is used to ensure the semantic
correctness between different update batches.
More generally, all codes (excluding Line 12) of Figure 8 repre-
sent a uniform programming template to perform graph update.
The whole code of the existing graph accelerator is treated as a
module for graph computation. For integration, users only need to
instantiate the accelerator module (Line 12) inside the top mod-
ule, and then connect accelerator’s input/output signals with other
GraSU’s modules for coordinating when it is launched/finished.
Thus, GraSU can be easily used and integrated into existing static
graph accelerators for handling dynamic graphs with only 11 lines
of code modifications. GraSU is implemented and written in Verilog.
Integrating GraSU for an HLS-based accelerator is also viable by
converting it into the Verilog program.
4 VALUE MEASUREMENT
We first describe how to accurately quantify the value of a vertex
to distinguish high-value data. We then discuss how the overheads
of value measurement can be hidden in an overlapping manner.
4.1 Quantifying the Value of a Vertex
According to the “rich get richer” conjecture, an intuition for quan-
tifying the value of a vertex value measurement is to use its degree.
This approach is useful but its accuracy is still far from the ideal sit-
uation. Figure 9 shows the accuracy of the top 5% vertices identified
in the ideal case and the degree-based approach for the real-world
dynamic graph WK. Accuracy is the quantitative ratio of the top-5%
vertices’ edge updates to the total edge updates. Ideal case means
that all the top-5% vertices’ edge updates are precisely-identified.
We see that the accuracy gap becomes bigger as the update batches
proceed. In particular, in the case of the 9th update batch, the top 5%
vertices identified by the degree-based approach can get involved
by only 67.33% edge updates while the accuracy in the ideal case
can be as high as 99.04%.
The reasons are twofold. First, some low-degree vertices in a
basic graph can be inserted with many edges and thus they may
gradually become high-degree ones as update batches go. This is
common in real scenarios. For example, an obscure actor may gain
many fans to become a superstar when his film gets successful.
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
153
Graph
Computation
Time
Graph
Update
Graph
Update
Graph
Computation
GraSU
Update batch
#i
Update batch
#(i+1)
Graph
Computation
Engine
Value Measurement
……
……
Figure 10: The overlapping opportunity between value mea-
surement and normal graph computation
Second, the edges of some high-degree vertices may have a slower
growth than others. For example, when a “superspreader” is isolated
with the chains of virus transmission cut off, it will become normal
soon. Both phenomenons indicate that the value of a vertex depends
on not only its degree but also its historical update frequency. Thus,
we propose to quantify the value of a vertex as follows.
𝑉𝑎𝑙𝑢𝑒𝑖
(𝑣) =

𝐷𝑒𝑔(𝑣) × 𝐹𝑖−1(𝑣) 0  𝑖  𝑁
𝐷𝑒𝑔(𝑣) 𝑖 = 0
(1)
where 𝑁 is the number of update batches and 𝐷𝑒𝑔(𝑣) is the number
of out-going edges of the vertex 𝑣. 𝐹𝑖−1(𝑣) represents the number of
edges updated to the vertex 𝑣 (also called the update frequency of 𝑣)
in the (𝑖 − 1)-th update batch. The value of 𝑣 is initialized to 𝐷𝑒𝑔(𝑣)
before the 0-th update batch starts (i.e., 𝑉𝑎𝑙𝑢𝑒0(𝑣)). 𝑉𝑎𝑙𝑢𝑒𝑖 (𝑣) de-
notes the value of 𝑣 after applying (𝑖 − 1)-th update batch.
In Equation (1), we see that the value is incrementally quantified
and dynamically adjusted to gradually improve the prediction accu-
racy of high-value edge data. Figure 9 shows the superiority of our
incremental approach. Compared to the degree-based approach, we
can capture 88.15% updates for the top 5% vertices identified.
4.2 Overlapping Value Measurement and
Graph Computation
As in Equation (1), measuring the value of a vertex needs to obtain
its degree and update frequency, both of which are dynamically
changed during graph updates. Thus, we have to compute them
on the fly, which can introduce potential runtime overheads. For-
tunately, the interleaving between graph update and graph com-
putation for each update batch (as shown in Figure 2(b)) offers the
potential opportunity to hide the overheads of value measurement.
Figure 10 illustrates an overlapping diagram. When the 𝑖-th
graph update is completed, the edge data resident in the on-chip
memory will be written back to the off-chip memory. Afterwards,
graph computation engine starts working. In the meantime, the
Incremental Value Measurer starts quantifying the vertex value for
the (𝑖 + 1)-th graph update. Fortunately, the time spent on value
measurement is often less than that spent on graph computation.
This is because value measurement needs to compute upon a vertex
only once while graph computation has an iterative process over
vertex [53]. Thus, the overheads of value measurement can be
usually hidden fully under the execution time of graph computation.
5 VALUE-AWARE MEMORY MANAGEMENT
This section elaborates how to make full use of the quantified vertex
value to identify high-value edge data accurately and to further
achieve a differential data management for maximize value benefits.
5.1 High-Value Data Identification
Based on the quantified vertex value, the next step is naturally to
identify the high-value data that should be stored on-chip. This
raises two questions: (1) which data is high value for the on-chip
storage? and (2) which on-chip memory (BRAM or UltraRAM) is
expected for caching high-value data?
High-Value Data Computation. GraSU tries to store as much
high-value data as possible on-chip. Since the value and the edge
number of each vertex are changed over time and the on-chip mem-
ory capacity for different FPGAs is different, we must dynamically
compute the high-value data as per the on-chip memory capacity,
the value of each vertex, and the edge size of each vertex. We can
therefore compute high-value data as follows:
𝜏 = arg max
𝑘
( 𝑘
Õ
𝑖=0
𝑆𝑖𝑧𝑒(𝐸𝑑𝑔𝑒𝑂𝑓 (𝑣𝑖))
)
, where 𝑘 ∈ [0, |𝑉 |),
𝑣𝑖 ∈ 𝑉𝑆𝑒𝑡,
𝑘
Õ
𝑖=0
𝑆𝑖𝑧𝑒(𝐸𝑑𝑔𝑒𝑂𝑓 (𝑣𝑖)) ≤ |𝑂𝑛𝑐ℎ𝑖𝑝𝑀𝑒𝑚| (2)
where 𝐸𝑑𝑔𝑒𝑂𝑓 (𝑣𝑖) indicates the edges of 𝑣𝑖. 𝑆𝑖𝑧𝑒(𝑆) represents the
total memory size of the set 𝑆. |𝑉 | is the number of vertices. 𝑉𝑆𝑒𝑡
is the set of vertices that have been sorted by their value from the
largest to the smallest in the VMM module.
Í𝑘
𝑖=0 𝑆𝑖𝑧𝑒(𝐸𝑑𝑔𝑒𝑂𝑓 (𝑣𝑖))
indicates the total edge size of {𝑣0, · · · , 𝑣𝑘 }. |𝑂𝑛𝑐ℎ𝑖𝑝𝑀𝑒𝑚| denotes
the on-chip memory capacity. Equation (2) means to find a largest
𝜏 such that the total edge size of {𝑣0, · · · , 𝑣𝜏 } is nearly equal to the
on-chip memory capacity. After 𝜏 is computed, we can then obtain
the set of high-value data (denoted as 𝑆𝐻𝑉 𝐷) easily as follows:
𝑆𝐻𝑉 𝐷 = {𝐸𝑑𝑔𝑒𝑂𝑓 (𝑣𝑖)|𝑖 ∈ [0,𝜏], 𝑣𝑖 ∈ 𝑉𝑆𝑒𝑡} (3)
Equation (3) shows the edge data of {𝑣0, · · · , 𝑣𝜏 } as high-value data.
UltraRAM vs. BRAM. We select UltraRAM to store high-value
data with the following reasons. First, edges often have a larger
block size than vertices since edge data size of a graph is often
much larger than vertex size. Thus, UltraRAM is more expected
to store edges due to its coarse-grained block size, while BRAM
is often used to store vertex data (as adopted in existing graph
accelerators [10, 11, 39, 50]). Second, UltraRAM often has a larger
memory size (e.g., 1280 × 288Kb for a Xilinx U250) than BRAM
(2000 × 36Kb) for storing more edges on-chip.
5.2 Value-Aware Memory Access
So far, we have the following settings of GraSU for dynamic graph
processing. The vertex data is stored in the on-chip BRAM. The
high-value edge data reside in the on-chip UltraRAM while the low-
value one is in the off-chip memory. In this differential memory
architecture, both UltraRAM and the off-chip memory have the
edge data. Thus, we must know which memory and where the
corresponding edge data are located when a vertex is processing.
This makes memory addressing complex. A naive approach is to use
another offset array to record the on-chip edge data, but it incurs
extra space overhead, which can be more than 𝑁 × 4𝐵 where 𝑁 is
the number of vertices.
Space-Efficient Memory Addressing. We present a simple yet
effective bitmap-based method to yield a good tradeoff between
space overhead and memory addressing efficiency. Each vertex in
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
154
UltraRAM
Offset Array
Edge Array
0 0 9 7 14 17
Off-chip Memory
FPGA
 

chip
0 1 0 1 0
Bitmap
Vertex IDs 0 1 2 3 4
Current_end_addr
Figure 11: An example of value-aware data access manage-
ment, where both UltraRAM and the off-chip memory need
to maintain a piece of edge data.
a bitmap occupies only 1 bit. The bitmap is stored in the on-chip
BRAMs with 1 (0) indicating that the edge data of a corresponding
vertex is stored on-chip (off-chip). Based on the bitmap, we can
access the high-value data easily. When all the edge data of a vertex
is loaded into UltraRAM, its bit in the bitmap is set to 1 and the
corresponding entry in the offset array is set to the new offset in
the UltraRAM. Also, the current end address of this vertex in the
UltraRAM is recorded. Figure 11 shows an example of a graph with
5 vertices. The edge data of 𝑉 1 and 𝑉 3 are high-value data that
need to be loaded into the UltraRAM. Their new offsets (i.e., 0 and
7) in the UltraRAM will be written in the corresponding entries
of the offset array. The entries of 𝑉 1 and 𝑉 3 in the bitmap are
also set to 1. The starting address of the UltraRAM is a constant
and the current end address of V3 is kept in a current_end_addr
register. Note that the vertex bitmap can be still partitioned on a
multi-FPGA platform for scalability, and the overhead induced by
bitmap construction can be amortized by multiple executions of a
variety of graph algorithms operating on the same graph.
High-Value Data WriteBack. As described above, the addresses
of high-value vertices in the offset array are overwritten by the
on-chip UltraRAM offsets. When these high-value data are written
back into the off-chip memory for data consistency, we need to com-
pute their original offsets in the offset array. Specifically, starting
from a given vertex 𝑣, we scan the bitmap and find the first vertex
marked with ‘0’ (denoted as 𝑣0) and the first vertex marked with ‘1’
(denoted as 𝑣1) in the bitmap. Then, the number of out-going edges
of 𝑣 can be obtained by computing the offset difference between
𝑣1 and 𝑣. Finally, the original off-chip memory offset of 𝑣 can be
computed as follows:
𝑜𝑓 𝑓 𝑠𝑒𝑡(𝑣) = 𝑜𝑓 𝑓 𝑠𝑒𝑡(𝑣0) − 𝐷𝑒𝑔(𝑣) (4)
where 𝑜𝑓 𝑓 𝑠𝑒𝑡(𝑣) denotes the off-chip memory offset of the vertex
𝑣. Figure 11 shows an example of how to calculate the original
off-chip memory offset of the vertex 𝑉 1. First, we scan the bitmap
and find 𝑣0 = 𝑉 2 and 𝑣1 = 𝑉 3 in the bitmap. Then, we compute
𝐷𝑒𝑔(𝑉 1) = 7 − 0. Thus, the original offset of 𝑉 1 can be computed
as 𝑜𝑓 𝑓 𝑠𝑒𝑡(𝑉 1) = 2 (i.e., 9-7).
Handling Read Requests. A read request to a vertex 𝑣 needs
to load all the edges of 𝑣 into the UHL. To achieve this goal, we
need to know three pieces of information. First, we have to know
where the edge data of 𝑣 is stored. This can be easily obtained by
accessing the bitmap with 1(0) indicating to be stored on-chip (off-
chip). Second, we compute the initial address of the edge data of
𝑣 (denoted as 𝑎𝑑𝑑𝑟𝑒𝑠𝑠(𝑣)), which can be obtained by adding the
starting memory address and the corresponding offset of 𝑣 in the
Table 2: Real-world dynamic graph datasets
Datasets # Vertices # Edges # BEdges # Degree
sx-askubuntu (AU) [28] 0.16M 0.96M 0.59M 6.05
sx-superuser (SU) [28] 0.19M 1.44M 0.92M 7.44
wiki-talk-temporal (WK) [28] 1.14M 7.83M 3.31M 6.87
sx-stackoverflow (SO) [28] 2.60M 63.50M 36.23M 24.40
soc-bitcoin (BC) [35] 24.58M 122.95M 60.49M 5.00
offset array. Third, we need to obtain the number of out-going
edges of 𝑣 (i.e., 𝐷𝑒𝑔(𝑣)). Two cases should be considered. If the edge
data is on chip, 𝐷𝑒𝑔(𝑣) can be obtained by computing the offset
difference between 𝑣1 (defined above) and 𝑣. If it is in the off chip,
we find 𝑣’s first subsequent vertex (denoted as 𝑣𝑠) and compute
𝑜𝑓 𝑓 𝑠𝑒𝑡(𝑣𝑠) based on the Equation (4) (if it is on-chip). 𝐷𝑒𝑔(𝑣) can
be therefore obtained by computing the offset difference between 𝑣𝑠
and 𝑣. Finally, we can return the edge data of 𝑣 (denoted as 𝑑𝑎𝑡𝑎(𝑣))
to the UHL with a tuple  𝑣,𝑎𝑑𝑑𝑟𝑒𝑠𝑠(𝑣),𝑑𝑎𝑡𝑎(𝑣) .
Handling Write Requests. When a write request with a tuple
 𝑣,𝑎𝑑𝑑𝑟𝑒𝑠𝑠(𝑣),𝑑𝑎𝑡𝑎(𝑣)  arriving, the updated edge data of 𝑣
from the UHL’s register will be written back to the UltraRAM (or
off-chip memory). Similarly, we get access to the bitmap to identify
if the vertex is stored in the UltraRAM or off-chip memory, with
1(0) indicating to write 𝑑𝑎𝑡𝑎(𝑣) to the UltraRAM (off-chip memory)
based on the 𝑎𝑑𝑑𝑟𝑒𝑠𝑠(𝑣).
6 EXPERIMENTAL EVALUATION
This section evaluates the efficiency of GraSU and shows the effec-
tiveness for its integration into existing static graph accelerators.
6.1 Experimental Setup
GraSU Settings. We implement GraSU upon a Xilinx Alveo™ U250
accelerator card, which is equipped with an XCU250 FPGA chip
and four 16GB DDR4 (Micron MTA18ASF2G72PZ-2G3B1). The
target FPGA chip provides 11.81MB on-chip BRAMs, 45MB on-chip
UltraRAMs, 1.68M LUTs, and 3.37M registers.
In our evaluation, GraSU is set with 32 graph update PEs. Each
segment in the PMA representation has a length of 8. We use 4
bytes to represent a vertex. Thus, the update-relevant register buffer
attached to each PE is sized of 32 bytes. To evaluate efficiency, we
integrate GraSU into a state-of-the-art static graph accelerator Ac-
cuGraph [50] to perform dynamic graph processing, referred to
as AccuGraph-D. To demonstrate the usability of GraSU, we also
integrate GraSU into three state-of-the-art FPGA-based graph accel-
erators, FabGraph [39], WaveScheduler [44], and ForeGraph [11]).
Graph Datasets and Applications. As shown in Table 2, we
use five real-world dynamic graphs publicly available from the Stan-
ford Large Network Dataset Collection [28] and Network Reposi-
tory [35]. Every edge in a dynamic graph has a timestamp that indi-
cates when it should appear. For example, a directed edge  𝑢, 𝑣,𝑡 
in WK indicates that the Wikipedia user𝑢 edited the talk page of the
user 𝑣 at the time 𝑡. “BEdges” in Table 2 denotes the edge set of an
initial base graph before graph updates start. The number of edges
that needs to be updated can be computed by “#Edges-#BEdge”. In
our evaluation, all graphs are considered directed, and their edge
updates are divided into 10 batches by default.
We evaluate three representative graph applications: Breadth
First Search (BFS), PageRank (PR), and Weakly Connected Compo-
nents (WCC). In a dynamic graph processing scenario, every time
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
155
Table 3: Resource utilization and clock rates
BFS PR WCC
LUT 12.28% 14.19% 12.89%
Register 5.64% 9.96% 6.73%
BRAM 72.77% 82.18% 82.18%
UltraRAM 62.50% 62.50% 62.50%
Maximal clock rate 246MHz 211MHz 245MHz
Table 4: Update time (in seconds) and update throughput
(in million edges per second) of graph updates for Stinger,
Aspen, and AcuGraph-D. The highlighted columns (i.e.,
×Stinger and ×Aspen) represent speedup results achieved by
AccuGraph-D against Stinger and Aspen, respectively.
Graph
Update time/update throughput Speedup
Stinger Aspen AccuGraph-D ×Stinger ×Aspen
AU 0.031/4.18 0.004/32.43 0.00068/190.77 45.58× 5.88×
SU 0.053/3.47 0.006/30.64 0.00125/147.08 42.40× 4.80×
WK 0.647/4.31 0.071/39.31 0.01375/202.97 47.05× 5.16×
SO 3.614/3.23 0.452/25.81 0.209/55.83 17.29× 2.16×
BC 6.752/9.25 1.473/42.40 0.358/174.46 18.86× 4.11×
an update batch is finished, we will perform a graph algorithm
on the newly-updated graph. Note that we run 10 iterations for
PageRank, and perform BFS and WCC until convergence.
Baselines. We compare AccuGraph-D with two state-of-the-
art CPU-based dynamic graph systems, Aspen [13] and Stinger
with its latest version 15.10 [14]. Both run on a high-end server
configured with 2×14-core Intel Xeon E5-2680 v4 CPUs operating
at 2.40 GHz, a 256GB memory, and a 2TB HDD. We evaluate the
update throughput of graph updates by the number of edges that
have been successfully updated per second. The overall efficiency
of dynamical graph processing can be computed in a total of update
time and graph computation time.
Resource Utilization. Table 3 shows the resource utilization
and clock rate of AccuGraph-D. All results are obtained via Xilinx
SDAccel 2019.2. To preserve the correctness, we conservatively set
200MHz as the clock rate in our experiments.
6.2 Graph Update Efficiency
Table 4 shows the update time (in seconds) and update throughput
(in million edges per second) of graph updates for Stinger [14],
Aspen [13], and AccuGraph-D, respectively.
AccuGraph-D vs. Stinger: Stinger uses both adjacency list and
CSR to maintain dynamic graph data. Stinger divides the edges of
each vertex into multiple blocks that are preserved some gaps and
organized similarly as the adjacency list. In each block, edges are
stored as a CSR representation. When an edge update arrives, it
traverses the block list and inserts (deletes) the to-be-updated edge
into an empty (from a corresponding) position.
Overall, AccuGraph-D performs graph updates with through-
puts of 55.83~202.97M edges/second while Stinger achieves only
3.23~9.25M edges/second. We can therefore see that AccuGraph-D
outperforms Stinger by 17.29×~47.05× (34.24× on average) in terms
of update time, with the following two reasons. First, traversing the
block list under Stinger incurs a low LLC hit ratio (often less than
20%) [1]. This is particularly serious for handling edge updates on
high-degree vertices, because a large number of edge updates are
applied on only a few high-degree vertices in real-world dynamic
graphs, thus significantly worsening update efficiency. In contrast,
AU SU WK SO BC AU SU WK SO BC AU SU WK SO BC
BFS PR WCC
0.01
0.1
1
10
100
1000
Time
(seconds)
Graph computation (patterned) Graph update(unpatterned)
Stinger Aspen AccuGraph-D
Figure 12: The total running time of AccuGraph-D against
Stinger and Aspen. Each bar represents a system or an ac-
celerator, where patterned parts indicate graph computation
time while unpatterned ones show graph update time.
AccuGraph-D associates the high-value edge data with a large frac-
tion of edge updates, and cache them in the on-chip memory. Thus,
excessive off-chip memory accesses are avoided. Second, when two
edges are simultaneously updated on the same vertex, Stinger uses
a locking mechanism to ensure correctness, decreasing further par-
allelism of edge updates. AccuGraph-D adopts a PMA-based format
variant, which can chunk the edge array into many fine-grained
segments and ensures lock-free segment updates [38].
AccuGraph-D vs. Aspen: Aspen is developed based on a purely-
functional search tree, which greatly facilitates the search of target
position corresponding to the to-be-updated edge. However, Aspen
needs to repeatedly load the same edge data into on-chip caches
at different times due to the extremely irregular memory access
features of graph update and the limitation of typical replacement
policies [3, 9, 23] on traditional architectures, while GraSU retains
high-value edge data on-chip during graph update. Thus, GraSU
does not need to frequently transfer edge data between on-chip and
off-chip memory, avoiding redundant communication overheads.
Compared with Aspen, AccuGraph-D can therefore improve the
update efficiency by 2.16×~5.88× (4.42× on average).
In particular, compared to other graphs SO exhibits the smallest
improvement with 2.16× only. The reason is simple. SO has the
highest number of average degree by 24.40 (Table 2). Thus, the
average edge data size for each vertex in SO is larger than others.
Consider the limited fixed-size UltraRAM, a larger vertex degree
generally implies fewer vertices’ edge data be stored in the on-chip
memory. As a result, relatively few spatial similarity opportunities
can be exploited by GraSU.
6.3 Overall Efficiency
We also evaluate the total running time (i.e., update time plus graph
computation time) of AccuGraph-D against Stinger and Aspen.
Stinger and Aspen are the in-memory systems (excluding disk-
loading time). To make apple-to-apple comparisons, we exclude the
CPU-FPGA transfer time.
Figure 12 shows the overall performance results, where each
bar consists of two parts for a graph system. The patterned part
represents graph computation time while the unpatterned part
indicates graph update time. Overall, compared with Stinger and
Aspen, AccuGraph-D has the fastest execution time for all three
graph algorithms by average speedups of 9.80× and 3.07×, respec-
tively. This is because of not only graph update efficiency improved
but also graph computation performance accelerated by FPGA.
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
156
AU SU WK SO BC
0
2
4
6
8
10
Normalized
Execution
Time
Baseline W/ VMM
W/ VMM+IVM W/ VMM+IVM+OT
Figure 13: Update efficiency of GraSU with
and without value-aware memory manage-
ment (VMM), incremental value measure-
ment (IVM), and overlapping technique (OT)
100×288Kb
200×288Kb
400×288Kb
800×288Kb
0.5
0.6
0.7
0.8
0.9
1
Normalized
Speedup
AU
SU
WK
SO
BC
Figure 14: Update efficiency of GraSU
with different UltraRAM sizes. All re-
sults are normalized to GraSU with
the UltraRAM size of 800×288Kb.
AU SU WK SO BC
0.95
1
1.05
1.1
1.15
1.2
1.25
Normalized
Speedup
0.1%
1%
10%
Figure 15: Update efficiency of GraSU
with varying batch sizes of graph up-
dates. All results are normalized to
GraSU with the update batch size of 10%.
Yet, we can also observe that graph computation can gradually
dominate the overall performance as update efficiency is signifi-
cantly improved. For instance, consider BFS on WK, the graph com-
putation proportion for Aspen is 60.16%, which can be increased to
80.76% when graph update efficiency is improved by GraSU. In this
work, we focus on improving graph update efficiency. As for how
to improve the performance of graph computation for boosting
overall performance further, we leave it as interesting future work.
6.4 Effectiveness
We further investigate the benefit breakdown for different compo-
nents of GraSU, including value-aware memory management (VMM),
incremental value measurement (IVM), and overlapping technique
(OT). Figure 13 shows the breakdown results, where the baseline
indicates that neither of VMM, IVM, nor OT is applied. All results
are normalized to a version of VMM, IVM, and OT applied. Note
that AU and SU are small enough to fit all edges into the on-chip
UltraRAM. In this case, IVM and OT are therefore disabled at run-
time and the baseline is GraSU with only VMM used. Overall, our
GraSU improves the baseline by 6.14× on average.
VMM. By preserving the high-value data (identified by the
degree-based approach) on-chip, a large number of off-chip mem-
ory accesses are transformed to be on-chip accesses. Therefore, we
see that VMM contributes a significant speedup of 4.69× (on aver-
age) over the baseline, occupying 65.56% of overall performance
improvement of graph update. In particular, for the small graphs AU
and SU, VMM shows the substantial performance improvements
by 9.89× and 7.37× over the baseline, respectively. The reason is
clear that all data are stored in the on-chip such that no off-chip
communications will occur in this case.
IVM. As shown in Figure 9, IVM can improve the prediction
accuracy of high-value data significantly. We next discuss the IVM’s
runtime overheads. Compared to the baseline, we see that IVM
offers only 1.58× speedup on average. In particular, for WK, IVM
may even cause significant slowdown such that overall performance
is poorer than the baseline. The reason is clear that the benefit
provided by the accuracy improvement is offset by the overheads
of repeated value measurement over update batches. Fortunately,
such overheads can be fully overlapped with the normal graph
computation phase, as discussed below.
OT. By applying OT, we see that the IVM-induced extra over-
head can be fully hidden behind the normal graph computation,
further improving the total execution time significantly. OT makes
dynamic graph updates run 4.21× faster than otherwise, demon-
strating the effectiveness of OT. Results show that IV and OT can
jointly contribute 34.44% of the overall performance improvement.
0 2 4 8 16 32
0
2
4
6
8
10
Normalized
Speedup
AU
SU
WK
SO
BC
Figure 16: Update efficiency of GraSU with different number
of PEs. All results are normalized to the update time in case
of using 2-PE.
6.5 Sensitive Study
Let us examine the sensitivity of GraSU’s performance to different
(1) UltraRAM sizes, (2) update batch sizes, and (3) PE numbers.
UltraRAM Size. Figure 14 illustrates the performance of graph
updates with different UltraRAM sizes ranging from 100×288Kb
to 800×288Kb. Overall, the larger the UltraRAM size is, the better
the performance is. This is because a larger size implies more high-
value edge data that can be cached on-chip and fewer edge data
transfers between on-chip and off-chip memory. In particular, AU
can be stored entirely under the UltraRAM sizes of 400×288Kb and
800×288Kb, and therefore, its performance is not changed in these
two cases. Also, when the UltraRAM size is scaled from 200×288Kb
to 400×288Kb, AU exhibits a significant performance improvement
(by up to 43.38%). The reason behind is that the 400×288Kb Ultra-
RAM caches more high-value data that is missing by 200×288Kb
size, thereby reducing the off-chip edge data accesses significantly.
Update Batch Size. Figure 15 characterizes the update perfor-
mance of GraSU with different update batch sizes. For each graph,
we divide the edges that need an update into 1000, 100, and 10
batches according to their timestamp range. These correspond re-
spectively to the cases that each update batch contains 0.1%, 1%, and
10% of the updated edges. We see that the batch size does not signif-
icantly affect the update efficiency since spatial similarity does not
destroy by update batch scales. In addition, as the number of edge
updates in the batch decreases from 10% to 0.1%, the average update
performance slightly increases from 1.0× to 1.23×. This is because
more high-value data are mined through value measurement.
PE Number. Figure 16 further plots update performance with a
varying number of (2/4/8/16/32) graph update PEs. All results are
normalized to the update time in case of using 2-PE. Overall, more
PEs can introduce an increasing performance improvement but
the growth rate is gradually decreasing. When the number of PEs
increases from 16 to 32, the update performance increases by 1.38×
only, on average. This is because more PEs mean more simultaneous
memory requests to be processed, showing significant memory
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
157
AU SU WK SO BC AU SU WK SO BC AU SU WK SO BC
BFS PR WCC
0
1
2
3
4
5
6
Normalized
Speedup
Aspen FabGraph-D WaveScheduler-D ForeGraph-D
Figure 17: Overall dynamic graph processing performance
of FabGraph, WaveScheduler, and ForeGraph (with an inte-
gration of GraSU) against Aspen
pressure to the value-aware memory manager with potential access
conflicts. We leave addressing this problem as future work.
6.6 Integration with Other Graph Accelerators
We finally explore to integrate GraSU into other three state-of-
the-art FPGA-based graph accelerators, including FabGraph [39],
WaveScheduler [44], and ForeGraph [11] (with its single version).
Similar to AccuGraph, with only 11 lines of code modifications,
GraSU can be easily integrated with FabGraph, WaveScheduler,
and ForeGraph to support dynamic graph processing, thanks to the
following highlighted designs. First, GraSU adopts a PMA-based
dynamic graph organization, which can support as many as exist-
ing accelerators using different underlying graph formats. Second,
GraSU uses a lightweight bitmap-based method to implement dif-
ferential memory access, thereby avoiding significant modifications
on the underlying memory subsystem and lowering the integration
obstacles. Third, GraSU provides a uniform integration framework
with easy-to-use programming interfaces.
Figure 17 shows the overall performance results of FabGraph-D,
WaveScheduler-D, and ForeGraph-D against Aspen. The experi-
ment environment is the same as AccuGraph-D. Since FabGraph
uses some UltraRAM resource as the shared vertex buffer, we allo-
cate only 21.09MB (600×288Kb) UltraRAM to buffer high-value data
for FabGraph-D. Results show that the dynamic graph versions of
FabGraph, WaveScheduler, and ForeGraph integrated from GraSU
also outperform Aspen with the geomean speedups of 2.93×, 3.09×,
and 1.63×, showing the generality and practicability of GraSU.
7 RELATED WORK
Graph Processing Accelerators. Due to the random access fea-
ture, graph processing generally features a low compute-to-memory
ratio challenge. To improve memory efficiency, existing FPGA-
based graph accelerators typically focuses on optimizing on-chip
memory access [10, 11, 39, 44, 50] and off-chip memory access [2,
12, 24, 31, 32, 51, 52, 54, 55]. For on-chip memory optimizations,
some consider mitigating the performance overheads caused by
data conflicts in the on-chip BRAM [44, 50]. Some [10, 11] use the
on-chip data resuing mechanism to improve the locality of graph
computation. There are also works that try to hide the delay of data
loading from off-chip memory to BRAM [39]. A number of graph
processing accelerators [12, 31, 32, 54, 55] focus on improving the
bandwidth utilization between the on-chip and the off-chip mem-
ory with sophisticated pipeline designs. Alternative uses emerging
memory technologies (e.g., hybrid memory cube) to further im-
prove the external memory access [24, 51, 52]. Unfortunately, these
earlier studies are limited to handle static graphs [19]. To the best
of our knowledge, there are currently few dynamic FPGA-based
graph accelerators. In this work, we aim to fill the gap between
static graph computation and dynamic graph update. We restrict
ourselves to build a fast graph update library, which can be inte-
grated easily into any existing FPGA-based static graph accelerator
for handling dynamic graphs.
Dynamic Graph Processing Systems. Most of existing dy-
namic graph systems can fall into two categories [6, 8, 13, 14, 16, 18,
21, 25, 29, 37, 38, 45, 46]. The first category develops new dynamic
graph representations based on different static data structures in
terms of CSR variants [14, 16, 29, 37], adjacency list [8, 25, 46], hash
table [18, 21], tree [6, 13], and Packed Memory Array (PMA) [38, 45].
These earlier studies improve the concurrency between graph up-
dates, but their underlying efficiency is still limited to the excessive
off-chip memory accesses [1]. GraSU identifies spatial similarity op-
portunities and presents a differential data management to improve
the memory efficiency of dynamic graph processing.
A number of studies improve the efficiency of graph computation
under dynamic graph processing scenes by leveraging recent (rather
than initial) vertex property values to accelerate the convergence of
graph computation [8, 30, 37, 40, 42, 43]. In particular, accelerating
graph computation using FPGAs stands out for yielding impressive
results in both performance and energy-efficiency [19, 39, 44, 50].
In this work, we emphasize more on accelerating dynamic graph
updates on FPGA, and also develop an FPGA library to make it
easily-integrated with minimal hardware engineering efforts.
8 CONCLUSION
In this paper, we introduce a graph-structured update library (called
GraSU) for high-throughput updates on FPGA. GraSU can be easily
integrated with any existing FPGA-based static graph accelerators
with only a few lines of code modifications for handling dynamic
graphs. GraSU features with the two key designs: an incremental
value measurement and a value-aware differential memory man-
agement. The former enables to quantify the data value accurately
while its overheads can be fully hidden behind normal graph com-
putations. The latter exploits spatial similarity of graph updates by
retaining high-value data on-chip so that the most off-chip data
communications arising in graph updates can be transformed into
fast on-chip memory accesses. We integrate GraSU into a state-of-
the-art static graph accelerator AccuGraph to drive dynamic graph
processing. Our implementation on a Xilinx Alveo™ U250 board
demonstrates that the dynamic graph version of AccuGraph out-
performs two state-of-the-art CPU-based dynamic graph systems,
Stinger and Aspen, by an average of 34.24× and 4.42× in terms of
update throughput, improving further overall efficiency by 9.80×
and 3.07× on average.
ACKNOWLEDGMENTS
This work is supported by the National Key Research and Develop-
ment Program of China under Grant No. 2018YFB1003502, National
Natural Science Foundation of China under Grant No. 62072195,
61825202, and 61832006. The correspondence should be addressed
to Long Zheng.
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
158
REFERENCES
[1] Abanti Basak, Jilan Lin, Ryan Lorica, Xinfeng Xie, Zeshan Chishti, Alaa
Alameldeen, and Yuan Xie. 2020. SAGA-Bench: Software and Hardware Charac-
terization of Streaming Graph Analytics Workloads. In ISPASS. IEEE, 12–23.
[2] Andrew Bean, Nachiket Kapre, and Peter Y. K. Cheung. 2015. G-DMA: improving
memory access performance for hardware accelerated sparse graph computation.
In ReConFig. IEEE, 1–6.
[3] Nathan Beckmann and Daniel Sánchez. 2015. Talus: A simple way to remove
cliffs in cache performance. In HPCA. IEEE, 64–75.
[4] Maciej Besta, Marc Fischer, Tal Ben-Nun, Johannes de Fine Licht, and Torsten
Hoefler. 2019. Substream-Centric Maximum Matchings on FPGA. In FPGA. ACM,
152–161.
[5] Maciej Besta, Marc Fischer, Vasiliki Kalavri, Michael Kapralov, and Torsten Hoe-
fler. 2019. Practice of Streaming and Dynamic Graphs: Concepts, Models, Systems,
and Parallelism. CoRR abs/1912.12740 (2019). arXiv:1912.12740
[6] Federico Busato, Oded Green, Nicola Bombieri, and David A. Bader. 2018. Hornet:
An Efficient Data Structure for Dynamic Sparse Graphs and Matrices on GPUs.
In HPEC. IEEE, 1–7.
[7] Xinyu Chen, Ronak Bajaj, Yao Chen, Jiong He, Bingsheng He, Weng-Fai Wong,
and Deming Chen. 2019. On-The-Fly Parallel Data Shuffling for Graph Processing
on OpenCL-Based FPGAs. In FPL. 67–73.
[8] Raymond Cheng, Ji Hong, Aapo Kyrola, Youshan Miao, Xuetian Weng, Ming Wu,
Fan Yang, Lidong Zhou, Feng Zhao, and Enhong Chen. 2012. Kineograph: taking
the pulse of a fast-changing and connected world. In EuroSys. ACM, 85–98.
[9] Asaf Cidon, Assaf Eisenman, Mohammad Alizadeh, and Sachin Katti. 2016.
Cliffhanger: Scaling Performance Cliffs in Web Memory Caches. In NSDI. USENIX,
379–392.
[10] Guohao Dai, Yuze Chi, Yu Wang, and Huazhong Yang. 2016. FPGP: Graph
Processing Framework on FPGA A Case Study of Breadth-First Search. In FPGA.
ACM, 105–110.
[11] Guohao Dai, Tianhao Huang, Yuze Chi, Ningyi Xu, Yu Wang, and Huazhong
Yang. 2017. ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA
Architecture. In FPGA. ACM, 217–226.
[12] Michael DeLorimier, Nachiket Kapre, Nikil Mehta, Dominic Rizzo, Ian Eslick,
Raphael Rubin, Tomás E. Uribe, Thomas F. Knight Jr., and André DeHon. 2006.
GraphStep: A System Architecture for Sparse-Graph Algorithms. In FCCM. IEEE,
143–151.
[13] Laxman Dhulipala, Guy E. Blelloch, and Julian Shun. 2019. Low-latency graph
streaming using compressed purely-functional trees. In PLDI. ACM, 918–934.
[14] David Ediger, Robert McColl, E. Jason Riedy, and David A. Bader. 2012. STINGER:
High performance data structure for streaming graphs. In HPEC. IEEE, 1–5.
[15] Nina Engelhardt and Hayden Kwok-Hay So. 2016. Gravf: A vertex-centric dis-
tributed graph processing framework on fpgas. In FPL. IEEE, 1–4.
[16] Guoyao Feng, Xiao Meng, and Khaled Ammar. 2015. DISTINGER: A distributed
graph data structure for massive dynamic graph processing. In BigData. IEEE,
1814–1822.
[17] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin.
2012. PowerGraph: Distributed Graph-parallel Computation on Natural Graphs.
In OSDI. USENIX, 17–30.
[18] Xiangyang Gou, Lei Zou, Chenxingyu Zhao, and Tong Yang. 2019. Fast and
Accurate Graph Stream Summarization. In ICDE. IEEE, 1118–1129.
[19] Chuang-Yi Gui, Long Zheng, Bingsheng He, Cheng Liu, Xin-Yu Chen, Xiao-Fei
Liao, and Hai Jin. 2019. A Survey on Graph Processing Accelerators: Challenges
and Opportunities. J. Comput. Sci. Technol. 34, 2 (2019), 339–371.
[20] Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret
Martonosi. 2016. Graphicionado: A High-Performance and Energy-Efficient
Accelerator for Graph Analytics. In MICRO. IEEE, 1–13.
[21] Keita Iwabuchi, Scott Sallinen, Roger A. Pearce, Brian Van Essen, Maya B. Gokhale,
and Satoshi Matsuoka. 2016. Towards a Distributed Large-Scale Dynamic Graph
Data Store. In IPDPS. IEEE, 892–901.
[22] Hai Jin, Pengcheng Yao, Xiaofei Liao, Long Zheng, and Xianliang Li. 2017. To-
wards Dataflow-Based Graph Accelerator. In ICDCS. IEEE, 1981–1992.
[23] Theodore Johnson and Dennis E. Shasha. 1994. 2Q: A Low Overhead High
Performance Buffer Management Replacement Algorithm. In VLDB. Morgan
Kaufmann, 439–450.
[24] Soroosh Khoram, Jialiang Zhang, Maxwell Strange, and Jing Li. 2018. Accelerat-
ing Graph Analytics by Co-Optimizing Storage and Access on an FPGA-HMC
Platform. In FPGA. ACM, 239–248.
[25] Pradeep Kumar and H. Howie Huang. 2019. GraphOne: A Data Store for Real-time
Analytics on Evolving Graphs. In FAST. USENIX, 249–263.
[26] Ravi Kumar, Jasmine Novak, and Andrew Tomkins. 2006. Structure and Evolution
of Online Social Networks. In KDD. ACM, 611–617.
[27] Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. 2005. Graphs over time:
densification laws, shrinking diameters and possible explanations. In KDD. ACM,
177–187.
[28] Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network
Dataset Collection. http://snap.stanford.edu/data.
[29] Peter Macko, Virendra Marathe, Daniel Margo, and Margo Seltzer. 2015. LLAMA:
Efficient graph analytics using Large Multiversioned Arrays. In ICDE. IEEE,
363–374.
[30] Mugilan Mariappan and Keval Vora. 2019. GraphBolt: Dependency-Driven Syn-
chronous Processing of Streaming Graphs. In EuroSys. ACM, 25:1–25:16.
[31] Eriko Nurvitadhi, Gabriel Weisz, Yu Wang, Skand Hurkat, Marie Nguyen, James C.
Hoe, José F. Martínez, and Carlos Guestrin. 2014. GraphGen: An FPGA Framework
for Vertex-Centric Graph Computation. In FCCM. IEEE, 25–28.
[32] Tayo Oguntebi and Kunle Olukotun. 2016. GraphOps: A Dataflow Library for
Graph Analytics Acceleration. In FPGA. ACM, 111–117.
[33] Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, John Greth,
Steven Burns, and Ozcan Ozturk. 2016. Energy Efficient Architecture for Graph
Analytics Accelerators. In ISCA. IEEE, 166–177.
[34] Xiafei Qiu, Wubin Cen, Zhengping Qian, You Peng, Ying Zhang, Xuemin Lin, and
Jingren Zhou. 2018. Real-time Constrained Cycle Detection in Large Dynamic
Graphs. Proc. VLDB Endow. 11, 12 (2018), 1876–1888.
[35] Ryan A. Rossi and Nesreen K. Ahmed. 2015. The Network Data Repository with
Interactive Graph Analytics and Visualization. In AAAI. http://networkrepository.
com
[36] David Sayce. 2020. The Number of tweets per day in 2020. https://www.dsayce.
com/social-media/tweets-day/.
[37] Dipanjan Sengupta, Narayanan Sundaram, Xia Zhu, Theodore L. Willke, Jeffrey S.
Young, Matthew Wolf, and Karsten Schwan. 2016. GraphIn: An Online High
Performance Incremental Graph Processing Framework. In Euro-Par. Springer,
319–333.
[38] Mo Sha, Yuchen Li, Bingsheng He, and Kian-Lee Tan. 2017. Accelerating Dynamic
Graph Analytics on GPUs. Proc. VLDB Endow. 11, 1 (2017), 107–120.
[39] Zhiyuan Shao, Ruoshi Li, Diqing Hu, Xiaofei Liao, and Hai Jin. 2019. Improving
Performance of Graph Processing on FPGA-DRAM Platform by Two-level Vertex
Caching. In FPGA. ACM, 320–329.
[40] Feng Sheng, Qiang Cao, Haoran Cai, Jie Yao, and Changsheng Xie. 2018. GraPU:
Accelerate Streaming Graph Analysis through Preprocessing Buffered Updates.
In SoCC. ACM, 301–312.
[41] Shuang Song, Xu Liu, Qinzhe Wu, Andreas Gerstlauer, Tao Li, and Lizy K. John.
2018. Start Late or Finish Early: A Distributed Graph Processing System with
Redundancy Reduction. Proc. VLDB Endow. 12, 2 (2018), 154–168.
[42] Keval Vora, Rajiv Gupta, and Guoqing (Harry) Xu. 2016. Synergistic Analysis of
Evolving Graphs. ACM Trans. Archit. Code Optim. 13, 4 (2016), 32:1–32:27.
[43] Keval Vora, Rajiv Gupta, and Guoqing (Harry) Xu. 2017. KickStarter: Fast and
Accurate Computations on Streaming Graphs via Trimmed Approximations. In
ASPLOS. ACM, 237–251.
[44] Qinggang Wang, Long Zheng, Jieshan Zhao, Xiaofei Liao, Hai Jin, and Jingling
Xue. 2020. A Conflict-free Scheduler for High-performance Graph Processing on
Multi-pipeline FPGAs. ACM Trans. Archit. Code Optim. 17, 2 (2020), 14:1–14:26.
[45] Brian Wheatman and Helen Xu. 2018. Packed Compressed Sparse Row: A Dy-
namic Graph Representation. In HPEC. IEEE, 1–7.
[46] Martin Winter, Daniel Mlakar, Rhaleb Zayer, Hans-Peter Seidel, and Markus
Steinberger. 2018. faimGraph: high performance management of fully-dynamic
graphs under tight memory constraints on the GPU. In SC. ACM, 60:1–60:13.
[47] Alex Woodie, Tiffany Trader, George Leopold, John Russell, Oliver Peckham,
James Kobielus, and Steve Conway. 2020. Tracking the Spread of Coronavirus with
Graph Databases. datanami. https://www.datanami.com/2020/03/12/tracking-
the-spread-of-coronavirus-with-graph-databases/.
[48] Xilinx. 2019. UltraScale Architecture Memory Resources User Guide.
https://www.xilinx.com/support/documentation/user_guides/ug573-ultrascale-
memory-resources.pdf.
[49] Xilinx. 2020. Vivado Design Suite User Guide High-Level Synthesis. https:
//www.xilinx.com/support/documentation/sw_manuals/xilinx2020_1/ug902-
vivado-high-level-synthesis.pdf.
[50] Pengcheng Yao, Long Zheng, Xiaofei Liao, Hai Jin, and Bingsheng He. 2018. An
Efficient Graph Accelerator with Parallel Data Conflict Management. In PACT.
ACM, 8:1–8:12.
[51] Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the Performance of
FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth
First Search. In FPGA. ACM, 207–216.
[52] Jialiang Zhang and Jing Li. 2018. Degree-aware Hybrid Graph Traversal on
FPGA-HMC Platform. In FPGA. ACM, 229–238.
[53] Long Zheng, Xianliang Li, Yaohui Zheng, Yu Huang, Xiaofei Liao, Hai Jin, Jingling
Xue, Zhiyuan Shao, and Qiang-Sheng Hua. 2020. Scaph: Scalable GPU-Accelerated
Graph Processing with Value-Driven Differential Scheduling. In ATC. USENIX,
573–588.
[54] Shijie Zhou, Charalampos Chelmis, and Viktor K Prasanna. 2016. High-
Throughput and Energy-Efficient Graph Processing on FPGA. In FCCM. IEEE,
103–110.
[55] Shijie Zhou, Rajgopal Kannan, Viktor K. Prasanna, Guna Seetharaman, and Qing
Wu. 2019. HitGraph: High-throughput Graph Processing Framework on FPGA.
IEEE Trans. Parallel Distrib. Syst. 30, 10 (2019), 2249–2264.
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
159

More Related Content

More from Subhajit Sahu

Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Subhajit Sahu
 
Fast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTESFast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTESSubhajit Sahu
 
Can you fix farming by going back 8000 years : NOTES
Can you fix farming by going back 8000 years : NOTESCan you fix farming by going back 8000 years : NOTES
Can you fix farming by going back 8000 years : NOTESSubhajit Sahu
 
HITS algorithm : NOTES
HITS algorithm : NOTESHITS algorithm : NOTES
HITS algorithm : NOTESSubhajit Sahu
 
Basic Computer Architecture and the Case for GPUs : NOTES
Basic Computer Architecture and the Case for GPUs : NOTESBasic Computer Architecture and the Case for GPUs : NOTES
Basic Computer Architecture and the Case for GPUs : NOTESSubhajit Sahu
 
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDESDynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDESSubhajit Sahu
 
Are Satellites Covered in Gold Foil : NOTES
Are Satellites Covered in Gold Foil : NOTESAre Satellites Covered in Gold Foil : NOTES
Are Satellites Covered in Gold Foil : NOTESSubhajit Sahu
 
Taxation for Traders < Markets and Taxation : NOTES
Taxation for Traders < Markets and Taxation : NOTESTaxation for Traders < Markets and Taxation : NOTES
Taxation for Traders < Markets and Taxation : NOTESSubhajit Sahu
 
A Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESA Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESSubhajit Sahu
 
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...Subhajit Sahu
 
Income Tax Calender 2021 (ITD) : NOTES
Income Tax Calender 2021 (ITD) : NOTESIncome Tax Calender 2021 (ITD) : NOTES
Income Tax Calender 2021 (ITD) : NOTESSubhajit Sahu
 
Youngistaan Foundation: Annual Report 2020-21 : NOTES
Youngistaan Foundation: Annual Report 2020-21 : NOTESYoungistaan Foundation: Annual Report 2020-21 : NOTES
Youngistaan Foundation: Annual Report 2020-21 : NOTESSubhajit Sahu
 
Youngistaan: Voting awarness-campaign : NOTES
Youngistaan: Voting awarness-campaign : NOTESYoungistaan: Voting awarness-campaign : NOTES
Youngistaan: Voting awarness-campaign : NOTESSubhajit Sahu
 
Cost Efficient PageRank Computation using GPU : NOTES
Cost Efficient PageRank Computation using GPU : NOTESCost Efficient PageRank Computation using GPU : NOTES
Cost Efficient PageRank Computation using GPU : NOTESSubhajit Sahu
 
Rank adjustment strategies for Dynamic PageRank : REPORT
Rank adjustment strategies for Dynamic PageRank : REPORTRank adjustment strategies for Dynamic PageRank : REPORT
Rank adjustment strategies for Dynamic PageRank : REPORTSubhajit Sahu
 
Variadic CRTP : NOTES
Variadic CRTP : NOTESVariadic CRTP : NOTES
Variadic CRTP : NOTESSubhajit Sahu
 
Proceedings Scholar Metrics : NOTES
Proceedings Scholar Metrics : NOTESProceedings Scholar Metrics : NOTES
Proceedings Scholar Metrics : NOTESSubhajit Sahu
 
Top CSE conferences list (IIIT Hyderabad) : NOTES
Top CSE conferences list (IIIT Hyderabad) : NOTESTop CSE conferences list (IIIT Hyderabad) : NOTES
Top CSE conferences list (IIIT Hyderabad) : NOTESSubhajit Sahu
 
Policy on stipend support for research students (IIIT Hyderabad) : NOTES
Policy on stipend support for research students (IIIT Hyderabad) : NOTESPolicy on stipend support for research students (IIIT Hyderabad) : NOTES
Policy on stipend support for research students (IIIT Hyderabad) : NOTESSubhajit Sahu
 
Submitting the Thesis Evaluation Request by MS/PhD Students (IIIT Hyderabad) ...
Submitting the Thesis Evaluation Request by MS/PhD Students (IIIT Hyderabad) ...Submitting the Thesis Evaluation Request by MS/PhD Students (IIIT Hyderabad) ...
Submitting the Thesis Evaluation Request by MS/PhD Students (IIIT Hyderabad) ...Subhajit Sahu
 

More from Subhajit Sahu (20)

Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
 
Fast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTESFast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTES
 
Can you fix farming by going back 8000 years : NOTES
Can you fix farming by going back 8000 years : NOTESCan you fix farming by going back 8000 years : NOTES
Can you fix farming by going back 8000 years : NOTES
 
HITS algorithm : NOTES
HITS algorithm : NOTESHITS algorithm : NOTES
HITS algorithm : NOTES
 
Basic Computer Architecture and the Case for GPUs : NOTES
Basic Computer Architecture and the Case for GPUs : NOTESBasic Computer Architecture and the Case for GPUs : NOTES
Basic Computer Architecture and the Case for GPUs : NOTES
 
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDESDynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
 
Are Satellites Covered in Gold Foil : NOTES
Are Satellites Covered in Gold Foil : NOTESAre Satellites Covered in Gold Foil : NOTES
Are Satellites Covered in Gold Foil : NOTES
 
Taxation for Traders < Markets and Taxation : NOTES
Taxation for Traders < Markets and Taxation : NOTESTaxation for Traders < Markets and Taxation : NOTES
Taxation for Traders < Markets and Taxation : NOTES
 
A Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESA Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTES
 
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...
 
Income Tax Calender 2021 (ITD) : NOTES
Income Tax Calender 2021 (ITD) : NOTESIncome Tax Calender 2021 (ITD) : NOTES
Income Tax Calender 2021 (ITD) : NOTES
 
Youngistaan Foundation: Annual Report 2020-21 : NOTES
Youngistaan Foundation: Annual Report 2020-21 : NOTESYoungistaan Foundation: Annual Report 2020-21 : NOTES
Youngistaan Foundation: Annual Report 2020-21 : NOTES
 
Youngistaan: Voting awarness-campaign : NOTES
Youngistaan: Voting awarness-campaign : NOTESYoungistaan: Voting awarness-campaign : NOTES
Youngistaan: Voting awarness-campaign : NOTES
 
Cost Efficient PageRank Computation using GPU : NOTES
Cost Efficient PageRank Computation using GPU : NOTESCost Efficient PageRank Computation using GPU : NOTES
Cost Efficient PageRank Computation using GPU : NOTES
 
Rank adjustment strategies for Dynamic PageRank : REPORT
Rank adjustment strategies for Dynamic PageRank : REPORTRank adjustment strategies for Dynamic PageRank : REPORT
Rank adjustment strategies for Dynamic PageRank : REPORT
 
Variadic CRTP : NOTES
Variadic CRTP : NOTESVariadic CRTP : NOTES
Variadic CRTP : NOTES
 
Proceedings Scholar Metrics : NOTES
Proceedings Scholar Metrics : NOTESProceedings Scholar Metrics : NOTES
Proceedings Scholar Metrics : NOTES
 
Top CSE conferences list (IIIT Hyderabad) : NOTES
Top CSE conferences list (IIIT Hyderabad) : NOTESTop CSE conferences list (IIIT Hyderabad) : NOTES
Top CSE conferences list (IIIT Hyderabad) : NOTES
 
Policy on stipend support for research students (IIIT Hyderabad) : NOTES
Policy on stipend support for research students (IIIT Hyderabad) : NOTESPolicy on stipend support for research students (IIIT Hyderabad) : NOTES
Policy on stipend support for research students (IIIT Hyderabad) : NOTES
 
Submitting the Thesis Evaluation Request by MS/PhD Students (IIIT Hyderabad) ...
Submitting the Thesis Evaluation Request by MS/PhD Students (IIIT Hyderabad) ...Submitting the Thesis Evaluation Request by MS/PhD Students (IIIT Hyderabad) ...
Submitting the Thesis Evaluation Request by MS/PhD Students (IIIT Hyderabad) ...
 

Recently uploaded

What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataBradBedford3
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfkalichargn70th171
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Modelsaagamshah0812
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about usDynamic Netsoft
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 

Recently uploaded (20)

What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer DataAdobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
Adobe Marketo Engage Deep Dives: Using Webhooks to Transfer Data
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Exploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the ProcessExploring iOS App Development: Simplifying the Process
Exploring iOS App Development: Simplifying the Process
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
Unlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language ModelsUnlocking the Future of AI Agents with Large Language Models
Unlocking the Future of AI Agents with Large Language Models
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
DNT_Corporate presentation know about us
DNT_Corporate presentation know about usDNT_Corporate presentation know about us
DNT_Corporate presentation know about us
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 

GraSU: A Fast Graph Update Library for FPGA-based Dynamic Graph Processing

  • 1. GraSU: A Fast Graph Update Library for FPGA-based Dynamic Graph Processing Qinggang Wang, Long Zheng, Yu Huang, Pengcheng Yao, Chuangyi Gui, Xiaofei Liao, Hai Jin, Wenbin Jiang, and Fubing Mao National Engineering Research Center for Big Data Technology and System/Service Computing Technology and System Lab/Cluster and Grid Computing Lab, Huazhong University of Science and Technology, China {qgwang,longzh,yuh,pcyao,chygui,xfliao,hjin,wenbinjiang,fbmao}@hust.edu.cn ABSTRACT Existing FPGA-based graph accelerators, typically designed for static graphs, rarely handle dynamic graphs that often involve sub- stantial graph updates (e.g., edge/node insertion and deletion) over time. In this paper, we aim to fill this gap. The key innovation of this work is to build an FPGA-based dynamic graph accelerator easily from any off-the-shelf static graph accelerator with minimal hard- ware engineering efforts (rather than from scratch). We observe spatial similarity of dynamic graph updates in the sense that most of graph updates get involved with only a small fraction of vertices. We therefore propose an FPGA library, called GraSU, to exploit spatial similarity for fast graph updates. GraSU uses a differential data management, which retains the high-value data (that will be frequently accessed) in the specialized on-chip UltraRAM while the overwhelming majority of low-value ones reside in the off-chip memory. Thus, GraSU can transform most of off-chip communica- tions arising in dynamic graph updates into fast on-chip memory accesses. Our experiences show that GraSU can be easily integrated into existing state-of-the-art static graph accelerators with only 11 lines of code modifications. Our implementation atop AccuGraph using a Xilinx Alveo™ U250 board outperforms two state-of-the-art CPU-based dynamic graph systems, Stinger and Aspen, by an aver- age of 34.24× and 4.42× in terms of update throughput, improving further overall efficiency by 9.80× and 3.07× on average. CCS CONCEPTS • Hardware → Reconfigurable logic and FPGAs. KEYWORDS Accelerators; Dynamic Graph; Library ACM Reference Format: Qinggang Wang, Long Zheng, Yu Huang, Pengcheng Yao, Chuangyi Gui, Xi- aofei Liao, Hai Jin, Wenbin Jiang, and Fubing Mao. 2021. GraSU: A Fast Graph Update Library for FPGA-based Dynamic Graph Processing. In Proceedings of the 2021 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA ’21), February 28-March 2, 2021, Virtual Event, USA. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3431920.3439288 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. FPGA ’21, February 28-March 2, 2021, Virtual Event, USA © 2021 Association for Computing Machinery. ACM ISBN 978-1-4503-8218-2/21/02...$15.00 https://doi.org/10.1145/3431920.3439288 1 INTRODUCTION Graph processing has been widely used for relationship analy- sis in a large variety of domains, such as social network analyt- ics [41], financial fraud detection [34], and coronavirus transmis- sion tracking [47]. In recent years, hardware acceleration has been demonstrated to effectively boost the performance of graph pro- cessing [19, 20, 22, 33]. In particular, FPGA can be regarded as one of most promising hardware platforms for graph processing due to its fine-grained parallelism, low power consumption, and flexible configurability, showing impressive results [4, 7, 10–12, 15, 24, 31, 32, 39, 44, 50–52, 55]. Unfortunately, most of existing FPGA-based graph accelerators are typically designed for handling static graphs [19]. In actuality, real-world graphs are often subject to change over time [27, 38], a.k.a. dynamic graphs. For instance, the twitter graph may be added or deleted with 6,000 tweets per second [36]. To process a dynamic graph, it often relies on an iterative process with two basic kennels: graph update and graph computation [1, 5]. Graph update aims to generate a new graph by modifying a graph snapshot with addition and deletion operations upon a vertex or an edge. Graph computa- tion (which can be a graph algorithm or an ad-hoc query) is then performed based on this new graph to extract useful information. In this context, graph computation has been well studied in existing FPGA-based graph processing accelerators [10, 11, 39, 44, 50, 55], while graph update is still understudied significantly [1, 19]. Prior studies show that graph update can function as importantly as graph computation [1]. An inefficient graph update may lead to the unexpected outcome [38]. A typical example is financial fraud detection, which aims to find fake transactions using the ring anal- ysis on evolving graphs. In this case, a slow-rate update may cause the ring detection to operate upon an outdated graph such that the analysis results can be inaccurate or even useless [34]. Although earlier studies on traditional architectures [6, 13, 14, 16, 29, 38, 46] adopt novel graph representations to improve the concurrency of dynamic graph updates, their underlying efficiency is still limited due to the inflexibility of traditional memory subsystems, yielding excessive off-chip communications [1, 19]. In this paper, we focus on filling the gap between graph update and graph computation using FPGA. We expect to develop a specialized FPGA-based graph update library that can be easily integrated into any off-the-shelf static graph accelerator so as to support dynamic graph processing with minimal hardware engineering efforts. However, performing graph updates efficiently on FPGAs re- mains challenging. For a typical FPGA-based graph accelerator architecture (as shown in Figure 1(a)), vertices are often stored Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA 149
  • 2. Off-the-shelf Graph Computation PEs Graph Update PEs FPGA chip Off-chip Memory Edge Data … Vertex data are stored in BRAMs Register Update Handling Logic ① ② ③ Expensive off-chip communication overhead (a) Existing accelerators with naive graph update scheme Graph Update PEs Off-chip Memory Vertex data are stored in BRAMs FPGA chip Edge Data … UltraRAMs Register Edge Update Handling Logic Value-Aware Memory Manager Incremental Value Measurer Off-the-shelf Graph Computation PEs (b) Existing accelerators integrated with GraSU Figure 1: Handling dynamic graph updates under existing static graph accelerators with (a) naive update scheme and (b) GraSU library. The edge data associated with different vertices are marked in different colors. in the on-chip BRAMs to reduce substantial random access over- heads of vertex data, while edges, massive in quantity, have to be loaded in a streaming-apply fashion from the off-chip mem- ory [10, 11, 39, 44, 50, 55]. In this context, a graph update operation can be split into three basic steps. Consider an edge update. First, the source vertex index of the to-be-updated edge is read. Its as- sociated edge array will be loaded from the off-chip memory, and stored into the registers attached to each processing element (PE). Second, the to-be-updated edge is then inserted into (or deleted from) the loaded edge array, which is finally written back to the off-chip memory. In the real world, there often are a large num- ber of updates applied. The off-chip edge data may be repeatedly and redundantly accessed by each separate PE, thus resulting in expensive off-chip communication overheads (as discussed in §2.2). In this paper, we focus on addressing whether and how the most off-chip communications arising in dynamic graph updates can be transformed into on-chip memory accesses. Note that we con- sider only edge updates in this work since vertex updates can be represented by a series of edge insertions and/or deletions [38]. We observe that dynamic graph updates operating upon real- world graphs show significant spatial similarity for off-chip edge data access, in the sense that most random off-chip memory re- quests come from accessing the edges associated with a few ‘valu- able’ vertices (as discussed in §2.3). Consider the realistic dynamic graph wiki-talk-temporal [28], most (>99.04%) of edge updates are associated with only 5% of vertices. This motivates us to reduce the most off-chip communication overheads by applying a differ- ential data management, in which for each batch of graph update, high-value vertices’ edges are resident on the specialized on-chip UltraRAM [48] while most of low-value ones are stored in the off- chip memory. Note that the value of a vertex represents the number of edges updating it during an update batch, differential data man- agement distinguishes the data based on value priority for being stored on-chip or off-chip. However, achieving this goal for accel- erating graph updates on FPGA is still challenging. First, the data that are valuable for each batch of graph update are dynamically changed over time. An offline value measurement towards static graphs may lead to inaccurate detection on valuable data [53]. Yet, performing the exhaustive value computation on the fly is also expensive. Second, the differential memory architecture makes data addressing more complex. Both on-chip UltraRAM and off-chip DRAM can store edge data. We must accurately know which data Time Graph Computation Time Graph Update #0 Store graph in a format Graph Computation Graph Update #1 Graph Computation (a) Static Graph Processing (b) Dynamic Graph Processing Graph Update #n Graph Computation … Figure 2: The basic workflow of (a) static graph processing and (b) dynamic graph processing is in which memory location in a space-efficient manner, which is also difficult. In this paper, we propose an FPGA library, called GraSU, for fast graph updates. As shown in Figure 1(b), GraSU consists of an incremental value measurer and a value-aware differential memory manager to fully exploit spatial similarity of dynamic graph up- dates. Consider the value of a vertex exhibits significant difference across different update batches for dynamic graphs, we propose to quantify the value incrementally based on the update history to capture important changes across batches such that the detec- tion accuracy can be improved dynamically. GraSU also enables to overlap incremental value measurement with normal graph com- putation and thus its runtime overheads can be fully hidden. To make a better tradeoff between space overhead and efficiency for differential memory addressing, we present a value-aware data management, which reduces unnecessary memory consumption by leveraging a bitmap and implements the fast yet accurate data addressing via bitmap-assisted address resolution. This paper has the following main contributions: • We observe spatial similarity arising in dynamic graph up- dates on real-world graphs for improving memory efficiency of dynamic graph processing. • We present an FPGA library, namely GraSU, which uses a differential data management to exploit spatial similarity for fast graph updates. GraSU can be easily integrated into any FPGA-based static graph accelerator with only a few lines of code modifications so as to handle dynamic graphs. • Our implementation on a Xilinx Alveo™ U250 card outper- forms Stinger [14] and Aspen [13] by an average of 34.24× and 4.42× in terms of update throughput, and improve fur- ther overall efficiency by 9.80× and 3.07× on average. The rest of this paper is organized as follows. §2 introduces the background and motivation. The overview of GraSU is presented in §3. §4 and §5 elaborate the value-aware differential data man- agement. §6 discusses the results. The related work is surveyed in §7. §8 concludes the paper. 2 BACKGROUND AND MOTIVATION We first review preliminaries of dynamic graph processing. We then identify memory inefficiency of existing FPGA-based graph accelerators for dynamic graphs, finally motivating our approach. 2.1 Dynamic Graph Processing Figure 2 depicts the basic workflows of static and dynamic graph processing, respectively. Static graph processing (Figure 2(a)) often works on topology-fixed graphs that can be organized in different graph representations, e.g., compressed sparse row (CSR) [50], com- pressed sparse column (CSC) [44], and coordinate list (COO) [55]. Dynamic graph processing can be understood as performing a se- ries of graph computations upon different graph versions that are successively generated by a sequence of graph update batches (as Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA 150
  • 3. (a) Graph Example 0 2 5 6 1 2 0 1 2 0 Vertex IDs Offset Array Edge Array (Neighbour IDs) 0 1 2 (b) CSR-based Format 0 1 2 [0,3] S 1 2 S 0 1 2 S 0 [12,15] [8,11] [4,7] [0,8] [8,15] [0,15] Edge Array 0 5 13 16 Offset Array (c) PMA-based Dynamic Graph Representation Vertex IDs 0 1 2 Segment #0 Segment #1 Segment #2 Segment #3 Figure 3: A simplified example for illustrating the PMA for- mat and how it supports CSR format. (a) An example graph. (b) CSR-based static graph format. (c) PMA-based dynamic graph format. ‘𝑆’ is a sentinel entry indicating each vertex’s edge range for maintaining the offset array. When a sentinel is changed, the corresponding entry in the offset array will be modified. shown in Figure 2(b)). To increase the concurrency of graph updates, earlier studies [6, 8, 13, 14, 16, 18, 21, 25, 29, 37, 38, 45, 46] present various dynamic graph representations based on Packed Memory Array (PMA) [38, 45], tree [6, 13], CSR variants [14, 16, 29, 37], ad- jacency list (AL) [8, 25, 46], and hash table [18, 21]. Compared to others applying on a specific graph format, PMA can be flexibly adaptive to different graph representations [38]. In this work, we architect GraSU based on the PMA format for the purpose of sup- porting a wide variety of FPGA-based static graph accelerators that may adopt different underlying graph representations. Figure 3 depicts an example graph with the PMA format and how it supports the traditional CSR format. As shown in Figure 3(c), the PMA format maintains sorted edges in a partially-consecutive manner by leaving some gaps for possible edge updates where ‘𝑆’ is a sentinel entry indicating each vertex’s edge range for maintaining the offset array. Logically, the PMA separates the whole edge array into a series of leaf segments and keeps an implicit tree for locating the position of edge updates quickly. When a leaf segment becomes full or empty, all edges of its parent will be redistributed to rebalance the edges array. Suppose all gaps are exhausted, the root segment will be doubled with workload rebalance re-invoked. 2.2 Off-Chip Communication Overheads As described in Figure 1(a), in an effort to reduce the random vertex access overheads, existing FPGA-based graph accelerators often store vertex data in the BRAMs while edges reside in the off-chip memory. In this setting, many edges may be repeatedly loaded from the off-chip memory by different update operations. In addition, these off-chip edge accesses closely depend on the sequence of the edges that need to be updated, and therefore are totally random with excessive off-chip communication overheads, further slowing down the overall efficiency of graph updates. To demonstrate this, we conduct a set of experiments to break down the real computation time and off-chip communication time of graph updates operating over AccuGraph [50]. Figure 4 shows the results on five real-world dynamic graphs (i.e., sx-askubuntu (AU), sx-superuser (SU), wiki-talk-temporal (WK), sx-stackoverflow (SO), and soc-bitcoin (BC), more details as shown in Table 2) with a varying number of edge updates. We see that communication over- heads dominate the overall execution time of graph updates for all dynamic graphs. In particular, communication overheads increase 0 2 4 6 8 10 Commu. Overhead Real Computation A U S U W K S O B C 20% A U S U W K S O B C 40% A U S U W K S O B C 80% A U S U W K S O B C 100% Normalized execution time Figure 4: Execution time breakdown of graph updates op- erating upon AccuGraph by applying different proportion (20%/40%/80%/100%) of edge updates on five real-world dy- namic graphs. All results are normalized to the case of ap- plying 20% edge updates. 2 5 4 9 8 7 3 1 6 0 Time 0 1 8 6 5 7 5 0 8 2 8 7 2 4 5 2 8 9 5 4 Edge Updates Applying 1 3 6 5 7 2 0 2 5 7 9 2 4 5 7 9 Edge Array Vertex IDs Offset Array 0 1 2 0 2 3 5 6 7 13 13 13 19 19 3 4 5 6 7 8 9 (a) Edge Updates (c) Graph Topology (b) Graph Data Edge data of vertex V5 Figure 5: An example for illustrating spatial similarity of edge updates. The blue solid (dashed) lines indicate edge in- sertions (deletions). significantly as edge update sizes increase. For example, when 20% of edge updates are applied for WK, communication overheads take 65.80% of overall performance. However, as this proportion is increased to 100%, communication overheads will be up to 83.26%, exerting more pressure on the off-chip memory bandwidth. Overall, communication overheads can take most (85.62% on average) of overall performance of dynamic graph updates. 2.3 Differential Data Management In this work, we find that not all off-chip communications arising in graph updates for real-world graphs are contributed equally. The reason behind this is complex. The nature of “rich get richer” [26] and the power-law degree feature1 of real-world graphs can help a partial understanding. The "rich get richer" nature indicates that a new edge update can be more likely operated on a high-degree ‘rich’ vertex. As a result, the minority of high-degree vertices get involved by most graph updates. In a real scenario of online shopping, users are more inclined to buy (i.e., edge update) those high-sale products which are often in the minority among all products. In summary, we have the following observation: Observation: Graph updates on real-world graphs show significant spatial similarity, indicating that most of the off-chip random memory accesses root from requesting a few vertices’ edge data. Figure 5 shows an example for helping understand spatial sim- ilarity. The initial graph (Figure 5(c)) will be modified by 10 edge update operations (Figure 5(a)). We can see that 8 out of 10 edge updates are associated to only 2 vertices, 5 and 8 . Figure 6 further investigates the percentage of edge updates that get involved by dif- ferent scale (1%∼5%) of top vertices. All results are collected from an 1Most vertices in a graph have a few neighbors while a few have many ones [17]. Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA 151
  • 4. 0% 20% 40% 60% 80% 100% 1% 2% 3% 4% 5% AU SU WK SO BC Percentage of vertices Percentage of edge updates Figure 6: Quantitative relationship between edge updates and vertices for five real-world dynamic graphs offline trace analysis. Basically, we see that most edge updates are operated on a few vertices. Consider SO, 58.28% of edge updates are operated upon the top-1% vertices. However, the percentage growth becomes gradually slow and saturated in the case that vertex ratio is 5%. Overall, we can see that 71.26%~99.04% of edge updates are focused on accessing the top-5% vertices. Note that spatial similar- ity is not assumed on power-law graphs. For USA-road [35] with a relatively-even degree distribution, 5% accident-prone vertices may frequently cause 66.74% road congestion. Spatial similarity motivates us to classify the entire edge data into high-value data (if it is requested by many edge updates) and low-value data (otherwise). As shown in Figure 5(b), the colored edges in the edge array are high-value data due to the association with frequently accessed 5 and 8 . This further requires to invent a differential data management, in which high-value data reside on the on-chip memory while low-value data are stored in the off-chip memory. In this case, most of random off-chip memory accesses arising in graph updates can be therefore transformed into fast on-chip accesses. However, realizing the differential data management on FPGA remains challenging and needs to meet at least two requirements: • Accuracy: We must accurately know the value of each ver- tex so as to place its associated edges into certain memory device. However, unlike static graphs, data value of dynamic graphs is dynamically various such that measuring it accu- rately and efficiently becomes difficult. • Space-Efficiency: In the differential memory architecture, both on-chip and off-chip memories have a copy of edge data. This introduces a new data addressing mechanism for positioning the data location. However, how to ensure the space efficiency of data addressing is also difficult. To address these issues, we propose GraSU that can exploit spa- tial similarity effectively and efficiently. 3 GRASU OVERVIEW In an effort to reduce excessive random off-chip communications induced by redundant memory accesses arising in graph updates, GraSU uses "value" to characterize the importance of data (i.e., high- accuracy value measurement) and treats them differentially (i.e., value-aware differential management) based on the key insight that not all off-chip accesses are created equally. 3.1 Architecture Figure 7 shows the overall architecture of GraSU, which consists of five components: dynamic graph storage, incremental value (a) Overall Architecture of GraSU Graph Data (stored in PMA- based dynamic graph format) Edge Updates Batch #1 Batch #2 Batch #3 …… Off-chip Memory FPGA Chip Update Handling Logic Update-relevant Data Register Edge Updates Dispatcher Graph Update PEs Value-Aware Memory Manager Off-chip Memory Controller High-value Data Buffer Incremental Value Measurer Updates Buffer (processed in batches) Edge Read Edge Write Edge Update Edge Update Send access request to Value-Aware Memory Manager Edge Insertion/Deletion on Update-relevant Data Register (b) Architecture of Update Handling Logic Edge Array 0 5 13 16 Offset Array Vertex IDs 0 1 2 S12 S0 12 S0 [0,3][4,7][8,11][12,15] [8,15] [0,7] [0,15] Segment (c) Dynamic Graph Organization ➏ ➊ ➋ ➌ UltraRAM ➍ ➎ §4 §5 Figure 7: The GraSU architecture measurer, edge updates dispatcher, edge updates handling logic, and value-aware memory manager. Dynamic Graph Organization. As discussed in §2.1, we fol- low the PMA representation [38, 45]. In GraSU, both on-chip and off-chip memories can store edge data. There may be the case that a segment contains edges with different value levels, making it extremely difficult for data organization. To avoid this, we propose to enforce each segment to contain edges from only one vertex (Fig- ure 7(c)). In addition, for the traditional PMA format, the edge array space will be doubled if it becomes full. However, FPGA currently does not support dynamic memory allocation effectively [49]. To achieve this functionality, we pre-allocate the off-chip memory into many spaces physically, and use the segment space logically for space doubling when necessary. Incremental Value Measurer (IVM). The IVM module is re- sponsible for quantifying the value of each vertex for graph updates, and further notifying the value-aware memory manager (VMM) to dispatch edges of high-value vertices into the on-chip UltraRAM (❶). Since the data value is dynamically changing, IVM adopts an incremental value measurement based on graph update history to improve measurement accuracy constantly. IVM is invoked every time a batch of graph updates are completed (❻). Note that value measurement overheads can be fully hidden behind normal graph computations. More details are discussed in §4. Edge Update Dispatcher (EUD). When high-value data reside in the on-chip UltraRAM, EUD gets started (❷). It reads a batch of edge updates from the off-chip memory and further dispatches them to appropriate graph update PEs orderly according to the timestamp order of each edge update (❸). Update Handling Logic (UHL). The UHL module makes sure that each edge update can be correctly inserted into or deleted from the edge array. UHL is equipped with a three-stage pipeline: edge read, edge update, and edge write (Figure 7(b)). The read stage loads the requested data of the to-be-updated edge by sending a request to the VMM (❹), discussed below. Afterwards, the update stage performs the insertion or deletion operations. Finally, the write stage writes back the updated edge data into the off-chip memory (or the UltraRAM) through VMM. Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA 152
  • 5. Table 1: Programming interfaces of GraSU Interfaces Description GraSU_alloc_UltraRAM UltraRAM Allocation GraSU_DGS Transform a static graph into the PMA format GraSU_Init_Value Initialize the value of each vertex GraSU_LHD Load the high-value data into the UltraRAM GraSU_Update_Start Handle edge updates GraSU_WHD Writeback high-value data into off-chip RAM GraSU_Quantify_Value Quantify the value of each vertex GraSU_Overlap Overlap value measurement with graph computation GraSU_alloc_UltraRAM( ); GraSU_DGS( ); GraSU_Init_Value( ); DynamicGraphProcessing( ){ for( each update batch ){ GraSU_LHD( update_batch_valid, LHD_valid ); GraSU_Update_Start( LHD_valid, GUS_valid ); GraSU_WHD( GUS_valid, WHD_valid ); /*Overlap graph computation with value measurement*/ GraSU_Overlap( WHD_valid ){ /*Notify AccuGraph to start graph computation*/ AccuGraph( WHD_valid, computation_valid ); GraSU_Quantify_Value( WHD_valid, quantify_valid ); } /* The signal indicates whether graph computation and value measurement are completed*/ update_batch_valid = computation_valid & quantify_valid; } } 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Figure 8: A uniform programming framework for illustrat- ing how to integrate GraSU into an existing static graph ac- celerator AccuGraph [50] for handling dynamic graphs. Value-Aware Memory Manager (VMM). The VMM module aims to locate the requested edge data in an accurate and efficient manner. To make a good tradeoff between memory space over- heads and data addressing efficiency in the differential memory architecture, VMM adopts a bitmap-indexed structure to minimize space consumption and uses a bitmap-assisted addressing resolu- tion mechanism to enable fast yet accurate differential data accesses. When receiving a read request (❹), which consists of the source and destination vertex indices of a to-be-updated edge, VMM will capture the edge array address of the source vertex as an initial on-chip (or off-chip) address. Afterward, VMM locates the target segment in the edge data and also loads the segment (as well as its address) to the update-relevant registers coupled with UHL (❺). When VMM receives a write request from UHL, the data residing in the UHL-attached buffer will be written back according to the address of the segment. More details are described in §5. 3.2 Programming Interfaces Table 1 shows the programming interfaces of GraSU for graph up- dates. Using GraSU, we do not need to modify the upper graph al- gorithm programming. Figure 8 shows an example of how GraSU is integrated into an existing state-of-the-art graph accelerator Accu- Graph [50] effectively. Two parameters of each interface indicate an input and an output signal, respectively. For each update batch, we use GraSU_LHD, GraSU_Update_Start, and GraSU_WHD to complete 0% 20% 40% 60% 80% 100% 0 1 2 3 4 5 6 7 8 9 Ideal Degree-based Incremental Batches of updates Percentage of edge updates Figure 9: Accuracy of the top 5% vertices identified by three different schemes over update batches for WK. Ideal results are obtained through an offline trace analysis. a graph update operation. After the graph update is finished, the out- put signal WHD_valid of GraSU_WHD will be valid to simultaneously activate graph computation (i.e., AccuGraph) and value measure- ment of the next graph update (i.e., GraSU_Quantify_Value). The parameter update_batch_valid is used to ensure the semantic correctness between different update batches. More generally, all codes (excluding Line 12) of Figure 8 repre- sent a uniform programming template to perform graph update. The whole code of the existing graph accelerator is treated as a module for graph computation. For integration, users only need to instantiate the accelerator module (Line 12) inside the top mod- ule, and then connect accelerator’s input/output signals with other GraSU’s modules for coordinating when it is launched/finished. Thus, GraSU can be easily used and integrated into existing static graph accelerators for handling dynamic graphs with only 11 lines of code modifications. GraSU is implemented and written in Verilog. Integrating GraSU for an HLS-based accelerator is also viable by converting it into the Verilog program. 4 VALUE MEASUREMENT We first describe how to accurately quantify the value of a vertex to distinguish high-value data. We then discuss how the overheads of value measurement can be hidden in an overlapping manner. 4.1 Quantifying the Value of a Vertex According to the “rich get richer” conjecture, an intuition for quan- tifying the value of a vertex value measurement is to use its degree. This approach is useful but its accuracy is still far from the ideal sit- uation. Figure 9 shows the accuracy of the top 5% vertices identified in the ideal case and the degree-based approach for the real-world dynamic graph WK. Accuracy is the quantitative ratio of the top-5% vertices’ edge updates to the total edge updates. Ideal case means that all the top-5% vertices’ edge updates are precisely-identified. We see that the accuracy gap becomes bigger as the update batches proceed. In particular, in the case of the 9th update batch, the top 5% vertices identified by the degree-based approach can get involved by only 67.33% edge updates while the accuracy in the ideal case can be as high as 99.04%. The reasons are twofold. First, some low-degree vertices in a basic graph can be inserted with many edges and thus they may gradually become high-degree ones as update batches go. This is common in real scenarios. For example, an obscure actor may gain many fans to become a superstar when his film gets successful. Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA 153
  • 6. Graph Computation Time Graph Update Graph Update Graph Computation GraSU Update batch #i Update batch #(i+1) Graph Computation Engine Value Measurement …… …… Figure 10: The overlapping opportunity between value mea- surement and normal graph computation Second, the edges of some high-degree vertices may have a slower growth than others. For example, when a “superspreader” is isolated with the chains of virus transmission cut off, it will become normal soon. Both phenomenons indicate that the value of a vertex depends on not only its degree but also its historical update frequency. Thus, we propose to quantify the value of a vertex as follows. 𝑉𝑎𝑙𝑢𝑒𝑖 (𝑣) = 𝐷𝑒𝑔(𝑣) × 𝐹𝑖−1(𝑣) 0 𝑖 𝑁 𝐷𝑒𝑔(𝑣) 𝑖 = 0 (1) where 𝑁 is the number of update batches and 𝐷𝑒𝑔(𝑣) is the number of out-going edges of the vertex 𝑣. 𝐹𝑖−1(𝑣) represents the number of edges updated to the vertex 𝑣 (also called the update frequency of 𝑣) in the (𝑖 − 1)-th update batch. The value of 𝑣 is initialized to 𝐷𝑒𝑔(𝑣) before the 0-th update batch starts (i.e., 𝑉𝑎𝑙𝑢𝑒0(𝑣)). 𝑉𝑎𝑙𝑢𝑒𝑖 (𝑣) de- notes the value of 𝑣 after applying (𝑖 − 1)-th update batch. In Equation (1), we see that the value is incrementally quantified and dynamically adjusted to gradually improve the prediction accu- racy of high-value edge data. Figure 9 shows the superiority of our incremental approach. Compared to the degree-based approach, we can capture 88.15% updates for the top 5% vertices identified. 4.2 Overlapping Value Measurement and Graph Computation As in Equation (1), measuring the value of a vertex needs to obtain its degree and update frequency, both of which are dynamically changed during graph updates. Thus, we have to compute them on the fly, which can introduce potential runtime overheads. For- tunately, the interleaving between graph update and graph com- putation for each update batch (as shown in Figure 2(b)) offers the potential opportunity to hide the overheads of value measurement. Figure 10 illustrates an overlapping diagram. When the 𝑖-th graph update is completed, the edge data resident in the on-chip memory will be written back to the off-chip memory. Afterwards, graph computation engine starts working. In the meantime, the Incremental Value Measurer starts quantifying the vertex value for the (𝑖 + 1)-th graph update. Fortunately, the time spent on value measurement is often less than that spent on graph computation. This is because value measurement needs to compute upon a vertex only once while graph computation has an iterative process over vertex [53]. Thus, the overheads of value measurement can be usually hidden fully under the execution time of graph computation. 5 VALUE-AWARE MEMORY MANAGEMENT This section elaborates how to make full use of the quantified vertex value to identify high-value edge data accurately and to further achieve a differential data management for maximize value benefits. 5.1 High-Value Data Identification Based on the quantified vertex value, the next step is naturally to identify the high-value data that should be stored on-chip. This raises two questions: (1) which data is high value for the on-chip storage? and (2) which on-chip memory (BRAM or UltraRAM) is expected for caching high-value data? High-Value Data Computation. GraSU tries to store as much high-value data as possible on-chip. Since the value and the edge number of each vertex are changed over time and the on-chip mem- ory capacity for different FPGAs is different, we must dynamically compute the high-value data as per the on-chip memory capacity, the value of each vertex, and the edge size of each vertex. We can therefore compute high-value data as follows: 𝜏 = arg max 𝑘 ( 𝑘 Õ 𝑖=0 𝑆𝑖𝑧𝑒(𝐸𝑑𝑔𝑒𝑂𝑓 (𝑣𝑖)) ) , where 𝑘 ∈ [0, |𝑉 |), 𝑣𝑖 ∈ 𝑉𝑆𝑒𝑡, 𝑘 Õ 𝑖=0 𝑆𝑖𝑧𝑒(𝐸𝑑𝑔𝑒𝑂𝑓 (𝑣𝑖)) ≤ |𝑂𝑛𝑐ℎ𝑖𝑝𝑀𝑒𝑚| (2) where 𝐸𝑑𝑔𝑒𝑂𝑓 (𝑣𝑖) indicates the edges of 𝑣𝑖. 𝑆𝑖𝑧𝑒(𝑆) represents the total memory size of the set 𝑆. |𝑉 | is the number of vertices. 𝑉𝑆𝑒𝑡 is the set of vertices that have been sorted by their value from the largest to the smallest in the VMM module. Í𝑘 𝑖=0 𝑆𝑖𝑧𝑒(𝐸𝑑𝑔𝑒𝑂𝑓 (𝑣𝑖)) indicates the total edge size of {𝑣0, · · · , 𝑣𝑘 }. |𝑂𝑛𝑐ℎ𝑖𝑝𝑀𝑒𝑚| denotes the on-chip memory capacity. Equation (2) means to find a largest 𝜏 such that the total edge size of {𝑣0, · · · , 𝑣𝜏 } is nearly equal to the on-chip memory capacity. After 𝜏 is computed, we can then obtain the set of high-value data (denoted as 𝑆𝐻𝑉 𝐷) easily as follows: 𝑆𝐻𝑉 𝐷 = {𝐸𝑑𝑔𝑒𝑂𝑓 (𝑣𝑖)|𝑖 ∈ [0,𝜏], 𝑣𝑖 ∈ 𝑉𝑆𝑒𝑡} (3) Equation (3) shows the edge data of {𝑣0, · · · , 𝑣𝜏 } as high-value data. UltraRAM vs. BRAM. We select UltraRAM to store high-value data with the following reasons. First, edges often have a larger block size than vertices since edge data size of a graph is often much larger than vertex size. Thus, UltraRAM is more expected to store edges due to its coarse-grained block size, while BRAM is often used to store vertex data (as adopted in existing graph accelerators [10, 11, 39, 50]). Second, UltraRAM often has a larger memory size (e.g., 1280 × 288Kb for a Xilinx U250) than BRAM (2000 × 36Kb) for storing more edges on-chip. 5.2 Value-Aware Memory Access So far, we have the following settings of GraSU for dynamic graph processing. The vertex data is stored in the on-chip BRAM. The high-value edge data reside in the on-chip UltraRAM while the low- value one is in the off-chip memory. In this differential memory architecture, both UltraRAM and the off-chip memory have the edge data. Thus, we must know which memory and where the corresponding edge data are located when a vertex is processing. This makes memory addressing complex. A naive approach is to use another offset array to record the on-chip edge data, but it incurs extra space overhead, which can be more than 𝑁 × 4𝐵 where 𝑁 is the number of vertices. Space-Efficient Memory Addressing. We present a simple yet effective bitmap-based method to yield a good tradeoff between space overhead and memory addressing efficiency. Each vertex in Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA 154
  • 7. UltraRAM Offset Array Edge Array 0 0 9 7 14 17 Off-chip Memory FPGA chip 0 1 0 1 0 Bitmap Vertex IDs 0 1 2 3 4 Current_end_addr Figure 11: An example of value-aware data access manage- ment, where both UltraRAM and the off-chip memory need to maintain a piece of edge data. a bitmap occupies only 1 bit. The bitmap is stored in the on-chip BRAMs with 1 (0) indicating that the edge data of a corresponding vertex is stored on-chip (off-chip). Based on the bitmap, we can access the high-value data easily. When all the edge data of a vertex is loaded into UltraRAM, its bit in the bitmap is set to 1 and the corresponding entry in the offset array is set to the new offset in the UltraRAM. Also, the current end address of this vertex in the UltraRAM is recorded. Figure 11 shows an example of a graph with 5 vertices. The edge data of 𝑉 1 and 𝑉 3 are high-value data that need to be loaded into the UltraRAM. Their new offsets (i.e., 0 and 7) in the UltraRAM will be written in the corresponding entries of the offset array. The entries of 𝑉 1 and 𝑉 3 in the bitmap are also set to 1. The starting address of the UltraRAM is a constant and the current end address of V3 is kept in a current_end_addr register. Note that the vertex bitmap can be still partitioned on a multi-FPGA platform for scalability, and the overhead induced by bitmap construction can be amortized by multiple executions of a variety of graph algorithms operating on the same graph. High-Value Data WriteBack. As described above, the addresses of high-value vertices in the offset array are overwritten by the on-chip UltraRAM offsets. When these high-value data are written back into the off-chip memory for data consistency, we need to com- pute their original offsets in the offset array. Specifically, starting from a given vertex 𝑣, we scan the bitmap and find the first vertex marked with ‘0’ (denoted as 𝑣0) and the first vertex marked with ‘1’ (denoted as 𝑣1) in the bitmap. Then, the number of out-going edges of 𝑣 can be obtained by computing the offset difference between 𝑣1 and 𝑣. Finally, the original off-chip memory offset of 𝑣 can be computed as follows: 𝑜𝑓 𝑓 𝑠𝑒𝑡(𝑣) = 𝑜𝑓 𝑓 𝑠𝑒𝑡(𝑣0) − 𝐷𝑒𝑔(𝑣) (4) where 𝑜𝑓 𝑓 𝑠𝑒𝑡(𝑣) denotes the off-chip memory offset of the vertex 𝑣. Figure 11 shows an example of how to calculate the original off-chip memory offset of the vertex 𝑉 1. First, we scan the bitmap and find 𝑣0 = 𝑉 2 and 𝑣1 = 𝑉 3 in the bitmap. Then, we compute 𝐷𝑒𝑔(𝑉 1) = 7 − 0. Thus, the original offset of 𝑉 1 can be computed as 𝑜𝑓 𝑓 𝑠𝑒𝑡(𝑉 1) = 2 (i.e., 9-7). Handling Read Requests. A read request to a vertex 𝑣 needs to load all the edges of 𝑣 into the UHL. To achieve this goal, we need to know three pieces of information. First, we have to know where the edge data of 𝑣 is stored. This can be easily obtained by accessing the bitmap with 1(0) indicating to be stored on-chip (off- chip). Second, we compute the initial address of the edge data of 𝑣 (denoted as 𝑎𝑑𝑑𝑟𝑒𝑠𝑠(𝑣)), which can be obtained by adding the starting memory address and the corresponding offset of 𝑣 in the Table 2: Real-world dynamic graph datasets Datasets # Vertices # Edges # BEdges # Degree sx-askubuntu (AU) [28] 0.16M 0.96M 0.59M 6.05 sx-superuser (SU) [28] 0.19M 1.44M 0.92M 7.44 wiki-talk-temporal (WK) [28] 1.14M 7.83M 3.31M 6.87 sx-stackoverflow (SO) [28] 2.60M 63.50M 36.23M 24.40 soc-bitcoin (BC) [35] 24.58M 122.95M 60.49M 5.00 offset array. Third, we need to obtain the number of out-going edges of 𝑣 (i.e., 𝐷𝑒𝑔(𝑣)). Two cases should be considered. If the edge data is on chip, 𝐷𝑒𝑔(𝑣) can be obtained by computing the offset difference between 𝑣1 (defined above) and 𝑣. If it is in the off chip, we find 𝑣’s first subsequent vertex (denoted as 𝑣𝑠) and compute 𝑜𝑓 𝑓 𝑠𝑒𝑡(𝑣𝑠) based on the Equation (4) (if it is on-chip). 𝐷𝑒𝑔(𝑣) can be therefore obtained by computing the offset difference between 𝑣𝑠 and 𝑣. Finally, we can return the edge data of 𝑣 (denoted as 𝑑𝑎𝑡𝑎(𝑣)) to the UHL with a tuple 𝑣,𝑎𝑑𝑑𝑟𝑒𝑠𝑠(𝑣),𝑑𝑎𝑡𝑎(𝑣) . Handling Write Requests. When a write request with a tuple 𝑣,𝑎𝑑𝑑𝑟𝑒𝑠𝑠(𝑣),𝑑𝑎𝑡𝑎(𝑣) arriving, the updated edge data of 𝑣 from the UHL’s register will be written back to the UltraRAM (or off-chip memory). Similarly, we get access to the bitmap to identify if the vertex is stored in the UltraRAM or off-chip memory, with 1(0) indicating to write 𝑑𝑎𝑡𝑎(𝑣) to the UltraRAM (off-chip memory) based on the 𝑎𝑑𝑑𝑟𝑒𝑠𝑠(𝑣). 6 EXPERIMENTAL EVALUATION This section evaluates the efficiency of GraSU and shows the effec- tiveness for its integration into existing static graph accelerators. 6.1 Experimental Setup GraSU Settings. We implement GraSU upon a Xilinx Alveo™ U250 accelerator card, which is equipped with an XCU250 FPGA chip and four 16GB DDR4 (Micron MTA18ASF2G72PZ-2G3B1). The target FPGA chip provides 11.81MB on-chip BRAMs, 45MB on-chip UltraRAMs, 1.68M LUTs, and 3.37M registers. In our evaluation, GraSU is set with 32 graph update PEs. Each segment in the PMA representation has a length of 8. We use 4 bytes to represent a vertex. Thus, the update-relevant register buffer attached to each PE is sized of 32 bytes. To evaluate efficiency, we integrate GraSU into a state-of-the-art static graph accelerator Ac- cuGraph [50] to perform dynamic graph processing, referred to as AccuGraph-D. To demonstrate the usability of GraSU, we also integrate GraSU into three state-of-the-art FPGA-based graph accel- erators, FabGraph [39], WaveScheduler [44], and ForeGraph [11]). Graph Datasets and Applications. As shown in Table 2, we use five real-world dynamic graphs publicly available from the Stan- ford Large Network Dataset Collection [28] and Network Reposi- tory [35]. Every edge in a dynamic graph has a timestamp that indi- cates when it should appear. For example, a directed edge 𝑢, 𝑣,𝑡 in WK indicates that the Wikipedia user𝑢 edited the talk page of the user 𝑣 at the time 𝑡. “BEdges” in Table 2 denotes the edge set of an initial base graph before graph updates start. The number of edges that needs to be updated can be computed by “#Edges-#BEdge”. In our evaluation, all graphs are considered directed, and their edge updates are divided into 10 batches by default. We evaluate three representative graph applications: Breadth First Search (BFS), PageRank (PR), and Weakly Connected Compo- nents (WCC). In a dynamic graph processing scenario, every time Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA 155
  • 8. Table 3: Resource utilization and clock rates BFS PR WCC LUT 12.28% 14.19% 12.89% Register 5.64% 9.96% 6.73% BRAM 72.77% 82.18% 82.18% UltraRAM 62.50% 62.50% 62.50% Maximal clock rate 246MHz 211MHz 245MHz Table 4: Update time (in seconds) and update throughput (in million edges per second) of graph updates for Stinger, Aspen, and AcuGraph-D. The highlighted columns (i.e., ×Stinger and ×Aspen) represent speedup results achieved by AccuGraph-D against Stinger and Aspen, respectively. Graph Update time/update throughput Speedup Stinger Aspen AccuGraph-D ×Stinger ×Aspen AU 0.031/4.18 0.004/32.43 0.00068/190.77 45.58× 5.88× SU 0.053/3.47 0.006/30.64 0.00125/147.08 42.40× 4.80× WK 0.647/4.31 0.071/39.31 0.01375/202.97 47.05× 5.16× SO 3.614/3.23 0.452/25.81 0.209/55.83 17.29× 2.16× BC 6.752/9.25 1.473/42.40 0.358/174.46 18.86× 4.11× an update batch is finished, we will perform a graph algorithm on the newly-updated graph. Note that we run 10 iterations for PageRank, and perform BFS and WCC until convergence. Baselines. We compare AccuGraph-D with two state-of-the- art CPU-based dynamic graph systems, Aspen [13] and Stinger with its latest version 15.10 [14]. Both run on a high-end server configured with 2×14-core Intel Xeon E5-2680 v4 CPUs operating at 2.40 GHz, a 256GB memory, and a 2TB HDD. We evaluate the update throughput of graph updates by the number of edges that have been successfully updated per second. The overall efficiency of dynamical graph processing can be computed in a total of update time and graph computation time. Resource Utilization. Table 3 shows the resource utilization and clock rate of AccuGraph-D. All results are obtained via Xilinx SDAccel 2019.2. To preserve the correctness, we conservatively set 200MHz as the clock rate in our experiments. 6.2 Graph Update Efficiency Table 4 shows the update time (in seconds) and update throughput (in million edges per second) of graph updates for Stinger [14], Aspen [13], and AccuGraph-D, respectively. AccuGraph-D vs. Stinger: Stinger uses both adjacency list and CSR to maintain dynamic graph data. Stinger divides the edges of each vertex into multiple blocks that are preserved some gaps and organized similarly as the adjacency list. In each block, edges are stored as a CSR representation. When an edge update arrives, it traverses the block list and inserts (deletes) the to-be-updated edge into an empty (from a corresponding) position. Overall, AccuGraph-D performs graph updates with through- puts of 55.83~202.97M edges/second while Stinger achieves only 3.23~9.25M edges/second. We can therefore see that AccuGraph-D outperforms Stinger by 17.29×~47.05× (34.24× on average) in terms of update time, with the following two reasons. First, traversing the block list under Stinger incurs a low LLC hit ratio (often less than 20%) [1]. This is particularly serious for handling edge updates on high-degree vertices, because a large number of edge updates are applied on only a few high-degree vertices in real-world dynamic graphs, thus significantly worsening update efficiency. In contrast, AU SU WK SO BC AU SU WK SO BC AU SU WK SO BC BFS PR WCC 0.01 0.1 1 10 100 1000 Time (seconds) Graph computation (patterned) Graph update(unpatterned) Stinger Aspen AccuGraph-D Figure 12: The total running time of AccuGraph-D against Stinger and Aspen. Each bar represents a system or an ac- celerator, where patterned parts indicate graph computation time while unpatterned ones show graph update time. AccuGraph-D associates the high-value edge data with a large frac- tion of edge updates, and cache them in the on-chip memory. Thus, excessive off-chip memory accesses are avoided. Second, when two edges are simultaneously updated on the same vertex, Stinger uses a locking mechanism to ensure correctness, decreasing further par- allelism of edge updates. AccuGraph-D adopts a PMA-based format variant, which can chunk the edge array into many fine-grained segments and ensures lock-free segment updates [38]. AccuGraph-D vs. Aspen: Aspen is developed based on a purely- functional search tree, which greatly facilitates the search of target position corresponding to the to-be-updated edge. However, Aspen needs to repeatedly load the same edge data into on-chip caches at different times due to the extremely irregular memory access features of graph update and the limitation of typical replacement policies [3, 9, 23] on traditional architectures, while GraSU retains high-value edge data on-chip during graph update. Thus, GraSU does not need to frequently transfer edge data between on-chip and off-chip memory, avoiding redundant communication overheads. Compared with Aspen, AccuGraph-D can therefore improve the update efficiency by 2.16×~5.88× (4.42× on average). In particular, compared to other graphs SO exhibits the smallest improvement with 2.16× only. The reason is simple. SO has the highest number of average degree by 24.40 (Table 2). Thus, the average edge data size for each vertex in SO is larger than others. Consider the limited fixed-size UltraRAM, a larger vertex degree generally implies fewer vertices’ edge data be stored in the on-chip memory. As a result, relatively few spatial similarity opportunities can be exploited by GraSU. 6.3 Overall Efficiency We also evaluate the total running time (i.e., update time plus graph computation time) of AccuGraph-D against Stinger and Aspen. Stinger and Aspen are the in-memory systems (excluding disk- loading time). To make apple-to-apple comparisons, we exclude the CPU-FPGA transfer time. Figure 12 shows the overall performance results, where each bar consists of two parts for a graph system. The patterned part represents graph computation time while the unpatterned part indicates graph update time. Overall, compared with Stinger and Aspen, AccuGraph-D has the fastest execution time for all three graph algorithms by average speedups of 9.80× and 3.07×, respec- tively. This is because of not only graph update efficiency improved but also graph computation performance accelerated by FPGA. Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA 156
  • 9. AU SU WK SO BC 0 2 4 6 8 10 Normalized Execution Time Baseline W/ VMM W/ VMM+IVM W/ VMM+IVM+OT Figure 13: Update efficiency of GraSU with and without value-aware memory manage- ment (VMM), incremental value measure- ment (IVM), and overlapping technique (OT) 100×288Kb 200×288Kb 400×288Kb 800×288Kb 0.5 0.6 0.7 0.8 0.9 1 Normalized Speedup AU SU WK SO BC Figure 14: Update efficiency of GraSU with different UltraRAM sizes. All re- sults are normalized to GraSU with the UltraRAM size of 800×288Kb. AU SU WK SO BC 0.95 1 1.05 1.1 1.15 1.2 1.25 Normalized Speedup 0.1% 1% 10% Figure 15: Update efficiency of GraSU with varying batch sizes of graph up- dates. All results are normalized to GraSU with the update batch size of 10%. Yet, we can also observe that graph computation can gradually dominate the overall performance as update efficiency is signifi- cantly improved. For instance, consider BFS on WK, the graph com- putation proportion for Aspen is 60.16%, which can be increased to 80.76% when graph update efficiency is improved by GraSU. In this work, we focus on improving graph update efficiency. As for how to improve the performance of graph computation for boosting overall performance further, we leave it as interesting future work. 6.4 Effectiveness We further investigate the benefit breakdown for different compo- nents of GraSU, including value-aware memory management (VMM), incremental value measurement (IVM), and overlapping technique (OT). Figure 13 shows the breakdown results, where the baseline indicates that neither of VMM, IVM, nor OT is applied. All results are normalized to a version of VMM, IVM, and OT applied. Note that AU and SU are small enough to fit all edges into the on-chip UltraRAM. In this case, IVM and OT are therefore disabled at run- time and the baseline is GraSU with only VMM used. Overall, our GraSU improves the baseline by 6.14× on average. VMM. By preserving the high-value data (identified by the degree-based approach) on-chip, a large number of off-chip mem- ory accesses are transformed to be on-chip accesses. Therefore, we see that VMM contributes a significant speedup of 4.69× (on aver- age) over the baseline, occupying 65.56% of overall performance improvement of graph update. In particular, for the small graphs AU and SU, VMM shows the substantial performance improvements by 9.89× and 7.37× over the baseline, respectively. The reason is clear that all data are stored in the on-chip such that no off-chip communications will occur in this case. IVM. As shown in Figure 9, IVM can improve the prediction accuracy of high-value data significantly. We next discuss the IVM’s runtime overheads. Compared to the baseline, we see that IVM offers only 1.58× speedup on average. In particular, for WK, IVM may even cause significant slowdown such that overall performance is poorer than the baseline. The reason is clear that the benefit provided by the accuracy improvement is offset by the overheads of repeated value measurement over update batches. Fortunately, such overheads can be fully overlapped with the normal graph computation phase, as discussed below. OT. By applying OT, we see that the IVM-induced extra over- head can be fully hidden behind the normal graph computation, further improving the total execution time significantly. OT makes dynamic graph updates run 4.21× faster than otherwise, demon- strating the effectiveness of OT. Results show that IV and OT can jointly contribute 34.44% of the overall performance improvement. 0 2 4 8 16 32 0 2 4 6 8 10 Normalized Speedup AU SU WK SO BC Figure 16: Update efficiency of GraSU with different number of PEs. All results are normalized to the update time in case of using 2-PE. 6.5 Sensitive Study Let us examine the sensitivity of GraSU’s performance to different (1) UltraRAM sizes, (2) update batch sizes, and (3) PE numbers. UltraRAM Size. Figure 14 illustrates the performance of graph updates with different UltraRAM sizes ranging from 100×288Kb to 800×288Kb. Overall, the larger the UltraRAM size is, the better the performance is. This is because a larger size implies more high- value edge data that can be cached on-chip and fewer edge data transfers between on-chip and off-chip memory. In particular, AU can be stored entirely under the UltraRAM sizes of 400×288Kb and 800×288Kb, and therefore, its performance is not changed in these two cases. Also, when the UltraRAM size is scaled from 200×288Kb to 400×288Kb, AU exhibits a significant performance improvement (by up to 43.38%). The reason behind is that the 400×288Kb Ultra- RAM caches more high-value data that is missing by 200×288Kb size, thereby reducing the off-chip edge data accesses significantly. Update Batch Size. Figure 15 characterizes the update perfor- mance of GraSU with different update batch sizes. For each graph, we divide the edges that need an update into 1000, 100, and 10 batches according to their timestamp range. These correspond re- spectively to the cases that each update batch contains 0.1%, 1%, and 10% of the updated edges. We see that the batch size does not signif- icantly affect the update efficiency since spatial similarity does not destroy by update batch scales. In addition, as the number of edge updates in the batch decreases from 10% to 0.1%, the average update performance slightly increases from 1.0× to 1.23×. This is because more high-value data are mined through value measurement. PE Number. Figure 16 further plots update performance with a varying number of (2/4/8/16/32) graph update PEs. All results are normalized to the update time in case of using 2-PE. Overall, more PEs can introduce an increasing performance improvement but the growth rate is gradually decreasing. When the number of PEs increases from 16 to 32, the update performance increases by 1.38× only, on average. This is because more PEs mean more simultaneous memory requests to be processed, showing significant memory Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA 157
  • 10. AU SU WK SO BC AU SU WK SO BC AU SU WK SO BC BFS PR WCC 0 1 2 3 4 5 6 Normalized Speedup Aspen FabGraph-D WaveScheduler-D ForeGraph-D Figure 17: Overall dynamic graph processing performance of FabGraph, WaveScheduler, and ForeGraph (with an inte- gration of GraSU) against Aspen pressure to the value-aware memory manager with potential access conflicts. We leave addressing this problem as future work. 6.6 Integration with Other Graph Accelerators We finally explore to integrate GraSU into other three state-of- the-art FPGA-based graph accelerators, including FabGraph [39], WaveScheduler [44], and ForeGraph [11] (with its single version). Similar to AccuGraph, with only 11 lines of code modifications, GraSU can be easily integrated with FabGraph, WaveScheduler, and ForeGraph to support dynamic graph processing, thanks to the following highlighted designs. First, GraSU adopts a PMA-based dynamic graph organization, which can support as many as exist- ing accelerators using different underlying graph formats. Second, GraSU uses a lightweight bitmap-based method to implement dif- ferential memory access, thereby avoiding significant modifications on the underlying memory subsystem and lowering the integration obstacles. Third, GraSU provides a uniform integration framework with easy-to-use programming interfaces. Figure 17 shows the overall performance results of FabGraph-D, WaveScheduler-D, and ForeGraph-D against Aspen. The experi- ment environment is the same as AccuGraph-D. Since FabGraph uses some UltraRAM resource as the shared vertex buffer, we allo- cate only 21.09MB (600×288Kb) UltraRAM to buffer high-value data for FabGraph-D. Results show that the dynamic graph versions of FabGraph, WaveScheduler, and ForeGraph integrated from GraSU also outperform Aspen with the geomean speedups of 2.93×, 3.09×, and 1.63×, showing the generality and practicability of GraSU. 7 RELATED WORK Graph Processing Accelerators. Due to the random access fea- ture, graph processing generally features a low compute-to-memory ratio challenge. To improve memory efficiency, existing FPGA- based graph accelerators typically focuses on optimizing on-chip memory access [10, 11, 39, 44, 50] and off-chip memory access [2, 12, 24, 31, 32, 51, 52, 54, 55]. For on-chip memory optimizations, some consider mitigating the performance overheads caused by data conflicts in the on-chip BRAM [44, 50]. Some [10, 11] use the on-chip data resuing mechanism to improve the locality of graph computation. There are also works that try to hide the delay of data loading from off-chip memory to BRAM [39]. A number of graph processing accelerators [12, 31, 32, 54, 55] focus on improving the bandwidth utilization between the on-chip and the off-chip mem- ory with sophisticated pipeline designs. Alternative uses emerging memory technologies (e.g., hybrid memory cube) to further im- prove the external memory access [24, 51, 52]. Unfortunately, these earlier studies are limited to handle static graphs [19]. To the best of our knowledge, there are currently few dynamic FPGA-based graph accelerators. In this work, we aim to fill the gap between static graph computation and dynamic graph update. We restrict ourselves to build a fast graph update library, which can be inte- grated easily into any existing FPGA-based static graph accelerator for handling dynamic graphs. Dynamic Graph Processing Systems. Most of existing dy- namic graph systems can fall into two categories [6, 8, 13, 14, 16, 18, 21, 25, 29, 37, 38, 45, 46]. The first category develops new dynamic graph representations based on different static data structures in terms of CSR variants [14, 16, 29, 37], adjacency list [8, 25, 46], hash table [18, 21], tree [6, 13], and Packed Memory Array (PMA) [38, 45]. These earlier studies improve the concurrency between graph up- dates, but their underlying efficiency is still limited to the excessive off-chip memory accesses [1]. GraSU identifies spatial similarity op- portunities and presents a differential data management to improve the memory efficiency of dynamic graph processing. A number of studies improve the efficiency of graph computation under dynamic graph processing scenes by leveraging recent (rather than initial) vertex property values to accelerate the convergence of graph computation [8, 30, 37, 40, 42, 43]. In particular, accelerating graph computation using FPGAs stands out for yielding impressive results in both performance and energy-efficiency [19, 39, 44, 50]. In this work, we emphasize more on accelerating dynamic graph updates on FPGA, and also develop an FPGA library to make it easily-integrated with minimal hardware engineering efforts. 8 CONCLUSION In this paper, we introduce a graph-structured update library (called GraSU) for high-throughput updates on FPGA. GraSU can be easily integrated with any existing FPGA-based static graph accelerators with only a few lines of code modifications for handling dynamic graphs. GraSU features with the two key designs: an incremental value measurement and a value-aware differential memory man- agement. The former enables to quantify the data value accurately while its overheads can be fully hidden behind normal graph com- putations. The latter exploits spatial similarity of graph updates by retaining high-value data on-chip so that the most off-chip data communications arising in graph updates can be transformed into fast on-chip memory accesses. We integrate GraSU into a state-of- the-art static graph accelerator AccuGraph to drive dynamic graph processing. Our implementation on a Xilinx Alveo™ U250 board demonstrates that the dynamic graph version of AccuGraph out- performs two state-of-the-art CPU-based dynamic graph systems, Stinger and Aspen, by an average of 34.24× and 4.42× in terms of update throughput, improving further overall efficiency by 9.80× and 3.07× on average. ACKNOWLEDGMENTS This work is supported by the National Key Research and Develop- ment Program of China under Grant No. 2018YFB1003502, National Natural Science Foundation of China under Grant No. 62072195, 61825202, and 61832006. The correspondence should be addressed to Long Zheng. Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA 158
  • 11. REFERENCES [1] Abanti Basak, Jilan Lin, Ryan Lorica, Xinfeng Xie, Zeshan Chishti, Alaa Alameldeen, and Yuan Xie. 2020. SAGA-Bench: Software and Hardware Charac- terization of Streaming Graph Analytics Workloads. In ISPASS. IEEE, 12–23. [2] Andrew Bean, Nachiket Kapre, and Peter Y. K. Cheung. 2015. G-DMA: improving memory access performance for hardware accelerated sparse graph computation. In ReConFig. IEEE, 1–6. [3] Nathan Beckmann and Daniel Sánchez. 2015. Talus: A simple way to remove cliffs in cache performance. In HPCA. IEEE, 64–75. [4] Maciej Besta, Marc Fischer, Tal Ben-Nun, Johannes de Fine Licht, and Torsten Hoefler. 2019. Substream-Centric Maximum Matchings on FPGA. In FPGA. ACM, 152–161. [5] Maciej Besta, Marc Fischer, Vasiliki Kalavri, Michael Kapralov, and Torsten Hoe- fler. 2019. Practice of Streaming and Dynamic Graphs: Concepts, Models, Systems, and Parallelism. CoRR abs/1912.12740 (2019). arXiv:1912.12740 [6] Federico Busato, Oded Green, Nicola Bombieri, and David A. Bader. 2018. Hornet: An Efficient Data Structure for Dynamic Sparse Graphs and Matrices on GPUs. In HPEC. IEEE, 1–7. [7] Xinyu Chen, Ronak Bajaj, Yao Chen, Jiong He, Bingsheng He, Weng-Fai Wong, and Deming Chen. 2019. On-The-Fly Parallel Data Shuffling for Graph Processing on OpenCL-Based FPGAs. In FPL. 67–73. [8] Raymond Cheng, Ji Hong, Aapo Kyrola, Youshan Miao, Xuetian Weng, Ming Wu, Fan Yang, Lidong Zhou, Feng Zhao, and Enhong Chen. 2012. Kineograph: taking the pulse of a fast-changing and connected world. In EuroSys. ACM, 85–98. [9] Asaf Cidon, Assaf Eisenman, Mohammad Alizadeh, and Sachin Katti. 2016. Cliffhanger: Scaling Performance Cliffs in Web Memory Caches. In NSDI. USENIX, 379–392. [10] Guohao Dai, Yuze Chi, Yu Wang, and Huazhong Yang. 2016. FPGP: Graph Processing Framework on FPGA A Case Study of Breadth-First Search. In FPGA. ACM, 105–110. [11] Guohao Dai, Tianhao Huang, Yuze Chi, Ningyi Xu, Yu Wang, and Huazhong Yang. 2017. ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA Architecture. In FPGA. ACM, 217–226. [12] Michael DeLorimier, Nachiket Kapre, Nikil Mehta, Dominic Rizzo, Ian Eslick, Raphael Rubin, Tomás E. Uribe, Thomas F. Knight Jr., and André DeHon. 2006. GraphStep: A System Architecture for Sparse-Graph Algorithms. In FCCM. IEEE, 143–151. [13] Laxman Dhulipala, Guy E. Blelloch, and Julian Shun. 2019. Low-latency graph streaming using compressed purely-functional trees. In PLDI. ACM, 918–934. [14] David Ediger, Robert McColl, E. Jason Riedy, and David A. Bader. 2012. STINGER: High performance data structure for streaming graphs. In HPEC. IEEE, 1–5. [15] Nina Engelhardt and Hayden Kwok-Hay So. 2016. Gravf: A vertex-centric dis- tributed graph processing framework on fpgas. In FPL. IEEE, 1–4. [16] Guoyao Feng, Xiao Meng, and Khaled Ammar. 2015. DISTINGER: A distributed graph data structure for massive dynamic graph processing. In BigData. IEEE, 1814–1822. [17] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin. 2012. PowerGraph: Distributed Graph-parallel Computation on Natural Graphs. In OSDI. USENIX, 17–30. [18] Xiangyang Gou, Lei Zou, Chenxingyu Zhao, and Tong Yang. 2019. Fast and Accurate Graph Stream Summarization. In ICDE. IEEE, 1118–1129. [19] Chuang-Yi Gui, Long Zheng, Bingsheng He, Cheng Liu, Xin-Yu Chen, Xiao-Fei Liao, and Hai Jin. 2019. A Survey on Graph Processing Accelerators: Challenges and Opportunities. J. Comput. Sci. Technol. 34, 2 (2019), 339–371. [20] Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret Martonosi. 2016. Graphicionado: A High-Performance and Energy-Efficient Accelerator for Graph Analytics. In MICRO. IEEE, 1–13. [21] Keita Iwabuchi, Scott Sallinen, Roger A. Pearce, Brian Van Essen, Maya B. Gokhale, and Satoshi Matsuoka. 2016. Towards a Distributed Large-Scale Dynamic Graph Data Store. In IPDPS. IEEE, 892–901. [22] Hai Jin, Pengcheng Yao, Xiaofei Liao, Long Zheng, and Xianliang Li. 2017. To- wards Dataflow-Based Graph Accelerator. In ICDCS. IEEE, 1981–1992. [23] Theodore Johnson and Dennis E. Shasha. 1994. 2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm. In VLDB. Morgan Kaufmann, 439–450. [24] Soroosh Khoram, Jialiang Zhang, Maxwell Strange, and Jing Li. 2018. Accelerat- ing Graph Analytics by Co-Optimizing Storage and Access on an FPGA-HMC Platform. In FPGA. ACM, 239–248. [25] Pradeep Kumar and H. Howie Huang. 2019. GraphOne: A Data Store for Real-time Analytics on Evolving Graphs. In FAST. USENIX, 249–263. [26] Ravi Kumar, Jasmine Novak, and Andrew Tomkins. 2006. Structure and Evolution of Online Social Networks. In KDD. ACM, 611–617. [27] Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. 2005. Graphs over time: densification laws, shrinking diameters and possible explanations. In KDD. ACM, 177–187. [28] Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network Dataset Collection. http://snap.stanford.edu/data. [29] Peter Macko, Virendra Marathe, Daniel Margo, and Margo Seltzer. 2015. LLAMA: Efficient graph analytics using Large Multiversioned Arrays. In ICDE. IEEE, 363–374. [30] Mugilan Mariappan and Keval Vora. 2019. GraphBolt: Dependency-Driven Syn- chronous Processing of Streaming Graphs. In EuroSys. ACM, 25:1–25:16. [31] Eriko Nurvitadhi, Gabriel Weisz, Yu Wang, Skand Hurkat, Marie Nguyen, James C. Hoe, José F. Martínez, and Carlos Guestrin. 2014. GraphGen: An FPGA Framework for Vertex-Centric Graph Computation. In FCCM. IEEE, 25–28. [32] Tayo Oguntebi and Kunle Olukotun. 2016. GraphOps: A Dataflow Library for Graph Analytics Acceleration. In FPGA. ACM, 111–117. [33] Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, John Greth, Steven Burns, and Ozcan Ozturk. 2016. Energy Efficient Architecture for Graph Analytics Accelerators. In ISCA. IEEE, 166–177. [34] Xiafei Qiu, Wubin Cen, Zhengping Qian, You Peng, Ying Zhang, Xuemin Lin, and Jingren Zhou. 2018. Real-time Constrained Cycle Detection in Large Dynamic Graphs. Proc. VLDB Endow. 11, 12 (2018), 1876–1888. [35] Ryan A. Rossi and Nesreen K. Ahmed. 2015. The Network Data Repository with Interactive Graph Analytics and Visualization. In AAAI. http://networkrepository. com [36] David Sayce. 2020. The Number of tweets per day in 2020. https://www.dsayce. com/social-media/tweets-day/. [37] Dipanjan Sengupta, Narayanan Sundaram, Xia Zhu, Theodore L. Willke, Jeffrey S. Young, Matthew Wolf, and Karsten Schwan. 2016. GraphIn: An Online High Performance Incremental Graph Processing Framework. In Euro-Par. Springer, 319–333. [38] Mo Sha, Yuchen Li, Bingsheng He, and Kian-Lee Tan. 2017. Accelerating Dynamic Graph Analytics on GPUs. Proc. VLDB Endow. 11, 1 (2017), 107–120. [39] Zhiyuan Shao, Ruoshi Li, Diqing Hu, Xiaofei Liao, and Hai Jin. 2019. Improving Performance of Graph Processing on FPGA-DRAM Platform by Two-level Vertex Caching. In FPGA. ACM, 320–329. [40] Feng Sheng, Qiang Cao, Haoran Cai, Jie Yao, and Changsheng Xie. 2018. GraPU: Accelerate Streaming Graph Analysis through Preprocessing Buffered Updates. In SoCC. ACM, 301–312. [41] Shuang Song, Xu Liu, Qinzhe Wu, Andreas Gerstlauer, Tao Li, and Lizy K. John. 2018. Start Late or Finish Early: A Distributed Graph Processing System with Redundancy Reduction. Proc. VLDB Endow. 12, 2 (2018), 154–168. [42] Keval Vora, Rajiv Gupta, and Guoqing (Harry) Xu. 2016. Synergistic Analysis of Evolving Graphs. ACM Trans. Archit. Code Optim. 13, 4 (2016), 32:1–32:27. [43] Keval Vora, Rajiv Gupta, and Guoqing (Harry) Xu. 2017. KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations. In ASPLOS. ACM, 237–251. [44] Qinggang Wang, Long Zheng, Jieshan Zhao, Xiaofei Liao, Hai Jin, and Jingling Xue. 2020. A Conflict-free Scheduler for High-performance Graph Processing on Multi-pipeline FPGAs. ACM Trans. Archit. Code Optim. 17, 2 (2020), 14:1–14:26. [45] Brian Wheatman and Helen Xu. 2018. Packed Compressed Sparse Row: A Dy- namic Graph Representation. In HPEC. IEEE, 1–7. [46] Martin Winter, Daniel Mlakar, Rhaleb Zayer, Hans-Peter Seidel, and Markus Steinberger. 2018. faimGraph: high performance management of fully-dynamic graphs under tight memory constraints on the GPU. In SC. ACM, 60:1–60:13. [47] Alex Woodie, Tiffany Trader, George Leopold, John Russell, Oliver Peckham, James Kobielus, and Steve Conway. 2020. Tracking the Spread of Coronavirus with Graph Databases. datanami. https://www.datanami.com/2020/03/12/tracking- the-spread-of-coronavirus-with-graph-databases/. [48] Xilinx. 2019. UltraScale Architecture Memory Resources User Guide. https://www.xilinx.com/support/documentation/user_guides/ug573-ultrascale- memory-resources.pdf. [49] Xilinx. 2020. Vivado Design Suite User Guide High-Level Synthesis. https: //www.xilinx.com/support/documentation/sw_manuals/xilinx2020_1/ug902- vivado-high-level-synthesis.pdf. [50] Pengcheng Yao, Long Zheng, Xiaofei Liao, Hai Jin, and Bingsheng He. 2018. An Efficient Graph Accelerator with Parallel Data Conflict Management. In PACT. ACM, 8:1–8:12. [51] Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search. In FPGA. ACM, 207–216. [52] Jialiang Zhang and Jing Li. 2018. Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform. In FPGA. ACM, 229–238. [53] Long Zheng, Xianliang Li, Yaohui Zheng, Yu Huang, Xiaofei Liao, Hai Jin, Jingling Xue, Zhiyuan Shao, and Qiang-Sheng Hua. 2020. Scaph: Scalable GPU-Accelerated Graph Processing with Value-Driven Differential Scheduling. In ATC. USENIX, 573–588. [54] Shijie Zhou, Charalampos Chelmis, and Viktor K Prasanna. 2016. High- Throughput and Energy-Efficient Graph Processing on FPGA. In FCCM. IEEE, 103–110. [55] Shijie Zhou, Rajgopal Kannan, Viktor K. Prasanna, Guna Seetharaman, and Qing Wu. 2019. HitGraph: High-throughput Graph Processing Framework on FPGA. IEEE Trans. Parallel Distrib. Syst. 30, 10 (2019), 2249–2264. Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA 159