Highlighted notes on GraSU: A Fast Graph Update Library for FPGA-based Dynamic Graph Processing.
While doing research work under Prof. Kishore Kothapalli.
In this paper researchers focus on optimizing graph update though a caching technique using on-chip UltraRAM of FPGA.
For dynamic graphs we can work on graph update, or graph algorithms. They work on graph update. They list geomean speedup.
For graph data structure they use Packed Memory Array (PMA) with CSR representation which supports fast edge addition and removal (I think they also mentioned use of fine-grained locking, instead of a per-vertex lock). A bitmap technique is used to avoid storing both UltraRAM and off-chip DRAM offsets. Algorithms to determine which vertex-edges to cache is run overalpped with graph algorithm (hidden overhead). It is based on "rich gets richer", that is high-degree vertices/recently updated vertices are most updated. Vertex data is stored in on-chip BRAM. Graph update is performed in batches. GraSU can be integrated with existing FPGA graph accelerators for static graphs.
Latest FPGA graph-accelerators for static graphs:
AccuGraph, FabGraph, WaveScheduler, ForeGraph.
Latest CPU dynamic graphs systems:
Stinger: CSR with block link list (slow traverse for high degree edges), per vertex lock
Aspen: Based on purely functional search tree (typical cache policy not good enough)
Real-world dynamic graphs in <u, v, t> format:
sx-askubuntu, sx-superuser, wiki-talk-temporal, sx-stackoverflow, soc-bitcoin.
How to allocate space for dynamic files? for dynamic vectors (in RAM)? For dynamic vertices? They seem very related, also to disk defragmentation.
2. Off-the-shelf Graph
Computation PEs
Graph Update PEs
FPGA
chip
Off-chip Memory
Edge Data
…
Vertex data are
stored in BRAMs
Register
Update Handling Logic
①
②
③
Expensive off-chip
communication
overhead
(a) Existing accelerators with naive
graph update scheme
Graph Update PEs
Off-chip Memory
Vertex data are
stored in BRAMs
FPGA
chip
Edge Data
…
UltraRAMs
Register
Edge Update Handling Logic
Value-Aware Memory Manager
Incremental
Value Measurer
Off-the-shelf Graph
Computation PEs
(b) Existing accelerators integrated
with GraSU
Figure 1: Handling dynamic graph updates under existing
static graph accelerators with (a) naive update scheme and
(b) GraSU library. The edge data associated with different
vertices are marked in different colors.
in the on-chip BRAMs to reduce substantial random access over-
heads of vertex data, while edges, massive in quantity, have to
be loaded in a streaming-apply fashion from the off-chip mem-
ory [10, 11, 39, 44, 50, 55]. In this context, a graph update operation
can be split into three basic steps. Consider an edge update. First,
the source vertex index of the to-be-updated edge is read. Its as-
sociated edge array will be loaded from the off-chip memory, and
stored into the registers attached to each processing element (PE).
Second, the to-be-updated edge is then inserted into (or deleted
from) the loaded edge array, which is finally written back to the
off-chip memory. In the real world, there often are a large num-
ber of updates applied. The off-chip edge data may be repeatedly
and redundantly accessed by each separate PE, thus resulting in
expensive off-chip communication overheads (as discussed in §2.2).
In this paper, we focus on addressing whether and how the most
off-chip communications arising in dynamic graph updates can
be transformed into on-chip memory accesses. Note that we con-
sider only edge updates in this work since vertex updates can be
represented by a series of edge insertions and/or deletions [38].
We observe that dynamic graph updates operating upon real-
world graphs show significant spatial similarity for off-chip edge
data access, in the sense that most random off-chip memory re-
quests come from accessing the edges associated with a few ‘valu-
able’ vertices (as discussed in §2.3). Consider the realistic dynamic
graph wiki-talk-temporal [28], most (>99.04%) of edge updates are
associated with only 5% of vertices. This motivates us to reduce
the most off-chip communication overheads by applying a differ-
ential data management, in which for each batch of graph update,
high-value vertices’ edges are resident on the specialized on-chip
UltraRAM [48] while most of low-value ones are stored in the off-
chip memory. Note that the value of a vertex represents the number
of edges updating it during an update batch, differential data man-
agement distinguishes the data based on value priority for being
stored on-chip or off-chip. However, achieving this goal for accel-
erating graph updates on FPGA is still challenging. First, the data
that are valuable for each batch of graph update are dynamically
changed over time. An offline value measurement towards static
graphs may lead to inaccurate detection on valuable data [53]. Yet,
performing the exhaustive value computation on the fly is also
expensive. Second, the differential memory architecture makes data
addressing more complex. Both on-chip UltraRAM and off-chip
DRAM can store edge data. We must accurately know which data
Time
Graph
Computation
Time
Graph
Update #0
Store graph
in a format
Graph
Computation
Graph
Update #1
Graph
Computation
(a) Static Graph Processing (b) Dynamic Graph Processing
Graph
Update #n
Graph
Computation
…
Figure 2: The basic workflow of (a) static graph processing
and (b) dynamic graph processing
is in which memory location in a space-efficient manner, which is
also difficult.
In this paper, we propose an FPGA library, called GraSU, for
fast graph updates. As shown in Figure 1(b), GraSU consists of an
incremental value measurer and a value-aware differential memory
manager to fully exploit spatial similarity of dynamic graph up-
dates. Consider the value of a vertex exhibits significant difference
across different update batches for dynamic graphs, we propose
to quantify the value incrementally based on the update history
to capture important changes across batches such that the detec-
tion accuracy can be improved dynamically. GraSU also enables to
overlap incremental value measurement with normal graph com-
putation and thus its runtime overheads can be fully hidden. To
make a better tradeoff between space overhead and efficiency for
differential memory addressing, we present a value-aware data
management, which reduces unnecessary memory consumption
by leveraging a bitmap and implements the fast yet accurate data
addressing via bitmap-assisted address resolution.
This paper has the following main contributions:
• We observe spatial similarity arising in dynamic graph up-
dates on real-world graphs for improving memory efficiency
of dynamic graph processing.
• We present an FPGA library, namely GraSU, which uses a
differential data management to exploit spatial similarity for
fast graph updates. GraSU can be easily integrated into any
FPGA-based static graph accelerator with only a few lines
of code modifications so as to handle dynamic graphs.
• Our implementation on a Xilinx Alveo™ U250 card outper-
forms Stinger [14] and Aspen [13] by an average of 34.24×
and 4.42× in terms of update throughput, and improve fur-
ther overall efficiency by 9.80× and 3.07× on average.
The rest of this paper is organized as follows. §2 introduces the
background and motivation. The overview of GraSU is presented
in §3. §4 and §5 elaborate the value-aware differential data man-
agement. §6 discusses the results. The related work is surveyed in
§7. §8 concludes the paper.
2 BACKGROUND AND MOTIVATION
We first review preliminaries of dynamic graph processing. We
then identify memory inefficiency of existing FPGA-based graph
accelerators for dynamic graphs, finally motivating our approach.
2.1 Dynamic Graph Processing
Figure 2 depicts the basic workflows of static and dynamic graph
processing, respectively. Static graph processing (Figure 2(a)) often
works on topology-fixed graphs that can be organized in different
graph representations, e.g., compressed sparse row (CSR) [50], com-
pressed sparse column (CSC) [44], and coordinate list (COO) [55].
Dynamic graph processing can be understood as performing a se-
ries of graph computations upon different graph versions that are
successively generated by a sequence of graph update batches (as
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
150
3. (a) Graph Example
0 2 5 6
1 2 0 1 2 0
Vertex IDs
Offset Array
Edge Array
(Neighbour IDs)
0 1 2
(b) CSR-based Format
0
1
2
[0,3]
S 1 2 S 0 1 2 S 0
[12,15]
[8,11]
[4,7]
[0,8] [8,15]
[0,15]
Edge
Array
0 5 13 16
Offset Array
(c) PMA-based Dynamic Graph Representation
Vertex IDs 0 1 2
Segment #0 Segment #1 Segment #2 Segment #3
Figure 3: A simplified example for illustrating the PMA for-
mat and how it supports CSR format. (a) An example graph.
(b) CSR-based static graph format. (c) PMA-based dynamic
graph format. ‘𝑆’ is a sentinel entry indicating each vertex’s
edge range for maintaining the offset array. When a sentinel
is changed, the corresponding entry in the offset array will
be modified.
shown in Figure 2(b)). To increase the concurrency of graph updates,
earlier studies [6, 8, 13, 14, 16, 18, 21, 25, 29, 37, 38, 45, 46] present
various dynamic graph representations based on Packed Memory
Array (PMA) [38, 45], tree [6, 13], CSR variants [14, 16, 29, 37], ad-
jacency list (AL) [8, 25, 46], and hash table [18, 21]. Compared to
others applying on a specific graph format, PMA can be flexibly
adaptive to different graph representations [38]. In this work, we
architect GraSU based on the PMA format for the purpose of sup-
porting a wide variety of FPGA-based static graph accelerators that
may adopt different underlying graph representations.
Figure 3 depicts an example graph with the PMA format and
how it supports the traditional CSR format. As shown in Figure 3(c),
the PMA format maintains sorted edges in a partially-consecutive
manner by leaving some gaps for possible edge updates where ‘𝑆’ is
a sentinel entry indicating each vertex’s edge range for maintaining
the offset array. Logically, the PMA separates the whole edge array
into a series of leaf segments and keeps an implicit tree for locating
the position of edge updates quickly. When a leaf segment becomes
full or empty, all edges of its parent will be redistributed to rebalance
the edges array. Suppose all gaps are exhausted, the root segment
will be doubled with workload rebalance re-invoked.
2.2 Off-Chip Communication Overheads
As described in Figure 1(a), in an effort to reduce the random vertex
access overheads, existing FPGA-based graph accelerators often
store vertex data in the BRAMs while edges reside in the off-chip
memory. In this setting, many edges may be repeatedly loaded from
the off-chip memory by different update operations. In addition,
these off-chip edge accesses closely depend on the sequence of the
edges that need to be updated, and therefore are totally random
with excessive off-chip communication overheads, further slowing
down the overall efficiency of graph updates.
To demonstrate this, we conduct a set of experiments to break
down the real computation time and off-chip communication time
of graph updates operating over AccuGraph [50]. Figure 4 shows
the results on five real-world dynamic graphs (i.e., sx-askubuntu
(AU), sx-superuser (SU), wiki-talk-temporal (WK), sx-stackoverflow
(SO), and soc-bitcoin (BC), more details as shown in Table 2) with a
varying number of edge updates. We see that communication over-
heads dominate the overall execution time of graph updates for all
dynamic graphs. In particular, communication overheads increase
0
2
4
6
8
10
Commu. Overhead Real Computation
A
U
S
U
W
K
S
O
B
C
20%
A
U
S
U
W
K
S
O
B
C
40%
A
U
S
U
W
K
S
O
B
C
80%
A
U
S
U
W
K
S
O
B
C
100%
Normalized
execution
time
Figure 4: Execution time breakdown of graph updates op-
erating upon AccuGraph by applying different proportion
(20%/40%/80%/100%) of edge updates on five real-world dy-
namic graphs. All results are normalized to the case of ap-
plying 20% edge updates.
2
5
4
9
8
7
3
1
6
0
Time
0
1
8
6
5
7
5
0
8
2
8
7
2
4
5
2
8
9
5
4
Edge Updates
Applying
1 3 6 5 7 2 0 2 5 7 9 2 4 5 7 9
Edge
Array
Vertex IDs
Offset Array
0 1 2
0 2 3 5 6 7 13 13 13 19 19
3 4 5 6 7 8 9
(a) Edge Updates
(c) Graph Topology
(b) Graph Data
Edge data of vertex V5
Figure 5: An example for illustrating spatial similarity of
edge updates. The blue solid (dashed) lines indicate edge in-
sertions (deletions).
significantly as edge update sizes increase. For example, when 20%
of edge updates are applied for WK, communication overheads
take 65.80% of overall performance. However, as this proportion is
increased to 100%, communication overheads will be up to 83.26%,
exerting more pressure on the off-chip memory bandwidth. Overall,
communication overheads can take most (85.62% on average) of
overall performance of dynamic graph updates.
2.3 Differential Data Management
In this work, we find that not all off-chip communications arising
in graph updates for real-world graphs are contributed equally. The
reason behind this is complex. The nature of “rich get richer” [26]
and the power-law degree feature1 of real-world graphs can help a
partial understanding. The "rich get richer" nature indicates that a
new edge update can be more likely operated on a high-degree ‘rich’
vertex. As a result, the minority of high-degree vertices get involved
by most graph updates. In a real scenario of online shopping, users
are more inclined to buy (i.e., edge update) those high-sale products
which are often in the minority among all products. In summary,
we have the following observation:
Observation: Graph updates on real-world graphs show significant
spatial similarity, indicating that most of the off-chip random
memory accesses root from requesting a few vertices’ edge data.
Figure 5 shows an example for helping understand spatial sim-
ilarity. The initial graph (Figure 5(c)) will be modified by 10 edge
update operations (Figure 5(a)). We can see that 8 out of 10 edge
updates are associated to only 2 vertices, 5 and 8 . Figure 6 further
investigates the percentage of edge updates that get involved by dif-
ferent scale (1%∼5%) of top vertices. All results are collected from an
1Most vertices in a graph have a few neighbors while a few have many ones [17].
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
151
4. 0%
20%
40%
60%
80%
100%
1% 2% 3% 4% 5%
AU
SU
WK
SO
BC
Percentage of vertices
Percentage
of
edge
updates
Figure 6: Quantitative relationship between edge updates
and vertices for five real-world dynamic graphs
offline trace analysis. Basically, we see that most edge updates are
operated on a few vertices. Consider SO, 58.28% of edge updates are
operated upon the top-1% vertices. However, the percentage growth
becomes gradually slow and saturated in the case that vertex ratio
is 5%. Overall, we can see that 71.26%~99.04% of edge updates are
focused on accessing the top-5% vertices. Note that spatial similar-
ity is not assumed on power-law graphs. For USA-road [35] with a
relatively-even degree distribution, 5% accident-prone vertices may
frequently cause 66.74% road congestion.
Spatial similarity motivates us to classify the entire edge data
into high-value data (if it is requested by many edge updates) and
low-value data (otherwise). As shown in Figure 5(b), the colored
edges in the edge array are high-value data due to the association
with frequently accessed 5 and 8 . This further requires to invent
a differential data management, in which high-value data reside on
the on-chip memory while low-value data are stored in the off-chip
memory. In this case, most of random off-chip memory accesses
arising in graph updates can be therefore transformed into fast
on-chip accesses.
However, realizing the differential data management on FPGA
remains challenging and needs to meet at least two requirements:
• Accuracy: We must accurately know the value of each ver-
tex so as to place its associated edges into certain memory
device. However, unlike static graphs, data value of dynamic
graphs is dynamically various such that measuring it accu-
rately and efficiently becomes difficult.
• Space-Efficiency: In the differential memory architecture,
both on-chip and off-chip memories have a copy of edge
data. This introduces a new data addressing mechanism for
positioning the data location. However, how to ensure the
space efficiency of data addressing is also difficult.
To address these issues, we propose GraSU that can exploit spa-
tial similarity effectively and efficiently.
3 GRASU OVERVIEW
In an effort to reduce excessive random off-chip communications
induced by redundant memory accesses arising in graph updates,
GraSU uses "value" to characterize the importance of data (i.e., high-
accuracy value measurement) and treats them differentially (i.e.,
value-aware differential management) based on the key insight that
not all off-chip accesses are created equally.
3.1 Architecture
Figure 7 shows the overall architecture of GraSU, which consists
of five components: dynamic graph storage, incremental value
(a) Overall Architecture of GraSU
Graph Data
(stored in PMA-
based dynamic
graph format) Edge Updates
Batch #1 Batch #2
Batch #3 ……
Off-chip Memory
FPGA Chip
Update
Handling Logic
Update-relevant
Data Register
Edge
Updates
Dispatcher
Graph Update PEs
Value-Aware
Memory Manager
Off-chip
Memory
Controller
High-value Data Buffer
Incremental
Value Measurer
Updates
Buffer
(processed in batches)
Edge Read
Edge Write
Edge Update
Edge
Update
Send
access
request
to
Value-Aware
Memory
Manager
Edge Insertion/Deletion on
Update-relevant Data Register
(b) Architecture of Update
Handling Logic
Edge Array
0 5 13 16
Offset Array
Vertex IDs 0 1 2
S12 S0 12 S0
[0,3][4,7][8,11][12,15]
[8,15]
[0,7]
[0,15]
Segment
(c) Dynamic Graph
Organization
➏
➊
➋
➌
UltraRAM
➍
➎
§4
§5
Figure 7: The GraSU architecture
measurer, edge updates dispatcher, edge updates handling logic,
and value-aware memory manager.
Dynamic Graph Organization. As discussed in §2.1, we fol-
low the PMA representation [38, 45]. In GraSU, both on-chip and
off-chip memories can store edge data. There may be the case that
a segment contains edges with different value levels, making it
extremely difficult for data organization. To avoid this, we propose
to enforce each segment to contain edges from only one vertex (Fig-
ure 7(c)). In addition, for the traditional PMA format, the edge array
space will be doubled if it becomes full. However, FPGA currently
does not support dynamic memory allocation effectively [49]. To
achieve this functionality, we pre-allocate the off-chip memory into
many spaces physically, and use the segment space logically for
space doubling when necessary.
Incremental Value Measurer (IVM). The IVM module is re-
sponsible for quantifying the value of each vertex for graph updates,
and further notifying the value-aware memory manager (VMM) to
dispatch edges of high-value vertices into the on-chip UltraRAM
(❶). Since the data value is dynamically changing, IVM adopts an
incremental value measurement based on graph update history to
improve measurement accuracy constantly. IVM is invoked every
time a batch of graph updates are completed (❻). Note that value
measurement overheads can be fully hidden behind normal graph
computations. More details are discussed in §4.
Edge Update Dispatcher (EUD). When high-value data reside
in the on-chip UltraRAM, EUD gets started (❷). It reads a batch
of edge updates from the off-chip memory and further dispatches
them to appropriate graph update PEs orderly according to the
timestamp order of each edge update (❸).
Update Handling Logic (UHL). The UHL module makes sure
that each edge update can be correctly inserted into or deleted from
the edge array. UHL is equipped with a three-stage pipeline: edge
read, edge update, and edge write (Figure 7(b)). The read stage loads
the requested data of the to-be-updated edge by sending a request
to the VMM (❹), discussed below. Afterwards, the update stage
performs the insertion or deletion operations. Finally, the write
stage writes back the updated edge data into the off-chip memory
(or the UltraRAM) through VMM.
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
152
5. Table 1: Programming interfaces of GraSU
Interfaces Description
GraSU_alloc_UltraRAM UltraRAM Allocation
GraSU_DGS Transform a static graph into the PMA format
GraSU_Init_Value Initialize the value of each vertex
GraSU_LHD Load the high-value data into the UltraRAM
GraSU_Update_Start Handle edge updates
GraSU_WHD Writeback high-value data into off-chip RAM
GraSU_Quantify_Value Quantify the value of each vertex
GraSU_Overlap Overlap value measurement with graph computation
GraSU_alloc_UltraRAM( );
GraSU_DGS( );
GraSU_Init_Value( );
DynamicGraphProcessing( ){
for( each update batch ){
GraSU_LHD( update_batch_valid, LHD_valid );
GraSU_Update_Start( LHD_valid, GUS_valid );
GraSU_WHD( GUS_valid, WHD_valid );
/*Overlap graph computation with value measurement*/
GraSU_Overlap( WHD_valid ){
/*Notify AccuGraph to start graph computation*/
AccuGraph( WHD_valid, computation_valid );
GraSU_Quantify_Value( WHD_valid, quantify_valid );
}
/* The signal indicates whether graph computation and
value measurement are completed*/
update_batch_valid = computation_valid & quantify_valid;
}
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Figure 8: A uniform programming framework for illustrat-
ing how to integrate GraSU into an existing static graph ac-
celerator AccuGraph [50] for handling dynamic graphs.
Value-Aware Memory Manager (VMM). The VMM module
aims to locate the requested edge data in an accurate and efficient
manner. To make a good tradeoff between memory space over-
heads and data addressing efficiency in the differential memory
architecture, VMM adopts a bitmap-indexed structure to minimize
space consumption and uses a bitmap-assisted addressing resolu-
tion mechanism to enable fast yet accurate differential data accesses.
When receiving a read request (❹), which consists of the source
and destination vertex indices of a to-be-updated edge, VMM will
capture the edge array address of the source vertex as an initial
on-chip (or off-chip) address. Afterward, VMM locates the target
segment in the edge data and also loads the segment (as well as
its address) to the update-relevant registers coupled with UHL (❺).
When VMM receives a write request from UHL, the data residing
in the UHL-attached buffer will be written back according to the
address of the segment. More details are described in §5.
3.2 Programming Interfaces
Table 1 shows the programming interfaces of GraSU for graph up-
dates. Using GraSU, we do not need to modify the upper graph al-
gorithm programming. Figure 8 shows an example of how GraSU is
integrated into an existing state-of-the-art graph accelerator Accu-
Graph [50] effectively. Two parameters of each interface indicate an
input and an output signal, respectively. For each update batch, we
use GraSU_LHD, GraSU_Update_Start, and GraSU_WHD to complete
0%
20%
40%
60%
80%
100%
0 1 2 3 4 5 6 7 8 9
Ideal
Degree-based
Incremental
Batches of updates
Percentage
of
edge
updates
Figure 9: Accuracy of the top 5% vertices identified by three
different schemes over update batches for WK. Ideal results
are obtained through an offline trace analysis.
a graph update operation. After the graph update is finished, the out-
put signal WHD_valid of GraSU_WHD will be valid to simultaneously
activate graph computation (i.e., AccuGraph) and value measure-
ment of the next graph update (i.e., GraSU_Quantify_Value). The
parameter update_batch_valid is used to ensure the semantic
correctness between different update batches.
More generally, all codes (excluding Line 12) of Figure 8 repre-
sent a uniform programming template to perform graph update.
The whole code of the existing graph accelerator is treated as a
module for graph computation. For integration, users only need to
instantiate the accelerator module (Line 12) inside the top mod-
ule, and then connect accelerator’s input/output signals with other
GraSU’s modules for coordinating when it is launched/finished.
Thus, GraSU can be easily used and integrated into existing static
graph accelerators for handling dynamic graphs with only 11 lines
of code modifications. GraSU is implemented and written in Verilog.
Integrating GraSU for an HLS-based accelerator is also viable by
converting it into the Verilog program.
4 VALUE MEASUREMENT
We first describe how to accurately quantify the value of a vertex
to distinguish high-value data. We then discuss how the overheads
of value measurement can be hidden in an overlapping manner.
4.1 Quantifying the Value of a Vertex
According to the “rich get richer” conjecture, an intuition for quan-
tifying the value of a vertex value measurement is to use its degree.
This approach is useful but its accuracy is still far from the ideal sit-
uation. Figure 9 shows the accuracy of the top 5% vertices identified
in the ideal case and the degree-based approach for the real-world
dynamic graph WK. Accuracy is the quantitative ratio of the top-5%
vertices’ edge updates to the total edge updates. Ideal case means
that all the top-5% vertices’ edge updates are precisely-identified.
We see that the accuracy gap becomes bigger as the update batches
proceed. In particular, in the case of the 9th update batch, the top 5%
vertices identified by the degree-based approach can get involved
by only 67.33% edge updates while the accuracy in the ideal case
can be as high as 99.04%.
The reasons are twofold. First, some low-degree vertices in a
basic graph can be inserted with many edges and thus they may
gradually become high-degree ones as update batches go. This is
common in real scenarios. For example, an obscure actor may gain
many fans to become a superstar when his film gets successful.
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
153
6. Graph
Computation
Time
Graph
Update
Graph
Update
Graph
Computation
GraSU
Update batch
#i
Update batch
#(i+1)
Graph
Computation
Engine
Value Measurement
……
……
Figure 10: The overlapping opportunity between value mea-
surement and normal graph computation
Second, the edges of some high-degree vertices may have a slower
growth than others. For example, when a “superspreader” is isolated
with the chains of virus transmission cut off, it will become normal
soon. Both phenomenons indicate that the value of a vertex depends
on not only its degree but also its historical update frequency. Thus,
we propose to quantify the value of a vertex as follows.
𝑉𝑎𝑙𝑢𝑒𝑖
(𝑣) =
𝐷𝑒𝑔(𝑣) × 𝐹𝑖−1(𝑣) 0 𝑖 𝑁
𝐷𝑒𝑔(𝑣) 𝑖 = 0
(1)
where 𝑁 is the number of update batches and 𝐷𝑒𝑔(𝑣) is the number
of out-going edges of the vertex 𝑣. 𝐹𝑖−1(𝑣) represents the number of
edges updated to the vertex 𝑣 (also called the update frequency of 𝑣)
in the (𝑖 − 1)-th update batch. The value of 𝑣 is initialized to 𝐷𝑒𝑔(𝑣)
before the 0-th update batch starts (i.e., 𝑉𝑎𝑙𝑢𝑒0(𝑣)). 𝑉𝑎𝑙𝑢𝑒𝑖 (𝑣) de-
notes the value of 𝑣 after applying (𝑖 − 1)-th update batch.
In Equation (1), we see that the value is incrementally quantified
and dynamically adjusted to gradually improve the prediction accu-
racy of high-value edge data. Figure 9 shows the superiority of our
incremental approach. Compared to the degree-based approach, we
can capture 88.15% updates for the top 5% vertices identified.
4.2 Overlapping Value Measurement and
Graph Computation
As in Equation (1), measuring the value of a vertex needs to obtain
its degree and update frequency, both of which are dynamically
changed during graph updates. Thus, we have to compute them
on the fly, which can introduce potential runtime overheads. For-
tunately, the interleaving between graph update and graph com-
putation for each update batch (as shown in Figure 2(b)) offers the
potential opportunity to hide the overheads of value measurement.
Figure 10 illustrates an overlapping diagram. When the 𝑖-th
graph update is completed, the edge data resident in the on-chip
memory will be written back to the off-chip memory. Afterwards,
graph computation engine starts working. In the meantime, the
Incremental Value Measurer starts quantifying the vertex value for
the (𝑖 + 1)-th graph update. Fortunately, the time spent on value
measurement is often less than that spent on graph computation.
This is because value measurement needs to compute upon a vertex
only once while graph computation has an iterative process over
vertex [53]. Thus, the overheads of value measurement can be
usually hidden fully under the execution time of graph computation.
5 VALUE-AWARE MEMORY MANAGEMENT
This section elaborates how to make full use of the quantified vertex
value to identify high-value edge data accurately and to further
achieve a differential data management for maximize value benefits.
5.1 High-Value Data Identification
Based on the quantified vertex value, the next step is naturally to
identify the high-value data that should be stored on-chip. This
raises two questions: (1) which data is high value for the on-chip
storage? and (2) which on-chip memory (BRAM or UltraRAM) is
expected for caching high-value data?
High-Value Data Computation. GraSU tries to store as much
high-value data as possible on-chip. Since the value and the edge
number of each vertex are changed over time and the on-chip mem-
ory capacity for different FPGAs is different, we must dynamically
compute the high-value data as per the on-chip memory capacity,
the value of each vertex, and the edge size of each vertex. We can
therefore compute high-value data as follows:
𝜏 = arg max
𝑘
( 𝑘
Õ
𝑖=0
𝑆𝑖𝑧𝑒(𝐸𝑑𝑔𝑒𝑂𝑓 (𝑣𝑖))
)
, where 𝑘 ∈ [0, |𝑉 |),
𝑣𝑖 ∈ 𝑉𝑆𝑒𝑡,
𝑘
Õ
𝑖=0
𝑆𝑖𝑧𝑒(𝐸𝑑𝑔𝑒𝑂𝑓 (𝑣𝑖)) ≤ |𝑂𝑛𝑐ℎ𝑖𝑝𝑀𝑒𝑚| (2)
where 𝐸𝑑𝑔𝑒𝑂𝑓 (𝑣𝑖) indicates the edges of 𝑣𝑖. 𝑆𝑖𝑧𝑒(𝑆) represents the
total memory size of the set 𝑆. |𝑉 | is the number of vertices. 𝑉𝑆𝑒𝑡
is the set of vertices that have been sorted by their value from the
largest to the smallest in the VMM module.
Í𝑘
𝑖=0 𝑆𝑖𝑧𝑒(𝐸𝑑𝑔𝑒𝑂𝑓 (𝑣𝑖))
indicates the total edge size of {𝑣0, · · · , 𝑣𝑘 }. |𝑂𝑛𝑐ℎ𝑖𝑝𝑀𝑒𝑚| denotes
the on-chip memory capacity. Equation (2) means to find a largest
𝜏 such that the total edge size of {𝑣0, · · · , 𝑣𝜏 } is nearly equal to the
on-chip memory capacity. After 𝜏 is computed, we can then obtain
the set of high-value data (denoted as 𝑆𝐻𝑉 𝐷) easily as follows:
𝑆𝐻𝑉 𝐷 = {𝐸𝑑𝑔𝑒𝑂𝑓 (𝑣𝑖)|𝑖 ∈ [0,𝜏], 𝑣𝑖 ∈ 𝑉𝑆𝑒𝑡} (3)
Equation (3) shows the edge data of {𝑣0, · · · , 𝑣𝜏 } as high-value data.
UltraRAM vs. BRAM. We select UltraRAM to store high-value
data with the following reasons. First, edges often have a larger
block size than vertices since edge data size of a graph is often
much larger than vertex size. Thus, UltraRAM is more expected
to store edges due to its coarse-grained block size, while BRAM
is often used to store vertex data (as adopted in existing graph
accelerators [10, 11, 39, 50]). Second, UltraRAM often has a larger
memory size (e.g., 1280 × 288Kb for a Xilinx U250) than BRAM
(2000 × 36Kb) for storing more edges on-chip.
5.2 Value-Aware Memory Access
So far, we have the following settings of GraSU for dynamic graph
processing. The vertex data is stored in the on-chip BRAM. The
high-value edge data reside in the on-chip UltraRAM while the low-
value one is in the off-chip memory. In this differential memory
architecture, both UltraRAM and the off-chip memory have the
edge data. Thus, we must know which memory and where the
corresponding edge data are located when a vertex is processing.
This makes memory addressing complex. A naive approach is to use
another offset array to record the on-chip edge data, but it incurs
extra space overhead, which can be more than 𝑁 × 4𝐵 where 𝑁 is
the number of vertices.
Space-Efficient Memory Addressing. We present a simple yet
effective bitmap-based method to yield a good tradeoff between
space overhead and memory addressing efficiency. Each vertex in
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
154
7. UltraRAM
Offset Array
Edge Array
0 0 9 7 14 17
Off-chip Memory
FPGA
chip
0 1 0 1 0
Bitmap
Vertex IDs 0 1 2 3 4
Current_end_addr
Figure 11: An example of value-aware data access manage-
ment, where both UltraRAM and the off-chip memory need
to maintain a piece of edge data.
a bitmap occupies only 1 bit. The bitmap is stored in the on-chip
BRAMs with 1 (0) indicating that the edge data of a corresponding
vertex is stored on-chip (off-chip). Based on the bitmap, we can
access the high-value data easily. When all the edge data of a vertex
is loaded into UltraRAM, its bit in the bitmap is set to 1 and the
corresponding entry in the offset array is set to the new offset in
the UltraRAM. Also, the current end address of this vertex in the
UltraRAM is recorded. Figure 11 shows an example of a graph with
5 vertices. The edge data of 𝑉 1 and 𝑉 3 are high-value data that
need to be loaded into the UltraRAM. Their new offsets (i.e., 0 and
7) in the UltraRAM will be written in the corresponding entries
of the offset array. The entries of 𝑉 1 and 𝑉 3 in the bitmap are
also set to 1. The starting address of the UltraRAM is a constant
and the current end address of V3 is kept in a current_end_addr
register. Note that the vertex bitmap can be still partitioned on a
multi-FPGA platform for scalability, and the overhead induced by
bitmap construction can be amortized by multiple executions of a
variety of graph algorithms operating on the same graph.
High-Value Data WriteBack. As described above, the addresses
of high-value vertices in the offset array are overwritten by the
on-chip UltraRAM offsets. When these high-value data are written
back into the off-chip memory for data consistency, we need to com-
pute their original offsets in the offset array. Specifically, starting
from a given vertex 𝑣, we scan the bitmap and find the first vertex
marked with ‘0’ (denoted as 𝑣0) and the first vertex marked with ‘1’
(denoted as 𝑣1) in the bitmap. Then, the number of out-going edges
of 𝑣 can be obtained by computing the offset difference between
𝑣1 and 𝑣. Finally, the original off-chip memory offset of 𝑣 can be
computed as follows:
𝑜𝑓 𝑓 𝑠𝑒𝑡(𝑣) = 𝑜𝑓 𝑓 𝑠𝑒𝑡(𝑣0) − 𝐷𝑒𝑔(𝑣) (4)
where 𝑜𝑓 𝑓 𝑠𝑒𝑡(𝑣) denotes the off-chip memory offset of the vertex
𝑣. Figure 11 shows an example of how to calculate the original
off-chip memory offset of the vertex 𝑉 1. First, we scan the bitmap
and find 𝑣0 = 𝑉 2 and 𝑣1 = 𝑉 3 in the bitmap. Then, we compute
𝐷𝑒𝑔(𝑉 1) = 7 − 0. Thus, the original offset of 𝑉 1 can be computed
as 𝑜𝑓 𝑓 𝑠𝑒𝑡(𝑉 1) = 2 (i.e., 9-7).
Handling Read Requests. A read request to a vertex 𝑣 needs
to load all the edges of 𝑣 into the UHL. To achieve this goal, we
need to know three pieces of information. First, we have to know
where the edge data of 𝑣 is stored. This can be easily obtained by
accessing the bitmap with 1(0) indicating to be stored on-chip (off-
chip). Second, we compute the initial address of the edge data of
𝑣 (denoted as 𝑎𝑑𝑑𝑟𝑒𝑠𝑠(𝑣)), which can be obtained by adding the
starting memory address and the corresponding offset of 𝑣 in the
Table 2: Real-world dynamic graph datasets
Datasets # Vertices # Edges # BEdges # Degree
sx-askubuntu (AU) [28] 0.16M 0.96M 0.59M 6.05
sx-superuser (SU) [28] 0.19M 1.44M 0.92M 7.44
wiki-talk-temporal (WK) [28] 1.14M 7.83M 3.31M 6.87
sx-stackoverflow (SO) [28] 2.60M 63.50M 36.23M 24.40
soc-bitcoin (BC) [35] 24.58M 122.95M 60.49M 5.00
offset array. Third, we need to obtain the number of out-going
edges of 𝑣 (i.e., 𝐷𝑒𝑔(𝑣)). Two cases should be considered. If the edge
data is on chip, 𝐷𝑒𝑔(𝑣) can be obtained by computing the offset
difference between 𝑣1 (defined above) and 𝑣. If it is in the off chip,
we find 𝑣’s first subsequent vertex (denoted as 𝑣𝑠) and compute
𝑜𝑓 𝑓 𝑠𝑒𝑡(𝑣𝑠) based on the Equation (4) (if it is on-chip). 𝐷𝑒𝑔(𝑣) can
be therefore obtained by computing the offset difference between 𝑣𝑠
and 𝑣. Finally, we can return the edge data of 𝑣 (denoted as 𝑑𝑎𝑡𝑎(𝑣))
to the UHL with a tuple 𝑣,𝑎𝑑𝑑𝑟𝑒𝑠𝑠(𝑣),𝑑𝑎𝑡𝑎(𝑣) .
Handling Write Requests. When a write request with a tuple
𝑣,𝑎𝑑𝑑𝑟𝑒𝑠𝑠(𝑣),𝑑𝑎𝑡𝑎(𝑣) arriving, the updated edge data of 𝑣
from the UHL’s register will be written back to the UltraRAM (or
off-chip memory). Similarly, we get access to the bitmap to identify
if the vertex is stored in the UltraRAM or off-chip memory, with
1(0) indicating to write 𝑑𝑎𝑡𝑎(𝑣) to the UltraRAM (off-chip memory)
based on the 𝑎𝑑𝑑𝑟𝑒𝑠𝑠(𝑣).
6 EXPERIMENTAL EVALUATION
This section evaluates the efficiency of GraSU and shows the effec-
tiveness for its integration into existing static graph accelerators.
6.1 Experimental Setup
GraSU Settings. We implement GraSU upon a Xilinx Alveo™ U250
accelerator card, which is equipped with an XCU250 FPGA chip
and four 16GB DDR4 (Micron MTA18ASF2G72PZ-2G3B1). The
target FPGA chip provides 11.81MB on-chip BRAMs, 45MB on-chip
UltraRAMs, 1.68M LUTs, and 3.37M registers.
In our evaluation, GraSU is set with 32 graph update PEs. Each
segment in the PMA representation has a length of 8. We use 4
bytes to represent a vertex. Thus, the update-relevant register buffer
attached to each PE is sized of 32 bytes. To evaluate efficiency, we
integrate GraSU into a state-of-the-art static graph accelerator Ac-
cuGraph [50] to perform dynamic graph processing, referred to
as AccuGraph-D. To demonstrate the usability of GraSU, we also
integrate GraSU into three state-of-the-art FPGA-based graph accel-
erators, FabGraph [39], WaveScheduler [44], and ForeGraph [11]).
Graph Datasets and Applications. As shown in Table 2, we
use five real-world dynamic graphs publicly available from the Stan-
ford Large Network Dataset Collection [28] and Network Reposi-
tory [35]. Every edge in a dynamic graph has a timestamp that indi-
cates when it should appear. For example, a directed edge 𝑢, 𝑣,𝑡
in WK indicates that the Wikipedia user𝑢 edited the talk page of the
user 𝑣 at the time 𝑡. “BEdges” in Table 2 denotes the edge set of an
initial base graph before graph updates start. The number of edges
that needs to be updated can be computed by “#Edges-#BEdge”. In
our evaluation, all graphs are considered directed, and their edge
updates are divided into 10 batches by default.
We evaluate three representative graph applications: Breadth
First Search (BFS), PageRank (PR), and Weakly Connected Compo-
nents (WCC). In a dynamic graph processing scenario, every time
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
155
8. Table 3: Resource utilization and clock rates
BFS PR WCC
LUT 12.28% 14.19% 12.89%
Register 5.64% 9.96% 6.73%
BRAM 72.77% 82.18% 82.18%
UltraRAM 62.50% 62.50% 62.50%
Maximal clock rate 246MHz 211MHz 245MHz
Table 4: Update time (in seconds) and update throughput
(in million edges per second) of graph updates for Stinger,
Aspen, and AcuGraph-D. The highlighted columns (i.e.,
×Stinger and ×Aspen) represent speedup results achieved by
AccuGraph-D against Stinger and Aspen, respectively.
Graph
Update time/update throughput Speedup
Stinger Aspen AccuGraph-D ×Stinger ×Aspen
AU 0.031/4.18 0.004/32.43 0.00068/190.77 45.58× 5.88×
SU 0.053/3.47 0.006/30.64 0.00125/147.08 42.40× 4.80×
WK 0.647/4.31 0.071/39.31 0.01375/202.97 47.05× 5.16×
SO 3.614/3.23 0.452/25.81 0.209/55.83 17.29× 2.16×
BC 6.752/9.25 1.473/42.40 0.358/174.46 18.86× 4.11×
an update batch is finished, we will perform a graph algorithm
on the newly-updated graph. Note that we run 10 iterations for
PageRank, and perform BFS and WCC until convergence.
Baselines. We compare AccuGraph-D with two state-of-the-
art CPU-based dynamic graph systems, Aspen [13] and Stinger
with its latest version 15.10 [14]. Both run on a high-end server
configured with 2×14-core Intel Xeon E5-2680 v4 CPUs operating
at 2.40 GHz, a 256GB memory, and a 2TB HDD. We evaluate the
update throughput of graph updates by the number of edges that
have been successfully updated per second. The overall efficiency
of dynamical graph processing can be computed in a total of update
time and graph computation time.
Resource Utilization. Table 3 shows the resource utilization
and clock rate of AccuGraph-D. All results are obtained via Xilinx
SDAccel 2019.2. To preserve the correctness, we conservatively set
200MHz as the clock rate in our experiments.
6.2 Graph Update Efficiency
Table 4 shows the update time (in seconds) and update throughput
(in million edges per second) of graph updates for Stinger [14],
Aspen [13], and AccuGraph-D, respectively.
AccuGraph-D vs. Stinger: Stinger uses both adjacency list and
CSR to maintain dynamic graph data. Stinger divides the edges of
each vertex into multiple blocks that are preserved some gaps and
organized similarly as the adjacency list. In each block, edges are
stored as a CSR representation. When an edge update arrives, it
traverses the block list and inserts (deletes) the to-be-updated edge
into an empty (from a corresponding) position.
Overall, AccuGraph-D performs graph updates with through-
puts of 55.83~202.97M edges/second while Stinger achieves only
3.23~9.25M edges/second. We can therefore see that AccuGraph-D
outperforms Stinger by 17.29×~47.05× (34.24× on average) in terms
of update time, with the following two reasons. First, traversing the
block list under Stinger incurs a low LLC hit ratio (often less than
20%) [1]. This is particularly serious for handling edge updates on
high-degree vertices, because a large number of edge updates are
applied on only a few high-degree vertices in real-world dynamic
graphs, thus significantly worsening update efficiency. In contrast,
AU SU WK SO BC AU SU WK SO BC AU SU WK SO BC
BFS PR WCC
0.01
0.1
1
10
100
1000
Time
(seconds)
Graph computation (patterned) Graph update(unpatterned)
Stinger Aspen AccuGraph-D
Figure 12: The total running time of AccuGraph-D against
Stinger and Aspen. Each bar represents a system or an ac-
celerator, where patterned parts indicate graph computation
time while unpatterned ones show graph update time.
AccuGraph-D associates the high-value edge data with a large frac-
tion of edge updates, and cache them in the on-chip memory. Thus,
excessive off-chip memory accesses are avoided. Second, when two
edges are simultaneously updated on the same vertex, Stinger uses
a locking mechanism to ensure correctness, decreasing further par-
allelism of edge updates. AccuGraph-D adopts a PMA-based format
variant, which can chunk the edge array into many fine-grained
segments and ensures lock-free segment updates [38].
AccuGraph-D vs. Aspen: Aspen is developed based on a purely-
functional search tree, which greatly facilitates the search of target
position corresponding to the to-be-updated edge. However, Aspen
needs to repeatedly load the same edge data into on-chip caches
at different times due to the extremely irregular memory access
features of graph update and the limitation of typical replacement
policies [3, 9, 23] on traditional architectures, while GraSU retains
high-value edge data on-chip during graph update. Thus, GraSU
does not need to frequently transfer edge data between on-chip and
off-chip memory, avoiding redundant communication overheads.
Compared with Aspen, AccuGraph-D can therefore improve the
update efficiency by 2.16×~5.88× (4.42× on average).
In particular, compared to other graphs SO exhibits the smallest
improvement with 2.16× only. The reason is simple. SO has the
highest number of average degree by 24.40 (Table 2). Thus, the
average edge data size for each vertex in SO is larger than others.
Consider the limited fixed-size UltraRAM, a larger vertex degree
generally implies fewer vertices’ edge data be stored in the on-chip
memory. As a result, relatively few spatial similarity opportunities
can be exploited by GraSU.
6.3 Overall Efficiency
We also evaluate the total running time (i.e., update time plus graph
computation time) of AccuGraph-D against Stinger and Aspen.
Stinger and Aspen are the in-memory systems (excluding disk-
loading time). To make apple-to-apple comparisons, we exclude the
CPU-FPGA transfer time.
Figure 12 shows the overall performance results, where each
bar consists of two parts for a graph system. The patterned part
represents graph computation time while the unpatterned part
indicates graph update time. Overall, compared with Stinger and
Aspen, AccuGraph-D has the fastest execution time for all three
graph algorithms by average speedups of 9.80× and 3.07×, respec-
tively. This is because of not only graph update efficiency improved
but also graph computation performance accelerated by FPGA.
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
156
9. AU SU WK SO BC
0
2
4
6
8
10
Normalized
Execution
Time
Baseline W/ VMM
W/ VMM+IVM W/ VMM+IVM+OT
Figure 13: Update efficiency of GraSU with
and without value-aware memory manage-
ment (VMM), incremental value measure-
ment (IVM), and overlapping technique (OT)
100×288Kb
200×288Kb
400×288Kb
800×288Kb
0.5
0.6
0.7
0.8
0.9
1
Normalized
Speedup
AU
SU
WK
SO
BC
Figure 14: Update efficiency of GraSU
with different UltraRAM sizes. All re-
sults are normalized to GraSU with
the UltraRAM size of 800×288Kb.
AU SU WK SO BC
0.95
1
1.05
1.1
1.15
1.2
1.25
Normalized
Speedup
0.1%
1%
10%
Figure 15: Update efficiency of GraSU
with varying batch sizes of graph up-
dates. All results are normalized to
GraSU with the update batch size of 10%.
Yet, we can also observe that graph computation can gradually
dominate the overall performance as update efficiency is signifi-
cantly improved. For instance, consider BFS on WK, the graph com-
putation proportion for Aspen is 60.16%, which can be increased to
80.76% when graph update efficiency is improved by GraSU. In this
work, we focus on improving graph update efficiency. As for how
to improve the performance of graph computation for boosting
overall performance further, we leave it as interesting future work.
6.4 Effectiveness
We further investigate the benefit breakdown for different compo-
nents of GraSU, including value-aware memory management (VMM),
incremental value measurement (IVM), and overlapping technique
(OT). Figure 13 shows the breakdown results, where the baseline
indicates that neither of VMM, IVM, nor OT is applied. All results
are normalized to a version of VMM, IVM, and OT applied. Note
that AU and SU are small enough to fit all edges into the on-chip
UltraRAM. In this case, IVM and OT are therefore disabled at run-
time and the baseline is GraSU with only VMM used. Overall, our
GraSU improves the baseline by 6.14× on average.
VMM. By preserving the high-value data (identified by the
degree-based approach) on-chip, a large number of off-chip mem-
ory accesses are transformed to be on-chip accesses. Therefore, we
see that VMM contributes a significant speedup of 4.69× (on aver-
age) over the baseline, occupying 65.56% of overall performance
improvement of graph update. In particular, for the small graphs AU
and SU, VMM shows the substantial performance improvements
by 9.89× and 7.37× over the baseline, respectively. The reason is
clear that all data are stored in the on-chip such that no off-chip
communications will occur in this case.
IVM. As shown in Figure 9, IVM can improve the prediction
accuracy of high-value data significantly. We next discuss the IVM’s
runtime overheads. Compared to the baseline, we see that IVM
offers only 1.58× speedup on average. In particular, for WK, IVM
may even cause significant slowdown such that overall performance
is poorer than the baseline. The reason is clear that the benefit
provided by the accuracy improvement is offset by the overheads
of repeated value measurement over update batches. Fortunately,
such overheads can be fully overlapped with the normal graph
computation phase, as discussed below.
OT. By applying OT, we see that the IVM-induced extra over-
head can be fully hidden behind the normal graph computation,
further improving the total execution time significantly. OT makes
dynamic graph updates run 4.21× faster than otherwise, demon-
strating the effectiveness of OT. Results show that IV and OT can
jointly contribute 34.44% of the overall performance improvement.
0 2 4 8 16 32
0
2
4
6
8
10
Normalized
Speedup
AU
SU
WK
SO
BC
Figure 16: Update efficiency of GraSU with different number
of PEs. All results are normalized to the update time in case
of using 2-PE.
6.5 Sensitive Study
Let us examine the sensitivity of GraSU’s performance to different
(1) UltraRAM sizes, (2) update batch sizes, and (3) PE numbers.
UltraRAM Size. Figure 14 illustrates the performance of graph
updates with different UltraRAM sizes ranging from 100×288Kb
to 800×288Kb. Overall, the larger the UltraRAM size is, the better
the performance is. This is because a larger size implies more high-
value edge data that can be cached on-chip and fewer edge data
transfers between on-chip and off-chip memory. In particular, AU
can be stored entirely under the UltraRAM sizes of 400×288Kb and
800×288Kb, and therefore, its performance is not changed in these
two cases. Also, when the UltraRAM size is scaled from 200×288Kb
to 400×288Kb, AU exhibits a significant performance improvement
(by up to 43.38%). The reason behind is that the 400×288Kb Ultra-
RAM caches more high-value data that is missing by 200×288Kb
size, thereby reducing the off-chip edge data accesses significantly.
Update Batch Size. Figure 15 characterizes the update perfor-
mance of GraSU with different update batch sizes. For each graph,
we divide the edges that need an update into 1000, 100, and 10
batches according to their timestamp range. These correspond re-
spectively to the cases that each update batch contains 0.1%, 1%, and
10% of the updated edges. We see that the batch size does not signif-
icantly affect the update efficiency since spatial similarity does not
destroy by update batch scales. In addition, as the number of edge
updates in the batch decreases from 10% to 0.1%, the average update
performance slightly increases from 1.0× to 1.23×. This is because
more high-value data are mined through value measurement.
PE Number. Figure 16 further plots update performance with a
varying number of (2/4/8/16/32) graph update PEs. All results are
normalized to the update time in case of using 2-PE. Overall, more
PEs can introduce an increasing performance improvement but
the growth rate is gradually decreasing. When the number of PEs
increases from 16 to 32, the update performance increases by 1.38×
only, on average. This is because more PEs mean more simultaneous
memory requests to be processed, showing significant memory
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
157
10. AU SU WK SO BC AU SU WK SO BC AU SU WK SO BC
BFS PR WCC
0
1
2
3
4
5
6
Normalized
Speedup
Aspen FabGraph-D WaveScheduler-D ForeGraph-D
Figure 17: Overall dynamic graph processing performance
of FabGraph, WaveScheduler, and ForeGraph (with an inte-
gration of GraSU) against Aspen
pressure to the value-aware memory manager with potential access
conflicts. We leave addressing this problem as future work.
6.6 Integration with Other Graph Accelerators
We finally explore to integrate GraSU into other three state-of-
the-art FPGA-based graph accelerators, including FabGraph [39],
WaveScheduler [44], and ForeGraph [11] (with its single version).
Similar to AccuGraph, with only 11 lines of code modifications,
GraSU can be easily integrated with FabGraph, WaveScheduler,
and ForeGraph to support dynamic graph processing, thanks to the
following highlighted designs. First, GraSU adopts a PMA-based
dynamic graph organization, which can support as many as exist-
ing accelerators using different underlying graph formats. Second,
GraSU uses a lightweight bitmap-based method to implement dif-
ferential memory access, thereby avoiding significant modifications
on the underlying memory subsystem and lowering the integration
obstacles. Third, GraSU provides a uniform integration framework
with easy-to-use programming interfaces.
Figure 17 shows the overall performance results of FabGraph-D,
WaveScheduler-D, and ForeGraph-D against Aspen. The experi-
ment environment is the same as AccuGraph-D. Since FabGraph
uses some UltraRAM resource as the shared vertex buffer, we allo-
cate only 21.09MB (600×288Kb) UltraRAM to buffer high-value data
for FabGraph-D. Results show that the dynamic graph versions of
FabGraph, WaveScheduler, and ForeGraph integrated from GraSU
also outperform Aspen with the geomean speedups of 2.93×, 3.09×,
and 1.63×, showing the generality and practicability of GraSU.
7 RELATED WORK
Graph Processing Accelerators. Due to the random access fea-
ture, graph processing generally features a low compute-to-memory
ratio challenge. To improve memory efficiency, existing FPGA-
based graph accelerators typically focuses on optimizing on-chip
memory access [10, 11, 39, 44, 50] and off-chip memory access [2,
12, 24, 31, 32, 51, 52, 54, 55]. For on-chip memory optimizations,
some consider mitigating the performance overheads caused by
data conflicts in the on-chip BRAM [44, 50]. Some [10, 11] use the
on-chip data resuing mechanism to improve the locality of graph
computation. There are also works that try to hide the delay of data
loading from off-chip memory to BRAM [39]. A number of graph
processing accelerators [12, 31, 32, 54, 55] focus on improving the
bandwidth utilization between the on-chip and the off-chip mem-
ory with sophisticated pipeline designs. Alternative uses emerging
memory technologies (e.g., hybrid memory cube) to further im-
prove the external memory access [24, 51, 52]. Unfortunately, these
earlier studies are limited to handle static graphs [19]. To the best
of our knowledge, there are currently few dynamic FPGA-based
graph accelerators. In this work, we aim to fill the gap between
static graph computation and dynamic graph update. We restrict
ourselves to build a fast graph update library, which can be inte-
grated easily into any existing FPGA-based static graph accelerator
for handling dynamic graphs.
Dynamic Graph Processing Systems. Most of existing dy-
namic graph systems can fall into two categories [6, 8, 13, 14, 16, 18,
21, 25, 29, 37, 38, 45, 46]. The first category develops new dynamic
graph representations based on different static data structures in
terms of CSR variants [14, 16, 29, 37], adjacency list [8, 25, 46], hash
table [18, 21], tree [6, 13], and Packed Memory Array (PMA) [38, 45].
These earlier studies improve the concurrency between graph up-
dates, but their underlying efficiency is still limited to the excessive
off-chip memory accesses [1]. GraSU identifies spatial similarity op-
portunities and presents a differential data management to improve
the memory efficiency of dynamic graph processing.
A number of studies improve the efficiency of graph computation
under dynamic graph processing scenes by leveraging recent (rather
than initial) vertex property values to accelerate the convergence of
graph computation [8, 30, 37, 40, 42, 43]. In particular, accelerating
graph computation using FPGAs stands out for yielding impressive
results in both performance and energy-efficiency [19, 39, 44, 50].
In this work, we emphasize more on accelerating dynamic graph
updates on FPGA, and also develop an FPGA library to make it
easily-integrated with minimal hardware engineering efforts.
8 CONCLUSION
In this paper, we introduce a graph-structured update library (called
GraSU) for high-throughput updates on FPGA. GraSU can be easily
integrated with any existing FPGA-based static graph accelerators
with only a few lines of code modifications for handling dynamic
graphs. GraSU features with the two key designs: an incremental
value measurement and a value-aware differential memory man-
agement. The former enables to quantify the data value accurately
while its overheads can be fully hidden behind normal graph com-
putations. The latter exploits spatial similarity of graph updates by
retaining high-value data on-chip so that the most off-chip data
communications arising in graph updates can be transformed into
fast on-chip memory accesses. We integrate GraSU into a state-of-
the-art static graph accelerator AccuGraph to drive dynamic graph
processing. Our implementation on a Xilinx Alveo™ U250 board
demonstrates that the dynamic graph version of AccuGraph out-
performs two state-of-the-art CPU-based dynamic graph systems,
Stinger and Aspen, by an average of 34.24× and 4.42× in terms of
update throughput, improving further overall efficiency by 9.80×
and 3.07× on average.
ACKNOWLEDGMENTS
This work is supported by the National Key Research and Develop-
ment Program of China under Grant No. 2018YFB1003502, National
Natural Science Foundation of China under Grant No. 62072195,
61825202, and 61832006. The correspondence should be addressed
to Long Zheng.
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
158
11. REFERENCES
[1] Abanti Basak, Jilan Lin, Ryan Lorica, Xinfeng Xie, Zeshan Chishti, Alaa
Alameldeen, and Yuan Xie. 2020. SAGA-Bench: Software and Hardware Charac-
terization of Streaming Graph Analytics Workloads. In ISPASS. IEEE, 12–23.
[2] Andrew Bean, Nachiket Kapre, and Peter Y. K. Cheung. 2015. G-DMA: improving
memory access performance for hardware accelerated sparse graph computation.
In ReConFig. IEEE, 1–6.
[3] Nathan Beckmann and Daniel Sánchez. 2015. Talus: A simple way to remove
cliffs in cache performance. In HPCA. IEEE, 64–75.
[4] Maciej Besta, Marc Fischer, Tal Ben-Nun, Johannes de Fine Licht, and Torsten
Hoefler. 2019. Substream-Centric Maximum Matchings on FPGA. In FPGA. ACM,
152–161.
[5] Maciej Besta, Marc Fischer, Vasiliki Kalavri, Michael Kapralov, and Torsten Hoe-
fler. 2019. Practice of Streaming and Dynamic Graphs: Concepts, Models, Systems,
and Parallelism. CoRR abs/1912.12740 (2019). arXiv:1912.12740
[6] Federico Busato, Oded Green, Nicola Bombieri, and David A. Bader. 2018. Hornet:
An Efficient Data Structure for Dynamic Sparse Graphs and Matrices on GPUs.
In HPEC. IEEE, 1–7.
[7] Xinyu Chen, Ronak Bajaj, Yao Chen, Jiong He, Bingsheng He, Weng-Fai Wong,
and Deming Chen. 2019. On-The-Fly Parallel Data Shuffling for Graph Processing
on OpenCL-Based FPGAs. In FPL. 67–73.
[8] Raymond Cheng, Ji Hong, Aapo Kyrola, Youshan Miao, Xuetian Weng, Ming Wu,
Fan Yang, Lidong Zhou, Feng Zhao, and Enhong Chen. 2012. Kineograph: taking
the pulse of a fast-changing and connected world. In EuroSys. ACM, 85–98.
[9] Asaf Cidon, Assaf Eisenman, Mohammad Alizadeh, and Sachin Katti. 2016.
Cliffhanger: Scaling Performance Cliffs in Web Memory Caches. In NSDI. USENIX,
379–392.
[10] Guohao Dai, Yuze Chi, Yu Wang, and Huazhong Yang. 2016. FPGP: Graph
Processing Framework on FPGA A Case Study of Breadth-First Search. In FPGA.
ACM, 105–110.
[11] Guohao Dai, Tianhao Huang, Yuze Chi, Ningyi Xu, Yu Wang, and Huazhong
Yang. 2017. ForeGraph: Exploring Large-scale Graph Processing on Multi-FPGA
Architecture. In FPGA. ACM, 217–226.
[12] Michael DeLorimier, Nachiket Kapre, Nikil Mehta, Dominic Rizzo, Ian Eslick,
Raphael Rubin, Tomás E. Uribe, Thomas F. Knight Jr., and André DeHon. 2006.
GraphStep: A System Architecture for Sparse-Graph Algorithms. In FCCM. IEEE,
143–151.
[13] Laxman Dhulipala, Guy E. Blelloch, and Julian Shun. 2019. Low-latency graph
streaming using compressed purely-functional trees. In PLDI. ACM, 918–934.
[14] David Ediger, Robert McColl, E. Jason Riedy, and David A. Bader. 2012. STINGER:
High performance data structure for streaming graphs. In HPEC. IEEE, 1–5.
[15] Nina Engelhardt and Hayden Kwok-Hay So. 2016. Gravf: A vertex-centric dis-
tributed graph processing framework on fpgas. In FPL. IEEE, 1–4.
[16] Guoyao Feng, Xiao Meng, and Khaled Ammar. 2015. DISTINGER: A distributed
graph data structure for massive dynamic graph processing. In BigData. IEEE,
1814–1822.
[17] Joseph E. Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos Guestrin.
2012. PowerGraph: Distributed Graph-parallel Computation on Natural Graphs.
In OSDI. USENIX, 17–30.
[18] Xiangyang Gou, Lei Zou, Chenxingyu Zhao, and Tong Yang. 2019. Fast and
Accurate Graph Stream Summarization. In ICDE. IEEE, 1118–1129.
[19] Chuang-Yi Gui, Long Zheng, Bingsheng He, Cheng Liu, Xin-Yu Chen, Xiao-Fei
Liao, and Hai Jin. 2019. A Survey on Graph Processing Accelerators: Challenges
and Opportunities. J. Comput. Sci. Technol. 34, 2 (2019), 339–371.
[20] Tae Jun Ham, Lisa Wu, Narayanan Sundaram, Nadathur Satish, and Margaret
Martonosi. 2016. Graphicionado: A High-Performance and Energy-Efficient
Accelerator for Graph Analytics. In MICRO. IEEE, 1–13.
[21] Keita Iwabuchi, Scott Sallinen, Roger A. Pearce, Brian Van Essen, Maya B. Gokhale,
and Satoshi Matsuoka. 2016. Towards a Distributed Large-Scale Dynamic Graph
Data Store. In IPDPS. IEEE, 892–901.
[22] Hai Jin, Pengcheng Yao, Xiaofei Liao, Long Zheng, and Xianliang Li. 2017. To-
wards Dataflow-Based Graph Accelerator. In ICDCS. IEEE, 1981–1992.
[23] Theodore Johnson and Dennis E. Shasha. 1994. 2Q: A Low Overhead High
Performance Buffer Management Replacement Algorithm. In VLDB. Morgan
Kaufmann, 439–450.
[24] Soroosh Khoram, Jialiang Zhang, Maxwell Strange, and Jing Li. 2018. Accelerat-
ing Graph Analytics by Co-Optimizing Storage and Access on an FPGA-HMC
Platform. In FPGA. ACM, 239–248.
[25] Pradeep Kumar and H. Howie Huang. 2019. GraphOne: A Data Store for Real-time
Analytics on Evolving Graphs. In FAST. USENIX, 249–263.
[26] Ravi Kumar, Jasmine Novak, and Andrew Tomkins. 2006. Structure and Evolution
of Online Social Networks. In KDD. ACM, 611–617.
[27] Jure Leskovec, Jon M. Kleinberg, and Christos Faloutsos. 2005. Graphs over time:
densification laws, shrinking diameters and possible explanations. In KDD. ACM,
177–187.
[28] Jure Leskovec and Andrej Krevl. 2014. SNAP Datasets: Stanford Large Network
Dataset Collection. http://snap.stanford.edu/data.
[29] Peter Macko, Virendra Marathe, Daniel Margo, and Margo Seltzer. 2015. LLAMA:
Efficient graph analytics using Large Multiversioned Arrays. In ICDE. IEEE,
363–374.
[30] Mugilan Mariappan and Keval Vora. 2019. GraphBolt: Dependency-Driven Syn-
chronous Processing of Streaming Graphs. In EuroSys. ACM, 25:1–25:16.
[31] Eriko Nurvitadhi, Gabriel Weisz, Yu Wang, Skand Hurkat, Marie Nguyen, James C.
Hoe, José F. Martínez, and Carlos Guestrin. 2014. GraphGen: An FPGA Framework
for Vertex-Centric Graph Computation. In FCCM. IEEE, 25–28.
[32] Tayo Oguntebi and Kunle Olukotun. 2016. GraphOps: A Dataflow Library for
Graph Analytics Acceleration. In FPGA. ACM, 111–117.
[33] Muhammet Mustafa Ozdal, Serif Yesil, Taemin Kim, Andrey Ayupov, John Greth,
Steven Burns, and Ozcan Ozturk. 2016. Energy Efficient Architecture for Graph
Analytics Accelerators. In ISCA. IEEE, 166–177.
[34] Xiafei Qiu, Wubin Cen, Zhengping Qian, You Peng, Ying Zhang, Xuemin Lin, and
Jingren Zhou. 2018. Real-time Constrained Cycle Detection in Large Dynamic
Graphs. Proc. VLDB Endow. 11, 12 (2018), 1876–1888.
[35] Ryan A. Rossi and Nesreen K. Ahmed. 2015. The Network Data Repository with
Interactive Graph Analytics and Visualization. In AAAI. http://networkrepository.
com
[36] David Sayce. 2020. The Number of tweets per day in 2020. https://www.dsayce.
com/social-media/tweets-day/.
[37] Dipanjan Sengupta, Narayanan Sundaram, Xia Zhu, Theodore L. Willke, Jeffrey S.
Young, Matthew Wolf, and Karsten Schwan. 2016. GraphIn: An Online High
Performance Incremental Graph Processing Framework. In Euro-Par. Springer,
319–333.
[38] Mo Sha, Yuchen Li, Bingsheng He, and Kian-Lee Tan. 2017. Accelerating Dynamic
Graph Analytics on GPUs. Proc. VLDB Endow. 11, 1 (2017), 107–120.
[39] Zhiyuan Shao, Ruoshi Li, Diqing Hu, Xiaofei Liao, and Hai Jin. 2019. Improving
Performance of Graph Processing on FPGA-DRAM Platform by Two-level Vertex
Caching. In FPGA. ACM, 320–329.
[40] Feng Sheng, Qiang Cao, Haoran Cai, Jie Yao, and Changsheng Xie. 2018. GraPU:
Accelerate Streaming Graph Analysis through Preprocessing Buffered Updates.
In SoCC. ACM, 301–312.
[41] Shuang Song, Xu Liu, Qinzhe Wu, Andreas Gerstlauer, Tao Li, and Lizy K. John.
2018. Start Late or Finish Early: A Distributed Graph Processing System with
Redundancy Reduction. Proc. VLDB Endow. 12, 2 (2018), 154–168.
[42] Keval Vora, Rajiv Gupta, and Guoqing (Harry) Xu. 2016. Synergistic Analysis of
Evolving Graphs. ACM Trans. Archit. Code Optim. 13, 4 (2016), 32:1–32:27.
[43] Keval Vora, Rajiv Gupta, and Guoqing (Harry) Xu. 2017. KickStarter: Fast and
Accurate Computations on Streaming Graphs via Trimmed Approximations. In
ASPLOS. ACM, 237–251.
[44] Qinggang Wang, Long Zheng, Jieshan Zhao, Xiaofei Liao, Hai Jin, and Jingling
Xue. 2020. A Conflict-free Scheduler for High-performance Graph Processing on
Multi-pipeline FPGAs. ACM Trans. Archit. Code Optim. 17, 2 (2020), 14:1–14:26.
[45] Brian Wheatman and Helen Xu. 2018. Packed Compressed Sparse Row: A Dy-
namic Graph Representation. In HPEC. IEEE, 1–7.
[46] Martin Winter, Daniel Mlakar, Rhaleb Zayer, Hans-Peter Seidel, and Markus
Steinberger. 2018. faimGraph: high performance management of fully-dynamic
graphs under tight memory constraints on the GPU. In SC. ACM, 60:1–60:13.
[47] Alex Woodie, Tiffany Trader, George Leopold, John Russell, Oliver Peckham,
James Kobielus, and Steve Conway. 2020. Tracking the Spread of Coronavirus with
Graph Databases. datanami. https://www.datanami.com/2020/03/12/tracking-
the-spread-of-coronavirus-with-graph-databases/.
[48] Xilinx. 2019. UltraScale Architecture Memory Resources User Guide.
https://www.xilinx.com/support/documentation/user_guides/ug573-ultrascale-
memory-resources.pdf.
[49] Xilinx. 2020. Vivado Design Suite User Guide High-Level Synthesis. https:
//www.xilinx.com/support/documentation/sw_manuals/xilinx2020_1/ug902-
vivado-high-level-synthesis.pdf.
[50] Pengcheng Yao, Long Zheng, Xiaofei Liao, Hai Jin, and Bingsheng He. 2018. An
Efficient Graph Accelerator with Parallel Data Conflict Management. In PACT.
ACM, 8:1–8:12.
[51] Jialiang Zhang, Soroosh Khoram, and Jing Li. 2017. Boosting the Performance of
FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth
First Search. In FPGA. ACM, 207–216.
[52] Jialiang Zhang and Jing Li. 2018. Degree-aware Hybrid Graph Traversal on
FPGA-HMC Platform. In FPGA. ACM, 229–238.
[53] Long Zheng, Xianliang Li, Yaohui Zheng, Yu Huang, Xiaofei Liao, Hai Jin, Jingling
Xue, Zhiyuan Shao, and Qiang-Sheng Hua. 2020. Scaph: Scalable GPU-Accelerated
Graph Processing with Value-Driven Differential Scheduling. In ATC. USENIX,
573–588.
[54] Shijie Zhou, Charalampos Chelmis, and Viktor K Prasanna. 2016. High-
Throughput and Energy-Efficient Graph Processing on FPGA. In FCCM. IEEE,
103–110.
[55] Shijie Zhou, Rajgopal Kannan, Viktor K. Prasanna, Guna Seetharaman, and Qing
Wu. 2019. HitGraph: High-throughput Graph Processing Framework on FPGA.
IEEE Trans. Parallel Distrib. Syst. 30, 10 (2019), 2249–2264.
Session 3: Machine Learning and Supporting Algorithms FPGA ’21, February 28–March 2, 2021, Virtual Event, USA
159