Highlighted notes on Parallel algorithms for multi-source graph traversal and its applications.
While doing research work under Prof. Kishore Kothapalli.
Seema is working on Multi-source BFS with hybrid-CSR, with applications in APSP, diameter, centrality, reachability.
BFS can be either top-down (from visited nodes, mark next frontier), or bottom-up (from unvisited nodes, mark next frontier). She mentioned that hybrid approach is more efficient. EtaGraph uses unified degree cut (UDC) graph partitioning. Also overlaps data transfer with kernel execution. iCENTRAL uses biconnected components for betwenness centrality on dynamic graphs.
Hybrid CSR uses an additional value array for storing packed "has edge/neighbour" bits. This can give better memory access pattern if many bits are set, and cause many threads to wait if many bits are zero. She mentioned Volta architecture has independent PC, stack per thread (similar to CPU?). Does is not matter then if the threads in a block diverge?
(BFS = G*v, Multi-source BFS = G*vs)
Parallel algorithms for multi-source graph traversal and its applications
1. Comprehensive report on âParallel algorithms for multi-source graph traversal and its
applicationsâ
by
Seema Simoliya
Program: Ph.D. (CSE)
Advisor: Dr. Kishore Kothapalli
Center for Security, Theory, and Algorithmic Research (C-STAR)
International Institute of Information Technology-Hyderabad, Gachibowli,
Hyderabad, India â 500 032.
seema.simoliya@research.iiit.ac.in
2. Table of Contents
1 Introduction.....................................................................................................................................................3
2 Graphics Processing Unit ...........................................................................................................................4
2.1 Modern GPU Computing Architecture .............................................................................................4
2.2 Memory Layout.......................................................................................................................................7
2.3 CUDA ........................................................................................................................................................8
3 Literature Survey ...........................................................................................................................................8
3.1 Representation of Sparse Graphs.....................................................................................................8
3.2 List of selected Research Papers................................................................................................... 10
3.3 Single Source Breadth-First Search.............................................................................................. 11
3.4 Multi-source Breadth First Search................................................................................................. 12
4 Proposed Research Plan.......................................................................................................................... 14
4.1 Hybrid CSR Representation............................................................................................................. 14
4.2 In-core Multi-source BFS.................................................................................................................. 15
4.2.1 Optimizations............................................................................................................................... 16
4.3 Out-of-core Multi-source BFS ......................................................................................................... 17
4.4 Applications......................................................................................................................................... 17
4.4.1 Applications to Static Graph Problems ................................................................................ 17
4.4.2 Applications to Dynamic Graph Problems .......................................................................... 18
5 Concluding Remarks ................................................................................................................................. 18
References............................................................................................................................................................. 18
3. 1 Introduction
Many scientific and technical problems are related to large size data with networked nature, such
problems can be well represented in the form of graph data structures. Graphs have shown their
importance in many varied fields but especially in computer science applications such as data mining [1],
image processing [2], networking [3], resource allocation [4], etc. This promotes the development of new
algorithms and new theorems that can be used in tremendous applications. Graphs provide a simple and
flexible abstract view of the discrete objects in the problem domain.
In many problem environments, the graph is so large that they require a different approach to process,
store and represent graphs. Large graph datasets refer to billions of vertices and trillions of edges such
as seen in the social networks, citation networks, web graphs, road networks, etc. For computations on
these graphs, parallel computing appears to be necessary to overcome the restriction of limited resources
and large data. Parallel and data intensive algorithms often require fast computational resources for bulk
processing. This can be achieved by using different special hardware along with general-purpose
computers, such as, field-programmable gate arrays (FPGA), Application Specific Integrated circuit
(ASIC), Vector processor or Graphics Processing Unit (GPU).
GPUs have become common in personal computers, supercomputers and datacenters; hence
researchers and application developers are motivated to design algorithms which take the architectural
advantage of GPU. Although GPU evolved because of the increasing demand of realistic graphics
rendering by the gaming world, it has become a choice for data-parallel and computationally expensive
workloads. GPUs provide highly scalable and cost-effective solutions to high performance computing.
GPGPU (General purpose Graphics Processing Unit) is the term which refers to the programmable use
of GPU beyond the traditional purpose of computation for computer graphics. Graph algorithms on large
data sets is a challenging problem on GPU as it has irregular memory access patterns and throughput
intensive requirements. Combining the power of both GPU and CPU can increase the performance of
graph algorithms.
The real-world graphs are sparse [5], therefore, as opposed to dense graphs, sparse graphs are more of
practical interest to researchers. Graph algorithms which work on dense graphs may not be suitable for
sparse graphs because these algorithms on sparse graphs often exhibit poor locality of memory access,
uneven workload per vertex and dynamic degree of parallelism over the course of execution. A better
representation of graphs and proper load balancing can address the above-mentioned issues in parallel
settings. Graphs are usually stored in adjacency matrix or adjacency list format. It is observed that when
graphs are represented as matrices, they appear to be more managed, concise, expressive in terms of
reduced memory footprint as compared to vertex-centric frameworks. There are few noted
representations for sparse graphs which are discussed in Section 3.1.
Traversal on real world sparse graphs is one of the core techniques used in the area of graph analytics.
Depth-first search and breadth-first search are two most common, yet famous traversal algorithms used
in these areas, out of which breadth-first search is considered good to run on GPUs because all the
vertices on a single level can be processed in parallel independently. [6] describes that BFS from a single
source can be found by multiplying the Sparse matrix with a vector (SpMV). GraphBlas is an API
specification which is based on the notion that various graph problems can be solved using linear algebra.
With the popularity of GraphBlas, various parallel breadth-first search techniques are developed on
different platforms by utilizing different architectural and programming optimizations. Different variants of
BFS are studied to solve graph problems. Problems like All-pairs Shortest Path (APSP), diameter
4. computation, betweenness centrality, reachability querying, etc. require one or more execution of such
breadth-first search. For example: APSP is a problem in which one needs to perform BFS from each
node to find the shortest distances between each pair of vertices. Multi-source BFS is another variant in
which BFS is performed using fewer nodes.
A few parallel solutions exist [7], [8], [9], [10] that have addressed the simultaneous execution of graph
traversal from more than one source nodes on CPU and GPU architecture. Each approach is discussed
in detail in the Section 3 literature survey. These algorithms run on different platforms and have few
limitations which are addressed in our approach suggested in Section 4. The problem area, where parallel
multi-source breadth-first search is performed, encourages finding alternate solutions given its various
applications and the limitations of existing solutions.
In the literature survey, I have studied about the various versions of graph traversal problems such as
Single-source BFS, Multi-source BFS and its applications. These problems have sequential as well as
parallel solutions. Inspired by the significance of functionalities offered by GraphBlas, in my research
work, I aim to design parallel algorithms (on GPUs) for problems listed below:
ďˇ Multi-source BFS using linear-algebra method in the following scenarios:
o When the graph can or cannot fit in the GPU memory
o Possibility of Overlapped execution between data transfer and BFS computation
o Memory saving compact representation of graph like hybrid-CSR
o On static graphs and dynamic graphs
ďˇ Shortest Path computation from multiple sources.
ďˇ Graph centrality calculation: Betweenness centrality of few k nodes which can effectively use our
proposed hybrid-CSR representation and Multi-source BFS.
ďˇ Diameter calculation in static and dynamic graphs.
This report is organized as follows: Section 2 describes the architecture of GPUs and CUDA programming
used in our research work. Section 3 gives the details about the literature work in the above-listed problem
areas. Section 4 in this report gives more detail about these problems and our approach to solve them.
2 Graphics Processing Unit
2.1 Modern GPU Computing Architecture
Fig. 1 illustrates the high-level architecture overview of NVIDIA Tesla V100 [11]. It is an extremely power-
efficient processor including 21.1 billion transistors yet delivering exceptional performance per watt. A
GPU is composed of multiple GPU Processing Clusters (GPC), Texture Processing Clusters (TPC),
Streaming Multiprocessors (SMs) and memory controllers. Each GPC consists of group of TPC and each
TPC is a group made up of several SMs, a texture unit and a logic control.
5. Fig.1 Volta GV100 Full GPU with 84 SM Units [11]
The Streaming Multiprocessor in Volta architecture, unlike predecessor architecture, is provided
additionally with tensor cores, combined L1 data cache and shared memory, set of registers. Fig.2 shows
the architecture of a Streaming Multiprocessor. Tensor cores are important features developed to train
large neural networks. Matrix-matrix multiplication is the core operation in training neural networks. Each
tensor core can compute 64-bit FMA operation in one clock cycle which significantly boosts the
performance of these operations. Shared memory provides high bandwidth and low latency. Combining
L1 data cache with shared memory allows L1 cache operations to attain the benefits of shared memory
performance.
Volta architecture also comes with independent thread scheduling which means each thread has its own
program counter. This makes programming on these GPUs more productive and easier.
6. Fig.1 Streaming Multiprocessor of V100 [11]
GPUs execute a group of 32 threads (called a warp) in SIMT (Single Instruction Multiple Threads) fashion.
Every thread in a warp maintains its own execution state therefore enabling concurrency among all the
threads, regardless of warp. Fig. 3 explains the independent thread scheduling. A schedule optimizer is
used to group active threads from the same warp together into SIMT units.
7. Fig. 3 Independent thread scheduling architecture block diagram [11]
2.2 Memory Layout
When writing programs, we group threads in a âblockâ. Each SM has its own capacity of number of
threads in a thread block. A thread has its own local memory. All the threads in a thread block can access
high performance on-chip shared memory. Thread blocks are arranged in a Grid and all threads of a Grid
have access to Global Memory where we copy the data to/from the CPU memory. Fig.4 explains the
high-level memory architecture with respect to a program. There are two additional read-only memory
spaces: constant memory and Texture Memory. These memory spaces are accessible to all the threads
and can be optimized according to the memory usage need.
Fig.4 Memory Hierarchy [12]
8. 2.3 CUDA
âCUDA (Compute Unite Device Architecture) [12] is a parallel computing platform and programming
model developed by NVIDIA that makes general purpose computing simple on GPUsâ. Programmers still
write in the familiar languages like C, C++, or the list of languages supported by NVIDIA CUDA toolkit.
CUDA enables three key abstractions - a hierarchy of thread groups, shared memories and barrier
synchronization. With the help of these abstractions, a problem can be divided into sub-problems that
can be solved independently using thread blocks, each sub-problem can be further solved cooperatively
in parallel by each thread within a block. In CUDA C++, a function which runs on GPU is called kernel,
which when executed runs in parallel by CUDA threads. The number of threads in a block and number
of blocks in a grid are defined by the programmer. Each thread and block have a unique id within a block
and grid, respectively. ThreadIdx is a 3-component vector so that a thread can be identified in a one-
dimensional, two-dimensional and three-dimensional thread block. Similarly, BlockIdx is a 3-component
vector identified in a one-dimensional, two-dimensional, three-dimensional grid.
When a CUDA Program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated
and distributed across SMs. Once a thread block is launched on a SM, all its warps are resident till all
finishes their execution. All the threads execute in parallel on a SM.
3 Literature Survey
3.1 Representation of Sparse Graphs
Substantial memory reductions can be realized when only the non-zero entries of Sparse Graphs are
stored. Consider a Sparse graph in fig.5 and its adjacency matrix, below are few sparse graph
representations which are commonly used:
Fig. 5. An Undirected unweighted graph
9. 1. Adjacency List: This format stores the list of edges for each element. The memory
requirement for this type of representation is O(|V|+|E|). Programmer can decide the
data structure to use this type of representation. For example, one can use list of list or
array of list data structure to store the graph in fig.5.
2. Coordinate Format (COO): Sparse matrix A is assumed to be stored in row-major
order. This matrix can be represented with 3 arrays-cooVal, cooRowInd, cooColInd. The
size of these 3 arrays is equal to the number of non-zero (nnz) values in the matrix A.
CooVal stores the nnz values in row-major order, cooRowInd stores the row indices of
these values and cooColInd stores the column indices of these values. With this format,
we can quickly reconstruct the matrix.
3. Compressed Sparse Row Format (CSR): The matrix can be represented with 3
arrays-csrVal, csrRowPtr, csrColInd. CsrVal stores the nnz values in row-major order,
csrRowPtr points to the indices which are beginning of the rows in the csrVal array and
csrColInd stores the column indices of these values. This provides the fast row access
and is particularly good for matrix-vector multiplication.
10. 4. Compressed Sparse Column Format (CSC): This format is similar to CSR format
except that the matrix storage is in column major order.
3.2 List of selected Research Papers
Summary of the 9 publications selected from reputed conferences/journals which exhibits potential work
done in the chosen area of research are listed below:
S.No. Reference Paper Reason #Citations
1. Liu, H., Huang, H. H., & Hu, Y.
(2016, June). ibfs: Concurrent
breadth-first search on gpus. In
Proceedings of the 2016
International Conference on
Management of Data (pp. 403-
416). [7]
Sharing and grouping of frontier
nodes across BFS instances, use of
bitwise optimization when
concurrently executing BFS from
many source nodes.
62
2. McLaughlin, A., & Bader, D. A.
(2015, December). Fast
execution of simultaneous
breadth-first searches on sparse
graphs. In 2015 IEEE 21st
International Conference on
Parallel and Distributed Systems
(ICPADS) (pp. 9-18). IEEE. [10]
This paper describes a simple
parallel multi search abstraction
which can be complemented with
other graph analytics applications on
a single GPU.
9
3. Then, M., Kaufmann, M.,
Chirigati, F., Hoang-Vu, T. A.,
Pham, K., Kemper, A., ... & Vo, H.
T. (2014). The more the merrier:
Efficient multi-source graph
traversal. Proceedings of the
VLDB Endowment, 8(4), 449-
460. [8]
The paper proposes an algorithm to
process the multi-source BFS
concurrently using a multicore CPU.
Their work leverages the properties
of the small-world graphs where most
of the vertices share commonality in
the initial levels of BFSs.
54
4. Kaufmann, M., Then, M.,
Kemper, A., & Neumann, T.
(2017). Parallel Array-Based
Single-and Multi-Source Breadth
First Searches on Large Dense
Graphs. In EDBT (pp. 1-12). [9]
This algorithm suggests two-phase
parallelism on Top-down approach
and two-phase parallelism on
Bottom-up approach used in [14].
Single-Source and Multi-Source
parallel BFS is suggested on multi-
socket NUMA-aware systems.
5
11. 5. Wang, P., Zhang, L., Li, C., &
Guo, M. (2019, May). Excavating
the potential of gpu for
accelerating graph traversal. In
2019 IEEE International Parallel
and Distributed Processing
Symposium (IPDPS) (pp. 221-
230). IEEE. [13]
In the algorithm EtaGraph,
overlapping of data transfer with
kernel execution when graph
traversal is performed leads to
improvement in the total execution
time. Unique scheme âUnified Degree
Cutâ is used for load balancing.
Prefetching is used to improve
memory access latency.
3
6. McLaughlin, A., & Bader, D.
(2014). Scalable and high
performance betweenness
centrality on the GPU. In SC'14:
Proceedings of the International
Conference for High
Performance Computing,
Networking, Storage and
Analysis (pp. 572â583). [14]
The algorithm suggests alternating
the two method of parallelism: edge-
parallel and work-efficient method,
based on how significantly the frontier
vertices set is changing in each
iteration.
102
7 SariyĂźce, A. E., Kaya, K., Saule,
E., & ĂatalyĂźrek, Ă. V. (2013,
March). Betweenness centrality
on GPUs and heterogeneous
architectures. In Proceedings of
the 6th Workshop on General
Purpose Processor Using
Graphics Processing Units (pp.
76-85). [15]
Various techniques to accelerate
Betweenness Centrality
computations such as vertex
virtualization, stride memory access
and a unique graph compression
method where vertices with degree 1
are removed. Algorithm makes use of
both GPU and CPU.
79
8 Jamour, F., Skiadopoulos, S., &
Kalnis, P. (2017). Parallel
algorithm for incremental
betweenness centrality on large
graphsIEEE Transactions on
Parallel and Distributed Systems,
29(3), 659â672. [16]
This algorithm is designed for the
evolving graphs. The incremental
Betweenness centrality is computed
on only the affected biconnected
components.
29
9 Crescenzi, P., Grossi, R., Habib,
M., Lanzi, L., & Marino, A. (2013).
On computing the diameter of
real-world undirected
graphsTheoretical Computer
Science, 514, 84â95. [17]
Diameter computation usually
requires BFS computation from all
nodes, this paper defines an
algorithm which does BFS from few
selected nodes and early terminate
the evaluation if a certain condition is
met.
61
3.3 Single Source Breadth-First Search
A typical sequential breadth-First Search begins from a source and then explores all the adjacent vertices
of the âfrontiersâ nodes at each level. This direction of search is called Top-Down search. As opposed to
this, there is a âbottom-upâ search, in which all the unvisited nodes are investigated to be a candidate for
12. the next frontier node. This search skips out the nodes which cannot be the part of the frontier nodes in
the next level. A hybrid approach [13] using both top-down and bottom-up search turns out to be much
more efficient than individual direction search. This combination of bidirectional search enables the
exploration of fewer nodes for the frontier. The approach is good for the single source BFS on a multicore
CPU (according to experiments in [13]). Numerous graph analytics algorithms require to execute multiple
BFS traversals on the same graph from different source nodes. With the advances in the hardware
architectures and the applications of BFS, the scope to explore the efficient multi-source BFS(MS-BFS)
algorithms has widened.
3.4 Multi-source Breadth First Search
[iBFS] A recent work done in the area of parallel multi-source BFS on GPU is proposed by Liu et.al
[7].The proposed algorithm iBFS utilizes three novel techniques: joint traversal, Group By and bitwise
optimization. Joint traversal allows to share the frontiers of different BFS instances, which reduces the
memory latency. Grouping BFS instances based on the sharing ratio of nodes among them can also save
on memory. Here Group By selectively chooses BFS instances which have maximum common nodes.
In the initial levels of two BFS instances the nodes are few and if such nodes are shared then all their
edges will be checked only once. This ensures maximum sharing among levels of different BFS
instances. Lastly, to handle the billions of nodes bitwise storage is used for status arrays and bitwise
operations are performed using thousands of GPU threads. A great improvement can be observed in
their work when bitwise optimizations are performed.
[multi-BFS] McLaughlin et.al [10] proposed a multi-search abstraction which can be applied on the
problems that require many simultaneous BFSs. The algorithm is shown to work for APSP, betweenness
centrality on a single GPU. Like the Gather-Apply-Scatter (GAS) paradigm, their approach can be
complemented with other applications. Multi-search abstraction has five major functions namely: init(),
prior(), visitvertex(), post(), finalise(). init() function initializes the data structure to begin with source i. This
step runs in parallel on Streaming Multiprocessors. prior() is a pre-processing step to handle if there are
any computations needed prior to a search iteration. visitvertex() function updates information about the
vertex e.g., the distance of a vertex from source i, and the next vertex. This step needs cooperation
between threads of warp to atomically update the distance between two nodes. post() performs any post-
processing step. And lastly finalise() will handle any final computation in the search. Since it is an
abstraction, one can benefit from the code reuse in solving the problem which uses multiple BFSs.
A coarse-grained parallelism is achieved by running i multiple BFS on i Streaming multiprocessors and
a fine-grained parallelism is achieved by assigning warps to the active frontier vertices.
[The more the merrier] Multi-Source BFS (MS-BFS) on a multi-core CPU is proposed by Then et.al [8]
which leverages the property of small-world graphs. This algorithm allows the sharing of frontier nodes
among BFS instances. Initially each source will mark themselves as discovered and all the adjacent
vertices that become visited come in the frontier nodes. It will merge all the BFS instances that refer to
vertex v as the frontier, then v can be explored only once for all the BFSs. A word of size equal to the
register width or the width of the cache line is considered for space optimization. Aggregated Neighbor
Processing is used to reduce the random accesses to the memory in which all the vertices are collected
that needs to be explored in the next level. The frontier nodes for the next level are prefetched.
13. Since the width of register can be smaller than the number of BFS instances that needs to be executed,
it is advised to use multiple registers for multi-source BFS. A good heuristic for maximum sharing is to
group BFSs according to their connected components. The algorithm compares itself with non-parallel
Direction optimized BFS [13] and textbook BFS.
[Parallel array-based MS-BFS] This algorithm [9] extends the idea of the array-based BFS of [8] and
parallelize it on multi-socket NUMAâaware server. The cache-hits are improved by utilizing the approach
in which the vertices are relabeled according to their degrees, so that the states of the higher degree
nodes are located close together. Work partitioning among the worker is decided on this basis, so that
all the threads are doing some work at a time by utilizing all the cores. Hence, a synchronization is needed
between threads. Since, in real-world graphs the degree of the vertices follows the power-law distribution,
there are few vertices with more degree and more vertices with few degrees, each thread cannot be
given same number of nodes to process. Instead, each thread is given its own queue, for example,
threads working on higher-degree vertices will have smaller queue size and threads working on low-
degree vertices will have larger queue.
Breadth First Search has variety of applications in many graph analytics problems. For example, in
calculations of transitive closure, Diameter, centrality metrics, etc., BFS from one or more vertices is done
to find the other metrics. We discuss about few such research work that have used multi-source BFS to
reach to Betweenness centrality solutions.
[EtaGraph] The algorithm mentioned by Wang et.al [14] is EtaGraph, which performs the overlapping of
data transfer with the kernel execution. To balance workload among threads for the small-world graphs,
a graph partition method called âUnified degree cutâ (UDC) is used. This scheme sets the upper bound
on the outdegree of each vertex so that no thread gets more work. Graphs in transformed CSR format is
brought in GPU on the fly. Data required for next iteration is prefetched in the shared memory. Though
the paper uses datasets which fit into the memory, this algorithm can be used for the algorithms which
does not fit in GPU memory.
[Scalable and high performance Betweenness centrality on GPUs] A work-efficient algorithm for
Betweenness Centrality is proposed by McLaughlin and Bader [16] that works for networks with large
diameter. This approach uses an explicit queue for graph traversal and discard the predecessor array
used in previous algorithm [17], [18].Threads are assigned to the elements of frontier queue. Atomic
Compare and Swap operations to that no two threads insert a vertex in the next frontier queue. This
approach may still suffer from the load imbalance among threads because of the properties of scale-free
networks, a hybrid method is used for selecting the parallelization strategy. If the size of next frontier
vertices queue is greater than a threshold, edge-parallel method is used otherwise the work-efficient
method is a good approach. This change in strategy is only applied when there is significant change in
the frontier queue of current level and the next level.
[Betweenness Centrality on GPUs] Betweenness centrality (BC) metric is of importance in different
types of networks such as social network, knowledge networks, finding best store locations in the cities,
etc. Sariyuce et.al [15] describes many techniques to speed up the BC computations and is experimented
14. on the 2 NUMA node cluster. GPU parallelism is achieved on vertex-based and edge-based algorithms
of the two baseline methods. Vertices with higher degrees are divided into virtual nodes each having at
most âmdegâ degree. This method of vertex virtualization is applied on the vertex-based as well as edge-
based algorithms as the parallelism on this algorithm suffers from load balancing and more memory
usage, respectively. The vertices with degree 1 are removed from the graph and its information is kept
in the predecessors. This approach saves a lot of space and computation time.
[iCENTRAL] Many real-world graphs which involves networks of transactions of various nature, social
interactions and engagements represented are dynamic. The data for these graphs are changing over
time, that is few nodes and edges are added or deleted with time. iCENTRAL algorithm by Jamour et al.
[19] demonstrate a parallel algorithm for Betweenness centrality on evolving graphs. The graph is
decomposed into biconnected components and the betweenness centrality values are updated only for
the vertices which are affected by the insertion/deletion of the edges. Betweenness centrality needs all-
pairs shortest path information which requires the breadth-first search DAGs of all the nodes. With the
help of biconnected components they identify the DAGs that remain intact. The overall complexity of
iCENTRAL is O(|Q||EBâe|) time and O(|V|+|E|) space where Q is the set of all the nodes for which the
values of â change with the insertion of edge e and EBâe is the set of edges after a new insertion/deletion.
The parallel version of iCENTRAL algorithm is implemented using MPI and evaluated on various graph
datasets on Intel Xeon CPU. The results are shown for a distributed system of 19 machines where it is
compared with other state-of-the-art algorithms.
[iFUB] Diameter computation of a graph is defined as the maximum of eccentricities from all the nodes.
Hence, a general approach requires All-Pairs Shortest Path computation, which is a computationally
expensive method. The algorithm defined in [20] selects few nodes for BFS computation and suggests a
termination condition when these BFSs are examined for diameter computation. This termination
condition is on the following observation: If the eccentricity of a node is greater that 2(BFS_level -1) then
the levels above this level will not be evaluated. This algorithm thus defines a lower bound and upper
bound of the diameter.
The k nodes are selected by any of the following methods: 1) random selection 2) nodes with higher
degree 3) 4-Sweep method. The 4-sweep method selects a node which is central in the graph that is it
has the low eccentricity.
4 Proposed Research Plan
4.1 Hybrid CSR Representation
For large real-world sparse graphs, the common approach to store the adjacency matrix is by using CSR
format. In order to store the type of graphs which are unweighted or have uniform weight, we have
designed a new form of representation called Hybrid CSR. The format is motivated by the observation in
unweighted graph that an edge is represented with a â0â or a â1â. We represent an edge as a bit in hybrid-
CSR format.
Consider the word size as 4-bit for the graph in fig. 5. We must pad a column with all zeros as the 8th
column to this adjacency matrix. The storage capacity for this compact matrix is 7*2*4= 56 bits whereas
the adjacency matrix will require 7*7*4=196 bits. The compact Matrix of this graph G is as follows:
15. Table 2. Compact Matrix of graph G
Now, we store only the nonzero entries just like in CSR format.
Working with the large graph datasets which cannot fit in the memory of GPU at once is a challenge. The
hybrid-CSR is an alternate solution for graph representation which saves on the memory space by a
factor of word size. This factor is of significance when the GPU is bound with limited memory to store the
entire graph of millions of nodes. A compact representation in hybrid-CSR is an approbation to save on
the memory. For example, consider a 32-bit integer which is true for the any architecture available these
days and a sparse graph of size 40 X 40, the total space required by adjacency matrix is 40 * 40 * 4Bytes
= 6400 Bytes whereas compact matrix requires only 40 * 2 * 4Bytes = 320 Bytes. If NNZ and nnz are the
number of nonzero in adjacency matrix and compact matrix respectively then the space complexity of
hybrid-CSR is O(2*nnz+n+1) where nnz << NNZ.
The hybrid-CSR format gives a good performance gain with the memory coalescing as more informative
data can be brought in the shared memory. The bitwise operation as described in our algorithm in section
4.2 further improves the execution time of the SpMM computation.
4.2 In-core Multi-source BFS
Our approach to solve multi-source BFS is inspired by linear algebra-based Matrix- Matrix Multiplication
(SpMM). As suggested in [6], Repeated multiplication of the graph matrix G with sparse vector x yields
the BFS traversal of the graph, where x(i)=1, x(j)=0 for i â j and i is start node. Similarly, Multi-source BFS
can be computed using iterative SpMM between the adjacency matrix and another matrix X representing
the source nodes in each column. Then Y= GT
* X picks out those rows in G, which contains the neighbors
of node i in each column of X. Multiplying Y with GT gives nodes two steps away and so on.
16. The below algorithm is our initial implementation of multi-BFS using hybrid-CSR format:
Algorithm: Hybrid-CSR-based Multi-BFS algorithm
Input: Graph GT
of size N *M (in hybrid-CSR format), Matrix of size X N*M (in compact Matrix format)
Output: Matrix Y (in compact Matrix)
1. Procedure Multi-BFS
2. for each thread ârowGâ in G in parallel do
3. for all ârowXâ in X do
4. Sum <-- 0
5. Bit<--rowIndex*M
6. for k from ROWPTR[rowG] to ROWPTR[rowG+1]
7. col <-- COLIND[k]
8. sum OR= G[rowG][col] AND X[rowX][col]
9. if (sum != 0) then
10. Setbit Y[rowG][bit]
11. break
12. end if
13. end for
14. bit <-- bit+1
15. end for
16. end for
17. end Procedure
The above kernel is launched by each thread on GPU. For the calculation of one bit in Y, one row of GT
does bitwise AND with one row of X. But at line 6, we only perform bitwise AND on position of nonzero
entries. Line 9 to 12 checks if at any point, sum becomes 1 we set bit for Y and need not do further
computation on that row.
In each iteration, the number of bitwise operations is same because there is fixed nnz elements in A
matrix. Our algorithm only performs AND operation on non-zero elements. Therefore, even when B matrix
gets denser and denser in each iteration, the number of bitwise operations does not increase. Unlike
cuSparse matrix multiplication method cusparseScsrmm() [21] which takes more execution time when
the matrix B gets denser.
4.2.1 Optimizations
Our algorithm Multi-BFS uses a single bit to represent an edge in graph G. Bitwise optimization used in
our algorithm has shown performance improvement over an iterative matrix-matrix multiplication (using
cuSPARSE library). Bitwise operations are incredibly simple and thus usually faster than arithmetic
operations. The reason is simple: any arithmetic operation inherently requires bitwise operations. Our
approach uses minimal bitwise, logical operations because the operation is only performed on the non-
zero elements. Moreover, an entire row-column multiplication is not needed because only one true
condition at nnz entry is sufficient to set a bit.
17. The performance of the GPU can be maximized by exposing it with sufficient parallelism, coalesced
memory access, and coherent execution within warp [22]. Our proposed hybrid CSR format improves the
memory coalescing by packaging bits representing each edge. The features of CUDA programming can
help to achieve various optimizations in our approach such as sync() function can make communication
within threads of the warp faster. Since global memory access is slower than other memory types
available in GPU. To maximize the performance, we efficiently use on-chip shared memory which has
higher bandwidth and lower latency than global memory. Shared memory is limited as compared to the
size of graph dataset; the coherence is achieved by proper thread organization within a warp.
4.3 Out-of-core Multi-source BFS
The size of GPUâs device memory is typically very less as compared to the CPUâs main memory. The
capacity of device memory limits large-scale graphs processing. The significant issue in processing a
very large graph is the management of data on the GPU, that exceeds the capacity of device memory,
so that it still maintains the coherent solution with minimum performance overhead. Another issue with
graph traversal on a very large graph is the accessibility pattern of the data because these algorithms
generally include irregular data accesses. This issue can observe a delay in solving the problem since
only part of graph data is available. The term out-of-core used in this report signifies that the graph cannot
be stored in the GPU global memory.
Many state-of-the-art algorithms that we have studied in the literature (discussed in section 3) are in-core
or uses a cluster of GPUs for graph processing. We see that the performance analysis for multi-source
graph traversal on very large graphs that cannot fit in the GPU memory are not well investigated. For
such very large graphs, the data transfer itself is a time-consuming process. In [16], the algorithm
overlaps the data transfer with the kernel execution, which is good approach to solve a problem
asynchronously, but this approach is not investigated for the graphs which do not fit in the GPU memory.
4.4 Applications
The problem of concurrent execution of multiple-source BFS is interesting as it has wide applications in
many graph analytic algorithms. We aim to use our proposed method for these applications. There is a
lot of scope to look at the problems which can benefit from our novel graph representation technique
hybrid-CSR. This is particularly suitable to unweighted graphs. This approach can also be extended to
graph algorithms on weighted static graphs and on weighted/unweighted dynamic graphs.
4.4.1 Applications to Static Graph Problems
Most of the current research work in Betweenness-centrality is for the unweighted static graphs [16], [15]
where the BFS is performed from all the nodes. In practice, in static graphs, only few nodes are of more
importance and one would like to know the betweenness centrality of these few k nodes only. For such
a case, BFS from all nodes will underutilize the resources and memory. Given the limited memory of
GPUs and the need to compute BFS from k nodes, the problem is relevant to many practical graph
analytics applications such as Diameter computation, shortest path computation, centrality metrics
computation, etc.
The application of finding the shortest paths from multiple source nodes can be seen in the information
systems for rerouting the emergency services or in road traffic systems to find the alternate routes. This
18. problem be a multi-source Dijkstra algorithm, where we need to find the shortest path from many source
nodes.
A typical approach to solve multi-source Dijkstra algorithm is to connect all the source nodes to a virtual
source node and assign zero-weights to all edges connecting them. This can be an expensive operation
when we are already dealing with large size graphs. A faster and parallel approach used in Multi-source
BFS can optimize the solution of k-source Dijkstra's algorithm.
4.4.2 Applications to Dynamic Graph Problems
Dynamic graphs can be viewed as a discrete sequence of static graphs. Dynamic graphs are ubiquitous
in computer science and other real-world applications. These graphs can be studied by specifying the
properties that remain invariant with time. Below are few applications where multi-source BFS can be
applied on dynamic graphs:
ďˇ Fault tolerance: Fault tolerance deals with maintaining a multiprocessor computer
architecture and network when sometimes nodes/edges fail to operate. So, a reliable fault
tolerance system is required to make the computer network reconfigurable and robust when
there is a failure in the node/edge.
ďˇ Graph connectivity: Graph connectivity is the smallest number of nodes whose removal
results in disconnected graph. Conditional connectivity offers an exceptional field of further
research on dynamic graphs.
ďˇ Centrality Computation: Finding the nodes which are of more importance in the network
is another application where BFS is needed. In social networks which are dynamic in nature,
the centrality value of nodes keeps changing.
ďˇ Diameter computation: When the new edges or nodes keep coming or going in the
computer network, natural graph parameters such as diameter, radius and eccentricities get
affected. These algorithms for these problems require shortest path computations.
5 Concluding Remarks
We have identified a set of interesting and coherent problems in the space of graph algorithms on GPUs.
These problems have applications to important computations such as diameter, centrality metrics,
shortest paths etc. which arise in domains including transportation networks, social network analysis, and
the like. Therefore, we believe that investigating the above-mentioned problems is of high relevance in
the current context.
References
[1] T. Washio and H. Motoda, "State of the art of graph-based data mining," Acm Sigkdd Explorations
Newsletter, vol. 5, p. 59â68, 2003.
[2] R. Baeza-Yates and G. Valiente, "An image similarity measure based on graph matching," in Proceedings
Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000, 2000.
19. [3] F. Chierichetti, A. Epasto, R. Kumar, S. Lattanzi and V. Mirrokni, "Efficient algorithms for public-private
social networks," in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2015.
[4] S. Stuijk, T. Basten, M. C. W. Geilen and H. Corporaal, "Multiprocessor resource allocation for throughput-
constrained synchronous dataflow graphs," in Proceedings of the 44th annual Design Automation
Conference, 2007.
[5] J. Kepner, D. Bader, A. Buluç, J. Gilbert, T. Mattson and H. Meyerhenke, "Graphs, matrices, and the
GraphBLAS: Seven good reasons," Procedia Computer Science, vol. 51, p. 2453â2462, 2015.
[6] J. Kepner and J. Gilbert, Graph algorithms in the language of linear algebra, SIAM, 2011.
[7] H. Liu, H. H. Huang and Y. Hu, "ibfs: Concurrent breadth-first search on gpus," in Proceedings of the 2016
International Conference on Management of Data, 2016.
[8] M. Then, M. Kaufmann, F. Chirigati, T.-A. Hoang-Vu, K. Pham, A. Kemper, T. Neumann and H. T. Vo, "The
more the merrier: Efficient multi-source graph traversal," Proceedings of the VLDB Endowment, vol. 8, p.
449â460, 2014.
[9] M. Kaufmann, M. Then, A. Kemper and T. Neumann, "Parallel Array-Based Single-and Multi-Source Breadth
First Searches on Large Dense Graphs.," in EDBT, 2017.
[10] A. McLaughlin and D. A. Bader, "Fast execution of simultaneous breadth-first searches on sparse graphs,"
in 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS), 2015.
[11] "Volta GPU architecture whitepaper. http://www.nvidia.com/object/volta-architecture-whitepaper.html".
[12] "CUDA development Toolkit. https://developer.nvidia.com/cuda-toolkit".
[13] S. Beamer, K. Asanovic and D. Patterson, "Direction-optimizing breadth-first search," in SC'12: Proceedings
of the International Conference on High Performance Computing, Networking, Storage and Analysis, 2012.
[14] P. Wang, L. Zhang, C. Li and M. Guo, "Excavating the potential of gpu for accelerating graph traversal," in
2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2019.
[15] A. E. SariyĂźce, K. Kaya, E. Saule and Ă. V. ĂatalyĂźrek, "Betweenness centrality on GPUs and heterogeneous
architectures," in Proceedings of the 6th Workshop on General Purpose Processor Using Graphics
Processing Units, 2013.
[16] A. McLaughlin and D. A. Bader, "Scalable and high performance betweenness centrality on the GPU," in
SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage
and Analysis, 2014.
[17] Y. Jia, V. Lu, J. Hoberock, M. Garland and J. C. Hart, "Edge v. node parallelism for graph centrality metrics,"
in GPU Computing Gems Jade Edition, Elsevier, 2012, p. 15â28.
20. [18] Z. Shi and B. Zhang, "Fast network centrality analysis using GPUs," BMC bioinformatics, vol. 12, p. 1â7,
2011.
[19] F. Jamour, S. Skiadopoulos and P. Kalnis, "Parallel algorithm for incremental betweenness centrality on
large graphs," IEEE Transactions on Parallel and Distributed Systems, vol. 29, p. 659â672, 2017.
[20] P. Crescenzi, R. Grossi, M. Habib, L. Lanzi and A. Marino, "On computing the diameter of real-world
undirected graphs," Theoretical Computer Science, vol. 514, p. 84â95, 2013.
[21] "cuSPARSE library. https://developer.nvidia.com/cusparse".
[22] "CUDA Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html".
[23] P. Mahonen, J. Riihijarvi and M. Petrova, "Automatic channel allocation for small wireless local area
networks using graph colouring algorithm approach," in 2004 IEEE 15th International Symposium on
Personal, Indoor and Mobile Radio Communications (IEEE Cat. No. 04TH8754), 2004.