SlideShare a Scribd company logo
1 of 20
Download to read offline
Comprehensive report on “Parallel algorithms for multi-source graph traversal and its
applications”
by
Seema Simoliya
Program: Ph.D. (CSE)
Advisor: Dr. Kishore Kothapalli
Center for Security, Theory, and Algorithmic Research (C-STAR)
International Institute of Information Technology-Hyderabad, Gachibowli,
Hyderabad, India – 500 032.
seema.simoliya@research.iiit.ac.in
Table of Contents
1 Introduction.....................................................................................................................................................3
2 Graphics Processing Unit ...........................................................................................................................4
2.1 Modern GPU Computing Architecture .............................................................................................4
2.2 Memory Layout.......................................................................................................................................7
2.3 CUDA ........................................................................................................................................................8
3 Literature Survey ...........................................................................................................................................8
3.1 Representation of Sparse Graphs.....................................................................................................8
3.2 List of selected Research Papers................................................................................................... 10
3.3 Single Source Breadth-First Search.............................................................................................. 11
3.4 Multi-source Breadth First Search................................................................................................. 12
4 Proposed Research Plan.......................................................................................................................... 14
4.1 Hybrid CSR Representation............................................................................................................. 14
4.2 In-core Multi-source BFS.................................................................................................................. 15
4.2.1 Optimizations............................................................................................................................... 16
4.3 Out-of-core Multi-source BFS ......................................................................................................... 17
4.4 Applications......................................................................................................................................... 17
4.4.1 Applications to Static Graph Problems ................................................................................ 17
4.4.2 Applications to Dynamic Graph Problems .......................................................................... 18
5 Concluding Remarks ................................................................................................................................. 18
References............................................................................................................................................................. 18
1 Introduction
Many scientific and technical problems are related to large size data with networked nature, such
problems can be well represented in the form of graph data structures. Graphs have shown their
importance in many varied fields but especially in computer science applications such as data mining [1],
image processing [2], networking [3], resource allocation [4], etc. This promotes the development of new
algorithms and new theorems that can be used in tremendous applications. Graphs provide a simple and
flexible abstract view of the discrete objects in the problem domain.
In many problem environments, the graph is so large that they require a different approach to process,
store and represent graphs. Large graph datasets refer to billions of vertices and trillions of edges such
as seen in the social networks, citation networks, web graphs, road networks, etc. For computations on
these graphs, parallel computing appears to be necessary to overcome the restriction of limited resources
and large data. Parallel and data intensive algorithms often require fast computational resources for bulk
processing. This can be achieved by using different special hardware along with general-purpose
computers, such as, field-programmable gate arrays (FPGA), Application Specific Integrated circuit
(ASIC), Vector processor or Graphics Processing Unit (GPU).
GPUs have become common in personal computers, supercomputers and datacenters; hence
researchers and application developers are motivated to design algorithms which take the architectural
advantage of GPU. Although GPU evolved because of the increasing demand of realistic graphics
rendering by the gaming world, it has become a choice for data-parallel and computationally expensive
workloads. GPUs provide highly scalable and cost-effective solutions to high performance computing.
GPGPU (General purpose Graphics Processing Unit) is the term which refers to the programmable use
of GPU beyond the traditional purpose of computation for computer graphics. Graph algorithms on large
data sets is a challenging problem on GPU as it has irregular memory access patterns and throughput
intensive requirements. Combining the power of both GPU and CPU can increase the performance of
graph algorithms.
The real-world graphs are sparse [5], therefore, as opposed to dense graphs, sparse graphs are more of
practical interest to researchers. Graph algorithms which work on dense graphs may not be suitable for
sparse graphs because these algorithms on sparse graphs often exhibit poor locality of memory access,
uneven workload per vertex and dynamic degree of parallelism over the course of execution. A better
representation of graphs and proper load balancing can address the above-mentioned issues in parallel
settings. Graphs are usually stored in adjacency matrix or adjacency list format. It is observed that when
graphs are represented as matrices, they appear to be more managed, concise, expressive in terms of
reduced memory footprint as compared to vertex-centric frameworks. There are few noted
representations for sparse graphs which are discussed in Section 3.1.
Traversal on real world sparse graphs is one of the core techniques used in the area of graph analytics.
Depth-first search and breadth-first search are two most common, yet famous traversal algorithms used
in these areas, out of which breadth-first search is considered good to run on GPUs because all the
vertices on a single level can be processed in parallel independently. [6] describes that BFS from a single
source can be found by multiplying the Sparse matrix with a vector (SpMV). GraphBlas is an API
specification which is based on the notion that various graph problems can be solved using linear algebra.
With the popularity of GraphBlas, various parallel breadth-first search techniques are developed on
different platforms by utilizing different architectural and programming optimizations. Different variants of
BFS are studied to solve graph problems. Problems like All-pairs Shortest Path (APSP), diameter
computation, betweenness centrality, reachability querying, etc. require one or more execution of such
breadth-first search. For example: APSP is a problem in which one needs to perform BFS from each
node to find the shortest distances between each pair of vertices. Multi-source BFS is another variant in
which BFS is performed using fewer nodes.
A few parallel solutions exist [7], [8], [9], [10] that have addressed the simultaneous execution of graph
traversal from more than one source nodes on CPU and GPU architecture. Each approach is discussed
in detail in the Section 3 literature survey. These algorithms run on different platforms and have few
limitations which are addressed in our approach suggested in Section 4. The problem area, where parallel
multi-source breadth-first search is performed, encourages finding alternate solutions given its various
applications and the limitations of existing solutions.
In the literature survey, I have studied about the various versions of graph traversal problems such as
Single-source BFS, Multi-source BFS and its applications. These problems have sequential as well as
parallel solutions. Inspired by the significance of functionalities offered by GraphBlas, in my research
work, I aim to design parallel algorithms (on GPUs) for problems listed below:
 Multi-source BFS using linear-algebra method in the following scenarios:
o When the graph can or cannot fit in the GPU memory
o Possibility of Overlapped execution between data transfer and BFS computation
o Memory saving compact representation of graph like hybrid-CSR
o On static graphs and dynamic graphs
 Shortest Path computation from multiple sources.
 Graph centrality calculation: Betweenness centrality of few k nodes which can effectively use our
proposed hybrid-CSR representation and Multi-source BFS.
 Diameter calculation in static and dynamic graphs.
This report is organized as follows: Section 2 describes the architecture of GPUs and CUDA programming
used in our research work. Section 3 gives the details about the literature work in the above-listed problem
areas. Section 4 in this report gives more detail about these problems and our approach to solve them.
2 Graphics Processing Unit
2.1 Modern GPU Computing Architecture
Fig. 1 illustrates the high-level architecture overview of NVIDIA Tesla V100 [11]. It is an extremely power-
efficient processor including 21.1 billion transistors yet delivering exceptional performance per watt. A
GPU is composed of multiple GPU Processing Clusters (GPC), Texture Processing Clusters (TPC),
Streaming Multiprocessors (SMs) and memory controllers. Each GPC consists of group of TPC and each
TPC is a group made up of several SMs, a texture unit and a logic control.
Fig.1 Volta GV100 Full GPU with 84 SM Units [11]
The Streaming Multiprocessor in Volta architecture, unlike predecessor architecture, is provided
additionally with tensor cores, combined L1 data cache and shared memory, set of registers. Fig.2 shows
the architecture of a Streaming Multiprocessor. Tensor cores are important features developed to train
large neural networks. Matrix-matrix multiplication is the core operation in training neural networks. Each
tensor core can compute 64-bit FMA operation in one clock cycle which significantly boosts the
performance of these operations. Shared memory provides high bandwidth and low latency. Combining
L1 data cache with shared memory allows L1 cache operations to attain the benefits of shared memory
performance.
Volta architecture also comes with independent thread scheduling which means each thread has its own
program counter. This makes programming on these GPUs more productive and easier.
Fig.1 Streaming Multiprocessor of V100 [11]
GPUs execute a group of 32 threads (called a warp) in SIMT (Single Instruction Multiple Threads) fashion.
Every thread in a warp maintains its own execution state therefore enabling concurrency among all the
threads, regardless of warp. Fig. 3 explains the independent thread scheduling. A schedule optimizer is
used to group active threads from the same warp together into SIMT units.
Fig. 3 Independent thread scheduling architecture block diagram [11]
2.2 Memory Layout
When writing programs, we group threads in a “block”. Each SM has its own capacity of number of
threads in a thread block. A thread has its own local memory. All the threads in a thread block can access
high performance on-chip shared memory. Thread blocks are arranged in a Grid and all threads of a Grid
have access to Global Memory where we copy the data to/from the CPU memory. Fig.4 explains the
high-level memory architecture with respect to a program. There are two additional read-only memory
spaces: constant memory and Texture Memory. These memory spaces are accessible to all the threads
and can be optimized according to the memory usage need.
Fig.4 Memory Hierarchy [12]
2.3 CUDA
“CUDA (Compute Unite Device Architecture) [12] is a parallel computing platform and programming
model developed by NVIDIA that makes general purpose computing simple on GPUs”. Programmers still
write in the familiar languages like C, C++, or the list of languages supported by NVIDIA CUDA toolkit.
CUDA enables three key abstractions - a hierarchy of thread groups, shared memories and barrier
synchronization. With the help of these abstractions, a problem can be divided into sub-problems that
can be solved independently using thread blocks, each sub-problem can be further solved cooperatively
in parallel by each thread within a block. In CUDA C++, a function which runs on GPU is called kernel,
which when executed runs in parallel by CUDA threads. The number of threads in a block and number
of blocks in a grid are defined by the programmer. Each thread and block have a unique id within a block
and grid, respectively. ThreadIdx is a 3-component vector so that a thread can be identified in a one-
dimensional, two-dimensional and three-dimensional thread block. Similarly, BlockIdx is a 3-component
vector identified in a one-dimensional, two-dimensional, three-dimensional grid.
When a CUDA Program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated
and distributed across SMs. Once a thread block is launched on a SM, all its warps are resident till all
finishes their execution. All the threads execute in parallel on a SM.
3 Literature Survey
3.1 Representation of Sparse Graphs
Substantial memory reductions can be realized when only the non-zero entries of Sparse Graphs are
stored. Consider a Sparse graph in fig.5 and its adjacency matrix, below are few sparse graph
representations which are commonly used:
Fig. 5. An Undirected unweighted graph
1. Adjacency List: This format stores the list of edges for each element. The memory
requirement for this type of representation is O(|V|+|E|). Programmer can decide the
data structure to use this type of representation. For example, one can use list of list or
array of list data structure to store the graph in fig.5.
2. Coordinate Format (COO): Sparse matrix A is assumed to be stored in row-major
order. This matrix can be represented with 3 arrays-cooVal, cooRowInd, cooColInd. The
size of these 3 arrays is equal to the number of non-zero (nnz) values in the matrix A.
CooVal stores the nnz values in row-major order, cooRowInd stores the row indices of
these values and cooColInd stores the column indices of these values. With this format,
we can quickly reconstruct the matrix.
3. Compressed Sparse Row Format (CSR): The matrix can be represented with 3
arrays-csrVal, csrRowPtr, csrColInd. CsrVal stores the nnz values in row-major order,
csrRowPtr points to the indices which are beginning of the rows in the csrVal array and
csrColInd stores the column indices of these values. This provides the fast row access
and is particularly good for matrix-vector multiplication.
4. Compressed Sparse Column Format (CSC): This format is similar to CSR format
except that the matrix storage is in column major order.
3.2 List of selected Research Papers
Summary of the 9 publications selected from reputed conferences/journals which exhibits potential work
done in the chosen area of research are listed below:
S.No. Reference Paper Reason #Citations
1. Liu, H., Huang, H. H., & Hu, Y.
(2016, June). ibfs: Concurrent
breadth-first search on gpus. In
Proceedings of the 2016
International Conference on
Management of Data (pp. 403-
416). [7]
Sharing and grouping of frontier
nodes across BFS instances, use of
bitwise optimization when
concurrently executing BFS from
many source nodes.
62
2. McLaughlin, A., & Bader, D. A.
(2015, December). Fast
execution of simultaneous
breadth-first searches on sparse
graphs. In 2015 IEEE 21st
International Conference on
Parallel and Distributed Systems
(ICPADS) (pp. 9-18). IEEE. [10]
This paper describes a simple
parallel multi search abstraction
which can be complemented with
other graph analytics applications on
a single GPU.
9
3. Then, M., Kaufmann, M.,
Chirigati, F., Hoang-Vu, T. A.,
Pham, K., Kemper, A., ... & Vo, H.
T. (2014). The more the merrier:
Efficient multi-source graph
traversal. Proceedings of the
VLDB Endowment, 8(4), 449-
460. [8]
The paper proposes an algorithm to
process the multi-source BFS
concurrently using a multicore CPU.
Their work leverages the properties
of the small-world graphs where most
of the vertices share commonality in
the initial levels of BFSs.
54
4. Kaufmann, M., Then, M.,
Kemper, A., & Neumann, T.
(2017). Parallel Array-Based
Single-and Multi-Source Breadth
First Searches on Large Dense
Graphs. In EDBT (pp. 1-12). [9]
This algorithm suggests two-phase
parallelism on Top-down approach
and two-phase parallelism on
Bottom-up approach used in [14].
Single-Source and Multi-Source
parallel BFS is suggested on multi-
socket NUMA-aware systems.
5
5. Wang, P., Zhang, L., Li, C., &
Guo, M. (2019, May). Excavating
the potential of gpu for
accelerating graph traversal. In
2019 IEEE International Parallel
and Distributed Processing
Symposium (IPDPS) (pp. 221-
230). IEEE. [13]
In the algorithm EtaGraph,
overlapping of data transfer with
kernel execution when graph
traversal is performed leads to
improvement in the total execution
time. Unique scheme “Unified Degree
Cut” is used for load balancing.
Prefetching is used to improve
memory access latency.
3
6. McLaughlin, A., & Bader, D.
(2014). Scalable and high
performance betweenness
centrality on the GPU. In SC'14:
Proceedings of the International
Conference for High
Performance Computing,
Networking, Storage and
Analysis (pp. 572–583). [14]
The algorithm suggests alternating
the two method of parallelism: edge-
parallel and work-efficient method,
based on how significantly the frontier
vertices set is changing in each
iteration.
102
7 SariyĂźce, A. E., Kaya, K., Saule,
E., & Çatalyürek, Ü. V. (2013,
March). Betweenness centrality
on GPUs and heterogeneous
architectures. In Proceedings of
the 6th Workshop on General
Purpose Processor Using
Graphics Processing Units (pp.
76-85). [15]
Various techniques to accelerate
Betweenness Centrality
computations such as vertex
virtualization, stride memory access
and a unique graph compression
method where vertices with degree 1
are removed. Algorithm makes use of
both GPU and CPU.
79
8 Jamour, F., Skiadopoulos, S., &
Kalnis, P. (2017). Parallel
algorithm for incremental
betweenness centrality on large
graphsIEEE Transactions on
Parallel and Distributed Systems,
29(3), 659–672. [16]
This algorithm is designed for the
evolving graphs. The incremental
Betweenness centrality is computed
on only the affected biconnected
components.
29
9 Crescenzi, P., Grossi, R., Habib,
M., Lanzi, L., & Marino, A. (2013).
On computing the diameter of
real-world undirected
graphsTheoretical Computer
Science, 514, 84–95. [17]
Diameter computation usually
requires BFS computation from all
nodes, this paper defines an
algorithm which does BFS from few
selected nodes and early terminate
the evaluation if a certain condition is
met.
61
3.3 Single Source Breadth-First Search
A typical sequential breadth-First Search begins from a source and then explores all the adjacent vertices
of the “frontiers” nodes at each level. This direction of search is called Top-Down search. As opposed to
this, there is a “bottom-up” search, in which all the unvisited nodes are investigated to be a candidate for
the next frontier node. This search skips out the nodes which cannot be the part of the frontier nodes in
the next level. A hybrid approach [13] using both top-down and bottom-up search turns out to be much
more efficient than individual direction search. This combination of bidirectional search enables the
exploration of fewer nodes for the frontier. The approach is good for the single source BFS on a multicore
CPU (according to experiments in [13]). Numerous graph analytics algorithms require to execute multiple
BFS traversals on the same graph from different source nodes. With the advances in the hardware
architectures and the applications of BFS, the scope to explore the efficient multi-source BFS(MS-BFS)
algorithms has widened.
3.4 Multi-source Breadth First Search
[iBFS] A recent work done in the area of parallel multi-source BFS on GPU is proposed by Liu et.al
[7].The proposed algorithm iBFS utilizes three novel techniques: joint traversal, Group By and bitwise
optimization. Joint traversal allows to share the frontiers of different BFS instances, which reduces the
memory latency. Grouping BFS instances based on the sharing ratio of nodes among them can also save
on memory. Here Group By selectively chooses BFS instances which have maximum common nodes.
In the initial levels of two BFS instances the nodes are few and if such nodes are shared then all their
edges will be checked only once. This ensures maximum sharing among levels of different BFS
instances. Lastly, to handle the billions of nodes bitwise storage is used for status arrays and bitwise
operations are performed using thousands of GPU threads. A great improvement can be observed in
their work when bitwise optimizations are performed.
[multi-BFS] McLaughlin et.al [10] proposed a multi-search abstraction which can be applied on the
problems that require many simultaneous BFSs. The algorithm is shown to work for APSP, betweenness
centrality on a single GPU. Like the Gather-Apply-Scatter (GAS) paradigm, their approach can be
complemented with other applications. Multi-search abstraction has five major functions namely: init(),
prior(), visitvertex(), post(), finalise(). init() function initializes the data structure to begin with source i. This
step runs in parallel on Streaming Multiprocessors. prior() is a pre-processing step to handle if there are
any computations needed prior to a search iteration. visitvertex() function updates information about the
vertex e.g., the distance of a vertex from source i, and the next vertex. This step needs cooperation
between threads of warp to atomically update the distance between two nodes. post() performs any post-
processing step. And lastly finalise() will handle any final computation in the search. Since it is an
abstraction, one can benefit from the code reuse in solving the problem which uses multiple BFSs.
A coarse-grained parallelism is achieved by running i multiple BFS on i Streaming multiprocessors and
a fine-grained parallelism is achieved by assigning warps to the active frontier vertices.
[The more the merrier] Multi-Source BFS (MS-BFS) on a multi-core CPU is proposed by Then et.al [8]
which leverages the property of small-world graphs. This algorithm allows the sharing of frontier nodes
among BFS instances. Initially each source will mark themselves as discovered and all the adjacent
vertices that become visited come in the frontier nodes. It will merge all the BFS instances that refer to
vertex v as the frontier, then v can be explored only once for all the BFSs. A word of size equal to the
register width or the width of the cache line is considered for space optimization. Aggregated Neighbor
Processing is used to reduce the random accesses to the memory in which all the vertices are collected
that needs to be explored in the next level. The frontier nodes for the next level are prefetched.
Since the width of register can be smaller than the number of BFS instances that needs to be executed,
it is advised to use multiple registers for multi-source BFS. A good heuristic for maximum sharing is to
group BFSs according to their connected components. The algorithm compares itself with non-parallel
Direction optimized BFS [13] and textbook BFS.
[Parallel array-based MS-BFS] This algorithm [9] extends the idea of the array-based BFS of [8] and
parallelize it on multi-socket NUMA–aware server. The cache-hits are improved by utilizing the approach
in which the vertices are relabeled according to their degrees, so that the states of the higher degree
nodes are located close together. Work partitioning among the worker is decided on this basis, so that
all the threads are doing some work at a time by utilizing all the cores. Hence, a synchronization is needed
between threads. Since, in real-world graphs the degree of the vertices follows the power-law distribution,
there are few vertices with more degree and more vertices with few degrees, each thread cannot be
given same number of nodes to process. Instead, each thread is given its own queue, for example,
threads working on higher-degree vertices will have smaller queue size and threads working on low-
degree vertices will have larger queue.
Breadth First Search has variety of applications in many graph analytics problems. For example, in
calculations of transitive closure, Diameter, centrality metrics, etc., BFS from one or more vertices is done
to find the other metrics. We discuss about few such research work that have used multi-source BFS to
reach to Betweenness centrality solutions.
[EtaGraph] The algorithm mentioned by Wang et.al [14] is EtaGraph, which performs the overlapping of
data transfer with the kernel execution. To balance workload among threads for the small-world graphs,
a graph partition method called “Unified degree cut” (UDC) is used. This scheme sets the upper bound
on the outdegree of each vertex so that no thread gets more work. Graphs in transformed CSR format is
brought in GPU on the fly. Data required for next iteration is prefetched in the shared memory. Though
the paper uses datasets which fit into the memory, this algorithm can be used for the algorithms which
does not fit in GPU memory.
[Scalable and high performance Betweenness centrality on GPUs] A work-efficient algorithm for
Betweenness Centrality is proposed by McLaughlin and Bader [16] that works for networks with large
diameter. This approach uses an explicit queue for graph traversal and discard the predecessor array
used in previous algorithm [17], [18].Threads are assigned to the elements of frontier queue. Atomic
Compare and Swap operations to that no two threads insert a vertex in the next frontier queue. This
approach may still suffer from the load imbalance among threads because of the properties of scale-free
networks, a hybrid method is used for selecting the parallelization strategy. If the size of next frontier
vertices queue is greater than a threshold, edge-parallel method is used otherwise the work-efficient
method is a good approach. This change in strategy is only applied when there is significant change in
the frontier queue of current level and the next level.
[Betweenness Centrality on GPUs] Betweenness centrality (BC) metric is of importance in different
types of networks such as social network, knowledge networks, finding best store locations in the cities,
etc. Sariyuce et.al [15] describes many techniques to speed up the BC computations and is experimented
on the 2 NUMA node cluster. GPU parallelism is achieved on vertex-based and edge-based algorithms
of the two baseline methods. Vertices with higher degrees are divided into virtual nodes each having at
most ‘mdeg’ degree. This method of vertex virtualization is applied on the vertex-based as well as edge-
based algorithms as the parallelism on this algorithm suffers from load balancing and more memory
usage, respectively. The vertices with degree 1 are removed from the graph and its information is kept
in the predecessors. This approach saves a lot of space and computation time.
[iCENTRAL] Many real-world graphs which involves networks of transactions of various nature, social
interactions and engagements represented are dynamic. The data for these graphs are changing over
time, that is few nodes and edges are added or deleted with time. iCENTRAL algorithm by Jamour et al.
[19] demonstrate a parallel algorithm for Betweenness centrality on evolving graphs. The graph is
decomposed into biconnected components and the betweenness centrality values are updated only for
the vertices which are affected by the insertion/deletion of the edges. Betweenness centrality needs all-
pairs shortest path information which requires the breadth-first search DAGs of all the nodes. With the
help of biconnected components they identify the DAGs that remain intact. The overall complexity of
iCENTRAL is O(|Q||EB’e|) time and O(|V|+|E|) space where Q is the set of all the nodes for which the
values of – change with the insertion of edge e and EB’e is the set of edges after a new insertion/deletion.
The parallel version of iCENTRAL algorithm is implemented using MPI and evaluated on various graph
datasets on Intel Xeon CPU. The results are shown for a distributed system of 19 machines where it is
compared with other state-of-the-art algorithms.
[iFUB] Diameter computation of a graph is defined as the maximum of eccentricities from all the nodes.
Hence, a general approach requires All-Pairs Shortest Path computation, which is a computationally
expensive method. The algorithm defined in [20] selects few nodes for BFS computation and suggests a
termination condition when these BFSs are examined for diameter computation. This termination
condition is on the following observation: If the eccentricity of a node is greater that 2(BFS_level -1) then
the levels above this level will not be evaluated. This algorithm thus defines a lower bound and upper
bound of the diameter.
The k nodes are selected by any of the following methods: 1) random selection 2) nodes with higher
degree 3) 4-Sweep method. The 4-sweep method selects a node which is central in the graph that is it
has the low eccentricity.
4 Proposed Research Plan
4.1 Hybrid CSR Representation
For large real-world sparse graphs, the common approach to store the adjacency matrix is by using CSR
format. In order to store the type of graphs which are unweighted or have uniform weight, we have
designed a new form of representation called Hybrid CSR. The format is motivated by the observation in
unweighted graph that an edge is represented with a ‘0’ or a ‘1’. We represent an edge as a bit in hybrid-
CSR format.
Consider the word size as 4-bit for the graph in fig. 5. We must pad a column with all zeros as the 8th
column to this adjacency matrix. The storage capacity for this compact matrix is 7*2*4= 56 bits whereas
the adjacency matrix will require 7*7*4=196 bits. The compact Matrix of this graph G is as follows:
Table 2. Compact Matrix of graph G
Now, we store only the nonzero entries just like in CSR format.
Working with the large graph datasets which cannot fit in the memory of GPU at once is a challenge. The
hybrid-CSR is an alternate solution for graph representation which saves on the memory space by a
factor of word size. This factor is of significance when the GPU is bound with limited memory to store the
entire graph of millions of nodes. A compact representation in hybrid-CSR is an approbation to save on
the memory. For example, consider a 32-bit integer which is true for the any architecture available these
days and a sparse graph of size 40 X 40, the total space required by adjacency matrix is 40 * 40 * 4Bytes
= 6400 Bytes whereas compact matrix requires only 40 * 2 * 4Bytes = 320 Bytes. If NNZ and nnz are the
number of nonzero in adjacency matrix and compact matrix respectively then the space complexity of
hybrid-CSR is O(2*nnz+n+1) where nnz << NNZ.
The hybrid-CSR format gives a good performance gain with the memory coalescing as more informative
data can be brought in the shared memory. The bitwise operation as described in our algorithm in section
4.2 further improves the execution time of the SpMM computation.
4.2 In-core Multi-source BFS
Our approach to solve multi-source BFS is inspired by linear algebra-based Matrix- Matrix Multiplication
(SpMM). As suggested in [6], Repeated multiplication of the graph matrix G with sparse vector x yields
the BFS traversal of the graph, where x(i)=1, x(j)=0 for i ≠ j and i is start node. Similarly, Multi-source BFS
can be computed using iterative SpMM between the adjacency matrix and another matrix X representing
the source nodes in each column. Then Y= GT
* X picks out those rows in G, which contains the neighbors
of node i in each column of X. Multiplying Y with GT gives nodes two steps away and so on.
The below algorithm is our initial implementation of multi-BFS using hybrid-CSR format:
Algorithm: Hybrid-CSR-based Multi-BFS algorithm
Input: Graph GT
of size N *M (in hybrid-CSR format), Matrix of size X N*M (in compact Matrix format)
Output: Matrix Y (in compact Matrix)
1. Procedure Multi-BFS
2. for each thread ‘rowG’ in G in parallel do
3. for all ‘rowX’ in X do
4. Sum <-- 0
5. Bit<--rowIndex*M
6. for k from ROWPTR[rowG] to ROWPTR[rowG+1]
7. col <-- COLIND[k]
8. sum OR= G[rowG][col] AND X[rowX][col]
9. if (sum != 0) then
10. Setbit Y[rowG][bit]
11. break
12. end if
13. end for
14. bit <-- bit+1
15. end for
16. end for
17. end Procedure
The above kernel is launched by each thread on GPU. For the calculation of one bit in Y, one row of GT
does bitwise AND with one row of X. But at line 6, we only perform bitwise AND on position of nonzero
entries. Line 9 to 12 checks if at any point, sum becomes 1 we set bit for Y and need not do further
computation on that row.
In each iteration, the number of bitwise operations is same because there is fixed nnz elements in A
matrix. Our algorithm only performs AND operation on non-zero elements. Therefore, even when B matrix
gets denser and denser in each iteration, the number of bitwise operations does not increase. Unlike
cuSparse matrix multiplication method cusparseScsrmm() [21] which takes more execution time when
the matrix B gets denser.
4.2.1 Optimizations
Our algorithm Multi-BFS uses a single bit to represent an edge in graph G. Bitwise optimization used in
our algorithm has shown performance improvement over an iterative matrix-matrix multiplication (using
cuSPARSE library). Bitwise operations are incredibly simple and thus usually faster than arithmetic
operations. The reason is simple: any arithmetic operation inherently requires bitwise operations. Our
approach uses minimal bitwise, logical operations because the operation is only performed on the non-
zero elements. Moreover, an entire row-column multiplication is not needed because only one true
condition at nnz entry is sufficient to set a bit.
The performance of the GPU can be maximized by exposing it with sufficient parallelism, coalesced
memory access, and coherent execution within warp [22]. Our proposed hybrid CSR format improves the
memory coalescing by packaging bits representing each edge. The features of CUDA programming can
help to achieve various optimizations in our approach such as sync() function can make communication
within threads of the warp faster. Since global memory access is slower than other memory types
available in GPU. To maximize the performance, we efficiently use on-chip shared memory which has
higher bandwidth and lower latency than global memory. Shared memory is limited as compared to the
size of graph dataset; the coherence is achieved by proper thread organization within a warp.
4.3 Out-of-core Multi-source BFS
The size of GPU’s device memory is typically very less as compared to the CPU’s main memory. The
capacity of device memory limits large-scale graphs processing. The significant issue in processing a
very large graph is the management of data on the GPU, that exceeds the capacity of device memory,
so that it still maintains the coherent solution with minimum performance overhead. Another issue with
graph traversal on a very large graph is the accessibility pattern of the data because these algorithms
generally include irregular data accesses. This issue can observe a delay in solving the problem since
only part of graph data is available. The term out-of-core used in this report signifies that the graph cannot
be stored in the GPU global memory.
Many state-of-the-art algorithms that we have studied in the literature (discussed in section 3) are in-core
or uses a cluster of GPUs for graph processing. We see that the performance analysis for multi-source
graph traversal on very large graphs that cannot fit in the GPU memory are not well investigated. For
such very large graphs, the data transfer itself is a time-consuming process. In [16], the algorithm
overlaps the data transfer with the kernel execution, which is good approach to solve a problem
asynchronously, but this approach is not investigated for the graphs which do not fit in the GPU memory.
4.4 Applications
The problem of concurrent execution of multiple-source BFS is interesting as it has wide applications in
many graph analytic algorithms. We aim to use our proposed method for these applications. There is a
lot of scope to look at the problems which can benefit from our novel graph representation technique
hybrid-CSR. This is particularly suitable to unweighted graphs. This approach can also be extended to
graph algorithms on weighted static graphs and on weighted/unweighted dynamic graphs.
4.4.1 Applications to Static Graph Problems
Most of the current research work in Betweenness-centrality is for the unweighted static graphs [16], [15]
where the BFS is performed from all the nodes. In practice, in static graphs, only few nodes are of more
importance and one would like to know the betweenness centrality of these few k nodes only. For such
a case, BFS from all nodes will underutilize the resources and memory. Given the limited memory of
GPUs and the need to compute BFS from k nodes, the problem is relevant to many practical graph
analytics applications such as Diameter computation, shortest path computation, centrality metrics
computation, etc.
The application of finding the shortest paths from multiple source nodes can be seen in the information
systems for rerouting the emergency services or in road traffic systems to find the alternate routes. This
problem be a multi-source Dijkstra algorithm, where we need to find the shortest path from many source
nodes.
A typical approach to solve multi-source Dijkstra algorithm is to connect all the source nodes to a virtual
source node and assign zero-weights to all edges connecting them. This can be an expensive operation
when we are already dealing with large size graphs. A faster and parallel approach used in Multi-source
BFS can optimize the solution of k-source Dijkstra's algorithm.
4.4.2 Applications to Dynamic Graph Problems
Dynamic graphs can be viewed as a discrete sequence of static graphs. Dynamic graphs are ubiquitous
in computer science and other real-world applications. These graphs can be studied by specifying the
properties that remain invariant with time. Below are few applications where multi-source BFS can be
applied on dynamic graphs:
 Fault tolerance: Fault tolerance deals with maintaining a multiprocessor computer
architecture and network when sometimes nodes/edges fail to operate. So, a reliable fault
tolerance system is required to make the computer network reconfigurable and robust when
there is a failure in the node/edge.
 Graph connectivity: Graph connectivity is the smallest number of nodes whose removal
results in disconnected graph. Conditional connectivity offers an exceptional field of further
research on dynamic graphs.
 Centrality Computation: Finding the nodes which are of more importance in the network
is another application where BFS is needed. In social networks which are dynamic in nature,
the centrality value of nodes keeps changing.
 Diameter computation: When the new edges or nodes keep coming or going in the
computer network, natural graph parameters such as diameter, radius and eccentricities get
affected. These algorithms for these problems require shortest path computations.
5 Concluding Remarks
We have identified a set of interesting and coherent problems in the space of graph algorithms on GPUs.
These problems have applications to important computations such as diameter, centrality metrics,
shortest paths etc. which arise in domains including transportation networks, social network analysis, and
the like. Therefore, we believe that investigating the above-mentioned problems is of high relevance in
the current context.
References
[1] T. Washio and H. Motoda, "State of the art of graph-based data mining," Acm Sigkdd Explorations
Newsletter, vol. 5, p. 59–68, 2003.
[2] R. Baeza-Yates and G. Valiente, "An image similarity measure based on graph matching," in Proceedings
Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000, 2000.
[3] F. Chierichetti, A. Epasto, R. Kumar, S. Lattanzi and V. Mirrokni, "Efficient algorithms for public-private
social networks," in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2015.
[4] S. Stuijk, T. Basten, M. C. W. Geilen and H. Corporaal, "Multiprocessor resource allocation for throughput-
constrained synchronous dataflow graphs," in Proceedings of the 44th annual Design Automation
Conference, 2007.
[5] J. Kepner, D. Bader, A. Buluç, J. Gilbert, T. Mattson and H. Meyerhenke, "Graphs, matrices, and the
GraphBLAS: Seven good reasons," Procedia Computer Science, vol. 51, p. 2453–2462, 2015.
[6] J. Kepner and J. Gilbert, Graph algorithms in the language of linear algebra, SIAM, 2011.
[7] H. Liu, H. H. Huang and Y. Hu, "ibfs: Concurrent breadth-first search on gpus," in Proceedings of the 2016
International Conference on Management of Data, 2016.
[8] M. Then, M. Kaufmann, F. Chirigati, T.-A. Hoang-Vu, K. Pham, A. Kemper, T. Neumann and H. T. Vo, "The
more the merrier: Efficient multi-source graph traversal," Proceedings of the VLDB Endowment, vol. 8, p.
449–460, 2014.
[9] M. Kaufmann, M. Then, A. Kemper and T. Neumann, "Parallel Array-Based Single-and Multi-Source Breadth
First Searches on Large Dense Graphs.," in EDBT, 2017.
[10] A. McLaughlin and D. A. Bader, "Fast execution of simultaneous breadth-first searches on sparse graphs,"
in 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS), 2015.
[11] "Volta GPU architecture whitepaper. http://www.nvidia.com/object/volta-architecture-whitepaper.html".
[12] "CUDA development Toolkit. https://developer.nvidia.com/cuda-toolkit".
[13] S. Beamer, K. Asanovic and D. Patterson, "Direction-optimizing breadth-first search," in SC'12: Proceedings
of the International Conference on High Performance Computing, Networking, Storage and Analysis, 2012.
[14] P. Wang, L. Zhang, C. Li and M. Guo, "Excavating the potential of gpu for accelerating graph traversal," in
2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2019.
[15] A. E. Sariyüce, K. Kaya, E. Saule and Ü. V. Çatalyürek, "Betweenness centrality on GPUs and heterogeneous
architectures," in Proceedings of the 6th Workshop on General Purpose Processor Using Graphics
Processing Units, 2013.
[16] A. McLaughlin and D. A. Bader, "Scalable and high performance betweenness centrality on the GPU," in
SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage
and Analysis, 2014.
[17] Y. Jia, V. Lu, J. Hoberock, M. Garland and J. C. Hart, "Edge v. node parallelism for graph centrality metrics,"
in GPU Computing Gems Jade Edition, Elsevier, 2012, p. 15–28.
[18] Z. Shi and B. Zhang, "Fast network centrality analysis using GPUs," BMC bioinformatics, vol. 12, p. 1–7,
2011.
[19] F. Jamour, S. Skiadopoulos and P. Kalnis, "Parallel algorithm for incremental betweenness centrality on
large graphs," IEEE Transactions on Parallel and Distributed Systems, vol. 29, p. 659–672, 2017.
[20] P. Crescenzi, R. Grossi, M. Habib, L. Lanzi and A. Marino, "On computing the diameter of real-world
undirected graphs," Theoretical Computer Science, vol. 514, p. 84–95, 2013.
[21] "cuSPARSE library. https://developer.nvidia.com/cusparse".
[22] "CUDA Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html".
[23] P. Mahonen, J. Riihijarvi and M. Petrova, "Automatic channel allocation for small wireless local area
networks using graph colouring algorithm approach," in 2004 IEEE 15th International Symposium on
Personal, Indoor and Mobile Radio Communications (IEEE Cat. No. 04TH8754), 2004.

More Related Content

What's hot

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Analysis of Impact of Graph Theory in Computer Application
Analysis of Impact of Graph Theory in Computer ApplicationAnalysis of Impact of Graph Theory in Computer Application
Analysis of Impact of Graph Theory in Computer ApplicationIRJET Journal
 
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTEScuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTESSubhajit Sahu
 
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)Subhajit Sahu
 
Hybrid Multicore Computing : NOTES
Hybrid Multicore Computing : NOTESHybrid Multicore Computing : NOTES
Hybrid Multicore Computing : NOTESSubhajit Sahu
 
P5 verification
P5 verificationP5 verification
P5 verificationdragonvnu
 
Improved algorithm for road region segmentation based on sequential monte car...
Improved algorithm for road region segmentation based on sequential monte car...Improved algorithm for road region segmentation based on sequential monte car...
Improved algorithm for road region segmentation based on sequential monte car...csandit
 
Learning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLLearning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLlauratoni4
 
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...lauratoni4
 
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSING
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSINGHOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSING
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSINGcscpconf
 
Accelerating economics: how GPUs can save you time and money
Accelerating economics: how GPUs can save you time and moneyAccelerating economics: how GPUs can save you time and money
Accelerating economics: how GPUs can save you time and moneyLaurent Oberholzer
 
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...lauratoni4
 
Laplacian-regularized Graph Bandits
Laplacian-regularized Graph BanditsLaplacian-regularized Graph Bandits
Laplacian-regularized Graph Banditslauratoni4
 
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...lauratoni4
 
Practical Parallel Hypergraph Algorithms | PPoPP ’20
Practical Parallel Hypergraph Algorithms | PPoPP ’20Practical Parallel Hypergraph Algorithms | PPoPP ’20
Practical Parallel Hypergraph Algorithms | PPoPP ’20Subhajit Sahu
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 

What's hot (17)

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
50120140503017 2
50120140503017 250120140503017 2
50120140503017 2
 
Analysis of Impact of Graph Theory in Computer Application
Analysis of Impact of Graph Theory in Computer ApplicationAnalysis of Impact of Graph Theory in Computer Application
Analysis of Impact of Graph Theory in Computer Application
 
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTEScuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs : NOTES
 
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
cuSTINGER: Supporting Dynamic Graph Aigorithms for GPUs (NOTES)
 
Hybrid Multicore Computing : NOTES
Hybrid Multicore Computing : NOTESHybrid Multicore Computing : NOTES
Hybrid Multicore Computing : NOTES
 
P5 verification
P5 verificationP5 verification
P5 verification
 
Improved algorithm for road region segmentation based on sequential monte car...
Improved algorithm for road region segmentation based on sequential monte car...Improved algorithm for road region segmentation based on sequential monte car...
Improved algorithm for road region segmentation based on sequential monte car...
 
Learning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RLLearning Graph Representation for Data-Efficiency RL
Learning Graph Representation for Data-Efficiency RL
 
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
 
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSING
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSINGHOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSING
HOMOGENEOUS MULTISTAGE ARCHITECTURE FOR REAL-TIME IMAGE PROCESSING
 
Accelerating economics: how GPUs can save you time and money
Accelerating economics: how GPUs can save you time and moneyAccelerating economics: how GPUs can save you time and money
Accelerating economics: how GPUs can save you time and money
 
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
 
Laplacian-regularized Graph Bandits
Laplacian-regularized Graph BanditsLaplacian-regularized Graph Bandits
Laplacian-regularized Graph Bandits
 
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
Graph Signal Processing for Machine Learning A Review and New Perspectives - ...
 
Practical Parallel Hypergraph Algorithms | PPoPP ’20
Practical Parallel Hypergraph Algorithms | PPoPP ’20Practical Parallel Hypergraph Algorithms | PPoPP ’20
Practical Parallel Hypergraph Algorithms | PPoPP ’20
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 

Similar to Parallel algorithms for multi-source graph traversal and its applications

Effective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU ArchitecturesEffective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU ArchitecturesIJCSEA Journal
 
Coarse grained hybrid reconfigurable architecture
Coarse grained hybrid reconfigurable architectureCoarse grained hybrid reconfigurable architecture
Coarse grained hybrid reconfigurable architectureDhiraj Chaudhary
 
Coarse grained hybrid reconfigurable architecture with no c router
Coarse grained hybrid reconfigurable architecture with no c routerCoarse grained hybrid reconfigurable architecture with no c router
Coarse grained hybrid reconfigurable architecture with no c routerDhiraj Chaudhary
 
Coarse grained hybrid reconfigurable architecture with noc router for variabl...
Coarse grained hybrid reconfigurable architecture with noc router for variabl...Coarse grained hybrid reconfigurable architecture with noc router for variabl...
Coarse grained hybrid reconfigurable architecture with noc router for variabl...Dhiraj Chaudhary
 
Performance Optimization of Clustering On GPU
 Performance Optimization of Clustering On GPU Performance Optimization of Clustering On GPU
Performance Optimization of Clustering On GPUijsrd.com
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Editor IJARCET
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Editor IJARCET
 
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONSBIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONScscpconf
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions csandit
 
Classified 3d Model Retrieval Based on Cascaded Fusion of Local Descriptors
Classified 3d Model Retrieval Based on Cascaded Fusion of Local Descriptors  Classified 3d Model Retrieval Based on Cascaded Fusion of Local Descriptors
Classified 3d Model Retrieval Based on Cascaded Fusion of Local Descriptors ijcga
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811IJERA Editor
 
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3Subhajit Sahu
 
AN EFFICIENT FPGA IMPLEMENTATION OF MRI IMAGE FILTERING AND TUMOUR CHARACTERI...
AN EFFICIENT FPGA IMPLEMENTATION OF MRI IMAGE FILTERING AND TUMOUR CHARACTERI...AN EFFICIENT FPGA IMPLEMENTATION OF MRI IMAGE FILTERING AND TUMOUR CHARACTERI...
AN EFFICIENT FPGA IMPLEMENTATION OF MRI IMAGE FILTERING AND TUMOUR CHARACTERI...VLSICS Design
 
An Efficient FPGA Implemenation of MRI Image Filtering and Tumour Characteriz...
An Efficient FPGA Implemenation of MRI Image Filtering and Tumour Characteriz...An Efficient FPGA Implemenation of MRI Image Filtering and Tumour Characteriz...
An Efficient FPGA Implemenation of MRI Image Filtering and Tumour Characteriz...VLSICS Design
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111NavNeet KuMar
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit pptSandeep Singh
 
A REVIEW ON PARALLEL COMPUTING
A REVIEW ON PARALLEL COMPUTINGA REVIEW ON PARALLEL COMPUTING
A REVIEW ON PARALLEL COMPUTINGAmy Roman
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkMahantesh Angadi
 
Heterogenous system architecture(HSA)
Heterogenous system architecture(HSA)Heterogenous system architecture(HSA)
Heterogenous system architecture(HSA)Dr. Michael Agbaje
 

Similar to Parallel algorithms for multi-source graph traversal and its applications (20)

I017425763
I017425763I017425763
I017425763
 
Effective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU ArchitecturesEffective Sparse Matrix Representation for the GPU Architectures
Effective Sparse Matrix Representation for the GPU Architectures
 
Coarse grained hybrid reconfigurable architecture
Coarse grained hybrid reconfigurable architectureCoarse grained hybrid reconfigurable architecture
Coarse grained hybrid reconfigurable architecture
 
Coarse grained hybrid reconfigurable architecture with no c router
Coarse grained hybrid reconfigurable architecture with no c routerCoarse grained hybrid reconfigurable architecture with no c router
Coarse grained hybrid reconfigurable architecture with no c router
 
Coarse grained hybrid reconfigurable architecture with noc router for variabl...
Coarse grained hybrid reconfigurable architecture with noc router for variabl...Coarse grained hybrid reconfigurable architecture with noc router for variabl...
Coarse grained hybrid reconfigurable architecture with noc router for variabl...
 
Performance Optimization of Clustering On GPU
 Performance Optimization of Clustering On GPU Performance Optimization of Clustering On GPU
Performance Optimization of Clustering On GPU
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
 
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONSBIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONS
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
 
Classified 3d Model Retrieval Based on Cascaded Fusion of Local Descriptors
Classified 3d Model Retrieval Based on Cascaded Fusion of Local Descriptors  Classified 3d Model Retrieval Based on Cascaded Fusion of Local Descriptors
Classified 3d Model Retrieval Based on Cascaded Fusion of Local Descriptors
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3
 
AN EFFICIENT FPGA IMPLEMENTATION OF MRI IMAGE FILTERING AND TUMOUR CHARACTERI...
AN EFFICIENT FPGA IMPLEMENTATION OF MRI IMAGE FILTERING AND TUMOUR CHARACTERI...AN EFFICIENT FPGA IMPLEMENTATION OF MRI IMAGE FILTERING AND TUMOUR CHARACTERI...
AN EFFICIENT FPGA IMPLEMENTATION OF MRI IMAGE FILTERING AND TUMOUR CHARACTERI...
 
An Efficient FPGA Implemenation of MRI Image Filtering and Tumour Characteriz...
An Efficient FPGA Implemenation of MRI Image Filtering and Tumour Characteriz...An Efficient FPGA Implemenation of MRI Image Filtering and Tumour Characteriz...
An Efficient FPGA Implemenation of MRI Image Filtering and Tumour Characteriz...
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
 
Graphics processing unit ppt
Graphics processing unit pptGraphics processing unit ppt
Graphics processing unit ppt
 
A REVIEW ON PARALLEL COMPUTING
A REVIEW ON PARALLEL COMPUTINGA REVIEW ON PARALLEL COMPUTING
A REVIEW ON PARALLEL COMPUTING
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce FrameworkBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce Framework
 
Heterogenous system architecture(HSA)
Heterogenous system architecture(HSA)Heterogenous system architecture(HSA)
Heterogenous system architecture(HSA)
 

More from Subhajit Sahu

DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESDyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESSubhajit Sahu
 
Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Subhajit Sahu
 
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESA Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESSubhajit Sahu
 
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESScalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESSubhajit Sahu
 
Application Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTESApplication Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTESSubhajit Sahu
 
Community Detection on the GPU : NOTES
Community Detection on the GPU : NOTESCommunity Detection on the GPU : NOTES
Community Detection on the GPU : NOTESSubhajit Sahu
 
Survey for extra-child-process package : NOTES
Survey for extra-child-process package : NOTESSurvey for extra-child-process package : NOTES
Survey for extra-child-process package : NOTESSubhajit Sahu
 
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTERDynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTERSubhajit Sahu
 
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Subhajit Sahu
 
Fast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTESFast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTESSubhajit Sahu
 
Can you x farming by going back 8000 years : NOTES
Can you x farming by going back 8000 years : NOTESCan you x farming by going back 8000 years : NOTES
Can you x farming by going back 8000 years : NOTESSubhajit Sahu
 
HITS algorithm : NOTES
HITS algorithm : NOTESHITS algorithm : NOTES
HITS algorithm : NOTESSubhajit Sahu
 
Basic Computer Architecture and the Case for GPUs : NOTES
Basic Computer Architecture and the Case for GPUs : NOTESBasic Computer Architecture and the Case for GPUs : NOTES
Basic Computer Architecture and the Case for GPUs : NOTESSubhajit Sahu
 
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDESDynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDESSubhajit Sahu
 
Are Satellites Covered in Gold Foil : NOTES
Are Satellites Covered in Gold Foil : NOTESAre Satellites Covered in Gold Foil : NOTES
Are Satellites Covered in Gold Foil : NOTESSubhajit Sahu
 
Taxation for Traders < Markets and Taxation : NOTES
Taxation for Traders < Markets and Taxation : NOTESTaxation for Traders < Markets and Taxation : NOTES
Taxation for Traders < Markets and Taxation : NOTESSubhajit Sahu
 
A Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESA Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESSubhajit Sahu
 
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...Subhajit Sahu
 
Income Tax Calender 2021 (ITD) : NOTES
Income Tax Calender 2021 (ITD) : NOTESIncome Tax Calender 2021 (ITD) : NOTES
Income Tax Calender 2021 (ITD) : NOTESSubhajit Sahu
 
Youngistaan Foundation: Annual Report 2020-21 : NOTES
Youngistaan Foundation: Annual Report 2020-21 : NOTESYoungistaan Foundation: Annual Report 2020-21 : NOTES
Youngistaan Foundation: Annual Report 2020-21 : NOTESSubhajit Sahu
 

More from Subhajit Sahu (20)

DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESDyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
 
Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)Shared memory Parallelism (NOTES)
Shared memory Parallelism (NOTES)
 
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESA Dynamic Algorithm for Local Community Detection in Graphs : NOTES
A Dynamic Algorithm for Local Community Detection in Graphs : NOTES
 
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESScalable Static and Dynamic Community Detection Using Grappolo : NOTES
Scalable Static and Dynamic Community Detection Using Grappolo : NOTES
 
Application Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTESApplication Areas of Community Detection: A Review : NOTES
Application Areas of Community Detection: A Review : NOTES
 
Community Detection on the GPU : NOTES
Community Detection on the GPU : NOTESCommunity Detection on the GPU : NOTES
Community Detection on the GPU : NOTES
 
Survey for extra-child-process package : NOTES
Survey for extra-child-process package : NOTESSurvey for extra-child-process package : NOTES
Survey for extra-child-process package : NOTES
 
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTERDynamic Batch Parallel Algorithms for Updating PageRank : POSTER
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER
 
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...
 
Fast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTESFast Incremental Community Detection on Dynamic Graphs : NOTES
Fast Incremental Community Detection on Dynamic Graphs : NOTES
 
Can you x farming by going back 8000 years : NOTES
Can you x farming by going back 8000 years : NOTESCan you x farming by going back 8000 years : NOTES
Can you x farming by going back 8000 years : NOTES
 
HITS algorithm : NOTES
HITS algorithm : NOTESHITS algorithm : NOTES
HITS algorithm : NOTES
 
Basic Computer Architecture and the Case for GPUs : NOTES
Basic Computer Architecture and the Case for GPUs : NOTESBasic Computer Architecture and the Case for GPUs : NOTES
Basic Computer Architecture and the Case for GPUs : NOTES
 
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDESDynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES
 
Are Satellites Covered in Gold Foil : NOTES
Are Satellites Covered in Gold Foil : NOTESAre Satellites Covered in Gold Foil : NOTES
Are Satellites Covered in Gold Foil : NOTES
 
Taxation for Traders < Markets and Taxation : NOTES
Taxation for Traders < Markets and Taxation : NOTESTaxation for Traders < Markets and Taxation : NOTES
Taxation for Traders < Markets and Taxation : NOTES
 
A Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTESA Generalization of the PageRank Algorithm : NOTES
A Generalization of the PageRank Algorithm : NOTES
 
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...
ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...
 
Income Tax Calender 2021 (ITD) : NOTES
Income Tax Calender 2021 (ITD) : NOTESIncome Tax Calender 2021 (ITD) : NOTES
Income Tax Calender 2021 (ITD) : NOTES
 
Youngistaan Foundation: Annual Report 2020-21 : NOTES
Youngistaan Foundation: Annual Report 2020-21 : NOTESYoungistaan Foundation: Annual Report 2020-21 : NOTES
Youngistaan Foundation: Annual Report 2020-21 : NOTES
 

Recently uploaded

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
2.pdf Ejercicios de programaciĂłn competitiva
2.pdf Ejercicios de programaciĂłn competitiva2.pdf Ejercicios de programaciĂłn competitiva
2.pdf Ejercicios de programaciĂłn competitivaDiego IvĂĄn Oliveros Acosta
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 

Recently uploaded (20)

Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
2.pdf Ejercicios de programaciĂłn competitiva
2.pdf Ejercicios de programaciĂłn competitiva2.pdf Ejercicios de programaciĂłn competitiva
2.pdf Ejercicios de programaciĂłn competitiva
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 

Parallel algorithms for multi-source graph traversal and its applications

  • 1. Comprehensive report on “Parallel algorithms for multi-source graph traversal and its applications” by Seema Simoliya Program: Ph.D. (CSE) Advisor: Dr. Kishore Kothapalli Center for Security, Theory, and Algorithmic Research (C-STAR) International Institute of Information Technology-Hyderabad, Gachibowli, Hyderabad, India – 500 032. seema.simoliya@research.iiit.ac.in
  • 2. Table of Contents 1 Introduction.....................................................................................................................................................3 2 Graphics Processing Unit ...........................................................................................................................4 2.1 Modern GPU Computing Architecture .............................................................................................4 2.2 Memory Layout.......................................................................................................................................7 2.3 CUDA ........................................................................................................................................................8 3 Literature Survey ...........................................................................................................................................8 3.1 Representation of Sparse Graphs.....................................................................................................8 3.2 List of selected Research Papers................................................................................................... 10 3.3 Single Source Breadth-First Search.............................................................................................. 11 3.4 Multi-source Breadth First Search................................................................................................. 12 4 Proposed Research Plan.......................................................................................................................... 14 4.1 Hybrid CSR Representation............................................................................................................. 14 4.2 In-core Multi-source BFS.................................................................................................................. 15 4.2.1 Optimizations............................................................................................................................... 16 4.3 Out-of-core Multi-source BFS ......................................................................................................... 17 4.4 Applications......................................................................................................................................... 17 4.4.1 Applications to Static Graph Problems ................................................................................ 17 4.4.2 Applications to Dynamic Graph Problems .......................................................................... 18 5 Concluding Remarks ................................................................................................................................. 18 References............................................................................................................................................................. 18
  • 3. 1 Introduction Many scientific and technical problems are related to large size data with networked nature, such problems can be well represented in the form of graph data structures. Graphs have shown their importance in many varied fields but especially in computer science applications such as data mining [1], image processing [2], networking [3], resource allocation [4], etc. This promotes the development of new algorithms and new theorems that can be used in tremendous applications. Graphs provide a simple and flexible abstract view of the discrete objects in the problem domain. In many problem environments, the graph is so large that they require a different approach to process, store and represent graphs. Large graph datasets refer to billions of vertices and trillions of edges such as seen in the social networks, citation networks, web graphs, road networks, etc. For computations on these graphs, parallel computing appears to be necessary to overcome the restriction of limited resources and large data. Parallel and data intensive algorithms often require fast computational resources for bulk processing. This can be achieved by using different special hardware along with general-purpose computers, such as, field-programmable gate arrays (FPGA), Application Specific Integrated circuit (ASIC), Vector processor or Graphics Processing Unit (GPU). GPUs have become common in personal computers, supercomputers and datacenters; hence researchers and application developers are motivated to design algorithms which take the architectural advantage of GPU. Although GPU evolved because of the increasing demand of realistic graphics rendering by the gaming world, it has become a choice for data-parallel and computationally expensive workloads. GPUs provide highly scalable and cost-effective solutions to high performance computing. GPGPU (General purpose Graphics Processing Unit) is the term which refers to the programmable use of GPU beyond the traditional purpose of computation for computer graphics. Graph algorithms on large data sets is a challenging problem on GPU as it has irregular memory access patterns and throughput intensive requirements. Combining the power of both GPU and CPU can increase the performance of graph algorithms. The real-world graphs are sparse [5], therefore, as opposed to dense graphs, sparse graphs are more of practical interest to researchers. Graph algorithms which work on dense graphs may not be suitable for sparse graphs because these algorithms on sparse graphs often exhibit poor locality of memory access, uneven workload per vertex and dynamic degree of parallelism over the course of execution. A better representation of graphs and proper load balancing can address the above-mentioned issues in parallel settings. Graphs are usually stored in adjacency matrix or adjacency list format. It is observed that when graphs are represented as matrices, they appear to be more managed, concise, expressive in terms of reduced memory footprint as compared to vertex-centric frameworks. There are few noted representations for sparse graphs which are discussed in Section 3.1. Traversal on real world sparse graphs is one of the core techniques used in the area of graph analytics. Depth-first search and breadth-first search are two most common, yet famous traversal algorithms used in these areas, out of which breadth-first search is considered good to run on GPUs because all the vertices on a single level can be processed in parallel independently. [6] describes that BFS from a single source can be found by multiplying the Sparse matrix with a vector (SpMV). GraphBlas is an API specification which is based on the notion that various graph problems can be solved using linear algebra. With the popularity of GraphBlas, various parallel breadth-first search techniques are developed on different platforms by utilizing different architectural and programming optimizations. Different variants of BFS are studied to solve graph problems. Problems like All-pairs Shortest Path (APSP), diameter
  • 4. computation, betweenness centrality, reachability querying, etc. require one or more execution of such breadth-first search. For example: APSP is a problem in which one needs to perform BFS from each node to find the shortest distances between each pair of vertices. Multi-source BFS is another variant in which BFS is performed using fewer nodes. A few parallel solutions exist [7], [8], [9], [10] that have addressed the simultaneous execution of graph traversal from more than one source nodes on CPU and GPU architecture. Each approach is discussed in detail in the Section 3 literature survey. These algorithms run on different platforms and have few limitations which are addressed in our approach suggested in Section 4. The problem area, where parallel multi-source breadth-first search is performed, encourages finding alternate solutions given its various applications and the limitations of existing solutions. In the literature survey, I have studied about the various versions of graph traversal problems such as Single-source BFS, Multi-source BFS and its applications. These problems have sequential as well as parallel solutions. Inspired by the significance of functionalities offered by GraphBlas, in my research work, I aim to design parallel algorithms (on GPUs) for problems listed below:  Multi-source BFS using linear-algebra method in the following scenarios: o When the graph can or cannot fit in the GPU memory o Possibility of Overlapped execution between data transfer and BFS computation o Memory saving compact representation of graph like hybrid-CSR o On static graphs and dynamic graphs  Shortest Path computation from multiple sources.  Graph centrality calculation: Betweenness centrality of few k nodes which can effectively use our proposed hybrid-CSR representation and Multi-source BFS.  Diameter calculation in static and dynamic graphs. This report is organized as follows: Section 2 describes the architecture of GPUs and CUDA programming used in our research work. Section 3 gives the details about the literature work in the above-listed problem areas. Section 4 in this report gives more detail about these problems and our approach to solve them. 2 Graphics Processing Unit 2.1 Modern GPU Computing Architecture Fig. 1 illustrates the high-level architecture overview of NVIDIA Tesla V100 [11]. It is an extremely power- efficient processor including 21.1 billion transistors yet delivering exceptional performance per watt. A GPU is composed of multiple GPU Processing Clusters (GPC), Texture Processing Clusters (TPC), Streaming Multiprocessors (SMs) and memory controllers. Each GPC consists of group of TPC and each TPC is a group made up of several SMs, a texture unit and a logic control.
  • 5. Fig.1 Volta GV100 Full GPU with 84 SM Units [11] The Streaming Multiprocessor in Volta architecture, unlike predecessor architecture, is provided additionally with tensor cores, combined L1 data cache and shared memory, set of registers. Fig.2 shows the architecture of a Streaming Multiprocessor. Tensor cores are important features developed to train large neural networks. Matrix-matrix multiplication is the core operation in training neural networks. Each tensor core can compute 64-bit FMA operation in one clock cycle which significantly boosts the performance of these operations. Shared memory provides high bandwidth and low latency. Combining L1 data cache with shared memory allows L1 cache operations to attain the benefits of shared memory performance. Volta architecture also comes with independent thread scheduling which means each thread has its own program counter. This makes programming on these GPUs more productive and easier.
  • 6. Fig.1 Streaming Multiprocessor of V100 [11] GPUs execute a group of 32 threads (called a warp) in SIMT (Single Instruction Multiple Threads) fashion. Every thread in a warp maintains its own execution state therefore enabling concurrency among all the threads, regardless of warp. Fig. 3 explains the independent thread scheduling. A schedule optimizer is used to group active threads from the same warp together into SIMT units.
  • 7. Fig. 3 Independent thread scheduling architecture block diagram [11] 2.2 Memory Layout When writing programs, we group threads in a “block”. Each SM has its own capacity of number of threads in a thread block. A thread has its own local memory. All the threads in a thread block can access high performance on-chip shared memory. Thread blocks are arranged in a Grid and all threads of a Grid have access to Global Memory where we copy the data to/from the CPU memory. Fig.4 explains the high-level memory architecture with respect to a program. There are two additional read-only memory spaces: constant memory and Texture Memory. These memory spaces are accessible to all the threads and can be optimized according to the memory usage need. Fig.4 Memory Hierarchy [12]
  • 8. 2.3 CUDA “CUDA (Compute Unite Device Architecture) [12] is a parallel computing platform and programming model developed by NVIDIA that makes general purpose computing simple on GPUs”. Programmers still write in the familiar languages like C, C++, or the list of languages supported by NVIDIA CUDA toolkit. CUDA enables three key abstractions - a hierarchy of thread groups, shared memories and barrier synchronization. With the help of these abstractions, a problem can be divided into sub-problems that can be solved independently using thread blocks, each sub-problem can be further solved cooperatively in parallel by each thread within a block. In CUDA C++, a function which runs on GPU is called kernel, which when executed runs in parallel by CUDA threads. The number of threads in a block and number of blocks in a grid are defined by the programmer. Each thread and block have a unique id within a block and grid, respectively. ThreadIdx is a 3-component vector so that a thread can be identified in a one- dimensional, two-dimensional and three-dimensional thread block. Similarly, BlockIdx is a 3-component vector identified in a one-dimensional, two-dimensional, three-dimensional grid. When a CUDA Program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed across SMs. Once a thread block is launched on a SM, all its warps are resident till all finishes their execution. All the threads execute in parallel on a SM. 3 Literature Survey 3.1 Representation of Sparse Graphs Substantial memory reductions can be realized when only the non-zero entries of Sparse Graphs are stored. Consider a Sparse graph in fig.5 and its adjacency matrix, below are few sparse graph representations which are commonly used: Fig. 5. An Undirected unweighted graph
  • 9. 1. Adjacency List: This format stores the list of edges for each element. The memory requirement for this type of representation is O(|V|+|E|). Programmer can decide the data structure to use this type of representation. For example, one can use list of list or array of list data structure to store the graph in fig.5. 2. Coordinate Format (COO): Sparse matrix A is assumed to be stored in row-major order. This matrix can be represented with 3 arrays-cooVal, cooRowInd, cooColInd. The size of these 3 arrays is equal to the number of non-zero (nnz) values in the matrix A. CooVal stores the nnz values in row-major order, cooRowInd stores the row indices of these values and cooColInd stores the column indices of these values. With this format, we can quickly reconstruct the matrix. 3. Compressed Sparse Row Format (CSR): The matrix can be represented with 3 arrays-csrVal, csrRowPtr, csrColInd. CsrVal stores the nnz values in row-major order, csrRowPtr points to the indices which are beginning of the rows in the csrVal array and csrColInd stores the column indices of these values. This provides the fast row access and is particularly good for matrix-vector multiplication.
  • 10. 4. Compressed Sparse Column Format (CSC): This format is similar to CSR format except that the matrix storage is in column major order. 3.2 List of selected Research Papers Summary of the 9 publications selected from reputed conferences/journals which exhibits potential work done in the chosen area of research are listed below: S.No. Reference Paper Reason #Citations 1. Liu, H., Huang, H. H., & Hu, Y. (2016, June). ibfs: Concurrent breadth-first search on gpus. In Proceedings of the 2016 International Conference on Management of Data (pp. 403- 416). [7] Sharing and grouping of frontier nodes across BFS instances, use of bitwise optimization when concurrently executing BFS from many source nodes. 62 2. McLaughlin, A., & Bader, D. A. (2015, December). Fast execution of simultaneous breadth-first searches on sparse graphs. In 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS) (pp. 9-18). IEEE. [10] This paper describes a simple parallel multi search abstraction which can be complemented with other graph analytics applications on a single GPU. 9 3. Then, M., Kaufmann, M., Chirigati, F., Hoang-Vu, T. A., Pham, K., Kemper, A., ... & Vo, H. T. (2014). The more the merrier: Efficient multi-source graph traversal. Proceedings of the VLDB Endowment, 8(4), 449- 460. [8] The paper proposes an algorithm to process the multi-source BFS concurrently using a multicore CPU. Their work leverages the properties of the small-world graphs where most of the vertices share commonality in the initial levels of BFSs. 54 4. Kaufmann, M., Then, M., Kemper, A., & Neumann, T. (2017). Parallel Array-Based Single-and Multi-Source Breadth First Searches on Large Dense Graphs. In EDBT (pp. 1-12). [9] This algorithm suggests two-phase parallelism on Top-down approach and two-phase parallelism on Bottom-up approach used in [14]. Single-Source and Multi-Source parallel BFS is suggested on multi- socket NUMA-aware systems. 5
  • 11. 5. Wang, P., Zhang, L., Li, C., & Guo, M. (2019, May). Excavating the potential of gpu for accelerating graph traversal. In 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (pp. 221- 230). IEEE. [13] In the algorithm EtaGraph, overlapping of data transfer with kernel execution when graph traversal is performed leads to improvement in the total execution time. Unique scheme “Unified Degree Cut” is used for load balancing. Prefetching is used to improve memory access latency. 3 6. McLaughlin, A., & Bader, D. (2014). Scalable and high performance betweenness centrality on the GPU. In SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 572–583). [14] The algorithm suggests alternating the two method of parallelism: edge- parallel and work-efficient method, based on how significantly the frontier vertices set is changing in each iteration. 102 7 SariyĂźce, A. E., Kaya, K., Saule, E., & ÇatalyĂźrek, Ü. V. (2013, March). Betweenness centrality on GPUs and heterogeneous architectures. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (pp. 76-85). [15] Various techniques to accelerate Betweenness Centrality computations such as vertex virtualization, stride memory access and a unique graph compression method where vertices with degree 1 are removed. Algorithm makes use of both GPU and CPU. 79 8 Jamour, F., Skiadopoulos, S., & Kalnis, P. (2017). Parallel algorithm for incremental betweenness centrality on large graphsIEEE Transactions on Parallel and Distributed Systems, 29(3), 659–672. [16] This algorithm is designed for the evolving graphs. The incremental Betweenness centrality is computed on only the affected biconnected components. 29 9 Crescenzi, P., Grossi, R., Habib, M., Lanzi, L., & Marino, A. (2013). On computing the diameter of real-world undirected graphsTheoretical Computer Science, 514, 84–95. [17] Diameter computation usually requires BFS computation from all nodes, this paper defines an algorithm which does BFS from few selected nodes and early terminate the evaluation if a certain condition is met. 61 3.3 Single Source Breadth-First Search A typical sequential breadth-First Search begins from a source and then explores all the adjacent vertices of the “frontiers” nodes at each level. This direction of search is called Top-Down search. As opposed to this, there is a “bottom-up” search, in which all the unvisited nodes are investigated to be a candidate for
  • 12. the next frontier node. This search skips out the nodes which cannot be the part of the frontier nodes in the next level. A hybrid approach [13] using both top-down and bottom-up search turns out to be much more efficient than individual direction search. This combination of bidirectional search enables the exploration of fewer nodes for the frontier. The approach is good for the single source BFS on a multicore CPU (according to experiments in [13]). Numerous graph analytics algorithms require to execute multiple BFS traversals on the same graph from different source nodes. With the advances in the hardware architectures and the applications of BFS, the scope to explore the efficient multi-source BFS(MS-BFS) algorithms has widened. 3.4 Multi-source Breadth First Search [iBFS] A recent work done in the area of parallel multi-source BFS on GPU is proposed by Liu et.al [7].The proposed algorithm iBFS utilizes three novel techniques: joint traversal, Group By and bitwise optimization. Joint traversal allows to share the frontiers of different BFS instances, which reduces the memory latency. Grouping BFS instances based on the sharing ratio of nodes among them can also save on memory. Here Group By selectively chooses BFS instances which have maximum common nodes. In the initial levels of two BFS instances the nodes are few and if such nodes are shared then all their edges will be checked only once. This ensures maximum sharing among levels of different BFS instances. Lastly, to handle the billions of nodes bitwise storage is used for status arrays and bitwise operations are performed using thousands of GPU threads. A great improvement can be observed in their work when bitwise optimizations are performed. [multi-BFS] McLaughlin et.al [10] proposed a multi-search abstraction which can be applied on the problems that require many simultaneous BFSs. The algorithm is shown to work for APSP, betweenness centrality on a single GPU. Like the Gather-Apply-Scatter (GAS) paradigm, their approach can be complemented with other applications. Multi-search abstraction has five major functions namely: init(), prior(), visitvertex(), post(), finalise(). init() function initializes the data structure to begin with source i. This step runs in parallel on Streaming Multiprocessors. prior() is a pre-processing step to handle if there are any computations needed prior to a search iteration. visitvertex() function updates information about the vertex e.g., the distance of a vertex from source i, and the next vertex. This step needs cooperation between threads of warp to atomically update the distance between two nodes. post() performs any post- processing step. And lastly finalise() will handle any final computation in the search. Since it is an abstraction, one can benefit from the code reuse in solving the problem which uses multiple BFSs. A coarse-grained parallelism is achieved by running i multiple BFS on i Streaming multiprocessors and a fine-grained parallelism is achieved by assigning warps to the active frontier vertices. [The more the merrier] Multi-Source BFS (MS-BFS) on a multi-core CPU is proposed by Then et.al [8] which leverages the property of small-world graphs. This algorithm allows the sharing of frontier nodes among BFS instances. Initially each source will mark themselves as discovered and all the adjacent vertices that become visited come in the frontier nodes. It will merge all the BFS instances that refer to vertex v as the frontier, then v can be explored only once for all the BFSs. A word of size equal to the register width or the width of the cache line is considered for space optimization. Aggregated Neighbor Processing is used to reduce the random accesses to the memory in which all the vertices are collected that needs to be explored in the next level. The frontier nodes for the next level are prefetched.
  • 13. Since the width of register can be smaller than the number of BFS instances that needs to be executed, it is advised to use multiple registers for multi-source BFS. A good heuristic for maximum sharing is to group BFSs according to their connected components. The algorithm compares itself with non-parallel Direction optimized BFS [13] and textbook BFS. [Parallel array-based MS-BFS] This algorithm [9] extends the idea of the array-based BFS of [8] and parallelize it on multi-socket NUMA–aware server. The cache-hits are improved by utilizing the approach in which the vertices are relabeled according to their degrees, so that the states of the higher degree nodes are located close together. Work partitioning among the worker is decided on this basis, so that all the threads are doing some work at a time by utilizing all the cores. Hence, a synchronization is needed between threads. Since, in real-world graphs the degree of the vertices follows the power-law distribution, there are few vertices with more degree and more vertices with few degrees, each thread cannot be given same number of nodes to process. Instead, each thread is given its own queue, for example, threads working on higher-degree vertices will have smaller queue size and threads working on low- degree vertices will have larger queue. Breadth First Search has variety of applications in many graph analytics problems. For example, in calculations of transitive closure, Diameter, centrality metrics, etc., BFS from one or more vertices is done to find the other metrics. We discuss about few such research work that have used multi-source BFS to reach to Betweenness centrality solutions. [EtaGraph] The algorithm mentioned by Wang et.al [14] is EtaGraph, which performs the overlapping of data transfer with the kernel execution. To balance workload among threads for the small-world graphs, a graph partition method called “Unified degree cut” (UDC) is used. This scheme sets the upper bound on the outdegree of each vertex so that no thread gets more work. Graphs in transformed CSR format is brought in GPU on the fly. Data required for next iteration is prefetched in the shared memory. Though the paper uses datasets which fit into the memory, this algorithm can be used for the algorithms which does not fit in GPU memory. [Scalable and high performance Betweenness centrality on GPUs] A work-efficient algorithm for Betweenness Centrality is proposed by McLaughlin and Bader [16] that works for networks with large diameter. This approach uses an explicit queue for graph traversal and discard the predecessor array used in previous algorithm [17], [18].Threads are assigned to the elements of frontier queue. Atomic Compare and Swap operations to that no two threads insert a vertex in the next frontier queue. This approach may still suffer from the load imbalance among threads because of the properties of scale-free networks, a hybrid method is used for selecting the parallelization strategy. If the size of next frontier vertices queue is greater than a threshold, edge-parallel method is used otherwise the work-efficient method is a good approach. This change in strategy is only applied when there is significant change in the frontier queue of current level and the next level. [Betweenness Centrality on GPUs] Betweenness centrality (BC) metric is of importance in different types of networks such as social network, knowledge networks, finding best store locations in the cities, etc. Sariyuce et.al [15] describes many techniques to speed up the BC computations and is experimented
  • 14. on the 2 NUMA node cluster. GPU parallelism is achieved on vertex-based and edge-based algorithms of the two baseline methods. Vertices with higher degrees are divided into virtual nodes each having at most ‘mdeg’ degree. This method of vertex virtualization is applied on the vertex-based as well as edge- based algorithms as the parallelism on this algorithm suffers from load balancing and more memory usage, respectively. The vertices with degree 1 are removed from the graph and its information is kept in the predecessors. This approach saves a lot of space and computation time. [iCENTRAL] Many real-world graphs which involves networks of transactions of various nature, social interactions and engagements represented are dynamic. The data for these graphs are changing over time, that is few nodes and edges are added or deleted with time. iCENTRAL algorithm by Jamour et al. [19] demonstrate a parallel algorithm for Betweenness centrality on evolving graphs. The graph is decomposed into biconnected components and the betweenness centrality values are updated only for the vertices which are affected by the insertion/deletion of the edges. Betweenness centrality needs all- pairs shortest path information which requires the breadth-first search DAGs of all the nodes. With the help of biconnected components they identify the DAGs that remain intact. The overall complexity of iCENTRAL is O(|Q||EB’e|) time and O(|V|+|E|) space where Q is the set of all the nodes for which the values of – change with the insertion of edge e and EB’e is the set of edges after a new insertion/deletion. The parallel version of iCENTRAL algorithm is implemented using MPI and evaluated on various graph datasets on Intel Xeon CPU. The results are shown for a distributed system of 19 machines where it is compared with other state-of-the-art algorithms. [iFUB] Diameter computation of a graph is defined as the maximum of eccentricities from all the nodes. Hence, a general approach requires All-Pairs Shortest Path computation, which is a computationally expensive method. The algorithm defined in [20] selects few nodes for BFS computation and suggests a termination condition when these BFSs are examined for diameter computation. This termination condition is on the following observation: If the eccentricity of a node is greater that 2(BFS_level -1) then the levels above this level will not be evaluated. This algorithm thus defines a lower bound and upper bound of the diameter. The k nodes are selected by any of the following methods: 1) random selection 2) nodes with higher degree 3) 4-Sweep method. The 4-sweep method selects a node which is central in the graph that is it has the low eccentricity. 4 Proposed Research Plan 4.1 Hybrid CSR Representation For large real-world sparse graphs, the common approach to store the adjacency matrix is by using CSR format. In order to store the type of graphs which are unweighted or have uniform weight, we have designed a new form of representation called Hybrid CSR. The format is motivated by the observation in unweighted graph that an edge is represented with a ‘0’ or a ‘1’. We represent an edge as a bit in hybrid- CSR format. Consider the word size as 4-bit for the graph in fig. 5. We must pad a column with all zeros as the 8th column to this adjacency matrix. The storage capacity for this compact matrix is 7*2*4= 56 bits whereas the adjacency matrix will require 7*7*4=196 bits. The compact Matrix of this graph G is as follows:
  • 15. Table 2. Compact Matrix of graph G Now, we store only the nonzero entries just like in CSR format. Working with the large graph datasets which cannot fit in the memory of GPU at once is a challenge. The hybrid-CSR is an alternate solution for graph representation which saves on the memory space by a factor of word size. This factor is of significance when the GPU is bound with limited memory to store the entire graph of millions of nodes. A compact representation in hybrid-CSR is an approbation to save on the memory. For example, consider a 32-bit integer which is true for the any architecture available these days and a sparse graph of size 40 X 40, the total space required by adjacency matrix is 40 * 40 * 4Bytes = 6400 Bytes whereas compact matrix requires only 40 * 2 * 4Bytes = 320 Bytes. If NNZ and nnz are the number of nonzero in adjacency matrix and compact matrix respectively then the space complexity of hybrid-CSR is O(2*nnz+n+1) where nnz << NNZ. The hybrid-CSR format gives a good performance gain with the memory coalescing as more informative data can be brought in the shared memory. The bitwise operation as described in our algorithm in section 4.2 further improves the execution time of the SpMM computation. 4.2 In-core Multi-source BFS Our approach to solve multi-source BFS is inspired by linear algebra-based Matrix- Matrix Multiplication (SpMM). As suggested in [6], Repeated multiplication of the graph matrix G with sparse vector x yields the BFS traversal of the graph, where x(i)=1, x(j)=0 for i ≠ j and i is start node. Similarly, Multi-source BFS can be computed using iterative SpMM between the adjacency matrix and another matrix X representing the source nodes in each column. Then Y= GT * X picks out those rows in G, which contains the neighbors of node i in each column of X. Multiplying Y with GT gives nodes two steps away and so on.
  • 16. The below algorithm is our initial implementation of multi-BFS using hybrid-CSR format: Algorithm: Hybrid-CSR-based Multi-BFS algorithm Input: Graph GT of size N *M (in hybrid-CSR format), Matrix of size X N*M (in compact Matrix format) Output: Matrix Y (in compact Matrix) 1. Procedure Multi-BFS 2. for each thread ‘rowG’ in G in parallel do 3. for all ‘rowX’ in X do 4. Sum <-- 0 5. Bit<--rowIndex*M 6. for k from ROWPTR[rowG] to ROWPTR[rowG+1] 7. col <-- COLIND[k] 8. sum OR= G[rowG][col] AND X[rowX][col] 9. if (sum != 0) then 10. Setbit Y[rowG][bit] 11. break 12. end if 13. end for 14. bit <-- bit+1 15. end for 16. end for 17. end Procedure The above kernel is launched by each thread on GPU. For the calculation of one bit in Y, one row of GT does bitwise AND with one row of X. But at line 6, we only perform bitwise AND on position of nonzero entries. Line 9 to 12 checks if at any point, sum becomes 1 we set bit for Y and need not do further computation on that row. In each iteration, the number of bitwise operations is same because there is fixed nnz elements in A matrix. Our algorithm only performs AND operation on non-zero elements. Therefore, even when B matrix gets denser and denser in each iteration, the number of bitwise operations does not increase. Unlike cuSparse matrix multiplication method cusparseScsrmm() [21] which takes more execution time when the matrix B gets denser. 4.2.1 Optimizations Our algorithm Multi-BFS uses a single bit to represent an edge in graph G. Bitwise optimization used in our algorithm has shown performance improvement over an iterative matrix-matrix multiplication (using cuSPARSE library). Bitwise operations are incredibly simple and thus usually faster than arithmetic operations. The reason is simple: any arithmetic operation inherently requires bitwise operations. Our approach uses minimal bitwise, logical operations because the operation is only performed on the non- zero elements. Moreover, an entire row-column multiplication is not needed because only one true condition at nnz entry is sufficient to set a bit.
  • 17. The performance of the GPU can be maximized by exposing it with sufficient parallelism, coalesced memory access, and coherent execution within warp [22]. Our proposed hybrid CSR format improves the memory coalescing by packaging bits representing each edge. The features of CUDA programming can help to achieve various optimizations in our approach such as sync() function can make communication within threads of the warp faster. Since global memory access is slower than other memory types available in GPU. To maximize the performance, we efficiently use on-chip shared memory which has higher bandwidth and lower latency than global memory. Shared memory is limited as compared to the size of graph dataset; the coherence is achieved by proper thread organization within a warp. 4.3 Out-of-core Multi-source BFS The size of GPU’s device memory is typically very less as compared to the CPU’s main memory. The capacity of device memory limits large-scale graphs processing. The significant issue in processing a very large graph is the management of data on the GPU, that exceeds the capacity of device memory, so that it still maintains the coherent solution with minimum performance overhead. Another issue with graph traversal on a very large graph is the accessibility pattern of the data because these algorithms generally include irregular data accesses. This issue can observe a delay in solving the problem since only part of graph data is available. The term out-of-core used in this report signifies that the graph cannot be stored in the GPU global memory. Many state-of-the-art algorithms that we have studied in the literature (discussed in section 3) are in-core or uses a cluster of GPUs for graph processing. We see that the performance analysis for multi-source graph traversal on very large graphs that cannot fit in the GPU memory are not well investigated. For such very large graphs, the data transfer itself is a time-consuming process. In [16], the algorithm overlaps the data transfer with the kernel execution, which is good approach to solve a problem asynchronously, but this approach is not investigated for the graphs which do not fit in the GPU memory. 4.4 Applications The problem of concurrent execution of multiple-source BFS is interesting as it has wide applications in many graph analytic algorithms. We aim to use our proposed method for these applications. There is a lot of scope to look at the problems which can benefit from our novel graph representation technique hybrid-CSR. This is particularly suitable to unweighted graphs. This approach can also be extended to graph algorithms on weighted static graphs and on weighted/unweighted dynamic graphs. 4.4.1 Applications to Static Graph Problems Most of the current research work in Betweenness-centrality is for the unweighted static graphs [16], [15] where the BFS is performed from all the nodes. In practice, in static graphs, only few nodes are of more importance and one would like to know the betweenness centrality of these few k nodes only. For such a case, BFS from all nodes will underutilize the resources and memory. Given the limited memory of GPUs and the need to compute BFS from k nodes, the problem is relevant to many practical graph analytics applications such as Diameter computation, shortest path computation, centrality metrics computation, etc. The application of finding the shortest paths from multiple source nodes can be seen in the information systems for rerouting the emergency services or in road traffic systems to find the alternate routes. This
  • 18. problem be a multi-source Dijkstra algorithm, where we need to find the shortest path from many source nodes. A typical approach to solve multi-source Dijkstra algorithm is to connect all the source nodes to a virtual source node and assign zero-weights to all edges connecting them. This can be an expensive operation when we are already dealing with large size graphs. A faster and parallel approach used in Multi-source BFS can optimize the solution of k-source Dijkstra's algorithm. 4.4.2 Applications to Dynamic Graph Problems Dynamic graphs can be viewed as a discrete sequence of static graphs. Dynamic graphs are ubiquitous in computer science and other real-world applications. These graphs can be studied by specifying the properties that remain invariant with time. Below are few applications where multi-source BFS can be applied on dynamic graphs:  Fault tolerance: Fault tolerance deals with maintaining a multiprocessor computer architecture and network when sometimes nodes/edges fail to operate. So, a reliable fault tolerance system is required to make the computer network reconfigurable and robust when there is a failure in the node/edge.  Graph connectivity: Graph connectivity is the smallest number of nodes whose removal results in disconnected graph. Conditional connectivity offers an exceptional field of further research on dynamic graphs.  Centrality Computation: Finding the nodes which are of more importance in the network is another application where BFS is needed. In social networks which are dynamic in nature, the centrality value of nodes keeps changing.  Diameter computation: When the new edges or nodes keep coming or going in the computer network, natural graph parameters such as diameter, radius and eccentricities get affected. These algorithms for these problems require shortest path computations. 5 Concluding Remarks We have identified a set of interesting and coherent problems in the space of graph algorithms on GPUs. These problems have applications to important computations such as diameter, centrality metrics, shortest paths etc. which arise in domains including transportation networks, social network analysis, and the like. Therefore, we believe that investigating the above-mentioned problems is of high relevance in the current context. References [1] T. Washio and H. Motoda, "State of the art of graph-based data mining," Acm Sigkdd Explorations Newsletter, vol. 5, p. 59–68, 2003. [2] R. Baeza-Yates and G. Valiente, "An image similarity measure based on graph matching," in Proceedings Seventh International Symposium on String Processing and Information Retrieval. SPIRE 2000, 2000.
  • 19. [3] F. Chierichetti, A. Epasto, R. Kumar, S. Lattanzi and V. Mirrokni, "Efficient algorithms for public-private social networks," in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015. [4] S. Stuijk, T. Basten, M. C. W. Geilen and H. Corporaal, "Multiprocessor resource allocation for throughput- constrained synchronous dataflow graphs," in Proceedings of the 44th annual Design Automation Conference, 2007. [5] J. Kepner, D. Bader, A. Buluç, J. Gilbert, T. Mattson and H. Meyerhenke, "Graphs, matrices, and the GraphBLAS: Seven good reasons," Procedia Computer Science, vol. 51, p. 2453–2462, 2015. [6] J. Kepner and J. Gilbert, Graph algorithms in the language of linear algebra, SIAM, 2011. [7] H. Liu, H. H. Huang and Y. Hu, "ibfs: Concurrent breadth-first search on gpus," in Proceedings of the 2016 International Conference on Management of Data, 2016. [8] M. Then, M. Kaufmann, F. Chirigati, T.-A. Hoang-Vu, K. Pham, A. Kemper, T. Neumann and H. T. Vo, "The more the merrier: Efficient multi-source graph traversal," Proceedings of the VLDB Endowment, vol. 8, p. 449–460, 2014. [9] M. Kaufmann, M. Then, A. Kemper and T. Neumann, "Parallel Array-Based Single-and Multi-Source Breadth First Searches on Large Dense Graphs.," in EDBT, 2017. [10] A. McLaughlin and D. A. Bader, "Fast execution of simultaneous breadth-first searches on sparse graphs," in 2015 IEEE 21st International Conference on Parallel and Distributed Systems (ICPADS), 2015. [11] "Volta GPU architecture whitepaper. http://www.nvidia.com/object/volta-architecture-whitepaper.html". [12] "CUDA development Toolkit. https://developer.nvidia.com/cuda-toolkit". [13] S. Beamer, K. Asanovic and D. Patterson, "Direction-optimizing breadth-first search," in SC'12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, 2012. [14] P. Wang, L. Zhang, C. Li and M. Guo, "Excavating the potential of gpu for accelerating graph traversal," in 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2019. [15] A. E. SariyĂźce, K. Kaya, E. Saule and Ü. V. ÇatalyĂźrek, "Betweenness centrality on GPUs and heterogeneous architectures," in Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, 2013. [16] A. McLaughlin and D. A. Bader, "Scalable and high performance betweenness centrality on the GPU," in SC'14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2014. [17] Y. Jia, V. Lu, J. Hoberock, M. Garland and J. C. Hart, "Edge v. node parallelism for graph centrality metrics," in GPU Computing Gems Jade Edition, Elsevier, 2012, p. 15–28.
  • 20. [18] Z. Shi and B. Zhang, "Fast network centrality analysis using GPUs," BMC bioinformatics, vol. 12, p. 1–7, 2011. [19] F. Jamour, S. Skiadopoulos and P. Kalnis, "Parallel algorithm for incremental betweenness centrality on large graphs," IEEE Transactions on Parallel and Distributed Systems, vol. 29, p. 659–672, 2017. [20] P. Crescenzi, R. Grossi, M. Habib, L. Lanzi and A. Marino, "On computing the diameter of real-world undirected graphs," Theoretical Computer Science, vol. 514, p. 84–95, 2013. [21] "cuSPARSE library. https://developer.nvidia.com/cusparse". [22] "CUDA Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html". [23] P. Mahonen, J. Riihijarvi and M. Petrova, "Automatic channel allocation for small wireless local area networks using graph colouring algorithm approach," in 2004 IEEE 15th International Symposium on Personal, Indoor and Mobile Radio Communications (IEEE Cat. No. 04TH8754), 2004.