Optimization of Collective Communication in MPICH

Optimization of Collective Communication
Operations in MPICH

Possamai Lino, 800509

Parallel Computing Lecture – February 2006

Introduction

To resolve many scientific problems, high calculation power is
needed.
Parallel architectures were created to increase the speed of
calculation.
Increasing the computational speed of calculation is achieved
also optimizing the operations used in the message passing
interface.
More than 40% of the time spent in MPI function was spent
in the function ‘Reduce’ and ‘AllReduce’.
And 25% of time using a non-power of two number of
processors

Possamai Lino Parallel Computing Lecture 2

Cost model
The time taken to send a message from node i to j is
modeled as α+nβ for bi-directional communications.
α is the latency, β is the bandwidth and n (bytes) is the
amount of data sent during the communication.
γ is the byte cost for the reducing operation computed
locally.
For uni-diretional communication the cost is modeled as
αuni+nβuni.

Ratio that indicate the type of network is defined as
fα=αuni/α. Same for bandwidth parameter.


*-to-all operations

AllGather (all-to-all)
Consist of gathering data from all nodes and distribute
it to all
Broadcast (one-to-all)
Is a operation of broadcasting a data from a root node
to every other node.
All-to-all
Each node send his unique data to every other
processes. Different from allgather because the data
owned by each node are not part of a unique vector


Allgather
The old algorithm uses a ring method.
At each step (p-1 in total), node i send its data to node i+1 and
receives data from i-1 (with wrap around).
Actually used for large/medium messages and for non-power of
two number of processes.
A first optimization consist of using a recursive vector doubling
with distance doubling technique as in figure.
The amount of data sent by each process is 2kn/p, where k is
the current step, ranging from
0 to log2 p - 1.
So, the total cost is:
α log2 p + nβ(p-1)/p


Broadcast

Binomial tree algorithm is the
old algorithm used in MPICH.
Good for short messages
because of the latency term.
Van Der Geijn has proposed an
algorithm for long messages
that takes a message, divide
and scatter it between nodes
and finally, collect them back
to every node (allgather).
The total cost is:
[α log2 p + nβ(p-1)/p ] +
[(p-1)α + nβ(p-1)/p]=
α(log2 p + p - 1) + 2nβ(p-
1)/p.


Reduce operations
Reduce
A root node computes a reduction function using the
data gathered from all processes

Reduce-scatter (all-to-all reduction)
Is a reduction in which, at the end, the result vector
is scattered between the processes

AllReduce
Is a reduction followed by a allgather of the resulting
vector

Terminology

Recursive vector halving: the vector to be reduced is
recursively halved in each step.
Recursive vector doubling: small pieces of the vector
scattered across processes are recursively gathered or
combined to form the large vector
Recursive distance halving: the distance over which
processes communicate is recursively halved at each step (p
/ 2, p / 4, ... , 1).
Recursive distance doubling: the distance over which
processes communicate is recursively doubled at each step
(1, 2, 4, ... , p / 2).


Reduce-scatter operation 1/2

Old algorithm implement this operation as a binomial
tree reduction to rank 0, followed by a linear scatterv.
The total cost is the sum of the binomial tree reduction
plus the cost of the linear scatterv, so
(log2 p + p -1) α + (log2 p + (p-1)/p) nβ + (log2 p) nγ
The choice for the best algorithm depends on the type of
reduce operation: commutative or not-commutative
For commutative reduce operation, and for short
messages, recursive-halving algorithm is used.
For not-commutative, recursive-doubling is used.


Recursive-halving (commutative)
Different implementations whether p is
power of two or not.
In the first case, log2 p steps are taken and
on each of them bi-directional
communication is performed.
The data sent is halved at each step.
Each process sends the data needed by all
processes in the other half and receives
the data needed by all processes in its own
half
In the second case, we reduce the number
of processes to the nearest lower power of
two, before applying the r-h algorithm to
the rest of nodes.
Finally, we distribute the data result vector
to r=p-p’ processes excluded


Recursive-doubling (non commutative)

Similar to allgather optimized algorithm
At each step k, from 0 to log2 p -1, processes communicate (n-
n/p 2k) data


Reduce-scatter for long messages
Previous algorithms works well
if the messages are short
In other cases, pairwise
exchange algorithm is used.
Needed p-1 steps where in
each step i, each process
sends data to process (rank +
i) and receives data from
process (rank – i) and finally
perform local reduction.
Amount of data sent at each
step n/p
Same bandwidth requirement
as the recursive halving
algorithm

Switching between algorithms
as optimization


Reduce

Old algorithm use a binomial
tree that takes log2 p steps
Good for short messages but
not best for long messages.
They propose an optimized
algorithm named Rabenseifner
that utilizes less bandwidth.
It is a reduce-scatter
(recursive-halving) followed by
a binomial tree gather to the
root node.
The cost is the sum of reduce-
scatter and gather.
Good for power of two number
of processes.

Reduce (non-power of two nodes)

In this case, before using the above algorithm, we
must arrange the number of processes.
Reducing to the nearest lower power of two nodes is
necessary, so p’=2‫ ﺎ‬log p ‫ﻟ‬
And the number of nodes removed is r=p-p’.
The reduction is obtained combining half part of data
of first 2r nodes to the same even ranked nodes.
Finally, the first r nodes plus the last p-2r nodes are
power of two and now we are able to apply po2
algorithms
Reduction cost: (1-fα)α + (1+fβ)βn/2 + γn/2


Reduce/AllReduce npo2 schema


AllReduce (power of two nodes)

Used a recursive doubling
algorithm for short and long
messages and for user-defined
reduction operations.
For long and predefined reduction
operation messages, Rebenseifnesr
algorithm is used.
Similar to reduce implementation,
starts with a reduce-scatter and is
followed by an allgather.
Total cost: 2α log2 p + 2nβ(p-1)/p
+ nγ(p-1)/p


AllReduce (non power of two nodes)

Similar implementation of reduce, but after the reduction
on power of two nodes, and after the recursive
algorithm, follow an allgather operation.
Allgather implemented using a recursive vector doubling
and distance halving for the first r+(2r-p) nodes.
For the other r nodes (p-r+2r-p), we need an additional
overhead for sending the result data.
This step takes αuni+nβuni


Results comparison

Vendor MPI/ newer MPI implementation Older/newer
implementation of operations


Index

Introduction
Cost Model
*-to-all operations
Allgather
Broadcast
Reduce operations
Terminology
Reduce-scatter
Reduce
AllReduce
Results comparison


References

Thakur, Rabenseifner, Gropp

Optimization of Collective Communication Operations in
MPICH

The Int. J. of High Performance Computing Applications,
Volume 19, No. 1, Spring 2005, pp. 49–66.


Optimization of Collective Communication in MPICH

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Optimization of Collective Communication in MPICH

Similar to Optimization of Collective Communication in MPICH (20)

More from Lino Possamai

More from Lino Possamai (7)

Recently uploaded

Recently uploaded (20)

Optimization of Collective Communication in MPICH