Dear MPI implementors on MPP systems,
If you want to speed up your MPI_Reduce/MPI_Allreduce implementation
then this page may be for you.
I have seen that at least two implementations (MPICH, T3E) are using
only one protocol for REDUCE that has a minimal latency but a _bad_
bandwidth for send_buffers larger than about 512 elements.
I have implemented for large buffers a second protocol with same
numerical computation as in MPICH.
The new protocol distributes the computation on all nodes and _not_
like a binary tree on 1/2, 1/4, ... nodes.
Therefore the execution time goes with
O(log(size)) + O(count).
For comparance, the binary tree in MPICH has
O(log(size)) * O(count).
Its bandwidth is about 2..7 times better than the bandwidth
of the old protocol.
But its latency is worse. Therefore I'm using it only for counts
bigger than a given limit.
To guarantee the requirement from MPI 1.1 page 114 lines 20-23., the
algorith can be used in combination with the vendor's algorithm (for
count < limit) only if the vendor's algorithm computes e.g. for 13 nodes
{ [(a+b)+(c+d)] + [(e+f)+(g+h)] } + { [(i+j)+k] + [l+m] }
a,b,c,... are the buffers of rank 0,1,2,...
Limits and timings on our T3E for MPI_SUM with MPI_DOUBLE:
Reduce: communicator size
2 3 2**n others
limit [count value] 896 1536 576 736
bandwiths [MB/s] new prot. 75.4 34.9 49.1 .. 33.9 27.3 .. 22.4
old prot. 28.8 16.2 16.3 .. 4.6 11.6 .. 4.6
ratio new/old 2.6 2.1 3.0 .. 7.4 2.4 .. 4.8
Allreduce: communicator size
2 3 2**n others
limit [count value] 448 1280 512 512
bandwiths [MB/s] new prot. 74.0 29.4 42.4 .. 29.0 23.3 .. 18.7
old prot. 20.9 14.4 13.2 .. 3.8 9.7 .. 3.4
ratio new/old 3.5 2.0 3.0 .. 7.7 2.4 .. 5.5
The measurements were done for comunicator size =
2, 3, 4, 6, 8, 12, 16, 24, 32, 48, 64, 96, 128, 192, 256
My implementation exports MPI_MYreduce and MPI_MYallreduce.
It uses MPI_Send, MPI_Recv, MPI_Sendrecv, and for count
The software consists on a
- the myreduce_README.asc
file that gives hints about compiling and porting this algorithm
on your system,
- myreduce.c
is for MPI implementations with handles implemented as constants
(useable in switch/case),
- myreduce_std.c
is for MPI implementations with handles implemented as pointers
(not useable in switch/case),
The software is free. But I like to get feedback if you are using it.
Kind regards
Rolf Rabenseifner
PS: This algorithm may be referenced under
The MPI_Allreduce algorithm is used in ScaMPI.
Performance measurements with the new MPI_Allreduce algorithm were published by Scali AS:
Lars Paul Huse:
Collective Communication on Dedicated Clusters of Workstations.
In J. Dongarra et al. (eds.),
Recent Advances in Parallel Virtual Machine
and Message Passing Interface,
proceedings of the
6th European PVM/MPI Users' Group Meeting (EuroPVM/MPI'99),
Barcelona, Spain,
LNCS 1697, pp. 469-476, Sept. 1999.
A few pointers to other activities on optimizing collective operations:
- D. Payne, L. Shuler, R. van de Geijn, J. Watts:
Interprocessor Collective Communications Library (iCC)
(Thanks to Diana Wooley at MPIDC 2000 for this pointer)
- IPDPS 2000, Proceedings, IEEE, Cancun, Mexico, May 1-5, 2000:
- P. Liu and D-W. Wang:
Reduction Optimization in Heterogeneous Cluster Environments.
- J. Nolte, M.Sato, and Y. Ishikawa:
Template Based Structured Collections.
- T. Kielmann, H.. Bal, and S. Gorlatch:
Bandwidth-efficient Collective Communication for Clustered Wide Area Systems.
- H.A. Chen, Y. O. Carrasco, and A.W. Apon:
MPI Collective Operations over IP Multicast.
IPDPS 2000 Workshops, J. Rolim et al. (Eds.), LNCS 1800, pp. 51-60, 2000.
- J. Bruck, C.-T. Ho, S. Kipnis, E. Upfal, and D. Weathersby:
Efficient Algorithms for All-to-All Communications in Multiport
Message-Passing Systems.
IEEE Transactions on Parallel and Distributed Systems, Vol. 8, No. 11,
Nov. 1997.
- E.K. Blum, X. Whang, P. Leung:
Architectures and message-passing algorithms for cluster computing:
Design and performance.
Parallel Computing 26 (2000) pp. 313-332.
- X. Whang, E.K. Blum, D.S. Parker, D. Massey:
The dance party problem and its application to collective communication in
computer networks.
Parallel Computing 23 (1997) pp. 1144-1156.
- E. Gabriel, M. Resch, and R. Rühle:
Implementing MPI with Optimized Algorithms for Metacomputing.
Third MPI Developer's and User's Conference,
MPIDC'99, A. Skjellum et al. (Eds.), March 10-12, Atlanta, GA, USA, 1999.
- G. E. Fagg, S. S. Vadhiyar, J. Dongarra:
ACCT: Automatic Collective Communications Tuning.
EuroPVM/MPI 2000, LNCS 1908, Springer, pp. 354-362.
Longer version:
www.netlib.org/utk/people/JackDongarra/PAPERS/coll-lacsi-2001.pdf
- S. S. Vadhiyar, G. E. Fagg, J. Dongarra:
Automatically Tuned Collective Communications.
Supercomputing 2000.
www.supercomp.org/sc2000/techpapr/papers/pap.pap270.pdf
- S. S. Vadhiyar, G. E. Fagg, J. Dongarra:
Towards an Accurate Model for Collective Communications.
Computational Science - ICCS 2001, International Conference,
San Francisco, CA, USA, May 28-30, 2001. Proceedings, Part I, pp. 41-50.
Towards an Accurate Model for Collective Communications.
www.cs.berkeley.edu/~richie/tuning/readings/vadhiyar2001-coll.ps.gz
- Nicholas T. Karonis, Bronis de Supinski, William Gropp and Ewing Lusk, Sebastien Lacour:
A Multilevel Approach to Topology-Aware Collective Operations in Computational Grids.
http://arxiv.org/pdf/cs.DC/0206038
- Nicholas T. Karonis,Brian Toonen, Ian Foster:
MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface.
http://arxiv.org/pdf/cs.DC/0206040
|