[an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive]
New Optimized MPI Reduce Algorithm

Dear MPI implementors on MPP systems,

If you want to speed up your MPI_Reduce/MPI_Allreduce implementation then this page may be for you.

I have seen that at least two implementations (MPICH, T3E) are using only one protocol for REDUCE that has a minimal latency but a _bad_ bandwidth for send_buffers larger than about 512 elements.

I have implemented for large buffers a second protocol with same numerical computation as in MPICH.

The new protocol distributes the computation on all nodes and _not_ like a binary tree on 1/2, 1/4, ... nodes. Therefore the execution time goes with O(log(size)) + O(count). For comparance, the binary tree in MPICH has O(log(size)) * O(count).

Its bandwidth is about 2..7 times better than the bandwidth of the old protocol. But its latency is worse. Therefore I'm using it only for counts bigger than a given limit.

To guarantee the requirement from MPI 1.1 page 114 lines 20-23., the algorith can be used in combination with the vendor's algorithm (for count < limit) only if the vendor's algorithm computes e.g. for 13 nodes

   { [(a+b)+(c+d)] + [(e+f)+(g+h)] }  +  { [(i+j)+k] + [l+m] } 
a,b,c,... are the buffers of rank 0,1,2,...

Limits and timings on our T3E for MPI_SUM with MPI_DOUBLE:

Reduce:                                    communicator size
                                2    3         2**n         others
 
  limit [count value]          896  1536       576            736 
  bandwiths [MB/s] new prot.  75.4  34.9   49.1 .. 33.9   27.3 .. 22.4
                   old prot.  28.8  16.2   16.3 ..  4.6   11.6 ..  4.6
  ratio new/old                2.6   2.1    3.0 ..  7.4    2.4 ..  4.8

Allreduce:                                 communicator size
                                2    3         2**n         others
 
  limit [count value]          448  1280       512            512 
  bandwiths [MB/s] new prot.  74.0  29.4   42.4 .. 29.0   23.3 .. 18.7
                   old prot.  20.9  14.4   13.2 ..  3.8    9.7 ..  3.4
  ratio new/old                3.5   2.0    3.0 ..  7.7    2.4 ..  5.5
The measurements were done for comunicator size = 2, 3, 4, 6, 8, 12, 16, 24, 32, 48, 64, 96, 128, 192, 256

My implementation exports MPI_MYreduce and MPI_MYallreduce. It uses MPI_Send, MPI_Recv, MPI_Sendrecv, and for count The software consists on a

  • the myreduce_README.asc file that gives hints about compiling and porting this algorithm on your system,
  • myreduce.c is for MPI implementations with handles implemented as constants (useable in switch/case),
  • myreduce_std.c is for MPI implementations with handles implemented as pointers (not useable in switch/case),

The software is free. But I like to get feedback if you are using it.

Kind regards
Rolf Rabenseifner

PS: This algorithm may be referenced under

The MPI_Allreduce algorithm is used in ScaMPI. Performance measurements with the new MPI_Allreduce algorithm were published by Scali AS:

    Lars Paul Huse: Collective Communication on Dedicated Clusters of Workstations. In J. Dongarra et al. (eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface, proceedings of the 6th European PVM/MPI Users' Group Meeting (EuroPVM/MPI'99), Barcelona, Spain, LNCS 1697, pp. 469-476, Sept. 1999.

A few pointers to other activities on optimizing collective operations:

  • D. Payne, L. Shuler, R. van de Geijn, J. Watts: Interprocessor Collective Communications Library (iCC)
    (Thanks to Diana Wooley at MPIDC 2000 for this pointer)
  • IPDPS 2000, Proceedings, IEEE, Cancun, Mexico, May 1-5, 2000:
    • P. Liu and D-W. Wang: Reduction Optimization in Heterogeneous Cluster Environments.
    • J. Nolte, M.Sato, and Y. Ishikawa: Template Based Structured Collections.
    • T. Kielmann, H.. Bal, and S. Gorlatch: Bandwidth-efficient Collective Communication for Clustered Wide Area Systems.
  • H.A. Chen, Y. O. Carrasco, and A.W. Apon: MPI Collective Operations over IP Multicast. IPDPS 2000 Workshops, J. Rolim et al. (Eds.), LNCS 1800, pp. 51-60, 2000.
  • J. Bruck, C.-T. Ho, S. Kipnis, E. Upfal, and D. Weathersby: Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems. IEEE Transactions on Parallel and Distributed Systems, Vol. 8, No. 11, Nov. 1997.
  • E.K. Blum, X. Whang, P. Leung: Architectures and message-passing algorithms for cluster computing: Design and performance. Parallel Computing 26 (2000) pp. 313-332.
  • X. Whang, E.K. Blum, D.S. Parker, D. Massey: The dance party problem and its application to collective communication in computer networks. Parallel Computing 23 (1997) pp. 1144-1156.
  • E. Gabriel, M. Resch, and R. Rühle: Implementing MPI with Optimized Algorithms for Metacomputing. Third MPI Developer's and User's Conference, MPIDC'99, A. Skjellum et al. (Eds.), March 10-12, Atlanta, GA, USA, 1999.
  • G. E. Fagg, S. S. Vadhiyar, J. Dongarra: ACCT: Automatic Collective Communications Tuning. EuroPVM/MPI 2000, LNCS 1908, Springer, pp. 354-362.
    Longer version: www.netlib.org/utk/people/JackDongarra/PAPERS/coll-lacsi-2001.pdf
  • S. S. Vadhiyar, G. E. Fagg, J. Dongarra: Automatically Tuned Collective Communications. Supercomputing 2000.
    www.supercomp.org/sc2000/techpapr/papers/pap.pap270.pdf
  • S. S. Vadhiyar, G. E. Fagg, J. Dongarra: Towards an Accurate Model for Collective Communications. Computational Science - ICCS 2001, International Conference, San Francisco, CA, USA, May 28-30, 2001. Proceedings, Part I, pp. 41-50. Towards an Accurate Model for Collective Communications.
    www.cs.berkeley.edu/~richie/tuning/readings/vadhiyar2001-coll.ps.gz
  • Nicholas T. Karonis, Bronis de Supinski, William Gropp and Ewing Lusk, Sebastien Lacour: A Multilevel Approach to Topology-Aware Collective Operations in Computational Grids.
    http://arxiv.org/pdf/cs.DC/0206038
  • Nicholas T. Karonis,Brian Toonen, Ian Foster: MPICH-G2: A Grid-Enabled Implementation of the Message Passing Interface.
    http://arxiv.org/pdf/cs.DC/0206040

[an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive]