Collective Reduction Operation on Cray X1 and Other Platforms

Rolf Rabenseifner and Panagiotis Adamidis
High-Performance Computing-Center Stuttgart (HLRS)
University of Stuttgart
Allmandring 30
D-70550 Stuttgart
Germany

http://www.hlrs.de/people/rabenseifner/ | http://www.hlrs.de/people/adamidis/

ABSTRACT:
A 5-year-profiling in production mode at the University of Stuttgart has shown that more than 40 % of the execution time of Message Passing Interface (MPI) routines is spent in the collective communication routines MPI_Allreduce and MPI_Reduce. Although MPI implementations are now available for about 10 years and all vendors are committed to this Message Passing Interface standard, the vendors' and publicly available reduction algorithms could be accelerated with new algorithms by a factor between 3 (IBM, sum) and 100 (Cray T3E, maxloc) for long vectors. This paper presents five algorithms optimized for different choices of vector size and number of processes. The focus is on bandwidth dominated protocols for power-of-two and non-power-of-two number of processes, optimizing the load balance in communication and computation. The new algorithms are compared also on the Cray X1 with the current development version of Cray's MPI library (mpt.2.4.0.0.13)

KEYWORDS:
Message Passing, MPI, Collective Operations, Reduction.

GLOBAL LINKS:
Full paper as reference, PDF document, postscript, gzip'ed postscript.
Slides as reference, PDF document. (Cray X1 results can be found only in the paper)
Used benchmark, MPI_Allreduce/Reduce and analysis software
Information about MPI from the author
Rolf Rabenseifner's list of publications