[an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive]
Dear MPI implementors on MPP systems, If you want to speed up your MPI_Reduce/MPI_Allreduce implementation then this page may be for you. I have seen that at least two implementations (MPICH, T3E) are using only one protocol for REDUCE that has a minimal latency but a _bad_ bandwidth for send_buffers larger than about 512 elements. I have implemented for large buffers a second protocol with same numerical computation as in MPICH. The new protocol distributes the computation on all nodes and _not_ like a binary tree on 1/2, 1/4, ... nodes. Therefore the execution time goes with O(log(size)) + O(count). For comparance, the binary tree in MPICH has O(log(size)) * O(count). Its bandwidth is about 2..7 times better than the bandwidth of the old protocol. But its latency is worse. Therefore I'm using it only for counts bigger than a given limit. To guarantee the requirement from MPI 1.1 page 114 lines 20-23., the algorith can be used in combination with the vendor's algorithm (for count < limit) only if the vendor's algorithm computes e.g. for 13 nodes { [(a+b)+(c+d)] + [(e+f)+(g+h)] } + { [(i+j)+k] + [l+m] }a,b,c,... are the buffers of rank 0,1,2,... Limits and timings on our T3E for MPI_SUM with MPI_DOUBLE: Reduce: communicator size 2 3 2**n others limit [count value] 896 1536 576 736 bandwiths [MB/s] new prot. 75.4 34.9 49.1 .. 33.9 27.3 .. 22.4 old prot. 28.8 16.2 16.3 .. 4.6 11.6 .. 4.6 ratio new/old 2.6 2.1 3.0 .. 7.4 2.4 .. 4.8 Allreduce: communicator size 2 3 2**n others limit [count value] 448 1280 512 512 bandwiths [MB/s] new prot. 74.0 29.4 42.4 .. 29.0 23.3 .. 18.7 old prot. 20.9 14.4 13.2 .. 3.8 9.7 .. 3.4 ratio new/old 3.5 2.0 3.0 .. 7.7 2.4 .. 5.5The measurements were done for comunicator size = 2, 3, 4, 6, 8, 12, 16, 24, 32, 48, 64, 96, 128, 192, 256
My implementation exports MPI_MYreduce and MPI_MYallreduce.
It uses MPI_Send, MPI_Recv, MPI_Sendrecv, and for count
The software is free. But I like to get feedback if you are using it.
Kind regards
PS: This algorithm may be referenced under
The MPI_Allreduce algorithm is used in ScaMPI.
Performance measurements with the new MPI_Allreduce algorithm were published by Scali AS:
A few pointers to other activities on optimizing collective operations:
|