For each message length L=1B, 2B, 4B, ... 1 MB:
Suggestion: Using L=1B, 2B, 4B, ... 2kB, 4kB, 4kB*(a**1), 4kB*(a**2), ... 4kB*(a**8) with 4kB*(a**8) = L_max and L_max = (memory per processor) / 128
Reasons: The values from 1B to 4kB reflect application messages with a fixed small length; the 8 values from 4kB to L_max reflect application messages with a length proportional to the amount of application data.
Suggestion for the averaging method: The average can be computed on a logarithmic scale.
(A worse suggestion would be to compute the average on the time spent in communication, i.e. on the reciprocal of the bandwith. This is worse because it would report on a cluster of two large fully connected multi-processor nodes mainly the interconnection bandwidth and would ignore the intraconnection on each node.)
Suggestion: The worst grouping pattern can be implemented as the slowest case of some random pattern.
The random generator should use a time-based seed. Tests on a real system (T3E) showed that this solution can produce with 10 patterns reproducible results with differences of about (to fill in later)% (3-4 tests for each 8, 16, ... 512 PEs). Weighting the worst pattern only with 20% and the cartesian pattern with 80% will reduce the differences in the results to (to fill in later)% / 5 = (to fill in later)%. This is an acceptable reproducibility.
Suggestion: Each node sends sends in each measurement messages to one or more nodes. Cyclic cartesian topologies are used and for each communication pattern.
The following communication pattern are used:
Based on different prime decompositions for successive numbers
of processors, the beff algorithm cannot guarantee
that the effective bandwidth is monotonous with the number of
processors,
see "Table of the first results with version 3.1"
below.
Using ring patterns as described in
the "Details" section.
These patterns generate nearly monotonous values.
The following table shows the results on a T3E for 52-68 processors.
Differentiating between the results of the ring patterns and the
results of the random patterns one can see that problems with
the monotony arise mainly from the small number of random
patterns.
The bold problem-values have a bandwidth
that is more than bring/size
(i.e. the maximum network performance of more than one node)
less than a former value (marked italic).
Increasing the number of random patterns
from 10 to 30 would reduce the problem significantly.
size
ring patterns
random patterns
beff
52
3844.962
2262.899
2949.705
MByte/s
53
3810.363
2253.909
2930.565
MByte/s
54
4009.288
2275.080
3020.174
MByte/s
55
3876.571
2306.710
2990.339
MByte/s
56
4013.754
2242.202
2999.941
MByte/s
57
4159.792
2258.213
3064.914
MByte/s
58
4183.254
2165.269
3009.629
MByte/s
59
4235.665
2272.635
3102.599
MByte/s
60
4231.334
2123.823
2997.767
MByte/s
61
4281.197
2137.067
3024.765
MByte/s
62
4419.872
2350.914
3223.467
MByte/s
63
4468.495
2041.979
3020.691
MByte/s
64
4583.545
2176.582
3158.554
MByte/s
65
4672.244
2287.434
3269.166
MByte/s
66
4692.184
2093.303
3134.033
MByte/s
67
4797.521
2359.445
3364.445
MByte/s
68
4752.635
2114.683
3170.223
MByte/s
Suggestion: Cyclic usage of a buffer that is larger than the cache.
E.g. send buffer length + receive buffer length = 12 * L_max = 9.4 % * (memory size per processor).
Suggestion: With MPI_Sendrecv, with MPI_Alltoallv and with non-blocking MPI_Irecv, MPI_Isend and MPI_Wait.
The bandwidth for each pattern is defined as the maximum of the three methods. The methods with MPI_Alltoallv and non-blocking communication allow to use duplex communication for the two messages of the "ping-pong". This means that in the case of the last cartesian pattern, the 6 messages in all 6 directions (+x, -x, +y, -y, +z, -z) can be sent in parallel.
Using the maximum bandwidth of all three methods, we can measure the communication network and not only the quality of the MPI implementation.
UP Effective Bandwidth Benchmark Pallas Effective Bandwidth Benchmark MPI at HLRS HLRS Navigation HLRS