- First approach from Karl Solchenbach,
Hans-Joachim Plum and Gero Ritzenhoefer
[1,2]:
For each message length L=1B, 2B, 4B, ... 1 MB:

- Start p/2 "Ping-Pong"s simultaneously to compute the bi-section bandwidth. All p processes are involved, p/2 as sender and p/2 as receiver.
- Calculate the bandwidth (volume=L*p/2, time of slowest pair).
- Average over the two grouping patterns:
- best case grouping: process 0 with 1, 2 with 3, ...
- worst case grouping: process rank 0 communicates with process rank (size-1), 1 with size-2, ...

- This first approach has a few disadvantages:
- The message lengths are fixed to the 21 values given above.
The maximum message length is fixed with 1 MByte.
This should be redefined to reflect the encreasing
memory sizes per processor of new systems.
**Suggestion:**Using L=1B, 2B, 4B, ... 2kB, 4kB, 4kB*(a**1), 4kB*(a**2), ... 4kB*(a**8) with 4kB*(a**8) = L_max and L_max = (memory per processor) / 128Reasons: The values from 1B to 4kB reflect application messages with a fixed small length; the 8 values from 4kB to L_max reflect application messages with a length proportional to the amount of application data.

- The average of the two grouping pattern implies:
- Due to the arithmetic average, a system with
a Ethernet-connected-cluster of dual-bord-PCs
would report a b
_{eff}= (dual-bord-bandwidth)/2 although the system can never be used for communication-intensive applications with cartesian communication patterns. The grouping patterns and the averaging method should reflect typical application patterns.**Suggestion**for the averaging method: The average can be computed on a logarithmic scale.(A worse suggestion would be to compute the average on the time spent in communication, i.e. on the reciprocal of the bandwith. This is worse because it would report on a cluster of two large fully connected multi-processor nodes mainly the interconnection bandwidth and would ignore the intraconnection on each node.)

- If one uses the worst case grouping defined above,
then vendors can modify their MPI implementations
to suit the MPI_COMM_WORLD ranking to the predefined
grouping patterns of the benchmark.
Since the worst case grouping never reflects a communication
pattern of a real application, such a modification of
an MPI implementation would slow down real applications
while speeding up this benchmark.
But a benchmark should never report better results
with MPI implementations that are worse for normal
real applications
[3].
**Suggestion:**The worst grouping pattern can be implemented as the slowest case of some random pattern.The random generator should use a time-based seed. Tests on a real system (T3E) showed that this solution can produce with 10 patterns reproducible results with differences of about

*(to fill in later)*% (3-4 tests for each 8, 16, ... 512 PEs). Weighting the worst pattern only with 20% and the cartesian pattern with 80% will reduce the differences in the results to*(to fill in later)*% / 5 =*(to fill in later)*%. This is an acceptable reproducibility.

- Due to the arithmetic average, a system with
a Ethernet-connected-cluster of dual-bord-PCs
would report a b
- The bi-section communication pattern sends a message only on
each second process.
In typical communication-intensive applications,
each process communicates with several neighbors
in each communication phase.
The bi-section pattern does not measure hardware-communication
facilities with more than a "half" link.
**Suggestion:**Each node sends sends in each measurement messages to one or more nodes. Cyclic cartesian topologies are used and for each communication pattern.The following communication pattern are used:

- 1-dimensional x-direction
- 2-dimensional x-direction
- 2-dimensional y-direction
- 2-dimensional x-direction and y-direction
- 3-dimensional x-direction
- 3-dimensional y-direction
- 3-dimensional z-direction
- 3-dimensional x-direction and y-direction and z-direction

- To reduce the problem with prime numbers, the topology selection
should be modified according to the algorithm described
in the "Details" section.
**Open problem:**Based on different prime decompositions for successive numbers of processors, the b

_{eff}algorithm cannot guarantee that the effective bandwidth is monotonous with the number of processors, see "Table of the first results" below.

- The message lengths are fixed to the 21 values given above.
The maximum message length is fixed with 1 MByte.
This should be redefined to reflect the encreasing
memory sizes per processor of new systems.
- Additional problems:
- The bandwidth should reflect the memory-to-memory bandwidth
and not the cache-to-cache bandwidth
[4].
**Suggestion:**Cyclic usage of a buffer that is larger than the cache.E.g. send buffer length + receive buffer length = 12 * L_max = 9.4 % * (memory size per processor).

- How to implement parallel sending in each process.
**Suggestion:**With MPI_Sendrecv, with MPI_Alltoallv and with non-blocking MPI_Irecv, MPI_Isend and MPI_Wait.The bandwidth for each pattern is defined as the maximum of the three methods. The methodis with MPI_Alltoallv and non-blocking communication allow to use duplex communication for the two messages of the "ping-pong". This means that in the case of the last cartesian pattern, the 6 messages in all 6 directions (+x, -x, +y, -y, +z, -z) can be sent in parallel.

- Why not only one method?
- MPI_Sendrecv may prohibit duplex communication.
- MPI_Alltoallv may not scale on large systems because then most of the entries in the matrix are empty.
- Non-blocking communication is the most general way of implementing this communication pattern and therefore the hardest way for the implementors of the MPI library.
- MPI_Sendrecv and MPI_Alltoallv may indroduce an additional overhead if they are implemented on top of the non-blocking point-to-point routines.

Using the maximum bandwidth of all three methods, we can measure the communication network and not only the quality of the MPI implementation.

- The bandwidth should reflect the memory-to-memory bandwidth
and not the cache-to-cache bandwidth
[4].

The benchmarks were done on the same partitions starting at BasePE 0xb8. At least three other applications were running on the system: 120 PEs at 0x0, 64 PEs at 0x78, and 8 PEs at 0x1f8.

PEs b_eff bi-section bi-section version 3.1 even nodes all nodes (extrapolated) 32 1869.713 1253.264 1253.264 result_3.1_032a 33 1786.342 1213.136 1251.046 result_3.1_033a 34 1896.591 1370.455 1370.455 result_3.1_034a 35 1873.812 1344.275 1383.812 result_3.1_035 " 1823.632 1338.750 1378.125 result_3.1_035a 36 1780.339 1368.684 1368.684 result_3.1_036 " 1825.823 1371.348 1371.348 result_3.1_036a 37 1868.371 1335.060 1372.145 result_3.1_037 " 1855.221 1332.900 1369.925 result_3.1_037a 38 1842.249 1442.594 1442.594 result_3.1_038 " 1838.336 1447.135 1447.135 result_3.1_038a 39 1891.028 1406.456 1443.468 result_3.1_039 " 1806.554 1405.525 1442.512 result_3.1_039a 40 1834.020 1460.460 1460.460 result_3.1_040 41 1920.499 1422.660 1458.226 result_3.1_041 42 2019.678 1596.252 1596.252 result_3.1_042 43 2052.604 1561.686 1598.869 result_3.1_043 44 2223.587 1675.564 1675.564 result_3.1_044 45 2127.454 1631.718 1668.802 result_3.1_045 46 2205.946 1865.806 1865.806 result_3.1_046 47 2211.938 1836.090 1876.005 result_3.1_047 48 2070.615 1861.824 1861.824 result_3.1_048 " 2081.355 1872.696 1872.696 result_3.1_048a 128 5225.635 4857.472 4857.472 result_3.1_128a 256 9709.549 8268.800 8268.800 result_3.1_256a

The results show that b_{eff} version 3.1 is not monotonic
(see 44-48 PEs).
Further investigation on a dedicated sytem can
show, whether this is an artefact due to the non-zero basePE.

PEs b_eff bi-section bi-section version 3.1 even nodes all nodes (extrapolated) 8 5161.896 3215.200 3215.200 result_sx4_3.1_008 16 9996.344 6368.176 6368.176 result_sx4_3.1_016

These results show clearly, that the former bi-section benchmark could not make full use of the communication network.

UP Effective Bandwidth Benchmark Pallas Effective Bandwidth Benchmark MPI at HLRS HLRS Navigation HLRS