Effective Bandwidth (b_eff) Benchmark - Background

Background

First approach from Karl Solchenbach, Hans-Joachim Plum and Gero Ritzenhoefer [1,2]:
For each message length L=1B, 2B, 4B, ... 1 MB:
1. Start p/2 "Ping-Pong"s simultaneously to compute the bi-section bandwidth. All p processes are involved, p/2 as sender and p/2 as receiver.
2. Calculate the bandwidth (volume=L*p/2, time of slowest pair).
3. Average over the two grouping patterns:
  1. best case grouping: process 0 with 1, 2 with 3, ...
  2. worst case grouping: process rank 0 communicates with process rank (size-1), 1 with size-2, ...
This first approach has a few disadvantages:
- The message lengths are fixed to the 21 values given above. The maximum message length is fixed with 1 MByte. This should be redefined to reflect the encreasing memory sizes per processor of new systems.
  Suggestion: Using L=1B, 2B, 4B, ... 2kB, 4kB, 4kB*(a**1), 4kB*(a**2), ... 4kB*(a**8) with 4kB*(a**8) = L_max and L_max = (memory per processor) / 128
  Reasons: The values from 1B to 4kB reflect application messages with a fixed small length; the 8 values from 4kB to L_max reflect application messages with a length proportional to the amount of application data.
- The average of the two grouping pattern implies:
  1. Due to the arithmetic average, a system with a Ethernet-connected-cluster of dual-bord-PCs would report a b_eff = (dual-bord-bandwidth)/2 although the system can never be used for communication-intensive applications with cartesian communication patterns. The grouping patterns and the averaging method should reflect typical application patterns.
    Suggestion for the averaging method: The average can be computed on a logarithmic scale.
    (A worse suggestion would be to compute the average on the time spent in communication, i.e. on the reciprocal of the bandwith. This is worse because it would report on a cluster of two large fully connected multi-processor nodes mainly the interconnection bandwidth and would ignore the intraconnection on each node.)
  2. If one uses the worst case grouping defined above, then vendors can modify their MPI implementations to suit the MPI_COMM_WORLD ranking to the predefined grouping patterns of the benchmark. Since the worst case grouping never reflects a communication pattern of a real application, such a modification of an MPI implementation would slow down real applications while speeding up this benchmark. But a benchmark should never report better results with MPI implementations that are worse for normal real applications [3].
    Suggestion: The worst grouping pattern can be implemented as the slowest case of some random pattern.
    The random generator should use a time-based seed. Tests on a real system (T3E) showed that this solution can produce with 10 patterns reproducible results with differences of about (to fill in later)% (3-4 tests for each 8, 16, ... 512 PEs). Weighting the worst pattern only with 20% and the cartesian pattern with 80% will reduce the differences in the results to (to fill in later)% / 5 = (to fill in later)%. This is an acceptable reproducibility.
- The bi-section communication pattern sends a message only on each second process. In typical communication-intensive applications, each process communicates with several neighbors in each communication phase. The bi-section pattern does not measure hardware-communication facilities with more than a "half" link.
  Suggestion: Each node sends sends in each measurement messages to one or more nodes. Cyclic cartesian topologies are used and for each communication pattern.
  The following communication pattern are used:
  - 1-dimensional x-direction
  - 2-dimensional x-direction
  - 2-dimensional y-direction
  - 2-dimensional x-direction and y-direction
  - 3-dimensional x-direction
  - 3-dimensional y-direction
  - 3-dimensional z-direction
  - 3-dimensional x-direction and y-direction and z-direction
  If the dimension of a direction is 1 then this direction is omitted. Patterns without directions longer than 1 are omitted.
- To reduce the problem with prime numbers, the topology selection should be modified according to the algorithm described in the "Details" section.
- Open problem:
  Based on different prime decompositions for successive numbers of processors, the b_eff algorithm cannot guarantee that the effective bandwidth is monotonous with the number of processors, see "Table of the first results" below.
Additional problems:
- The bandwidth should reflect the memory-to-memory bandwidth and not the cache-to-cache bandwidth [4].
  Suggestion: Cyclic usage of a buffer that is larger than the cache.
  E.g. send buffer length + receive buffer length = 12 * L_max = 9.4 % * (memory size per processor).
- How to implement parallel sending in each process.
  Suggestion: With MPI_Sendrecv, with MPI_Alltoallv and with non-blocking MPI_Irecv, MPI_Isend and MPI_Wait.
  The bandwidth for each pattern is defined as the maximum of the three methods. The methodis with MPI_Alltoallv and non-blocking communication allow to use duplex communication for the two messages of the "ping-pong". This means that in the case of the last cartesian pattern, the 6 messages in all 6 directions (+x, -x, +y, -y, +z, -z) can be sent in parallel.
- Why not only one method?
  - MPI_Sendrecv may prohibit duplex communication.
  - MPI_Alltoallv may not scale on large systems because then most of the entries in the matrix are empty.
  - Non-blocking communication is the most general way of implementing this communication pattern and therefore the hardest way for the implementors of the MPI library.
  - MPI_Sendrecv and MPI_Alltoallv may indroduce an additional overhead if they are implemented on top of the non-blocking point-to-point routines.
  Using the maximum bandwidth of all three methods, we can measure the communication network and not only the quality of the MPI implementation.

First Results

T3E-900/512 (sn6715 hwwt3e 2.0.4.71 unicosmk CRAY T3E)

The benchmarks were done on the same partitions starting at BasePE 0xb8. At least three other applications were running on the system: 120 PEs at 0x0, 64 PEs at 0x78, and 8 PEs at 0x1f8.

PEs  b_eff          bi-section   bi-section
     version 3.1    even nodes   all nodes (extrapolated)

 32  1869.713       1253.264     1253.264   result_3.1_032a
 33  1786.342       1213.136     1251.046   result_3.1_033a
 34  1896.591       1370.455     1370.455   result_3.1_034a
 35  1873.812       1344.275     1383.812   result_3.1_035 
  "  1823.632       1338.750     1378.125   result_3.1_035a
 36  1780.339       1368.684     1368.684   result_3.1_036 
  "  1825.823       1371.348     1371.348   result_3.1_036a
 37  1868.371       1335.060     1372.145   result_3.1_037 
  "  1855.221       1332.900     1369.925   result_3.1_037a
 38  1842.249       1442.594     1442.594   result_3.1_038 
  "  1838.336       1447.135     1447.135   result_3.1_038a
 39  1891.028       1406.456     1443.468   result_3.1_039 
  "  1806.554       1405.525     1442.512   result_3.1_039a
 40  1834.020       1460.460     1460.460   result_3.1_040 
 41  1920.499       1422.660     1458.226   result_3.1_041 
 42  2019.678       1596.252     1596.252   result_3.1_042 
 43  2052.604       1561.686     1598.869   result_3.1_043 
 44  2223.587       1675.564     1675.564   result_3.1_044 
 45  2127.454       1631.718     1668.802   result_3.1_045 
 46  2205.946       1865.806     1865.806   result_3.1_046 
 47  2211.938       1836.090     1876.005   result_3.1_047 
 48  2070.615       1861.824     1861.824   result_3.1_048 
  "  2081.355       1872.696     1872.696   result_3.1_048a
128  5225.635       4857.472     4857.472   result_3.1_128a
256  9709.549       8268.800     8268.800   result_3.1_256a

The results show that b_eff version 3.1 is not monotonic (see 44-48 PEs). Further investigation on a dedicated sytem can show, whether this is an artefact due to the non-zero basePE.

SX-4/32 (SUPER-UX hwwsx4 9.1 Rev1 SX-4)

The benchmarks were done on a 8 and 16 processors dedicated to the b_eff program by a resource block. The other CPUs were used for other applications at the same time.

PEs  b_eff          bi-section   bi-section
     version 3.1    even nodes   all nodes (extrapolated)

  8  5161.896       3215.200     3215.200   result_sx4_3.1_008
 16  9996.344       6368.176     6368.176   result_sx4_3.1_016

These results show clearly, that the former bi-section benchmark could not make full use of the communication network.

Links

UP Effective Bandwidth Benchmark Pallas Effective Bandwidth Benchmark MPI at HLRS HLRS Navigation HLRS