Effective Bandwidth (b_eff) Benchmark - Background

Background

First approach from Karl Solchenbach, Hans-Joachim Plum and Gero Ritzenhoefer [1,2]:
For each message length L=1B, 2B, 4B, ... 1 MB:
1. Start p/2 "Ping-Pong"s simultaneously to compute the bi-section bandwidth. All p processes are involved, p/2 as sender and p/2 as receiver.
2. Calculate the bandwidth (volume=L*p/2, time of slowest pair).
3. Average over the two grouping patterns:
  1. best case grouping: process 0 with 1, 2 with 3, ...
  2. worst case grouping: process rank 0 communicates with process rank (size-1), 1 with size-2, ...

This first approach has a few disadvantages:

The message lengths are fixed to the 21 values given above. The maximum message length is fixed with 1 MByte. This should be redefined to reflect the encreasing memory sizes per processor of new systems.
Suggestion: Using L=1B, 2B, 4B, ... 2kB, 4kB, 4kB*(a**1), 4kB*(a**2), ... 4kB*(a**8) with 4kB*(a**8) = L_max and L_max = (memory per processor) / 128
Reasons: The values from 1B to 4kB reflect application messages with a fixed small length; the 8 values from 4kB to L_max reflect application messages with a length proportional to the amount of application data.
The average of the two grouping pattern implies:
1. Due to the arithmetic average, a system with a Ethernet-connected-cluster of dual-bord-PCs would report a b_eff = (dual-bord-bandwidth)/2 although the system can never be used for communication-intensive applications with cartesian communication patterns. The grouping patterns and the averaging method should reflect typical application patterns.
  Suggestion for the averaging method: The average can be computed on a logarithmic scale.
  (A worse suggestion would be to compute the average on the time spent in communication, i.e. on the reciprocal of the bandwith. This is worse because it would report on a cluster of two large fully connected multi-processor nodes mainly the interconnection bandwidth and would ignore the intraconnection on each node.)
2. If one uses the worst case grouping defined above, then vendors can modify their MPI implementations to suit the MPI_COMM_WORLD ranking to the predefined grouping patterns of the benchmark. Since the worst case grouping never reflects a communication pattern of a real application, such a modification of an MPI implementation would slow down real applications while speeding up this benchmark. But a benchmark should never report better results with MPI implementations that are worse for normal real applications [3].
  Suggestion: The worst grouping pattern can be implemented as the slowest case of some random pattern.
  The random generator should use a time-based seed. Tests on a real system (T3E) showed that this solution can produce with 10 patterns reproducible results with differences of about (to fill in later)% (3-4 tests for each 8, 16, ... 512 PEs). Weighting the worst pattern only with 20% and the cartesian pattern with 80% will reduce the differences in the results to (to fill in later)% / 5 = (to fill in later)%. This is an acceptable reproducibility.
The bi-section communication pattern sends a message only on each second process. In typical communication-intensive applications, each process communicates with several neighbors in each communication phase. The bi-section pattern does not measure hardware-communication facilities with more than a "half" link.
Suggestion: Each node sends sends in each measurement messages to one or more nodes. Cyclic cartesian topologies are used and for each communication pattern.
The following communication pattern are used:
- 1-dimensional x-direction
- 2-dimensional x-direction
- 2-dimensional y-direction
- 2-dimensional x-direction and y-direction
- 3-dimensional x-direction
- 3-dimensional y-direction
- 3-dimensional z-direction
- 3-dimensional x-direction and y-direction and z-direction
If the dimension of a direction is 1 then this direction is omitted. Patterns without directions longer than 1 are omitted.
To reduce the problem with prime numbers, the topology selection should be modified according to the following algorithm: If the number of nodes is larger than 8 then the size of the each topology is reduced until each dimension is larger than 1. If the number of nodes is larger than 4 then the size for the 2- and 3-dimensional topology is reduced to the next even number. If the dimension of a direction is 1 then this direction is omitted. Patterns without directions longer than 1 are omitted.
Which patterns:
Based on different prime decompositions for successive numbers of processors, the b_eff algorithm cannot guarantee that the effective bandwidth is monotonous with the number of processors, see "Table of the first results with version 3.1" below.

Second proposal -- realized in version 3.2 now:

Using ring patterns as described in the "Details" section. These patterns generate nearly monotonous values. The following table shows the results on a T3E for 52-68 processors.

Differentiating between the results of the ring patterns and the results of the random patterns one can see that problems with the monotony arise mainly from the small number of random patterns.

size ring patterns random patterns b_eff
52 3844.962 2262.899 2949.705 MByte/s
53 3810.363 2253.909 2930.565 MByte/s
54 4009.288 2275.080 3020.174 MByte/s
55 3876.571 2306.710 2990.339 MByte/s
56 4013.754 2242.202 2999.941 MByte/s
57 4159.792 2258.213 3064.914 MByte/s
58 4183.254 2165.269 3009.629 MByte/s
59 4235.665 2272.635 3102.599 MByte/s
60 4231.334 2123.823 2997.767 MByte/s
61 4281.197 2137.067 3024.765 MByte/s
62 4419.872 2350.914 3223.467 MByte/s
63 4468.495 2041.979 3020.691 MByte/s
64 4583.545 2176.582 3158.554 MByte/s
65 4672.244 2287.434 3269.166 MByte/s
66 4692.184 2093.303 3134.033 MByte/s
67 4797.521 2359.445 3364.445 MByte/s
68 4752.635 2114.683 3170.223 MByte/s

The bold problem-values have a bandwidth that is more than b_ring/size (i.e. the maximum network performance of more than one node) less than a former value (marked italic). Increasing the number of random patterns from 10 to 30 would reduce the problem significantly.

Additional problems:
- The bandwidth should reflect the memory-to-memory bandwidth and not the cache-to-cache bandwidth [4].
  Suggestion: Cyclic usage of a buffer that is larger than the cache.
  E.g. send buffer length + receive buffer length = 12 * L_max = 9.4 % * (memory size per processor).
- How to implement parallel sending in each process.
  Suggestion: With MPI_Sendrecv, with MPI_Alltoallv and with non-blocking MPI_Irecv, MPI_Isend and MPI_Wait.
  The bandwidth for each pattern is defined as the maximum of the three methods. The methods with MPI_Alltoallv and non-blocking communication allow to use duplex communication for the two messages of the "ping-pong". This means that in the case of the last cartesian pattern, the 6 messages in all 6 directions (+x, -x, +y, -y, +z, -z) can be sent in parallel.
- Why not only one method?
  - MPI_Sendrecv may prohibit duplex communication.
  - MPI_Alltoallv may not scale on large systems because then most of the entries in the matrix are empty.
  - Non-blocking communication is the most general way of implementing this communication pattern and therefore the hardest way for the implementors of the MPI library.
  - MPI_Sendrecv and MPI_Alltoallv may indroduce an additional overhead if they are implemented on top of the non-blocking point-to-point routines.
  Using the maximum bandwidth of all three methods, we can measure the communication network and not only the quality of the MPI implementation.
- On some systems Isend and Irecv have different latencies. Therefore it might be better to split this method into two methods:
  - MPI_Irecv + MPI_Send + MPI_Waitall, and
  - MPI_Isend _ MPI_Recv + MPI_Waitall.
  But this approach does not allow the transfer of all messages at the same time in the background, i.e. the calls to MPI_Send (and MPI_Recv) are serialized if the implementation uses synchronous transfer mode.

Links

UP Effective Bandwidth Benchmark Pallas Effective Bandwidth Benchmark MPI at HLRS HLRS Navigation HLRS

Rolf Rabenseifner

size	b_eff	b_eff/size	summary	full protocol
68	3170.223	46.621	result_3.2_t3e_068a.shrt	result_3.2_t3e_068a.gz
67	3364.445	50.216	result_3.2_t3e_067a.shrt	result_3.2_t3e_067a.gz
66	3134.033	47.485	result_3.2_t3e_066a.shrt	result_3.2_t3e_066a.gz
65	3269.166	50.295	result_3.2_t3e_065a.shrt	result_3.2_t3e_065a.gz
64	3158.554	49.352	result_3.2_t3e_064a.shrt	result_3.2_t3e_064a.gz
63	3020.691	47.947	result_3.2_t3e_063a.shrt	result_3.2_t3e_063a.gz
62	3223.467	51.991	result_3.2_t3e_062a.shrt	result_3.2_t3e_062a.gz
61	3024.765	49.586	result_3.2_t3e_061a.shrt	result_3.2_t3e_061a.gz
60	2997.767	49.963	result_3.2_t3e_060a.shrt	result_3.2_t3e_060a.gz
59	3102.599	52.586	result_3.2_t3e_059a.shrt	result_3.2_t3e_059a.gz
58	3009.629	51.890	result_3.2_t3e_058a.shrt	result_3.2_t3e_058a.gz
57	3064.914	53.770	result_3.2_t3e_057a.shrt	result_3.2_t3e_057a.gz
56	2999.941	53.570	result_3.2_t3e_056a.shrt	result_3.2_t3e_056a.gz
55	2990.339	54.370	result_3.2_t3e_055a.shrt	result_3.2_t3e_055a.gz
54	3020.174	55.929	result_3.2_t3e_054a.shrt	result_3.2_t3e_054a.gz
53	2930.565	55.294	result_3.2_t3e_053a.shrt	result_3.2_t3e_053a.gz
52	2949.705	56.725	result_3.2_t3e_052a.shrt	result_3.2_t3e_052a.gz

size	ring patterns	random patterns	b_eff
52	3844.962	2262.899	2949.705	MByte/s
53	3810.363	2253.909	2930.565	MByte/s
54	4009.288	2275.080	3020.174	MByte/s
55	3876.571	2306.710	2990.339	MByte/s
56	4013.754	2242.202	2999.941	MByte/s
57	4159.792	2258.213	3064.914	MByte/s
58	4183.254	2165.269	3009.629	MByte/s
59	4235.665	2272.635	3102.599	MByte/s
60	4231.334	2123.823	2997.767	MByte/s
61	4281.197	2137.067	3024.765	MByte/s
62	4419.872	2350.914	3223.467	MByte/s
63	4468.495	2041.979	3020.691	MByte/s
64	4583.545	2176.582	3158.554	MByte/s
65	4672.244	2287.434	3269.166	MByte/s
66	4692.184	2093.303	3134.033	MByte/s
67	4797.521	2359.445	3364.445	MByte/s
68	4752.635	2114.683	3170.223	MByte/s

Effective Bandwidth (beff) Benchmark - Background

Background

Links

Effective Bandwidth (b_eff) Benchmark - Background