Effective Bandwidth (b_eff) Benchmark

The algorithm of b_eff (version 3.2)

The effective bandwidth b_eff measures the accumulated bandwidth of the communication network of a parallel and/or distributed computing system. Several message sizes, communication patterns and methods are used. The algorithm uses an average to take into account that in real applications short and long messages result in different bandwidth values.

Definition of the effective bandwidth b_eff:

b_eff = logavg ( logavg_{ring patterns} (sum_L (max_mthd (max_rep ( b(ring pat.,L,mthd,rep) )))/21 ),

logavg_{random patterns} (sum_L (max_mthd (max_rep ( b(random pat.,L,mthd,rep) )))/21 )

)

with

b(pat,L,mthd,rep) = L * (total number of messages of a pattern "pat") * looplength / (maximum time on each process for executing the communication pattern looplength times)
Each measurement is repeated 3 times (rep=1..3). The maximum bandwidth of all repetitions is used (see max_mthd in the formula above).
Each pattern is programmed with three methods. The maximum bandwidth of all methods is used (max_mthd).
The measurement is done for different sizes of a message: The message length L has the following 21 values:
L = 1B, 2B, 4B, ... 2kB, 4kB, 4kB*(a**1), 4kB*(a**2), ... 4kB*(a**8)
with and 4kB*(a**8) = L_max and L_max = (memory per processor) / 128
and looplength = min( 300, L_max / L ).
The average of the bandwidth of all messages sizes is computed (sum_L(...)/21).
A set of ring patterns and random patterns is used (see details section below).
The average for all ring patterns and the average of all random patterns is computed on the logarithmic scale
(logavg_{ring patterns} and logavg_{random patterns}).
Finally the effective bandwidth is the logarithmic average of these two values
(logavg(logavg_{ring patterns}, logavg_{random patterns}).

Details of the algoritm:

Programming methods:

Background

First approach from Karl Solchenbach, Hans-Joachim Plum and Gero Ritzenhoefer [1,2] was based on the bi-section bandwidth.

Due to several problems a redesign was done. . This redesign tries not to violate the rules defined by Rolf Hempel in [3] and by William Gropp and Ewing Lusk in [4].

Output of the b_eff Benchmark

Each run of the benchmark on a particular system results in an output file. The last line of this output file reports e.g.

b_eff = 9709.549 MB/s = 37.928 * 256 PEs with 128 MB/PE on sn6715 hwwt3e 2.0.4.71 unicosmk CRAY T3E

This line reports

the effective bandwidth b_eff of the whole system,
the effective bandwidth of each processor (or node),
the number of processors (or nodes),
the memory of each processor (or node),
the output of uname -a.

The previous sections of the output file are mainly all measurement values of b(pat,L,mthd,rep) and some analysis tables. A full description of the output file is available here.

Sourcecode

b_eff.c (version 3.2)

Benchmarking

If you use this benchmark, please send us back the following information:

which compilation command was used,
which MPI implementation, version, ... was used,
do you have setup a special environment, e.g. UNIX environment, variables for compiling or running the benchmark,
with which command and/or batch script have you started the benchmark,
b_eff.c writes its results on stdout; please attach this output as a gzip'ed attachment.

Additionally -- only for you -- b_eff.c writes the last summary line also on stderr.

Some examples on how to compile and start b_eff.c are given on the first lines in b_eff.c.

Please send the mail to rabenseifner@rus.uni-stuttgart.de.

First Results

On a Cray T3E-900 with 512+32 processors and 128 MByte/processor

On Nov. 7, 1999, on sn6715 hwwt3e 2.0.4.71 unicosmk CRAY T3E, with 128 MB/PE: The measurements with 2 to 256 PEs were done while an other application was running on the first 256 PEs. Currently the 512 PEs value must be computed on the base of former measurements with release 3.1, using the 1 dimensional cyclic and the random values. The MPI implementation mpt.1.3.0.2 with used and the environment variable MPI_BUFFER_MAX=4099 was set.

 
Used commands: module switch mpt mpt.1.3.0.2 
               cc -o b_eff -D MEMORY_PER_PROCESSOR=128 b_eff.c 
               export MPI_BUFFER_MAX=4099 
               mpirun -np size ./b_eff > result_3.2_t3e_size

On a NEC SX-4/32 with 32 processors and 256 MByte/processor

On Nov. 9, 1999, on SUPER-UX hwwsx4 9.1 Rev1 SX-4, with 256 MB/PE: The measurement was done on a (dedicated) resource block with 16 processors while other application were running on the other processors (exception: the benchmark on 4 processors was done nteractively with time-sharing).

 
Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=256 b_eff.c -lm
               mpirun -np size ./b_eff > result_3.2_sx4_size

On a NEC SX-5/8B with 8 processors and 8 GByte/processor

On Nov. 9, 1999, on SUPER-UX sx5 9.2 k SX-5/8B, preliminary measurements were done with 256 MB/PE and without the non-blocking communication method:

size b_eff
MByte/s b_eff/size
MByte/s summary full protocol
4 5439.199 1359.800 result_3.2_sx5_256MB_004.shrt result_3.2_sx5_256MB_004.gz
2 2662.468 1331.234 result_3.2_sx5_256MB_002.shrt result_3.2_sx5_256MB_002.gz

 
Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=256 b_eff_mthd1+2.c -lm
               mpirun -np size ./b_eff > result_3.2_sx4_size

On a HP-V 9000/800/V2250 with 8 processors and 1024 MByte/processor

On Nov. 9, 1999, on HP-UX hwwhpv B.11.00 A 9000/800, with 1024 MB/PE: The measurement was done while another application was running, but with reduced priority (nice=39).

size b_eff
MByte/s b_eff/size
MByte/s summary full protocol
7 435.041 62.149 result_3.2_hpv_007c.shrt result_3.2_hpv_007c.gz

 
Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=1024 b_eff.c -lm
               mpirun -np size ./b_eff > result_3.2_hpv_size

On a Hitachi SR 2201 with 32+8 processors and 256 MByte/processor

On Nov. 9, 1999, on HI-UX/MPP hitachi 02-03 0 SR2201, with 256 MB/PE: The measurement was done while another application was running on the other 16 PEs. All PEs were used as dedicated PEs.

size b_eff
MByte/s b_eff/size
MByte/s summary full protocol
16 527.805 32.988 result_3.2_SR2201_016.shrt result_3.2_SR2201_016.gz
8 276.903 34.613 result_3.2_SR2201_008.shrt result_3.2_SR2201_008.gz
4 151.928 37.982 result_3.2_SR2201_004.shrt result_3.2_SR2201_004.gz
2 80.086 40.043 result_3.2_SR2201_002.shrt result_3.2_SR2201_002.gz

 
Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=1024 b_eff.c -lm
               mpirun -n size ./b_eff > result_3.2_SR2201_size

On a SwissTX T1-baby with 6*2 processors and 512 MByte/processor

On Nov. 15, 1999, on OSF1 toneb7 V5.0 910 alpha, with 512 MB/PE: The measurement was done while other applications were running on PEs that weren't used by this benchmark. All PEs were used as dedicated PEs.

size b_eff
MByte/s b_eff/size
MByte/s summary full protocol
12 ???.??? ??.??? result_3.3_SwissTX1baby_012a.shrt result_3.3_SwissTX1baby_012a.gz
8 97.497 12.187 result_3.3_SwissTX1baby_008a.shrt result_3.3_SwissTX1baby_008a.gz
4 49.394 12.348 result_3.3_SwissTX1baby_004a.shrt result_3.3_SwissTX1baby_004a.gz
2 25.792 12.896 result_3.3_SwissTX1baby_002a.shrt result_3.3_SwissTX1baby_002a.gz

 
Used commands: tnetcc -o b_eff -DMEMORY_PER_PROCESSOR=512 b_eff.c -lm
               bsub -Is -n size txrun b_eff > result_3.3_SwissTX1baby_size

Todo

Open questions:
- Should we use more random patterns, e.g. 30 instead of 10, and can we reduce the number of repetions from 3 to 2? This would improve the reproducibility of the results, but would increase the execution time, e.g. on a T3E from 4.5 to 6 minutes. Details see here
- Should the non-blocking method (see above) be split into two methods
  - MPI_Irecv + MPI_Send + MPI_Waitall, and
  - MPI_Isend + MPI_Recv + MPI_Waitall.
  because on some systems these methods have different latencies? This would give a better fairness, but would also increase the execution time, e.g. on a T3E from 6 (with the modifications above) to 8 minutes.
- Should we remove in the final benchmark the 9 patterns that are used only as an information. This would decrease the execution time, e.g. on a T3E from 8 to 6.5 minutes.
- L_max (as defined above) may exceed the range of C datatype "int" if a system uses only 32 bit for "int" and if it has more than 128 GB memory per processor. On such systems, L_max should be reduced to 1 GB.
- The fixed algorithm for "looplength" (see above) does not scale for large memory size per process. It should be adapted automatically to reduce the total execution time of this benchmark.
Measurements on other systems.
Establishing a mail reflector for the discussion on b_eff.

References:

[1]: Karl Solchenbach: Benchmarking the Balance of Parallel Computers. SPEC Workshop on Benchmarking Parallel and High-Performance Computing Systems (copy of the slides), Wuppertal, Germany, Sept. 13, 1999.
[2]

Links

Pallas Effective Bandwidth Benchmark MPI at HLRS HLRS Navigation HLRS
This page: www.hlrs.de/mpi/b_eff/b_eff_3.2/

Rolf Rabenseifner

b_eff =	logavg	(	logavg_{ring patterns}	(sum_L (max_mthd (max_rep ( b(ring pat.,L,mthd,rep)	)))/21 ),
			logavg_{random patterns}	(sum_L (max_mthd (max_rep ( b(random pat.,L,mthd,rep)	)))/21 )
		)

size	b_eff MByte/s	b_eff/size MByte/s	summary	full protocol
512	~ 19482.	38.05	extrapolation based on	result_3.1_t3e_512b.gz
384	15526.600	40.434	result_3.2_t3e_384a.shrt	result_3.2_t3e_384a.gz
256	10056.033	39.281	result_3.2_t3e_256a.shrt	result_3.2_t3e_256a.gz
192	7871.336	40.997	result_3.2_t3e_192a.shrt	result_3.2_t3e_192a.gz
128	5620.345	43.909	result_3.2_t3e_128a.shrt	result_3.2_t3e_128a.gz
96	4180.723	43.549	result_3.2_t3e_096a.shrt	result_3.2_t3e_096a.gz
64	3158.554	49.352	result_3.2_t3e_064a.shrt	result_3.2_t3e_064a.gz
48	2725.891	56.789	result_3.2_t3e_048a.shrt	result_3.2_t3e_048a.gz
32	1893.872	59.183	result_3.2_t3e_032a.shrt	result_3.2_t3e_032a.gz
24	1522.225	63.426	result_3.2_t3e_024a.shrt	result_3.2_t3e_024a.gz
16	1063.217	66.451	result_3.2_t3e_016a.shrt	result_3.2_t3e_016a.gz
12	918.109	76.509	result_3.2_t3e_012a.shrt	result_3.2_t3e_012a.gz
8	612.815	76.602	result_3.2_t3e_008a.shrt	result_3.2_t3e_008a.gz
6	509.359	84.893	result_3.2_t3e_006a.shrt	result_3.2_t3e_006a.gz
4	355.045	88.761	result_3.2_t3e_004a.shrt	result_3.2_t3e_004a.gz
3	278.898	92.966	result_3.2_t3e_003a.shrt	result_3.2_t3e_003a.gz
2	182.989	91.495	result_3.2_t3e_002a.shrt	result_3.2_t3e_002a.gz

Effective Bandwidth (beff) Benchmark

The algorithm of beff (version 3.2)

Definition of the effective bandwidth beff: