Effective Bandwidth (beff) Benchmark


The algorithm of beff (version 3.2)

The effective bandwidth beff measures the accumulated bandwidth of the communication network of a parallel and/or distributed computing system. Several message sizes, communication patterns and methods are used. The algorithm uses an average to take into account that in real applications short and long messages result in different bandwidth values.

Definition of the effective bandwidth beff:

beff = logavg ( logavgring patterns (sumL (maxmthd (maxrep ( b(ring pat.,L,mthd,rep) )))/21 ),
logavgrandom patterns (sumL (maxmthd (maxrep ( b(random pat.,L,mthd,rep) )))/21 )
)

with

Details of the algoritm:

Programming methods:
The communication is programmed with the several methods. This allows the measurement of the effective bandwidth independent of which MPI methods are optimized on a given platform. The maximum bandwidth of the following methods is used:
  1. MPI_Sendrecv
  2. MPI_Alltoallv
  3. non-blocking MPI_Irecv and MPI_Isend.
Communication patterns:
To produce a balanced measurement on any network topology, different communication patterns are used:


Background

First approach from Karl Solchenbach, Hans-Joachim Plum and Gero Ritzenhoefer [1,2] was based on the bi-section bandwidth.

Due to several problems a redesign was done. . This redesign tries not to violate the rules defined by Rolf Hempel in [3] and by William Gropp and Ewing Lusk in [4].


Output of the beff Benchmark

Each run of the benchmark on a particular system results in an output file. The last line of this output file reports e.g.

b_eff = 9709.549 MB/s = 37.928 * 256 PEs with 128 MB/PE on sn6715 hwwt3e 2.0.4.71 unicosmk CRAY T3E

This line reports

The previous sections of the output file are mainly all measurement values of b(pat,L,mthd,rep) and some analysis tables. A full description of the output file is available
here.


Sourcecode

b_eff.c (version 3.2)


Benchmarking

If you use this benchmark, please send us back the following information:

Additionally -- only for you -- b_eff.c writes the last summary line also on stderr.

Some examples on how to compile and start b_eff.c are given on the first lines in b_eff.c.

Please send the mail to rabenseifner@rus.uni-stuttgart.de.


First Results

On a Cray T3E-900 with 512+32 processors and 128 MByte/processor

On Nov. 7, 1999, on sn6715 hwwt3e 2.0.4.71 unicosmk CRAY T3E, with 128 MB/PE: The measurements with 2 to 256 PEs were done while an other application was running on the first 256 PEs. Currently the 512 PEs value must be computed on the base of former measurements with release 3.1, using the 1 dimensional cyclic and the random values. The MPI implementation mpt.1.3.0.2 with used and the environment variable MPI_BUFFER_MAX=4099 was set.

size beff
MByte/s
beff/size
MByte/s
summary full protocol
512 ~ 19482. 38.05 extrapolation based on result_3.1_t3e_512b.gz
384 15526.600 40.434 result_3.2_t3e_384a.shrt result_3.2_t3e_384a.gz
256 10056.033 39.281 result_3.2_t3e_256a.shrt result_3.2_t3e_256a.gz
192 7871.336 40.997 result_3.2_t3e_192a.shrt result_3.2_t3e_192a.gz
128 5620.345 43.909 result_3.2_t3e_128a.shrt result_3.2_t3e_128a.gz
96 4180.723 43.549 result_3.2_t3e_096a.shrt result_3.2_t3e_096a.gz
64 3158.554 49.352 result_3.2_t3e_064a.shrt result_3.2_t3e_064a.gz
48 2725.891 56.789 result_3.2_t3e_048a.shrt result_3.2_t3e_048a.gz
32 1893.872 59.183 result_3.2_t3e_032a.shrt result_3.2_t3e_032a.gz
24 1522.225 63.426 result_3.2_t3e_024a.shrt result_3.2_t3e_024a.gz
16 1063.217 66.451 result_3.2_t3e_016a.shrt result_3.2_t3e_016a.gz
12 918.109 76.509 result_3.2_t3e_012a.shrt result_3.2_t3e_012a.gz
8 612.815 76.602 result_3.2_t3e_008a.shrt result_3.2_t3e_008a.gz
6 509.359 84.893 result_3.2_t3e_006a.shrt result_3.2_t3e_006a.gz
4 355.045 88.761 result_3.2_t3e_004a.shrt result_3.2_t3e_004a.gz
3 278.898 92.966 result_3.2_t3e_003a.shrt result_3.2_t3e_003a.gz
2 182.989 91.495 result_3.2_t3e_002a.shrt result_3.2_t3e_002a.gz

 
Used commands: module switch mpt mpt.1.3.0.2 
               cc -o b_eff -D MEMORY_PER_PROCESSOR=128 b_eff.c 
               export MPI_BUFFER_MAX=4099 
               mpirun -np size ./b_eff > result_3.2_t3e_size


On a NEC SX-4/32 with 32 processors and 256 MByte/processor

On Nov. 9, 1999, on SUPER-UX hwwsx4 9.1 Rev1 SX-4, with 256 MB/PE: The measurement was done on a (dedicated) resource block with 16 processors while other application were running on the other processors (exception: the benchmark on 4 processors was done nteractively with time-sharing).

size beff
MByte/s
beff/size
MByte/s
summary full protocol
16 9670.150 604.384 result_3.2_sx4_016.shrt result_3.2_sx4_016.gz
15 9493.817 632.921 result_3.2_sx4_015.shrt result_3.2_sx4_015.gz
14 9007.233 643.374 result_3.2_sx4_014.shrt result_3.2_sx4_014.gz
13 8301.263 638.559 result_3.2_sx4_013.shrt result_3.2_sx4_013.gz
12 7738.770 644.898 result_3.2_sx4_012.shrt result_3.2_sx4_012.gz
11 7129.367 648.124 result_3.2_sx4_011.shrt result_3.2_sx4_011.gz
10 6401.344 640.134 result_3.2_sx4_010.shrt result_3.2_sx4_010.gz
9 5765.670 640.630 result_3.2_sx4_009.shrt result_3.2_sx4_009.gz
8 5162.575 645.322 result_3.2_sx4_008.shrt result_3.2_sx4_008.gz
7 4535.283 647.898 result_3.2_sx4_007.shrt result_3.2_sx4_007.gz
6 3920.267 653.378 result_3.2_sx4_006.shrt result_3.2_sx4_006.gz
5 3261.534 652.307 result_3.2_sx4_005.shrt result_3.2_sx4_005.gz
4 2622.012 655.503 result_3.2_sx4_004.shrt result_3.2_sx4_004.gz
3 1983.912 661.304 result_3.2_sx4_003.shrt result_3.2_sx4_003.gz
2 1316.320 658.160 result_3.2_sx4_002.shrt result_3.2_sx4_002.gz

 
Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=256 b_eff.c -lm
               mpirun -np size ./b_eff > result_3.2_sx4_size


On a NEC SX-5/8B with 8 processors and 8 GByte/processor

On Nov. 9, 1999, on SUPER-UX sx5 9.2 k SX-5/8B, preliminary measurements were done with 256 MB/PE and without the non-blocking communication method:

size beff
MByte/s
beff/size
MByte/s
summary full protocol
4 5439.199 1359.800 result_3.2_sx5_256MB_004.shrt result_3.2_sx5_256MB_004.gz
2 2662.468 1331.234 result_3.2_sx5_256MB_002.shrt result_3.2_sx5_256MB_002.gz

 
Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=256 b_eff_mthd1+2.c -lm
               mpirun -np size ./b_eff > result_3.2_sx4_size


On a HP-V 9000/800/V2250 with 8 processors and 1024 MByte/processor

On Nov. 9, 1999, on HP-UX hwwhpv B.11.00 A 9000/800, with 1024 MB/PE: The measurement was done while another application was running, but with reduced priority (nice=39).

size beff
MByte/s
beff/size
MByte/s
summary full protocol
7 435.041 62.149 result_3.2_hpv_007c.shrt result_3.2_hpv_007c.gz

 
Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=1024 b_eff.c -lm
               mpirun -np size ./b_eff > result_3.2_hpv_size


On a Hitachi SR 2201 with 32+8 processors and 256 MByte/processor

On Nov. 9, 1999, on HI-UX/MPP hitachi 02-03 0 SR2201, with 256 MB/PE: The measurement was done while another application was running on the other 16 PEs. All PEs were used as dedicated PEs.

size beff
MByte/s
beff/size
MByte/s
summary full protocol
16 527.805 32.988 result_3.2_SR2201_016.shrt result_3.2_SR2201_016.gz
8 276.903 34.613 result_3.2_SR2201_008.shrt result_3.2_SR2201_008.gz
4 151.928 37.982 result_3.2_SR2201_004.shrt result_3.2_SR2201_004.gz
2 80.086 40.043 result_3.2_SR2201_002.shrt result_3.2_SR2201_002.gz

 
Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=1024 b_eff.c -lm
               mpirun -n size ./b_eff > result_3.2_SR2201_size


On a SwissTX T1-baby with 6*2 processors and 512 MByte/processor

On Nov. 15, 1999, on OSF1 toneb7 V5.0 910 alpha, with 512 MB/PE: The measurement was done while other applications were running on PEs that weren't used by this benchmark. All PEs were used as dedicated PEs.

size beff
MByte/s
beff/size
MByte/s
summary full protocol
12 ???.??? ??.??? result_3.3_SwissTX1baby_012a.shrt result_3.3_SwissTX1baby_012a.gz
8 97.497 12.187 result_3.3_SwissTX1baby_008a.shrt result_3.3_SwissTX1baby_008a.gz
4 49.394 12.348 result_3.3_SwissTX1baby_004a.shrt result_3.3_SwissTX1baby_004a.gz
2 25.792 12.896 result_3.3_SwissTX1baby_002a.shrt result_3.3_SwissTX1baby_002a.gz

 
Used commands: tnetcc -o b_eff -DMEMORY_PER_PROCESSOR=512 b_eff.c -lm
               bsub -Is -n size txrun b_eff > result_3.3_SwissTX1baby_size


Todo


References:

[1]
Karl Solchenbach: Benchmarking the Balance of Parallel Computers. SPEC Workshop on Benchmarking Parallel and High-Performance Computing Systems (copy of the slides), Wuppertal, Germany, Sept. 13, 1999.
[2]
Karl Solchenbach, Hans-Joachim Plum and Gero Ritzenhoefer: Pallas Effective Bandwidth Benchmark - source code and sample results ( EFF_BW.tar.gz, 43 KB)
[3]
Rolf Hempel: Basic message passing benchmarks, methodology and pitfalls. SPEC Workshop on Benchmarking Parallel and High-Performance Computing Systems (copy of the slides), Wuppertal, Germany, Sept. 13, 1999.
[4]
William Gropp and Ewing Lusk: Reproducible Measurement of MPI Performance Characteristics. In J. Dongarra et al. (eds.), Recent Advances in Parallel Virtual Machine and Message Passing Interface, proceedings of the 6th European PVM/MPI Users' Group Meeting, EuroPVM/MPI'99, Barcelona, Spain, Sept. 26-29, 1999, LNCS 1697, pp 11-18. (Summary on the web)

Links

Pallas Effective Bandwidth Benchmark     MPI at HLRS     HLRS Navigation     HLRS    
This page: www.hlrs.de/mpi/b_eff/b_eff_3.2/

Rolf Rabenseifner