The effective bandwidth beff measures the accumulated bandwidth of the communication network of parallel and/or distributed computing systems. Several message sizes, communication patterns and methods are used. The algorithm uses an average to take into account that short and long messages are transferred with different bandwidth values in real applications.
beff = | logavg | ( | logavgring patterns | (sumL (maxmthd (maxrep ( b(ring pat.,L,mthd,rep) | )))/21 ), |
logavgrandom patterns | (sumL (maxmthd (maxrep ( b(random pat.,L,mthd,rep) | )))/21 ) | |||
) |
with
In each ring, the processes are sorted by their ranks in the topology metioned above.
Further details are discribed in the technical section.
The effective bandwidth is number of MPI processes multiplied
with the asymptotic bandwidth multiplied
with the ratio of the area under the curve
"bandwidth over message-lengths" and the area under the
constant asymptotic bandwidth curve in the same diagram.
To measure the bandwidth, several communication patterns
are applied.
The patterns are based on rings and on random distributions.
The logarithmic average on all ring patterns and on all random
patterns is computed and beff is
the logarithmic average of these two values.
The communication is implemented in three different ways with
MPI and for each single measurement the maximum bandwidth
of all three methods is used.
For the ratio mentioned above the bandwidth is plotted over
the message length and the used message length values are plotted
equidistant on the abscissa, i.e. along two logarithmic scales,
one from 1 byte to 4 kbyte (12 intervals)
and the next from 4 kbyte to L
A first approach from Karl Solchenbach,
Hans-Joachim Plum and Gero Ritzenhoefer
[1,2] was based on the bi-section bandwidth.
Due to several
problems
a redesign was done.
This redesign tries not to violate the rules defined
by Rolf Hempel in [3]
and by William Gropp and Ewing Lusk in [4].
Each run of the benchmark on a particular system results in
an output file.
The last line of this output file reports e.g.
b_eff = 9709.549 MB/s = 37.928 * 256 PEs
with 128 MB/PE
on sn6715 hwwt3e 2.0.4.71 unicosmk CRAY T3E
This line reports
If you use this benchmark, please send us back the following
information:
Additionally -- only for you -- b_eff.c writes the last summary line
also on stderr.
Some examples on how to compile and start b_eff.c are given
in the next sections.
In all cases one has to choose the correct memory size value
(in MBytes).
The syntax for setting the CPP macro MEMORY_PER_PROCESSOR may differ,
e.g. with or without a blank after the -D option.
Please send the mail to
rabenseifner@rus.uni-stuttgart.de.
Size and beff values are highlighted
if the measurements evaluates the whole system.
On Nov. 7, 1999, on sn6715 hwwt3e 2.0.4.71 unicosmk CRAY T3E, with 128 MB/PE:
The measurements with 2 to 256 PEs were done while an other application
was running on the first 256 PEs.
Currently the 512 PEs value must be computed on the base of
former measurements with release 3.1, using the 1 dimensional
cyclic and the random values.
The MPI implementation mpt.1.3.0.2 with used and
the environment variable MPI_BUFFER_MAX=4099 was set.
On Nov. 29, 1999, on HI-UX/MPP himiko 03-00 ad2b0 SR8000, with 1 GB/PE:
The measurements were done with exclusively used PEs.
The ping pong measurement is done with the first two MPI processes
in each beff-configuration.
The MPI implementation does not use the topology information given by
the beff benchmark program and allocates the process ranks
by default in a round-robin order.
This results in a bad efficiency, see second part of the table.
Two measurements were taken twice, see ..._b.shrt files.
By using the multi-command interface of mpiexec, one can explicitly allocate
contigous intervals of process ranks in each SR8000 node,
see first part of the table.
On Nov. 9, 1999, on HI-UX/MPP hitachi 02-03 0 SR2201, with 256 MB/PE:
The measurement was done while another application was running on the
other 16 PEs. All PEs were used as dedicated PEs.
On Nov. 15, 1999, on OSF1 toneb7 V5.0 910 alpha, with 512 MB/PE:
The measurement was done while other applications were running on
PEs that weren't used by this benchmark.
All PEs were used as dedicated PEs.
On Nov. 25/26, 1999, two measurements were done on dedicated processors:
The beff benchmark is not well-suited for shared memory
systems:
On hierarchical systems, OpenMP should be used inside the shared
memory nodes and MPI should be used between the shared memory nodes.
It is currently under discussion to extend the beff benchmark
with an OpenMP based memory copying between the processors inside of a
shared memory node.
On Nov. 9, 1999, on SUPER-UX hwwsx4 9.1 Rev1 SX-4, with 256 MB/PE:
The measurement was done on a (dedicated) resource block with 16
processors
while other application were running on the other processors
(exception: the benchmark on 4 processors was done nteractively with
time-sharing).
On Nov. 9, 1999, on SUPER-UX sx5 9.2 k SX-5/8B,
preliminary measurements were done with 256 MB/PE and without
the non-blocking communication method:
On Nov. 9, 1999, on HP-UX hwwhpv B.11.00 A 9000/800, with 1024 MB/PE:
The measurement was done while another application was running, but
with reduced priority (nice=39).
On Nov. 17, 1999, on sn9626 athos 10.0.0.6 eth.3 CRAY SV1, with 512 MB/PE:
The measurement was done in time-sharing mode
while other applications were running on the system.
Looking at the results one can see, that for 2, 4 and 8 nodes,
the benchmark was nearly scheduled as on dedicated processors.
For 12 and 15 nodes one can see, that setting-up the maximum
on all methods and repetitions results in reproducible
bandwidth values.
The measurement on 15 processors was done with looplengthmax
reduced to 30 to reduce the total execution time.
A measurement on all 16 PEs was not possible due to other
applications running on the system with lower priority.
Summary:
Background
Output of the beff Benchmark
The previous sections of the output file are mainly all
measurement values of b(pat,L,mthd,rep)
and some analysis tables.
A full description of the output file is available
here.
Sourcecode
Benchmarking
First Results
Distributed Memory Systems
On a Cray T3E-900 with 512+32 processors and 128 MByte/processor
Used commands: module switch mpt mpt.1.3.0.2
cc -o b_eff -DMEMORY_PER_PROCESSOR=128 b_eff.c
export MPI_BUFFER_MAX=4099
mpirun -np size ./b_eff result_3.2_t3e_size
MPI release: mpt.1.3.0.2
Execution time < 225 sec
On a Hitachi SR 8000 with 24 processors on 3 nodes and 1 GByte/processor
Used commands: mpicc -o b_eff -DMEMORY_PER_PROCESSOR=1024 b_eff.c -lm
hpstatus
limit datasize 500000
explicit allocation of contiguous ranks on 3 nodes:
mpiexec -p NODE0 -N 1 -n size/3 ./b_eff result_3.3_SR8000_1GB_003nodes_sizePEs_c \
: -p NODE1 -N 1 -n size/3 ./b_eff \
: -p NODE2 -N 1 -n size/3 ./b_eff
contiguous ranks on 2 nodes:
mpiexec -p NODE0 -N 1 -n size/2 ./b_eff result_3.3_SR8000_1GB_002nodes_sizePEs_c \
: -p NODE1 -N 1 -n size/2 ./b_eff
contiguous ranks on 1 node:
mpiexec -p NODE0 -N 1 -n size/1 ./b_eff result_3.3_SR8000_1GB_001nodes_sizePEs_c
special additional options
64: option -lp64, used on mpicc
SS: environment variable JOBTYPE=SS, set while executing mpiexec
default round-robin allocation:
mpiexec -p ALL -N nodes -n size ./b_eff result_3.3_SR8000_1GB_nodesnodes_sizePEs
MPI release: P-1811-1113, HI-UX/MPP
Execution time < 150 sec
On a Hitachi SR 2201 with 32+8 processors and 256 MByte/processor
size
beff
MByte/s
beff/size
MByte/s
summary
full protocol
16 527.805 32.988
result_3.2_SR2201_016.shrt
result_3.2_SR2201_016.gz
8 276.903 34.613
result_3.2_SR2201_008.shrt
result_3.2_SR2201_008.gz
4 151.928 37.982
result_3.2_SR2201_004.shrt
result_3.2_SR2201_004.gz
2 80.086 40.043
result_3.2_SR2201_002.shrt
result_3.2_SR2201_002.gz
Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=1024 b_eff.c -lm
mpirun -n size ./b_eff result_3.2_SR2201_size
MPI release: P-1811-1112, HI-UX/MPP 02-01 (based on MPICH Version 1.0.11)
On a SwissTX T1-baby with 6*2 processors and 512 MByte/processor
size
beff
MByte/s
beff/size
MByte/s
summary
full protocol
12 ???.??? ??.???
result_3.3_SwissTX1baby_012a.shrt
result_3.3_SwissTX1baby_012a.gz
8 97.497 12.187
result_3.3_SwissTX1baby_008a.shrt
result_3.3_SwissTX1baby_008a.gz
4 49.394 12.348
result_3.3_SwissTX1baby_004a.shrt
result_3.3_SwissTX1baby_004a.gz
2 25.792 12.896
result_3.3_SwissTX1baby_002a.shrt
result_3.3_SwissTX1baby_002a.gz
Used commands: tnetcc -o b_eff -DMEMORY_PER_PROCESSOR=512 b_eff.c -lm
bsub -Is -n size txrun b_eff result_3.3_SwissTX1baby_size
MPI release: T-NET: 0.17 (see http://service.scs.ch/gb2/fci/revision/)
(sysconfig -q tnet Version)
Execution time < 163 sec
On IBM SP2
size
beff
MByte/s
beff/size
MByte/s
summary
full protocol
platform
128 2241.885 17.515
result_SP2_512MB_128PE.shrt
result_SP2_512MB_128PE.gz
P2SC (120 MHz) processors with 512 MB each
32 568.227 17.757
result_SP2_256MB_32PE.shrt
result_SP2_256MB_32PE.gz
POWER2 (77 MHz) processors with 256 MB each
Used commands: mpcc -o b_eff -DMEMORY_PER_PROCESSOR=512 b_eff.c -lm
poe b_eff result_SP2_512MB_sizePE -procs size
MPI release: ???
Execution time < 980 sec (very long due to the extremly high Alltoallv latancy)
Shared Memory Systems
On a NEC SX-4/32 with 32 processors and 256 MByte/processor
Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=256 b_eff.c -lm
mpirun -np size ./b_eff result_3.2_sx4_size
MPI release: 9.1
On a NEC SX-5/8B with 8 processors and 8 GByte/processor
size
beff
MByte/s
beff/size
MByte/s
summary
full protocol
4 5439.199 1359.800
result_3.2_sx5_256MB_004.shrt
result_3.2_sx5_256MB_004.gz
2 2662.468 1331.234
result_3.2_sx5_256MB_002.shrt
result_3.2_sx5_256MB_002.gz
Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=256 b_eff_mthd1+2.c -lm
mpirun -np size ./b_eff result_3.2_sx5_256MB_size
MPI release: 9.2
On a HP-V 9000/800/V2250 with 8 processors and 1024 MByte/processor
size
beff
MByte/s
beff/size
MByte/s
summary
full protocol
7 435.041 62.149
result_3.2_hpv_007c.shrt
result_3.2_hpv_007c.gz
Used commands: mpicc -o b_eff -D MEMORY_PER_PROCESSOR=1024 b_eff.c -lm
mpirun -np size ./b_eff result_3.2_hpv_size
MPI release: HP MPI 1.4 implementation
On a SGI Cray SV1-B/16-8 with 16 processors and 512 MByte/processor
size
beff
MByte/s
beff/size
MByte/s
summary
full protocol
16 ~ 1487.200 ~ 92.950
extrapolation
based on column beff/size
and lines about 15, 12 and 8 PEs
15 1444.958 96.331
result_3.3_SV1B16_015a.shrt
result_3.3_SV1B16_015a.gz
12 1283.318 106.943
result_3.3_SV1B16_012a.shrt
result_3.3_SV1B16_012a.gz
8 958.823 119.853
result_3.3_SV1B16_008a.shrt
result_3.3_SV1B16_008a.gz
4 626.880 156.720
result_3.3_SV1B16_004a.shrt
result_3.3_SV1B16_004a.gz
2 359.071 179.535
result_3.3_SV1B16_002a.shrt
result_3.3_SV1B16_002a.gz
Used commands: cc -h taskprivate $LIBCM -o b_eff -D MEMORY_PER_PROCESSOR=512 b_eff.c
qsub -eo -q nqebatch jobsize
with jobsize:
#!/bin/sh
export MPI_GROUP_MAX=64
ja
mpirun -nt size ~/b_eff ~/result_3.3_SV1B16_size
ja -st
MPI release: mpt.1.3.0.2
Execution time < 360 sec (expected on a dedicated system)
References:
Links