Effective I/O Bandwidth (b_{eff_io}) Benchmark

This page refers to the old release b_eff_io version 1.3

The effective I/O bandwidth benchmark (b_eff_io) covers two goals: (1) to achieve a characteristic average number for the I/O bandwidth achievable with parallel MPI-I/O applications, and (2) to get detailed information about several access patterns and buffer lengths. The benchmark examines "first write", "rewrite" and "read" access, strided (individual and shared pointers) and segmented collective patterns on one file per application and non-collective access to one file per process. The number of parallel accessing processes is also varied and wellformed I/O is compared with non-wellformed. On systems, meeting the rule that the total memory can be written to disk in 10 minutes, the benchmark should not need more than 15 minutes for a first pass of all patterns. The benchmark is designed analogously to the effective bandwidth benchmark for message passing (b_eff) that characterizes the message passing capabilities of a system in a few minutes.

The latest releases:

Current release as gzip'ed tar archive: b_eff_io_v1.3.tar.gz
Files of this release: b_eff_io.c, b_eff_io_eps, man page (formatted), man/man1/b_eff_io.1
Helper for b_eff_io_eps: b_eff_io_eps.gnuplot, b_eff_io_eps_on1page.dvi
Source of "on 1 page": b_eff_io_eps_on1page.tex
Old releases: 1.2, 1.1, 1.0, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1.

A detailed report with first results of b_eff_io release 0.5 can be obtained here.
Latest publications:

Rolf Rabenseifner and Alice E. Koniges: Effective File-I/O Bandwidth Benchmark.
In proceedings, Arndt Bode, Thomas Ludwig, Roland Wismüller (editors), Euro-Par 2000 -- Parallel Processing, Aug. 29 - Sept. 1, 2000, München, Germany, LNCS 1900, pp 1273-1283. Files: full paper as PDF and gzip'ed postscript, slides as gzip'ed postscript.
Rolf Rabenseifner and Alice E. Koniges: The Effective I/O Bandwidth Benchmark (b_eff_io).
In proceedings of the Message Passing Interface and High-Performance Clusters Developer's and User's Conference (MPIDC 2000), March 20-23, 2000, Ithaca, NY, USA. Files: full paper as US paper, gzip'ed postscript, A4 gzip'ed postscript; slides as gzip'ed postscript.

Usage

Installation and the first test

Download the tar file of the current release b_eff_io_v1.3.tar.gz
Unpack with: gunzip -c b_eff_io_v1.3.tar.gz | tar -xvf -
Change directory: cd b_eff_io
Compile it: mpicc -o b_eff_io b_eff_io.c -lm
If you are using an old ROMIO without shared file-pointer, then you can use for a first test:
mpicc -o b_eff_io -D WITHOUT_SHARED b_eff_io.c -lm
But, to achieve valid b_eff_io results, you have to install an MPI library that allows shared file-pointers.
Test it: mpirun -np 4 ./b_eff_io -MB 256 -MT 1024 -T 30 -p /my/fast/scratch/dir
This means, that you are using 4 MPI processes, you have a system with 256 MBytes memory for each processor, a total memory of 1024 MBytes, and you want to run only a test, scheduled to run at least 30 seconds -- this means, it should run in not more than 2 minutes. This I/O benchmark uses large scratch files. They are stored in /my/fast/scratch/dir.
You will get back:
- on standard output -- the b_eff_io value
- on b_eff_io.sum -- a human readable summary
- on b_eff_io.prot -- the full benchmark protocol
Print the summary, e.g., with: a2ps -C -1 -l 120 b_eff_io.sum
If there are serious problems, e.g., the benchmark took more than 2 minutes on a dedicated system, then please, feel free to contact the author and send him all your commands and attach the files b_eff_io.sum and b_eff_io.prot.
If gnuplot and dvips is available, you can generate some plots from the summary:
b_eff_io_eps 4
Print the summary sheet, e.g., with: lpr b_eff_io_on1page.ps

CAUTION: Because this first test is scheduled with only 30 seconds, the results will never tell you anything about the I/O bandwidth of your system. The test only should tell you whether the benchmark is running on your system.

To get a first b_eff_io impression

Before you start with a realistic schedule time, you should use correct values for the memory sizes and at least 1/8 of the real number of nodes of your system, but still 30 seconds scheduled time, e.g. on a 64 processor system with 32 GB of memory:
mpirun -np 8 ./b_eff_io -MB 512 -MT 32768 -T 30 -p /my/fast/scratch/dir -f my_system_08pe_0030sec
The last option defines the prefix of your protocol files
Now, you can test larger scheduled time frames, e.g. 15 minutes (=900 sec):
mpirun -np 8 ./b_eff_io -MB 512 -MT 32768 -T 900 -p /my/fast/scratch/dir -f my_system_08pe_0900sec
b_eff_io_eps 8 my_system_08pe_0900sec
#MPROC is used in the next section to abbreviate the memory size of each MPI process (in MBytes), e.g., in our example #MPROC=512.
#MTOTAL is used to abbreviate the total memory size of the system (in MBytes), e.g., in our example #MTOTAL=32768.

The official execution of b_eff_io

The b_eff_io benchmark has to be done with three mandatory and one optional number of processes:

First, #MPI_PROC_PER_SMP_NODE must be chosen by the person who makes the benchmark. This number is the number of MPI processes that should run on each SMP node. This number is tunable and must not be more than the number of processors of each SMP node.
On systems that are not a cluster of SMP nodes, #MPI_PROC_PER_SMP_NODE is 1.
Next, #NODES_FULL is defined as the number of SMP nodes of the system. On systems that are not a cluster of SMP nodes, #NODES_FULL is the number of processors available for parallel computation.

Thus you have to compute the following numbers:

    #NODES_MEDIUM =  2 ** ( round ( log_2(#NODES_FULL) * 0.70 ) )
    #NODES_SMALL  =  2 ** ( round ( log_2(#NODES_FULL) * 0.35 ) )

Examples: The following table shows the MID and SMALL values for given FULL values

    #NODES_FULL       = 2048 1024 512 256 128  64  32  16  8  4  2  1
    ==> #NODES_MEDIUM =  256  128  64  64  32  16  16   8  4  2  2  1
    ==> #NODES_SMALL  =   16   16   8   8   4   4   4   2  2  2  1  1

And last, you can freely choose #NODES_TUNE as the number of nodes that gives the best b_eff_io value.
Each of these b_eff_io measurements should be done with 30 minutes scheduled time (-T 1800), i.e., we would like to see the results of
```
      mpirun 'on #NODES_SMALL   with #MPI_PROC_PER_SMP_NODE'  \
             ./b_eff_io -T 1800 -MB #MPROC -MT #MTOTAL -T 1800 \
                        -p /my/fast/scratch/dir \
                        -f my_system_#NODES_SMALL_1800sec
 
      mpirun 'on #NODES_MEDIUM   with #MPI_PROC_PER_SMP_NODE'  \
             ./b_eff_io -T 1800 -MB #MPROC -MT #MTOTAL -T 1800 \
                        -p /my/fast/scratch/dir \
                        -f my_system_#NODES_MEDIUM_1800sec
 
      mpirun 'on #NODES_FULL   with #MPI_PROC_PER_SMP_NODE'  \
             ./b_eff_io -T 1800 -MB #MPROC -MT #MTOTAL -T 1800 \
                        -p /my/fast/scratch/dir \
                        -f my_system_#NODES_FULL_1800sec
 
      mpirun 'on #NODES_TUNED   with #MPI_PROC_PER_SMP_NODE'  \
             ./b_eff_io -T 1800 -MB #MPROC -MT #MTOTAL -T 1800 \
                        -p /my/fast/scratch/dir \
                        -f my_system_#NODES_TUNED_1800sec
```
#NODES_TUNED may be chosen as one of the three other values that has the best b_eff_io result (i.e. #NODES_TUNED is optional).
Filesystem parameters should be chosen as for normal users.
The size of data written to /my/fast/scratch/dir by each of these four benchmarks is about real_execution_time * accumulated_write_bandwidth / 3. The real execution time may differ from the scheduled time (30 min.) due to following reasons:
1. The sync operation is outside of the time-driven loop and may consume time after the scheduled iterations.
2. The loop is finished by an iteration that con sumed much more time than the previous iterations.
3. The pattern types 3 (segmented) and 4 (seg-coll) are not time-driven. The estimation for adequate repeating factors is based on results with pattern types 0-2. This estimation may be to high if the implementation of pattern types 3 and 4 is worse than that of pattern type 0 and 2.
4. The same reason is valid for all patterns with the access methods "rewrite" and "read".

Publishing the results

Now, you can publish these four b_eff_io values together with all commands and parameters you have used to run these benchmarks and together with the protocol files (my_system_..._1800sec.prot) and the summary files (my_system_..._1800sec.sum).

The Top Cluster initiative of the TFCC Open Forum has nominated this benchmark for evaluating the I/O performance of clusters (see discussion archive).

It is planned to include the b_eff_io results into the TOPClusters list.

References:

[1]: Karl Solchenbach: Benchmarking the Balance of Parallel Computers. SPEC Workshop on Benchmarking Parallel and High-Performance Computing Systems (copy of the slides), Wuppertal, Germany, Sept. 13, 1999.
[2]

Links

Rolf Rabenseifner

Effective I/O Bandwidth (beff_io) Benchmark