"b_eff_io"("1") manual page

Name

b_eff_io - Effective parallel MPI file I/O benchmark

The effective I/O bandwidth benchmark (b_eff_io) covers two goals:
(1) to achieve a characteristic average number for the I/O bandwidth achievable with parallel MPI-I/O applications
(2) to get detailed information about several access patterns and buffer lengths. The benchmark examines "first write", "rewrite" and "read" access, strided (individual and shared pointers) and segmented collective patterns on one file per application and non-collective access to one file per process. The number of parallel accessing processes is also varied and wellformed I/O is compared with non-wellformed. On systems, meeting the rule that the total memory can be written to disk in 10 minutes, the benchmark should not need more than 15 minutes for a first pass of all patterns. The benchmark is designed analogously to the effective bandwidth benchmark for message passing (b_eff) that characterizes the message passing capabilities of a system in a few minutes.

Synopsis

mpicc: -o b_eff_io [-D WITHOUT_SHARED ] b_eff_io.c -lm
mpirun: -np number_of_MPI_processes ./b_eff_io -MB number_of_megabytes_memory_per_node -MT number_of_megabytes_memory_of_the_total_system [-noshared] [-nounique] [-rewrite] [-keep] [-N number_of_processes[,number_of_processes[,...]]] [-T scheduled_time] [-p path_of_fast_filesystem] [-i info_file] [-e number_of_errors] [-f protocol_files'_prefix]
or
mpiexec: -n number_of_MPI_processes ./b_eff_io -MB number_of_megabytes_memory_per_node -MT number_of_megabytes_memory_of_the_total_system [ other options see mpirun above ]

General Compile Time Options

-D WITHOUT_SHARED: to substitute the shared file pointer by individual file pointers (implies runtime option -noshared)

Runtime Options

-np number_of_MPI_processes

(mpirun option, see man mpirun ) defines the number of MPI processes started for this benchmark.

-MB number_of_megabytes_memory_per_node

(mandatory) A node is defined as the unit used by or useable for one MPI process. This value is used to compute the maximum chunk size for the patterns 1, 10, 18, 26 and 35. The maximum chunk size is defined as max( 2MB, memory per node / 128).

-MT number_of_megabytes_memory_of_the_total_system

(mandatory) This value is used to compute the ratio of transferred bytes to the size of the total memory.

-noshared

to substitute the shared file pointer by individual file pointers in pattern type 1 (implied by the compile time option -D WITHOUT_SHARED).

-nounique

to remove MPI_MODE_UNIQUE_OPEN on each file opening (on some system, this option allows some MPI optimizations)

-rewrite

do rewrite between write and read for all patterns

-keep

to keep

all benchmarking files on close after last pattern test

-N number_of_processes[,number_of_processes[,...]]

defines the partition sizes used for this benchmark (default: see Default Partition Sizes)

-T scheduled_time

scheduled time for all partitions of processes N (default = 1800 [seconds], see also option -N).

-p path_of_fast_filesystem

path of the filesystem that should be benchmarked, i.e. where this benchmarks should write its scratch files (default is the current directory).

-i info_file

file containing file hints, see section Info File Format below (default is to use no hints, i.e., to use only MPI_INFO_NULL). Using -i, the really used hints are printed in the prefix.prot protocol file. The default hints can be viewed by using -i with an empty info-file.

-e number_of_errors

maximum of errors printed in each pattern (default = 1).

-f protocol_files'_prefix

prefix of the protocol file and the summary file. The protocol and summary will be named prefix.prot and prefix.sum (default prefix = b_eff_io).

Remarks

Already existing scratch files are automatically removed before benchmarking is started.

If the result should be used for comparing different systems, the benchmark is only valid if the following criterions are reached:

T >= 1800 sec,

the option -noshared is NOT used, and

no errors are reported.

Resources

Time:: The user might expect that this benchmark would need the scheduled time, defined with the option -T, or the sum of the scheduled time values of several partitions, defined by the options -T and -N. But, due to several reasons, this benchmark could need much more time (2x - 4x). Reasons are:

The sync operation is outside of the time-driven loop and might consume time after the scheduled iterations.
The loop is finished by an iteration that consumed much more time than the previous iterations.
The pattern types 3 (segmented) and 4 (seg-coll) are not time-driven. The estimation for adequate repeating factors is based on results with pattern types 0-2. This estimation might be to high if the implementation of pattern types 3 and 4 is worse than that of pattern type 0 and 2.
The same reason is valid for all patterns with the access methods "rewrite" and "read".
Size of scratch data on disk:: On the disk defined with the option -p, the size of the data written by this benchmark is about real_execution_time *accumulated_write_bandwidth /3.
Memory buffer space:: b_eff_io needs N * max(4MBytes, memory_per_node/64) memory for its buffers.

Postprocessing

beffio_eps generates diagrams of the summary protocol file given.

Synopsis: beffio_eps [ protocol_file_prefix ] (default=b_eff_io)

Output:

black/white diagrams, e.g., for publication:: prefix_(np) _(write, rewrt, read, type0_sca, type1_sha, type2_sep, type3_seg and type4_sgc)_mono.eps, prefix_(write, rewrt, read)_mono.eps
if dvips is available, a summary sheet of these diagrams:: prefix_(np) _on1page.ps
colored diagrams with thick lines, e.g., for slides:: prefix_(np) _(write, rewrt, read, type0_sca, type1_sha, type2_sep, type3_seg and type4_sgc)_color.(eps and png), prefix_(write, rewrt, read)_color.(eps and png)

Examples

 CRAY T3E:  
   Prerequisites: using moduls mpt
   Compilation on T3E :
     cc -o b_eff_io -D WITHOUT_SHARED b_eff_io.c
     cc -o b_eff_io -D WITHOUT_SHARED b_eff_io.c \
        ../ufs_t3e/ad_ufs_open.o ../ufs_t3e/ad_ufs_read.o \
        ../ufs_t3e/ad_ufs_write.o
     cc -o b_eff_io -D WITHOUT_SHARED b_eff_io.c \
        ../ufs_t3e/ad_ufs_*.o
   Execution: export MPI_BUFFER_MAX=4099
   T3E-900 with 128 MB/processor and 512 PEs:
     mpirun -np 64 ./b_eff_io -MB 128 -MT 65536 \
            -p $SCRDIR -f b_eff_io_T3E900_064PE
   T3E-1200 with 512 MB/processor and 512:
     mpirun -np 64 ./b_eff_io -MB 512 -MT 262144 \
            -p $SCRDIR -f b_eff_io_T3E1200_064PE
 
 SX-4:
   Prerequisites: -
   Compilation on SX-4/32 with 256 MB/processor:
     mpicc -o b_eff_io b_eff_io.c -lm
   Execution:
     mpirun -np 8 ./b_eff_io -MB 256 -MT 8192 \
            -p $SCRDIR -f b_eff_io_SX4_08PE
 
 Postprocessing (on local workstation):
     b_eff_io_eps 64 b_eff_io_T3E900_064PE
     b_eff_io_eps 64 b_eff_io_T3E1200_064PE
     b_eff_io_eps  8 b_eff_io_SX4_08PE
 
 Outputfiles:
     b_eff_io_T3E900_064PE.sum       human readable summary
                          .prot      full benchmark protocol
                          _*_mono.eps   diagrams black/white
                          _*_color.eps  colored, for slides
     Same for b_eff_io_T3E1200_064PE and b_eff_io_SX4_08PE.