Table of Contents

Name

b_eff_io - Effective parallel MPI file I/O benchmark

Description

The effective I/O bandwidth benchmark (b_eff_io) covers two goals:
(1) to achieve a characteristic average number for the I/O bandwidth achievable with parallel MPI-I/O applications
(2) to get detailed information about several access patterns and buffer lengths. The benchmark examines "first write", "rewrite" and "read" access, strided (individual and shared pointers) and segmented collective patterns on one file per application and non-collective access to one file per process. The number of parallel accessing processes is also varied and wellformed I/O is compared with non-wellformed. On systems, meeting the rule that the total memory can be written to disk in 10 minutes, the benchmark should not need more than 15 minutes for a first pass of all patterns. The benchmark is designed analogously to the effective bandwidth benchmark for message passing (b_eff) that characterizes the message passing capabilities of a system in a few minutes.

Synopsis

mpicc
-o b_eff_io [-D WITHOUT_SHARED ] b_eff_io.c -lm

mpirun
-np number_of_MPI_processes ./b_eff_io -MB number_of_megabytes_memory_per_node -MT number_of_megabytes_memory_of_the_total_system [-noshared] [-nounique] [-rewrite] [-keep] [-N number_of_processes[,number_of_processes[,...]]] [-T scheduled_time] [-p path_of_fast_filesystem] [-i info_file] [-e number_of_errors] [-f protocol_files'_prefix]

or

mpiexec
-n number_of_MPI_processes ./b_eff_io -MB number_of_megabytes_memory_per_node -MT number_of_megabytes_memory_of_the_total_system [ other options see mpirun above ]

General Compile Time Options

-D WITHOUT_SHARED
to substitute the shared file pointer by individual file pointers (implies runtime option -noshared)

Runtime Options

-np number_of_MPI_processes
(mpirun option, see man mpirun ) defines the number of MPI processes started for this benchmark.

-MB number_of_megabytes_memory_per_node
(mandatory) A node is defined as the unit used by or useable for one MPI process. This value is used to compute the maximum chunk size for the patterns 1, 10, 18, 26 and 35. The maximum chunk size is defined as max( 2MB, memory per node / 128).

-MT number_of_megabytes_memory_of_the_total_system
(mandatory) This value is used to compute the ratio of transferred bytes to the size of the total memory.

-noshared
to substitute the shared file pointer by individual file pointers in pattern type 1 (implied by the compile time option -D WITHOUT_SHARED).

-nounique
to remove MPI_MODE_UNIQUE_OPEN on each file opening (on some system, this option allows some MPI optimizations)

-rewrite
do rewrite between write and read for all patterns

-keep
to keep

all benchmarking files on close after last pattern test

-N number_of_processes[,number_of_processes[,...]]
defines the partition sizes used for this benchmark (default: see Default Partition Sizes)

-T scheduled_time
scheduled time for all partitions of processes N (default = 1800 [seconds], see also option -N).

-p path_of_fast_filesystem
path of the filesystem that should be benchmarked, i.e. where this benchmarks should write its scratch files (default is the current directory).

-i info_file
file containing file hints, see section Info File Format below (default is to use no hints, i.e., to use only MPI_INFO_NULL). Using -i, the really used hints are printed in the prefix.prot protocol file. The default hints can be viewed by using -i with an empty info-file.

-e number_of_errors
maximum of errors printed in each pattern (default = 1).

-f protocol_files'_prefix
prefix of the protocol file and the summary file. The protocol and summary will be named prefix.prot and prefix.sum (default prefix = b_eff_io).

Remarks

Already existing scratch files are automatically removed before benchmarking is started.

If the result should be used for comparing different systems, the benchmark is only valid if the following criterions are reached:

  1. T >= 1800 sec,
  2. the option -noshared is NOT used, and
  3. no errors are reported.

Resources

Time:
The user might expect that this benchmark would need the scheduled time, defined with the option -T, or the sum of the scheduled time values of several partitions, defined by the options -T and -N. But, due to several reasons, this benchmark could need much more time (2x - 4x). Reasons are:

  • The sync operation is outside of the time-driven loop and might consume time after the scheduled iterations.
  • The loop is finished by an iteration that consumed much more time than the previous iterations.
  • The pattern types 3 (segmented) and 4 (seg-coll) are not time-driven. The estimation for adequate repeating factors is based on results with pattern types 0-2. This estimation might be to high if the implementation of pattern types 3 and 4 is worse than that of pattern type 0 and 2.
  • The same reason is valid for all patterns with the access methods "rewrite" and "read".

  • Size of scratch data on disk:
    On the disk defined with the option -p, the size of the data written by this benchmark is about real_execution_time *accumulated_write_bandwidth /3.

    Memory buffer space:
    b_eff_io needs N * max(4MBytes, memory_per_node/64) memory for its buffers.

    Postprocessing

    beffio_eps generates diagrams of the summary protocol file given.

    Synopsis: beffio_eps [ protocol_file_prefix ] (default=b_eff_io)

    Output:

    black/white diagrams, e.g., for publication:
    prefix_(np) _(write, rewrt, read, type0_sca, type1_sha, type2_sep, type3_seg and type4_sgc)_mono.eps, prefix_(write, rewrt, read)_mono.eps

    if dvips is available, a summary sheet of these diagrams:
    prefix_(np) _on1page.ps

    colored diagrams with thick lines, e.g., for slides:
    prefix_(np) _(write, rewrt, read, type0_sca, type1_sha, type2_sep, type3_seg and type4_sgc)_color.(eps and png), prefix_(write, rewrt, read)_color.(eps and png)

    Examples


     CRAY T3E:  
       Prerequisites: using moduls mpt
       Compilation on T3E :
         cc -o b_eff_io -D WITHOUT_SHARED b_eff_io.c
         cc -o b_eff_io -D WITHOUT_SHARED b_eff_io.c \
            ../ufs_t3e/ad_ufs_open.o ../ufs_t3e/ad_ufs_read.o \
            ../ufs_t3e/ad_ufs_write.o
         cc -o b_eff_io -D WITHOUT_SHARED b_eff_io.c \
            ../ufs_t3e/ad_ufs_*.o
       Execution: export MPI_BUFFER_MAX=4099
       T3E-900 with 128 MB/processor and 512 PEs:
         mpirun -np 64 ./b_eff_io -MB 128 -MT 65536 \
                -p $SCRDIR -f b_eff_io_T3E900_064PE
       T3E-1200 with 512 MB/processor and 512:
         mpirun -np 64 ./b_eff_io -MB 512 -MT 262144 \
                -p $SCRDIR -f b_eff_io_T3E1200_064PE
     
     SX-4:
       Prerequisites: -
       Compilation on SX-4/32 with 256 MB/processor:
         mpicc -o b_eff_io b_eff_io.c -lm
       Execution:
         mpirun -np 8 ./b_eff_io -MB 256 -MT 8192 \
                -p $SCRDIR -f b_eff_io_SX4_08PE
     
     Postprocessing (on local workstation):
         b_eff_io_eps 64 b_eff_io_T3E900_064PE
         b_eff_io_eps 64 b_eff_io_T3E1200_064PE
         b_eff_io_eps  8 b_eff_io_SX4_08PE
     
     Outputfiles:
         b_eff_io_T3E900_064PE.sum       human readable summary
                              .prot      full benchmark protocol
                              _*_mono.eps   diagrams black/white
                              _*_color.eps  colored, for slides
         Same for b_eff_io_T3E1200_064PE and b_eff_io_SX4_08PE.
    

    See Also

    mpi(1) , mpirun(1) , mpiexec(1) , mpicc(1) , www.hlrs.de/mpi/b_eff_io/, www.hlrs.de/mpi/b_eff/, www.hlrs.de/mpi/mpi_t3e.html#StripedIO,


    Table of Contents