Table of Contents
b_eff_io - Effective parallel MPI file I/O benchmark
The
effective I/O bandwidth benchmark (b_eff_io) covers two goals:
(1) to achieve a characteristic average number for the I/O bandwidth achievable
with parallel MPI-I/O applications
(2) to get detailed information about several access patterns and buffer
lengths. The benchmark examines "first write", "rewrite" and "read" access,
strided (individual and shared pointers) and segmented collective patterns
on one file per application and non-collective access to one file per process.
The number of parallel accessing processes is also varied and wellformed
I/O is compared with non-wellformed. On systems, meeting the rule that the
total memory can be written to disk in 10 minutes, the benchmark should
not need more than 15 minutes for a first pass of all patterns. The benchmark
is designed analogously to the effective bandwidth benchmark for message
passing (b_eff) that characterizes the message passing capabilities of
a system in a few minutes.
- mpicc
- -o b_eff_io [-D WITHOUT_SHARED
] b_eff_io.c -lm
- mpirun
- -np number_of_MPI_processes ./b_eff_io -MB number_of_megabytes_memory_per_node
-MT number_of_megabytes_memory_of_the_total_system [-noshared] [-nounique]
[-rewrite] [-keep] [-N number_of_processes[,number_of_processes[,...]]] [-T scheduled_time]
[-p path_of_fast_filesystem] [-i info_file] [-e number_of_errors] [-f protocol_files'_prefix]
or
- mpiexec
- -n number_of_MPI_processes ./b_eff_io -MB number_of_megabytes_memory_per_node
-MT number_of_megabytes_memory_of_the_total_system [ other options see mpirun
above ]
- -D WITHOUT_SHARED
- to substitute the shared
file pointer by individual file pointers (implies runtime option -noshared)
- -np number_of_MPI_processes
- (mpirun option, see man mpirun
) defines the number of MPI processes started for this benchmark.
- -MB number_of_megabytes_memory_per_node
- (mandatory) A node is defined as the unit used by or useable for one MPI
process. This value is used to compute the maximum chunk size for the patterns
1, 10, 18, 26 and 35. The maximum chunk size is defined as max( 2MB, memory
per node / 128).
- -MT number_of_megabytes_memory_of_the_total_system
- (mandatory)
This value is used to compute the ratio of transferred bytes to the size
of the total memory.
- -noshared
- to substitute the shared file pointer by
individual file pointers in pattern type 1 (implied by the compile time
option -D WITHOUT_SHARED).
- -nounique
- to remove MPI_MODE_UNIQUE_OPEN on each
file opening (on some system, this option allows some MPI optimizations)
- -rewrite
- do rewrite between write and read for all patterns
- -keep
- to keep
all benchmarking files on close after last pattern test
- -N number_of_processes[,number_of_processes[,...]]
- defines the partition sizes used for this benchmark (default: see Default
Partition Sizes)
- -T scheduled_time
- scheduled time for all partitions of
processes N (default = 1800 [seconds], see also option -N).
- -p path_of_fast_filesystem
- path of the filesystem that should be benchmarked, i.e. where this benchmarks
should write its scratch files (default is the current directory).
- -i info_file
- file containing file hints, see section Info File Format below (default
is to use no hints, i.e., to use only MPI_INFO_NULL). Using -i, the really
used hints are printed in the prefix.prot protocol file. The default hints
can be viewed by using -i with an empty info-file.
- -e number_of_errors
- maximum
of errors printed in each pattern (default = 1).
- -f protocol_files'_prefix
- prefix of the protocol file and the summary file. The protocol and summary
will be named prefix.prot and prefix.sum (default prefix = b_eff_io).
Already
existing scratch files are automatically removed before benchmarking is
started.
If the result should be used for comparing different systems,
the benchmark is only valid if the following criterions are reached:
- T
>= 1800 sec,
- the option -noshared is NOT used, and
- no errors are reported.
- Time:
- The user might expect that this benchmark would need the
scheduled time, defined with the option -T, or the sum of the scheduled
time values of several partitions, defined by the options -T and -N. But,
due to several reasons, this benchmark could need much more time (2x - 4x).
Reasons are:
- The sync operation is outside of the time-driven loop and
might consume time after the scheduled iterations.
- The loop is finished
by an iteration that consumed much more time than the previous iterations.
- The pattern types 3 (segmented) and 4 (seg-coll) are not time-driven. The
estimation for adequate repeating factors is based on results with pattern
types 0-2. This estimation might be to high if the implementation of pattern
types 3 and 4 is worse than that of pattern type 0 and 2.
- The same reason
is valid for all patterns with the access methods "rewrite" and "read".
- Size of scratch data on disk:
- On the disk defined with the option -p, the
size of the data written by this benchmark is about real_execution_time
*accumulated_write_bandwidth /3.
- Memory buffer space:
- b_eff_io needs N
* max(4MBytes, memory_per_node/64) memory for its buffers.
beffio_eps
generates diagrams of the summary protocol file given.
Synopsis: beffio_eps
[ protocol_file_prefix ] (default=b_eff_io)
Output:
- black/white diagrams,
e.g., for publication:
- prefix_(np)
_(write, rewrt, read, type0_sca, type1_sha,
type2_sep, type3_seg and type4_sgc)_mono.eps, prefix_(write, rewrt, read)_mono.eps
- if dvips is available, a summary sheet of these diagrams:
- prefix_(np)
_on1page.ps
- colored diagrams with thick lines, e.g., for slides:
- prefix_(np)
_(write,
rewrt, read, type0_sca, type1_sha, type2_sep, type3_seg and type4_sgc)_color.(eps
and png), prefix_(write, rewrt, read)_color.(eps and png)
CRAY T3E:
Prerequisites: using moduls mpt
Compilation on T3E :
cc -o b_eff_io -D WITHOUT_SHARED b_eff_io.c
cc -o b_eff_io -D WITHOUT_SHARED b_eff_io.c \
../ufs_t3e/ad_ufs_open.o ../ufs_t3e/ad_ufs_read.o \
../ufs_t3e/ad_ufs_write.o
cc -o b_eff_io -D WITHOUT_SHARED b_eff_io.c \
../ufs_t3e/ad_ufs_*.o
Execution: export MPI_BUFFER_MAX=4099
T3E-900 with 128 MB/processor and 512 PEs:
mpirun -np 64 ./b_eff_io -MB 128 -MT 65536 \
-p $SCRDIR -f b_eff_io_T3E900_064PE
T3E-1200 with 512 MB/processor and 512:
mpirun -np 64 ./b_eff_io -MB 512 -MT 262144 \
-p $SCRDIR -f b_eff_io_T3E1200_064PE
SX-4:
Prerequisites: -
Compilation on SX-4/32 with 256 MB/processor:
mpicc -o b_eff_io b_eff_io.c -lm
Execution:
mpirun -np 8 ./b_eff_io -MB 256 -MT 8192 \
-p $SCRDIR -f b_eff_io_SX4_08PE
Postprocessing (on local workstation):
b_eff_io_eps 64 b_eff_io_T3E900_064PE
b_eff_io_eps 64 b_eff_io_T3E1200_064PE
b_eff_io_eps 8 b_eff_io_SX4_08PE
Outputfiles:
b_eff_io_T3E900_064PE.sum human readable summary
.prot full benchmark protocol
_*_mono.eps diagrams black/white
_*_color.eps colored, for slides
Same for b_eff_io_T3E1200_064PE and b_eff_io_SX4_08PE.
mpi(1)
, mpirun(1)
, mpiexec(1)
, mpicc(1)
,
www.hlrs.de/mpi/b_eff_io/,
www.hlrs.de/mpi/b_eff/,
www.hlrs.de/mpi/mpi_t3e.html#StripedIO,
Table of Contents