HLRS _ Services - Parallel Computing - Programming Models - MPI

[an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive] [an error occurred while processing this directive]

This page gives important informations about MPI on our CRAY T3E. Currently we have installed the CRAY version mpt 1.4.0.2.p that corresponds to the MPI standard MPI 1.2 plus the MPI-2 chapter about one-sided communication and about parallel file I/O. It is upward compatible with the former releases because the include files mpi.h and mpif.h of the current release have not changed anything of these files of the former release. This means that files compiled under older mpt releases can be linked together with files compiled now and with the current library.

The release mpt 1.4.0.2.p is the same as module mpt.1.4.0.2, except that automatic MPI profiling is included, i.e. MPI_Finalize writes a statistical summary only about MPI calls on a syslog file. Further information is given in the profiling section below.

As mpt.1.3.0.1, the current release includes a subset of MPI-I/O (-> details about mpt.1.3.0.1). It allows parallel file I/O with up to 200 MByte/sec on your directory $SCRDIR on the striped file system /hwwt3e_tmp.

Initialization of the MPI environment

With the ksh shell the following (already existing) lines must be activated in the .profile given by the HLRS:

USE_PROG_ENV=1 # Programming development compiler linker ...
USE_PROG_MPI=1 # MPI development

At the moment only ksh is supported. After modifying .profile one must login again.

The command module load mpt described in the man page MPI(1) under GETTING STARTED is no longer necessary.

Compile, Link and Start

man MPI and man mpirun on hwwt3e.hww.de give a detailed description. The initialization described above is always necessary, i.e. for reading man MPI, and for compiling, linking and starting of the MPI application, and for using utilities, e.g. xmppview.

Implementation decisions and the standard send mode

The MPI standard allows different implementations. E.g. the message passing in standard mode may be implemented buffered or synchronously. Most implementations transfer short messages in a buffered way and longer ones synchronously. But applications should always expect a blocking implementation to avoid deadlocks.

Modified due to new protocol strategy in mpt.1.4.0.0 and newer

Short messages are internally and automatically buffered, long messages are transfered synchronously. With the environment variable MPI_BUFFER_MAX, one can modify the maximum length of short messages.

Latency and Throughput

Modified due to new protocol strategy in mpt.1.4.0.0 and newer

Then CRAY's MPI has with standard MPI_Send and MPI_Recv a latency of 6 microseconds, and with messages longer than 8 kbytes a bandwidth of faster than 220 Mbytes/sec, with messages longer than 64 kbytes a bandwidth of faster than 300 Mbytes/sec, and with messages longer than 256 kbytes a bandwidth of about 315 Mbytes/sec.

Profiling

The release mpt 1.4.0.2.p is the same as mpt.1.4.0.2, except that automatic MPI profiling is included, i.e. MPI_Finalize writes a statistical summary only about MPI calls on a syslog file.
The user can get this summary too, if he/she sets the environment variable MPIPROFOUT to the values stdout, stderr or any filename, e.g.
with export MPIPROFOUT=stdout
or export MPIPROFOUT=my_mpi_profiling_summary
or export MPIPROFOUT=~/profiling_sum_on_my_home
Then the summary is written to that pipe or that file. Its format is explained here.
The automatic MPI profiling needs about 200kb of memory and about 0.1 sec for writing the statistic and 0.3E-6 sec for each MPI call, i.e. benchmarking of longer running programs should not be affected. The automatic profiling was included to get informations for the decision, which MPI functions should be optimized in the future. This profiling does not profile any shmem routines. Therefore communication implemented with shmem or with libraries that use shmem (e.g. ScaLAPACK, BLAS or PBLAS) is not profiled.
The profiling is realized internally in the MPI libraries libmpi.a and libpmpi.a. Therefore the standardized external profiling interface PMPI_mpi_function can be used as usual, i.e. by adding -lpmpi on the link command. If you use -lpmpi, then the profiling counters for the routines MPI_Address, MPI_Type_struct, MPI_Wtick and MPI_Wtime may be enlarged, because of internal usage of MPI routines inside the PMPI library.
Additionally you can make intermediate prints on your profiling output file (see MPIPROFOUT above). Any MPI process can print the current state of the profiling database in C with MPI_Pcontrol(77001) or in Fortran with CALL MPI_PCONTROL(77001).
An intermediate print of the average over all processes can be done with the collective call MPI_Pcontrol(77007) or CALL MPI_PCONTROL(77007), that must be called by all processes in MPI_COMM_WORLD. MPI_Finalize produces the same average printing.
Both calls to MPI_Pcontrol do not change the database. The overhead produced by these calls is not accumulated. Calls to MPI_Pcontrol with other arguments are ignored. The intermediate prints are not written to the syslog file.
The syslog file is analysed each weekend and a weekly summary of your jobs is mailed to you. It includes a table about your usage of all MPI routines. The format of this table is described here.
First results are published in the BI. 6/7 1998. A preprint is available. Details about the results of the first two months were presented at the ZKI-AK Supercomputing on Oct. 2, 1998; the slides are available as gziped postscript. A summary of the first three months were presented at the 16. UNICOS AK on Oct. 30, 1998; slides as gziped postscript. Both postscript files cannot be viewed with ghostview, but they can be printed on most postscript printers. Due to an error in the timing intrinsic _rtc() of the T3E the results in the first presentation were partially incorrect. The analysis used for the second presentation has tried to correct the invalid _rtc() timestamps. The statistical informations are based on an automatically generated analysis. An example with anonymous user ids is shown here. Following papers are published: "Automatic MPI Counter Profiling of All Users: First Results on a CRAY T3E 900-512" at MPIDC'99, "Effective Performance Problem Detection of MPI Programs on MPP Systems: From the Global View to the Detail" at ParCo99, "Automatic Profiling of MPI Applications with Hardware Performance Counters" at EuroPVM/MPI'99, and "Automatic MPI Counter Profiling" at CUG Summit 2000.
Survey results, gathered in May 1999, are published here.
Restriction: In the current version, only the MPI-1 routines are profiled. MPI-2 routines are executed without profiling.

Striped MPI-I/O with mpt.1.3.0.1 and newer

The MPI library contains the MPI-I/O implementation ROMIO 1.0.1 that enables parallel file I/O. With the modifications mentioned at the end of the following list, one can achieve a parallel disk rate of about 200 MByte/sec.
mpt.1.3.0.1 and ROMIO 1.0.1 have the following limitations:

ROMIO 1.0.1 includes everything defined in the MPI-2 I/O chapter except shared file pointer functions (Sec. 9.4.4), split collective data access functions (Sec. 9.4.5), support for file interoperability (Sec. 9.5), I/O error handling (Sec. 9.7), and I/O error classes (Sec. 9.8).
Since shared file pointer functions are not supported, the MPI_MODE_SEQUENTIAL amode to MPI_File_open is also not supported.
The subarray and distributed array datatype constructor functions from Chapter 4 (Sec. 4.14.4 & 4.14.5) have been implemented. They are useful for accessing arrays stored in files. The functions MPI_File_f2c and MPI_File_c2f (Sec. 4.12.4) are also implemented.
The "status" argument is not filled in any function. Consequently, MPI_Get_count and MPI_Get_elements will not work when passed the status object from an MPI-IO operation.
All nonblocking I/O functions use a ROMIO-defined "MPIO_Request" object instead of the usual "MPI_Request" object. Accordingly, two functions, MPIO_Test and MPIO_Wait, are provided to wait and test on these MPIO_Request objects. They have the same semantics as MPI_Test and MPI_Wait:
int MPIO_Test(MPIO_Request *request, int *flag, MPI_Status *status);
int MPIO_Wait(MPIO_Request *request, MPI_Status *status);
The usual functions MPI_Test, MPI_Wait, MPI_Testany, etc., will not work for nonblocking I/O.
This version works only on a homogeneous cluster of machines, and only the "native" file data representation is supported.
All functions return only two possible error codes -- MPI_SUCCESS on success and MPI_ERR_UNKNOWN on failure.
End-of-File is not detected, i.e. the individual file pointer is increased by the requested amount of data and not by the really read amount of data. Therefore MPI_FILE_GET_POSITION will return a wrong offset, after EOF was reached.
The MPI routines must not be called with an empty Fortran90 array as buf (and an zero count argument). this may be a problem, if MPI_DISTRIBUTE_BLOCK produces empty blocks (e.g. with ndims=1, gsize=5, psize=4, we have local array lengthes = 2,2,1,0 on the four processors).
Work around: Use a dummy buffer in the case of an empty Fortran90 array. See Example exa_block.f in directory ufs_t3e.
The newtype of MPI_TYPE_CREATE_DARRAY is an invalid datatype in those processes that have empty blocks.
Work around: Use etype instead of newtype if the local block is empty. See Example exa_block.f in directory ufs_t3e.
mpt.1.3.0.1 does not use striped nor parallel file I/O. Therefore the total bandwidth of all processes is limited to the speed of one RAID (about 30 MByte/sec).
Work around: To achieve full striped and parallel performance (up to 200 MByte/sec) one can use the three modified device driver routines in ufs_t3e or www.hlrs.de/mpi/ufs_t3e.tar.gz. About details, see the README.html file.
The man pages do not document these limitations.

MPI-Release, documentation und known bugs

CRAY's MPI on the T3E meets the standard MPI 1.2:
The standard document "MPI: A Message-Passing Interface Standard", Rev. June 12, 1995 is available as mpi-11.ps.gz (355757 bytes) or mpi-11.ps.Z (506895 bytes). The MPI 1.2 extensions are published as part of the MPI-2 document: mpi-20.ps.gz (595560 bytes) or mpi-20.ps.Z (870935 bytes).

MPI is documented in the man-page "man MPI" and "man mpirun"and in the MPI Programmer's Manual

If you have problems with CRAY's MPI then please look first at the list of known bugs.

The usage of the performance analyser VAMPIRtrace is documented here.

And here you find further informations about MPI.

MPI Benchmark Service

[an error occurred while processing this directive]

Responsibles for MPI in the HLRS are

Rolf Rabenseifner, rabenseifner@rus.uni-stuttgart.de, Tel. 0711/685-5530 und
Matthias Müller, mueller@hlrs.de, Tel. 0711/685-8038

References

Known Errors in CRAY's MPI on the CRAY T3E

Further informations about MPI

MPI 1.1 Standard, mpi-11.ps.gz (355757 bytes)

MPI 1.1 Standard, mpi-11.ps.Z (506895 bytes)

MPI 1.2 and MPI 2.0 Standard, mpi-20.ps.gz (595560 bytes)
MPI 1.2 and MPI 2.0 Standard, mpi-20.ps.Z (870935 bytes)
VAMPIRtrace

CRAY T3E

MPI Programmer's Manual

Papers about MPI Profiling
Preprint of BI 6/7 1998: "Erste T3E MPI-Profiling Ergebnisse"
URL of this page: http://www.hlrs.de/mpi/mpi_t3e.html