Known Problems with MPI/SX on the NEC SX-4/32


At the moment the following bugs and problems are known with MPI/SX 8.1:
 
Documentation errors and problems
  • None.
 
Problems with the MPI/SX implementation
  • When a program contains procedures written in FORTRAN77/SX or FORTRAN90/SX and 8 byte is selected as a numeric storage unit in them, any procedure contained in the program must not be written in C language.

  • When 4 byte is selected as a numeric storage unit in FORTRAN77/SX or FORTARN90/SX programming, it is not possible to use several MPI procedures, for example MPI_ADDRESS, which take the value of an address as an argument.

  • MPI/SX does not support "-h int64" option mode in C/SX programming. Therefore, if this option is specified, the error of referring undefined procedures will occur at linking time.

  • It is not possible to use MPI/SX in C++ programming. If it is used, an error will occur at linking time.


At the moment we know the following bugs and problems are known with MPI/SX 7.2:
 
Documentation errors and problems
  • The MPI/SX User's Guide describes MPI/SX 7.1 but not MPI/SX 7.2.

 
Problems with the MPI/SX implementation
  • There is no family scheduling with MPI/SX 7.2..
    Workaround: Using the batch queue NP2GMPIwith dedicated processors.
  • The f77 option -acct in combination with F_PROGINF=DETAIL in the envorinment and the cc option -hacct with C_PROGINF=DETAIL makes a confusing output because the outputs of all processes are intermixed.

    Workaround: One can create and use the following shell script mpisep.sh

    #!/bin/sh
    NNUM=`cat /etc/nodenum`
    FILE=stderr.$$.$NNUM
    exec $* 2> $FILE

    Usage:
    mpisx -p 2 -e mpisep.sh a.out

    Using this script, the PROGINF output can be separated and goes into the different files for each MPI process.

At the moment we know the following bugs and problems are known with MPI/SX 6.2 (partially corrected in 7.1):
 
Documentation errors and problems
  • The MPI/SX User's Guide on page 2-13, section 2.6, 1st topic states: "All input/output statements must be issued after the MPI_INIT procedure is called."

    This does not conform to the standard MPI 1.1 and should be interpreted as a hint for the user:

    This sentence should be only a hint for the users that IO on same files from different MPI processes are issued by the runtime system in a nondeterministic sequence and therefore the user should do this very carefully. Determimistic IO can be done only after the processes know their ranks and therefore after a call to MPI_INIT.

 
Problems with the MPI/SX implementation
  • With MPI/SX based on threads the different Fortran MPI processes (i.e. threads with f77 -P multi -G local) cannot use same file units, i.e. applications using the same unit number on different MPI processes for different files do not run and applications modified for the MPI/SX (i.e. using different unit numbers) can use at maximum 3 files in each of its 32 threads due to the limitation of at maximum 99 Fortran unit numbers.

    The option "-G local" should make all process-global objects to thread-local objects. This are SAVE variables, initialized variables, common blocks, data declared in Fortran 90 modules and the mapping of Fortran unit numbers to filenname/descriptors.
    Only then MPI processes can be generated through mapping by a thread task (see MPI/SX User's Guide, page 3-2, last sentence of "-G local" description)

    This problem is already known by NEC (see page 2-13, section 2.6, 2nd topic). It will not be fixed, but starting with MPI/SX release 7.2, thread based implementations are not any longer available.

    In the meantime the user can choose between 3 workarounds:

    • he/she modifies its application, that it uses different unit numbers for different files in different MPI processes;
    • he/she uses MPI/SX based on processes (instead of threads).
    This is not a bug, because MPI_Attr_get(MPI_COMM_WORLD,MPI_IO,(void *)&mpi_io,&flag) returns on all processes that only the process with rank 0 in MPI_COMM_WORLD has regular IO facilities. It is only a problem of low quality, because on the NEC SX-4 each MPI process could have regular IO facilities.

  • MPI_IO returns myrank instead of MPI_ANY_SOURCE with MPI/SX based on processes.

  • MPI_Finalize returns only in the process with rank 0 in MPI_COMM_WORLD with MPI/SX based on threads.
    MPI/SX is the single MPI implementation that does not return from MPI_Finalize in all processes. The MPI-2 Forum decided that this is allowed.

  • The f77 option -acct makes a confusing output if mpisx starts processes (not threads) because the outputs of all processes are intermixed.

    Workaround: One can create and use the following shell script mpisep.sh

    #!/bin/sh
    NNUM=`cat /etc/nodenum`
    FILE=stderr.$$.$NNUM
    exec $* 2> $FILE

    Usage:

    mpisx -p 2 -e mpisep.sh a.out

    Using this script, the PROGINF output can be separated and goes into the different files for each MPI process.

  • The language mix of C and FORTRAN cannot not be used when 8-bytes numerical storage unit is specified or float2 mode is used. In that case, the language mix causes undefined symbols of MPI procedures in linking phase. 4-bytes numerical storage unit in float0 or float1 mode must be specified when the language mix is used.

  • The man pages of MPI procedures are not provided. They will be provided in R7.1

  • Tools: FANALIZER and CANALIZER cannot be used for MPI programs.

  • Debug mode execution: When the MPI program is executed in debug mode (-debug option is specified), the run command of pdbx can be executed only once. The second execution of run command causes to stall a program.

  • When MPI_ADDRESS is used in FORTRAN program, the option -Nw cannot be specified in the float0 or float1 modes. To use MPI_ADDRESS in FORTRAN program, the float2 mode must be used, or the option -w is required in float0 and float1 modes.
 
Problems with the MPICH 1.0.13 tests suite using MPI/SX
Problems with the pt2pt test:
  1. structf aborted abnormally.
    It uses MPI_Address. Because NEC SX-4 uses by default 32 bit Fortran INTEGERs, but 64 bit addresses, this test can be compiled only with f77 -ew ... -lmpiw
  2. testall reports in many cases failing to free a request.
    This test assumes that MPI_REQUEST_NULL equals to 0. This is a bug in the test, because the standard does not require this.
  3. overtake does not proceed after MPI_Finalize in the process (i.e. the NEC thread) with rank=1, see above.
Problems with the coll test:
  1. redscat aborted abnormally.
    Substituting memory.h and malloc.h by stdlib.h this test passes.
  2. alltoallv aborted abnormally.
    Substituting memory.h and malloc.h by stdlib.h this test passes.
(Not yet tested: lederman)

MPI/SX based on processes we have not tested. Probably the errors above exist also in MPI/SX based on processes.
 
The following bugs in MPI/SX are now fixed:
  • The test pt2pt/cancel aborted abnormally due to a bug in MPI_CANCEL.
  • The tests coll/scatterv and topol/cartf failed due to a bug in MPI_Dims_create:
    MPI_Dims_create(4,2,dims) returned dims = (4,1) instead of (2,2).
  • MPI_WTIME_IS_GLOBAL returned 0 instead of 1.
 
General workarounds for the problems above
One can use MPI/SX 7.2.  
 
Rolf Rabenseifner, Rabenseifner@rus.uni-stuttgart.de