Intel mpi problems (Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory)

I have a small cluster with six nodes (Dell R650 and Dell R750) blades. I had been running CentOS 7, but as the end-of-life for support was coming close, I asked the vendor who sold and set up the cluster to install Rocky Linux 9.2 As my main use of the cluster is for running large scientific codes (typically written in Fortran), I installed the latest version of Intel oneapi (both basekit and HPCkit) using DNF. The version as given by mpiifort is 2021.10.0 20230609. While the compiler seems to work fine, MPI jobs (including those compiled by oneapi 2022.1.1.119 that I used on the CentOS 7.9 system I ran earlier do not run terminating with the error message (Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory). The system is equipped with infiniband and a simple test of infiniband (mpiexec.hydra -n 32 hostname) works fine on the same hosts and between nodes as well.

To avoid the possible unintentional complications from the scientific codes, I reproduced the same class of error using the oneapi mpi test program test.f90. The program was compiled using mpiifort -o testf90 test.f90. It compiled without error. Running the program produced the following error message.
Strangely enough, the program runs fine under root (the vendor thought of this, but I reproduced the result). Also the same problem occurs with a newly (non-privileged account) so that the .bashrc and .bash_profile accounts (that have lots of setup in them) are not relevant. Any suggestions as to what the cause might be and, even better, how to fix it. My guess is that it is a permission related problem, but I cannot be sure.

>mpirun -genv I_MPI_DEBUG=+5 -n 32 ./testf90 
[0#307613:307613@muon] MPI startup(): Intel(R) MPI Library, Version 2021.10  Build 20230619 (id: c2e19c2f3e)
[0#307613:307613@muon] MPI startup(): Copyright (C) 2003-2023 Intel Corporation.  All rights reserved.
[0#307613:307613@muon] MPI startup(): library kind: release
[0#307613:307613@muon] MPI startup(): libfabric version: 1.18.0-impi
[0#307613:307613@muon] MPI startup(): libfabric provider: psm3
muon:rank6.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank6.testf90: Unable to allocate UD send buffer pool
muon:rank15.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank15.testf90: Unable to allocate UD send buffer pool
muon:rank18.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank18.testf90: Unable to allocate UD send buffer pool
muon:rank12.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank12.testf90: Unable to allocate UD send buffer pool
muon:rank29.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank29.testf90: Unable to allocate UD send buffer pool
muon:rank16.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank16.testf90: Unable to allocate UD send buffer pool
muon:rank3.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank3.testf90: Unable to allocate UD send buffer pool
muon:rank30.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank30.testf90: Unable to allocate UD send buffer pool
muon:rank24.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank24.testf90: Unable to allocate UD send buffer pool
muon:rank14.testf90: Unable to alloc send buffer MR on mlx5_0: Cannot allocate memory
muon:rank14.testf90: Unable to allocate UD send buffer pool

The test.f90 MPI test program is reproduced below for clarity.

!
! Copyright Intel Corporation.
! 
! This software and the related documents are Intel copyrighted materials, and
! your use of them is governed by the express license under which they were
! provided to you (License). Unless the License provides otherwise, you may
! not use, modify, copy, publish, distribute, disclose or transmit this
! software or the related documents without Intel's prior written permission.
! 
! This software and the related documents are provided as is, with no express
! or implied warranties, other than those that are expressly stated in the
! License.
!
        program main
        use mpi
        implicit none

        integer i, size, rank, namelen, ierr
        character (len=MPI_MAX_PROCESSOR_NAME) :: name
        integer stat(MPI_STATUS_SIZE)

        call MPI_INIT (ierr)

        call MPI_COMM_SIZE (MPI_COMM_WORLD, size, ierr)
        call MPI_COMM_RANK (MPI_COMM_WORLD, rank, ierr)
        call MPI_GET_PROCESSOR_NAME (name, namelen, ierr)

        if (rank.eq.0) then

            print *, 'Hello world: rank ', rank, ' of ', size, ' running on ', name

            do i = 1, size - 1
                call MPI_RECV (rank, 1, MPI_INTEGER, i, 1, MPI_COMM_WORLD, stat, ierr)
                call MPI_RECV (size, 1, MPI_INTEGER, i, 1, MPI_COMM_WORLD, stat, ierr)
                call MPI_RECV (namelen, 1, MPI_INTEGER, i, 1, MPI_COMM_WORLD, stat, ierr)
                name = ''
                call MPI_RECV (name, namelen, MPI_CHARACTER, i, 1, MPI_COMM_WORLD, stat, ierr)
                print *, 'Hello world: rank ', rank, ' of ', size, ' running on ', name
            enddo

        else

            call MPI_SEND (rank, 1, MPI_INTEGER, 0, 1, MPI_COMM_WORLD, ierr)
            call MPI_SEND (size, 1, MPI_INTEGER, 0, 1, MPI_COMM_WORLD, ierr)
            call MPI_SEND (namelen, 1, MPI_INTEGER, 0, 1, MPI_COMM_WORLD, ierr)
            call MPI_SEND (name, namelen, MPI_CHARACTER, 0, 1, MPI_COMM_WORLD, ierr)

        endif

        call MPI_FINALIZE (ierr)

        end

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.