Using OpenMPI

Overview

Per the OpenMPI consortium website: "The Open MPI Project is an open source Message Passing Interface implementation that is developed and maintained by a consortium of academic, research, and industry partners." OpenMPI is one of the most commonly used MPI implementations for High Performance Computing (HPC), and is derived from the MPICH reference standard.

This document attempts to provide general guidance and best practices for using OpenMPI applications on Matilda. Several OpenMPI versions are currently available for your use on Matilda. Versions >= 4.1.1c are strongly recommended to ensure the best performance and widest range of available features.

Since OpenMPI (and any other type of MPI protocol) is a complex and highly customizable application, we do not try to advise here on the creation, compilation, or linking of code. Moreover, this is not an exhaustive list of all of the possible runtime options that one could use when launching an OpenMPI application. See the OpenMPI documentation site for more information.

Runtime Best Practices

mpirun Specification

When running an OpenMPI based application from within a job script, it is strongly recommended to use "mpirun <application name>" without using any processor (-np #) specification. The number of nodes, tasks and cores/cpus should be specified by the job script. Because the OpenMPI builds on Matilda have been compiled with SLURM, SLURM will automatically spawn the necessary number of processors based on the job specification requirements.

Using something like "mpirun -np 40 <application name>" (for example) will create a massive overutilization situation - essentially, if 40 cores are specified, and the "-np 40" flag is used, the result will be 40x40=1600 cores. While SLURM will permit this your load will be increased by a factor of 4. This will in turn cause considerable process switching, run up the load level on the node, and as a result will significantly slow down your job.

From OpenMPI:

When mpirun is launched in a Slurm job, mpirun will automatically utilize the Slurm infrastructure for launching and controlling the individual MPI processes. Hence, it is unnecessary to specify the --hostfile, --host, or -n options to mpirun.

srun versus mpirun

It is similarly recommended to AVOID using "srun" to launch OpenMPI applications. The OpenMPI developers strongly recommend against this practice (although SLURM documentation differs). Experience and testing have shown that using something like "srun <application name>" can result in significant resource (cpu) over-utilization and sometimes other anomalies.

From OpenMPI:

Unless there is a strong reason to use srun for direct launch, the Open MPI team recommends using mpirun for launching under Slurm jobs.

salloc Recommendations

For users that make use of "salloc" to allocate job resources prior to an interactive run, the recommendations provided above are the same. For example, if we were to specify 4 nodes, 40 cores per node using salloc:

salloc -N 4 -n 40 -t 2:00:00
module load OpenMPI/4.1.1c
mpirun myapp

As a reminder, when running "salloc", resources are allocated when available, and the user is given a new login session on the hpc-login node. From that new prompt one would simply load the desired modulefiles (including OpenMPI) and then launch their application using "mpirun" just as we would in a batch job script. The actual processes will be dispatched by SLURM to the allocated nodes, whilst a lightweight process runs in the background on the login node.

Modular Component Architecture

Modular Component Architecture (mca) is a mechanism that may be used to fine-tune runtime parameters when using "mpirun". Users sometimes tweak runtime parameters by specifying mca attributes, including selection of specific network communications protocols. OpenMPI on Matilda has been compiled with the Unified Communication X (UCX) communication library, and as such, if infiniband is present on the compute node (which it is on all Matilda nodes) is used by default to select the optimal network communications model.

UCX currently supports:

  • OpenFabrics Verbs (including InfiniBand and RoCE)

  • Cray’s uGNI (not applicable on Matilda)
  • TCP
  • Shared memory
  • NVIDIA CUDA drivers (applicable on GPU nodes)

While users can manually select any of the above transports at run time, if none are provided OpenMPI will select a default transport as follows:

  1. If InfiniBand devices are available, use the UCX PML.

  2. If PSM, PSM2, or other tag-matching-supporting Libfabric transport devices are available (e.g., Cray uGNI), use the cm PML and a single appropriate corresponding mtl module.
  3. Otherwise, use the ob1 PML and one or more appropriate btl modules.

While some users may choose to manually set mca transports, most will probably achieve optimal performance by allowing OpenMPI to utilize UCX at runtime (although this is not guaranteed).

CUDA Support

CUDA support has been built into Matilda's OpenMPI versions >= 4.1.1-cuda12.1. OpenMPI versions compiled with CUDA support will have a modulefile prefix of "-cudaXX.x" (use "module av OpenMPI" to see available versions). These versions may of course be used on non-CUDA capable nodes without any errors or performance issues, but users desiring GPU support should purposefully select one these versions if their job will be run on the GPU nodes to leverage the installed CUDA cards. As a reminder, make sure to specify "--gres:gpu=#" (where # is an integer between 1-4) when desiring to use GPU-capable nodes.


CategoryHPC