Differences between revisions 40 and 57 (spanning 17 versions)
Revision 40 as of 2021-03-08 10:45:48
Size: 5393
Editor: jbjohnston
Comment:
Revision 57 as of 2022-11-28 11:45:26
Size: 9351
Editor: jbjohnston
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
<<TableOfContents(3)>> <<TableOfContents(4)>>
Line 59: Line 59:
module load someApp srun -n 2 python myScript.py & srun -n 2 someApp & wait }}} module load someApp
srun -n 2 python myScript.py & srun -n 2 someApp & wait }}}
Line 65: Line 66:
This is an example of a job script that runs a single MPI application across multiple nodes with distributed memory. It is recommended to use "srun" instead of "mpirun":
=== Important Notes ===
When compiling an application that uses OpenMPI, it is recommended users utilize OpenMPI/4.1.1c or higher (versions 4.1.1 and 4.0.5 should be avoided). In addition, When running any application that uses or was compiled with OpenMPI, '''DO NOT USE SRUN'''. The developers of OpenMPI specifically [[https://docs.open-mpi.org/en/v5.0.x/running-apps/slurm.html|discourage the use of "srun"]], and experience has shown it can cause over-utilization issues (load will be higher than number of cores specified). Instead, use "mpirun" '''WITHOUT''' the "-np <# cores>" flag.

Because OpenMPI has been compiled using SLURM, specifying the number of tasks in the job script is sufficient, and adding the "-np" flag may cause performance issues, including over-utilization. Indeed, one may choose to only specify "ntasks" independent of the number of nodes or tasks per node, since SLURM will automatically assign the requisite number of nodes and distribute cores accordingly when invoking "mpirun".

=== MPI Examples ===
This is an example of a job script that runs a single MPI application across multiple nodes with distributed memory:
Line 68: Line 76:
# ====Sample Job Script=== # ==== Sample MPI Job Script ===
Line 77: Line 85:
module load OpenMPI module load OpenMPI/4.1.1c
Line 79: Line 87:
srun ./my_mpi_app }}} mpirun ./my_mpi_app }}}
Line 82: Line 90:
Two nodes are assigned with 14 tasks per node (28 tasks total). One GB of RAM is allocated per CPU and srun is used to launch our MPI-based application. Two nodes are assigned with 14 tasks per node (28 tasks total). One GB of RAM is allocated per CPU and mpirun is used to launch our MPI-based application. Because OpenMPI has been compiled with the SLURM libraries, the "mpirun" command acts as a wrapper - making sure to allocate the proper number of tasks to the application based upon the "ntasks" specification provided by the job script.
Line 85: Line 93:
=== srun ===
Line 90: Line 99:
=== Explanation === ==== Explanation ====
Line 93: Line 102:
Upon launching srun and the job resources being allocated, you will be connected interactively to the session on the compute node. Once you exit the compute node, the job session is terminated.

=== salloc ===

If you want to create an interactive session and connect and disconnect from that job without terminating it before the walltime limit is reached, you can use "salloc":

{{{
salloc -N 1 -c 1 -t 30:00
}}}

This will return something like the following:

{{{
salloc: Granted job allocation 18187
}}}

Now to connect to the allocated resources on the compute node, use "srun":

{{{
srun --jobid=18187 --pty /bin/bash
}}}

Of course, substitute in your actual job number returned by salloc.

==== Explanation ====

Using a combination of salloc and srun separates the resource allocation from the actual interactive compute node session. This allows the user to allocate resources, and connect or disconnect from the session without killing the job (so long as walltime isn't exceeded).

=== X11 Forwarding ===
Matilda now permits X11 forwarding on its nodes for interactive runs. To use forwarding, you must first establish an X11 session with the login node:
{{{
ssh -X [email protected]
}}}
or alternatively:
{{{
ssh -Y [email protected]
}}}
'''Note:''' "-X" treats Matilda as an "untrusted" host, which may be more secure, but could result in more errors. "-Y" treats Matilda as a "trusted" host and is less likely to generate errors (but could be somewhat less secure). As a general rule-of-thumb, use "-X" unless you find your application doesn't run well using it, in which case use "-Y". Sometimes "-X" will crash certain GUI applications, because it restricts certain functionality that may be needed for the application to work properly.

Once an X11 session is established on Matilda, you can start an interactive job using "srun" as follows (to allow X11 forwarding to the node):
{{{
srun -N 1 -c 1 -t 30:00 --x11 --pty /bin/bash --login
}}}

Similarly, if you are using the "salloc-srun" method, you can use:
{{{
salloc -N 1 -c 1 -t 30:00 --x11
srun --jobid=<jobnumber> --xll --pty /bin/bash
}}}

==== Explanation ====
The commands above are almost exactly the same as those presented under the Sections on [[https://kb.oakland.edu/uts/HPCJobScripts#srun|srun]] and [[https://kb.oakland.edu/uts/HPCJobScripts#salloc|salloc]], except that the "--x11" flag is added to each command, AND you must first establish an X-session with the login node.
Line 106: Line 167:
file=$(awk "NR==${SLURM_ARRAY_TASK_ID}" file_list.txt) python /home/someUser/myscript.py $file > myoutput_${SLURM_ARRAY_TASK_ID}.out }}} file=$(awk "NR==${SLURM_ARRAY_TASK_ID}" file_list.txt)
python /home/someUser/myscript.py $file > myoutput_${SLURM_ARRAY_TASK_ID}.out }}}
Line 122: Line 184:
#SBATCH --cpus-per-task  #SBATCH --cpus-per-task=1
Line 127: Line 189:
=== Explanation ===

In the example above, "--gres=gpu:1" requests one GPU. The job will automatically be assigned to an HPC GPU node.

Job Scripts

Serial Single Threaded

This example illustrates a job script designed to run a simple single-threaded processes on a single compute node:

#====Sample Job Script=== 
#!/bin/bash
#SBATCH --job-name=mySerialjob 
#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=1 
#SBATCH --cpus-per-task=1 
#SBATCH --time=0-00:20:00 
#SBATCH --mem=3102

cd ${SLURM_SUBMIT_DIR}

module load someApp someApp 

Explanation

A single process run only requires 1 node, as well as 1 cpu core and a single task. These are reflected in the example script. We change to the same directory from where we submitted the job ( ${SLURM_SUBMIT_DIR}) to produce our output. Then we load the module "someApp" and execute it.

Note that ${SLURM_SUBMIT_DIR} is one of many environmental variables available from within a SLURM job script. For a comprehensive list, please refer to the SLURM documentation.

Multi-Threaded Single Node

In this example we are running an application capable of utilizing multiple process threads on a single node (BLAST):

# ====Sample Job Script=== 
#!/bin/bash
#SBATCH --job-name=myBLASTjob 
#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=1 
#SBATCH --cpus-per-task=8 
#SBATCH --time=0-01:00:00 
#SBATCH --mem=3102

cd ${SLURM_SUBMIT_DIR}

module load BLAST blastn --num_threads 8 <...> 

Explanation

In this case we still have a single task (our blastn run) but we require 8 cpu cores to accommodate the 8 threads we've specified on the command line. The ellipses between the angle brackets represents the balance of our command line arguments.

Multiple Serial Jobs

Here we demonstrate it is possible to run multiple copies of the same application, and leverage SLURM's "srun" command to distribute tasks on multiple nodes:

# ====Sample Job Script=== 
#!/bin/bash 
#SBATCH --nodes=2 
#SBATCH --ntasks-per-node=2 
#SBATCH --cpus-per-task=1 
#SBATCH --time=0-01:00:00 
#SBATCH --mem=3102

module load someApp 
srun -n 2 python myScript.py & srun -n 2 someApp & wait 

Explanation

We specify 2 nodes and 2 tasks per node (total 4 tasks). The "srun" command is used to direct that 2 copies of each application should be run. srun works with SLURM to launch and schedule each task across our assigned nodes. The ampersand (&) causes each task to be run "in the background" so that all tasks may be launched in parallel and are not blocked waiting for other tasks to complete. The "wait" directive tells SLURM to wait until all background tasks are completed.

MPI Jobs

Important Notes

When compiling an application that uses OpenMPI, it is recommended users utilize OpenMPI/4.1.1c or higher (versions 4.1.1 and 4.0.5 should be avoided). In addition, When running any application that uses or was compiled with OpenMPI, DO NOT USE SRUN. The developers of OpenMPI specifically discourage the use of "srun", and experience has shown it can cause over-utilization issues (load will be higher than number of cores specified). Instead, use "mpirun" WITHOUT the "-np <# cores>" flag.

Because OpenMPI has been compiled using SLURM, specifying the number of tasks in the job script is sufficient, and adding the "-np" flag may cause performance issues, including over-utilization. Indeed, one may choose to only specify "ntasks" independent of the number of nodes or tasks per node, since SLURM will automatically assign the requisite number of nodes and distribute cores accordingly when invoking "mpirun".

MPI Examples

This is an example of a job script that runs a single MPI application across multiple nodes with distributed memory:

# ==== Sample MPI Job Script === 
#!/bin/bash 
#SBATCH --nodes=2 
#SBATCH --ntasks-per-node=14 
#SBATCH --ntasks=28 
#SBATCH --cpus-per-task=1 
#SBATCH --mem-per-cpu=1G 
#SBATCH --time=0-10:00:00

module load OpenMPI/4.1.1c

mpirun ./my_mpi_app 

Explanation

Two nodes are assigned with 14 tasks per node (28 tasks total). One GB of RAM is allocated per CPU and mpirun is used to launch our MPI-based application. Because OpenMPI has been compiled with the SLURM libraries, the "mpirun" command acts as a wrapper - making sure to allocate the proper number of tasks to the application based upon the "ntasks" specification provided by the job script.

Interactive Jobs

srun

It is possible to schedule an interactive job on the cluster. This can be accomplished using "srun" and specifying resource parameters on the command line:

srun -N 1 -c 1 -t 30:00 --pty /bin/bash --login

Explanation

Here 1 node is specified as is 1 core, with a walltime of 30 minutes. The balance of the command gives us a bash login shell that will be scheduled by SLURM on one of the compute nodes.

Upon launching srun and the job resources being allocated, you will be connected interactively to the session on the compute node. Once you exit the compute node, the job session is terminated.

salloc

If you want to create an interactive session and connect and disconnect from that job without terminating it before the walltime limit is reached, you can use "salloc":

salloc -N 1 -c 1 -t 30:00

This will return something like the following:

salloc: Granted job allocation 18187

Now to connect to the allocated resources on the compute node, use "srun":

srun --jobid=18187 --pty /bin/bash

Of course, substitute in your actual job number returned by salloc.

Explanation

Using a combination of salloc and srun separates the resource allocation from the actual interactive compute node session. This allows the user to allocate resources, and connect or disconnect from the session without killing the job (so long as walltime isn't exceeded).

X11 Forwarding

Matilda now permits X11 forwarding on its nodes for interactive runs. To use forwarding, you must first establish an X11 session with the login node:

ssh -X [email protected]

or alternatively:

ssh -Y [email protected]

Note: "-X" treats Matilda as an "untrusted" host, which may be more secure, but could result in more errors. "-Y" treats Matilda as a "trusted" host and is less likely to generate errors (but could be somewhat less secure). As a general rule-of-thumb, use "-X" unless you find your application doesn't run well using it, in which case use "-Y". Sometimes "-X" will crash certain GUI applications, because it restricts certain functionality that may be needed for the application to work properly.

Once an X11 session is established on Matilda, you can start an interactive job using "srun" as follows (to allow X11 forwarding to the node):

srun -N 1 -c 1 -t 30:00 --x11 --pty /bin/bash --login

Similarly, if you are using the "salloc-srun" method, you can use:

salloc -N 1 -c 1 -t 30:00 --x11
srun --jobid=<jobnumber> --xll --pty /bin/bash

Explanation

The commands above are almost exactly the same as those presented under the Sections on srun and salloc, except that the "--x11" flag is added to each command, AND you must first establish an X-session with the login node.

Job Arrays

Job arrays are a convenient way to perform the same set of procedures or tasks on multiple data sets without having to launch more than one job. This reduces the number of job scripts required, and allows jobs to run in parallel with a single script. In the example below, we are executing the same process on 4 different input files:

# ====Sample Job Script=== 
#!/bin/bash
#SBATCH --job-name=myArrayest 
#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=1 
#SBATCH --cpus-per-task 
#SBATCH --time=0-10:00:00 
#SBATCH --array=1-4

file=$(awk "NR==${SLURM_ARRAY_TASK_ID}" file_list.txt) 
python /home/someUser/myscript.py $file > myoutput_${SLURM_ARRAY_TASK_ID}.out 

Explanation

  1. The line "#SBATCH --array=1-4" specified we are running 4 tasks, numbered 1-4
  2. The line beginning "file=" uses the scripting language "awk" to read the line number corresponding to the SLURM_ARRAY_TASK_ID (1-4) from the file "file_list.txt" which is contained in the working directory
  3. The python script "myscript.py" operates on the value returned for "$file" (the filename) and to store the output in a file named "myoutput_#.out" where "#" corresponds to the job array ID for the SLURM task.

GPU Jobs

To run a GPU-based job we simply need to add an SBATCH request for the generic resource "gres" for a "gpu" as shown below:

# ====Sample Job Script=== 
#!/bin/bash
#SBATCH --job-name=myArrayest 
#SBATCH --nodes=1 
#SBATCH --ntasks-per-node=1 
#SBATCH --cpus-per-task=1
#SBATCH --gres=gpu:1

./myGPUapp 

Explanation

In the example above, "--gres=gpu:1" requests one GPU. The job will automatically be assigned to an HPC GPU node.


CategoryHPC