Using Job Steps
In SLURM, Job Steps are a way to launch distinct parallel (most commonly) and/or sequential tasks from within a single job script. Job Steps are executed using the SLURM command "srun".
By default, all jobs consist of at least one (or more) Job Steps. When a job is launched a single step is automatically created even if "srun" is not used to launch distinct tasks. Recall from the discussion of Job Scripts that a job script contains directives to the SLURM resource manager that specifies resources to be used by the job, such as the number of nodes, cores, tasks, cpus per task, etc. When Job Steps are launched within a job using "srun" all or a portion of those job-defined resources are used by each task generated by "srun".
Job Step Example 1: Parallel Tasks, Multiple Nodes
Imagine we had the following job script:
## Job Step Example 1 #!/bin/bash --login #SBATCH --job-name=jobStepEx1 #SBATCH --nodes=2 #SBATCH --ntasks=2 #SBATCH --cpus-per-task=40 #SBATCH --time=01:00:00 ## Launch Parallel Job Steps srun -N 1 -n 1 -c 40 ./myScript.py -i dataset.dat -o output1.dat & srun -N 1 -n 1 -c 40 ./anotherScript.py -i input2.dat -o output2.dat & wait ./postProcessing.py -i output1.dat output2.dat echo "End Job"
In the example above, we allocate 2 nodes, 2 tasks, and 40 cpus-per-task. In this case, a minimum of 2 tasks are required since we intend to run 2 parallel (simultaneous) tasks (denoted by the "srun" command). Since we'd like to utilize 40 cpu cores for each task, we need to make sure we allocate at least 2 nodes (40 cores on Matilda is typically associated with the compute nodes). Note the ampersand "&" at the end of each of the "srun" command lines. In Linux, this tells the operating system to run the command "in the background" (i.e., the command prompt returns immediately after the command is issued). This allows the two tasks to run at the same time. If we omitted the ampersand (&) then the first "srun" would execute, and the second would wait automatically until the first command had finished.
The "wait" statement is a bash directive which instructs the job script to pause until all background steps have completed. In this example, we want to perform some post-processing on the output of the 2 parallel tasks (output1.dat and output2.dat). Thus, we do not want to proceed with the "postProcessing.py" step until those tasks have completely finished.
Imagine we omitted the "&" at the end of the "srun" commands, what would happen? In that case, the first "srun" task would execute but the second would not since the first would not be running in the background.
The output for our job script (slurm-<jobid>.out) would contain something like the following for the above job script:
srun: lua: Submitted: 50434.0 component: 0 using: srun srun: lua: Submitted: 50434.1 component: 0 using: srun End Job
Here, the jobid=50434. The suffixes ".0" and ".1" denote each of the 2 job steps.
Job Step Example 2: Parallel Tasks, One Node
Suppose we run the following:
## Job Step Example 2 #!/bin/bash --login #SBATCH --job-name=jobStepEx1 #SBATCH --nodes=1 #SBATCH --ntasks=2 #SBATCH --cpus-per-task=20 #SBATCH --time=01:00:00 ## Launch Parallel Job Steps srun -N 1 -n 1 -c 40 --overlap ./myScript.py -i dataset.dat -o output1.dat & srun -N 1 -n 1 -c 40 --overlap ./anotherScript.py -i input2.dat -o output2.dat & wait ./postProcessing.py -i output1.dat output2.dat echo "End Job"
Note we have reduced the number of nodes to 1 and maintained all other resource parameters. However, we have now added a flag to the "srun" commands "--overlap". This flag permits resources allocated for the job to be shared between the parallel tasks. This includes nodes, CPUs, etc. In this case, both tasks will be able to use up to 40 cores on a single node. This may not be desirable as it could result in over-utilization of the node, since processes would need to swap between process threads. However, if one of the parallel tasks is fairly lightweight, this could increase our efficiency without significantly impacting performance.
What would happen in this case if we omitted the "--overlap" flags? Since each task calls for the use of 40 cpu cores, and we have not specified that resources can be shared, the effect would be the same as in the previous example if we omitted the "&". That is, each srun will run sequentially.
If however, we were to reduce the cores per "srun" Job Step to "20", both could run at the same time since each would use exactly half of the total cpu cores available on the node.
Note that the value "--cpus-per-task=20" only impacts the scheduling of resources, and does not directly control the number of cpus that can be run for each "srun" task. That is, resources are scheduled based upon:
- 1 node * 2 tasks * 20 cpus-per-task = 40 cpu cores total
Given that fact, what would happen if we made "--cpus-per-task=40"? Since there are no single nodes available with 80 cores on Matilda, the job will not be scheduled and an error returned.
The job output for the job script above would look something like the following:
srun: Job step's --cpus-per-task value exceeds that of job (40 > 20). Job step may never run. srun: Job step's --cpus-per-task value exceeds that of job (40 > 20). Job step may never run. srun: lua: Submitted: 50442.0 component: 0 using: srun srun: lua: Submitted: 50442.1 component: 0 using: srun End Job
Note we have 2 warnings here: one for each Job Step. However, we can see that the Job Steps run anyway. That's because we have used the "--overlap" flag which permits sharing. Also consider the following from the SLURM documentation:
"NOTE: Beginning with 22.05, srun will not inherit the --cpus-per-task value requested by salloc or sbatch. It must be requested again with the call to srun or set with the SRUN_CPUS_PER_TASK environment variable if desired for the task(s)."
Checking Job Step Status
To see the status of detailed job steps, we can use the following:
squeue -s -u <username>
This will yield a result like the following:
50442.0 myScr general-s username 0:02 hpc-bigmem-p01 50442.1 anoth general-s username 0:02 hpc-bigmem-p02 50442.batch batch general-s username 0:05 hpc-bigmem-p01 50442.extern extern general-s username 0:05 hpc-bigmem-p[01-02]
Here we see each of the parallel Job Steps (50442.0 and 50442.1) in addition to the standard batch script output we normally see with the "squeue -u <username>" command.