SLURM Monitoring Tools
Contents
Overview
The SLURM resource manager on Matilda comes with several built-in tools that can be useful for monitoring jobs and the status of nodes on the cluster. Some of these commands can have fairly complex (but useful) formatting options, so only a brief (but hopefully useful) overview is presented below. References are provided at the end of this document for users who wish to expand upon the information provided herein.
The Goals of Monitoring and a Few Examples
While UTS staff monitor Matilda for functionality, compliance and availability of resources, user self-monitoring of jobs can be very useful to improving the efficiency of your work on the cluster. Some of the benefits of self-monitoring include:
Determining how much of a resource your jobs are actually using: For example, if you guess you might need 300GB of memory (RAM) but the job only actually requires 50GB, specifying 300GB unnecessarily means your jobs are confined to one of the 4 "bigmem" nodes. Therefore, because "bigmem" nodes are in high demand (and few in number), your job may spend much more time in a queued state than is actually necessary.
Evaluating job history: It is possible to list all of jobs you've run over a given time period. This information can be used to determine time spent on various tasks, estimate future job resource requirements, or determine the status of a job that ended unexpectedly (e.g., did it fail, complete, what time etc.?).
Assess currently available resources: When planning your work, it may be helpful to assess what resources are currently available on the cluster. The cluster "occupancy rate" varies considerably, even over short periods of time. For example, it is not uncommon for the cluster to go from being only 5-10% "occupied", to well over 80%, and then a week or so later, back down to 5-10%.
Evaluating job performance and correctness: Suppose you believe you've correctly specified 40 cpus for a job - and thus believe you'll have the whole node to yourself - only to discover another user has a job running on the same node. Worse, that additional job is now slowing your job down. This is often caused by users incorrectly specifying parameters such as "ntasks" in a way that does not comport with the actual number of threads they are using (e.g. ntasks=1, but you used mpirun -np 40). In these instances, SLURM assigns your job to the node and reserves one (1) cpu core for you, but you are actually using 40. SLURM may then assign another job to that node because by its accounting, the node has 39 cpu cores available. The node is now "over-utilized" and this slows down your job. By using monitoring tools, you can evaluate your job to see if it is actually setup correctly and make changes if necessary before a problem like this occurs.
Useful SLURM Monitoring Commands
sstat
The SLURM command sstat is useful for obtaining information on your currently running jobs. Simply running "sstat <jobid>" will produce many metrics, but the output can be a bit messy. You can control the formatting of the sstat output by using specifiers with the "--format=" flag. For example:
sstat <jobid> --format=JobID,MaxRSS,AveCPU,NTasks,NodeList,TRESUsageInTot%40
...will provide the maximum memory used (MaxRSS), the average CPU utilization (#cores * runtime), the number of tasks, a list of nodes, and the resource utilization in total so far. The "%40" specifier used above with "TRESUsageInTot" provides control over the formatted field-width of the output. To see all of the available format specifier options, you may run:
sstat --helpformat AveCPU AveCPUFreq AveDiskRead AveDiskWrite AvePages AveRSS AveVMSize ConsumedEnergy ConsumedEnergyRaw JobID MaxDiskRead MaxDiskReadNode MaxDiskReadTask MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask MaxPages MaxPagesNode MaxPagesTask MaxRSS MaxRSSNode MaxRSSTask MaxVMSize MaxVMSizeNode MaxVMSizeTask MinCPU MinCPUNode MinCPUTask Nodelist NTasks Pids ReqCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot
For more information on sstat, refer to the SLURM documentation.
squeue
While the squeue command is well known by most SLURM users, be aware that it is possible to obtain more information than running the default squeue command alone provides. For example, running:
squeue --format=%10i%15u%15j%5t%15M%15l%8C%30N%10D%
...will produce formatted output containing job id, username, job name, elapsed time, walltime, the number of CPUs, a list of nodes, and the number of nodes utilized. For instance:
JOBID USER NAME ST TIME TIME_LIMIT CPUS NODELIST NODES 76856 someuser is_lslf R 1:19:55 20:10:00 8 hpc-throughput-p07 1 76855 someuser is_rsf R 1:30:55 20:10:00 8 hpc-throughput-p06 1 76854 someuser is_rlf R 1:41:25 20:10:00 8 hpc-throughput-p05 1 76850 otheruser dfly_p18 R 2:14:03 2-10:10:00 32 hpc-bigmem-p02 1 76833 newuser mohiL-3PR R 15:03:44 6-16:00:00 1 hpc-throughput-p01 1 76832 newuser mohiL-4PR R 15:04:51 6-16:00:00 1 hpc-bigmem-p01 1
The "%#" specifiers control the field width, and the letter suffixes (e.g. "%10i") reference the format field (JobID width=10). Although the format specifiers for squeue are a bit obscure, if you find a format that is particularly useful, you can define the format to use whenever you login to Matilda by setting the value of "SQUEUE_FORMAT" in your ".bashrc" file. For example:
export SQUEUE_FORMAT="%10i%15u%15j%5t%15M%15l%8C%30N%10D%"
Refer to the SLURM squeue documentation for more information.
sacct
The sacct command is useful for reviewing the status of running or completed jobs. In its simplest form, you need only use "sacct -j <jobid>" for any running or completed/failed job. Like 'squeue' and 'sstat', the sacct command can be used with format modifiers/specifiers to obtain additional (or to filer) information. For example:
sacct -j 999888 --format=JobID%12,State,User,Account%30,TimeLimit,ReqTRES%45,Partition%15 JobID State User Account Timelimit ReqTRES Partition ------------ ---------- --------- ------------------------------ ---------- --------------------------------------------- --------------- 999888 FAILED someuser+ myjobName-here 20:10:00 billing=8,cpu=8,mem=772512M,node=1 general-long 999888.batch FAILED myjobName-here 999888.extern COMPLETED myjobName-here
This shows how long the job ran, as well as trackable resources (cpu, gpu, billing, etc.), and the partition. Note that for this job there is a primary job number, as well as variations with the ".batch" and ".extern" suffixes. These are "job steps" created by SLURM for every job (MPI jobs may have many more steps, one for each task). The former designates this as a "batch" job and tracks the resources used inside of the batch job. The latter accounts for SLURM resource specifications external to the job.
There are many possible format specifier that can be used with the sacct command. To see a list, use:
sacct --helpformat Account AdminComment AllocCPUS AllocNodes AllocTRES AssocID AveCPU AveCPUFreq AveDiskRead AveDiskWrite AvePages AveRSS AveVMSize BlockID Cluster Comment Constraints Container ConsumedEnergy ConsumedEnergyRaw CPUTime CPUTimeRAW DBIndex DerivedExitCode Elapsed ElapsedRaw Eligible End ExitCode Flags GID Group JobID JobIDRaw JobName Layout MaxDiskRead MaxDiskReadNode MaxDiskReadTask MaxDiskWrite MaxDiskWriteNode MaxDiskWriteTask MaxPages MaxPagesNode MaxPagesTask MaxRSS MaxRSSNode MaxRSSTask MaxVMSize MaxVMSizeNode MaxVMSizeTask McsLabel MinCPU MinCPUNode MinCPUTask NCPUS NNodes NodeList NTasks Priority Partition QOS QOSRAW Reason ReqCPUFreq ReqCPUFreqMin ReqCPUFreqMax ReqCPUFreqGov ReqCPUS ReqMem ReqNodes ReqTRES Reservation ReservationId Reserved ResvCPU ResvCPURAW Start State Submit SubmitLine Suspended SystemCPU SystemComment Timelimit TimelimitRaw TotalCPU TRESUsageInAve TRESUsageInMax TRESUsageInMaxNode TRESUsageInMaxTask TRESUsageInMin TRESUsageInMinNode TRESUsageInMinTask TRESUsageInTot TRESUsageOutAve TRESUsageOutMax TRESUsageOutMaxNode TRESUsageOutMaxTask TRESUsageOutMin TRESUsageOutMinNode TRESUsageOutMinTask TRESUsageOutTot UID User UserCPU WCKey WCKeyID WorkDir
For more information, refer to the SLURM documentation on sacct.
scontrol
The scontrol command is helpful in evaluating detailed information on a node or running job. To see the state of a particular node (e.g. hpc-compute-p01):
scontrol show node hpc-compute-p01 NodeName=hpc-compute-p01 Arch=x86_64 CoresPerSocket=20 CPUAlloc=32 CPUTot=40 CPULoad=32.01 AvailableFeatures=local ActiveFeatures=local Gres=(null) NodeAddr=hpc-compute-p01 NodeHostName=hpc-compute-p01 Version=21.08.8-2 OS=Linux 4.18.0-372.19.1.el8_6.x86_64 #1 SMP Mon Jul 18 11:14:02 EDT 2022 RealMemory=191895 AllocMem=0 FreeMem=185655 Sockets=2 Boards=1 State=MIXED ThreadsPerCore=1 TmpDisk=0 Weight=100 Owner=N/A MCS_label=N/A Partitions=general-short,general-long,rusakov BootTime=2023-03-09T00:18:27 SlurmdStartTime=2023-03-09T00:20:33 LastBusyTime=2023-03-20T05:06:59 CfgTRES=cpu=40,mem=191895M,billing=40 AllocTRES=cpu=32 CapWatts=n/a CurrentWatts=0 AveWatts=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
Here, we can see how many cores the node has (40); how many cores are allocated (32); details about system memory; and the node state (in this case, MIXED means there are jobs allocated but not all resources are assigned).
We can also see job details using:
scontrol show job id 999888 JobId=999888 JobName=slurm_omp.sh UserId=someuser(123456) GroupId=faculty(1002) MCS_label=N/A Priority=1 Nice=0 Account=rusakov-research-group QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=20:37:19 TimeLimit=2-00:00:00 TimeMin=N/A SubmitTime=2023-03-21T14:11:51 EligibleTime=2023-03-21T14:11:51 AccrueTime=2023-03-21T14:11:51 StartTime=2023-03-21T14:11:51 EndTime=2023-03-23T14:11:52 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2023-03-21T14:11:51 Scheduler=Main Partition=general-long AllocNode:Sid=hpc-login-p01:1446834 ReqNodeList=(null) ExcNodeList=(null) NodeList=hpc-compute-p[33-35],hpc-throughput-p01 BatchHost=hpc-compute-p33 NumNodes=4 NumCPUs=100 NumTasks=100 CPUs/Task=1 ReqB:S:C:T=0:0:*:1 TRES=cpu=100,node=4,billing=100 Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=* MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0 Features=(null) DelayBoot=00:00:00 OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null) Command=/projects/some-research-group/AtX3/AtBr3_profile/CCSD_T_TZ_profile/LRC-wPBEh-D4_SO_profile/slurm_omp.sh WorkDir=/projects/some-research-group/AtX3/AtBr3_profile/CCSD_T_TZ_profile/LRC-wPBEh-D4_SO_profile StdErr=/projects/some-research-group/AtX3/AtBr3_profile/CCSD_T_TZ_profile/LRC-wPBEh-D4_SO_profile/slurm-76813.out StdIn=/dev/null StdOut=/projects/some-research-group/AtX3/AtBr3_profile/CCSD_T_TZ_profile/LRC-wPBEh-D4_SO_profile/slurm-76813.out Power=
The result contains a plethora of information on cores, memory, nodes being utilized, and other detailed job information.
For more information, please refer to the SLURM documentation on scontrol.
seff
The SLURM seff command is useful for assessing job efficiency. Please note, that seff results for running jobs may be incorrect or misleading. However, if you run seff on a job id that has already completed, it can be useful in assessing your completed job performance. For example:
seff 999888 Job ID: 999888 Cluster: slurm User/Group: someuser/students State: FAILED (exit code 134) Nodes: 1 Cores per node: 8 CPU Utilized: 00:05:15 CPU Efficiency: 12.15% of 00:43:12 core-walltime Job Wall-clock time: 00:05:24 Memory Utilized: 398.96 MB Memory Efficiency: 0.00% of 0.00 MB
In the example above, the job failed. You can use seff in conjunction with the "-d" (debug) flag for additional information:
seff -d 999888 Slurm data: JobID ArrayJobID User Group State Clustername Ncpus Nnodes Ntasks Reqmem PerNode Cput Walltime Mem ExitStatus Slurm data: 999888 someuser students FAILED slurm 8 1 1 0 1 315 324 408540 134 Job ID: 999888 Cluster: slurm User/Group: someuser/students State: FAILED (exit code 134) Nodes: 1 Cores per node: 8 CPU Utilized: 00:05:15 CPU Efficiency: 12.15% of 00:43:12 core-walltime Job Wall-clock time: 00:05:24 Memory Utilized: 398.96 MB Memory Efficiency: 0.00% of 0.00 MB
Seff is a contributor script distributed with SLURM. For more information, refer to the contribs repo for the seff command.
Integrating SLURM Commands
While SLURM monitoring commands are very useful, they can be even more powerful when used in conjunction with other resources and techniques. Make sure to checkout the documents on HPC Powertools and On-Node Monitoring for more information.
More Information and References
CategoryHPC