Node Monitoring

Overview

Monitoring of your job by examining the node(s) on which they are running can be a very useful tool in assessing performance and during troubleshooting. This document (while not exhaustive) outlines some of the more useful methods.

Please keep in-mind, that users are NOT able to monitor nodes where they do not have a job currently running. This is a security feature enforced on Matilda that is intended to minimize the potential for accidental interference with the processes of other users.

On-Node Monitoring

Your Job Number

Once a job is launched you will be provided with a job number. This is immediately returned after submitting the job. For example:

[someuser@hpc-login-p01 ~]$ sbatch myjob.sh
sbatch: lua: Submitted: 126352.4294967294 component: 0 using: sbatch
Submitted batch job 126352

In the example above, "126352" is our job number. If you miss recording that number, there are other ways you can find it. One is to use "squeue":

[someuser@hpc-login-p01 ~]$ squeue -u someuser
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
            126352 general-l myjob.sh someuser  R       3:35      2 hpc-compute-p[07-08]

In the example above, we specify the user filter "-u" to limit the returned jobs to only those that we own. Here we can again see the "JOBID 126352". We can also see the column "NODELIST" which shows us we are running on nodes hpc-compute-p07 and hpc-compute-p08. Knowing which nodes our job is using are essential for monitoring job performance.

Your Job Nodes

If we already have our job number, we can also figure out which nodes we're using by leveraging some other methods (we've already demonstrated one method using squeue as shown above). For example,

[someuser@hpc-login-p01 ~]$ sacct -X -j 126352 -o NodeList%25
                 NodeList
-------------------------
     hpc-compute-p[07-08]

In this example we use the "sacct" command, with the flags "-X" (simplified results), the "-j" (the job number), and the output format specifier "-o NodeList%25" (show NodeList with a field width of 25).

We could also just run "squeue" which will return a list of all running and queued jobs, and then just look for our job based on User ID and/or Job Name.

Node Access

Once you know what node(s) your job is running on, you may access it directly using SSH. For example, if your job is running on hpc-compute-p06:

ssh hpc-compute-p06

This will put you in your home directory on the node where your job is running. If you do not have a job running on the node you attempt to login to, your access will be denied.

Node Monitoring

Overview

The most direct way to monitor your running job will be to directly login to that node using SSH, and then use various command tools to examine things like load, memory usage, threading behavior, etc. It is also possible to issue some commands in-line with SSH to briefly login, run the desired command, and then return the result and logout (all in one line); or some SLURM commands can also be useful for assessing performance. These techniques and commands will be covered in the following sections.

Useful Command Line Tools

Many commands can be accessed directly from the shell, and some others are provided as module file installations. Some commonly used commands include:

  • top - shows a list of running processes, with summary of CPU utilization, memory usage, etc.
  • free - shows memory utilization. Add a flag like "-g" to show memory usage in gigabytes.
  • htop - similar to "top" but with much more readable output. REQUIRES MODULE LOAD (module load htop)
  • iostat - displays average cpu usage as well as read/writes for various storage devices
  • vmstat - display memory usage statistics
  • lsof - "list open files" is useful for showing open files that are being read and/or written to
  • mpstat - statistics for all system CPUs
  • uptime - shows system loads for 1, 5, and 15 minute averages, respectively
  • ps -e - shows actively running processes
  • pstree - shows a schematic process tree

Each of the commands above may be issued as-is, or with various modifying flags. Other commands like "grep" may be useful for filtering command output to isolate items of interest. The options for how these commands may be used is virtually limitless, and this document cannot hope to cover them all. You can find out command options for any of these by using the "man <command name>" command. Presented below are a few useful examples that may prove helpful.

Example: uptime and Load

The uptime command is useful for determining the "load" on a compute node. Most of the Matilda compute nodes have 40 processing cores. If the processes running on that node match the capacity of one processing core, the load would be "1" (or 100%). Therefore, if the processes on a node are fully utilizing all processing cores, we would expect a maximum of load of about "40". An example of the use of the "uptime" command:

[someuser@hpc-compute-p22 ~]# uptime
 14:50:51 up 119 days,  1:17,  1 user,  load average: 40.09, 40.13, 40.09

In the example above, the load averages for 1, 5, and 15 minutes are all right around "40", so this node if fully utilized by not over-utilized.

Explanation

UTS will sometimes notify users about a problem of node "over-utilization". Over-utilization can take 2 forms:

  • The total load on a node exceeds "40" - as the total load grows higher, performance is degraded for all users.
  • The total load of a user's job exceeds the number of cores requested. Example, you request 10 cores but the load is higher - say "15". This degrades the user's job performance and the performance of other user's jobs that may be running on the node.

When the load of a node grows to be higher than the number of available cores, jobs are placed in a run-queue where they may must wait for time on the CPU. The operating system will swap processes in and out to permit a distribution of run-time to each process. When a process is in the run-queue and waiting for CPU time, no useful work is being performed on that process. Thus, the higher the load over the maximum, the more processing is slowed down. If one user is utilizing more cores than they have requested, and a second user is using an appropriate number of cores, the processing time of both jobs may be impacted.

For example, imagine a case where one user has requested 20 cores, and a second user has also requested 20 cores. SLURM can grant this request to run on a single node, since the available resource is "40". However, if the first user actually utilizes 30 cores (load 30) and the second user utilizes no more than the requested 20 cores (load 20), the total load is "50", and that is too high for the node to accommodate efficiently.

Uptime is thus, a simple and invaluable tool for monitoring your job's performance.

Example: top

The "top" and "htop" commands are particularly useful when trying to determine how many processors your jobs are using. For instance, suppose you see multiple jobs running on node (including yours) and your check of the load indicates that node is being over-utilized. How could you tell whether it was your process or someone else's? The "top" command helps breakdown CPU utilization by process. For example, look at the output from the "top" command in the image below:

[someuser@hpc-compute-p22 ~]# top

attachment:top_output.png

A similar but clearer view using "htop":

[someuser@hpc-compute-p22 ~]# module load htop
[someuser@hpc-compute-p22 ~]# htop

attachment:htop_output.png


CategoryHPC