Matilda HPC Powertools

Overview

Matilda HPC "Powertools" have been developed to assist users in accessing useful system information and functionality, but abstracting away the complexities associated with the SLURM into user-friendly scripts. While Powertools is not intended as a full replacement for SLURM, it is hoped it makes common functions more accessible for most user's requirements.

Several tools are also provided to assist users with the monitoring of their jobs, including resource utilization and performance.

This document contains information on the Powertools modulefile and it's various features. Watch this space, as it will be updated as more tools are developed and/or revised.

[Comments and suggestions are always welcome.]

Accessing Powertools

You may have access to Powertools as you would with any modulefile:

module load powertools

To view a summary of powertools, simply enter:

powertools

This will produce "man page like" output. You can also enter:

man powertools

Powertools Description

acctBalance

The acctBalance script permits the user to check the usage and allocation balances for the current year for any group of which they are a member. If no time period is specified, then YTD is assumed. You can specify "all" if you want to see all groups. In addition, you may specify a time period, or verbose output (a breakdown of all the users in your group). Please note a YTD account summary is always provided. By default the number of "billing" hours are displayed, where: billing-hours=(cpu-hours + (10 * gpu-hours)). You may also just specify cpu-hours or gpu-hours if desired.

To see the available options, use:

acctBalance -h

Usage: /cm/shared/apps/powertools/1.0/bin/acctBalance -g groupname [-s YYYY-MM-DD] [-e YYYY-MM-DD | now] [-t c|g|b] [-l] [-h]

        where:
                -g: specific project group name, or 'all'
                -s: start time format in YYYY-MM-DD (optional)
                -t: report type (optional)
                        where:
                          c = cpu-hours
                          g = gpu-hours
                          b = billing-hours (cpu-hours + (gpu-hours * 10))
                          DEFAULT=billing-hours
                -e: end time format in YYYY-MM-DD or now (optional)
                -l: list all groups of which user is a member (optional)
                -v: provide details on group users
                -h: display usage

        NOTE: time reporting period defaults to YTD
        NOTE: selecting '-l' provides a group list only

Examples

Display billing-hours for all member groups from March 1, 2023 through to the present:

acctBalance -g all -s 2023-03-01 -e now

Start Time: 2023-03-01  End Time:  now
Group: contrived  Type:  billing  Detail:  Summary

Requested Accounting Report for CONTRIVED

        CONTRIVED group BILLING time for requested period:  434 hours
...

Group: hpc-workshop  Type:  billing  Detail:  Summary

Requested Accounting Report for HPC-WORKSHOP

        HPC-WORKSHOP group BILLING time for requested period:  27838 hours
...

Display gpu-hours YTD with group user details:

acctBalance -g all -t g -v

Start Time: 2023-01-01  End Time:  2023-12-31
Group: contrived  Type:  gres/gpu  Detail:  Breakdown

Requested Accounting Report for CONTRIVED

        CONTRIVED group GRES/GPU time for requested period:  13 hours

                User Breakdown

                hpctester01                  13 hours
                hpctester02                   0 hours

List all groups of which the user is a member:

acctBalance -l

User: hpctester02 is a member of the following ACCOUNT groups:

  - contrived
  - hpc-workshop

acctBalanceAll

This is a simplified version of "acctBalance" which simply lists a YTD summary of billing-hours for all member groups (no user breakdown).

Examples

acctBalanceAll

Group: contrived  Type:    Detail:

Requested Accounting Report for CONTRIVED

        CONTRIVED group Billing time for requested period 2025-01-01 to now:  55 hours

Account Summary for CONTRIVED 2025

        CONTRIVED group billing alottment for 2025: 1500000 hours
        CONTRIVED group billing time used in 2025:  55 hours
        CONTRIVED group billing time remaining in 2025:  1499945 hours

Group: hpc-workshop  Type:    Detail:

Requested Accounting Report for HPC-WORKSHOP

        HPC-WORKSHOP group Billing time for requested period 2025-01-01 to now:  0 hours

Account Summary for HPC-WORKSHOP 2025

        HPC-WORKSHOP group billing alottment for 2025: 1500000 hours
        HPC-WORKSHOP group billing time used in 2025:  0 hours
        HPC-WORKSHOP group billing time remaining in 2025:  1500000 hours

** Report Complete!

acctWarn

Using a default threshold of 85% of allocated hours, "acctWarn" lets the user check whether they are approaching the end of their current year allottment. The threshold is adjustable by the user by using the "-t <threshold>" option. If you are a member of more than one accounting group, you may also use the option "-g <group name>" to specify one particular account. The "acctWarn" script can be run manually, or a user could choose to create their own cron job to perform a automated daily check and send an email.

Examples

acctWarn
acctWarn -t 50  (use a threshold of 50%)

checkGPUS

This powertool provides a quick report of GPU availability on the Matilda GPU nodes.

Examples

checkGPUS

HOSTNAMES           CPUS(A/I/O/T)       GRES_USED
hpc-gpu-p01         40/8/0/48           gpu:0
hpc-gpu-p02         40/8/0/48           gpu:0
hpc-gpu-p03         40/8/0/48           gpu:0

hostLoad

"hostLoad" is a tool designed to assist with user job monitoring. Running "hostLoad" with no arguments will provide the user with a list of all their running jobs, and the loads they are placing on the associated runtime nodes. "hostLoad" can also be run with the option "-j <jobid>" (for cases where the user has more than one running job) to report loads for one specific job.

Examples

hostLoad

Node Name                     Load-15m    Capacity%  Notes
hpc-largemem-p01                  1.06        2.65%  OK
hpc-throughput-p01                2.01        5.03%  OK

hostLoad -j 123456

Node Name                     Load-15m    Capacity%  Notes
hpc-largemem-p01                  1.06        2.65%  OK

If the node load is over 103%, a warning will appear under Notes: "Overutilized". Please note that this reflects the total load on the node which may include just the user's job (if it is the only one running), or the sum of all jobs on the node from all users.

jobHist

The jobHist script displays any of the user's jobs that are completed (this includes successfully complete, failed, cancelled, etc.). If no arguments are provided, all jobs from the start of the calendar year to the present are displayed. Alternately, the user may specify a time period, or a number of days, weeks, or months prior to the current date. To see input options, please use the "-h" flag (jobStats -h).

Examples

Display all non-running user jobs that have terminated in the last week:

jobHist -p 1w

Start Time: 2025-01-17  End Time:  now

Running Job History Report for hpctester02 Fri Jan 24 11:16:45 EST 2025

                 Start                    End      JobID         JobName      Elapsed NNode  NCPUS         CPUTime      State                                 NodeList
---------------------- ---------------------- ---------- --------------- ------------ ----- ------ --------------- ---------- ----------------------------------------
   2025-01-24T09:34:15    2025-01-24T10:26:28     149826       primes.sh     00:52:13     1      1        00:52:13  COMPLETED                         hpc-largemem-p01
   2025-01-24T09:58:06    2025-01-24T10:57:12     149828       primes.sh     00:59:06     2      2        01:58:12  COMPLETED                  hpc-throughput-p[01-02]

** Report Complete!

Display all of the user's non-running jobs over specified time period:

jobHist -s 2025-01-01 -e 2025-01-10

Start Time: 2025-01-01  End Time:  2025-01-10

Running Job History Report for hpctester02 Fri Jan 24 11:18:36 EST 2025

                 Start                    End      JobID         JobName      Elapsed NNode  NCPUS         CPUTime      State                                 NodeList
---------------------- ---------------------- ---------- --------------- ------------ ----- ------ --------------- ---------- ----------------------------------------
   2025-01-08T15:01:44    2025-01-08T15:54:06     148716       primes.sh     00:52:22     2      2        01:44:44  COMPLETED                  hpc-throughput-p[05-06]
   2025-01-08T16:24:39    2025-01-08T17:15:33     148732       primes.sh     00:50:54     2      2        01:41:48  COMPLETED                  hpc-throughput-p[06-07]
   2025-01-09T07:59:27    2025-01-09T08:52:47     148788       primes.sh     00:53:20     2      2        01:46:40  COMPLETED                  hpc-throughput-p[01-02]
   2025-01-09T09:28:11    2025-01-09T10:18:56     148789       primes.sh     00:50:45     2      2        01:41:30  COMPLETED                  hpc-throughput-p[01-02]
   2025-01-09T11:23:22    2025-01-09T12:16:04     148793       primes.sh     00:52:42     2      2        01:45:24  COMPLETED                  hpc-throughput-p[01-02]
   2025-01-09T12:18:05    2025-01-09T13:08:53     148794       primes.sh     00:50:48     2      2        01:41:36  COMPLETED                  hpc-throughput-p[01-02]
   2025-01-09T12:22:15    2025-01-09T13:16:35     148795       primes.sh     00:54:20     1      1        00:54:20  COMPLETED                       hpc-throughput-p05

** Report Complete!

jobLoads

The "jobLoads" tool is designed for user monitoring to assist in assessing details about job utilization on each node assigned to all of the user's running jobs. It is a more detailed (and slower running) variation of the "hostLoad" tool, which is primarily designed to check for aggregate node over-utilization. The jobLoads tool is particularly useful for determining if a user's job is responsible for node over-utilization (as identified by "hostLoad"), or whether the user's job is actually grossly under-utilizing an assigned node.

The tool also assesses cases where a node is assigned to a user job but the job is not actually using the node (an unused resource). These are loosely termed as "bad jobs" because resources are assigned but are completely un-utilized, meaning they cannot be used by other users.

Examples

[hpctester02@hpc-login-p01 ~]$ jobLoads
hpctester02, 149826, hpc-largemem-p01, .99, 1
OK

In the example above, the user job "149826" is using 0.99 of the 1 core assigned to their job. The "OK" signifies that the job is using assigned resources (but doesn't doesn't assess under- or over-utilization).

Let's look at a case where the user has requested (and has been assigned) 2 nodes, but is actually only running processes on one of those nodes:

[hpctester02@hpc-login-p01 ~]$ jobLoads
hpctester02, 149826, hpc-largemem-p01, .99, 1
hpctester02, 149828, hpc-throughput-p[01-02], .98, 2

The results above for Job 149828 should give the user pause, as ~1 core (.98) is being used of the 2 cores assigned, and more importantly, the user is assigned 2 nodes (1 core per node). This suggests that the user is only using 1 core on 1 node. In this case, it is strongly recommended that the user check the assigned nodes (see: Node Monitoring) to verify whether the processes they expect to be running are in fact running.

jobStart

The "jobStart" command will provide the user with an estimated start time for their job, plus a list of the nodes that are scheduled to be used by that job.

[Note: These values may change depending on new job priority, administrative job extensions, and unplanned node events. In some cases, it may take some time for nodes to be scheduled for recently queued jobs.]

Examples

jobStart -j <jobid>

Estimated start time and scheduled nodes for JobID: 123456

Start:  25 Jan 2025 17:12:38
Nodes:  hpc-throughput-p01

jobStats

The jobStats script displays information on the user's running jobs. By default, jobStats displays summarized information (no job steps). If you desire jobStep breakdowns, use the "-v" flag.

Examples

Display running job statistics, including start and elapsed time, number of nodes and cores, maximum memory used (MaxRSS), and a list of nodes the job is running on:

jobStats

Running Job Report for hpctester02 Fri Jan 24 11:20:32 EST 2025

              Start           JobID    JobName    Elapsed NNode  NCPUS    CPUTime     ReqMem        State               NodeList             AllocTRES     MaxRSS
------------------- --------------- ---------- ---------- ----- ------ ---------- ---------- ------------ ---------------------- ---------------------- ----------
2025-01-24T11:20:26          149829  primes.sh   00:00:06     1      1   00:00:06    772508M      RUNNING       hpc-largemem-p01  billing=1,cpu=1,node=1   12912K

** Report Complete!

md5direct

The "md5direct" tool computes an md5 checksum for an entire directory. This can be very useful when transferring subdirectories with multiple files to verify that the contents of the source and destination directories are the same, or to determine if two subdirectories are duplicates of one another.

Examples

md5direct myDirectory

74186da3e014e081cd34137942ac47ed  -

powertools

The "powertools" command lists all of the powertools available.

Examples

powertools

NAME
       powertools
DESCRIPTION
       User and Admin utility scripts and tools for the Matilda HPC cluster
LAST MODIFIED
       01/24/2025
POWERTOOLS
              1.     powertools - list this help file
              2.     acctBalance - group-based accounting of billing, cpu, and/or gpu hours used over specified period
             .....

quotaCheck

quotCheck accepts no input arguments. It simply returns the file usage statistics for the user's home directory, and any project directories for groups of which they are a member. Please note, that for shared directories (e.g. those under /projects), the value returned is only for files where the group ownership corresponds to the project group. If there are files present that do not have a group ownership equivalent to the project group (e.g. a file with a group ownership of 'students' or 'faculty'), those files will not be counted in the total.

Examples

quotaCheck

Home Directory Usage:  14.01GiB  50.00GiB  28.02%

Project Directory Usage:   [Note: accuracy depends on group ownership of files]
   /projects/contrived: 62.57GiB  1024.00GiB  6.11% 
   /projects/hpc-workshop: 0GiB  1024.00GiB  0% 

scratchQuota

Similar to quotaCheck, the scratchQuota script takes no input arguments, and simply reports file usage in the user's scratch space, and in their affiliated /scratch/projects space. Once again, please note that files in the /scratch/projects space that have group ownerships other than the project group, will not be counted as part of the total.

Examples

scratchQuota

Total User Scratch Usage (user and projects):  0GB  10240.00GB  0%

Project Scratch Directory Usage (group):   [Note: accuracy depends on group ownership of files]
   /scratch/projects/contrived: 0GB  10240.00GB  0% 
   /scratch/projects/hpc-workshop: 0GB  10240.00GB  0%

scratchScript

The scratchScript script examines all files by access time in the user's /scratch/users and /scratch/projects directory spaces, and lists those files that are slated to be deleted. If no input arguments are provided, the list will contain only those files slated to be deleted tomorrow morning. If the "-d D" option is used (where 'D' is an integer number of days, e.g. 5), then access times of 45 - D will be tested. This will provide the desired amount of warning to the user of what files will be deleted in the coming "D" days.

Examples

To see a list and the number of files that will be deleted in 5 days or less:

scratchScript -d 5

/scratch/users/hpctester02
No files found >=40 days in /scratch/users/hpctester02

This value will be used for both the /scratch/users and /scratch/projects directories.


CategoryHPC