Matilda HPC Powertools
Overview
Matilda HPC "Powertools" have been developed to assist users in accessing useful system information and functionality, but abstracting away the complexities associated with the SLURM into user-friendly scripts. While Powertools is not intended as a full replacement for SLURM, it is hoped it makes common functions more accessible for most user's requirements.
Several tools are also provided to assist users with the monitoring of their jobs, including resource utilization and performance.
This document contains information on the Powertools modulefile and it's various features. Watch this space, as it will be updated as more tools are developed and/or revised.
[Comments and suggestions are always welcome.]
Accessing Powertools
You may have access to Powertools as you would with any modulefile:
module load powertools
To view a summary of powertools, simply enter:
powertools
This will produce "man page like" output. You can also enter:
man powertools
Powertools Description
acctBalance
The acctBalance script permits the user to check the usage and allocation balances for the current year for any group of which they are a member. If no time period is specified, then YTD is assumed. You can specify "all" if you want to see all groups. In addition, you may specify a time period, or verbose output (a breakdown of all the users in your group). Please note a YTD account summary is always provided. By default the number of "billing" hours are displayed, where: billing-hours=(cpu-hours + (10 * gpu-hours)). You may also just specify cpu-hours or gpu-hours if desired.
To see the available options, use:
acctBalance -h Usage: /cm/shared/apps/powertools/1.0/bin/acctBalance -g groupname [-s YYYY-MM-DD] [-e YYYY-MM-DD | now] [-t c|g|b] [-l] [-h] where: -g: specific project group name, or 'all' -s: start time format in YYYY-MM-DD (optional) -t: report type (optional) where: c = cpu-hours g = gpu-hours b = billing-hours (cpu-hours + (gpu-hours * 10)) DEFAULT=billing-hours -e: end time format in YYYY-MM-DD or now (optional) -l: list all groups of which user is a member (optional) -v: provide details on group users -h: display usage NOTE: time reporting period defaults to YTD NOTE: selecting '-l' provides a group list only
Examples
Display billing-hours for all member groups from March 1, 2023 through to the present:
acctBalance -g all -s 2023-03-01 -e now Start Time: 2023-03-01 End Time: now Group: contrived Type: billing Detail: Summary Requested Accounting Report for CONTRIVED CONTRIVED group BILLING time for requested period: 434 hours ... Group: hpc-workshop Type: billing Detail: Summary Requested Accounting Report for HPC-WORKSHOP HPC-WORKSHOP group BILLING time for requested period: 27838 hours ...
Display gpu-hours YTD with group user details:
acctBalance -g all -t g -v Start Time: 2023-01-01 End Time: 2023-12-31 Group: contrived Type: gres/gpu Detail: Breakdown Requested Accounting Report for CONTRIVED CONTRIVED group GRES/GPU time for requested period: 13 hours User Breakdown hpctester01 13 hours hpctester02 0 hours
List all groups of which the user is a member:
acctBalance -l User: hpctester02 is a member of the following ACCOUNT groups: - contrived - hpc-workshop
acctBalanceAll
This is a simplified version of "acctBalance" which simply lists a YTD summary of billing-hours for all member groups (no user breakdown).
Examples
acctBalanceAll Group: contrived Type: Detail: Requested Accounting Report for CONTRIVED CONTRIVED group Billing time for requested period 2025-01-01 to now: 55 hours Account Summary for CONTRIVED 2025 CONTRIVED group billing alottment for 2025: 1500000 hours CONTRIVED group billing time used in 2025: 55 hours CONTRIVED group billing time remaining in 2025: 1499945 hours Group: hpc-workshop Type: Detail: Requested Accounting Report for HPC-WORKSHOP HPC-WORKSHOP group Billing time for requested period 2025-01-01 to now: 0 hours Account Summary for HPC-WORKSHOP 2025 HPC-WORKSHOP group billing alottment for 2025: 1500000 hours HPC-WORKSHOP group billing time used in 2025: 0 hours HPC-WORKSHOP group billing time remaining in 2025: 1500000 hours ** Report Complete!
acctWarn
Using a default threshold of 85% of allocated hours, "acctWarn" lets the user check whether they are approaching the end of their current year allottment. The threshold is adjustable by the user by using the "-t <threshold>" option. If you are a member of more than one accounting group, you may also use the option "-g <group name>" to specify one particular account. The "acctWarn" script can be run manually, or a user could choose to create their own cron job to perform a automated daily check and send an email.
Examples
acctWarn acctWarn -t 50 (use a threshold of 50%)
checkGPUS
This powertool provides a quick report of GPU availability on the Matilda GPU nodes.
Examples
checkGPUS HOSTNAMES CPUS(A/I/O/T) GRES_USED hpc-gpu-p01 40/8/0/48 gpu:0 hpc-gpu-p02 40/8/0/48 gpu:0 hpc-gpu-p03 40/8/0/48 gpu:0
hostLoad
"hostLoad" is a tool designed to assist with user job monitoring. Running "hostLoad" with no arguments will provide the user with a list of all their running jobs, and the loads they are placing on the associated runtime nodes. "hostLoad" can also be run with the option "-j <jobid>" (for cases where the user has more than one running job) to report loads for one specific job.
Examples
hostLoad Node Name Load-15m Capacity% Notes hpc-largemem-p01 1.06 2.65% OK hpc-throughput-p01 2.01 5.03% OK hostLoad -j 123456 Node Name Load-15m Capacity% Notes hpc-largemem-p01 1.06 2.65% OK
If the node load is over 103%, a warning will appear under Notes: "Overutilized". Please note that this reflects the total load on the node which may include just the user's job (if it is the only one running), or the sum of all jobs on the node from all users.
jobHist
The jobHist script displays any of the user's jobs that are completed (this includes successfully complete, failed, cancelled, etc.). If no arguments are provided, all jobs from the start of the calendar year to the present are displayed. Alternately, the user may specify a time period, or a number of days, weeks, or months prior to the current date. To see input options, please use the "-h" flag (jobStats -h).
Examples
Display all non-running user jobs that have terminated in the last week:
jobHist -p 1w Start Time: 2025-01-17 End Time: now Running Job History Report for hpctester02 Fri Jan 24 11:16:45 EST 2025 Start End JobID JobName Elapsed NNode NCPUS CPUTime State NodeList ---------------------- ---------------------- ---------- --------------- ------------ ----- ------ --------------- ---------- ---------------------------------------- 2025-01-24T09:34:15 2025-01-24T10:26:28 149826 primes.sh 00:52:13 1 1 00:52:13 COMPLETED hpc-largemem-p01 2025-01-24T09:58:06 2025-01-24T10:57:12 149828 primes.sh 00:59:06 2 2 01:58:12 COMPLETED hpc-throughput-p[01-02] ** Report Complete!
Display all of the user's non-running jobs over specified time period:
jobHist -s 2025-01-01 -e 2025-01-10 Start Time: 2025-01-01 End Time: 2025-01-10 Running Job History Report for hpctester02 Fri Jan 24 11:18:36 EST 2025 Start End JobID JobName Elapsed NNode NCPUS CPUTime State NodeList ---------------------- ---------------------- ---------- --------------- ------------ ----- ------ --------------- ---------- ---------------------------------------- 2025-01-08T15:01:44 2025-01-08T15:54:06 148716 primes.sh 00:52:22 2 2 01:44:44 COMPLETED hpc-throughput-p[05-06] 2025-01-08T16:24:39 2025-01-08T17:15:33 148732 primes.sh 00:50:54 2 2 01:41:48 COMPLETED hpc-throughput-p[06-07] 2025-01-09T07:59:27 2025-01-09T08:52:47 148788 primes.sh 00:53:20 2 2 01:46:40 COMPLETED hpc-throughput-p[01-02] 2025-01-09T09:28:11 2025-01-09T10:18:56 148789 primes.sh 00:50:45 2 2 01:41:30 COMPLETED hpc-throughput-p[01-02] 2025-01-09T11:23:22 2025-01-09T12:16:04 148793 primes.sh 00:52:42 2 2 01:45:24 COMPLETED hpc-throughput-p[01-02] 2025-01-09T12:18:05 2025-01-09T13:08:53 148794 primes.sh 00:50:48 2 2 01:41:36 COMPLETED hpc-throughput-p[01-02] 2025-01-09T12:22:15 2025-01-09T13:16:35 148795 primes.sh 00:54:20 1 1 00:54:20 COMPLETED hpc-throughput-p05 ** Report Complete!
jobLoads
The "jobLoads" tool is designed for user monitoring to assist in assessing details about job utilization on each node assigned to all of the user's running jobs. It is a more detailed (and slower running) variation of the "hostLoad" tool, which is primarily designed to check for aggregate node over-utilization. The jobLoads tool is particularly useful for determining if a user's job is responsible for node over-utilization (as identified by "hostLoad"), or whether the user's job is actually grossly under-utilizing an assigned node.
The tool also assesses cases where a node is assigned to a user job but the job is not actually using the node (an unused resource). These are loosely termed as "bad jobs" because resources are assigned but are completely un-utilized, meaning they cannot be used by other users.
Examples
[hpctester02@hpc-login-p01 ~]$ jobLoads hpctester02, 149826, hpc-largemem-p01, .99, 1 OK
In the example above, the user job "149826" is using 0.99 of the 1 core assigned to their job. The "OK" signifies that the job is using assigned resources (but doesn't doesn't assess under- or over-utilization).
Let's look at a case where the user has requested (and has been assigned) 2 nodes, but is actually only running processes on one of those nodes:
[hpctester02@hpc-login-p01 ~]$ jobLoads hpctester02, 149826, hpc-largemem-p01, .99, 1 hpctester02, 149828, hpc-throughput-p[01-02], .98, 2
The results above for Job 149828 should give the user pause, as ~1 core (.98) is being used of the 2 cores assigned, and more importantly, the user is assigned 2 nodes (1 core per node). This suggests that the user is only using 1 core on 1 node. In this case, it is strongly recommended that the user check the assigned nodes (see: Node Monitoring) to verify whether the processes they expect to be running are in fact running.
jobStart
The "jobStart" command will provide the user with an estimated start time for their job, plus a list of the nodes that are scheduled to be used by that job.
[Note: These values may change depending on new job priority, administrative job extensions, and unplanned node events. In some cases, it may take some time for nodes to be scheduled for recently queued jobs.]
Examples
jobStart -j <jobid> Estimated start time and scheduled nodes for JobID: 123456 Start: 25 Jan 2025 17:12:38 Nodes: hpc-throughput-p01
jobStats
The jobStats script displays information on the user's running jobs. By default, jobStats displays summarized information (no job steps). If you desire jobStep breakdowns, use the "-v" flag.
Examples
Display running job statistics, including start and elapsed time, number of nodes and cores, maximum memory used (MaxRSS), and a list of nodes the job is running on:
jobStats Running Job Report for hpctester02 Fri Jan 24 11:20:32 EST 2025 Start JobID JobName Elapsed NNode NCPUS CPUTime ReqMem State NodeList AllocTRES MaxRSS ------------------- --------------- ---------- ---------- ----- ------ ---------- ---------- ------------ ---------------------- ---------------------- ---------- 2025-01-24T11:20:26 149829 primes.sh 00:00:06 1 1 00:00:06 772508M RUNNING hpc-largemem-p01 billing=1,cpu=1,node=1 12912K ** Report Complete!
md5direct
The "md5direct" tool computes an md5 checksum for an entire directory. This can be very useful when transferring subdirectories with multiple files to verify that the contents of the source and destination directories are the same, or to determine if two subdirectories are duplicates of one another.
Examples
md5direct myDirectory 74186da3e014e081cd34137942ac47ed -
powertools
The "powertools" command lists all of the powertools available.
Examples
powertools NAME powertools DESCRIPTION User and Admin utility scripts and tools for the Matilda HPC cluster LAST MODIFIED 01/24/2025 POWERTOOLS 1. powertools - list this help file 2. acctBalance - group-based accounting of billing, cpu, and/or gpu hours used over specified period .....
quotaCheck
quotCheck accepts no input arguments. It simply returns the file usage statistics for the user's home directory, and any project directories for groups of which they are a member. Please note, that for shared directories (e.g. those under /projects), the value returned is only for files where the group ownership corresponds to the project group. If there are files present that do not have a group ownership equivalent to the project group (e.g. a file with a group ownership of 'students' or 'faculty'), those files will not be counted in the total.
Examples
quotaCheck Home Directory Usage: 14.01GiB 50.00GiB 28.02% Project Directory Usage: [Note: accuracy depends on group ownership of files] /projects/contrived: 62.57GiB 1024.00GiB 6.11% /projects/hpc-workshop: 0GiB 1024.00GiB 0%
scratchQuota
Similar to quotaCheck, the scratchQuota script takes no input arguments, and simply reports file usage in the user's scratch space, and in their affiliated /scratch/projects space. Once again, please note that files in the /scratch/projects space that have group ownerships other than the project group, will not be counted as part of the total.
Examples
scratchQuota Total User Scratch Usage (user and projects): 0GB 10240.00GB 0% Project Scratch Directory Usage (group): [Note: accuracy depends on group ownership of files] /scratch/projects/contrived: 0GB 10240.00GB 0% /scratch/projects/hpc-workshop: 0GB 10240.00GB 0%
scratchScript
The scratchScript script examines all files by access time in the user's /scratch/users and /scratch/projects directory spaces, and lists those files that are slated to be deleted. If no input arguments are provided, the list will contain only those files slated to be deleted tomorrow morning. If the "-d D" option is used (where 'D' is an integer number of days, e.g. 5), then access times of 45 - D will be tested. This will provide the desired amount of warning to the user of what files will be deleted in the coming "D" days.
Examples
To see a list and the number of files that will be deleted in 5 days or less:
scratchScript -d 5 /scratch/users/hpctester02 No files found >=40 days in /scratch/users/hpctester02
This value will be used for both the /scratch/users and /scratch/projects directories.
CategoryHPC