Matilda HPC Rate FAQ

Last Revised: August 12, 2024

Overview

This article presents answers to some commonly asked questions regarding the Matilda HPC base allocation and additional computation time purchases. For details on base allocation, rates, and buyin node options, please refer to the KB article on Matilda HPC Base Allocation and Rates.

Base Allocation

What is my base allocation?

All OU-affiliated researchers and their lab members receive and individual allocation of 50 GB of home directory storage and 10 TB of scratch (/scratch/users) storage on the Matilda cluster. This allocation allows OU-affiliated researchers access to the Matilda cluster and to submit jobs as part of a PI project/group.

PIs are also provided with shared project (and project scratch) space for research projects or group projects. These allocations are assigned to the PI and can be used by members of their group. Base allocations are as follows:

  • CPU hours1: 1,000,000 per year, refreshed annually

  • GPU hours2: 50,000 per year, refreshed annually

  • Shared project/group storage: 1 TB (/projects directory)

  • Shared project/group scratch storage3: 10 TB (/scratch/projects)

CPU and GPU hours are convertible, so researchers can use their allocation in whatever ways make best sense for their specific needs. This is accomplished by assigning "General Compute" hours in the amount of 1,500,000, which is calculated using billing weights for CPU and GPU hours as shown below:

  • General Compute hours: 1,500,000

  • CPU hours: cpu-hours x 1.0 general compute hours

  • GPU hours: gpu-hours x 10.0 general compute hours

Annually allocated General Compute hours are calculated as follows:

  • (1,000,000 cpu-hours x 1.0) + (50,000 gpu-hours x 10.0) = 1,500,000

For example, if all of a PI's workloads on Matilda used CPU hours only, the researcher would have an effective allocation 1,500,000 cpu-hours (100% of the General Compute allocation). However, if a PI were to use 100,000 gpu-hours, this would translate into 1,000,000 General Compute hours (100,000 x 10.0) leaving a balance of 500,000 General Compute hours which could be used for cpu- or gpu-hours [1,500,000 general compute hours - (100,000 gpu-hours x 10.0) = 500,000 general compute hours].

How are "Compute Hours" calculated?

CPU compute hours are calculated based on the actual usage time per core. So for a job that uses 5 CPUs for a period of 100 hours, the billing/usage hours consumed would be 500 general compute hours (5 cpus x 100 hours). Occassionally, researchers may reserve more CPUs (or GPUs) than their job actually uses. Any CPUs not actually used (but reserved) will be billed at the same rate, since there is no practical way to determine if a CPU is being utilized, and it is still reserved by the job scheduler and cannot be utilized by other users. It is strongly advised that researchers test their applications and ensure that the CPUs reserved for a job are actually being used correctly by the job.

GPU compute hours are calculated similarly to CPU hours, but a multiplier of "10" is used to convert these into "General Compute" (billing) hours as discussed previously. So for a job that calls for 1 CPU and 2 GPUs that are used for 10 hours the general compute hours will total 210 hours [(2 GPUs x 10 hours x 10.0) + (1 CPU x 10 hours)].

If CPUs or GPUs are reserved for a particular walltime (total estimated run time) but the job finishes before the walltime expires, users will only be billed for the actual job run time (the utilized portion of the estimated walltime). So for example, if a user reserves 5 CPUs for a wa\ltime of 100 hours, but the job only runs for 70 hours, the total billed time will be 350 cpu-hours (5 CPUs x 70 hours).

What are my storage limitations?

There are important limitations on the backup and replication of Matilda HPC cluster storage. These include:

  • /scratch/user spaces are not backed up or replicated. Files are automatically deleted if they have not been accessed in the last 45 days

  • /scratch/project spaces are not backup up or replicated. Files are automatically deleted if they have not been accessed int the last 45 days

  • PIs will not have access to user /home directory spaces even after the lab member leaves OU, unless they have secured permission in writing from the user

  • Data stored in /project and /scratch/project spaces will be accessible by the PI, even if the lab member leaves OU

  • /home and /project space backups may be accessed by traversing to the ".snapshot" subdirectory of any directory or subdirectory

For more details regarding on-cluster storage, please refer to the KB article on Matilda HPC Storage.

When will the usage of my Base Allocation be reset?

Matilda HPC compute hour usage is reset every year on January 1st at 12:00am. After that date, your usage is reset to "0" and you will have access to the full balance of your Base Allocation.

Do unused portions of the Base Allocation roll-over into the next year?

No. Unused portions of the Base Allocation do not roll-over into the next calendar year (use-it-or-lose-it).

How can I check my Matilda HPC usage?

UTS has developed the utility suite "powertools" which includes tools for checking your account balance and other information on your jobs including history. Reports can be generated for various periods, and can be produced in summary or detailed formats. For information on powertools and how to use it, please refer to the KB article on Matilda HPC Powertools.

A quick summary can be generated for the accounts/groups of which you are a member YTD, by using:

module load powertools
acctBalanceAll

What are my options if I run out of Base Allocation time?

If you have the funding to support the purchase additional compute time, you may do so by contacting UTS.

If you do not have the funds, UTS provides the "scavenger queue" on Matilda which permits a limited number of jobs and resources per job to be run without accruing time against your Base Allocation. This can of course, be used to run jobs if your Base Allocation is exhausted and you do not have the funds to purchase additional compute hours. Please be aware however, that the scavenger queue will significantly reduce the number of jobs you can run on Matilda at any one time, and the resources that can be dedicated to each job.

How can I conserve my Base Allocation time?

A good way to help preserve your Base Allocation time is to train new users or run tests on the scavenger queue. These jobs do not accrue resource time against your Base Allocation, but offer only limited resources (which works well for very small jobs, tests, and training).

Another best practice is do NOT launch many complex new jobs where minimal or no testing has been performed. For example, if you want to run a multi-node, many-core job but are unsure of how it will perform (and perhaps it is too large to run on the scavenger queue), we suggest running a single job to gauge performance, and then apply any necessary changes. Once you have your job configured to your satisfaction, you can launch several similar jobs on different data sets. This prevents wasting Base Allocation time by launching several high-resource jobs at once, only to discover they are not setup properly.

Additional Compute Time

What are the rates for additional computational resources?

Researchers who need additional computational time beyond the annual base allocation can purchase additional resources. Current costs (which will be revised every two years) are:

  • CPU hours: $0.024 per hour

  • GPU hours: $0.24 per hour

How should I estimate how much additional compute time to purchase?

A good way to do this is to use the Matilda HPC Powertools suite to generate usage reports - particularly by user. This will enable you to see who is using what resources, how much, and for what types of jobs. This information can be used to inductively identify typical resource requirements for different types of jobs, and/or to determine if some of your users might be wasting resources due to inadequate understanding of how to use the cluster or setup computational problems.

Does unused additional compute time roll-over into the next year?

Yes. Additional compute time is allocated in a separate account, and unused balances roll-over until exhausted.

How do I use my additional compute time?

Additional compute time is placed in at least one separate and additional account (your default account is the same as your project/group/lab name). When you wish to use this time, you must explicitly specify the account in your job script or on the command line when submitting a job. For example, lets assume we have placed purchased time in the account "joneslabExtra":

#SBATCH -A joneslabExtra

or

#SBATCH --account=joneslabExtra

or

srun -N 1 -n 1 -c 1 -t 1:00:00 -A joneslabExtra --pty /bin/bash

For more information on these and other options, refer to the SBATCH Options KB and/or the Job Script Examples KB.

Is there a way to separate out additional compute time into separate categories for each grant?

If you have received multiple grant awards where separate funding has been set aside for Matilda HPC compute time, UTS can create an account for each grant award upon request. In this way, additional time can be placed into each grant account. However, you must make sure that all of the users in your lab specify the correct account when running jobs as shown above.

Can access to additional compute time accounts be restricted to select users?

Yes. Access to each account (including your default project account) is restricted to specific users. It is also possible to set limits on how much time a particular user may use from any given accountn (the default is unlimited up to the Base Allocation or maximum additional compute time). Please notify UTS if you would like to restrict access to any of your accounts with details on the users, account specifics, and usage limits (if any).

Additional Storage Space

What are the rates for additional storage resources?

Researchers or groups who need additional storage beyond the annual base allocation can purchase additional space, depending on their specific storage needs. There are two base storage types: storage on the Matilda HPC cluster itself, or storage in one or more OU data centers, but without direct access to/from the Matilda cluster. Current costs (which will be revised every two years) for each storage tier are presented in the following table:

Storage Type

Specifications

Cost Per TiB (1 year)

Matilda project or home directory quota

Storage quota on Matilda is increased and is immediately available for use on Matilda: Two locations (snapshot & replication between NFH and DH) and DR-AWS/Deep Archive. Offers best protection for data, 2+1 locations

$260

Matilda scratch space quota

Scratch space on Matilda is increased and immediately available for use on Matilda: High-speed Lustre parallel scratch, files are not backed up and purged automatically after 45 days from time of last access. Offers increased working storage for large data files.

$72

Performance tier

Single local (snapshot) storage location (NFH), high speed. Good for storing data you need occasionally where file loss is not catastrophic.

$170

Archive tier4

Single local5 (snapshot) storage location (DH). Good for archiving data you need to keep but not access often and where file loss is not catastrophic.

$90

Replicated performance tier

Two locations (snapshot & replication between NFH & DH). Good for storing data you need occasionally where file loss would be catastrophic.

$250

Replicated performance tier with deep archive

Two locations (snapshot & replication between NFH & DH) and DR-AWS/Deep Archive. Best protection for storing data you need occasionally where file loss would be catastrophic.

$260

Archive tier4 with deep archive

Single location and DR-AWS/Deep Archive. Good for archiving data with infrequent access but requires off-site data protection.

$90

How can I check my storage space usage?

When you login to Matilda, your usage statistics are displayed for your /home, /project, and /scratch spaces. After login, you may use the Matilda HPC Powertools various "quota" tools to check on file usage. Please note for shared spaces like "/projects" the accuracy of the usage statistics will depend on the group ownership of the files in these directories. Although the default setting is for new files to be assigned the group ownership to the project name, users sometimes transfer files from other spaces (e.g. /home) and preserve group ownership. This can throw-off the true usage calculation. Make sure your users are assigning the correct group to any files transfered to these spaces.

Are there any best practices for conserving storage space?

Some suggested techniques include:

  • Gzipping files/directories where work has been completed and will not need to be accessed for a while.
  • Proactively deleting temporary files from /home and /project spaces
  • Using /scratch spaces for temporary or "test" work products (there is a limit on scratch space, but it is significantly higher and files are deleted automatically after 45 days)

Buy-In Nodes

Researchers who need hardware capacity beyond what is currently available on the Matilda cluster can purchase additional nodes. UTS staff will add purchased nodes to the cluster and manage them together with the rest of the cluster. Buy-in users and their research groups will have priority access on all cluster resources they purchase. They will also receive additional compute time (CPU or GPU, as needed or desired) in the calendar year they purchase resources, based on rates in effect at the time of purchase.

Additional compute time allocations will be divided equally over a period of 5-years and added to the base allocation unless otherwise specified. Additional compute time added to a researcher's account from a buy-in purchase will be provided in a use-or-lose allocation just like base allocations. No additional allocations will be provided beyond 5 years from the date the buy-in node was placed into service.

Buy-In node priority will be provided to the researcher as follows:

  • The buy-in researcher will be permitted to run jobs up to the maximum walltime of 168 hours on their buy-in nodes
  • Other researchers may use the buy-in node, but only for a maximum walltime of 4 hours, and only if the resources are available.
  • Additional compute time allocations provided as part of a buy-in purchase will apply to any job on any node (not just the buy-in node)
  • Buy-in priority will last for a period of 5 years from the date when the node was placed into service

To purchase a node, contact UTS at [email protected] to discuss your needs and get a quote. The exact price will depend on the hardware chosen, plus any incidentals that may be needed to connect the new hardware to the cluster.


CategoryHPC

  1. Compute Hours are measured per CPU Core used in a job, thus a job running on 40 CPU cores and running for 1 hour would consume 40 Compute Hours. (1)

  2. GPU Hours are measured per GPU requested as typically only a single job can be run on a GPU at a time. E.g. A job requesting 2 GPU resources and running for 1 hour would consume 2 GPU Hours. (2)

  3. Scratch storage is short-term storage used for working files only. It is not backed up or mirrored, and inactive files (determined by last access time) are deleted after 45 days. (3)

  4. Can be made available through Globus to move data in and out of the Matilda HPC Cluster (4 5)

  5. Local refers to storage that is within one of the data center on the OU campus either in Dodge Hall (DH) or North Foundation Hall (NFH) (6)