HPC Change Log

Log of software installations, updates installed, planned maintenance, configuration changes.

Globus Maintenance - [Completed]

Date: Novwember 2, 2024 9:00am - 10:30am EST

Cluster Matilda Globus Services

All Globus services will be unavailable including Auth, Transfer, Flows, Search, Timer, Compute, and Globus ID. This will impact the availability of the Globus Endpoint hosted on Matilda.

We expect a relative brief disruption in Globus services during this time.

For more information see: Globus Maintenance Downtime.

Cluster Upgrades - [Completed]

Date: August 28, 2024 7:00am EDT - August 29, 2024 5:00pm EDT

Cluster Matilda

The cluster will be down as we apply updates and upgrade cluster management software, apply OS updates, firmware upgrades, and replace batteries on the DDN Lustre Storage appliance. We are also planning on moving physical network links that connect the HPC to the campus data center network to the new ACI data center network.

We expect significant downtime for the cluster as we perform this maintenance.

As a reminder, we also have HPC resources at MSU iCER which may be of use during this downtime. See MSU_iCER_HPCC_Research_Cluster for details on requesting access if you don't have it already.

Firewall Migration - [Completed]

Date: July 24, 2024 from 06:00 - 08:00 EDT

Cluster Matilda

The UTS Security Team will be migrating network security services to a new firewall for the login and data nodes.

Users may experience dropped remote SSH/SFTP/SCP sessions during this time or issues logging in and possible interruptions to Globus file sharing. Consider using a terminal multiplexer such as tmux to be able to reconnect to an existing SSH session if your connection is interrupted.

Firewall Migration - [Completed - Reverted changes, will reschedule]

Date: July 10, 2024 from 06:00 - 08:00 EDT

Cluster Matilda

The UTS Security Team will be migrating network security services to a new firewall for the login and data nodes.

Users may experience dropped remote SSH/SFTP/SCP sessions during this time or issues logging in and possible interruptions to Globus file sharing. Consider using a terminal multiplexer such as tmux to be able to reconnect to an existing SSH session if your connection is interrupted.

Cluster Upgrades - [Completed]

Date: April 24-26, 2024

Cluster Matilda

The cluster will be down as we apply updates and upgrade cluster management software (migrate from Bright Cluster Manager to NVIDIA Base Command Manager), apply OS updates, and firmware upgrades.

We expect significant downtime for the cluster as we perform this maintenance.

As a reminder, we also have HPC resources at MSU iCER which may be of use during this downtime. See MSU_iCER_HPCC_Research_Cluster for details on requesting access if you don't have it already.

Compute node firmware updates - rolling - [Completed]

Date: March 25 - April 26, 2024

Cluster Matilda

We will be draining nodes and applying Dell firmware updates. This should not be disruptive to jobs but will reduce some of the capacity of the cluster as we take idle nodes out of service to apply updates.

Cluster Upgrades - [Completed]

In addition to standard software and firmware upgrades, a "Prolog" script was added to detect if a mount point (e.g. /scratch or /projects) is down, prior to a job landing on a node. In such cases, the node will be set to "drain" (taken offline) and the user's job will be re-queued.

Date: August 23-24, 2023

Cluster Matilda

The cluster will be down as we apply updates and upgrade cluster management software, apply OS updates, and firmware upgrades.

We expect significant downtime for the cluster as we perform this maintenance.

As a reminder, we also have HPC resources at MSU iCER which may be of use during this downtime. See MSU_iCER_HPCC_Research_Cluster for details on requesting access if you don't have it already.

Cluster Upgrades - [Completed]

Date: April 26-27, 2023

Cluster Matilda

The cluster will be down as we apply updates and upgrade cluster management software, apply OS updates, and firmware upgrades.

We expect significant downtime for the cluster as we perform this maintenance.

As a reminder, we also have HPC resources at MSU iCER which may be of use during this downtime. See MSU_iCER_HPCC_Research_Cluster for details on requesting access if you don't have it already.

Globus Archives and Guest Collections - [Completed]

Date: November 3, 2022

Cluster Matilda

We will be upgrading to a subscription-based version of Globus that will permit users to share Guest Collections with collaborators outside of OU.

Please refer to the Globus Archives documentation for more information.

Important OpenMPI Upgrade and Changes - [Completed]

Date: November 3, 2022

Cluster Matilda

We will be evaluating OpenMPI performance and optimization and making any necessary changes.

OpenMPI version 4.1.1c was tested and installed - module load OpenMPI/4.1.1c

Please refer to the MPI Job Script documentation for important changes when compiling or running OpenMPI capable applications.

MATLAB Installation - [Completed]

Date: November 3, 2022

Cluster Matilda

Matlab Version R2022b was installed - module load MATLAB/R2022b

For the latest information, please refer to the MATLAB HPC reference. Please update your off-cluster scripts if you'd like to run the new version of MATLAB remotely.

Cluster Upgrades - [Completed]

Several new features and configuration changes were applied during this maintenance window. Please see HPC Feature Updates for more information.

Date: August 24 - August 25, 2022

Cluster Matilda

The cluster will be down as we apply updates and upgrade cluster management software, apply OS updates, apply firmware upgrades to the scratch storage appliance. We expect significant downtime for the cluster as we perform this maintenance. As a reminder, we also have HPC resources at MSU iCER which may be of use during this downtime. See MSU_iCER_HPCC_Research_Cluster for details on requesting access if you don't have it already.

SLURM Upgrade - We upgraded to version 21.08

SLURM Configuration - New queues were added for short (<=4 hrs) and long jobs, as well as buyin users. For a complete list of all changes, see the Feature Updates

OS Upgrade - RHEL 8.6

Firmware and BIOS updates

Driver Updates - Infiniband networking, Lustre (/scratch) and NVIDIA GPUs

Infiniband Network, Scratch Storage, MPI Outage - [Completed]

Date: May 07, 2022

Cluster Matilda

NVIDIA/Mellanox Switch failed making Infiniband Network, MPI jobs, and /scratch storage inaccessible. We are working with Dell and NVIDIA to get a replacement ASAP.

May 12, 2022 - We moved around the remaining functioning switches and restored the Infiniband network to about two thirds of the cluster. Still waiting on a replacement switch.

May 17, 2022 - Replacement switch has been installed and configured. All compute nodes are back online with /scratch available.

Home Directory Quota Increase - [Completed]

Date: April 29, 2022

Cluster Matilda

Home directory quotas were increased from 20GB to 50GB

Cluster Upgrades - [Completed]

Date: April 27 - April 29, 2022

Cluster Matilda

The cluster will be down as we apply updates and upgrade cluster management software, apply OS updates, apply firmware upgrades to the scratch storage appliance. We expect significant downtime for the cluster as we perform this maintenance. As a reminder, we also have HPC resources at MSU iCER which may be of use during this downtime. See MSU_iCER_HPCC_Research_Cluster for details on requesting access if you don't have it already.

Primary storage system upgrade - [Completed]

Date: Oct 20, 2021: 06:00 - 10:00

Cluster Matilda

The primary storage system (Dell EMC Isilon) for home directories, project space, and software will be upgraded. The upgrade is a rolling upgrade of the storage system cluster nodes so should not be disruptive, however, there is a possibility that jobs and the cluster will be unavailable.

Scratch space file purging - [Completed]

Date: April 28 - April 30, 2021

Cluster Matilda

Files on scratch space are currently not purging as expected for files not accessed in 45 days. We will purge files and implement a script to perform this purge on an ongoing basis.

Cluster Upgrades - [Completed]

Date: April 28 - April 29, 2021

Cluster Matilda

The cluster will be down as we apply updates and upgrade cluster management software, apply OS updates, apply firmware upgrades to the scratch storage appliance.

Compute node firmware updates - rolling - [Completed]

Date: April 9 - April 30, 2021

Cluster Matilda

We will be draining nodes and applying Dell firmware updates. This should not be disruptive to jobs but will reduce some of the capacity of the cluster as we take idle nodes out of service to apply updates.

Change from TMod to LMod module environment - [Completed]

Date: January 13, 2021 - 07:00-08:00

Cluster: Matilda

The system will be reconfigured to utilize the TMod module system instead of the LMod module system. LMod is backwards compatible with TMod so users should not experience any issues with existing job submission scripts.


CategoryHPC