HPC Change Log

Log of software installations, updates installed, planned maintenance, configuration changes.

Cluster Upgrades - [Scheduled]

Date: August 24 - August 25, 2022

Cluster Matilda

The cluster will be down as we apply updates and upgrade cluster management software, apply OS updates, and firmware upgrades. We expect significant downtime for the cluster as we perform this maintenance. As a reminder, we also have HPC resources at MSU iCER which may be of use during this downtime. See MSU_iCER_HPCC_Research_Cluster for details on requesting access if you don't have it already.

Infiniband Network, Scratch Storage, MPI Outage - [Completed]

Date: May 07, 2022

Cluster Matilda

NVIDIA/Mellanox Switch failed making Infiniband Network, MPI jobs, and /scratch storage inaccessible. We are working with Dell and NVIDIA to get a replacement ASAP.

May 12, 2022 - We moved around the remaining functioning switches and restored the Infiniband network to about two thirds of the cluster. Still waiting on a replacement switch.

May 17, 2022 - Replacement switch has been installed and configured. All compute nodes are back online with /scratch available.

Home Directory Quota Increase - [Completed]

Date: April 29, 2022

Cluster Matilda

Home directory quotas were increased from 20GB to 50GB

Cluster Upgrades - [Completed]

Date: April 27 - April 29, 2022

Cluster Matilda

The cluster will be down as we apply updates and upgrade cluster management software, apply OS updates, apply firmware upgrades to the scratch storage appliance. We expect significant downtime for the cluster as we perform this maintenance. As a reminder, we also have HPC resources at MSU iCER which may be of use during this downtime. See MSU_iCER_HPCC_Research_Cluster for details on requesting access if you don't have it already.

Primary storage system upgrade - [Completed]

Date: Oct 20, 2021: 06:00 - 10:00

Cluster Matilda

The primary storage system (Dell EMC Isilon) for home directories, project space, and software will be upgraded. The upgrade is a rolling upgrade of the storage system cluster nodes so should not be disruptive, however, there is a possibility that jobs and the cluster will be unavailable.

Scratch space file purging - [Completed]

Date: April 28 - April 30, 2021

Cluster Matilda

Files on scratch space are currently not purging as expected for files not accessed in 45 days. We will purge files and implement a script to perform this purge on an ongoing basis.

Cluster Upgrades - [Completed]

Date: April 28 - April 29, 2021

Cluster Matilda

The cluster will be down as we apply updates and upgrade cluster management software, apply OS updates, apply firmware upgrades to the scratch storage appliance.

Compute node firmware updates - rolling - [Completed]

Date: April 9 - April 30, 2021

Cluster Matilda

We will be draining nodes and applying Dell firmware updates. This should not be disruptive to jobs but will reduce some of the capacity of the cluster as we take idle nodes out of service to apply updates.

Change from TMod to LMod module environment - [Completed]

Date: January 13, 2021 - 07:00-08:00

Cluster: Matilda

The system will be reconfigured to utilize the TMod module system instead of the LMod module system. LMod is backwards compatible with TMod so users should not experience any issues with existing job submission scripts.


CategoryHPC