Maintenance Notifications
This page provides general notifications about the Aoraki Research Cluster, including its maintenance schedule.
To carry out critical updates while minimiaing downtime, the Aoraki Cluster undergoes planned maintenance windows. During these periods, logins are disabled, and jobs are paused or stopped depending on the severity of the maintenance. Maintenance notifications are communicated via email to all users, and updates are posted on this page.
Upcoming scheduled maintenance dates
Upcoming Maintenance: September 5–7, 2025
We will be performing scheduled maintenance on the Aoraki Research Cluster and associated services from:
5:00pm Friday, 5 September – 7:00am Monday, 8 September 2025 (NZST)
During this time, all cluster services including OnDemand and CryoSPARC will be unavailable, and the login node will be temporarily disabled.
Planned Work
The following system upgrades and changes will be performed:
Enable SLURM QOS (Quality of Service) Introduce SLURM QOS levels to support differentiated scheduling policies, job prioritization, and fair-share configurations.
Install Powerscale Isilon multipath drivers Improve fault tolerance and performance for storage connections by enabling multipath support for Isilon volumes.
Set SLURM temporary directory on Weka Redirect SLURM job temporary storage to a high-performance Weka filesystem location for better I/O performance.
Install and Enable DCGMI NVIDIA Data Center GPU Manager (DCGMI) is a suite of tools for monitoring and managing NVIDIA GPUs in data centers. It provides real-time health, power, performance, and utilization metrics, helping administrators optimize GPU workloads and ensure system reliability.
Enable SLURM Performance Statistics SLURM Performance Statistics (SPS) is a plugin that collects and reports detailed job and system performance metrics, such as CPU, memory, and GPU usage. It enables advanced monitoring and accounting for optimizing workload efficiency and system utilization.
Set `/opt/weka` as a bind mount (currently a symlink) Update /opt/weka to use a proper bind mount for improved stability.
Restart the login node to apply configuration changes Reboot the login node to apply system-level changes and complete the maintenance.
Completed - April 25-28 2025 Maintenance Details
Date and Time: 5:00pm Friday, 25 April – 7:00am Monday, 28 April 2025 (NZST)
Note: All jobs will be stopped at 5:00pm Friday, 25 April.
Impact on Services
The Aoraki cluster and OnDemand services will be unavailable throughout the maintenance window.
Access to the login node will be temporarily disabled.
Other dependent services, including CryoSPARC, will also be affected.
Action Required
Please plan your computational tasks accordingly to ensure that any critical jobs are completed before 5:00pm Friday, 25 April.
Save your work and log out of the Aoraki cluster and CryoSPARC nodes before the maintenance begins.
Purpose
This maintenance is essential to implement several key upgrades. Planned work includes:
Expansion of the cluster with additional nodes
Storage software upgrade
Network migration
Enabling cgroups on all GPUs for better resource control
Storage configuration updates, including:
Unmounting of the current/scratch directory to finalise deprecation.
Aoraki Research Cluster November 2024 Maintenance Complete
The scheduled maintenance for the Aoraki Research Cluster was successfully completed on November 10, 2024. This upgrade includes several significant improvements aimed at enhancing performance, reliability, and resource management for all users.
Key Enhancements:
Fairshare Scheduling with SLURM Accounting SLURM Accounting is now enabled to monitor resource usage effectively, supporting fairshare scheduling for a balanced distribution of workloads.
Enhanced SLURM Clustered Database and Secondary Controller A new SLURM clustered database and secondary controller have been implemented to improve resource management and redundancy. This addition strengthens failover capabilities, ensuring uninterrupted SLURM operations in the event of a single-point failure.
GPU Visibility with Cgroups Device Constraints Cgroups device constraints have been added to GPU nodes, allowing users to access only the GPUs they have requested. This change minimizes conflicts and promotes fair GPU sharing across workloads.
AUKS SLURM Plugin for Kerberos Authentication The AUKS SLURM plugin has been integrated, enabling jobs to access Kerberos-protected network resources for up to 7 days without user intervention, simplifying authentication needs during job execution.
Change Log 10-11 November 2024
SLURM Controller Daemon Upgrade Upgraded slurmctld daemon to version 23.02.0-json, enabling both Lua and JSON capabilities.
Secondary SLURM Control Node Added a secondary SLURM control node.
SLURM Database Migration Migrated the SLURM database to a Galera cluster.
Enhanced SLURM Database Backup and Monitoring Added new backup and monitoring features for the SLURM database.
GPU Cgroups Constraints Implemented Cgroups constraints on GPU resources to ensure fair resource distribution.
Weka Storage Upgrade Upgraded Weka storage to version 4.2.17.
Weka Container Configuration Update Configured Weka containers to use 2 CPUs on each node.
CPU Reservation for Weka in SLURM Configuration Updated SLURM configuration to reserve CPU resources specifically for Weka operations.
SLURM Accounting User and Group Import Imported all users and groups into SLURM accounting for enhanced tracking and management.
AUKS Installation and Configuration Installed and configured AUKS to enable Kerberos authentication across cluster nodes.
Node Renaming for Active Directory Compliance Renamed all nodes to comply with the Active Directory naming scheme and rejoined the domain.
OnDemand Reconfiguration Reconfigured OnDemand to utilise new node names.
L40 GPU Node Activation for OnDemand Desktop Use Enabled the L40 GPU Nodes for use in OnDemand desktop environments.
9-10 November 2024 - Aoraki Research Cluster Maintenance Updates
As part of the scheduled maintenance for the Aoraki Research Cluster over the weekend of 9-10 November 2024, we will be implementing several updates to enhance system performance, resource allocation, and overall reliability. Below are the key changes that will be applied during this outage:
Implementation of New SLURM Clustered Database and Secondary Controller
To improve resource management and increase redundancy, we will be introducing a new SLURM clustered database along with a secondary controller. This upgrade is critical to ensure better failover capabilities and maintain SLURM operations in the event of any single-point failure.
Enabling SLURM Accounting for Fairshare Scheduling
SLURM Accounting will be enabled to track resource usage and ensure that fairshare scheduling is properly implemented. This will help balance workload distribution across users, giving fair access to cluster resources based on usage history.
CPU Reservation for Weka (Recommended for 100G)
Each node will now have 2 CPUs reserved specifically for Weka filesystem network services. This is in line with the recommended configuration for 100G networking and will help to optimize I/O performance and system responsiveness, particularly for jobs involving large data.
Cgroup Device Constraints for GPU Visibility
Cgroup device constraints will be added to all GPU nodes. This change is essential for improving GPU allocation and visibility, ensuring that users can only see and use the GPUs they have specifically requested. Currently, users can see or attempt to use all GPUs on a node, even if those GPUs are allocated to other jobs. This update will prevent such conflicts and ensure fair resource sharing.
Implementing GPU Sharding on L40 Nodes
To further optimize GPU resource management, we will be introducing GPU sharding on the L40 nodes. This will allow better partitioning of GPU resources, giving users more flexibility and control over their workloads while minimizing GPU contention between jobs.
Implementing AUKS SLURM Plugin
The AUKS SLURM plugin extends the Kerberos authentication system in HPC environments. By integrating AUKS with SLURM, jobs can access network resources (HCS) that require Kerberos authentication without user intervention during job execution. After the initial user’s Kerberos ticket is obtained on the login node and added to the SLURM AUKS repository, network resources are accessible on all cluster nodes for 7 days.
We appreciate your understanding and cooperation during this maintenance period. If you have any questions or concerns, please feel free to reach out to our support team at rtis.solutions@otago.ac.nz
30th May 2024 - Research Cluster Maintenance
CHANGES AND FIXES IMPLEMENTED:
Breaking changes:
Memory usage is now limited to the amount specified in your Slurm job script.
Partitions now have time limits. All jobs must now include a specified time variable.
Other changes/fixes:
Storage has been updated to allow permissions to be applied more effectively.
- A more robust login node has been introduced, featuring the same CPU architecture as the cluster nodes, which is ideal for compiling software.
If prompted during SSH login regarding host key change use
ssh-keygen -f "~/.ssh/known_hosts" -R "aoraki-login.otago.ac.nz"to remove the old key and reattempt to SSH.
New compute and GPU nodes have been added, including 5 high-memory (2TB) nodes, 4 high-CPU nodes, 4 A100 GPU nodes, and 1 H100 GPU node, in addition to our existing compute nodes.
New Slurm partitions have been created.