Maintenance Notifications

Planned Datacentre Outage – January 2026

The Research Computing team advises users of a planned datacentre outage that will impact the Aoraki research compute cluster and several associated research services.

This outage is required to support scheduled electrical maintenance and will involve a complete power shutdown of the 325 Datacentre.

Outage Window

Important

Start: 12:00 pm, Friday 16 January 2026 End: 5:00 pm, Monday 19 January 2026 Timezone: NZDT

All systems hosted in the datacentre will be unavailable for the duration of this window.

Services Impacted

The following services will be unavailable during the outage:

Aoraki high-performance compute cluster (login nodes and compute nodes)
SLURM job scheduling and submission services
Virtual machines hosted in the datacentre
On-site storage services
Web portals and research platforms hosted in the datacentre
Departmental servers hosted in the datacentre
Open OnDemand and hosted applications (including CryoSPARC)

What Users Need to Do

Warning

Ensure all running jobs complete before the outage begins.
Avoid scheduling long-running or critical workflows during this period.
New SLURM job submissions will be disabled ahead of the outage.
No user access or actions will be possible during the outage.

Users should plan workflows accordingly to avoid disruption.

What We Will Do

The Research Computing team will:

Drain the Aoraki cluster in advance of the outage.
Perform a controlled shutdown of all affected systems.
Coordinate with Facilities during the power outage.
Restore services in a staged and validated manner once power is returned.

Service restoration will begin as soon as it is safe to do so after power is restored.

—

Maintenance Completed: September 5–8, 2025

The scheduled maintenance on the Aoraki Research Cluster and associated services, which ran from 5:00pm Friday, 5 September – 7:00am Monday, 8 September 2025 (NZST), is now complete.

All cluster services, including OnDemand and CryoSPARC, are fully operational and the login node has been re-enabled.

Planned Work Completed

The following system upgrades and changes were successfully performed during the maintenance window:

Enabled SLURM QOS (Quality of Service) SLURM QOS levels are now introduced to support differentiated scheduling policies, job prioritization, and fair-share configurations.
Installed Powerscale Isilon multipath drivers Multipath support for Isilon volumes is now enabled, improving fault tolerance and performance for storage connections.
Set SLURM temporary directory on Weka SLURM job temporary storage has been redirected to a high-performance Weka filesystem location for better I/O performance.
Installed and Enabled DCGMI NVIDIA Data Center GPU Manager (DCGMI) is now installed and operational. This suite of tools provides real-time health, power, performance, and utilization metrics for NVIDIA GPUs, helping administrators optimize GPU workloads and ensure system reliability.
Enabled SLURM Performance Statistics SLURM Performance Statistics (SPS) is now active. This plugin collects and reports detailed job and system performance metrics, such as CPU, memory, and GPU usage, enabling advanced monitoring and accounting for optimizing workload efficiency and system utilization.
Set /opt/weka as a bind mount (formerly a symlink) The change to use a proper bind mount for /opt/weka has been implemented for improved stability.
Restarted the login node to apply configuration changes The login node was successfully rebooted to apply all system-level changes and complete the maintenance.

—

Completed - April 25-28 2025 Maintenance Details

Date and Time: 5:00pm Friday, 25 April – 7:00am Monday, 28 April 2025 (NZST)

Note: All jobs will be stopped at 5:00pm Friday, 25 April.

Impact on Services

The Aoraki cluster and OnDemand services will be unavailable throughout the maintenance window.
Access to the login node will be temporarily disabled.
Other dependent services, including CryoSPARC, will also be affected.

Action Required

Please plan your computational tasks accordingly to ensure that any critical jobs are completed before 5:00pm Friday, 25 April.
Save your work and log out of the Aoraki cluster and CryoSPARC nodes before the maintenance begins.

Purpose

This maintenance is essential to implement several key upgrades. Planned work includes:

Expansion of the cluster with additional nodes
Storage software upgrade
Network migration
Enabling cgroups on all GPUs for better resource control
Storage configuration updates, including:
Unmounting of the current/scratch directory to finalise deprecation.

Aoraki Research Cluster November 2024 Maintenance Complete

The scheduled maintenance for the Aoraki Research Cluster was successfully completed on November 10, 2024. This upgrade includes several significant improvements aimed at enhancing performance, reliability, and resource management for all users.

Key Enhancements:

Fairshare Scheduling with SLURM Accounting SLURM Accounting is now enabled to monitor resource usage effectively, supporting fairshare scheduling for a balanced distribution of workloads.

Enhanced SLURM Clustered Database and Secondary Controller A new SLURM clustered database and secondary controller have been implemented to improve resource management and redundancy. This addition strengthens failover capabilities, ensuring uninterrupted SLURM operations in the event of a single-point failure.

GPU Visibility with Cgroups Device Constraints Cgroups device constraints have been added to GPU nodes, allowing users to access only the GPUs they have requested. This change minimizes conflicts and promotes fair GPU sharing across workloads.

AUKS SLURM Plugin for Kerberos Authentication The AUKS SLURM plugin has been integrated, enabling jobs to access Kerberos-protected network resources for up to 7 days without user intervention, simplifying authentication needs during job execution.

Change Log 10-11 November 2024

SLURM Controller Daemon Upgrade Upgraded slurmctld daemon to version 23.02.0-json, enabling both Lua and JSON capabilities.
Secondary SLURM Control Node Added a secondary SLURM control node.
SLURM Database Migration Migrated the SLURM database to a Galera cluster.
Enhanced SLURM Database Backup and Monitoring Added new backup and monitoring features for the SLURM database.
GPU Cgroups Constraints Implemented Cgroups constraints on GPU resources to ensure fair resource distribution.
Weka Storage Upgrade Upgraded Weka storage to version 4.2.17.
Weka Container Configuration Update Configured Weka containers to use 2 CPUs on each node.
CPU Reservation for Weka in SLURM Configuration Updated SLURM configuration to reserve CPU resources specifically for Weka operations.
SLURM Accounting User and Group Import Imported all users and groups into SLURM accounting for enhanced tracking and management.
AUKS Installation and Configuration Installed and configured AUKS to enable Kerberos authentication across cluster nodes.
Node Renaming for Active Directory Compliance Renamed all nodes to comply with the Active Directory naming scheme and rejoined the domain.
OnDemand Reconfiguration Reconfigured OnDemand to utilise new node names.
L40 GPU Node Activation for OnDemand Desktop Use Enabled the L40 GPU Nodes for use in OnDemand desktop environments.

9-10 November 2024 - Aoraki Research Cluster Maintenance Updates

As part of the scheduled maintenance for the Aoraki Research Cluster over the weekend of 9-10 November 2024, we will be implementing several updates to enhance system performance, resource allocation, and overall reliability. Below are the key changes that will be applied during this outage:

Implementation of New SLURM Clustered Database and Secondary Controller

To improve resource management and increase redundancy, we will be introducing a new SLURM clustered database along with a secondary controller. This upgrade is critical to ensure better failover capabilities and maintain SLURM operations in the event of any single-point failure.
Enabling SLURM Accounting for Fairshare Scheduling

SLURM Accounting will be enabled to track resource usage and ensure that fairshare scheduling is properly implemented. This will help balance workload distribution across users, giving fair access to cluster resources based on usage history.
CPU Reservation for Weka (Recommended for 100G)

Each node will now have 2 CPUs reserved specifically for Weka filesystem network services. This is in line with the recommended configuration for 100G networking and will help to optimize I/O performance and system responsiveness, particularly for jobs involving large data.
Cgroup Device Constraints for GPU Visibility

Cgroup device constraints will be added to all GPU nodes. This change is essential for improving GPU allocation and visibility, ensuring that users can only see and use the GPUs they have specifically requested. Currently, users can see or attempt to use all GPUs on a node, even if those GPUs are allocated to other jobs. This update will prevent such conflicts and ensure fair resource sharing.
Implementing GPU Sharding on L40 Nodes

To further optimize GPU resource management, we will be introducing GPU sharding on the L40 nodes. This will allow better partitioning of GPU resources, giving users more flexibility and control over their workloads while minimizing GPU contention between jobs.
Implementing AUKS SLURM Plugin

The AUKS SLURM plugin extends the Kerberos authentication system in HPC environments. By integrating AUKS with SLURM, jobs can access network resources (HCS) that require Kerberos authentication without user intervention during job execution. After the initial user’s Kerberos ticket is obtained on the login node and added to the SLURM AUKS repository, network resources are accessible on all cluster nodes for 7 days.

We appreciate your understanding and cooperation during this maintenance period. If you have any questions or concerns, please feel free to reach out to our support team at rtis.solutions@otago.ac.nz

30th May 2024 - Research Cluster Maintenance

CHANGES AND FIXES IMPLEMENTED:

Breaking changes:

Memory usage is now limited to the amount specified in your Slurm job script.
Partitions now have time limits. All jobs must now include a specified time variable.

Other changes/fixes:

Storage has been updated to allow permissions to be applied more effectively.
A more robust login node has been introduced, featuring the same CPU architecture as the cluster nodes, which is ideal for compiling software.
- If prompted during SSH login regarding host key change use ssh-keygen -f "~/.ssh/known_hosts" -R "aoraki-login.otago.ac.nz" to remove the old key and reattempt to SSH.
New compute and GPU nodes have been added, including 5 high-memory (2TB) nodes, 4 high-CPU nodes, 4 A100 GPU nodes, and 1 H100 GPU node, in addition to our existing compute nodes.
New Slurm partitions have been created.