Cluster FAQ

What should I do if my home directory is full?

When your home directory is full on the cluster, you may encounter issues with logging in, running jobs, or saving files. Follow the steps below to diagnose and resolve the issue.

Check what’s using space:

Use the following command to see which files and directories are taking up the most space:
```
du -ahx --max-depth=2 $HOME | sort -k1 -rh
```
- du -ahx: Lists all files and directories with sizes.
- --max-depth=2: Limits the depth of directory traversal (you can increase if needed).
- sort -k1 -rh: Sorts the results from largest to smallest in human-readable format.
Clean up unnecessary files:

Once you’ve identified the largest files or folders, you can:
- Delete unnecessary files:
```
rm large_file.bam
```
- Compress and archive old directories:
```
tar -czf old_data.tar.gz old_data/
```
Move or migrate data:

Research data and large files should not be stored in your home directory. Instead, request a project directory on shared or scratch storage.

Contact the RTIS support team to request project or scratch storage.
Manage Conda environments:

Conda environments can consume significant space in your home directory. We recommend managing them carefully to reduce storage usage.
- Consider creating environments in a project directory instead of the default location in $HOME/.conda.
- Periodically remove unused environments.
For detailed advice, see our guide:

https://researchcomputing.otago.ac.nz/cluster/storage.html#managing-conda-environments-to-conserve-home-directory-storage
Prevent future issues:
- Regularly monitor your disk usage.
- Avoid saving large SLURM job outputs in your home directory.
- Use project or scratch space for compute-intensive or large-scale data workloads.

How long do I have to wait for my job to start?

There is no guaranteed way to know the exact wait time for a SLURM job, but you can get a good estimate using SLURM commands and by checking cluster utilisation.

Check your job in the queue:

Use the following command to see your jobs and their status:
```
squeue -u $USER
```
Look for the ST (state) and START_TIME columns. PD means “pending,” and R means “running.”
Estimated start time:

After submitting a job, you can check its estimated start time (if available):
```
scontrol show job <jobid> | grep StartTime
```
This shows SLURM’s estimated start time, but be aware that this can change as other jobs are submitted or finish.
Tip:

If you need your jobs to start quickly, consider requesting fewer resources (such as fewer CPUs, less memory, or a shorter time limit) or being flexible with which partition you use.
This command can be useful to list configured and allocated resources for all nodes:

for i in {01..09} {11..12} {14..43}; do echo aoraki$i; scontrol show node aoraki$i | grep TRES; done

Can I extend the time limit of a running job?

SLURM jobs cannot be extended past their requested wall time. If your job exceeds the time limit, SLURM will terminate it.

You cannot extend the session.
If your code does not save progress regularly, you may lose data.
SLURM will send a termination signal (SIGTERM, then SIGKILL) when the time limit is reached.
Best practices:
- Always write output or checkpoints regularly (for example, save to a file every few minutes or after each iteration).
- Use software that supports checkpointing or periodic saves.
- If you need more time, you can cancel and resubmit your job with a longer time limit before it reaches the time limit.
Recommendation:

Estimate your required run time generously and always save your work periodically. There is no way to automatically extend the job run time while it is running.
Only an admin can extend the time limit of a running job.

If you need to extend the time limit of a running job, you must contact an administrator. They can adjust the job’s time limit if resources are available and it does not conflict with other jobs.

Summary:

You can check your job’s estimated wait time with squeue and scontrol show job <jobid>, but these are only estimates and can change.
SLURM jobs will be killed when they exceed their time limit, so regular saving/checkpointing is essential. There is no way to extend a running job’s time limit automatically.

How do I check the status of my job(s)?

You can use the squeue command to check the status of all your jobs:

squeue -u $USER

This will display your jobs, including job ID, partition, name, and status. The ST column shows the state: PD = Pending, R = Running, CG = Completing, etc.

How do I cancel a running or pending job?

To cancel a job, use:

scancel <jobid>

You can find your job ID using squeue -u $USER.

How do I see why my job is pending?

Use:

scontrol show job <jobid>

Look for the Reason= field in the output. It will tell you why the job isn’t running (e.g., resources unavailable, priority, dependency).

How do I get notified when my job finishes or fails?

Add these lines to your SLURM script:

#SBATCH --mail-type=END,FAIL
#SBATCH --mail-user=your.email@domain.com

This will send an email to you when your job ends or fails.

Where do I find the output and error logs for my job?

By default, SLURM writes standard output and error to a file named slurm-<jobid>.out in the directory where you submitted the job.

You can customize the output and error file names with:

#SBATCH --output=myjob.out
#SBATCH --error=myjob.err

How do I set job dependencies (e.g., run a job after another finishes)?

Use the --dependency option when submitting your job:

sbatch --dependency=afterok:<jobid> myscript.sh

This will start your job only after the specified job completes successfully.

How do I check cluster/node usage and availability?

Use:

sinfo

This displays the status of partitions and nodes.

Why did my job fail?

Check the job’s output and error logs (e.g., slurm-<jobid>.out) for error messages.

You can also view a summary of job exit codes and status with:

sacct -j <jobid> --format=JobID,State,ExitCode

How do I change or cancel a job after it starts running?

To cancel a job, use:
```
scancel <jobid>
```
To change resources, you must cancel the job and resubmit it with new resource requests.