Connecting to the Cluster

The Otago cluster compute and GPU resources are accessible via ssh or via a web browser. These resources are managed by a SLURM job scheduling system.

For access to the cluster please click the link below to go to the Otago research computing and cluster registration form (Aoraki Cluster).

Connecting

There are two ways to access the Aoraki Cluster: via SSH or the OnDemand web portal.

To use SSH: open terminal on your device and input ssh <otago-username>@aoraki-login.otago.ac.nz. When prompted about fingerprint say yes and input your password when prompted.

Note

If you are using Student Wi-Fi or VPN, you will need to use the alternate address aoraki-login-stu: ssh <otago-username>@aoraki-login-stu.otago.ac.nz

If this doesn’t work check that bash is installed. For instructions on how to do this go to https://www.w3schools.com/bash/bash_getstarted.php .

See Helpful commands in bash for tips on how to use the bash shell.
To connect to the Otago OnDemand web portal, go to https://ondemand.otago.ac.nz and log in with your Otago email address and password. See the OnDemand instructions at the OnDemand documentation page.

Slurm Job Scheduling

Slurm Overview

The Otago cluster uses Simple Linux Utility for Resource Management (SLURM) for job management. Slurm is an open-source resource manager and job scheduling system for HPC (High-Performance Computing) which manages jobs, job steps, nodes, partitions (groups of nodes), and other entities on the cluster. Its main purpose is to allocate and manage resources on a cluster, so that jobs can be run in a way that maximizes utilization of the available resources. In order for SLURM to effectively manage resources, jobs are submitted to a queue and based on the requested resources the scheduler will run them to make best utilisation of the cluster.

In contrast to the usual interactive mode of running commands on your computer, the main way of interacting with slurm is to ‘batch’ your jobs up and submit them. This ‘batching’ is usually in the form of a bash script which also includes meta information about the resource requirements such as the amount of time it’s expected to take, along with the number of CPUs and RAM needed. At a minimum, a time-limit needs to be specified for your job at submission.

The following is a summary of how to submit jobs and interact with the scheduler. Full documentation for slurm is available at https://slurm.schedmd.com/documentation.html

Slurm Workflow

What follows is a high-level overview of how Slurm works:

Users submit jobs to Slurm using the sbatch command. A job is a set of instructions for running a particular program or set of programs.

Slurm assigns resources to the job based on the requested resources and the availability of resources on the cluster. This includes CPU cores, memory, GPUs, and other resources.

Slurm creates a job step for each task in the job. A job step is a subunit of a job that runs on a single node.

Slurm schedules the job steps to run on the assigned resources. It takes into account the job’s dependencies, priorities, and other factors to ensure that jobs run in the most efficient way possible.

Once a job step is running, Slurm monitors it and manages it throughout its lifecycle. This includes managing I/O, handling errors, and collecting job statistics.

When a job is complete Slurm notifies the user and provides them with the output and any error messages that were generated.

Slurm provides a powerful set of tools for managing large-scale compute resources which allows users to run complex simulations and data analyses more efficiently and effectively.

Interacting with the SLURM scheduler

The following are commands that are used to find out information about the status of the scheduler and jobs that have been submitted

sinfo
View the status of the cluster’s compute nodes. The output includes how many nodes and of what types are currently available for running jobs

squeue
Check the current jobs in the batch queue system. Use squeue --me to view your own jobs.

scancel
Cancel a job based on its job ID. scancel 123 would cancel the job with ID 123. It is only possible to cancel your own jobs. scancel --me will cancel all of your jobs.

sacct
Display the job usage metrics after a job has run. This is useful to see resource usage of a job, or determine if it failed. sacct -j <jobid>

Hint

sinfo will quickly tell you the state of the cluster and squeue will show you all of the jobs running and in the queue.

Submitting Jobs

The following are commands that are used for submission of jobs

sbatch: Submit a job to the batch queue system. Use sbatch myjob.sh to submit the Slurm job script myjob.sh. You can also provide or override the slurm parameters by supplying them at submission e.g. sbatch --job-name=my_job myjob.sh. The values supplied on the commandline will override the values inside your script.
scancel: Cancel a job based on its job ID.

scancel 123 would cancel the job with ID 123. It is only possible to cancel your own jobs. scancel --me will cancel all of your jobs.

Here we give details on job submission for various kinds of jobs in both batch (i.e., non-interactive or background) mode and interactive mode.

In addition to the key options of account, partition, and time limit (see below), your job script files can also contain options to request various numbers of cores, nodes, and/or computational tasks. There are also a variety of additional options you can specify in your batch files, if desired, such as email notification options when a job has completed. These are all described further below.

At a minimum, a time limit must be provided when submitting a job with --time=hh:mm:ss (replacing hh,mm, and ss with number values). This can be provided either be as part of your jobscript or as a commandline parameter.

Defining Jobs

In order to submit a job to the scheduler using sbatch you first need to define the job through a script.

A job script specifies where and how you want to run your job on the cluster and ends with the actual command(s) needed to run your job. The job script file looks much like a standard shell script (.sh) file, but at the top also includes one or more lines that specify options for the SLURM scheduler. These lines take the form of

#SBATCH --some-option-here

Although these lines start with hash signs (#), and thus are regarded as comments by the shell, they are nonetheless read and interpreted by the SLURM scheduler.

It is through these #SBATCH lines that the system resources are requested for the allocation that will run your job. These parameters can also the supplied as part of calling sbatch at job submission. parameters supplied in this way will override the values in your job script.

Common parameters include:

Meta

--time=: (required) Time limit to be applied to the job. Supplied in format hh:mm:ss.
--job-name= / -J: Custom job name
--partition=: aoraki (default) or aoraki_gpu
--output= / -o: File to save output from stdout
--error=/ -e: File to save output from stderr
--dependency=/ -d: Depends on a specified jobid finishing. Can be modifed by completion status. See documentation.
--chdir= / -D: Directory to change into before running the job

Memory- Only need to supply one of these.

--mem=: (default 8GB) Total memory for the job per node. Specify with units (MB, GB)
--mem-per-cpu=: amount of memory for each cpu (slurm will total this). Specify with units (MB, GB)

Parallelism

--cpus-per-task= / -c: Number of cores being requested (default = 1)
--ntasks=: Number of tasks (default = 1)
--array=: defines an array task
--nodes=/ -N: (default = 1). Number of nodes to run jobs across.

The full list of parameters and their descriptions is available at https://slurm.schedmd.com/sbatch.html

Here is an example slurm script that would request a single cpu with an allocation of 4 GB of memory, and run for a maximum of 1 minute:

#!/bin/bash
#SBATCH --job-name=my_job # define the job name
#SBATCH --mem=4GB # request an allocation with 4GB of ram
#SBATCH --time=00:01:00 # job time limit of 1 minute (hh:mm:ss)
#SBATCH --partition=aoraki # 'aoraki' or 'aoraki_gpu' (for gpu access)

# usual bash commands go below here:
echo "my example script will now start"
sleep 10 # pretend to do something
echo "my example script has finished"

Hint

Finding Output

Output from running a SLURM batch job is, by default, placed in a file named slurm-%j.out, where the job’s ID is substituted for %j; e.g. slurm-478012.out. This file will be created in your current directory; i.e. the directory from within which you entered the sbatch command. Also by default, both command output and error output (to stdout and stderr, respectively) are combined in this file.

To specify alternate files for command and error output use:

--output: destination file for stdout
--error: destination file for stderr

Slurm Scheduler Example

Here is a minimal example of a job script file. It will run unattended for up to 30 seconds on one of the compute nodes in the aoraki partition, and will simply print out the text hello world.

#!/bin/bash
# Job name:
#SBATCH --job-name=test
#
# Partition:
#SBATCH --partition=aoraki
#
# Request one node:
#SBATCH --nodes=1
#
# Specify one task:
#SBATCH --ntasks-per-node=1
#
# Number of processors for single task needed for use case (example):
#SBATCH --cpus-per-task=4
#
# Wall clock limit:
#SBATCH --time=00:00:30
#
echo "hello world"

If the text of this file is stored in hello_world.sh you could run and retrieve the result at the Linux prompt as follows

$ sbatch hello_world.sh
Submitted batch job 716
$ cat slurm-716.out
hello world

Note

By default the output will be stored in a file called slurm-<number>.out where <number> is the job ID assigned by Slurm

Interactive Jobs

In some instances, you may need to use software that requires user interaction rather than running programs or scripts in batch mode. To do so, you must first start an instance of an interactive shell on a Otago login node, within which you can then run your software on that node. To run such an interactive job on a compute node, you’ll use srun. Here is a basic example that launches an interactive bash shell on that node and includes the required account and partition options:

[user@aoraki-login ~]$ srun --pty --partition=aoraki --time=00:05:30 -N 1 -n 4 /bin/bash

Once your job starts, the prompt will change and indicate you are on a compute node rather than a login node:

srun: job 669120 queued and waiting for resources
srun: job 669120 has been allocated resources
[user@aoraki13 ~]

Memory Available

Also note that in all partitions except for GPU and HTC partitions, by default the full memory on the node(s) will be available to your job.

You should specify the amount using either the total memory required with --mem (which is the same as --mem-per-node), or the amount of ram required per task with --mem-per-task.

On the GPU and HTC partitions you get an amount of memory proportional to the number of cores your job requests relative to the number of cores on the node. For example, if the node has 64 GB and 8 cores, and you request 2 cores, you’ll have access to 16 GB memory.

Parallelization

When submitting parallel code, you usually need to specify the number of tasks, nodes, and CPUs to be used by your job in various ways. For any given use case, there are generally multiple ways to set the options to achieve the same effect; these examples try to illustrate what we consider to be best practices.

The key options for parallelization are:

--nodes (or -N)
indicates the number of nodes to use

--ntasks-per-node
indicates the number of tasks (i.e., processes one wants to run on each node)

--cpus-per-task (or -c)
indicates the number of cpus to be used for each task

In addition, in some cases it can make sense to use the --ntasks (or -n) option to indicate the total number of tasks and let the scheduler determine how many nodes and tasks per node are needed. In general --cpus-per-task will be 1 except when running threaded code.

Note that if the various options are not set SLURM will in some cases infer what the value of the option needs to be given other options that are set and in other cases will treat the value as being 1. So some of the options set in the example below are not strictly necessary, but we give them explicitly to be clear.

Here’s an example script that requests an entire Otago node and specifies 20 cores per task.

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=account_name
#SBATCH --partition=aoraki
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=20
#SBATCH --time=00:00:30
## Command(s) to run:
echo "hello world"

Only the partition, time, and account flags are required.

GPU Jobs

Please see example job script for such jobs.

The key things to remember are:

Submit to a partition with nodes with GPUs

Include the --gres flag.

Request at least two CPUs for each GPU requested, using --cpus-per-task

You can request multiple GPUs with syntax like this (in this case for two GPUs): --gpus-per-node=2

The partition is used to specify a specific GPU, or how much GPU memory is needed

aoraki_gpu will get you any free GPU

aoraki_gpu_H100 will get you an entire H100 with 80 GB of GPU memory

aoraki_gpu_L40 will get you an entire L40 with 48GB of GPU memory

aoraki_gpu_A100_80GB will get you an A100 with 80GB of GPU memory to use

aoraki_gpu_A100_40GB will get you an A100 with 40GB of GPU memory to use

Note

You may see some scripts use a command line --gres=gpu:2 to specify two GPUS. This way of specifying the number of GPUs to use is in the process of being deprecated.

Running a GPU job on Slurm involves specifying the required resources and submitting the job to the scheduler. Here are the basic steps to run a GPU job on Slurm:

Request the required resources. In order to run a GPU job on Slurm, you need to specify the number of GPUs and the amount of memory required. For example, to request a single GPU with 16GB of CPU memory, you would add the following line to your Slurm job script:
#SBATCH --gpus-per-node=1
#SBATCH --mem=16GB # 16 GB CPU Memory
Load the necessary modules. Depending on the software and libraries you are using you may need to load additional modules to access the GPU resources. This can usually be done using the module load command. For example, to load the CUDA toolkit:
lua
module load cuda
Write the job script. Create a job script that specifies the commands and arguments needed to run your GPU job. This can include running a CUDA program, a TensorFlow script, or any other GPU-accelerated code.
Submit the job. Use the sbatch command to submit the job script to the Slurm scheduler. For example:
sbatch my_gpu_job.sh
Once your job is submitted, Slurm will allocate the requested resources and schedule the job to run on a node with the appropriate GPU. You can monitor the status of your job using the squeue command and view the output using the sacct command once the job completes.

Here’s an example script that will return the information on the GPU available on aoraki_gpu:

#!/bin/bash
#SBATCH --account=account_name
#SBATCH --partition=aoraki_gpu
#SBATCH --gpus-per-node=1
#SBATCH --mem=4GB
#SBATCH --time=00:00:30
nvidia-smi

Hint

If you want to run a GPU job interactively you can create slurm session on a gpu node (Partition aoraki_gpu_L40 in this example) using the following command which simply adds the --gres=gpu:1 flag to the srun command:

srun --ntasks=1 --partition=aoraki_gpu_L40 --gres=gpu:1 --cpus-per-task=4 --time=0-03:00 --mem=50G --pty /bin/bash

For a slightly more involved example consider the following C code.

#include<stdio.h>

#define BLOCKS 2
#define WIDTH 16

__global__ void whereami() {
  printf("I'm thread %d in block %d\n", threadIdx.x, blockIdx.x);
}

int main() {
  whereami<<<BLOCKS, WIDTH>>>();
  cudaDeviceSynchronize();
  return 0;
}

If this is stored in the file whereami.cu and compiled with nvcc whereami.cu -o whereami we can use the Slurm job script

#!/bin/bash
#SBATCH --account=account_name
#SBATCH --partition=aoraki_gpu
#SBATCH --gpus-per-node=1
#SBATCH --mem=4GB
#SBATCH --time=00:00:30
whereami

to obtain output such as the following (ordering of lines may differ):

I'm thread 0 in block 1
I'm thread 1 in block 1
I'm thread 2 in block 1
I'm thread 3 in block 1
I'm thread 4 in block 1
I'm thread 5 in block 1
I'm thread 6 in block 1
I'm thread 7 in block 1
I'm thread 8 in block 1
I'm thread 9 in block 1
I'm thread 10 in block 1
I'm thread 11 in block 1
I'm thread 12 in block 1
I'm thread 13 in block 1
I'm thread 14 in block 1
I'm thread 15 in block 1
I'm thread 0 in block 0
I'm thread 1 in block 0
I'm thread 2 in block 0
I'm thread 3 in block 0
I'm thread 4 in block 0
I'm thread 5 in block 0
I'm thread 6 in block 0
I'm thread 7 in block 0
I'm thread 8 in block 0
I'm thread 9 in block 0
I'm thread 10 in block 0
I'm thread 11 in block 0
I'm thread 12 in block 0
I'm thread 13 in block 0
I'm thread 14 in block 0
I'm thread 15 in block 0

Job Accounting / Efficency

To view your job information you can use the sacct command.

To view detailed job information:

sacct --format="JobID,JobName,Elapsed,AveCPU,MinCPU,TotalCPU,Alloc,NTask,MaxRSS,State" -j <job_id_number>

eg

sacct --format="JobID,JobName,Elapsed,AveCPU,MinCPU,TotalCPU,Alloc,NTask,MaxRSS,State" -j 321746
JobID           JobName    Elapsed     AveCPU     MinCPU   TotalCPU  AllocCPUS   NTasks     MaxRSS      State
------------ ---------- ---------- ---------- ---------- ---------- ---------- -------- ---------- ----------
321746       ondemand/+   23:11:07                        00:05.337          4                     CANCELLED+
321746.batch      batch   23:11:08   00:00:00   00:02:56  00:05.337          4        1    683648K  CANCELLED

SLURM Examples

These examples can be found at https://appsgit.otago.ac.nz/projects/RTIS-SP/repos/slurm-code-examples/browse

Or downloaded and browsed on the cluster by:

git clone https://appsgit.otago.ac.nz/scm/rtis-sp/slurm-code-examples.git

Examples of using R with SLURM

If you have downloaded the repository, the following code examples are in the r_examples subdirectory:

cd slurm-code-examples/r_examples/

SLURM job calling an R script

This pair of scripts represents how an Rscript can be run using the SLURM scheduler.

Create the R script hello_rscript.R with the following contents:

print("hello")

Create the slurm script run_hello_rscript.sh with the following contents:

#!/bin/bash
#SBATCH --mem=512MB
#SBATCH --time=00:01:00
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=1

# load R v4.2.2 through spack
# `spack install r@4.2.2` will need to have been run first
spack load r@4.2.2

# run the rscript
Rscript hello_rscript.R

The job can now be submitted to the scheduler with:

[user@aoraki-login r_examples]$ sbatch run_hello_rscript.sh

Example of using multiple cores within a single Rscript

The following are two examples creating a job that uses multiple cores on a single node with R managing the the parallelism. In R, either the parallel package or the future package are (mainly) used to do this. The parallelly package provides some bonus functionality to both.

The key difference to the previous example is that we need to define how to parallelise the code. This is done through parallel or future and we need to define the number of cores/cpus to use for the parallelism within the SLURM script through --cpus-per-task which gets stored in a BASH environment variable $SLURM_CPUS_PER_TASK. There are multiple approaches to how this can be done but for this example two have been demonstrated

The overall task demonstrated is to calculate the means for sub-groups of a dataset in parallel.

The first approach uses the packages parallel and doParallel to create and manage the parallelism within R. It will create and register a ‘cluster’ of ‘workers’ within R and then pass each sub-task to a worker.

r_multicore_example-parallel.R:

library(parallelly)
library(parallel)
library(doParallel)

# automatically detect number of available cores
ncpu <- parallelly::availableCores()


doParallel::registerDoParallel(cores = ncpu)

# calculate the mean sepal length for each iris species in parallel
mean_petal_lengths <- foreach(species = unique(iris$Species), .combine = 'c') %dopar% {
    m <- mean(iris[iris$Species == species, "Sepal.Length"])
    names(m) <- species
    return(m)
}

print(mean_petal_lengths)

run_multicore_r_example-parallel.sh:

#!/bin/bash
#SBATCH --mem=512MB
#SBATCH --time=00:01:00
#SBATCH --cpus-per-task=4
#SBATCH --ntasks=1

# load R v4.2.2 through spack
spack load r@4.2.2

# run the rscript
Rscript r_multicore_example-parallel.R

Submitting the job:

[user@aoraki-login r_examples]$ sbatch run_multicore_r_example-parallel.sh

The following is same example as above but insetad implemented using the furrr package to parallelise purrr using future.

r_multicore_example-furrr.R:

library(parallelly)
library(furrr)


ncpus <- parallelly::availableCores()
plan("multisession", workers = ncpus)

mean_species <- function(x){
    mean(iris[iris$Species == x, "Sepal.Length"])
}

species <- unique(iris$Species)
mean_petal_lengths <- furrr::future_map_dbl(species, mean_species)
names(mean_petal_lengths)  <- species

print(mean_petal_lengths)

run_multicore_r_example-furrr.sh

#!/bin/bash
#SBATCH --mem=512MB
#SBATCH --time=00:01:00
#SBATCH --cpus-per-task=4
#SBATCH --ntasks=1

# load R v4.2.2 through spack
spack load r@4.2.2

# run the rscript
Rscript r_multicore_example-furrr.R

Submitting the job:

[user@aoraki-login r_examples]$ sbatch run_multicore_r_example-furrr.sh

SLURM array job with R

An array job allows you to define the resources for a single job but run many instances of it. Instead of submitting many individual jobs, it is best to use an array job as it is more effcient for the scheduler. A common use case would be to define a job for a sample, and then run the job on all samples in parallel. SLURM facilitates this through the array job type. For an array job there are two bash environment variables that you can make use of: SLURM_JOBID which is the overall job, and SLURM_ARRAY_TASK_ID which is the id assigned to that specific “subjob”. A common use for the task id would be to use it as an index on a sample sheet.

A second benefit of using an array over individually submitting many of the same job is that if you want to rerun a job, you can specify specific array indexes, instead of many differnt job ids.

To make use of these variables inside your R task, you can either pass them through as a commandline argument or you can access them from within R with option()

The scenario for the example will be again calculating the mean for each species in the iris data set but this time instead of using a single job with multiple cores, we’ll use a single job per species utilising the SLURM array.

Here is an example of running 3 jobs in parallel using a slurm array passing the array index as a commandline argument to R

r_array_job_args.R:

args <- commandArgs(trailingOnly = TRUE)
# args come in as strings so need to convert to numeric type
id <- as.numeric(args[1])
message(paste("SLURM_ARRAY_TASK_ID was: ", id))

# determine which of the species is the target for the job
job_species <- unique(iris$Species)[id]
species_mean_sepal_length <- mean(iris[iris$Species == job_species, "Sepal.Length"])
results <- data.frame(species = job_species, mean_sepal_length = species_mean_sepal_length)
write.csv(results, paste0(job_species,".csv"), row.names = FALSE)

run_array_job_example-args.sh:

#!/bin/bash
#SBATCH --mem=512MB
#SBATCH --time=00:01:00
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=1
#SBATCH --array=1-3 # run three of this job with the indexes 1,2,3

# load R v4.2.2 through spack
spack load r@4.2.2

# run the rscript
Rscript r_array_job_example-args.R ${SLURM_ARRAY_TASK_ID}

Submitting the job:

[user@aoraki-login r_examples]$ sbatch run_array_job_example-args.sh

And here is an example of the same job but instead accessing the system environment variable from within R:

r_array_job_example-env.R:

# read the environment varible SLURM_ARRAY_TASK_ID and convert to numeric data type
id <- as.numeric(Sys.getenv(x = "SLURM_ARRAY_TASK_ID"))
message(paste("SLURM_ARRAY_TASK_ID was: ", id))

# determine which of the species is the target for the job
job_species <- unique(iris$Species)[id]
species_mean_sepal_length <- mean(iris[iris$Species == job_species, "Sepal.Length"])
results <- data.frame(species = job_species, mean_sepal_length = species_mean_sepal_length)
write.csv(results, paste0(job_species,".csv"), row.names =  FALSE)

run_array_job_example-env.sh:

#!/bin/bash
#SBATCH --mem=512MB
#SBATCH --time=00:01:00
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=1
#SBATCH --array=1-3 # run three of this job with the indexes 1,2,3

# load R v4.2.2 through spack
spack load r@4.2.2

# run the rscript
Rscript r_array_job_example-env.R

Submitting the job:

[user@aoraki-login r_examples]$ sbatch run_array_job_example-env.sh

One disadvantage of this approach is it can be harder to create and debug the R code as it relies on environmental variables being set prior to execution, rather than passing the values at runtime on the commandline through arguments.

SLURM job dependencies with R

SLURM job dependencies allow you to link jobs together and subsequent jobs will only run depending on the running status of pre-requisite jobs. This allows you to create workflows with different job requirements for the different stages.

When submitting the -d (or --dependency=) option is supplied to sbatch with the job id to make the job dependent on e.g. sbatch -d 1234. The main advantage of using dependencies is that SLURM will cordinate the running (or cancelling if failed) of downstream dependent jobs. There are extra options for the dependencies such as -d afterok:<jobid>, -d afterany:<job_id>. See https://slurm.schedmd.com/sbatch.html for more infomation regarding the -d option.

In the previous example we outputted the mean for each group but it would be good to have a single output that summarised the results. We could create a second job that was dependant on the previous job compeleting before it ran to take the results and combine them together. This type of work flow is often referred to as a scatter-gather as there is a scattering phase to calculate results per group and a gathering phase to combine them back together.

r_combine_results.R:

results_list <-list()
for(species in unique(iris$Species)){
    results_list[[species]] <- read.csv(paste0(species,".csv"), header=TRUE)
}
combined_results <- do.call(rbind, results_list)
write.csv(combined_results, "combined_results.csv", row.names=FALSE)

run_combine_example.sh

#!/bin/bash
#SBATCH --mem=512MB
#SBATCH --time=00:01:00
#SBATCH --cpus-per-task=1
#SBATCH --ntasks=1


# load R v4.2.2 through spack
spack load r@4.2.2

# run the rscript
Rscript r_combine_results.R

Now in order to run this as a dependency we supply -d <jobid> to sbatch with the jobid of the job upon which we are depending to run first when submitting.

e.g.

[user@aoraki-login r_examples]$ sbatch run_array_job_example-args.sh
Submitted batch job 323173
[user@aoraki-login r_examples]$ sbatch -d 323173 run_combine_example.sh
Submitted batch job 323174

This requires you to pay attention the job number however, so instead we can wrap the submission in a bash script to automatically grab the jobid:

run_scatter_gather.sh

# store the output from sbatch into a variable
FIRST_JOB=$(sbatch run_array_job_example-args.sh)

# extract out the job id from the variable
# use space as a delimiter and take the 4th field
FIRST_JOB_ID=$(echo "${FIRST_JOB}" | cut -d" " -f 4)

# submit the second job setting the dependency on the first
sbatch -d ${FIRST_JOB_ID} run_combine_example.sh

run the script to do the submission:

bash run_scatter_gather.sh

Examples of using python with SLURM

When running conda inside SLURM, SLURM doesn’t automatically look at your .bashrc file, which is where the conda init inserts a block of code to run the conda profile scripts to make conda available on your path. To get around this, as part of the SLURM script, you need to source the conda profile script within your SLURM script.

Assuming you have installed conda as per this guide, then an example SLURM script on how to run it is:

For a path-based conda environment (it is recommended to install environments into project directories instead of using named environments):

#!/bin/bash
#SBATCH --job='conda path env test'
#SBATCH --account=INSERT_YOUR_ACCOUNT_NAME
#SBATCH --time=00:05:00
#SBATCH --mem=512MB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --out=path_env_test.log

source ~/miniforge3/etc/profile.d/conda.sh
export PYTHONNOUSERSITE=1 # don't add python user site library to path

conda activate ~/test_env
python ~/python_version.py

Or, if you had created a named environment:

#!/bin/bash
#SBATCH --job='conda named env test'
#SBATCH --account=INSERT_YOUR_ACCOUNT_NAME
#SBATCH --time=00:05:00
#SBATCH --mem=512MB
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --out=named_env_test.log

source ~/miniforge3/etc/profile.d/conda.sh
export PYTHONNOUSERSITE=1 # don't add python user site library to path

conda activate named_env
python ~/python_version.py

The python script used in this example was:

import sys
print(sys.version)

Connecting to the Cluster

Sign Up for Access

Connecting

Slurm Job Scheduling

Slurm Overview

Slurm Workflow

Interacting with the SLURM scheduler

Submitting Jobs

Defining Jobs

Slurm Scheduler Example

Interactive Jobs

Memory Available

Parallelization

GPU Jobs

Job Accounting / Efficency

SLURM Examples

Examples of using R with SLURM

Examples of using python with SLURM