Skip to content

GPU

Spartan has 31 nodes, each with 4 80GB Nvidia A100 GPUs, 495000MB RAM and 32 CPU cores. They are available to all University of Melbourne researchers with a Spartan account.

Access

Unlike the old LIEF GPGPU platform, you do not need to specify a QoS in your Slurm submit scripts. Remove any QoS before you submit, or set it to "normal".

#SBATCH -p gpu-a100
#SBATCH --gres=gpu:1

This will request 1 GPU on the gpu-a100 partition.

Specialist partitions, such as the feit-gpu-a100, will still require the appropriate qos.

Maximum job length

We have 2 partitions, gpu-a100-short, which supports jobs up to 1 GPU and 4hrs of walltime, and gpu-a100 which supports jobs up to 7 days of walltime.

Comparitive speeds

We have done some benchmarking on the A100 nodes vs the old P100 nodes, and we have seen that the A100 nodes are approximately 3 to 4 times as fast as the P100 nodes. This will vary depending on how your application uses the GPUs

Known issues

This will be updated regularly as more researchers report issues

Example

The Nvidia A100 (Ampere series) has a restriction on which cuDNN versions it supports. From the cuDNN release notes

Versions of cuDNN before the 8.0 release series do not support the NVIDIA Ampere Architecture and will generate incorrect results if used on that architecture. Furthermore, if used, training operations can succeed with a NaN loss for every epoch.