GPU

Spartan has 41 GPU nodes in the public partitions, which are available to all University of Melbourne researchers with a Spartan account.

31 nodes, each with 4 80GB Nvidia A100 GPUs, 495000MB RAM and 32 CPU cores
10 nodes, each with 4 80GB Nvidia H100 GPUs, 950000MB RAM and 64 CPU cores

Access

Unlike the old LIEF GPGPU platform, you do not need to specify a QoS in your Slurm submit scripts. Remove any QoS before you submit, or set it to "normal".

#SBATCH -p gpu-a100
#SBATCH --gres=gpu:1

This will request 1 GPU on the gpu-a100 partition.

Specialist partitions, such as the feit-gpu-a100, will still require the appropriate qos.

Maximum job length

We have 3 partitions:

gpu-a100-short, which supports jobs up to 1 GPU and 4hrs of walltime
gpu-a100 which supports jobs up to 7 days of walltime
gpu-h100 which supports jobs up to 7 days of walltime

Comparitive speeds

We have done some benchmarking on the H100 and A100 nodes vs the older V100 nodes. In general, the applications we tested were 25%-200% faster on H100 nodes compared to A100 nodes. The applications on A100 nodes were approximately 300% to 400% faster than the older V100 nodes. The performance gains varied between applications - even different versions of the same applications showed considerable performance variations on H100 nodes due to the rapidly evolving software layers.

Known issues

This will be updated regularly as more researchers report issues

Example

cuDNN 7Minimum CUDA version for NVidia H100 GPUs

The Nvidia A100 (Ampere series) has a restriction on which cuDNN versions it supports. From the cuDNN release notes

Versions of cuDNN before the 8.0 release series do not support the NVIDIA Ampere Architecture and will generate incorrect results if used on that architecture. Furthermore, if used, training operations can succeed with a NaN loss for every epoch.

The NVidia H100 GPUs are supported from CUDA v11.8.0 onwards. Any software modules built with earlier CUDA versions (e.g: CUDA 11.7.0) are not supported on nodes with H100 GPUs and may fail or crash randomly.