GPU
Spartan has 41 GPU nodes in the public partitions, which are available to all University of Melbourne researchers with a Spartan account.
-
31 nodes, each with 4 80GB Nvidia A100 GPUs, 495000MB RAM and 32 CPU cores
-
10 nodes, each with 4 80GB Nvidia H100 GPUs, 950000MB RAM and 64 CPU cores
Access
Unlike the old LIEF GPGPU platform, you do not need to specify a QoS in your Slurm submit scripts. Remove any QoS before you submit, or set it to "normal".
This will request 1 GPU on the gpu-a100 partition.
Specialist partitions, such as the feit-gpu-a100, will still require the appropriate qos.
Maximum job length
We have 3 partitions:
gpu-a100-short
, which supports jobs up to 1 GPU and 4hrs of walltimegpu-a100
which supports jobs up to 7 days of walltimegpu-h100
which supports jobs up to 7 days of walltime
Comparitive speeds
We have done some benchmarking on the H100 and A100 nodes vs the older V100 nodes. In general, the applications we tested were 25%-200% faster on H100 nodes compared to A100 nodes. The applications on A100 nodes were approximately 300% to 400% faster than the older V100 nodes. The performance gains varied between applications - even different versions of the same applications showed considerable performance variations on H100 nodes due to the rapidly evolving software layers.
Known issues
This will be updated regularly as more researchers report issues
Example
The Nvidia A100 (Ampere series) has a restriction on which cuDNN versions it supports. From the cuDNN release notes