15-16/02/2020
The February maintenance is now complete. The main tasks completed were
-
Upgraded to Slurm 20.02.6. This was a minor point release update from our previous version
-
Changed the email template for completed Slurm jobs. In the emails sent upon job completion, the job efficiency is now included in the email.
-
Upgraded Lmod, the module command.
The update to Lmod now requires you to explicitly request the toolchain you would like to use. e.g. to load python/3.8.2, you need to do
module load gcccore/8.3.0
module load python/3.8.4
You can see the toolchain you need by running module av
module av python/3.8.2
----- Toolchain: gcccore/8.3.0 Compiler: gcccore 8.3.0 -----
python/3.8.2 (D)
Where:
D: Default Module
Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".
-
Removed the vccc, physics-gpu and ashley partitions
-
Updated the NVidia driver to support CUDA 11.2
-
Package updates of all operating system packages
-
Reduced the maximum memory available on the gpgpu nodes. The maximum memory per node is now 111000MB
Please submit a ticket if you notice things aren't working normally after the maintenance
20-22/07/2020
The July maintenance is now complete. The main tasks completed were
- Moved from CephFS to GPFS for project and scratch filesystems
The new absolute locations for the filesystems is/data/gpfs/projects
and/data/scratch/projects
Common datasets are now in/data/gpfs/datasets
Symlinks have been made so that scripts referencing /data/cephfs
, /data/projects
and /scratch
will still work
If you require files from CephFS, on the login nodes, old projects is mounted at /ceph/projects
and old scratch is mounted at /ceph/scratch
. Old CephFS will be available for approximately 1 month.
- Changed all users to use our new software build system by default
We started installing new software into our new software system from the beginning of February, as we wanted to stop researchers from loading incompatible software together and also base our common software on newer compilers. More information can be found at https://dashboard.hpc.unimelb.edu.au/software/#the-new-modules-system
If your scripts use the old software system, you have two choices
-
Permanently set your default software system to the old one (you only have to do this once): To do this, run
toggle-default-software-stack.sh
-
Just switch a script to use the old software system: Add
source /usr/local/module/spartan_old.sh
to your script before you load the required modules.
We hope that most researchers will use the new software system. It is based on newer compilers and has less duplicate versions of software.
-
Removed the cloud, bigmem, msps partitions
We have replaced the cloud and bigmem partitions with 3500 new CPU cores in the physical partition. This will bring more reliable performance of jobs, and allow larger and more MPI jobs. It also provides much faster and lower latency access to our new storage. -
Node memory limits updated
We've had to reduce the amount of memory available to jobs on the nodes due to the amount of memory GPFS needs for file caching and other activities. Please see the updated table at https://dashboard.hpc.unimelb.edu.au/status_specs/ -
SLURM upgraded to 20.02.3
-
Changed OpenMPI to use UCX as default
UCX is the new connection library and is recommended by OpenMPI in versions 4 and above. We have changed the default behaviour of MPI jobs to use UCX by default.
If you notice problems with this, you can add export USE_UCX=0
to your scripts before the execution of your application, and it will revert to the old behaviour.
-
Updated the NVidia driver version to support CUDA 11
-
Package and security updates
This is a very large change, and has brought in new functionality as well as an entirely new storage system. We have thoroughly tested the storage, but we can't test everything. Once our normal workload is back, we may see things that we have never seen before. Please contact us if you notice any issues with the storage and we'll check it out.