Status

15-16/02/2020

The February maintenance is now complete. The main tasks completed were

  • Upgraded to Slurm 20.02.6. This was a minor point release update from our previous version

  • Changed the email template for completed Slurm jobs. In the emails sent upon job completion, the job efficiency is now included in the email.

  • Upgraded Lmod, the module command.

The update to Lmod now requires you to explicitly request the toolchain you would like to use. e.g. to load python/3.8.2, you need to do

module load gcccore/8.3.0
module load python/3.8.4

You can see the toolchain you need by running module av

module av python/3.8.2

----- Toolchain: gcccore/8.3.0 Compiler: gcccore 8.3.0 -----

python/3.8.2 (D)

Where:
D: Default Module

Use "module spider" to find all possible modules and extensions.
Use "module keyword key1 key2 ..." to search for all possible modules matching any of the "keys".
  • Removed the vccc, physics-gpu and ashley partitions

  • Updated the NVidia driver to support CUDA 11.2

  • Package updates of all operating system packages

  • Reduced the maximum memory available on the gpgpu nodes. The maximum memory per node is now 111000MB

Please submit a ticket if you notice things aren't working normally after the maintenance

20-22/07/2020

The July maintenance is now complete. The main tasks completed were

  • Moved from CephFS to GPFS for project and scratch filesystems
    The new absolute locations for the filesystems is /data/gpfs/projects and /data/scratch/projects
    Common datasets are now in /data/gpfs/datasets

Symlinks have been made so that scripts referencing /data/cephfs, /data/projects and /scratch will still work

If you require files from CephFS, on the login nodes, old projects is mounted at /ceph/projects and old scratch is mounted at /ceph/scratch. Old CephFS will be available for approximately 1 month.

  • Changed all users to use our new software build system by default
    We started installing new software into our new software system from the beginning of February, as we wanted to stop researchers from loading incompatible software together and also base our common software on newer compilers. More information can be found at https://dashboard.hpc.unimelb.edu.au/software/#the-new-modules-system

If your scripts use the old software system, you have two choices

  • Permanently set your default software system to the old one (you only have to do this once): To do this, run toggle-default-software-stack.sh

  • Just switch a script to use the old software system: Add source /usr/local/module/spartan_old.sh to your script before you load the required modules.

We hope that most researchers will use the new software system. It is based on newer compilers and has less duplicate versions of software.

  • Removed the cloud, bigmem, msps partitions
    We have replaced the cloud and bigmem partitions with 3500 new CPU cores in the physical partition. This will bring more reliable performance of jobs, and allow larger and more MPI jobs. It also provides much faster and lower latency access to our new storage.

  • Node memory limits updated
    We've had to reduce the amount of memory available to jobs on the nodes due to the amount of memory GPFS needs for file caching and other activities. Please see the updated table at https://dashboard.hpc.unimelb.edu.au/status_specs/

  • SLURM upgraded to 20.02.3

  • Changed OpenMPI to use UCX as default
    UCX is the new connection library and is recommended by OpenMPI in versions 4 and above. We have changed the default behaviour of MPI jobs to use UCX by default.

If you notice problems with this, you can add export USE_UCX=0 to your scripts before the execution of your application, and it will revert to the old behaviour.

  • Updated the NVidia driver version to support CUDA 11

  • Package and security updates

This is a very large change, and has brought in new functionality as well as an entirely new storage system. We have thoroughly tested the storage, but we can't test everything. Once our normal workload is back, we may see things that we have never seen before. Please contact us if you notice any issues with the storage and we'll check it out.