Changes
Upcoming
Spartan will be offline for scheduled maintainence from 9am Monday 1 June until 6pm Friday 5 June.
== Why is Spartan going offline
The primary reason for the outage is to perform some essential power work in the datacentre. This will allow Spartan to continue to expand to meet needs of researchers.
Note that this power work only affects Spartan. It will not affect the other RCS services, such as Mediaflux and the Research Cloud.
== What other work are you doing
We will also be performing regular maintenance on the cluster, to upgrade packages and Nvidia GPU drivers
== How will this affect jobs I submit to the system
Spartan will not start any jobs which, based on your requested job wallclock limit, will not finish before the maintenance window starts. Jobs that won't finish will be held in the system until the maintenance is over.
Jobs that have been held for this reason will have a status of "(ReqNodeNotAvail, Reserved for maintenance)" in the squeue output.
Any jobs that, based on the job wallclock limit, will finish before the maintenance starts, will continue to run as normal.
== Will I be able to gain access to my files when Spartan is offline
No. The login nodes, Slurm and OnDemand will be inaccessible to all users during the outage
Previous
Spartan has been upgraded to RedHat Enterprise Linux 9.6, Slurm 25.05.3 and Spectrum Scale 5.2.3.2
Work done during the maintenance was
-
OnDemand upgraded to 4.0.7
-
Slurm upgraded to 25.05.3
-
Set OMP_NUM_THREADS environment variable to 1 by default. If you use multithreaded applications (using OpenMP), you will need to set this to the number of CPUs you have requested. You can do this by adding
to your Slurm submit script
Spartan has been upgraded to RedHat Enterprise Linux 9.4, Slurm 24.05.6 and Spectrum Scale 5.2.2.0
The main work done during the maintenance was
- Nvidia driver change
The Nvidia driver has been updated to 550.144.03
- Removal of out of warranty privately owned hardware and partitions
The partitions that have been removed are:
* mig
* mig-gpu
* turbsim
* argali
* gpu-v100-preempt
This is because the hardware that was in those partitions was no longer in warranty
- Removal of out of warranty hardware from
cascadepartition
Approximately 20 of the oldest nodes in the cascade partition have been removed due to being out of warranty.
We highly suggest that you move to using the sapphire partition in preference to using the cascade partition as the sapphire partition has a lot more resources and are newer.
- OnDemand upgraded to version 4.0
This has resulted in the loss of form history for applications like Cryosparc and Relion. You will need to reenter the values you had in the launcher forms.
- Changing permissions on
/var/local/tmp
Previously we allowed users to write directly to /var/local/tmp in jobs. Due to users not cleaning up that location, we have stopped the ability to write directly to /var/local/tmp.
If you would like to use temporary storage on the node, please write to /tmp instead. See Local Temp Space for details.
Spartan has been upgraded to RedHat Enterprise Linux 9.4, Slurm 23.11.8 and Spectrum Scale 5.1.9.2
- Nvidia driver change
The Nvidia driver has been updated to support CUDA 12.4
- gpu-h100 partition added
We have recently purchased 10 H100 GPU nodes, and have added them to the gpu-h100 partition.
The H100 GPUs are the latest generation of Nvidia GPU, and have significant advantage over the A100. For information about how to use them, see GPU
Spartan has been upgraded to RedHat Enterprise Linux 9, Slurm 23.02.5 and Spectrum Scale 5.1.8.1
There are many changes to the system which you should read and get accustomed with before submitting jobs.
**Please be patient with us when the system comes back online. The number, size and complexity of changes we made means that there will probably be things that don't quite work, despite our extensive preparation and testing. Please submit a ticket if things aren't quite working well, and describe your issue in as much detail as you can (including modules being loaded, job number and error message seen). **
-
New software system
The old software systems in RedHat 7 are no longer available. The new software system is based on hierarchies, where you can only see the software in the toolchain you have loaded. The new software system is case sensitive. See Modules for details.
On that page, you will see suggested module load statements for different workflows.
-
Software changes
fosscuda has been removed. To use GPU software, load a CUDA version, and look for modules with CUDA in the name.
Singularity has been removed. The Singularity project forked into 2 - SingularityCE and Apptainer. We have chosen to install Apptainer, which can be used in identical fashion to Singularity. Load the
Apptainermodule if you wish to use containers on Spartan. -
Partition renaming
In anticipation for new hardware to arrive this year, which will have different CPUs than our current hardware, we have renamed
physicaltocascade(the current CPUs are Cascade Lake CPUs). Please see Specifications for details. -
Removal of FastX
We have removed the old FastX system for remote desktops. Remote Desktops are now available through Open OnDemand, including a GPU enabled desktop option. See Open OnDemand for details.
Recommendations for returning users:
-
Delete your R libraries and Python environments, and recreate.
R libraries are by default stored in $HOME/R.
Python environments are stored in $HOME/.local (for pip install --user), and $HOME/venvs for virtual envs. Delete these directories, and recreate the environments you require.
-
Never use
pip install --userto install Python modules. We highly recommend you move to using virtualenvs and/or Conda environments for Python module installation. This is much neater and allows you to separate Python modules for different tasks. See Python for details.