Batch Submission and Schedulers
A batch system will track the resources available on a system and determine when jobs can run on compute nodes. This is often conducted through separate applications of a resource manager (which tracks what resources are available on each compute node) and a job scheduler (which determines when jobs can run). On the Spartan HPC system we use the Slurm Workload Manager, which combines both tasks into a single application.
To submit jobs to the cluster one needs to provide a job submission script. The script consists of two sets of directives. The first set are the resource requests that one is making to the scheduler. This includes how many nodes are needed, how many cores per node, what partition the job will run on, and how long these resources are required (walltime). These scheduler directives must come first. The second set is the batch of commands that are understood by the computer's operating system environment, the shell. This includes any modules that are being loaded, and the commands, including invoking other scripts, that will be run.
When a job is submitted to Slurm, it will go the scheduler which will receive information from the resource manager daemons that run on the compute nodes. The resource requests of the job are compared with the resources available and evaluated by a policy-based "Fair Share" system. When the jobs are available on the partition requested and the job has priority, it will run for as long as the time that the resources have been requested for. When the job completes (or aborts) the scheduler will write an output file, and the application may as well.
Spartan's Partitions
HPC systems often are built around queues or partitions representing homogenous hardware or administrative restrictions. With Slurm on Spartan, one can view the list of partitions with the sinfo -s
command, like the following:
$ sinfo -s
PARTITION AVAIL TIMELIMIT NODES(A/I/O/T) NODELIST
physical* up 30-00:00:0 46/0/7/53 spartan-bm[055-066,085-125]
long up 90-00:00:0 2/0/0/2 spartan-snowy[030-031]
msps2 up 30-00:00:0 0/2/0/2 spartan-bm[047-048]
punim0396 up 30-00:00:0 0/2/0/2 spartan-bm[051-052]
shortgpgpu up 1:00:00 0/2/0/2 spartan-gpgpu[001-002]
gpgpu up 7-00:00:00 31/33/0/64 spartan-gpgpu[003-015,020-069,076]
longgpgpu up 30-00:00:0 1/1/0/2 spartan-gpgpu[070-071]
deeplearn up 30-00:00:0 5/7/0/12 spartan-gpgpu[072-075,078-082,086-088]
interactive up 2-00:00:00 2/0/0/2 spartan-bm[083-084]
snowy up 30-00:00:0 11/18/0/29 spartan-snowy[001-029]
mig up 30-00:00:0 5/5/0/10 spartan-bm[067-076]
turbsim up 30-00:00:0 0/6/0/6 spartan-bm[077-082]
mig-gpu up 30-00:00:0 0/2/0/2 spartan-gpgpu[084-085]
gpgputest up 30-00:00:0 0/6/0/6 spartan-gpgpu[016-019,089-090]
physicaltest up 30-00:00:0 0/2/0/2 spartan-bm[053-054]
adhoc up 7-00:00:00 0/2/0/2 spartan-bm[126-127]
debug up 30-00:00:0 103/88/7/198 spartan-bm[047-048,051-127],spartan-gpgpu[001-076,078-082,084-090],spartan-snowy[001-031]
This provides the partition name (e.g., physical, snowy, gpgpu, etc). Some of the partitions are restricted to particular projects (e.g., punim0396, deeplearn, etc) and will require membership to a relevant group and an additional scheduler directive to access. Others (e.g., physical, snowy) have general access. In addition to the name, the output lists the Availability of the partition (up, down), and the maximum walltime that is available for job submission. Normally, this is 30-00:00:0, or 30 days for most partitions. If it is less than 30 days this indicates that a planned outage is pending, with the maximum walltime decrementing as the day of the outage approaches.
Following this is a summary status of nodes in the partition. These are either Allocated, Idle, Out, and Total. Finally, the list of nodes in the partition with hostnames and ranges. Note that individual nodes can belong to multiple partitions.
Partition Utilisation
Before submitting a job it may be worthwhile to check the utilisation of a partition. Whilst the sinfo -s
command gives a high-level overview of the status of all the partitions, the showq
command can be used to specify the status of a partitular partition, e.g., showq -p physical
, showq -p gpgpu
etc). For example
$ showq -p physical
SUMMARY OF JOBS FOR QUEUE: <physical>
ACTIVE JOBS--------------------
JOBID JOBNAME USERNAME STATE CORE REMAINING STARTTIME
==================================================================================
17325890 run-experi vivekkatial Running 1 263:35:45 Wed Jul 22 15:23:00
17325891 run-experi vivekkatial Running 1 263:35:46 Wed Jul 22 15:23:01
17325892 run-experi vivekkatial Running 1 263:35:46 Wed Jul 22 15:23:01
17325893 run-experi vivekkatial Running 1 263:35:46 Wed Jul 22 15:23:01
17325901 run-experi vivekkatial Running 1 263:35:46 Wed Jul 22 15:23:01
...
850 active jobs : 2335 of 3912 cores ( 59.69 %): 43 of 56 nodes ( 76.79 %)
...
As can be seen from the example, whilst all nodes are busy there are still some cores available. If a full-node job was launched at this point it would certainly go into the queue.
Monitoring job memory, CPU and GPU utilisation
As a result of the feedback obtained by the 2020 Spartan HPC user survey, a job monitoring system was developed.
This allows users to monitor the memory, CPU and GPU usage of their jobs via a simple command line script.
For more details, please see Job Monitoring
Job Priority and Limits
Spartan is a very busy system, sometimes with 100% worker node allocation on most days. Because no system has an infinite number of cores there needs to be some sort of method which establishes an order when a job can run. Demand for HPC resources typically outweights supply. By default, the scheduler allocates on a simple "first-in, first-out" (FIFO) approach. However the applications of rules and policies can change the priority of a job, which will be expressed as a number to the scheduler. Some of the main factors are:
- Job size : The number of nodes, cores, or memory that a job is requesting. A higher priority is given to larger jobs.
- Wait time : The priority of a job increases the longer it has been in the queue.
- Fairshare - the difference between the portion of the computing resource that has been promised to a group or user and the amount of resources that has been consumed. It takes into account the resources used by a user's jobs in the last 14 days. The more resources used by a user's or group's jobs in the last 14 days, the lower the priority of your new jobs.
- Backfilling: Where there is a gap in the resources and a job size, the scheduler will fill that gap to ensure maximum resource allocation.
- Partition and QoS: A factor associated with each node partition.
On Spartan, the calculated priority is dominated by the fairshare component (aside from QoS restrictions), so the most common reason for a job taking a long time to start is because of the amount of resources consumed in the last 14 days
You can see your job priority, and what makes up the priority, by using the sprio
command
# sprio -j 12409951
JOBID PARTITION PRIORITY AGE FAIRSHARE JOBSIZE PARTITION QOS
12409951 physical 4240 3000 1233 6 1 0
To ensure fair use of the GPGPU partition on Spartan, quotas are implemented to ensure that no one participant on the GPGPU partition (LaTrobe, Deakin, St Vincents, UoM, UoM Engineering, UoM MDHS) can use all of the available GPUs. If the GPGPU sponsor is using over their quota of GPUs, a job will be held with the message "QOSGrpGRES". Similarly if the GPGPU sponsor is using over their quota of CPUs, the job will be held with the message "QOSGrpCpuLimit". The job will run once enough current running jobs end.
Likewise on public partitions of Spartan (physical, snowy, interactive) CPU and memory quotas have been implemented. This ensures no one project can use all the resources in these partitions. The limits are currently set at 19.5% of the resources in each partition. If a job is not running due to "MaxCpuPerAccount", it means that the project's running jobs exceed the current CPU quota for that partition. If a job is not running due to "MaxMemoryPerAccount", it means that the project's running jobs exceed the current memory quota for that partition.
Partition | CPU Quota (CPU cores) | Memory Quota (MB RAM) |
---|---|---|
physical | 750 | 9585888 |
snowy | 200 | 1493750 |
interactive | 8 | 98304 |
long | 32 | 239000 |
Note: Users with COVID-19 projects may gain an additional priority with the directive --qos=covid19
at job submission
CPU and Memory Quotas
"MaxCpuPerAccount" and "MaxMemoryPerAccount"
To ensure fair use of the public partitions on Spartan (physical, snowy), we have implemented CPU and memory quotas. This ensures no one project can use all the resources in these partitions. The limits are currently set at 15% of the resources in each partition.
If your job is not running due to "MaxCpuPerAccount", it means that your project's running jobs exceed the current CPU quota for that partition.
If your job is not running due to "MaxMemoryPerAccount", it means that your project's running jobs exceed the current memory quota for that partition.
See the table above for the current limits per project.
GPU Partitions
Spartan hosts a GPGPU service, developed in conjunction with Research Platform Services, the Melbourne School of Engineering, Melbourne Bioinformatics, RMIT, La Trobe University, St Vincent's Institute of Medical Research and Deakin University. It was funded through ARC LIEF grant LE170100200. It consists of 72 nodes, each with four NVIDIA P100 graphics cards, which can provide a theoretical maximum of around 900 teraflops.
The GPGPU cluster is available to University researchers, as well as external institutions that partnered through the ARC LIEF grant.
Jobs submitted to the GPU partition require the partition information in the script (e.g., #SBATCH --partition=gpgpu
), along with generic resource request in the job script, for example #SBATCH --gres=gpu:2
which will request two GPUs for the job. A range of GPU-accelerated software such as TensorFlow is available on Spartan, as well as CUDA for developing your own GPU applications. These are available on Spartan at /usr/local/common
. Finally, these jobs will also require a #SBATCH --qos=
setting to denote which of the project members have granted their project authority to use the partition. A table of valid qos entries and their corresponding groups is as follows
Group | QOS setting | Notes |
---|---|---|
General UoM | gpgpuresplat | Users can apply for this access on a case by case basis |
MSE | gpgpumse | MSE users can apply for access |
MDHS | gpgpumdhs | MDHS users can apply for access |
St Vincent's | gpgpusvi | |
RMIT | gpgpurmit | |
Deakin | gpgpudeakin | |
La Trobe | gpgpultu |
Note that the deeplearn
parition also requires QOS (#SBATCH --qos=gpgpudeeplearn
) and gres settings (#SBATCH --gres=gpu:n
). This partition is limited to specific projects from the CIS department.
To ensure fair use of the GPGPU partition on Spartan, quotas are implemented to ensure that no one participant in the GPGPU project (LaTrobe, Deakin, St Vincents, UoM, UoM Engineering, UoM MDHS) can use all of the available GPUs. If a GPGPU sponsor is using over their quota of GPUs, their jobs will be held with the message "QOSGrpGRES". Similarly if a GPGPU sponsor is using over their quota of CPUs, their jobs will be held with the message "QOSGrpCpuLimit". The job will run once enough current running jobs end.
If you have a GPU project and the you're at the University of Melbourne you can access the CPU partitions as well by adding: #SBATCH -q normal
to your job submission script, followed by the CPU partition name, e.g., #SBATCH -p physical
Use Cases
DeepLearn
Spartan's deeplearn partition consists of hardware purchased by the Computing and Information Systems department; 13 nodes, each with four NVIDIA V100 graphics cards. The deeplearn partition is available to specific Engineering projects, especially those in CS and SE. You can request access to it at the time of creating a Spartan account.
To access the deeplearn partition:
#SBATCH --partition deeplearn
#SBATCH --qos gpgpudeeplearn
#SBATCH -A projectID
#SBATCH --gres=gpu:v100:4
This will request 4 v100 GPUs using the project projectID
You can see the specifications of the deeplearn nodes on Status and Specifications