Scheduler

Batch Submission and Schedulers

A batch system will track the resources available on a system and determine when jobs can run on compute nodes. This is often conducted through separate applications of a resource manager (which tracks what resources are available on each compute node) and a job scheduler (which determines when jobs can run).

On the Spartan HPC system we use the Slurm Workload Manager, which combines both tasks into a single application.

To submit jobs to the cluster one needs to provide a job submission script.

The script consists of two sets of directives:

The first set are the resource requests that one is making to the scheduler. This includes how many nodes are needed, how many cores per node, what partition the job will run on, and how long these resources are required (walltime).

Note

These scheduler directives must come first.

The second set is the batch of commands that are understood by the computer's operating system environment, the shell. This includes any modules that are being loaded, and the commands, including invoking other scripts, that will be run.

When a job is submitted to Slurm, it will go the scheduler which will receive information from the resource manager daemons that run on the compute nodes. The resource requests of the job are compared with the resources available and evaluated by a policy-based "Fair Share" system. When the jobs are available on the partition requested and the job has priority, it will run for as long as the time that the resources have been requested for. When the job completes (or aborts) the scheduler will write an output file, and the application may as well.

Partitions and limits

Spartan has a number of partitions accessible to all users, as well as a number of private partitions that are only accessible to specific faculties and/or research groups. You can see all partitions by running sinfo -s on Spartan.

The publicly accessible partitions, with their quotas and limits, are listed below:

Partition	Walltime	Running jobs	CPU Quota (CPU cores) - per user	Memory Quota (MB RAM) - per user	GPUs - per user	CPU Quota (CPU cores) - per project	Memory Quota (MB RAM) - per project	GPUs - per project
cascade,sapphire	30 days	No limit	1400	14486111		1400	14486111
interactive	2 days	1	8	73728
long	90 days	No limit	36	372500		36	372500
bigmem	21 days	No limit	256	8120000		256	8120000
gpu-a100-short	4 hrs	1	16	247500	2
gpu-a100	7 days	No limit	384	5940000	48	384	5940000	48
gpu-h100	7 days	No limit	192	2850000	12	192	2850000	48
gpu-l40s	7 days	No limit	192	2850000	12	192	2850000	48

On public partitions of Spartan (cascade, sapphire, interactive, bigmem, long, gpu-a100, gpu-h100) CPU, memory and GPU quotas have been implemented. This ensures no one user or project can use all the resources in these partitions. The limits are currently set at a percentage of the resources in each partition.

Note

If a job is not running due to "QOSMaxCpuPerUserLimit", it means that the project's running jobs exceed the current per-user CPU quota for that partition. If a job is not running due to "QOSMaxMemPerUserLimit", it means that the project's running jobs exceed the current per-user memory quota for that partition. Similarly, if a job is not running due to "MaxCpuPerAccountLimit", it means that the project's running jobs exceed the current per-account, or per-project CPU quota for that partition. This also applies to other "Account" based statuses, in this context a Slurm "Account" is equivalent to a Spartan project (e.g. punimxxxx).

The CPU type, number of nodes, CPUs per node etc of the partitions can be found here

Partition Utilisation

Before submitting a job it may be worthwhile to check the utilisation of a partition. Whilst the sinfo -s command gives a high-level overview of the status of all the partitions, the sinfo -O cpusstate command can be used to specify the status of a particular partition, e.g., sinfo -O cpusstate -p sapphire, sinfo -O cpusstate -p gpu-a100 etc).

For example

$ sinfo -p cascade -O cpusstate
CPUS(A/I/O/T)       
5648/256/0/5904

As can be seen from the example, cascade has 5904 CPU cores, 5648 are being used, and 256 cores are currently idle.

Job Priority

Spartan is a very busy system, with 100% worker node allocation on most days. Demand for HPC resources typically surpasses supply. Because no system has an infinite number of cores there needs to be some sort of method which establishes an order when a job can run.
By default, the scheduler allocates on a simple "first-in, first-out" (FIFO) approach. However the applications of rules and policies can change the priority of a job, which will be expressed as a number to the scheduler. Some of the main factors are:

Job size : The number of nodes, cores, or memory that a job is requesting. A higher priority is given to larger jobs.
Wait time : The priority of a job increases the longer it has been in the queue.
Fairshare : Fairshare takes into account the resources used by a project's jobs in the last 14 days. The more resources used by a project's jobs in the last 14 days, the lower the priority of the new jobs for that project.
Backfilling: This allows lower priority jobs to run as long as the batch system knows they will finish before the higher priority job needs the resources. This makes it very important that the users specify their CPU, memory and walltime requirements accurately, to make best use of the backfilling system.
Partition and QoS: A factor associated with each node partition.

On Spartan, the calculated priority is dominated by the fairshare component (aside from QoS restrictions), so the most common reason for a job taking a long time to start is because of the amount of resources consumed in the last 14 days.

You can see your job priority, and what makes up the priority, by using the sprio command

# sprio -j 12409951
          JOBID PARTITION   PRIORITY        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS
       12409951 cascade        4240       3000       1233          6          1          0

Note: Users with COVID-19 projects may gain an additional priority with the directive --qos=covid19 at job submission

Common job status and what they mean

In the output of squeue, when your job is not running, the reason why your job is not running can be seen in the NODELIST(REASON) column of squeue. Common reasons are:

Status Reason	What it means
`(Priority)`	Higher priority jobs are ahead of your job in the queue
`(Resources)`	Your job is waiting for enough resources to become available before it can run
`(MaxMemoryPerAccount)`	The sum of the RAM used by the running jobs of your project has hit the RAM quota. Note that this includes use by other project members
`(MaxCpuPerAccount)`	The sum of the CPU used by the running jobs of your project has hit the CPU quota. Note that this includes use by other project members
`(MaxGRESPerAccount)`	The sum of the GPU used by the running jobs of your project has hit the GPU quota. Note that this inclues use by other project members
`(ReqNodeNotAvail, UnavailableNodes:`	Your job can't run as based on the walltime requested in your job, it won't be finished before the node is taken offline. Normally this means there's an upcoming maintenance window scheduled.