Batch Submission and Schedulers
A batch system will track the resources available on a system and determine when jobs can run on compute nodes. This is often conducted through separate applications of a resource manager (which tracks what resources are available on each compute node) and a job scheduler (which determines when jobs can run).
On the Spartan HPC system we use the Slurm Workload Manager, which combines both tasks into a single application.
To submit jobs to the cluster one needs to provide a job submission script.
The script consists of two sets of directives:
- The first set are the resource requests that one is making to the scheduler. This includes how many nodes are needed, how many cores per node, what partition the job will run on, and how long these resources are required (walltime).
These scheduler directives must come first.
- The second set is the batch of commands that are understood by the computer's operating system environment, the shell. This includes any modules that are being loaded, and the commands, including invoking other scripts, that will be run.
When a job is submitted to Slurm, it will go the scheduler which will receive information from the resource manager daemons that run on the compute nodes. The resource requests of the job are compared with the resources available and evaluated by a policy-based "Fair Share" system. When the jobs are available on the partition requested and the job has priority, it will run for as long as the time that the resources have been requested for. When the job completes (or aborts) the scheduler will write an output file, and the application may as well.
HPC systems often are built around queues or partitions representing homogenous hardware or administrative restrictions. With Slurm on Spartan, one can view the list of partitions with the
sinfo -s command, like the following:
cascade* up 30-00:00:0 3/78/0/81 spartan-bm[001-029,039-046,049,053-066,087-115] rhel7 down 30-00:00:0 0/10/0/10 spartan-bm[116-125] long up 90-00:00:0 0/2/0/2 spartan-bm[031-032] bigmem up 14-00:00:0 0/5/0/5 spartan-bm[030,033-034,037-038] argali up 30-00:00:0 0/21/2/23 spartan-argali[01-23] msps2 up 30-00:00:0 0/2/0/2 spartan-bm[047-048] punim0396 up 30-00:00:0 0/1/0/1 spartan-bm050 gpu-a100 up 7-00:00:00 0/29/0/29 spartan-gpgpu[099-127] gpu-a100-short up 4:00:00 0/2/0/2 spartan-gpgpu[128-129] gpu-a100-preempt up 7-00:00:00 0/23/0/23 spartan-gpgpu[098,131,144-159,161-165] gpu-v100-preempt up 7-00:00:00 0/4/0/4 spartan-gpgpu[084-085,089-090] feit-gpu-a100 up 7-00:00:00 0/21/0/21 spartan-gpgpu[144-159,161-165] deeplearn up 30-00:00:0 0/31/7/38 spartan-gpgpu[065-071,078-082,086-088,091-096,132-143,160,166-169] interactive up 2-00:00:00 0/4/0/4 spartan-bm[083-086] extremecfd up 14-00:00:0 0/1/0/1 spartan-gpgpu097 feit-geoandco up 14-00:00:0 0/1/0/1 spartan-gpgpu131 mig up 30-00:00:0 0/10/0/10 spartan-bm[067-076] turbsim up 30-00:00:0 0/6/0/6 spartan-bm[077-082] mig-gpu up 30-00:00:0 0/2/0/2 spartan-gpgpu[084-085] gpgputest up 30-00:00:0 0/4/0/4 spartan-gpgpu[089-090,098,130] physicaltest up 30-00:00:0 0/2/0/2 spartan-bm[035-036] physicaltest-amd down 30-00:00:0 0/1/0/1 spartan-bm128 adhoc down 14-00:00:0 0/0/0/0 debug up 30-00:00:0 3/233/9/245 spartan-argali[01-23],spartan-bm[001-050,053-125,128],spartan-gpgpu[065-071,078-082,084-169]
This provides the partition name (e.g., cascade, long, gpgpu, etc).
Some of the partitions are restricted to particular projects (e.g., punim0396, deeplearn, etc) and will require membership to a relevant group and an additional scheduler directive to access.
Others (e.g., cascade, long) have general access.
In addition to the name, the output lists the Availability of the partition (up, down), and the maximum walltime that is available for job submission. Normally, this is 30-00:00:0, or 30 days for most partitions. If it is less than 30 days this indicates that a planned outage is pending, with the maximum walltime decrementing as the day of the outage approaches.
Following this is a summary status of nodes in the partition. These are either Allocated, Idle, Out, and Total. Finally, the list of nodes in the partition with hostnames and ranges. Note that individual nodes can belong to multiple partitions.
Before submitting a job it may be worthwhile to check the utilisation of a partition. Whilst the
sinfo -s command gives a high-level overview of the status of all the partitions, the
sinfo -O cpusstate command can be used to specify the status of a partitular partition, e.g.,
sinfo -O cpusstate -p cascade,
sinfo -O cpusstate -p gpgpu etc).
As can be seen from the example, cascade has 5904 CPU cores, 5648 are being used, and 256 cores are currently idle.
Spartan is a very busy system, with 100% worker node allocation on most days. Demand for HPC resources typically surpasses supply. Because no system has an infinite number of cores there needs to be some sort of method which establishes an order when a job can run.
By default, the scheduler allocates on a simple "first-in, first-out" (FIFO) approach. However the applications of rules and policies can change the priority of a job, which will be expressed as a number to the scheduler. Some of the main factors are:
- Job size : The number of nodes, cores, or memory that a job is requesting. A higher priority is given to larger jobs.
- Wait time : The priority of a job increases the longer it has been in the queue.
- Fairshare : The difference between the portion of the computing resource that has been promised to a group or user and the amount of resources that has been consumed. It takes into account the resources used by a user's jobs in the last 14 days. The more resources used by a user's or group's jobs in the last 14 days, the lower the priority of your new jobs.
- Backfilling: This allows lower priority jobs to run as long as the batch system knows they will finish before the higher priority job needs the resources. This makes it very important that the users specify their CPU, memory and walltime requirements accurately, to make best use of the backfilling system.
- Partition and QoS: A factor associated with each node partition.
On Spartan, the calculated priority is dominated by the fairshare component (aside from QoS restrictions), so the most common reason for a job taking a long time to start is because of the amount of resources consumed in the last 14 days.
You can see your job priority, and what makes up the priority, by using the
--qos=covid19at job submission