Spartan logo

Spartan: Frequently Asked Questions

1. What is Spartan?
2. Why do I need it?
3. How do I get an account on Spartan?
4. How do I access it?
5. What applications are there?
6. How do I submit a job?
7. How do I submit a job efficiently?
8. How do I submit a job differently?
9. How do I specify particular resources or reserved partitions?
10. Where are example job submission scripts?
11. Where do I get help?
12. Acknowledgements

1. What is Spartan?

Spartan is high performance computing (HPC) and research cloud (Melbourne Research Cloud, MRC), with attached Research Data Storage Services (RDSS).

Spartan consists of (a) a management node for system administrators, (b) a login node for users to connect to the system and submit jobs, (c) a small number of 'bare metal' compute nodes for multinode tasks, (d) any 'bare metal' user-procured hardware (e.g., departmental nodes), (e) vHPC cloud compute nodes for overflow and GPGPU tasks, and (e) general cloud compute nodes.

The aim of the University is to provide a more unified experience for researchers to access compute services, either generic compute or HPC, including specialised processing for graphics and imaging (General Purpose Graphic Processing Units).

2. Why do I need it?

There are a number of common reasons why a researcher may find that a standard user computer (desktop, laptop etc) is not up to particular computational tasks. They may find that the tasks they're running are taking too long, or there's not enough cores, or their dataset is too big, or the application is too difficult to install, or that it's inefficient to purchase licenses for each user.

Any of these are good reasons to make use of the resources that Spartan offers.

3. How do I get an account on Spartan?

Access to Spartan requires an an account, which requires an association with a Project, which are subject to approval by the Head of Research Compute Services. Projects must demonstrate an approved research goal or goals, or demonstrate potential to support research activity. Projects require a Principle Investigator and may have additional Research Collaborators.

Projects and accounts may be established through Karaage

4. How do I access it?

Access to Spartan, like nearly all other HPC systems, is through SSH (secure shell). Linux and MacOS X users will usually have this built into their shell environment. MS-Windows users will have to install an SSH client, such as PuTTY (http://putty.org). The first login will create your home directory. For example:

[lev@cricetomys ~]$ ssh lev@spartan.hpc.unimelb.edu.au
lev@spartan.hpc.unimelb.edu.au's password:
Creating home directory for lev.
[lev@spartan ~]$

To make logins faster you may wish to add an entry in ~/.ssh/config on your local machine and/or create an ~/.ssh/authorized_keys file in your Spartan home directory. Combined this will allow you an abbreviated alias to login and passwordless (key authenticated) logins. For example;

[lev@cricetomys ~]$ ssh spartan
Last login: Fri May 13 13:28:41 2016 from 128.250.116.164
[lev@spartan ~]$

Spartan uses Red Hat Enterprise Linux Server release 7.2, and knowledge of the Linux command line with this distribution is absolutely required for effective use of the system.

5. What applications are there?

There are a number of applications installed on Spartan, typically installed from source with optimisation to improve performance. Spartan uses a modules system (lmod) which sets the environment paths for the user when invoked, allowing for multiple versions of the same software to be installed, which means that consistent environments can be ensured throughout a software project or more recent versions with newer features can be introduced without conflicting with existing versions.

The following command lists all applications available in the module system:

[lev@spartan ~]$ module avail
..

To include the module paths in the environment the use uses the `module load` and removes the paths with the `module unload` command. A brief description of the module is available with the `module whatis` command. e.g.,

[lev@spartan ~]$ module whatis GCC/4.9.2
[lev@spartan ~]$ module load GCC/4.9.2
[lev@spartan ~]$ module unload GCC/4.9.2

Unless the user is running an interactive job (see section 8, below), modules are not usually invoked on the command line but rather as part of a job script. However do note though that the job schedeler wikk copy the users environment into the batch job, so modules loaded when submitting will still be loaded when the job runs.

6. How do I submit a job?

Because the login node is a shared user environment it is very important that significant computational tasks are not run on this node. If this happens it will restrict access and usage for others, and the Spartan system administrators will almost certainly kill your job.

Instead computational tasks should be submitted to the batch system (Simple Linux Utility for Resource Management, or SLURM) which tracks resources throughout the cluster and builds a queue of when jobs can run. When the resources are available the job scheduler will direct the task to a compute node to run.

Batch scripts are generated by writing a short text file of the resource requests and the commands that are desired. The following provides a skeleton of a typical set of resource requests and commands.

#!/bin/bash
#SBATCH -p cloud
#SBATCH --time=01:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
module load my-app-compiler/version
my-app data

The script first invokes a shell environment, followed by the partition the job will run on (the default is 'cloud'). The next four lines are resource requests, specifically for one compute node, one task, and cpu core per task. After these requests are allocated, the script loads a module and then runs the exectable against the dataset specified.

A user may wish to received email notifications depending on how their job has run. If this the case the following are optional.

# SBATCH --mail-user=example@example.com
# SBATCH --mail-type=ALL

The option 'ALL' here represents an email when the job begins, finished, or aborts. The individual options, if desired, are FAIL, BEGIN, and END.

SLURM will generate IDs for error and output files, based on the job ID. If alternate names are desired the following can be used (the more elaborate -output -o error):

#SBATCH -o outputfile.out
#SBATCH -e errorfile.err

By default slurm captures both stdout and stderr into a single output file called slurm-%{JOBID}.out

Assuming the script above is called `testjob1.sh` and the application and data parameters have been incorporated, it could be submitted as follows:

[lev@spartan ~]$ sbatch testjob1.sh

7. How do I submit a job efficiently?

The first example (section 6) is a simple single core job request for one hour. Knowing what sort of resources are needed aids with efficient resource allocation and speeds up both job submission and operation, as well as providing for other users. In some cases this can be quite significant (e.g., requesting multicore resources on a single core job is pointless), whilst in other circumstances it may require some testing (e.g., a dataset of a particular size takes a particular amount of time to run).

Modifying resource allocation requests improves job efficiency. For single core jobs, uses the allocations suggested in section 6, modifying time if necessary.

For shared-memory multithreaded job (e.g., OpenMP), modify the --cpus-per-task to a maximum of 8, which is the maximum number of cores on a single instance.

#SBATCH --cpus-per-task=8

For distributed-memory multicore job using message passing, the multinode partition has to be invoked and the resource requets altered. e.g.,

#!/bin/bash
#SBATCH -p physical
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=1
module load my-app-compiler/version
srun my-mpi-app

Note that there is only 1 CPU per task, which is typical with code written with message passing.

Please not we do not recommend running multinode jobs using the cloud partition; it will be very slow. The cloud partition is very good for single node jobs up to eight cores ("ntasks").

8. How do I submit a job differently?

The examples given sections 6 and 7 refer to standard batch submissions. Alternative job submissions include specifying batch arrays, batch dependencies, and interactive sessions.

In the first case, the same batch script, and therefore the same resource requests, is used multiple times. A typical example is to apply the same task across multiple datasets. The following example submits 10 batch jobs with myapp running against datasets dataset1.csv, dataset2.csv, ... dataset10.csv

#SBATCH --array=1-10
myapp dataset"${SLURM_ARRAY_TASK_ID}".csv

In the second case a dependency condition is established on which the launching of a batch script depends, creating a conditional pipeline. The dependency directives consist of `after`, `afterok`, `afternotok`, `before`, `beforeok`, `beforenotok`. A typical use case is where the output of one job is required as the input of the next job.

#SBATCH --dependency=afterok:`myfirstjobid` mysecondjob

In the third case SLURM, based on the resource requests made on the command line, puts the user on to a compute node. This is typically done if they user wants to run a large script (and shouldn't do it on the login node), or wants to test or debug a job. The following command would launch one node with two processors for ten minutes.

[lev@spartan ~]$ sinteractive --time=00:10:00 --nodes=1 --ntasks=2
srun: job 64 queued and waiting for resources
srun: job 64 has been allocated resources

9. How do I specify particular resources or reserved partitions?

Sometimes specific nodes need to be specified or excluded. To exclude use the -x option followed by the nodelist. To specify particular nodes use the -w option followed by the nodelist. e.g.,

#SBATCH -x spartan-bm001,spartan-bm002 #SBATCH -w spartan-bm001,spartan-bm002

Some partitions on the cluster are reserved for particular groups. A user may belong to multiple groups, some of which may not have access to that partition. Therefore when making a job submission for a reserved partition both the partition (-p) and the account group (-A) needs to be specified.

[lev@spartan ~]$ sinteractive -p water -A punim0006 srun: job 211916 queued and waiting for resources srun: job 211916 has been allocated resources [lev@spartan-water01 ~]$

Whilst we have specified GPU partitions on Spartan on other Slurm-based systems this may be implemented at a Generic Resource. This can be invoked in a manner similar to the following:

#SBATCH --gres=gpu

10. Where are example job submission scripts?

There is a collection of example job submission scripts stored in a shared directory on Spartan. Change to the /usr/local/common/ directory.

11. Where do I get help?

If a user has problems with submitting a job, needs a new application or extension to an existing application installed, if their submissions are generated unexpected errors etc., an email can be sent to: hpc-support@unimelb.edu.au

Please provide as much information as possible in your email, such as the specific software that you required to be installed, a download link, the job id of anything you were running, the location of the batch script etc.

The University will also be running training courses on using Spartan for researchers as part of the ResBaz program.

11. Acknowledgements

This guide was written by Lev Lafayette for Research Platforms, University of Melbourne, with contributions by Chris Samuel from the Victorian Life Science Computation Initiative, and Bernard Meade, Head of Research Compute Services, and Tim Rice, Research Platform Services. Version 0.7, December 12 2016.