Managing Data

Where to Store Your Data on Spartan

Many HPC jobs have large datasets. There are a number of places to store data on Spartan and different ways to get data in and out. Some directories have a faster interconnect than others. If a job involves a larger dataset with I/O, a slow connection between the compute device and the storage device will affect job performance. In all but the smallest jobs, it is best to have data close (physically, with a fast connection) to compute.

Warning

Note that /home, /data/gpfs and /data/scratch are all network-based storage that can be accessed by multiple nodes and processes at the same time across the whole of Spartan. Take care that you don't inadvertently write to the same file from multiple jobs at the same time.

Warning

While it's often essential to have fast nearby storage while working on your data, please don't use Spartan as a long-term data repository.

Spartan is not a data storage platform, and the project filesystem is only to be used for computational storage (i.e. data actively being analysed). Any data for archiving or not being used should be uploaded to Mediaflux and removed from Spartan.

Home Directory

A user's home directory, i.e. /home/$username can be used to store small amounts of data, however this is generally discouraged. It's best suited to short-lived and non-critical data that is for the user. Others in a project won't have access to this data, and it's 50GB of storage. Quota and usage with the command check_home_usage e.g.,

[new-user@spartan ~]$ check_home_usage
new-user has used 4GB out of 50GB in /home/new-user

Projects Directories

Your projects directory is the best place to store research data while you're working on it. It's located at /data/gpfs/projects/$projectID. Project directories are backed up nightly, but do not have a snapshot ability.

All members of project can access this datastore, and 500 GB of storage is available per project. If more storage is required, please contact the helpdesk. In general, for University of Melbourne users, 1 TB of project storage is available upon request, and up to 10 TB is possible after consultation on needs and data management strategies. Project storage beyond 10 TB will generally require some sort of co-investment.

You can check project quota and usage with the command check_project_usage e.g.,

[new-user@spartan ~]$ check_project_usage
myproject has used 3997GB out of 8000GB in /data/gpfs/projects/myproject
myproject1 has used 265GB out of 500GB in /data/gpfs/projects/myproject1

It is important to pay extra attention to file and directory ownership and permissions in project directories. Members of a project will only have access to files and directories that the owners have provided group permission to. It a file or directory is meant to be shared the onus is on the owner of the files to ensure that permission is granted, using the standard UNIX command chmod. Project leaders are strongly encouraged to ensure that project members provide access to project members for files inside the shared project directory.

Scratch Directories

The scratch filesystem is a very fast, NVMe flash based filesystem, suitable for temporary data needed for jobs.

It's located at /data/scratch/projects/$projectID, which is shared across multiple nodes.

Warning

Note that scratch directory is not backed up, and does not have a snapshot ability.

Warning

Files in the /data/scratch directory have a nominal lifetime of 60 days. Files can and will be deleted by Spartan admins to ensure that /data/scratch does not fill up.

If you wish to use /data/scratch, please submit a request for a scratch filesystem directory.

Local temp space

On each of the physical nodes there is a fast NVMe PCI-E card which will provide the fastest filesystem for your jobs. The normal capacity per node is 1.8TB, which is shared between all jobs running on that node.

You can use it by writing to /tmp on each node. Note that /tmp is local to each job and each worker node. It is automatically cleaned once the job has finished.

If you are using /tmp on the node, add

#SBATCH --tmp=XGB

to your submit script, to request the usage of XGB of /tmp space. This is not a quota, but rather just to ensure that jobs aren't scheduled to nodes that have already had other jobs using up the /tmp space.

Shared datasets

Spartan has some commonly used datasets set up in a shared location so you don't need to store it in your own area.

To access them, please join the Software group using Karaage (go to Karaage, click Software Agreements->Add software).

Dataset	Description	Location on Spartan
Connectome	The Human Connectome HCP dataset	/data/gpfs/datasets/connectome
Imagenet	The Imagenet blurred dataset - Imagenet ILSVRC 2012–2017 face obscured	/data/gpfs/datasets/Imagenet
Imagenet	The Imagenet CLS-LOC Dataset - Imagenet CLS-LOC	/data/scratch/datasets/Imagenet
Objectnet	The Objectnet Dataset - Objectnet	/data/gpfs/datasets/Objectnet
Alphafold	The Alphafold datasets - Last update was 1/Feb/2022	/data/scratch/datasets/alphafold
KRAKEN2	The Kraken2 database	/data/gpfs/datasets/KRAKEN2
KRAKEN	The Kraken database	/data/gpfs/datasets/KRAKEN
GTDB-Tk	The GTDB-TK Genome Taxonomy database - version 202 and 207	/data/gpfs/datasets/GTDBtk
Colabfold	The Colabfold database - downloaded 1/05/2023	/data/gpfs/datasets/mmseqs

Staging

Local disk is typically faster than shared disks. Spartan has /home for home (slower), /data/gpfs/projects/$projectID (faster), /data/scratch/projects/$projectID for temporary storage data (even faster), and as local disk, /tmp (fastest, not shared). For the latter you will need to copy data between these locations within a job script.

How to Transfer Data In and Out of Spartan

There are a few common tools that can be used to copy data from your local laptop/desktop to Spartan.

Note that you must run the data transfer client on your local machine. Spartan can be contacted from everywhere, but your laptop/desktop can't. So you need to open the file transfer client on your local machine, not on Spartan.

Secure Copy (scp)

The scp command can be used to move data from your local machine to Spartan. For example, to move mydata.dat from a current working directory on a local to a project directory on Spartan:

On your local machine:

$ scp local.dat myusername@spartan.hpc.unimelb.edu.au:/data/gpfs/projects/myproject/remote.dat

Files can be transferred from Spartan to a local machine whilst by reversing the order of the arguments.

On your local machine:

$ scp myusername@spartan.hpc.unimelb.edu.au:/data/gpfs/projects/myproject/$remote.dat local.dat

Entire directories can be copied with the -r flag.

On your local machine:

$ scp -r myusername@spartan.hpc.unimelb.edu.au:/data/gpfs/projects/myproject/$remotedir/ .

For Windows users, PuTTY provides an equivalent tool called pscp. If data is located on a remote machine, SSH into that system first, and then use scp from that machine to transfer data to Spartan.

For a GUI interface, applications like FileZilla (cross-platform) or CyberDuck (OS X & Windows) are suggested.

rsync

Repeatedly transferring large files in and out of Spartan via scp can be inefficient. A good alternative is rsync, which only transfers the parts that have changed. It can work on single files, or whole directories, and the syntax is much same as for scp.

On your local machine:

$ rsync local.dat myusername@spartan.hpc.unimelb.edu.au:/data/gpfs/projects/myproject/remote.dat

Note that the first argument is the source, and the second is the destination which will be modified to match the source.

The rsync application can copy directories (as above) and also protect destination directories, ensuring that files that have been modified at the destination are not over-written:

$ rsync -avz --update source/ username@remotemachine:/path/to/destination

To force a destination to synchronise absolutely with the source, use the --delete flag. Consider this with the -n, or --dry-run options first!

$ rsync -avz --delete source/ username@remotemachine:/path/to/destination

Mediaflux Integration

Research Computing Services provides a data management service utilising the Mediaflux platform. This platform provides a persistent location for research data and meta-data. To aid integration between Mediaflux and Spartan, Java clients are available on Spartan, allowing data to be downloaded from and uploaded to Mediaflux. Details on Mediaflux integration with Spartan can be found in the Mediaflux support wiki

S3-compatible storage

Research Computing Services provides an object storage service with an S3-compatible layer. Data can be archived from Spartan to this service, and retrieved to be analysed later. For more information, please see our wiki

Data and Storage Solutions Beyond Spartan

The University offers a range of other data storage and management solutions to meet your needs, beyond the short-term storage available on Spartan, which are described.

In some cases it's possible to integrate these resources with your account on Spartan to streamline workflows. Get in touch.