Where to Store Your Data on Spartan
Many HPC jobs have large datasets. There are a number of places to store data on Spartan and different ways to get data in and out. Some directories have a faster interconnect than others. If a job involves a larger dataset with I/O, a slow connection between the compute device and the storage device will affect job performance. In all but the smallest jobs, it is best to have data close (physically, with a fast connection) to compute.
/data/scratch are all network-based storage that can be accessed by multiple nodes and processes at the same time across the whole of Spartan. Take care that you don't inadvertently write to the same file from multiple jobs at the same time.
While it's often essential to have fast nearby storage while working on your data, please don't use Spartan as a long-term data repository.
Spartan is not a data storage platform, and the project filesystem is only to be used for computational storage (i.e. data actively being analysed). Any data for archiving or not being used should be uploaded to Mediaflux and removed from Spartan.
A user's home directory, i.e.
/home/$username can be used to store small amounts of data, however this is generally discouraged. It's best suited to short-lived and non-critical data that is for the user. Others in a project won't have access to this data, and it's 50GB of storage. Quota and usage with the command
Your projects directory is the best place to store research data while you're working on it. It's located at
/data/gpfs/projects/$projectID. The projects directories is backed up nightly, but does not have a snapshot ability.
All members of project can access this datastore, and 500 GB of storage is available per project. If more storage is required than, contact the helpdesk. In general, for University of Melbourne users, 1 TB of project storage is available upon request, and up to 10 TB is possible after consultation on needs and data management strategies. Project storage beyond 10 TB will generally require some sort of co-investment.
You can check project quota and usage with the command
The scratch filesystem is a very fast, NVMe flash based filesystem, suitable for temporary data needed for jobs.
It's located at
/data/scratch/projects/$projectID, which is shared across multiple nodes. Note that scratch directory is not backed up, and does not have a snapshot ability.
Files in the
/data/scratch directory have a nominal lifetime of 60 days. Files can and will be deleted by Spartan admins to ensure that
/data/scratch does not fill up.
If you wish to use /data/scratch, please submit a request for a scratch filesystem directory.
Local temp space
On each of the physical nodes, there is a fast NVMe PCI-E card, which will provide the fastest filesystem for your jobs. The normal capacity per node is 1.8TB, which is shared between all jobs running on that node.
You can use it by writing to
/tmp on each node. Note that
/tmp is local to each job and each worker node. It is automatically cleaned once the job has finished.
Spartan has some commonly used datasets set up in a shared location so you don't need to store it in your own area.
To access them, please join the Software group using Karaage (go to Karaage, click
Software Agreements->Add software).
|Dataset||Description||Location on Spartan|
|Connectome||The Human Connectome HCP dataset||/data/gpfs/datasets/connectome|
|Imagenet||The Imagenet blurred dataset - Imagenet ILSVRC 2012–2017 face obscured||/data/gpfs/datasets/Imagenet|
|Alphafold||The Alphafold datasets - Last update was 1/Feb/2022||/data/scratch/datasets/alphafold|
|KRAKEN2||The Kraken2 database||/data/gpfs/datasets/KRAKEN2|
|KRAKEN||The Kraken database||/data/gpfs/datasets/KRAKEN|
|GTDB-Tk||The GTDB-TK Genome Taxonomy database - version 202 and 207||/data/gpfs/datasets/GTDBtk|
|Colabfold||The Colabfold database - downloaded 1/05/2023||/data/gpfs/datasets/mmseqs|
Local disk is typically faster than shared disks. Spartan has
/home for home (slower),
/data/scratch/projects/$projectID for temporary storage data (even faster), and as local disk,
/var/local/tmp (fastest, not shared). For the latter you will need to copy data between these locations within a job script.
How to Transfer Data In and Out of Spartan
Secure Copy (scp)
scp command can be used to move data from your local machine to Spartan. For example, to move
mydata.dat from a current working directory on a local to a project directory on Spartan:
$ scp local.dat firstname.lastname@example.org:/data/gpfs/projects/$myproject/remote.dat
Files can be transferred from Spartan to a local machine whilst by reversing the order of the arguments:
$ scp email@example.com:/data/gpfs/projects/$myproject/$remote.dat local.dat
Entire directories can be copied with the
$ scp -r firstname.lastname@example.org:/data/gpfs/projects/$myproject/$remotedir/ .
Both these examples assume the user is on the local machine. One cannot normally copy data from Spartan to a local system, as local systems typically do not have a public fixed IP address.
For Windows users, PuTTY provides an equivalent tool called
pscp. If data is located on a remote machine, SSH into that system first, and then use
scp from that machine to transfer data to Spartan.
Repeatedly transferring large files in and out of Spartan via
scp can be inefficient. A good alternative is rsync, which only transfers the parts that have changed. It can work on single files, or whole directories, and the syntax is much same as for
$ rsync local.dat email@example.com:/data/gpfs/projects/$myproject/remote.dat
Note that the first argument is the source, and the second is the destination which will be modified to match the source.
The rsync application can copy directories (as above) and also protect destination directories, ensuring that files that have been modified at the destination are not over-written:
$ rsync -avz --update source/ username@remotemachine:/path/to/destination
To force a destination to synchronise absolutely with the source, use the
--delete flag. Consider this with the
--dry-run options first!
$ rsync -avz --delete source/ username@remotemachine:/path/to/destination
Research Computing Services provides a data management service utilising the Mediaflux platform. This platform provides a persistent location for research data and meta-data. To aid integration between Mediaflux and Spartan, Java clients are available on Spartan, allowing data to be downloaded from and uploaded to Mediaflux. Details on Mediaflux integration with Spartan can be found in the Mediaflux support wiki
Research Computing Services provides an object storage service with an S3-compatible layer. Data can be archived from Spartan to this service, and retrieved to be analysed later. For more information, please see our wiki
Data and Storage Solutions Beyond Spartan
The University offers a range of other data storage and management solutions to meet your needs, beyond the short-term storage available on Spartan, which are described.
In some cases it's possible to integrate these resources with your account on Spartan to streamline workflows. Get in touch.