Managing Data
Where to Store Your Data on Spartan
Many HPC jobs have large datasets. There are a number of places to store data on Spartan and different ways to get data in and out. Some directories have a faster interconnect than others. If a job involves a larger dataset with I/O, a slow connection between the compute device and the storage device will affect job performance. In all but the smallest jobs, it is best to have data close (physically, with a fast connection) to compute.
Warning
Note that /home
, /data/gpfs
and /data/scratch
are all network-based storage that can be accessed by multiple nodes and processes at the same time across the whole of Spartan. Take care that you don't inadvertently write to the same file from multiple jobs at the same time.
Warning
While it's often essential to have fast nearby storage while working on your data, please don't use Spartan as a long-term data repository.
Spartan is not a data storage platform, and the project filesystem is only to be used for computational storage (i.e. data actively being analysed). Any data for archiving or not being used should be uploaded to Mediaflux and removed from Spartan.
Home Directory
A user's home directory, i.e. /home/$username
can be used to store small amounts of data, however this is generally discouraged. It's best suited to short-lived and non-critical data that is for the user. Others in a project won't have access to this data, and it's 50GB of storage. Quota and usage with the command check_home_usage
e.g.,
Projects Directories
Your projects directory is the best place to store research data while you're working on it. It's located at /data/gpfs/projects/$projectID
. The projects directories is backed up nightly, but does not have a snapshot ability.
All members of project can access this datastore, and 500 GB of storage is available per project. If more storage is required than, contact the helpdesk. In general, for University of Melbourne users, 1 TB of project storage is available upon request, and up to 10 TB is possible after consultation on needs and data management strategies. Project storage beyond 10 TB will generally require some sort of co-investment.
You can check project quota and usage with the command check_project_usage
e.g.,
[new-user@spartan ~]$ check_project_usage
myproject has used 3997GB out of 8000GB in /data/gpfs/projects/myproject
myproject1 has used 265GB out of 500GB in /data/gpfs/projects/myproject1
Scratch Directories
The scratch filesystem is a very fast, NVMe flash based filesystem, suitable for temporary data needed for jobs.
It's located at /data/scratch/projects/$projectID
, which is shared across multiple nodes. Note that scratch directory is not backed up, and does not have a snapshot ability.
Files in the /data/scratch
directory have a nominal lifetime of 60 days. Files can and will be deleted by Spartan admins to ensure that /data/scratch
does not fill up.
If you wish to use /data/scratch, please submit a request for a scratch filesystem directory.
Local temp space
On each of the physical nodes, there is a fast NVMe PCI-E card, which will provide the fastest filesystem for your jobs. The normal capacity per node is 1.8TB, which is shared between all jobs running on that node.
You can use it by writing to /tmp
on each node. Note that /tmp
is local to each job and each worker node. It is automatically cleaned once the job has finished.
If you are using /tmp on the node, add
to your submit script, to request the usage of XGB of /tmp space. This is not a quota, but rather just to ensure that jobs aren't scheduled to nodes that have already had other jobs using up the /tmp space.
Shared datasets
Spartan has some commonly used datasets set up in a shared location so you don't need to store it in your own area.
To access them, please join the Software group using Karaage (go to Karaage, click Software Agreements->Add software
).
Dataset | Description | Location on Spartan |
---|---|---|
Connectome | The Human Connectome HCP dataset | /data/gpfs/datasets/connectome |
Imagenet | The Imagenet blurred dataset - Imagenet ILSVRC 2012–2017 face obscured | /data/gpfs/datasets/Imagenet |
Imagenet | The Imagenet CLS-LOC Dataset - Imagenet CLS-LOC | /data/scratch/datasets/Imagenet |
Alphafold | The Alphafold datasets - Last update was 1/Feb/2022 | /data/scratch/datasets/alphafold |
KRAKEN2 | The Kraken2 database | /data/gpfs/datasets/KRAKEN2 |
KRAKEN | The Kraken database | /data/gpfs/datasets/KRAKEN |
GTDB-Tk | The GTDB-TK Genome Taxonomy database - version 202 and 207 | /data/gpfs/datasets/GTDBtk |
Colabfold | The Colabfold database - downloaded 1/05/2023 | /data/gpfs/datasets/mmseqs |
Staging
Local disk is typically faster than shared disks. Spartan has /home
for home (slower), /data/gpfs/projects/$projectID
(faster), /data/scratch/projects/$projectID
for temporary storage data (even faster), and as local disk, /tmp
(fastest, not shared). For the latter you will need to copy data between these locations within a job script.
How to Transfer Data In and Out of Spartan
Secure Copy (scp)
The scp
command can be used to move data from your local machine to Spartan. For example, to move mydata.dat
from a current working directory on a local to a project directory on Spartan:
$ scp local.dat myusername@spartan.hpc.unimelb.edu.au:/data/gpfs/projects/$myproject/remote.dat
Files can be transferred from Spartan to a local machine whilst by reversing the order of the arguments:
$ scp myusername@spartan.hpc.unimelb.edu.au:/data/gpfs/projects/$myproject/$remote.dat local.dat
Entire directories can be copied with the -r
flag.
$ scp -r myusername@spartan.hpc.unimelb.edu.au:/data/gpfs/projects/$myproject/$remotedir/ .
Both these examples assume the user is on the local machine. One cannot normally copy data from Spartan to a local system, as local systems typically do not have a public fixed IP address.
For Windows users, PuTTY provides an equivalent tool called pscp
. If data is located on a remote machine, SSH into that system first, and then use scp
from that machine to transfer data to Spartan.
For a GUI interface, applications like FileZilla (cross-platform) or CyberDuck (OS X & Windows) are suggested.
rsync
Repeatedly transferring large files in and out of Spartan via scp
can be inefficient. A good alternative is rsync, which only transfers the parts that have changed. It can work on single files, or whole directories, and the syntax is much same as for scp
.
$ rsync local.dat myusername@spartan.hpc.unimelb.edu.au:/data/gpfs/projects/$myproject/remote.dat
Note that the first argument is the source, and the second is the destination which will be modified to match the source.
The rsync application can copy directories (as above) and also protect destination directories, ensuring that files that have been modified at the destination are not over-written:
$ rsync -avz --update source/ username@remotemachine:/path/to/destination
To force a destination to synchronise absolutely with the source, use the --delete
flag. Consider this with the -n
, or --dry-run
options first!
$ rsync -avz --delete source/ username@remotemachine:/path/to/destination
Mediaflux Integration
Research Computing Services provides a data management service utilising the Mediaflux platform. This platform provides a persistent location for research data and meta-data. To aid integration between Mediaflux and Spartan, Java clients are available on Spartan, allowing data to be downloaded from and uploaded to Mediaflux. Details on Mediaflux integration with Spartan can be found in the Mediaflux support wiki
S3-compatible storage
Research Computing Services provides an object storage service with an S3-compatible layer. Data can be archived from Spartan to this service, and retrieved to be analysed later. For more information, please see our wiki
Data and Storage Solutions Beyond Spartan
The University offers a range of other data storage and management solutions to meet your needs, beyond the short-term storage available on Spartan, which are described.
In some cases it's possible to integrate these resources with your account on Spartan to streamline workflows. Get in touch.