Status

Where to Store Your Data on Spartan

Many HPC jobs have large datasets. There are a number of places to store data on Spartan and different ways to get data in and out. Some directories have a faster interconnect than others. If a job involves a larger dataset with I/O a slow connection between the compute device and the storage device will effect job performance. In all but the smallest jobs, it is best to have data close (physically, with a fast connection) to compute.

Note that /home, /data/gpfs and /data/scratch are all network-based storage that can be accessed by multiple nodes and processes at the same time across the whole of Spartan. Take care that you don't inadvertently write to the same file from multiple jobs at the same time.

While it's often essential to have fast nearby storage while working on your data, please don't use Spartan as a long-term data repository. It's not designed for that, may not conform to the requirements set by an institution or funding body.

Home Directory

A user's home directory, i.e. /home/$username can be used to store small amounts of data, however this is generally discouraged. It's best suited to short-lived and non-critical data that is for the user. Others in a project won't have access to this data, and it's 50GB of storage. Quota and usage with the command check_home_usage e.g.,

[new-user@spartan ~]$ check_home_usage
new-user has used 4GB out of 50GB in /home/new-user

Projects Directories

Your projects directory is the best place to store research data while you're working on it. It's located at /data/gpfs/projects/$projectID. The projects directories is backed up nightly, but does not have a snapshot ability.

All members of project can access this datastore, and 500 GB of storage is available per project. If more storage is required than, contact the helpdesk. In general, for University of Melbourne users, 1 TB of project storage is available upon request, and up to 10 TB is possible if needed. Project storage beyond 10 TB will generally require some sort of co-investment.

Project quota and usage with the command check_project_usage e.g.,

[new-user@spartan ~]$ check_project_usage
myproject has used 3997GB out of 8000GB in /data/gpfs/projects/myproject
myproject1 has used 265GB out of 500GB in /data/gpfs/projects/myproject1

Scratch Space

Temporary working data can be stored while your job is running at /tmp. This will map to a directory on the faster scratch network storage specific to a job ID, and clean up once the job is complete. It's also possible to write directly to /data/scratch/projects/$projectID, which is shared across multiple nodes. Note that scratch directory is not backed up, and does not have a snapshot ability.

Files in the /data/scratch directory have a nominal lifetime of 60 days. Files can be deleted to ensure that /data/scratch does not fill up.

If you wish to use /data/scratch, please submit a request for a scratch filesystem directory.

Shared datasets

Spartan has some commonly used datasets set up in a shared location so you don't need to store it in your own area.

To access them, please join the Software group using Karaage (go to Karaage, click Software Agreements->Add software).

Connectome: The Human Connectome HCP dataset is stored on Spartan in /data/gpfs/datasets/connectome

Imagenet: The Imagenet blurred dataset is stored on Spartan in /data/gpfs/datasets/Imagenet

Staging

Local disk is typically faster than shared disks. Spartan has /home for home (slower), /data/gpfs/projects/$projectID (faster), /data/scratch/projects/$projectID for temporary storage data (even faster), and as local disk, /var/local/tmp (fastest, not shared). For the latter you will need to copy data between these locations within a job script.

How to Transfer Data In and Out of Spartan

Secure Copy (scp)

The scp command can be used to move data from your local machine to Spartan. For example, to move mydata.dat from a current working directory on a local to a project directory on Spartan:

$ scp local.dat myusername@spartan.hpc.unimelb.edu.au:/data/gpfs/projects/$myproject/remote.dat

Files can be transferred from Spartan to a local machine whilst by reversing the order of the arguments:

$ scp myusername@spartan.hpc.unimelb.edu.au:/data/gpfs/projects/$myproject/$remote.dat local.dat

Entire directories can be copied with the -r flag.

$ scp -r myusername@spartan.hpc.unimelb.edu.au:/data/gpfs/projects/$myproject/$remotedir/ .

Both these examples assume the user is on the local machine. One cannot normally copy data from Spartan to a local system, as local systems typically do not have a public fixed IP address.

For Windows users, PuTTY provides an equivalent tool called pscp. If data is located on a remote machine, SSH into that system first, and then use scp from that machine to transfer data to Spartan.

For a GUI interface, applications like FileZilla (cross-platform) or CyberDuck (OS X & Windows) are suggested.

rsync

Repeatedly transferring large files in and out of Spartan via scp can be inefficient. A good alternative is rsync, which only transfers the parts that have changed. It can work on single files, or whole directories, and the syntax is much same as for scp.

$ rsync local.dat myusername@spartan.hpc.unimelb.edu.au:/data/gpfs/projects/$myproject/remote.dat

Note that the first argument is the source, and the second is the destination which will be modified to match the source.

The rsync application can copy directories (as above) and also protect destination directories, ensuring that files that have been modified at the destination are not over-written:

$ rsync -avz --update source/ username@remotemachine:/path/to/destination

To force a destination to synchronise absolutely with the source, use the --delete flag. Consider this with the -n, or --dry-run options first!

$ rsync -avz --delete source/ username@remotemachine:/path/to/destination

Mediaflux Integration

Research Computing Services provides a data management service utilising the Mediaflux platform. This platform provides a persistent location for research data and meta-data. To aid integration between Mediaflux and Spartan, Java clients are available on Spartan, allowing data to be downloaded from and uploaded to Mediaflux. Details on Mediaflux integration with Spartan can be found in Section 4 of the Mediaflux support wiki

S3-compatible storage

Research Computing Services provides an object storage service with an S3-compatible layer. Data can be archived from Spartan to this service, and retrieved to be analysed later. For more information, please see our wiki

Data and Storage Solutions Beyond Spartan

The University offers a range of other data storage and management solutions to meet your needs, beyond the short-term storage available on Spartan, which are described.

In some cases it's possible to integrate these resources with your account on Spartan to streamline workflows. Get in touch.