Skip to content


Checkpoint is the action of saving a job running process to a file with no modifications to user code or to the O/S. User can restart job from the checkpoint file later, it will continue run from where it left off.

On spartan we use Distributed MultiThreaded Checkpointing (DMTCP) to approach.


1: Add the following lines to existed slurm script (currently it only works for gcccore/8.3.0 or foss/2019b):

module load GCCcore/11.3.0
module load DMTCP/3.0.0 

2: Before execuable:

. start_coordinator
start_coordinator -i 180 (180 seconds, replace 180 for the time your would like to check point)

3: launch job with:

dmtcp_launch -j YOUR execuable

4: Once job checkpoint, it will find three files:




Repeat above steps 1 and 2, if you would like to check point.

Restart application from checkpoint files using dmtcp_restart command


Please go to /apps/examples/checkpoint for examples