Skip to content

Checkpoint

Checkpoint is the action of saving a job running process to a file with no modifications to user code or to the O/S. User can restart job from the checkpoint file later, it will continue run from where it left off.

On spartan we use Distributed MultiThreaded Checkpointing (DMTCP) to approach.

Checkpointing:

1: Add the following lines to existed slurm script (currently it only works for gcccore/8.3.0 or foss/2019b):

module load GCCcore/11.3.0
module load DMTCP/3.0.0 

2: Before execuable:

. start_coordinator
start_coordinator -i 180 (180 seconds, replace 180 for the time your would like to check point)

3: launch job with:

dmtcp_launch -j YOUR execuable

4: Once job checkpoint, it will find three files:

ckpt_count_*.dmtcp

dmtcp_restart_script_*.sh

dmtcp_restart_script.sh

Restarting:

Repeat above steps 1 and 2, if you would like to check point.

Restart application from checkpoint files using dmtcp_restart command

./dmtcp_restart_script.sh -h $DMTCP_COORD_HOST -p $DMTCP_COORD_PORT

Please go to /apps/examples/checkpoint for examples