Checkpoint
Checkpoint is the action of saving a job running process to a file with no modifications to user code or to the O/S. User can restart job from the checkpoint file later, it will continue run from where it left off.
On spartan we use Distributed MultiThreaded Checkpointing (DMTCP) to approach.
Checkpointing:
1: Add the following lines to existed slurm script (currently it only works for gcccore/8.3.0 or foss/2019b):
2: Before execuable:
. start_coordinator
start_coordinator -i 180 (180 seconds, replace 180 for the time your would like to check point)
3: launch job with:
4: Once job checkpoint, it will find three files:
Restarting:
Repeat above steps 1 and 2, if you would like to check point.
Restart application from checkpoint files using dmtcp_restart command
Please go to /apps/examples/checkpoint for examples