DMTCP Checkpointing

Overview

The application DMTCP (Distributed MultiThreaded Checkpointing) on Matilda to assist users in checkpointing jobs in cases where the application used does not have a checkpointing ability built-in. This document covers an explanation of Checkpoint/Restart principles and how it can be used to assist users in their workflow in a high-performance computing environment. Examples are provided for a couple of basic use cases which can be used as-is, or modified for use on Matilda.

To see what versions of DMTCP are available, you may use the command:

module av DMTCP

To load DMTCP, simply use the command:

module load DMTCP/<version>
or
module load DMTCP (for the default)

Checkpoint Restart Basics

The basic principle behind Checkpoint/Restart involves taking a snapshot of a job's running processes and memory state (the Checkpoint), and writing to a "Restart" file prior to the job's termination. The user may then "Restart" the job from the point where the most recent checkpoint was created (or any other older checkpoint, if desired) using the Restart file. Some applications have checkpointing built-in (such as NWChem), but many do not. This is where applications like DMTCP can prove useful. While DMTCP is not capable of providing Checkpoint/Restart for all applications (for example, MPI jobs) it can provide this capablity for many serial and OpenMP jobs (multi-threaded) run on a single node.

The primary benefit of Checkpoint/Restart (C/R) is fault-tolerance, especially in HPC environments like Matilda. The fault-tolerance provided by DMTCP can be beneficial in situations like:

  • Exceeding scheduled walltime
  • Exceeding other allocated resources such as memory
  • Loss of connectivity to an I/O device (networking issues)
  • Unexpected failure of one process thread in a muli-threaded run
  • Failure of a worker in a distributed node environment (MPI)
  • Hardware failures

C/R is especially useful in HPC environments where jobs generally involve high compute resource demands running for long periods of time. For example, if a user starts a job and allocates 80 hours for the job and then that job cannot complete within the alotted time, all 80 hours of work are lost without C/R and the run must be repeated with a longer walltime that will hopefully be sufficient to finish the job.

At this time, DMTCP does not provide C/R for MPI (distributed) jobs, but MPI capability is being worked on by the developers. Thus, it is only appropriate for serial and multi-threaded jobs run on a single node.

Using DMTCP

The following subsections provide a couple of basic examples involving the use of DMTCP on Matilda.

DMTCP and Batch Jobs

Checkpointing

The job script below could be used to initiate the first run of a job for which checkpointing is desired:

### dmtcpStart.sh example job script
#!/bin/bash
#SBATCH --job-name=DMTCP_Test
#SBATCH --time=4:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=40
#SBATCH --cpus-per-task=1

export CHKDIR=/scratch/users/someuser/chkpts
mkdir -p $CHKDIR
module load DMTCP/3.1.2
dmtcp_launch --ckptdir $CHKDIR -i 300 ~/myapplication <args>

where:

  • We create a directory CHKDIR to hold our job checkpoints
  • The DMTCP modulefile is loaded
  • The command "dmtcp_launch" is used to start the job
  • The option "--chkptdir" specifies where checkpoints should be saved
  • The option "-i" specifies the interval (in seconds) when a new checkpoint will be created

Launching a job in this way will automatically create a "dmtcp_coordinator" process on the node. This coordinator controls the creation of checkpoints and can also be created manually if desired.

Manually creating a coordinator using a separate terminal affords the user real-time control over checkpointing and other behavior, and generally works best for interactive jobs, or combination job scripts that control initial starts and restarts (covered later in this article).

Restarting

If we take a look at our checkpoint directory after the run above concludes, we might see something like the following:

[someuser@hpc-login-p01 ~]$ ls -1 /scratch/users/somuser/chkpts
ckpt_platform-python3.6_5d66181938e54cfc-40000-1fad1569e43d92.dmtcp
dmtcp_restart_script_5d66181938e54cfc-40000-1fad15683ee132.sh
dmtcp_restart_script.sh

where:

  • The file beginning "ckpt_platform-" and ending in ".dmtcp" is our checkpoint file
  • The file beginning "dmtcp_restart_script_5d..." is a restart script created to expedite relaunching the job
  • The file "dmtcp_restart_script.sh" is a symbolic link to the long file name above.

PLEASE NOTE: On Matilda, do NOT use the "dmtcp_restart_script" files for batch jobs, as these scripts do not contain the SLURM resource manager information necessary. However, these scripts CAN be used when running interactive jobs (covered later).

To restart the job, we might use a job script like the following:

### dmtcpRestart.sh example job script
#!/bin/bash
#SBATCH --job-name=DMTCP_Restart_Test
#SBATCH --time=4:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=40
#SBATCH --cpus-per-task=1

export CHKDIR=/scratch/users/jbjohnston/chkpts
module load DMTCP/3.1.2
restartFile=$(ls -lart $CHKDIR/ckpt_*.dmtcp | tail -1 | tr -s ' ' | cut -d ' ' -f9)
dmtcp_restart -i 300 --ckptdir $CHKDIR $restartFile

sleep 30

where:

  • We export the location of CHKDIR originally specified in the initial start script
  • The command that saves "restartFile" ensures that the latest checkpoint is used (if there is more than one)
  • The command "dmtcp_restart" is used to relaunch the job using the checkpoint (*.dmtcp)
  • The option "-i 300" is used to specify that checkpointing should continue at an interval of 300 seconds
  • The "sleep" command is used to delay termination so that any checkpoint that might be in-progress can complete

Note that it is important to make sure our restarted job continues to checkpoint in the event the job does not finish (otherwise we will have to restart from the first initial start checkpoint).

The restart job script above can be used as many times as necessary until the job is completed, each time using the most recent checkpoint file as the starting point.

Checkpoint and Restart

It may be convenient to combine the initial checkpoint run and the subsequent restarts into a single script that can be requeued until the job completes. Here is an example:

### Sample Job Script dmtcpStRest.sh
#!/bin/bash
#SBATCH --job-name=DMTCP_Start_Restart
#SBATCH --requeue           # the "requeue" flag is important to tell SLURM this is a requeuable job
#SBATCH --signal=B:10@30    # send the signal `10` at 30s before job times out
#SBATCH --open-mode=append  # append output from restarted jobs to the same output file
#SBATCH --time=10:00
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1

## Requeue the preempted job when Signal 10 is issued (see --signal above)
## The "trap" command captures signal 10 and implements "scontrolrequeue"
##
trap "echo -n 'TIMEOUT @ '; date; echo 'Resubmitting...'; scontrol requeue ${SLURM_JOBID}  " 10

## Specify an environmental variable for checkpoints, create the directory if it doesn't exist
##
export CHKDIR=/scratch/users/someuser/chkpts
mkdir -p $CHKDIR
module load DMTCP/3.1.2

## The variable "cnt" tracks how many times the job is restarted, using
## the SLURM env variable "SLURM_RESTART_COUNT" which is incremented every time
## the job is restarted.
##
cnt=${SLURM_RESTART_COUNT-0}
echo "SLURM_RESTART_COUNT = $cnt"

## Start the DMTCP coordinator manually at the beginning of each run,
## and specify a random port
##
dmtcp_coordinator -i 5 --daemon --port 0 --port-file /tmp/port
export DMTCP_COORD_PORT=$(</tmp/port)

## On intial start, cnt=0, so we use "launch"
## On subsequent restarts (cnt > 0) we use "restart"
##
if [[ $cnt == 0 ]]
then
    echo "doing launch"
    rm -f $CHKDIR/*.dmtcp
    dmtcp_launch --ckptdir $CHKDIR -i 60 -j ~/myapplication <args> &
elif [[ $cnt > 0 ]]; then
    echo "doing restart"
    restartFile=$(ls -lart $CHKDIR/ckpt_*.dmtcp | tail -1 | tr -s ' ' | cut -d ' ' -f9)
    dmtcp_restart --ckptdir $CHKDIR -i 60 -j $restartFile &
else
    echo "Failed to restart the job, exit"; exit
fi
wait

where:

  • We leverage the SLURM environmental variable SLURM_RESTART_COUNT to track restarts

  • Use the "--requeue" flag to tell SLURM this job may be requeued. This permits repeated restarts until the job completes.
  • The "--signal" flag is used to send a numerical signal at "X" (10) seconds before the job walltime expires.
  • By using "--signal" and "--requeue" we can trigger an automatic requeuing of the job if it doesn't finish
  • Note that we run the "dmtcp_launch" and "dmtcp_restart" commands in the background using the "&". This is because there is a difference between preemption and time outs. Requeuing happens on preemption, so we want to let the job be preempted 10 sec before walltime is reached rather than letting it time out.

  • Please note important comments in the job script example above

As written, the job script example above will perform an initial launch of our job. The job is designed to permit preemption before it is timed out, so that it can be requeued and then restarted using the checkpoints. Restarts will continue automatically until the job completes.

DMTCP and Interactive Jobs

In addition to batch processing, DMTCP can be run interactively. An additional layer of control can be added by manually launching the dmtcp_coordinator in one terminal, while lauching checkpointable runs using "srun" from another.

Initial Job Allocation

We can start our DMTCP interactive job using a method we might use for any other SLURM interactive job.

[someuser@hpc-login-p01 ~]$ srun -N 1 -c 1 -t 30:00 --pty /bin/bash --login
[someuser@hpc-throughput-p10 ~]$ module load DMTCP

Then open a second SSH terminal to our interactive job node (here, hpc-throughput-p10), and start the coordinator:

[someuser@hpc-login-p01 ~]$ ssh hpc-throughput-p10
[someuser@hpc-throughput-p10 ~]$ module load DMTCP
[someuser@hpc-throughput-p10 ~]$ dmtcp_coordinator --port 0 --port-file /tmp/port
dmtcp_coordinator starting...
    Host: hpc-throughput-p10.cm.cluster (172.30.0.100)
    Port: 36889
    Checkpoint Interval: disabled (checkpoint manually instead)
    Exit on last client: 0
Type '?' for help.

[2024-12-19T14:23:50.502, 2158496, 2158496, Note] at coordinatorplugin.h:205 in tick; REASON='No active clients; starting stale timeout
     theStaleTimeout = 28800
dmtcp>

Notice that we are left with a prompt "dmtcp>". From here we can execute various commands to the coordinator.

**IMPORTANT**: Note the Port Number provided in the output from the dmtcp_coordinator command.

Initial Job Launch

Now moving back to the first terminal we launched using "srun", we can perform our initial job start:

[someuser@hpc-throughput-p10 ~]$ dmtcp_launch -j -p 33183 ./someprogram.py <args>

Manual Coordination (optional)

In the case above where we manually run the coordinator in a separate terminal, we can issue various commands in the "dmtcp>" prompt to control our job. These include:

  • c - checkpoint
  • s - status
  • q - kill all jobs and quit

So for example, to checkpoint:

dmtcp> c
[2024-12-19T14:31:27.018, 2159473, 2159473, Note] at dmtcp_coordinator.cpp:1164 in startCheckpoint; REASON='starting checkpoint; incrementing generation; suspending all nodes
     s.numPeers = 1
     compId.computationGeneration() = 1
[2024-12-19T14:31:27.777, 2159473, 2159473, Note] at dmtcp_coordinator.cpp:496 in releaseBarrier; REASON='Checkpoint complete; all workers running
[2024-12-19T14:31:27.902, 2159473, 2159473, Note] at dmtcp_coordinator.cpp:559 in recordCkptFilename; REASON='Checkpoint complete. Wrote restart script

Or to check the status:

dmtcp> s
Status...
Host: hpc-throughput-p10.cm.cluster (172.30.0.100)
Port: 36889
Checkpoint Interval: Checkpoint Interval: disabled (checkpoint manually instead)
Exit on last client: 0
Kill after checkpoint: 0
Computation Id: 5d66181938e54bb6-40000-22729b72dc0a17
Checkpoint Dir: /home/s/someuser
NUM_PEERS=1
RUNNING=yes

Restarting

To restart the job in this example, we could issue a new "srun" to create a new interactive job (and open a new terminal as shown above), or if the current interactive job is still running we could simply use:

dmtcp_restart -j -p 36889 ckpt_a.out_*.dmtcp

More Information

The following references may be useful for learning more about how to use DMTCP.


CategoryHPC