DMTCP Checkpointing
Contents
Overview
The application DMTCP (Distributed MultiThreaded Checkpointing) on Matilda to assist users in checkpointing jobs in cases where the application used does not have a checkpointing ability built-in. This document covers an explanation of Checkpoint/Restart principles and how it can be used to assist users in their workflow in a high-performance computing environment. Examples are provided for a couple of basic use cases which can be used as-is, or modified for use on Matilda.
To see what versions of DMTCP are available, you may use the command:
module av DMTCP
To load DMTCP, simply use the command:
module load DMTCP/<version> or module load DMTCP (for the default)
Checkpoint Restart Basics
The basic principle behind Checkpoint/Restart involves taking a snapshot of a job's running processes and memory state (the Checkpoint), and writing to a "Restart" file prior to the job's termination. The user may then "Restart" the job from the point where the most recent checkpoint was created (or any other older checkpoint, if desired) using the Restart file. Some applications have checkpointing built-in (such as NWChem), but many do not. This is where applications like DMTCP can prove useful. While DMTCP is not capable of providing Checkpoint/Restart for all applications (for example, MPI jobs) it can provide this capablity for many serial and OpenMP jobs (multi-threaded) run on a single node.
The primary benefit of Checkpoint/Restart (C/R) is fault-tolerance, especially in HPC environments like Matilda. The fault-tolerance provided by DMTCP can be beneficial in situations like:
- Exceeding scheduled walltime
- Exceeding other allocated resources such as memory
- Loss of connectivity to an I/O device (networking issues)
- Unexpected failure of one process thread in a muli-threaded run
- Failure of a worker in a distributed node environment (MPI)
- Hardware failures
C/R is especially useful in HPC environments where jobs generally involve high compute resource demands running for long periods of time. For example, if a user starts a job and allocates 80 hours for the job and then that job cannot complete within the alotted time, all 80 hours of work are lost without C/R and the run must be repeated with a longer walltime that will hopefully be sufficient to finish the job.
At this time, DMTCP does not provide C/R for MPI (distributed) jobs, but MPI capability is being worked on by the developers. Thus, it is only appropriate for serial and multi-threaded jobs run on a single node.
Using DMTCP
The following subsections provide a couple of basic examples involving the use of DMTCP on Matilda.
DMTCP and Batch Jobs
Checkpointing
The job script below could be used to initiate the first run of a job for which checkpointing is desired:
### dmtcpStart.sh example job script #!/bin/bash #SBATCH --job-name=DMTCP_Test #SBATCH --time=4:00:00 #SBATCH --nodes=1 #SBATCH --ntasks=40 #SBATCH --cpus-per-task=1 export CHKDIR=/scratch/users/someuser/chkpts mkdir -p $CHKDIR module load DMTCP/3.1.2 dmtcp_launch --ckptdir $CHKDIR -i 300 ~/myapplication <args>
where:
- We create a directory CHKDIR to hold our job checkpoints
- The DMTCP modulefile is loaded
- The command "dmtcp_launch" is used to start the job
- The option "--chkptdir" specifies where checkpoints should be saved
- The option "-i" specifies the interval (in seconds) when a new checkpoint will be created
Launching a job in this way will automatically create a "dmtcp_coordinator" process on the node. This coordinator controls the creation of checkpoints and can also be created manually if desired.
Manually creating a coordinator using a separate terminal affords the user real-time control over checkpointing and other behavior, and generally works best for interactive jobs, or combination job scripts that control initial starts and restarts (covered later in this article).
Restarting
If we take a look at our checkpoint directory after the run above concludes, we might see something like the following:
[someuser@hpc-login-p01 ~]$ ls -1 /scratch/users/somuser/chkpts ckpt_platform-python3.6_5d66181938e54cfc-40000-1fad1569e43d92.dmtcp dmtcp_restart_script_5d66181938e54cfc-40000-1fad15683ee132.sh dmtcp_restart_script.sh
where:
- The file beginning "ckpt_platform-" and ending in ".dmtcp" is our checkpoint file
- The file beginning "dmtcp_restart_script_5d..." is a restart script created to expedite relaunching the job
- The file "dmtcp_restart_script.sh" is a symbolic link to the long file name above.
PLEASE NOTE: On Matilda, do NOT use the "dmtcp_restart_script" files for batch jobs, as these scripts do not contain the SLURM resource manager information necessary. However, these scripts CAN be used when running interactive jobs (covered later).
To restart the job, we might use a job script like the following:
### dmtcpRestart.sh example job script #!/bin/bash #SBATCH --job-name=DMTCP_Restart_Test #SBATCH --time=4:00:00 #SBATCH --nodes=1 #SBATCH --ntasks=40 #SBATCH --cpus-per-task=1 export CHKDIR=/scratch/users/jbjohnston/chkpts module load DMTCP/3.1.2 restartFile=$(ls -lart $CHKDIR/ckpt_*.dmtcp | tail -1 | tr -s ' ' | cut -d ' ' -f9) dmtcp_restart -i 300 --ckptdir $CHKDIR $restartFile sleep 30
where:
- We export the location of CHKDIR originally specified in the initial start script
- The command that saves "restartFile" ensures that the latest checkpoint is used (if there is more than one)
- The command "dmtcp_restart" is used to relaunch the job using the checkpoint (*.dmtcp)
- The option "-i 300" is used to specify that checkpointing should continue at an interval of 300 seconds
- The "sleep" command is used to delay termination so that any checkpoint that might be in-progress can complete
Note that it is important to make sure our restarted job continues to checkpoint in the event the job does not finish (otherwise we will have to restart from the first initial start checkpoint).
The restart job script above can be used as many times as necessary until the job is completed, each time using the most recent checkpoint file as the starting point.
Checkpoint and Restart
It may be convenient to combine the initial checkpoint run and the subsequent restarts into a single script that can be requeued until the job completes. Here is an example:
### Sample Job Script dmtcpStRest.sh #!/bin/bash #SBATCH --job-name=DMTCP_Start_Restart #SBATCH --requeue # the "requeue" flag is important to tell SLURM this is a requeuable job #SBATCH --signal=B:10@30 # send the signal `10` at 30s before job times out #SBATCH --open-mode=append # append output from restarted jobs to the same output file #SBATCH --time=10:00 #SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 ## Requeue the preempted job when Signal 10 is issued (see --signal above) ## The "trap" command captures signal 10 and implements "scontrolrequeue" ## trap "echo -n 'TIMEOUT @ '; date; echo 'Resubmitting...'; scontrol requeue ${SLURM_JOBID} " 10 ## Specify an environmental variable for checkpoints, create the directory if it doesn't exist ## export CHKDIR=/scratch/users/someuser/chkpts mkdir -p $CHKDIR module load DMTCP/3.1.2 ## The variable "cnt" tracks how many times the job is restarted, using ## the SLURM env variable "SLURM_RESTART_COUNT" which is incremented every time ## the job is restarted. ## cnt=${SLURM_RESTART_COUNT-0} echo "SLURM_RESTART_COUNT = $cnt" ## Start the DMTCP coordinator manually at the beginning of each run, ## and specify a random port ## dmtcp_coordinator -i 5 --daemon --port 0 --port-file /tmp/port export DMTCP_COORD_PORT=$(</tmp/port) ## On intial start, cnt=0, so we use "launch" ## On subsequent restarts (cnt > 0) we use "restart" ## if [[ $cnt == 0 ]] then echo "doing launch" rm -f $CHKDIR/*.dmtcp dmtcp_launch --ckptdir $CHKDIR -i 60 -j ~/myapplication <args> & elif [[ $cnt > 0 ]]; then echo "doing restart" restartFile=$(ls -lart $CHKDIR/ckpt_*.dmtcp | tail -1 | tr -s ' ' | cut -d ' ' -f9) dmtcp_restart --ckptdir $CHKDIR -i 60 -j $restartFile & else echo "Failed to restart the job, exit"; exit fi wait
where:
We leverage the SLURM environmental variable SLURM_RESTART_COUNT to track restarts
- Use the "--requeue" flag to tell SLURM this job may be requeued. This permits repeated restarts until the job completes.
- The "--signal" flag is used to send a numerical signal at "X" (10) seconds before the job walltime expires.
- By using "--signal" and "--requeue" we can trigger an automatic requeuing of the job if it doesn't finish
Note that we run the "dmtcp_launch" and "dmtcp_restart" commands in the background using the "&". This is because there is a difference between preemption and time outs. Requeuing happens on preemption, so we want to let the job be preempted 10 sec before walltime is reached rather than letting it time out.
- Please note important comments in the job script example above
As written, the job script example above will perform an initial launch of our job. The job is designed to permit preemption before it is timed out, so that it can be requeued and then restarted using the checkpoints. Restarts will continue automatically until the job completes.
DMTCP and Interactive Jobs
In addition to batch processing, DMTCP can be run interactively. An additional layer of control can be added by manually launching the dmtcp_coordinator in one terminal, while lauching checkpointable runs using "srun" from another.
Initial Job Allocation
We can start our DMTCP interactive job using a method we might use for any other SLURM interactive job.
[someuser@hpc-login-p01 ~]$ srun -N 1 -c 1 -t 30:00 --pty /bin/bash --login [someuser@hpc-throughput-p10 ~]$ module load DMTCP
Then open a second SSH terminal to our interactive job node (here, hpc-throughput-p10), and start the coordinator:
[someuser@hpc-login-p01 ~]$ ssh hpc-throughput-p10 [someuser@hpc-throughput-p10 ~]$ module load DMTCP [someuser@hpc-throughput-p10 ~]$ dmtcp_coordinator --port 0 --port-file /tmp/port dmtcp_coordinator starting... Host: hpc-throughput-p10.cm.cluster (172.30.0.100) Port: 36889 Checkpoint Interval: disabled (checkpoint manually instead) Exit on last client: 0 Type '?' for help. [2024-12-19T14:23:50.502, 2158496, 2158496, Note] at coordinatorplugin.h:205 in tick; REASON='No active clients; starting stale timeout theStaleTimeout = 28800 dmtcp>
Notice that we are left with a prompt "dmtcp>". From here we can execute various commands to the coordinator.
**IMPORTANT**: Note the Port Number provided in the output from the dmtcp_coordinator command.
Initial Job Launch
Now moving back to the first terminal we launched using "srun", we can perform our initial job start:
[someuser@hpc-throughput-p10 ~]$ dmtcp_launch -j -p 33183 ./someprogram.py <args>
Manual Coordination (optional)
In the case above where we manually run the coordinator in a separate terminal, we can issue various commands in the "dmtcp>" prompt to control our job. These include:
- c - checkpoint
- s - status
- q - kill all jobs and quit
So for example, to checkpoint:
dmtcp> c [2024-12-19T14:31:27.018, 2159473, 2159473, Note] at dmtcp_coordinator.cpp:1164 in startCheckpoint; REASON='starting checkpoint; incrementing generation; suspending all nodes s.numPeers = 1 compId.computationGeneration() = 1 [2024-12-19T14:31:27.777, 2159473, 2159473, Note] at dmtcp_coordinator.cpp:496 in releaseBarrier; REASON='Checkpoint complete; all workers running [2024-12-19T14:31:27.902, 2159473, 2159473, Note] at dmtcp_coordinator.cpp:559 in recordCkptFilename; REASON='Checkpoint complete. Wrote restart script
Or to check the status:
dmtcp> s Status... Host: hpc-throughput-p10.cm.cluster (172.30.0.100) Port: 36889 Checkpoint Interval: Checkpoint Interval: disabled (checkpoint manually instead) Exit on last client: 0 Kill after checkpoint: 0 Computation Id: 5d66181938e54bb6-40000-22729b72dc0a17 Checkpoint Dir: /home/s/someuser NUM_PEERS=1 RUNNING=yes
Restarting
To restart the job in this example, we could issue a new "srun" to create a new interactive job (and open a new terminal as shown above), or if the current interactive job is still running we could simply use:
dmtcp_restart -j -p 36889 ckpt_a.out_*.dmtcp
More Information
The following references may be useful for learning more about how to use DMTCP.
DMTCP Manual(this may be slightly out of date)
CategoryHPC