Some HPC simulations take a long time to run (up to days and weeks) and run over multiple nodes. When running these long jobs, failure rate and availability of the underlying infrastructure may become an issue. One solution for recovering from these failures is using checkpointing during the simulation to provide a starting point in case the simulation is terminated for any reason.
When running compute jobs in Azure, most issues related to running on bare metal compute nodes still remain. Hardware can fail, memory can have issues, and the InfiniBand network may encounter problems. On top of that, due to security and functional updates, the (virtualization) software stack may not stay static throughout the duration of the job.
A way to prevent the impact of the potential failures is to save intermediate results, in such a way that a calculation can be restarted from that result point. The creation of this intermediate result is called creating a checkpoint.
Checkpointing can be implemented at different levels: within the applications and at the OS level. If available, the checkpointing within applications are the most stable, and many applications do support checkpointing natively.
One significant downside of checkpointing is the amount of I/O that is added to the job during a checkpoint, and the incurred job delay during this writing. Therefore, there needs to be a good balance made for the time between checkpoints and the risk of job failure.
The actual amount of overhead of checkpointing depends on the size of the model in memory and the resulting file size that has to be written to disk. For smaller models, this file size can be in the range of hundreds of megabytes. Doing a single checkpoint per hour may not even be visible in the simulation speed. For larger models, this will be in the range of gigabytes (GBs) and here, a reasonably fast IO system will allow you to limit the overhead to stay in a range under 5 percent. Depending on the total runtime of the simulation, it makes sense to implement checkpointing on an hourly up to daily interval.
In the examples below, an Azure CycleCloud-managed IBM LSF cluster is used. All the jobs are submitted from a regular user account, which has a shared homedir, hosted on Azure NetApp Files. The base configuration of this cluster is set up using Azure HPC resources on GitHub.
In this setup, the homedir of the user is used for all job storage. This includes job files, input files, output files and checkpoint files. All applications are compiled and installed using EasyBuild framework.
This post focuses on node failures and preventive measures at a job level. However, another potential failure point is the LSF master node of the cluster. Under normal circumstances, a failure on the master node does not influence the job run. That being said, one of the options is to have a high availability (HA) setup of a set of master nodes.
This Molecular Dynamics application is written at Sandia labs and is open source. It supports MPI and can run large and long-lasting jobs. LAMMPS provides native support for checkpointing, which can be enabled by using the restart command in the input file.
For this example, a simple input file based on Lennard-Jones is used:
# 3d Lennard-Jones melt
variable x index 1
variable y index 1
variable z index 1
variable xx equal 400*$x
variable yy equal 400*$y
variable zz equal 400*$z
units lj
atom_style atomic
lattice fcc 0.8442
region box block 0 ${xx} 0 ${yy} 0 ${zz}
create_box 1 box
create_atoms 1 box
mass 1 1.0
velocity all create 1.44 87287 loop geom
pair_style lj/cut 2.5
pair_coeff 1 1 1.0 1.0 2.5
neighbor 0.3 bin
neigh_modify delay 0 every 20 check no
fix 1 all nve
restart 1000 lj.restart
run 100000 every 1000 NULL
The job script creates a working directory based on the job-id, copies in the used executable and the input file, and starts a mpirun based on the amount of cores assigned to the job:
#!/bin/bash
mkdir $LSB_JOBID
cd $LSB_JOBID
cp ~/apps/lammps-7Aug19-mpi/src/lmp_mpi .
cp ../in.lj .
module load foss/2018b
mpirun -n $LSB_DJOB_NUMPROC ./lmp_mpi -in in.lj
The job is submitted using the LSF bsub command, and four HC nodes are requested, using a total of 176 cores:
bsub -q hc-mpi -n 176 -R "span[ptile=44]" -o ./%J.out ./lammpsjob.sh
Once the nodes are acquired and started up, the simulation kicks off. The job-id that is given is 25, so we can now look at the job directory to follow the progress. Due to the restart option in the input file, a request is being made every 1,000 simulation steps to create a file for all the 256 million atoms. In this case, a 21-GB file is created roughly every half hour.
-rw-rw-r--. 1 ccadmin ccadmin 21G Jan 8 12:20 lj.restart.1000
-rw-rw-r--. 1 ccadmin ccadmin 21G Jan 8 12:51 lj.restart.2000
-rw-rw-r--. 1 ccadmin ccadmin 21G Jan 8 13:22 lj.restart.3000
-rw-rw-r--. 1 ccadmin ccadmin 21G Jan 8 13:53 lj.restart.4000
-rw-rw-r--. 1 ccadmin ccadmin 21G Jan 8 14:24 lj.restart.5000
-rw-rw-r--. 1 ccadmin ccadmin 21G Jan 8 14:56 lj.restart.6000
-rw-rw-r--. 1 ccadmin ccadmin 21G Jan 8 15:26 lj.restart.7000
Now let’s assume that the job exited due to some failure.
To restart the simulation, a modified input file has to be made—mostly because the read_restart and
the create_box commands conflict. They will both try to create the simulation space with the atoms. A working restart file is the following:
# 3d Lennard-Jones melt
read_restart lj.restart.*
velocity all create 1.44 87287 loop geom
pair_style lj/cut 2.5
pair_coeff 1 1 1.0 1.0 2.5
neighbor 0.3 bin
neigh_modify delay 0 every 20 check no
fix 1 all nve
restart 1000 lj.restart
run 100000 upto every 1000 NULL
First, read_restart is defined. It will look for the latest checkpoint file and use that to recreate the atoms. Next, the run command is modified to include upto, which tells the run to not restart the simulation and run 100,000 steps additional to the restart state, but to run only up to the initially defined endpoint.
Since we are already in a working directory with all the required information, we also need to change the job submission script to look something like this:
#!/bin/bash
cd ~/jobs/25
module load foss/2018b
mpirun -n $LSB_DJOB_NUMPROC ./lmp_mpi -in in.lj.restart
And we can submit this again using the LSF bsub command:
bsub -q hc-mpi -n 176 -R "span[ptile=44]" -o ./%J.out ./restart-lammpsjob.sh
This starts a new job that takes the latest checkpoint file and continues the simulation from the defined checkpoint.
Another way of implementing checkpointing that uses significantly less storage is restarting based on two restart files. In this version we write quite some data every half-hour to store the checkpoint files. This can be optimized, by using the following restart entries in the input file:
restart 1000 lj.restart.1 lj.restart.2
When restarting from this state, check which file was written latest and use this specifically when adding the read_restart entry into the input file. Also, do not forget to add the upto entry. After resubmitting the simulation, it will continue from the last checkpoint.
To quantify the checkpoint overhead, multiple jobs were submitted, and each is scheduled to simulate 200 steps and last approximately 1 hour and use 176 cores. One of the jobs is doing a single checkpoint at 1,400 steps minutes, one is doing two checkpoints, each at 800 steps, and one has no checkpointing enabled.
Checkpoint/run |
Runtime |
Steps |
Lost time |
Overhead |
Extrapolated for 24-hour job |
0 |
3,469 sec |
|
0 |
0% |
0% |
1 |
3,630 sec |
2,000 |
161 sec |
4.6% |
0.18% |
2 |
3,651 |
2,000 |
182 sec |
5.2% |
0.21% |
The checkpoint overhead for this model is quite small with around a 5-percent impact for an hourly checkpoint. The combined checkpoint file size is 21 GB, which is being written during each checkpoint.
A second example of a molecular dynamic application is NAMD. This application is written and maintained by the University of Illinois. This application is open source, even though registration is required.
NAMD also supports checkpointing natively. This demonstration uses the STMV benchmark, which models about 1,000,000 atoms.
To enable checkpointing, the following lines need to be added to the stmv.namd input file:
outputName ./output
outputEnergies 20
outputTiming 20
numsteps 21600
restartName ./checkpoint
restartFreq 100
restartSave yes
And that input file can be used by the following job script:
#!/bin/bash
mkdir $LSB_JOBID
cd $LSB_JOBID
cp ../stmv/* .
module load NAMD/2.13-foss-2018b-mpi
mpirun -n $LSB_DJOB_NUMPROC namd2 stmv.namd
The job is submitted with the following LSF submit script:
bsub -q hc-mpi -n 176 -R "span[ptile=44]" -o ./%J.out ./namd-mpi-job
Once started, this job creates checkpoint files in the working directory:
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:58 checkpoint.9500.coor
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:58 checkpoint.9500.vel
-rw-r--r--. 1 ccadmin ccadmin 261 Jan 15 17:58 checkpoint.9500.xsc
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:58 checkpoint.9600.coor
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:58 checkpoint.9600.vel
-rw-r--r--. 1 ccadmin ccadmin 261 Jan 15 17:58 checkpoint.9600.xsc
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:58 checkpoint.9700.coor
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:58 checkpoint.9700.vel
-rw-r--r--. 1 ccadmin ccadmin 261 Jan 15 17:58 checkpoint.9700.xsc
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:59 checkpoint.9800.coor
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:59 checkpoint.9800.vel
-rw-r--r--. 1 ccadmin ccadmin 261 Jan 15 17:58 checkpoint.9800.xsc
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:59 checkpoint.9900.coor
-rw-r--r--. 1 ccadmin ccadmin 25M Jan 15 17:59 checkpoint.9900.vel
-rw-r--r--. 1 ccadmin ccadmin 258 Jan 15 17:59 checkpoint.9900.xsc
If the job fails at this point, the checkpoint files can be used to restart the job. To do this, the input file needs to be adjusted. Looking for the latest checkpoint file available, the timestep is 9,900. Now, in the input file, the following two changes are needed—first, tell the simulation to restart at time 9,900, and second, tell it to read and initialize the simulation from the checkpoint files of checkpoint.9900.
outputName ./output
outputEnergies 20
outputTiming 20
firsttimestep 9900
numsteps 21600
restartName ./checkpoint
restartFreq 100
restartSave yes
reinitatoms ./checkpoint.9900
After resubmitting the job with bsub, the simulation continues from the checkpoint and stops at the original intended step, as this example output shows:
TCL: Reinitializing atom data from files with basename ./checkpoint.9900
Info: Reading from binary file ./checkpoint.9900.coor
Info: Reading from binary file ./checkpoint.9900.vel
[…]
TIMING: 21600 CPU: 1882.08, 0.159911/step Wall: 1882.08, 0.159911/step, 0 hours remaining, 1192.902344 MB of memory in use.
ETITLE: TS BOND ANGLE DIHED IMPRP ELECT VDW BOUNDARY MISC KINETIC TOTAL TEMP POTENTIAL TOTAL3 TEMPAVG PRESSURE GPRESSURE VOLUME PRESSAVG GPRESSAVG
ENERGY: 21600 366977.1834 278844.5518 81843.4582 5043.3226 -4523413.8827 385319.1671 0.0000 0.0000 942875.0190 -2462511.1806 296.5584 -3405386.1995 -2454036.3329 296.7428 15.0219 19.7994 10190037.8159 -1.4609 -1.0415
WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 21600
WRITING COORDINATES TO RESTART FILE AT STEP 21600
FINISHED WRITING RESTART COORDINATES
WRITING VELOCITIES TO RESTART FILE AT STEP 21600
FINISHED WRITING RESTART VELOCITIES
WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 21600
WRITING COORDINATES TO OUTPUT FILE AT STEP 21600
WRITING VELOCITIES TO OUTPUT FILE AT STEP 21600
====================================================
WallClock: 1894.207520 CPUTime: 1894.207520 Memory: 1192.902344 MB
[Partition 0][Node 0] End of program
To quantify the checkpoint overhead, multiple jobs were submitted, each lasting approximately one hour and using 176 cores. One of the jobs is doing a single checkpoint at 40,000 steps minutes, one is doing two checkpoints, each at 20,000 steps, and one has no checkpointing enabled.
Checkpoint/run |
Runtime |
Steps |
Lost time |
Overhead |
Extrapolated for 24-hour job |
0 |
3,300 sec |
72,000 |
0 |
0% |
0% |
1 |
3,395 sec |
72,000 |
95 sec |
2.8% |
0.11% |
2 |
3,417 sec |
72,000 |
117 sec |
3.5% |
0.13% |
The checkpoint overhead in this model is around 3 percent with a combined checkpoint file size of 50 MB.
Another Molecular Dynamics package is Gromacs, originally started at the University of Groningen in the Netherlands and now at Upsalla University in Sweden. This open source package is available at the Gromacs website.
The data set used for this work is the open benchmark benchPEP. This data set has around 12 million atoms and is roughly 293 MB in size.
Gromacs supports checkpointing natively. To enable it, use the following option on the mdrun command line:
mpirun gmx_mpi mdrun -s benchPEP.tpr -nsteps -1 -maxh 1.0 -cpt 10
The –cpt 10 option initiates a checkpoint every 10 minutes. This is shown in the log files as:
step 320: timed with pme grid 320 320 320, coulomb cutoff 1.200: 32565.0 M-cycles
step 480: timed with pme grid 288 288 288, coulomb cutoff 1.302: 34005.8 M-cycles
step 640: timed with pme grid 256 256 256, coulomb cutoff 1.465: 43496.5 M-cycles
step 800: timed with pme grid 280 280 280, coulomb cutoff 1.339: 35397.0 M-cycles
step 960: timed with pme grid 288 288 288, coulomb cutoff 1.302: 33611.2 M-cycles
step 1120: timed with pme grid 300 300 300, coulomb cutoff 1.250: 32671.0 M-cycles
step 1280: timed with pme grid 320 320 320, coulomb cutoff 1.200: 31965.1 M-cycles
optimal pme grid 320 320 320, coulomb cutoff 1.200
Writing checkpoint, step 3200 at Thu Mar 19 08:24:14 2020
After killing a single thread on one of the machines, the job is stopped:
--------------------------------------------------------------------------
mpirun noticed that process rank 71 with PID 17851 on node ip-0a000214 exited on signal 9 (Killed).
--------------------------------------------------------------------------
This leaves us with a state.cpt file that was written during the last checkpoint. To restart from this state file, the following mdrun command can be used:
mpirun gmx_mpi mdrun -s benchPEP.tpr -cpi state.cpt
When starting this run, the computation continues from the checkpoint:
step 3520: timed with pme grid 320 320 320, coulomb cutoff 1.200: 31738.8 M-cycles
step 3680: timed with pme grid 288 288 288, coulomb cutoff 1.310: 33970.4 M-cycles
step 3840: timed with pme grid 256 256 256, coulomb cutoff 1.474: 44232.6 M-cycles
step 4000: timed with pme grid 280 280 280, coulomb cutoff 1.348: 36289.3 M-cycles
step 4160: timed with pme grid 288 288 288, coulomb cutoff 1.310: 34398.6 M-cycles
step 4320: timed with pme grid 300 300 300, coulomb cutoff 1.258: 32844.2 M-cycles
step 4480: timed with pme grid 320 320 320, coulomb cutoff 1.200: 32005.9 M-cycles
optimal pme grid 320 320 320, coulomb cutoff 1.200
Writing checkpoint, step 8080 at Thu Mar 19 09:03:05 2020
And the new checkpoint can be seen:
-rw-rw-r--. 1 ccadmin ccadmin 287M Mar 19 09:03 state.cpt
-rw-rw-r--. 1 ccadmin ccadmin 287M Mar 19 09:03 state_prev.cpt
To verify the checkpoint overhead, multiple jobs were submitted, each lasting 1 hour (using -maxh 1.0) and using 176 cores. One of the jobs performs a single checkpoint after 45 minutes, one does two checkpoints, each at 20 minutes, and one has no checkpointing enabled.
Checkpoint/run |
Steps |
Lost steps |
Overhead |
Extrapolated for 24-hour job |
0 |
38,000 |
0 |
0% |
0% |
1 |
38,000 |
0 |
0% |
0% |
2 |
36,640 |
1,360 |
3.6% |
0.16% |
As you can see, the overhead of doing a single checkpoint per hour is again very small, not even visible in the application run times. This is mainly because the model and checkpoint file size is 287 MB.
As these results show, checkpointing involves some overhead, and it varies by application and the size of the model that has to be written to disk. So, you need to weigh the tradeoffs—how often to checkpoint and what overhead is acceptable versus the risk of losing computation time due to a lost job.
As a general guideline, I advise the following:
That being said, with an overhead that's less than 5 percent per hour of a job, it may be worth it to you to enable checkpointing.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.