Running long HPC jobs on Azure with checkpointing using LAMMPS, NAMD and Gromacs
Published Apr 24 2020 05:49 AM 5,183 Views
Microsoft

Some HPC simulations take a long time to run (up to days and weeks) and run over multiple nodes. When running these long jobs, failure rate and availability of the underlying infrastructure may become an issue. One solution for recovering from these failures is using checkpointing during the simulation to provide a starting point in case the simulation is terminated for any reason.

 

When running compute jobs in Azure, most issues related to running on bare metal compute nodes still remain. Hardware can fail, memory can have issues, and the InfiniBand network may encounter problems. On top of that, due to security and functional updates, the (virtualization) software stack may not stay static throughout the duration of the job.

 

A way to prevent the impact of the potential failures is to save intermediate results, in such a way that a calculation can be restarted from that result point. The creation of this intermediate result is called creating a checkpoint.

 

Checkpointing can be implemented at different levels: within the applications and at the OS level. If available, the checkpointing within applications are the most stable, and many applications do support checkpointing natively. 

 

One significant downside of checkpointing is the amount of I/O that is added to the job during a checkpoint, and the incurred job delay during this writing. Therefore, there needs to be a good balance made for the time between checkpoints and the risk of job failure.

 

Summary

The actual amount of overhead of checkpointing depends on the size of the model in memory and the resulting file size that has to be written to disk. For smaller models, this file size can be in the range of hundreds of megabytes. Doing a single checkpoint per hour may not even be visible in the simulation speed. For larger models, this will be in the range of gigabytes (GBs) and here, a reasonably fast IO system will allow you to limit the overhead to stay in a range under 5 percent. Depending on the total runtime of the simulation, it makes sense to implement checkpointing on an hourly up to daily interval.

 

Cluster setup

In the examples below, an Azure CycleCloud-managed IBM LSF cluster is used. All the jobs are submitted from a regular user account, which has a shared homedir, hosted on Azure NetApp Files. The base configuration of this cluster is set up using Azure HPC resources on GitHub.

 

In this setup, the homedir of the user is used for all job storage. This includes job files, input files, output files and checkpoint files. All applications are compiled and installed using  EasyBuild framework.

 

This post focuses on node failures and preventive measures at a job level. However, another potential failure point is the LSF master node of the cluster. Under normal circumstances, a failure on the master node does not influence the job run. That being said, one of the options is to have a high availability (HA) setup of a set of master nodes.

 

LAMMPS

This Molecular Dynamics application is written at Sandia labs and is open source. It supports MPI and can run large and long-lasting jobs. LAMMPS provides native support for checkpointing, which can be enabled by using the restart command in the input file.

 

For this example, a simple input file based on Lennard-Jones is used:

 

# 3d Lennard-Jones melt
variable        x index 1
variable        y index 1
variable        z index 1
variable        xx equal 400*$x
variable        yy equal 400*$y
variable        zz equal 400*$z
units           lj
atom_style      atomic
lattice         fcc 0.8442
region          box block 0 ${xx} 0 ${yy} 0 ${zz}
create_box      1 box
create_atoms    1 box
mass            1 1.0
velocity        all create 1.44 87287 loop geom
pair_style      lj/cut 2.5
pair_coeff      1 1 1.0 1.0 2.5
neighbor        0.3 bin
neigh_modify    delay 0 every 20 check no
fix             1 all nve
restart         1000 lj.restart
run             100000 every 1000 NULL

The job script creates a working directory based on the job-id, copies in the used executable and the input file, and starts a mpirun based on the amount of cores assigned to the job:

 

#!/bin/bash
mkdir $LSB_JOBID
cd $LSB_JOBID
cp ~/apps/lammps-7Aug19-mpi/src/lmp_mpi .
cp ../in.lj .
module load foss/2018b
mpirun -n $LSB_DJOB_NUMPROC ./lmp_mpi -in in.lj

 

The job is submitted using the LSF bsub command, and four HC nodes are requested, using a total of 176 cores:

 

bsub -q hc-mpi -n 176 -R "span[ptile=44]" -o ./%J.out ./lammpsjob.sh

 

Once the nodes are acquired and started up, the simulation kicks off. The job-id that is given is 25, so we can now look at the job directory to follow the progress. Due to the restart option in the input file, a request is being made every 1,000 simulation steps to create a file for all the 256 million atoms. In this case, a 21-GB file is created roughly every half hour.

 

-rw-rw-r--.  1 ccadmin ccadmin  21G Jan  8 12:20 lj.restart.1000
-rw-rw-r--.  1 ccadmin ccadmin  21G Jan  8 12:51 lj.restart.2000
-rw-rw-r--.  1 ccadmin ccadmin  21G Jan  8 13:22 lj.restart.3000
-rw-rw-r--.  1 ccadmin ccadmin  21G Jan  8 13:53 lj.restart.4000
-rw-rw-r--.  1 ccadmin ccadmin  21G Jan  8 14:24 lj.restart.5000
-rw-rw-r--.  1 ccadmin ccadmin  21G Jan  8 14:56 lj.restart.6000
-rw-rw-r--.  1 ccadmin ccadmin  21G Jan  8 15:26 lj.restart.7000

Now let’s assume that the job exited due to some failure.

 

To restart the simulation, a modified input file has to be made—mostly because the read_restart and

the create_box commands conflict. They will both try to create the simulation space with the atoms. A working restart file is the following:

 

# 3d Lennard-Jones melt
read_restart    lj.restart.*
velocity        all create 1.44 87287 loop geom
pair_style      lj/cut 2.5
pair_coeff      1 1 1.0 1.0 2.5
neighbor        0.3 bin
neigh_modify    delay 0 every 20 check no
fix             1 all nve
restart         1000 lj.restart
run             100000 upto every 1000 NULL

First, read_restart is defined. It will look for the latest checkpoint file and use that to recreate the atoms. Next, the run command is modified to include upto, which tells the run to not restart the simulation and run 100,000 steps additional to the restart state, but to run only up to the initially defined endpoint.

 

Since we are already in a working directory with all the required information, we also need to change the job submission script to look something like this:

 

#!/bin/bash
cd ~/jobs/25
module load foss/2018b
mpirun -n $LSB_DJOB_NUMPROC ./lmp_mpi -in in.lj.restart

And we can submit this again using the LSF bsub command:

 

bsub -q hc-mpi -n 176 -R "span[ptile=44]" -o ./%J.out ./restart-lammpsjob.sh

 

This starts a new job that takes the latest checkpoint file and continues the simulation from the defined checkpoint.

 

Another way of implementing checkpointing that uses significantly less storage is restarting based on two restart files. In this version we write quite some data every half-hour to store the checkpoint files. This can be optimized, by using the following restart entries in the input file:

 

restart         1000 lj.restart.1 lj.restart.2

 

When restarting from this state, check which file was written latest and use this specifically when adding the read_restart entry into the input file. Also, do not forget to add the upto entry. After resubmitting the simulation, it will continue from the last checkpoint.

 

To quantify the checkpoint overhead, multiple jobs were submitted, and each is scheduled to simulate 200 steps and last approximately 1 hour and use 176 cores. One of the jobs is doing a single checkpoint at 1,400 steps minutes, one is doing two checkpoints, each at 800 steps, and one has no checkpointing enabled.

 

Checkpoint/run

Runtime

Steps

Lost time

Overhead

Extrapolated for 24-hour job

0

3,469 sec

 

0

0%

0%

1

3,630 sec

2,000

161 sec

4.6%

0.18%

2

3,651

2,000

182 sec 

5.2%

0.21%

 

The checkpoint overhead for this model is quite small with around a 5-percent impact for an hourly checkpoint. The combined checkpoint file size is 21 GB, which is being written during each checkpoint.

 

NAMD

A second example of a molecular dynamic application is NAMD. This application is written and maintained by the University of Illinois. This application is open source, even though registration is required.

 

NAMD also supports checkpointing natively. This demonstration uses the STMV benchmark, which models about 1,000,000 atoms.

 

To enable checkpointing, the following lines need to be added to the stmv.namd input file:

 

outputName          ./output
outputEnergies      20
outputTiming        20
numsteps            21600
restartName         ./checkpoint
restartFreq         100
restartSave         yes

And that input file can be used by the following job script:

 

#!/bin/bash
mkdir $LSB_JOBID
cd $LSB_JOBID
cp ../stmv/* .
module load NAMD/2.13-foss-2018b-mpi
mpirun -n $LSB_DJOB_NUMPROC namd2 stmv.namd

The job is submitted with the following LSF submit script:

 

bsub -q hc-mpi -n 176 -R "span[ptile=44]" -o ./%J.out ./namd-mpi-job

 

Once started, this job creates checkpoint files in the working directory:

 

-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:58 checkpoint.9500.coor
-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:58 checkpoint.9500.vel
-rw-r--r--.  1 ccadmin ccadmin  261 Jan 15 17:58 checkpoint.9500.xsc
-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:58 checkpoint.9600.coor
-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:58 checkpoint.9600.vel
-rw-r--r--.  1 ccadmin ccadmin  261 Jan 15 17:58 checkpoint.9600.xsc
-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:58 checkpoint.9700.coor
-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:58 checkpoint.9700.vel
-rw-r--r--.  1 ccadmin ccadmin  261 Jan 15 17:58 checkpoint.9700.xsc
-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:59 checkpoint.9800.coor
-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:59 checkpoint.9800.vel
-rw-r--r--.  1 ccadmin ccadmin  261 Jan 15 17:58 checkpoint.9800.xsc
-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:59 checkpoint.9900.coor
-rw-r--r--.  1 ccadmin ccadmin  25M Jan 15 17:59 checkpoint.9900.vel
-rw-r--r--.  1 ccadmin ccadmin  258 Jan 15 17:59 checkpoint.9900.xsc

 

If the job fails at this point, the checkpoint files can be used to restart the job. To do this, the input file needs to be adjusted. Looking for the latest checkpoint file available, the timestep is 9,900. Now, in the input file, the following two changes are needed—first, tell the simulation to restart at time 9,900, and second, tell it to read and initialize the simulation from the checkpoint files of checkpoint.9900.

 

outputName          ./output
outputEnergies      20
outputTiming        20
firsttimestep       9900
numsteps            21600
restartName         ./checkpoint
restartFreq         100
restartSave         yes
reinitatoms        ./checkpoint.9900

After resubmitting the job with bsub, the simulation continues from the checkpoint and stops at the original intended step, as this example output shows:

 

TCL: Reinitializing atom data from files with basename ./checkpoint.9900
Info: Reading from binary file ./checkpoint.9900.coor
Info: Reading from binary file ./checkpoint.9900.vel
[…]
TIMING: 21600  CPU: 1882.08, 0.159911/step  Wall: 1882.08, 0.159911/step, 0 hours remaining, 1192.902344 MB of memory in use.
ETITLE:      TS           BOND          ANGLE          DIHED          IMPRP               ELECT            VDW       BOUNDARY           MISC        KINETIC               TOTAL           TEMP      POTENTIAL         TOTAL3        TEMPAVG            PRESSURE      GPRESSURE         VOLUME       PRESSAVG      GPRESSAVG
ENERGY:   21600    366977.1834    278844.5518     81843.4582      5043.3226       -4523413.8827    385319.1671         0.0000         0.0000    942875.0190       -2462511.1806       296.5584  -3405386.1995  -2454036.3329       296.7428             15.0219        19.7994  10190037.8159        -1.4609        -1.0415
WRITING EXTENDED SYSTEM TO RESTART FILE AT STEP 21600
WRITING COORDINATES TO RESTART FILE AT STEP 21600
FINISHED WRITING RESTART COORDINATES
WRITING VELOCITIES TO RESTART FILE AT STEP 21600
FINISHED WRITING RESTART VELOCITIES
WRITING EXTENDED SYSTEM TO OUTPUT FILE AT STEP 21600
WRITING COORDINATES TO OUTPUT FILE AT STEP 21600
WRITING VELOCITIES TO OUTPUT FILE AT STEP 21600
====================================================
WallClock: 1894.207520  CPUTime: 1894.207520  Memory: 1192.902344 MB
[Partition 0][Node 0] End of program

 

To quantify the checkpoint overhead, multiple jobs were submitted, each lasting approximately one hour and using 176 cores. One of the jobs is doing a single checkpoint at 40,000 steps minutes, one is doing two checkpoints, each at 20,000 steps, and one has no checkpointing enabled.

 

Checkpoint/run

Runtime

Steps

Lost time

Overhead

Extrapolated for 24-hour job

0

3,300 sec

72,000

0

0%

0%

1

3,395 sec

72,000

95 sec

2.8%

0.11%

2

3,417 sec

72,000

117 sec 

3.5%

0.13%

 

The checkpoint overhead in this model is around 3 percent with a combined checkpoint file size of 50 MB.

Gromacs

Another Molecular Dynamics package is Gromacs, originally started at the University of Groningen in the Netherlands and now at Upsalla University in Sweden. This open source package is available at the Gromacs website.

 

The data set used for this work is the open benchmark benchPEP. This data set has around 12 million atoms and is roughly 293 MB in size.

 

Gromacs supports checkpointing natively. To enable it, use the following option on the mdrun command line:

 

mpirun gmx_mpi mdrun -s benchPEP.tpr -nsteps -1 -maxh 1.0 -cpt 10

 

The –cpt 10 option initiates a checkpoint every 10 minutes. This is shown in the log files as:

 

step  320: timed with pme grid 320 320 320, coulomb cutoff 1.200: 32565.0 M-cycles
step  480: timed with pme grid 288 288 288, coulomb cutoff 1.302: 34005.8 M-cycles
step  640: timed with pme grid 256 256 256, coulomb cutoff 1.465: 43496.5 M-cycles
step  800: timed with pme grid 280 280 280, coulomb cutoff 1.339: 35397.0 M-cycles
step  960: timed with pme grid 288 288 288, coulomb cutoff 1.302: 33611.2 M-cycles
step 1120: timed with pme grid 300 300 300, coulomb cutoff 1.250: 32671.0 M-cycles
step 1280: timed with pme grid 320 320 320, coulomb cutoff 1.200: 31965.1 M-cycles
              optimal pme grid 320 320 320, coulomb cutoff 1.200
Writing checkpoint, step 3200 at Thu Mar 19 08:24:14 2020

After killing a single thread on one of the machines, the job is stopped:

 

--------------------------------------------------------------------------
mpirun noticed that process rank 71 with PID 17851 on node ip-0a000214 exited on signal 9 (Killed).
--------------------------------------------------------------------------

 

This leaves us with a state.cpt file that was written during the last checkpoint. To restart from this state file, the following mdrun command can be used:

 

mpirun gmx_mpi mdrun -s benchPEP.tpr -cpi state.cpt


When starting this run, the computation continues from the checkpoint:

 

step 3520: timed with pme grid 320 320 320, coulomb cutoff 1.200: 31738.8 M-cycles
step 3680: timed with pme grid 288 288 288, coulomb cutoff 1.310: 33970.4 M-cycles
step 3840: timed with pme grid 256 256 256, coulomb cutoff 1.474: 44232.6 M-cycles
step 4000: timed with pme grid 280 280 280, coulomb cutoff 1.348: 36289.3 M-cycles
step 4160: timed with pme grid 288 288 288, coulomb cutoff 1.310: 34398.6 M-cycles
step 4320: timed with pme grid 300 300 300, coulomb cutoff 1.258: 32844.2 M-cycles
step 4480: timed with pme grid 320 320 320, coulomb cutoff 1.200: 32005.9 M-cycles
              optimal pme grid 320 320 320, coulomb cutoff 1.200
Writing checkpoint, step 8080 at Thu Mar 19 09:03:05 2020

 

And the new checkpoint can be seen:

-rw-rw-r--.  1 ccadmin ccadmin 287M Mar 19 09:03 state.cpt
-rw-rw-r--.  1 ccadmin ccadmin 287M Mar 19 09:03 state_prev.cpt

 

To verify the checkpoint overhead, multiple jobs were submitted, each lasting 1 hour (using -maxh 1.0) and using 176 cores. One of the jobs performs a single checkpoint after 45 minutes, one does two checkpoints, each  at 20 minutes, and one has no checkpointing enabled.

 

Checkpoint/run

Steps

Lost steps

Overhead

Extrapolated for 24-hour job

0

38,000

0

0%

0%

1

38,000

0

0%

0%

2

36,640

1,360      

3.6%

0.16%

 

As you can see, the overhead of doing a single checkpoint per hour is again very small, not even visible in the application run times. This is mainly because the model and checkpoint file size is 287 MB.

Checkpointing interval

As these results show, checkpointing involves some overhead, and it varies by application and the size of the model that has to be written to disk. So, you need to weigh the tradeoffs—how often to checkpoint and what overhead is acceptable versus the risk of losing computation time due to a lost job.

 

As a general guideline, I advise the following:

  • Keep checkpointing disabled for any jobs running for one hour or less.
  • For longer running jobs, start enabling checkpointing at an interval of 1 to 24 hours—ideally, resulting in a checkpoint every 5 to 10 percent of the jobs. For example, if a job is expected to run for 4 days (96 hours), enable a checkpoint every 5 to 10 hours. If a job is expected to run for 10 days, enable a checkpoint every 12 to 24 hours.
  • For longer running jobs, it's good to do more fine-grained checkpointing.

 

That being said, with an overhead that's less than 5 percent per hour of a job, it may be worth it to you to enable checkpointing.  

 

Version history
Last update:
‎Nov 09 2023 01:31 AM
Updated by: