UK Logo High Performance Computing July 4, 2008
Home News System Status User Policies Account Info Hardware Software Documentation FAQs Search
UKy HPC LSF Checkpointing Jobs


   LSF Checkpointing  
HP Superdome Cluster


Checkpointing is a name for any of a series of techniques for ensuring that if-and-when the machine on which your program is running crashes or is taken down you don't lose all your work up to that point.

Use of some form of checkpointing is STRONGLY recommended. The Research Computing cannot guarantee 100% uptime on its machines; the longer a job takes to run, the more likely it is that some problem may interfere and the more valuable the data or computations that would be lost.

There are three basic types available on this cluster. One is relevant to pre-packaged software, the other two are relevant to programs written locally.


Type 1 - Handled by the program

The software package has an option to write out data periodically, or when it finishes a particular step. In some cases, a special file is written out that lets you restart the computation with little human intervention; in some cases raw data is written out and some thought is required to restart the program.


Type 2 - Program periodically writes out it's state

The program is locally written, and periodically writes out information to a file that can be used to recover the then-current state of the computation and proceed from there. This is good practice even if you use the next variety of checkpointing as well: if nothing else, it allows the programmer to check and make sure the program is proceding correctly while it is being tested.

Unfortunately, this is also the only method that works with user-written 64-bit programs.

NOTE: This is the only effective form of checkpointing for MPI programs.


Type 3 - Program has been re-linked with a special library provided by LSF

LSF replaces selected low-level system calls with special routines; when signalled, it writes out the program's internal state. This can be done periodically, or by a special command, or both. This lets programs be checkpointed periodically (as insurance against system crashes) or under special circumstances (e.g. when the machine is taken down for maintenance). The drawback to this is that the files produced are large, and aren't humanly readable - they are only useful to restart the jobs that created them when the system comes back up. Unfortunately, only single process jobs compiled in 32-bit format can currently be checkpointed in this way.


The remainder of this document will cover this third type of checkpointing; the first two types depend entirely on the package used and/or the programmer writing the application.

Type 3 checkpointing does not require the programmer to replace any of the standard statments normally used in his or her program, or to add any extra statements, or procedure/subroutine/function calls. Instead, all that the programmer has to do is re-link your program with one of the provided commands. Separate commands are provided to link C and Fortran programs, respectively. The command to re-link C programs is  ckpt_ld; the command to re-link Fortran programs is  ckpt_ld_f.

Here's an example of compiling and re-linking a C program:

1.  Compile the program, use the  -c  option to produce an object (.o) file instead of an executable.

cc -c myjob.c

2.  Link with ckpt_ld.

ckpt_ld -o my_job my_job.o

ckpt_ld -o my_job my_job.o -lm

Relinking Fortran programs is similar, but uses the ckpt_ld_f command instead of ckpt_ld.


Limitations of Type 3 checkpointing:

  • Only single process jobs can be checkpointed.


  • Only 32-bit programs can be checkpointed.


  • Proceses with open sockets or pipes may not restart properly as pipe and sockets are not opened on restart. If you are piping data to or from STDOUT, STDIN, or STDERR, all data in the pipes will be lost on restart.


  • The program must use statically linked libraries.


  • Checkpointed programs should not use private stacks.


  • Checkpointed programs should not use internal timers.

How to submit checkpointed jobs

When you submit a checkpointed job, you need to specify a checkpoint directory; in other words, a directory into which the checkpoint files will be written. Checkpoint files are large; it is probably a bad idea to write them in or under your home directory. The best place to write them is under your scratch directory (  /scratch/userid, where userid is your userid ). Jobs are each checkpointed in a sudirectory identified by the job's id number, so you do not have to specify a separate subdirectory for each job.

In addition, you may want to specify a checkpoint period.  Checkpoint periods are specified in minutes (after each checkpoint-period minutes, your job will be checkpointed).

Checkpoint period is optional and does not have to be specified. However, the checkpoint directory is not optional and must be specified for checkpointed jobs.

For example, assume that the program myjob has been re-linked with the ckpt_ld, and that it doesn't use stdin or stdout.

The following command will submit the job to the serial queue, specifying  /scratch/myuserid  as the checkpoint directory:

bsub -q serial -k /scratch/myuserid myjob

The following command will submit the job to the serial queue, specifying  /scratch/myuserid  as the checkpoint directory and a checkpoint period of 6 hours (360 minutes):

bsub -q serial -k "/scratch/myuserid 360" myjob

NOTE:  The quotemarks ( " ) are not optional.




Send comments/questions to: help-hpc@uky.edu
Last modified:
July 25 2002 14:32:00.