Checkpointing is a
name for any of a series of techniques for ensuring that if-and-when the
machine on which your program is running crashes or is taken down you don't
lose all your work up to that point.
Use of some form of
checkpointing is
STRONGLY recommended. The Research Computing cannot guarantee 100%
uptime on its machines; the longer a job takes to run, the more likely it is
that some problem may interfere and the more valuable the data or computations
that would be lost.
There are
three basic types available on this cluster. One is relevant
to pre-packaged software, the other two are relevant to programs written
locally.
Type 1 - Handled by the program
The software package has an option to write out data periodically, or when it
finishes a particular step. In some cases, a special file is written out that
lets you restart the computation with little human intervention; in some cases
raw data is written out and some thought is required to restart the program.
Type 2 - Program periodically writes out it's state
The program is locally written, and periodically writes out information to a
file that can be used to recover the then-current state of the computation and
proceed from there. This is good practice even if you use the next variety
of checkpointing
as well: if nothing else, it allows the programmer to check and make sure the
program is proceding correctly while it is being tested.
Unfortunately, this is also the only method that works with user-written 64-bit
programs.
NOTE: This is
the only effective form of checkpointing for
MPI programs.
Type 3 - Program has been re-linked with a special library provided by LSF
LSF replaces selected
low-level system calls with special routines; when signalled, it writes out
the program's internal state. This can be done periodically, or by a special
command, or both. This lets programs be checkpointed periodically (as
insurance against system crashes) or under special circumstances (e.g. when
the machine is taken down for maintenance). The drawback to this is that the
files produced are large, and aren't humanly readable - they are only useful
to restart the jobs that created them when the system comes back
up. Unfortunately, only single
process jobs compiled in 32-bit format can currently
be checkpointed in this way.
The remainder of this document will cover this third type of
checkpointing; the first two types depend entirely on the
package used and/or the programmer writing the application.
Type 3
checkpointing does not require
the programmer to replace any of the standard statments normally used in his
or her program, or to add any extra statements, or
procedure/subroutine/function calls. Instead, all that the programmer has to
do is re-link your program with one of the provided commands. Separate
commands are provided to link
C
and
Fortran programs,
respectively. The command to re-link
C programs is
ckpt_ld; the command to re-link
Fortran
programs is
ckpt_ld_f.
Here's an example of compiling and re-linking a
C program:
1. Compile the program, use the
-c option
to produce an object (
.o) file instead of an executable.
cc -c myjob.c
2. Link with
ckpt_ld.
ckpt_ld -o my_job my_job.o
ckpt_ld -o my_job my_job.o -lm
Relinking Fortran programs is similar, but uses the
ckpt_ld_f
command instead of
ckpt_ld.
Limitations of Type 3 checkpointing:
-
Only single process jobs
can be checkpointed.
-
Only 32-bit programs can be checkpointed.
-
Proceses with open sockets or pipes may not restart properly as pipe and
sockets are not opened on restart. If you are piping data to or from
STDOUT,
STDIN, or STDERR,
all data in the pipes will be lost on restart.
-
The program must use statically linked libraries.
-
Checkpointed programs should not use private stacks.
-
Checkpointed programs should not use internal timers.
How to submit checkpointed jobs
When you submit a checkpointed job, you need to specify a
checkpoint directory;
in other words, a directory into which the
checkpoint files will be written.
Checkpoint
files are large; it is probably a
bad idea to write them in or
under your
home directory. The best place to write them is under your
scratch directory (
/scratch/userid, where
userid is your
userid ). Jobs are each checkpointed in a sudirectory
identified by the job's
id number,
so you do not have to specify a separate subdirectory for each job.
In addition, you may want to specify a
checkpoint period.
Checkpoint periods are
specified in
minutes
(after each checkpoint-period minutes, your job will be checkpointed).
Checkpoint period
is optional and does not have to be specified. However, the
checkpoint directory
is
not optional and
must be specified for checkpointed jobs.
For example, assume that the program
myjob has been re-linked
with the
ckpt_ld,
and that it doesn't use
stdin or
stdout.
The following command will submit the job to the
serial queue,
specifying
/scratch/myuserid as the
checkpoint directory:
bsub -q serial -k /scratch/myuserid myjob
The following command will submit the job to the
serial queue,
specifying
/scratch/myuserid as the
checkpoint directory and a
checkpoint period of 6 hours (360 minutes):
bsub -q serial -k "/scratch/myuserid 360" myjob
NOTE: The quotemarks ( " ) are
not optional.
Send comments/questions to:
help-hpc@uky.edu
Last modified: July 25 2002 14:32:00.