UK Logo High Performance Computing November 23, 2009
Home News System Status User Policies Account Info Hardware Software Documentation FAQs Search
UKy HPC Frequently-Asked Questions

Frequently-Asked Questions related to:

For questions about this faq, please contact: help-hpc@uky.edu


FAQ Revised: Thursday 09 March 2006 10:59:24

Table of Contents

1. Disk Space
2. Scratch Space

1. Disk Space

1.1. How do I check my disk quota?

To check your disk quota, run the command:
	quota -v
The third field is your quota in K-bytes; the second field is how much space you are currently using.

Note: even though the quota command may show all the /home dirs it is monitoring, each user has only one home dir.

1.2. How do I check the available disk space on a given filesystem?

Run the command:
	bdf


1.3. When I run my jobs, I get this message. What does this message mean?

    vxfs: mesg 042: vx_bsdquotaupdate - warning: /home file system user some_user disk quota exceeded

Each user on the is given a default allocation of local disk space at the time the account is activated. Depending on the the amount of home dir disk storage you are using, if you attemp to run a relatively large job from your /home directory, you can possibly exceed your disk quota. To check this, run the command:
	quota -v
The third field is your quota in K-bytes; the second field is how much space you are currently using.

In order to run large jobs without exceeding your disk quota you should create any large files in a scratch directory named /scratch/your_userid. The scratch area available on each node should exceed 500GB, depending on how much is being used by all jobs. It is unlikely that the disk space requirement for your job will exceed the space available in your /scratch directory.

Precaution:
    Files placed in a /scratch directory are NOT backed up. Consequently, you can lose the files that you have placed in your /scratch directory if the system crashes or if certain other types of problems develop. Furthermore, any files left in a /scratch directory for more than 30 days are subject to being purged. Therefore, files placed in a /scratch directory should be placed there on a temporary basis only. If these files need to be stored for an extended time and are too large to store on your /home directory, they may be transferred to the UniTree tape archive system (see the University of Kentucky UniTree - Tape Archival/Management System page).

For more information on the /scratch filesystem, see the Scratch Directories/Files section of the New User's Document and the FAQ section concerning Scratch Space below.


2. Scratch Space

2.1. What is scratch space ?

"scratch space" is a filesystem primarily for actively running jobs. There are no /scratch quotas, so your jobs are free to write data to the limits of the filesystem (within reason). Currently, it is locally attached so it provides optimum performance in contrast to NFS-mounted HOME dirs. Your "scratch space" has the pathname /scratch/your_userid. However, there is a finite amount of space. If the space fills unchecked, necessary steps would be taken to eliminate the risk to actively runnings jobs. Please note that files in scratch space are NOT backed up; keeping your only copy of an important file in your scratch area is NOT a good idea. Files to be kept long-term may be copied to the HOME dir, space permitting, or to the mass storage system (see the local web page on DataStorage).

2.2. Is my scratch space the same on all machines in the cluster?

No. Each node in the cluster has a separate scratch area. To make things easier for users, when LSF starts a job (on any SDX host in the cluster except the login node) the contents of the owner's scratch directory are copied to his or her scratch area on the machine the job is run on; when the job finishes, any new files and files that have been changed are copied back to the login node. Some exceptions should be noted here.

2.3. Why is the scratch space not the same on all machines in the cluster?

The SDX cluster currently does not have the benefit of a specialized "Global Filesystem" for shared scratch space. It is technically possible to have one filesystem mounted to all hosts via commonly available NFS. However, heavy IO over NFS often has performance issues. Therefore, the current configuration does not utilize NFS for the scratch filesystem in order to provide the highest possible IO performance. ie each node has it's own locally-attached hardware for the /scratch filesystem which use the same mount-point name for uniformity across the cluster.

Note: This could change in the future with the deployment of new filesystem solutions.

2.4. Would storing large files or large numbers of files in my scratch area affect the staging (copying) of /scratch files?

Yes. Often there are good reasons to have relevant files available to an active job, such as the input files, etc. However, you should be aware that when you run an SDX batch (LSF) job, EVERY FILE in your scratch directory that isn't already present on the execution host is copied down to the execution host UNLESS you have taken action to avoid this (see below). There are some exceptions: if the job is scheduled to the login node there is no need for the copy. There are other filetypes that are skipped, e.g. html or postscript. If both the login node and the execution host have versions of the file, the file is copied down only if the version from the login node is newer than the one on the execution host).

This copying overhead costs time from your allocation and can SIGNIFICANTLY SLOW DOWN the start of new batch jobs (while this potentially unecessary file transfer takes place). This is particularly wasteful with a dedicated queue, as the job's slots are reserved (but largely idle) while the file staging takes place.

If you have files in your scratch area that don't need to be copied down to the execution host for use by your jobs, mv them under the lsf_nocopy sub-dir and they will not be auto-copied. (this sub-dir is automatically created for you in your topmost scratch directory). If you wish to accomplish the same thing on a per directory basis, create (touch) a file called .lsf_nocopy (dot-lsf_nocopy) in that directory. The .lsf_nocopy file keeps that directory from being copied. Either approach accomplishes the same thing. One tip is to "touch .lsf_nocpy" at the job's post phase in order that the dir is not copied by subsequent jobs.

Storing files in your scratch area is not a good idea for the long term. The scratch diskspace is NOT backed up; if you don't have a copy of the file elsewhere, you risk losing it.

2.5. Are there classes of files or directories which should always be put under the lsf_nocopy directory, or have ".lsf_nocopy" file added to the top directory ?

Yes. Any large groups of files, or large individual files, that are not needed to run your jobs should be in a directory either under the lsf_nocopy directory or which contains an ".lsf_nocopy" file - the idea is to prevent LSF from having to copy the files back and forth (which can take significant time depending on system load and how much is being copied).

Some kinds of files or directories to watch for and handle as mentioned above:
  • Source trees (directories containing source code and subdirectories used in building software packages but not in running them).
  • Tar files
  • Documentation files and directories
  • Currently unused data files and directories


2.6. If I am running a job that uses more than one machine, is my scratch directory copied to all the machine the job uses ?

Yes. However, be careful about outputing data to your scratch area on multiple machines. If you write several files with the same filename (one per machine your job is executing on) when the files are copied back to the main scratch area on hpc.uky.edu you would end up with only one version of the file, the one that was written to last. There is an inherent problem in trying to compress several scratch areas with active output files into one area on the login node; the best solution is to only write output files on one machine.

2.7. Can two of my jobs interfere with each other (Cause files from one job to be over-written by files from the other) ?

This can happen, but only if the files have the exact same pathname. The easiest and safest way to deal with this is to use separate scratch directories for the files for each job. If you don't want to do it by hand, the mktemp command is useful. It creates a unique directory name.

Note: this problem cannot happen with Gaussian jobs - the script that sets up the job creates a separate, unique scratch directory for each job using the mktemp command. The only way to get a file to be overwritten would be to run two jobs from the same directory using Gaussian command files with the same name - in that case, the output file from the first job to finish would be overwritten. The solution is not to run jobs with the same command file name from the same directory.

2.8. I cleaned up my /scratch directory, then a job that had been running finished and all the old scratch files I had previously deleted reappeared. How do I avoid this?

This happens because the files are effectively mirrored on the execution host while the active job is running and you are deleting only the copies on the login node. So, when an active job finishes the files will be automatically copied back and "reappear".

The easiest and safest approach is to let all your active LSF jobs finish, let LSF copy any files they created back to the login host and then clean your scratch directory before submitting any new jobs.

2.9. I want to look at an intermediate result of my job, but the file doesn't seem to exist in my scratch directory until after the job finishes. How can I look at the file while my job is running?

Despite appearances, the HP Superdome cluster does not have one single scratch directory. There are four, one for each machine. When a job is run on a machine other than the login host, the contents of your scratch directory is copied to the equivalent directory on the machine the job is run on. When the job finishes, the files from the scratch directory on the execution host are copied back up to the scratch directory on the login host. The file you want to look at doesn't exist on the login host until the relevant job finishes. To get around this, use the bjobs command to find out which host(s) the job is running on (e.g. node_name); then use the ssh command to switch to that machine(e.g. ssh node_name), and check your file as per normal. To get back to the login host, type exit. Do not abuse the ssh command by running interactive jobs on the other hosts in the cluster (check the Policies document for more on this).

2.10. If I kill a batch job, do I need to clean up my scratch directory on the machine the job was running on?

No. (unless the job ran on the login node). If the job didn't run on the login node, LSF will copy the files in your scratch directory on that machine back to the login node, and will delete them from the machine the job ran on when you have no other batch jobs running on it. Wait until any other LSF jobs you have running have stopped and the files from them have been copied back; then clean your scratch directory on the login node. If the job ran on the login node, the jobs were only copied to another machine if you started an LSF job which started up on another machine while the first job was running.

2.11. Is there a limit on how long files can remain in my scratch directory?

Yes. Any files left unused in the scratch directories on the login node for more than thirty (30) days are subject to being purged. Since the scratch directories of the other machines are cleaned as jobs finish and the contents of user's scratch directories on those machines are copied back, this will evenutally purge scratch directories on the whole cluster. Except in filesystem full cirumstances, users should be warned by email more than one day before any files in their scratch directories are purged, and notified by email when the files in question are deleted. Please recall that files in the scratch directories are NOT backed up; once a file is deleted from a scratch directory, it is gone. It is the user's responsibility to maintain copies (either in the home directory or at some other storage location) of any files in the scratch area which need to be kept on a more permanent basis. Local users are encouraged to investigate using hsm.uky.edu for file storage. If the scratch disk becomes too full despite the automatic deletion procedure, the administrators will take whatever steps are necessary ensure the proper filesystem availability for all users. If you have any questions about this action, please contact help-hpc@uky.edu


FAQ generated by: makefaq