The wall-clock limit for the GPU queue has been reduced to 24 hours.
You will need to add -t=24:00:00 (or less) to your sbatch script or command line. If you omit this, your job will remain in a blocked state and will not dispatch.
As a reminder, any sbatch command-line argument can be embedded in your job script with an #SBATCH -x xyz pragma for your convenience. See man sbatch for more info.
The GPU queue limits may be subject to change as the usage patterns dictate.
The Portland (PGI) Fortran license has been updated to version 11.10 which includes NVIDIA GPU support.
Both the cuda and pgi modules should be loaded to establish the appropriate environment.
Some example codes can be found on the system in /share/cluster/examples/dlx/GPU/Nvidia/PGI
Portland’s overview of the above sample code can be found here.
Other useful PGI GPU-related links:
PGI Accelerator Overview, Resources and FAQ
PGI CUDA Fortran.
PGI GPU Forum
Example speedups for specific domains
PGI General Docs
Useful NVIDIA GPU links:
NVIDIA Software Development Tools
NVIDIA GPU SDK code samples
We have added four new C6100 compute nodes to the cluster, each currently configured with four GPUs each (NVIDIA Tesla M2070). The GPU nodes are identical to the basic compute nodes (12 cores and 36 GB), except for the externally attached GPUs.
For extensive NVIDIA GPU developer info, see their software developement page.
The NVIDIA Toolkit is the primary development environment, which you establish by loading a local module:
module load cuda
Queue your batch jobs as usual, but to the GPU partition (-p GPU). Please do NOT run non-GPU code on these nodes!
Also, see the post regarding a new update to the Portland Fortran compiler that adds OpenMP-like features that may provide substantially easier GPU program development in some cases.
Comments Off
0730 The system is currently experiencing issues with the cluster file-systems. This may affect login sessions as the HOME dirs may be impacted. There is no ETR yet as root cause is still being identified. Check this page for more updates.
Normal operations have resumed.
Notes: some IB router interfaces stopped responding.
1200 System is off-line for scheduled maint (scheduler paused; running jobs canceled).
1400 Infiniband managed switch firmware updated.
1500 Infiniband unmanaged switch firmware updated.
1600 Infiniband HCA firmware updated (cnodes).
1630 Infiniband HCA firmware updated (fnodes).
1700 System returned to production.
HSM near-line storage may be unavailable while undergoing scheduled maintenance on Thurs, May 26th from 8am to 6pm.
05/15/11 0800 We have experienced a critical network outage. There is no ETR for root cause at this time.
05/15/11 2230 All production jobs have been aborted. We regret the inconvenience and are working on mitigation of future occurrences.
05/16/11 0930 We are still experiencing network issues. We are actively engaged with the vendor for trouble-shooting; thank you for your patience.
05/17/11 0800 After the issue was thought to be mitigated, we continue to have sporadic issues with the internal networking.
05/17/11 1630 We are engaged with Dell’s escalated support, but we do not have an ETR at this time.
05/17/11 2100 The problems have been isolated to Dell switch/port issues. Firmware updates are being applied and other HW may be replaced Wed. We hope to have service restored sometime (late) Wed. Thank you for your patience.
05/18/11 We have received replacement Dell switches and are working on installation.
05/18/11 2000 All switch firmware updated and two switches replaced. Checking cluster network connectivity.
05/18/11 2359 System is released back to production.
The next quarterly scheduled cluster maintenance is 7/11/2011 from 8-4pm.
All services may be unavailable during this time.
Should the maintenance time not be necessary, or finish early, an update will be posted.
13:15 We are currently investigating an issue with the servers that provide critical cluster services.
This may affect batch commands (sinfo, showq) etc and access to the compiler license manager among other services. You may also see “Socket timed out” messages from your runnings jobs.
15:30 Normal operations should be restored. Most jobs were unaffected, but approximately 20% may have been. Please check them at your earliest convenience.
1830 We are currently experiencing high packet loss on the cluster’s internal network. This may slow down normal operations such as batch commands, compiler execution, etc.
2200 Most operations have been restored. However, many jobs were disrupted and aborted.