The HPC system will be shut down during the maintenance detailed below. HPC jobs that are still running on December 28th will be canceled by 8am. If you have any questions, please contact email@example.com.
University of Kentucky Academic Planning, Analytics and Technology (APAT) has scheduled electrical maintenance for the McVey Hall Data Center’s Uninterruptible Power Supply (UPS) systems and the building switchgear on Saturday, December 28, 2013 [~ 6:00am]. The entire building will experience several short interruptions in power over a couple of hours while UK’s Physical Plant Division (PPD) electricians perform required maintenance on the automated distribution switchgear. At the same time, two of the Data Center’s UPS systems will be taken down for electrical system changes. The third UPS system will remain on to maintain network, Active Directory (AD), Domain Name System (DNS), and F5 availability.
The UPS system downtimes are expected to last up to 8 hours. During the downtime, most systems including SAP, Exchange, the VM-Farms, Blackboard and its peripheral systems will be unavailable. This means individuals will be unable to access data in SAP, receive or send email from UK Exchange Accounts or view course information on Blackboard. The functional outage time will vary depending on the system.
For questions about the upcoming electrical maintenance, please contact Butch Adkins at firstname.lastname@example.org or 859-218-1716.
The wall-clock limit for the GPU queue has been reduced to 24 hours.
You will need to add -t=24:00:00 (or less) to your sbatch script or command line. If you omit this, your job will remain in a blocked state and will not dispatch.
As a reminder, any sbatch command-line argument can be embedded in your job script with an #SBATCH -x xyz pragma for your convenience. See man sbatch for more info.
The GPU queue limits may be subject to change as the usage patterns dictate.
The Portland (PGI) Fortran license has been updated to version 11.10 which includes NVIDIA GPU support.
Both the cuda and pgi modules should be loaded to establish the appropriate environment.
Some example codes can be found on the system in /share/cluster/examples/dlx/GPU/Nvidia/PGI
Portland’s overview of the above sample code can be found here.
Other useful PGI GPU-related links:
PGI Accelerator Overview, Resources and FAQ
PGI CUDA Fortran.
PGI GPU Forum
Example speedups for specific domains
PGI General Docs
Useful NVIDIA GPU links:
NVIDIA Software Development Tools
NVIDIA GPU SDK code samples
We have added four new C6100 compute nodes to the cluster, each currently configured with four GPUs each (NVIDIA Tesla M2070). The GPU nodes are identical to the basic compute nodes (12 cores and 36 GB), except for the externally attached GPUs.
For extensive NVIDIA GPU developer info, see their software developement page.
The NVIDIA Toolkit is the primary development environment, which you establish by loading a local module:
module load cuda
Queue your batch jobs as usual, but to the GPU partition (-p GPU). Please do NOT run non-GPU code on these nodes!
Also, see the post regarding a new update to the Portland Fortran compiler that adds OpenMP-like features that may provide substantially easier GPU program development in some cases.
0730 The system is currently experiencing issues with the cluster file-systems. This may affect login sessions as the HOME dirs may be impacted. There is no ETR yet as root cause is still being identified. Check this page for more updates.
Normal operations have resumed.
Notes: some IB router interfaces stopped responding.
1200 System is off-line for scheduled maint (scheduler paused; running jobs canceled).
1400 Infiniband managed switch firmware updated.
1500 Infiniband unmanaged switch firmware updated.
1600 Infiniband HCA firmware updated (cnodes).
1630 Infiniband HCA firmware updated (fnodes).
1700 System returned to production.
HSM near-line storage may be unavailable while undergoing scheduled maintenance on Thurs, May 26th from 8am to 6pm.
05/15/11 0800 We have experienced a critical network outage. There is no ETR for root cause at this time.
05/15/11 2230 All production jobs have been aborted. We regret the inconvenience and are working on mitigation of future occurrences.
05/16/11 0930 We are still experiencing network issues. We are actively engaged with the vendor for trouble-shooting; thank you for your patience.
05/17/11 0800 After the issue was thought to be mitigated, we continue to have sporadic issues with the internal networking.
05/17/11 1630 We are engaged with Dell’s escalated support, but we do not have an ETR at this time.
05/17/11 2100 The problems have been isolated to Dell switch/port issues. Firmware updates are being applied and other HW may be replaced Wed. We hope to have service restored sometime (late) Wed. Thank you for your patience.
05/18/11 We have received replacement Dell switches and are working on installation.
05/18/11 2000 All switch firmware updated and two switches replaced. Checking cluster network connectivity.
05/18/11 2359 System is released back to production.
The next quarterly scheduled cluster maintenance is 7/11/2011 from 8-4pm.
All services may be unavailable during this time.
Should the maintenance time not be necessary, or finish early, an update will be posted.