UK HPC News

December 27, 2013

Data Center power outage 12/28/13

Filed under: Scheduled Outage — Jerry Grooms @ 2:41 pm

For status updates on the McVey power outage and subsequent system downtime for firmware and software updates please see:

UKy HPC Twitter feed

There have been some unexpected issues with firmware updates; the plan is to be online again no later than Sunday evening.

December 10, 2013

IB network bandwidth issues

Filed under: Scheduled Outage — Jerry Grooms @ 10:27 am

We are currently experiencing some performance degradation on the Infiniband network. Jobs that generate large amounts of multi-node traffic may be impacted. This issue is due to Mellanox firmware defects that cause ISL’s to drop and not retrain. Although the ISL’s are redundant, losing paths degrades performance. The resolution is intrusive and requires cluster downtime.

We plan to update the firmware on all IB switches in the cluster at the next downtime; 12/28/13.

November 15, 2013

Scheduled (power) outage – Saturday, Dec 28th, 2013

Filed under: Scheduled Outage — Jerry Grooms @ 3:29 pm

The HPC system will be shut down during the maintenance detailed below. HPC jobs that are still running on December 28th will be canceled by 8am. If you have any questions, please contact help-hpc@uky.edu.

University of Kentucky Academic Planning, Analytics and Technology (APAT) has scheduled electrical maintenance for the McVey Hall Data Center’s Uninterruptible Power Supply (UPS) systems and the building switchgear on Saturday, December 28, 2013 [~ 6:00am]. The entire building will experience several short interruptions in power over a couple of hours while UK’s Physical Plant Division (PPD) electricians perform required maintenance on the automated distribution switchgear. At the same time, two of the Data Center’s UPS systems will be taken down for electrical system changes. The third UPS system will remain on to maintain network, Active Directory (AD), Domain Name System (DNS), and F5 availability.

The UPS system downtimes are expected to last up to 8 hours. During the downtime, most systems including SAP, Exchange, the VM-Farms, Blackboard and its peripheral systems will be unavailable. This means individuals will be unable to access data in SAP, receive or send email from UK Exchange Accounts or view course information on Blackboard. The functional outage time will vary depending on the system.

For questions about the upcoming electrical maintenance, please contact Butch Adkins at butch@uky.edu or 859-218-1716.

September 11, 2013

Test post

Filed under: Uncategorized — Jerry Grooms @ 3:12 am

Ignore…testing only.

April 18, 2012

Wall-clock limits for GPU queue

Filed under: News — Jerry Grooms @ 1:00 pm

The wall-clock limit for the GPU queue has been reduced to 24 hours.

You will need to add -t=24:00:00 (or less) to your sbatch script or command line. If you omit this, your job will remain in a blocked state and will not dispatch.

As a reminder, any sbatch command-line argument can be embedded in your job script with an #SBATCH -x xyz pragma for your convenience. See man sbatch for more info.

The GPU queue limits may be subject to change as the usage patterns dictate.

December 6, 2011

Portland Fortran compiler update

Filed under: News — Jerry Grooms @ 1:43 pm

The Portland (PGI) Fortran license has been updated to version 11.10 which includes NVIDIA GPU support.

Both the cuda and pgi modules should be loaded to establish the appropriate environment.

Some example codes can be found on the system in /share/cluster/examples/dlx/GPU/Nvidia/PGI

Portland’s overview of the above sample code can be found here.

Other useful PGI GPU-related links:

PGI Accelerator Overview, Resources and FAQ
PGI CUDA Fortran.
PGI GPU Forum
Example speedups for specific domains

PGI General Docs

Useful NVIDIA GPU links:

NVIDIA Software Development Tools
NVIDIA GPU SDK code samples

November 10, 2011

GPU Compute Nodes

Filed under: News — Jerry Grooms @ 5:10 pm

We have added four new C6100 compute nodes to the cluster, each currently configured with four GPUs each (NVIDIA Tesla M2070). The GPU nodes are identical to the basic compute nodes (12 cores and 36 GB), except for the externally attached GPUs.

For extensive NVIDIA GPU developer info, see their software developement page.

The NVIDIA Toolkit is the primary development environment, which you establish by loading a local module:

module load cuda

Queue your batch jobs as usual, but to the GPU partition (-p GPU). Please do NOT run non-GPU code on these nodes!

Also, see the post regarding a new update to the Portland Fortran compiler that adds OpenMP-like features that may provide substantially easier GPU program development in some cases.

July 21, 2011

Cluster file-system issues

Filed under: Downtime — Jerry Grooms @ 2:29 am

0730 The system is currently experiencing issues with the cluster file-systems. This may affect login sessions as the HOME dirs may be impacted. There is no ETR yet as root cause is still being identified. Check this page for more updates.

Normal operations have resumed.

Notes: some IB router interfaces stopped responding.

July 11, 2011

DLX Scheduled Outage

Filed under: Downtime — Jerry Grooms @ 9:07 am

1200 System is off-line for scheduled maint (scheduler paused; running jobs canceled).

1400 Infiniband managed switch firmware updated.

1500 Infiniband unmanaged switch firmware updated.

1600 Infiniband HCA firmware updated (cnodes).

1630 Infiniband HCA firmware updated (fnodes).

1700 System returned to production.

May 25, 2011

HSM Scheduled outage

Filed under: Scheduled Outage — Jerry Grooms @ 9:54 am

HSM near-line storage may be unavailable while undergoing scheduled maintenance on Thurs, May 26th from 8am to 6pm.

Older Posts »

Powered by WordPress