UK HPC News

October 9, 2009

hsm outage Thursday 15 October 2009 13:00-15:00 EDT

Filed under: Downtime, News — Chuck Fisher @ 7:41 am

The IBM3494 tape library will be down for maintenance Thursday 15 October 13:00-15:00 EDT.  This means no retrieval of files from hsm.uky.edu.

hsm outage Thursday 15 October 2009 13:00-15:00 EDT

Filed under: Downtime, News — Chuck Fisher @ 7:41 am

The IBM3494 tape library will be down for maintenance Thursday 15 October 13:00-15:00 EDT.  This means no retrieval of files from hsm.uky.edu.

September 28, 2009

Gaussian default version updated to E.01

Filed under: News — Jerry Grooms @ 6:46 pm

After an extended testing period, Gaussian version E.01 has been made the default.   The previous version, Gaussian D.02, is still available with batchg03.D02

August 19, 2009

Unscheduled outage - UPS failure

Filed under: Downtime — Jerry Grooms @ 8:14 pm

Around 2:00PM, during scheduled maintenance of a data center UPS, the HPC complex lost power to critical hardware (along with many other non-HPC servers).

All jobs on both the Intel and P-series cluster were lost due to loss of SAN storage. With the exception of a few servers with damaged hardware, the resources are now back online accepting jobs.

August 4, 2009

Unscheduled outage - UPS failure

Filed under: Downtime — Jerry Grooms @ 5:52 pm
A power outage due to today's storm damaged a UPS in McVey Hall.  Over 200 compute blades lost power and went down.
Most are back online and jobs are dispatching normally.  Please check your jobs.


	

July 23, 2009

hsm.uky.edu outages 13:00-16:00 EDT 29 July 2009

Filed under: Uncategorized — Chuck Fisher @ 10:26 am

There will be short (30-40 minute) downtimes  on hsm.uky.edu between 13:00 and 16:00 29 July 2009, for maintenance.  There will also be periods during this time when storage and retrieval of data will be delayed despite the machine being up.

June 23, 2009

Unscheduled GPFS outage

Filed under: Downtime — Jerry Grooms @ 7:03 am

06/23/09 0745

One GPFS filesystem server down with bad RAID stripe; HOME dirs not available; scheduler, compiler not available; resolution in-progress, but no ETA.

Update: 1140

While one storage server is still offline, most user-visible services have been restored.

June 8, 2009

Unscheduled outage - login-1

Filed under: Downtime — Jerry Grooms @ 3:07 pm

Login-1 has dropped it’s GPFS filesystem mounts and therefore HOME dirs are currently not visible to this host; debug traces are being collected and when that has completed service will be restored.  Estimated ETA:  1-2hrs.

Update: 6:00pm service restored; IBM support case:  30905.082.000

April 28, 2009

Unscheduled outage - Intel cluster

Filed under: Downtime — Jerry Grooms @ 9:58 pm

The BCX Intel cluster had an unscheduled outage that adversely affected most jobs.  The issue appears to be related to events that affected the global filesystem (GPFS).  Production has resumed at this time.  The jobs on the P-series cluster which also share this filesystem were unaffected.

Update:  05/05/09  Software updates regarding a known GPFS  defect have been applied to critical servers to address the frequent hangs.

Update: 05/14/09  No new hangs thus far.
Update: 05/15/09  New hang 17:15pm; root cause TBD.

April 9, 2009

hsm outage 12:00 - 15:00 EDT Monday April 13th

Filed under: Downtime — Chuck Fisher @ 11:56 am

Backups.uky.edu will be down for essential maintenance noon-15:00  EDT Monday April 13th 2009.   This will  keep data from being stored to or retrieved from tape by hsm.uky.edu.  If the maintenance takes less time than scheduled the system will be brought back into service as soon as practicable.

Next Page »

Powered by WordPress