hsm outage Thursday 15 October 2009 13:00-15:00 EDT
The IBM3494 tape library will be down for maintenance Thursday 15 October 13:00-15:00 EDT. This means no retrieval of files from hsm.uky.edu.
The IBM3494 tape library will be down for maintenance Thursday 15 October 13:00-15:00 EDT. This means no retrieval of files from hsm.uky.edu.
The IBM3494 tape library will be down for maintenance Thursday 15 October 13:00-15:00 EDT. This means no retrieval of files from hsm.uky.edu.
After an extended testing period, Gaussian version E.01 has been made the default. The previous version, Gaussian D.02, is still available with batchg03.D02
Around 2:00PM, during scheduled maintenance of a data center UPS, the HPC complex lost power to critical hardware (along with many other non-HPC servers).
All jobs on both the Intel and P-series cluster were lost due to loss of SAN storage. With the exception of a few servers with damaged hardware, the resources are now back online accepting jobs.
A power outage due to today's storm damaged a UPS in McVey Hall. Over 200 compute blades lost power and went down.
Most are back online and jobs are dispatching normally. Please check your jobs.
There will be short (30-40 minute) downtimes on hsm.uky.edu between 13:00 and 16:00 29 July 2009, for maintenance. There will also be periods during this time when storage and retrieval of data will be delayed despite the machine being up.
06/23/09 0745
One GPFS filesystem server down with bad RAID stripe; HOME dirs not available; scheduler, compiler not available; resolution in-progress, but no ETA.
Update: 1140
While one storage server is still offline, most user-visible services have been restored.
Login-1 has dropped it’s GPFS filesystem mounts and therefore HOME dirs are currently not visible to this host; debug traces are being collected and when that has completed service will be restored. Estimated ETA: 1-2hrs.
Update: 6:00pm service restored; IBM support case: 30905.082.000
The BCX Intel cluster had an unscheduled outage that adversely affected most jobs. The issue appears to be related to events that affected the global filesystem (GPFS). Production has resumed at this time. The jobs on the P-series cluster which also share this filesystem were unaffected.
Update: 05/05/09 Software updates regarding a known GPFS defect have been applied to critical servers to address the frequent hangs.
Update: 05/14/09 No new hangs thus far.
Update: 05/15/09 New hang 17:15pm; root cause TBD.
Backups.uky.edu will be down for essential maintenance noon-15:00 EDT Monday April 13th 2009. This will keep data from being stored to or retrieved from tape by hsm.uky.edu. If the maintenance takes less time than scheduled the system will be brought back into service as soon as practicable.
Powered by WordPress