Supercomputer Hardware Problems -
Over the past few weeks, we have experienced a number of processor failures on Helios (the Sun E10000 Supercomputer). Helios has 64 processors. 32 of them are from the original installation and 32 are from the upgrade done last year. Some of the older processors have started to fail. When one does fail, Helios reboots. We are able to dynamically unconfigure the system board that has the bad CPU, have Sun Microsystems replace it, then configure the board back into the running system. We are working with Sun to replace ALL of the older processors ASAP so that we do not keep on getting surprised by these unscheduled reboots of the system.
LSF Queue Changes - Effective Monday, November 12, 2001 -
LSF is the queue management software used to manage the workload on the real workhorse servers, Helios and Norfolk. We are making some changes to the queue configurations to better manage the workload and to make queue names and configurations a little more meaningful and consistent between Helios and Norfolk.
A few new queues will be created, some old queues will be phased out
and deleted. The following table outlines the major settings for each queue.
A (*) indicates a new/changed setting value, and (**) indicates that
whatever value was there is removed.
| Queue Name | Batch / Interactive | Run Limit (min) | Disp Prio | Nice | Max Proc | Max Jobs | Job Limit Per
User |
Job Limit Per
Host |
Load Sched
UT |
Load
Stop UT |
Host |
| short | Both | 15 | 50* | 0* | 1 | ** | 90%* | Norfolk | |||
| normal | Both | 600* | 30* | 5* | 1 | ** | ** | 90% | Norfolk | ||
| long | Batch* | 20* | 10 | 1 | ** | ** | 80%* | Norfolk | |||
| SAS | Both | 600* | 30 | 5* | 1 | ** | 2 | Norfolk | |||
| SPSS | Both | 600* | 30 | 5* | 1 | 2 | 2* | Norfolk | |||
| SAS-long | Batch* | 20* | 10* | 1* | 2* | 1* | 80%* | Norfolk | |||
| bigjobs DELETE* | X | X | X | X | X | X | X | X | X | X | Norfolk |
| hpc-short* | Both* | 15* | 50* | 0* | Helios | ||||||
| hpc-normal | Both | 600 | 30* | 5* | 90%* | ** | Helios | ||||
| hpc-long | Batch* | 20* | 10* | 90% | ** | Helios | |||||
| hpc-mpi | Batch* | 20* | 10* | 80% | ** | Helios | |||||
| hpc-mpi-short | Both* | 15 | 50* | 0 | 4 | Helios | |||||
| hpc
PHASE OUT* |
Interactive | 600* | 1* | 15* | 80%* | ** | Helios | ||||
| hpc-batch PHASE OUT* | Batch | 1* | 15* | 80%* | ** | Helios | |||||
| Host Name | Max Jobs | Job Limit Per
User |
MEM
Limit |
Load
Sched UT |
Load
Stop UT |
||||||
| Norfolk | ** | 3 | 5MB* | 100%* | 100%* | ||||||
| Helios | 100% | 100% |