I just read a mail list posting in which a friend explained how he had to shut down and number of servers in a data center for facilities maintenance for the first time in six years -- the machines had all been running non-stop for the whole time. On restart a significant number failed because -- it is suspected -- the power supplies which could supply power for normal operation could no longer handle startup conditions! The suspicion is that electrolyte evaporation in the PSU capacitors is the most likely cause!
Wow. Is there a disaster waiting to happen in your data center? I did a quick search on this topic but not being an electrical engineer I may not have been using the right search terms. Anyone know anything about this problem?
Emergency powerdown
Having been involved in an emergency shutdown the other day, I saw not only power supplies fail but hard disks as well. Would the underlying cause be similar?
electronics degredation
More commonly we have had hard drives fail following a restart after many months of uninterrupted service. We have seen power supplies go as well. Not necessarily following prolonged uptime, but from power on surges. Electrical components degraded over time and it is certainly reasonable that they may become more sensitive to power surges/spikes as they age.
This degradation affects all components however not just power supplies. When we have to power down a large number of servers at the same time, we make sure we have some replacement drives, memory, CPU's, and power supplies, (and a valid set of recent backups) immediately available should we need them on restart.
Obviously, environmental factors play a large part in component longevity. The less benign the environment the lower the longevity of the components. Stressed components fail quicker than non-stressed components, and since stress in cumulative, then the longer a component runs the greater the cumulative stress. And no matter what you do, the power on sequence is stressful, if for no other reason than the reheating of previously un-stressed circuitry that has cooled considerably during during the off time.
The military has done some in depth research (no pun intended) with regards to electronic component degradation.
A couple of web sites that might provide some explaination, (way over my head, but I'm sure valuable to engineers) are: (some may requre purchase.)
http://www.bmpcoe.org/bestpractices/internal/calce/calce_11.html
http://stinet.dtic.mil/oai/oai?&verb=getRecord&metadataPrefix=html&identifier=AD0726923
Degradation Based Long-Term Reliability Assessment for Electronic Components in Submarine Applications
Stagger those platter spinups
The effects of rotational inertia plus back EMF presented by a motor on startup from zero RPM is often overlooked. Spinning up a whole bunch of disks simultaneously puts a massive short-term load on the power supply rails and can overwhelm even a moderately healthy PSU.
Most decent RAID controllers allow variable delay timings for disk spinup. It's a good feature -- use it.
component failure
My father did some research at his university for a company a while ago. Leaving power on to a component caused it to fail sooner than cycling DC power on and off to the same component.
HOWEVER - another professor in the same department found that pico-second spikes of 15 volts or more were enough to fry silicon transistors. I found such spikes using a very fast reacting o-scope when I turned on an electronic device at the AC mains.
Conclusion: If the powersupply is sufficiently filtered, turning equipment on and off will get you longer life for silicon devices.
Mechanical devices are a different story. The stresses on mechanical devices are greatest at turn-on because you are overcoming inertia and friction.
one more piece to ponder
My guess for the cause pico-second spikes is that the application of voltage to transformers is generating a back EMF to the circuit boards that regular filtering can't deal with - because it is a spike with all frequencies embedded in it.
Old Power Supplies
Hey,
They are commonly known as Switch Mode Power Supplies,
Funny thing is they can be faulty whist running "without you knowing" until you turn the system off.
There is no way to tell until you go for a cold start.