Google research: Cranking up the heat may not harm your disk drives

Study of Google's computing infrastructure contradicts previous research on drive failures.

Temperatures exceeding 100 degrees Fahrenheit may not be damaging to disk drives, according to new research by Google engineers which casts doubt on previous findings linking heat to elevated failure rates.

After studying five years worth of monitoring statistics from Google’s massive data centers, researchers say they could find no consistent pattern linking failure rates to high temperatures or high utilization levels. Temperature, they write, is often called the most important environmental factor affecting disk drive reliability.

“This is a fairly surprising result, which could indicate that data-center or server designers have more freedom than previously thought when setting operating temperatures for equipment that contains disk drives,” write Google engineers Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso. “We can conclude that at moderate temperature ranges it is likely that there are other effects which affect failure rates much more strongly than temperatures do.”

The Google researchers are more optimistic about the impact of heat on computer systems than a Forrester Research analyst who, in a Webinar for IT professionals last month, said the increasingly fine features of new chips must be protected by lowering maximum operating temperatures.

The Google research, presented this month in San Jose, Calif., at the 5th USENIX Conference on File and Storage Technologies, examined data center performance at temperatures from 15 to 45 degrees Celsius, or 59 to 113 degrees Fahrenheit.

They found negative effects from high temperature only for the higher end of the temperature range (104 degrees Fahrenheit or more) and even at those temperatures the negative effects were only observed for drives at least 3 years old.

By contrast, a software and hardware manufacturer known as AVTECH Software says the “optimal” temperature range to maintain data center reliability is between 68 and 75 degrees Fahrenheit.

The Google engineers do report seeing a “modest increase” in failure rates at the lowest end of the temperature distribution they studied.

The engineers did not see a consistent correlation between high utilization and high failure rates, a finding they say also contradicts previous literature on the subject. Frequent utilization seems to lead to problems in drives that are less than a year old, and also in drives that are at least five years old, but not in drives that are in the middle of the age range, they found. This may happen because drives that perform poorly when utilized often do not survive past their first year.

More than 90% of new information produced today is stored on magnetic media, mostly hard disk drives, according to an estimate cited in the Google paper. Drive manufacturers say yearly failure rates are below 2 percent, but user studies have found rates as high as 6 percent, the paper states.

The Google researchers did find several measures useful for predicting drive failure. The measures, known as SMART (self-monitoring analysis and reporting technology) parameters, include scan errors, which are reported as drives scan the disk surface in the background.

“After their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors,” the Google researchers write.

But more than half of Google’s failed drives did not exhibit scan errors or any of the four most prominent SMART signals. This makes it difficult to develop a comprehensive model for predicting failure.

“It is possible, however, that models that use parameters beyond those provided by SMART could achieve significantly better accuracies,” the Google engineers write. “For example, performance anomalies and other application or operating system signals could be useful in conjunction with SMART data to create more powerful models. We plan to explore this possibility in our future work.”

Although the Google data showed higher failure rates in older disk drives, the numbers do not prove there is a correlation between age and failure rates because there were many different models of disk drives observed in the study. “These data are not directly useful in understanding the effects of disk age on failure rates,” the engineers write.

Learn more about this topic

More power, more heat, more problems

1/11/07

IBM cools chips with thin paste10/26/06Vendors feel heat to cool hardware1/30/06
Join the discussion
Be the first to comment on this article. Our Commenting Policies