It isn't smart to rely on SMART
By Chris Mellor
,
TechWorld
, 02/21/2007
- Share/Email
- Tweet This
- Print
Google research has shown that built-in disk drive diagnostics only predict about half the drive failures that occur.
Modern disk drives have a built-in self-test and diagnostic facility termed Self-Monitoring, Analysis and Reporting Technology
-- SMART. The drive firmware monitors a range of drive parameters, things like the number of seek errors and the disk spin-up
time. If these parameters degrade over time it may indicate the unit is heading for a breakdown. With advance warning of an
impending disk failure you will have a chance to move files and/or replace the unit before you lose any data.
Google's study looked at more than one hundred thousand disk drives which were a combination of serial and parallel ATA consumer-grade
hard disk drives, ranging in speed from 5400 to 7200 rpm, and in size from 80 to 400 GB. The observed range of annualized
failure rates varied from 1.7 percent, for drives that were in their first year of operation, to over 8.6 percent, observed
in their third year.
The Google researchers found that SMART diagnostics are not as useful as they are supposed to be. They note that there is
little independent research into drive life and diagnostics, stating 'Most of the available information comes from the disk
manufacturers themselves. Their data are typically based on extrapolation from accelerated life test data of small populations
or from returned unit databases.'
They note 'detailed studies of very large populations (of hard drives) are the only way to collect enough failure statistics
to enable meaningful conclusions. In this paper we present one such study by examining the population of hard drives under
deployment within Google’s computing infrastructure.' Google has 'built an infrastructure that collects vital information
about all Google’s systems every few minutes, and a repository that stores these data in time-series format (essentially forever)
for further analysis.'
The researchers mined this data and analyzed it looking for correlations between hard drive sensor and SMART readings and
failure events. Their findings were:
-- Very little correlation between failure rates and either raised temperature or activity levels.
-- Some SMART parameters (scan errors, reallocation counts, offline reallocation counts, and probational counts) have a large
impact on failure probability. Others do not. Out of all failed drives, over 56 percent of them had no count in any of these
four strong SMART signals.
-- There was a lack of failure-predicting SMART signals on a large proportion of failed drives.
-- Taking all SMART signals and temperature readings into account they found about 36 percent of all failed drives had no
predictive failure signals at all.
Their conclusion was that 'it is unlikely that an accurate predictive failure model can be built based on these signals alone."
Further "models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures."
Google's researchers hope that predictive models that 'use parameters beyond those provided by SMART could achieve significantly
better accuracies. For example, performance anomalies and other application or operating system signals could be useful in
conjunction with SMART data to create more powerful models.'
Partner Content
Blue Stripe Software
www.bluestripe.com/
Improving Application Performance Troubleshooting
Diagnosing why an application is slow is hard, at times taking days or weeks to isolate and resolve. This paper explains the challenges involved using current management tools, provides a 'wish list' for application management and analysis, and explains the need for an application system-wide approach that monitors entire applications, not components.
Download Whitepaper
Virtual Vigilance: Managing Application Performance in Virtual Environments
This paper highlights the impact of virtualization on application performance. "Managing Application Performance in Virtual Environments" states: "Best-in-Class organizations are predominately taking actions around improving visibility across both physical and virtual systems, assessing the business impact of application performance and understanding interdependencies of applications in virtualized environments."
Download Whitepaper
Application Service Requests: The Missing Link for Pragmatic ITSM
Forrester Research analyst Glenn O'Donnell and BlueStripe co-founder Vic Nyman discuss a breakthrough approach to application problem management. Learn the new approach for ITSM problem management, which provides: Rapid isolation of application slow-downs to specific components for quick problem resolution, 24/7 monitoring for proactive notification of potential issues before end users are impacted and much more.
Register for Webcast
Comment