Skip Links

Network World

Jimmy Ray Purser

Is MTBF data useless?

By JimmyRay on Tue, 11/18/08 - 9:50am.

I was at a customer site the other day and we were going back and forth on MTBF numbers like George Costanza trying to split a check for lunch. Reading MTBF numbers is kinda like believing you can actually catch fish with the Ronco pocket fisherman. Wow!! According to my Mean Time Between Failure data, my network gear will last for 30 years!! Sorta like a furniture store going out of business...for five years. Normally, most folks look at MTBF data like this; "Hmmm, a year contains 8,766 hours. My switch has a MTBF score of 292Khrs. Now, lets convert that to years. I then divide 8766/292000= a little over 33 years! The higher the MTBF score, the longer the years between failures. For example: MTBF of 438Khrs; 8766/438000= almost 50 years (49.9). Wholly smokes!!! Whatta deal! I’ll take two! Then marketing grabs this number and then it’s; MTBF apply directly to the RFP, MTBF apply directly to the RFP, MTBF apply directly to the RFP, MTBF apply directly to the RFP, MTBF apply directly to the RFP...

The first time I had to talk about MTBF with a customer, I was surprised. Normally, because we used this data internally in the prototyping process to determine the best materials to use in mass assembly or for cost reduction. When this is used in the field, one of two opposite things usually happens: either the person actually thinks the device will last half a century or longer, or the opposite, they realize this is crazy and so they write off the entire MTBF figure as an obvious exaggeration and therefore useless. The real answer of course is neither. MTBF is great to know in network planning and design. It can truly provide me with excellent SLA data points.

It is obviously impossible for any individual switch to be tested to anywhere near the amount of time required to provide a MTBF factor near even 100,000 never mind 500,000. So where do these numbers come from? A whole device rarely fails. It is actually a component IN a device that bites it. MTBF is basically a composite score based upon all the components in a device. Now, that is a ton of math, but not for a good ole software package! The manufacturers use a benchmark called Telcordia TR-332 to set this number. We feed the components used into the system make an adjust or two for heat, placement, etc and BLAM! out come the results. The software actually says; "Working" I can always hear the Star Trek computer when I see that.

There are in fact two different types of MTBF figures. When a manufacturer is introducing a new switch to the market, it obviously has not been field proven, so they have no data on how the switch will perform. Still, they can't just shrug and say "beats me Dude, but buy some anyway, look at the cool paint job", Hey, I am designing a data center here not picking up a 67 RS Camaro, although that would be cool. Because many customers want to know what the reliability of the switch is likely to be, manufactures calculate what is called a theoretical MTBF figure. This is the number that comes from the TR-332 analytic software and the analysis of historical data. For example; the historical failure rate of other switches similar to the one being placed on the market, and the failure rate of the components used in the new model. It's important to realize that these MTBF figures are estimates based on a theoretical model of reality, (much like normal marketing...) and are limited by the known boundaries of that model. Of course we have to make assumptions like correct installation, correct environment and we do not account for manufacturing issues like bad fans or defective phys. No device is goober proof.

After a particular model of switch has been in the market for a while, say a year, the actual failures of the switch can be analyzed and a calculation made to determine the switch's operational MTBF. That’s the good stuff because it has been field tested. That’s why I tell folks to ask for the date of calculation for the switch MBTF numbers. Like the freshness date on beer. This figure is derived by analyzing field returns for a switch model and comparing them to the installed base for the model and how long the average switch in the field has been running. Operational MTBFs are typically lower than theoretical MTBFs because they include some "human element" (you know Cisco marketing is effective if you hear the song "Teenage Wasteland" in your head about now…) and the goober factor which is problems not accounted for in theoretical model. This is big time more accurate however, operational MTBF is rarely discussed as a reliability specification because most manufacturers don't provide it as a specification, and because most people only look at the MTBFs of new switches--for which operational figures are not yet available.

The key point to remember when looking at any MTBF figure is that it is meant to be an average, based on testing done on many switches over a smaller period of time. In the end it is basic statistics, the M in MTBF is a Mean and Mean is an average that accounts for failures before and after the center number. If it is a new switch and you are working with theoretical MTBF, you also include a risk factor in your calculation. I believe it is important to know this data for high availability planning BUT more important is to factor that with the calculation date and the annual failure rate per volume shipped. I'm fixin to head out to lunch with a group of folks. I just figured out the Mean Time to Avoid the Check and it does not look good for me...and that is based on operational data...

Jimmy Ray Purser

About Networking Geek to Geek

Jimmy Ray Purser is the technical co-host for Cisco's TechWise and BizWise TV. Jimmy Ray also conducts advanced training for engineers across North America and Europe and regularly speaks at industry conferences such as VON, CeBIT, N+I, and Networkers. As a field engineer, Jimmy Ray experiences networking first hand behind the console or in the rack. He is an active member in the IEEE and the Ethernet Alliance and has designed, installed and tested numerous networks for Fortune 500 companies, the United States military and other institutions worldwide. He holds 3 U.S. patents for Ethernet security algorithms with two others pending and one defensive publication, as well as numerous other vendor certifications in networking and security.

Purser holds a Bachelor of Science degree in electrical engineering from Southern Illinois University is currently pursuing a master of science degree in electrical engineering.

 

Most Discussed Posts