How much of your SLA focuses on reliability?

* Availability vs. reliability

Writing effective and meaningful service-level agreements is always a challenge. It is relatively easy to identify certain important technical measurements as a way to ensure satisfactory performance. However, it is the end user experience and not the specific measurement on which SLAs should be focused. The specific measurement can be met and yet the business metric, such as the end user experience, may still be left wanting.

Let's say you have an availability standard for network uptime of 99.5%. This still leaves almost 4 hours per month of allowable downtime. A couple of one- or two-hour outages, particularly in off-hour times, may have little impact on the end user. Those same four hours of downtime, spread out over 9 hours, in 10 to 20 second increments of downtime can represent over 700 up and down events. The network was available 99.5% of the time, but due to poor reliability, the end users were seriously impacted over an 8- or 9-hour period. 

I spent a number of years working in the automated teller machine (ATM) industry. People have high expectations when it comes to accessing their cash, so the ATM networks set availability standards. When I was first involved with the operations committee of a certain ATM network, the availability standard was 97%. Given the state of telecommunications in the late 1980s, this was a fair number and most customers were able to get cash when needed. Over time, this number was more easily achieved and the standard was raised to 98%, then 99% and ultimately into the multiple 9s (99.99%). This was possible as the technology and systems improved.

The story doesn't end at 99.99%. The reason availability was measured in the first place was to ensure that a customer would have a high likelihood of receiving cash from an ATM whenever they attempted a transaction. It takes more to approve a cash withdrawal than the network being available. ATM networks are groups of distributed terminals connected to a central network switch. In order to validate your PIN and bank balance, the network switch either has a database of bank balances and PINs or an online connection to the bank to validate information. If your bank's system is down, but the network is up, you can go through the entire process of performing a transaction and still receive no cash. So the ATM networks began to add performance measures to the individual bank's connection to the central switch.  They made the transformation from looking at availability and focused on reliability. The user experience depended not on the network being up, but the customer receiving cash. Here, two availability metrics needed to be combined to ensure the user experience.

At Enterprise Management Associates, we regularly consult with companies to establish, review and improve their SLAs. What we are really doing is making sure that business goals are fully supported by the IT infrastructure. To ensure a successful SLA, focus on the business metric first and then the combination of technical measurements needed for the business metric to be achieved. 

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2005 IDG Communications, Inc.