Is it just me or are organizations placing continuously higher expectations on the availability of their networks. The past few weeks I have been helping an organization prepare for a network change window that was only going to be two hours long. I had 60-some pages of changes that all needed to happen in a short period of time without any mistakes. There is a definite trend for organizations to reduce the number and duration of change windows as a way to increase network availability.
The typical industries have high expectations of network availability: financial, online retail, airlines, and of course service providers. However, these days I also see schools, research organizations and traditional businesses looking for their networks to be five-nines capable. It is remarkable to me that school districts think that kids can’t learn if the Internet is down.
The impacts of a network outage can be tangible like lost revenue and reduced employee productivity. However, there are also intangible impacts such as customer satisfaction/loyalty/retention, tarnished brand image, internal company communication, teamwork, innovation, employee job satisfaction, and so on. The cost of an hour of network downtime can be staggeringly high depending on what you are using a computer network for. It is not uncommon for financial organizations to suffer millions of dollars in losses per hour of downtime. Some typical organizations may have many thousands of dollars of lost revenue per hour of network downtime.
Years ago I remember preparing a few hours for a network change to ensure that it went well. I recall network change windows would be for many hours if not entire days over a weekend. Therefore we had time to flush out the last-minute details during the change window. This past year I have found myself preparing for weeks in order to prepare for a scheduled change window that lasts only a few hours. The change control approval is like getting a congressional order or a pardon from the president himself because the different lines of business within an organization don’t agree on the least impactful time for the change.
Some organizations are so afraid of any downtime they make the process extremely difficult to allow the network administrators the ability to make required upgrades. I have encountered a few organizations that have only one change window a month and invariably one or two get canceled each year. That leaves only 10 change windows each year which makes it nearly impossible to get everything done that is required. That leads to either changes that overlap or impact each other, more difficult troubleshooting if there are problems, or can lead to network administrators going around the process to get their work done. That is like only bringing your automobile into the shop every 100,000 miles for a tune-up. Organizations need to allow for a reasonable number of change windows. I would say that it would be better to have more, but smaller, change windows even if they have to take place from 10PM to midnight.
I find that more organizations are looking toward having networks that are supremely redundant. Customers are highly interested in solutions like Cisco’s Virtual Switching System (VSS) and In-Service Software Upgrades (ISSU) with software modularity. Server Load Balancing (SLB) systems and Geographic Server Load Balancing (GSLB) systems are still popular purchases for organizations even in these tough economic conditions. They feel that if they have solutions like this they will be able to take down half the network for maintenance. Their designs are like having an “A” and a “B” network that are fully redundant. Organizations with vision and a good grasp of the network’s role in supporting their existing business are continuing to invest in their IT infrastructure.
I have also encountered organizations that are moving data centers and they also have high expectations of network availability during a data center or office move. You wouldn’t consider moving into a building that didn’t have running water or electricity and the same goes for a data network. It is a condition of worker productivity. If the water, electricity, or network were to stop then the building would be uninhabitable and the workers would be sent home until repairs could be made. However, even though buildings don’t have redundant power cables or water pipes the network is looked upon for higher level of redundancy.
There are 86400 seconds/day, 604800 seconds/week, 31536000 seconds/year and 525600 minutes/year. Therefore, five nines would allow for only 5.256 minutes per year (525600 X .00001).
Who can say that their network has been down for only 5 minutes in 2008? If you have only 5 problems each year that means going from 99.9% availability down to 99% availability. 99.9% = 10 minute downtime/week (3 nines – about 9 hours of total downtime/yr) 99.99% = 1 minute downtime/week (4 nines – about 52 minutes of total downtime/yr) 99.999% = 6 second downtime/week (5 nines – about 5.25 minutes of total downtime - planned or unplanned - in a given year)
You are only as good as the power to the equipment. The availability of power sources must be more reliable than your overall requirement for availability. Raw AC may only provide 99.96% availability, however, adding a one Hour Uninterruptible Power Supply (UPS) will get you close to 99.998% (assuming 1 hour repair). However, using both a UPS with a backup generator can get you to 99.99998% uptime.
It has been documented in several studies that the average number of outages sufficient to cause IT system malfunction per year at a typical site is approximately 15. 90% of the outages are less than five minutes in duration and 99% of the outages are less than one hour in duration. Therefore, the total cumulative outage duration is approximately 100 minutes per year. Based on these averages you should look at what you can do to rise above this average and strive for five-nines of reliability.
What does it take for a network to be five-nines capable? Here are some ideas:
- There should be no single point of hardware failure can take down the entire system.
- You should have spare hardware on site and readily accessible.
- There should be a minimum of time for recovery or switchover to backup facilities.
- There should be no reliance on manual operations for failover. All procedures must detect and correct or quickly converge and route around the failure.
- As I have mentioned before, having detailed network documentation also helps support the goal of greater network availability.
- Solid change control and configuration management practices are needed.
- You should have up to date network management systems. The sooner you can recognize an issue the faster you can react.
- You should have well documented troubleshooting procedures and automated processes to test all facets of your network for availability and performance.
Hopefully your organization allows for a reasonable number of change windows that are sufficiently long to keep the network running in top condition. A little bit of preventative maintenance will help your network run smoothly and actually increase its overall annual availability percentage.