Today, most business applications are network centric. Their real-time business applications require the highest level of network availability. Downtime can cause lost revenue, reduced productivity, and a tarnished image. However, few executives realize how important the underlying network is to their sustained operations. In Jeff Doyle’s recent blog post on “Taking the Art Out of Networking” he talks about trying to quantify things that most network professionals don’t bother to calculate. One of them is the overall annual network availability percentage.
If you are striving for 99.999% (five-nines) annual network availability then you can only have about 5.25 minutes of total planned or unplanned downtime in a given year. That comes to only 6 seconds of downtime each week. That doesn’t give a human any time to react to a network issue and make a correction and therefore this is extremely difficult for most enterprises to achieve. I see most high-quality organizations in the range of four nines (99.99% availability, about 52 minutes of total downtime/year, about 1 minute of downtime/week) or three nines (99.9% availability, about 9 hours of total downtime/year, about 10 minutes of downtime/week). Some quick mathematical formulas for computing availability and downtime that use Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR) are as follows:
Availability = MTBF/(MTBF + MTTR) Availability = 1 - (total outage time) / (total in service time) Downtime = (1 - Availability) x 525,600 minutes/yr
As you can see, your ability to fix the network quickly will help drive down your MTTR. You may not have as much influence on the MTBF other than having highly disciplined network change control and configuration management. However, you can improve your ability to troubleshoot the network and get it back online in a hurry.
I do a lot of network assessments of organization’s networks to help them determine where improvements can be made. The goal is to determine what could be causing the network to not perform at its full potential and put together a plan to optimize the performance, reliability, security, and cost effectiveness. I am disconcerted by the fact that few of our customers have good quality network documentation. Furthermore, I have encountered several organizations that don’t have any network documentation and I have been complete shocked. Some clients have diagrams but they are high level and lack details required to help with troubleshooting. Some diagrams are old, inaccurate and are essentially useless for aiding in any network troubleshooting.
When I visualize network diagrams that are useful for troubleshooting I see diagrams that have both physical and logical information. I want to see network diagrams that have physical port/interface numbers, IP addresses, and accurate topology information. Microsoft Visio is a very popular tool for documenting networks. My company is also using Netformx software in the design phase. This figure is an example of a piece of a network that has such useful information for troubleshooting.
Whenever you embark on a troubleshooting exercise the first thing you check is the current status against a baseline. If you don’t have a baseline then you don’t know what is different about your environment from the ideal settings. You want to know how the network is behaving during normal operations when the network is quiescent and fully converged. If you don’t have that then you don’t have anything to compare the problem configurations or network device state information to. Having that network documentation created is simply the practice of being a good custodian of a network.
I often make the statement that having good documentation on hand will save approximately 10 minutes per troubleshooting incident. Having a good baseline or a solid understanding of the current network topology will speed your ability to affect a change that will bring the network back online during a failure. If you don’t have good documentation and you have only 5 network problems each year that means going from 99.9% availability down to 99% availability. That means that you can reduce your Mean Time to Repair (MTTR) with Microsoft Visio.
I have a case study for you. I was consulting to one customer recently who didn’t have very good network documentation. We have had meetings with the customer where the topic was about trying to conserve costs yet still reaching for operational excellence and striving to minimize downtime at the same time. The customer is very concerned now that they have had several network failures because they suffer financial penalties for network downtime. We provided the customer a quick and dirty ROI analysis on the benefits of a project to document and remediate problems identified in a recent network assessment. As we strived to determine the quantifiable results of the project in terms of actual dollars we estimated that our recommendations can conservatively help avoid 5 failures in one year. Their financial penalties are $5k for a late file transfer and $1k for every hour the network is down. Preventing 6 failures a year would total to $30,000 plus $6,000 in added fines. By having us improve their network and increase stability by reducing MTTR by 10 minutes (assuming 6 incidents a year this equates to 60 minutes or the difference between, assuming the customer generates $1.2B/yr revenue ($2,283/min), 60 minutes of downtime would equal $136,980 in lost revenue. Therefore, we could quantify that the total remediation project benefits would be in the neighborhood of $172,980.
That’s a pretty good ROI for having good quality and up-to-date network documentation. Microsoft Visio to the rescue.