The criticality of networks is constantly increasing as the applications that utilize the infrastructure grow in importance. Instead of striving for five-9s of reliability (99.999%) why not just aim for 100% availability because that is what the requirements dictate. That begs the question, how would you design a data network if the goal was to achieve 100% availability?
Recently I have seen more networks that are of the upmost criticality. The business has an expectation that the network will not go down. Any network downtime would have financial implications or human life is in danger and any network failure is not tolerated. Many feel that these are unrealistic expectations of modern data communications networks. However, if these are the requirements then we need to think differently about how we design networks and fund networking projects to meet the goal of 100% uptime.
Culturally we have designed networks with the “Internet” concept that if one device fails then after a period or reconvergence the network will automatically restore itself. The original Advanced Research Projects Agency NETwork (ARPANET) was design with this goal in mind because it is conceivable that an entire facility could be destroyed yet the integrity of the overall communication system between sites could remain operational. The network would simply route around failures during a war.
The problem is that the convergence time hasn’t been very fast. With traditional spanning tree (IEEE 802.1D) timers and protocols like RIP and IGRP the convergence was slow. We have been able to achieve faster layer-3 convergence with EIGRP, OSPF, and IS-IS, however, BGP convergence using the default timers is still not very fast. We can turn down hello and dead timers to speed convergence but it doesn’t get us to the SONET/optical failover times. These IGPs and EGPs have also had algorithm optimizations to help speed convergence. Rapid-Spanning Tree (802.1W) is a prime example of how an aging protocol (802.1D) has had to speed up to keep pace with modern networking goals.
I have written about some of these concepts in an earlier blog post titled “High Expectations of Network Availability”. Network downtime typically is related to a hardware failure (environmental or man-made), a software failure (definitely man-made), poor design (human error), or misconfigurations and accidents (us again). I have also written about how creating good documentation of your network will increase your network’s overall availability. I don’t mean to beat a dead horse but I wanted to discuss further why MTBF and MTTR are important concepts when designing a high-availability network using redundant components.
Mathematics Behind the Problem
When we think about high-availability network design the design must consider the reliability of the components that make up the entire communications system. Redundancy is having multiple interchangeable components to reduce probability of a total failure. Fail-over is the process or act of turning over control to the backup system. The ability for a network to automatically detect an issue, determine the backup traffic path, and switch to that backup device needs to be minimized. That is why the Mean Time Between Failures (MTBF) of components and the Mean Time To Repair (MTTR) values of the failover mechanisms should be attempted to be calculated. The simple formula below shows how these values determine the system’s overall availability.
Availability = MTBF / (MTBF + MTTR) Availability = 1 - (total outage time) / (total in service time)
For example, if a device has a single power supply that is rated with a MTBF of 40,000 hours (4.5 years) and it takes 8 hours on average to repair it (MTTR), then the availability can be calculated at (40,000)/(40,000+8) = 99.98%.
The equation is different if we consider multiple devices that are arranged in serial (in a line) with each other or if devices are in parallel (along side) with each other. Below is the equation to calculate serial availability based on the series of availability percentages of the devices. In this equation “i” is the component number and “n” is the number of components.
Therefore, if we had a device that had a power supply with 99.994% availability and used a processor that had 99.999% availability then the total system availability would be (0.99994 X 0.99999) approximately 99.993%. With serial topologies you can see that the availability is reduced because if either device fails then the entire communication path is broken.
Below is the equation for the overall system availability based on the availability percentages of the devices in parallel. Notice how this equation calculates 1 minus the sum of 1 minus the component availability percentage.
Therefore, if we had a device with parallel redundant processors each with an availability of 99% then the total system availability would be (1 - [(1 - .99) X (1 - .99)]) approximately 99.99%. You can see how using components with low availability can be used in parallel to create a system that has a greater availability than simply the sum of the parts.
For example, consider the diagram below of some theoretical network devices connected together in this configuration. One would think that if any one of these systems failed the entire system would remain operational, however, there is still remote chance that the entire system might fail. If this were to happen then the entire system would be offline until the network was repaired manually.
Each device type (1, 2, 3) has an availability of 0.999, 0.995, and 0.995 respectively. Below are the calculations for the overall system availability for devices in this configuration.
Availability = (1-(1-Avail1)^2) X (1-(1-Avail2)^2) X (1-(1-Avail3)^3) Availability = (1-(1-.999)^2) X (1-(1-.995)^2) X (1-(1-.995)^3) Availability = 0.999999 X 0.999975 X 0.999999875 Availability = 0.99974887528137496875
As we see, the availability increased to over 99.97% by combining devices. With this example we realize that we can create a network that has a higher overall availability by designing a redundant topology our of lower-reliability components. However, we have also created a more complicated configuration that would theoretically take longer to troubleshoot. We should not overlook the MTTR component when designing high availability systems.
If you are interested in reading more about these types of calculations there was a good Cisco Press book on the subject written back in 2001 by Chris Oggerino titled “High Availability Network Fundamentals” ISBN 1587130173. There is also an even older book I found with lots of good concepts. “Mathematical theory of reliability” by Richard E. Barlow, Frank Proschan, Larry C. Hunter, Contributor Larry C. Hunter, published by SIAM, 1996, ISBN 0898713692.
New Cisco Solutions
Vendors are striving to develop solutions that provide greater levels of redundancy and faster failover times. Examples of this can be found in stacking a pair of Cisco 3750 switches. There is hardware-level redundancy when devices are “stacked” and the failover speed is lightning fast when a system is connected to multiple switches in the stack. Another example of this would be Cisco Virtual Switching System (VSS). The failover speeds at the hardware-layer and VSS utilizes the Virtual Switch Link (VSL) for failover and synchronization. The throughput is certainly increased when the 6500s are placed into a VSS configuration. One advantage of using stacking or VSS is that you no longer have a need for STP and HSRP configurations on core switches but these solutions still support using “layer-3 in the core”.
With a redundant system you must have good management to determine if a failure occurred on one of the devices. It is conceivable that a device failure could occur and you wouldn’t know about it because the whole system is still functioning. With redundant solutions like VSS and stacked 3750 switches you should also know which switch is the “master”. This redundancy allows for simpler configuration because both switches share the same configuration. However, you must also test that failover times of the “master” live up to your expectations.
Striving for 100%
If we are to try to construct a network that has 100% uptime then we must change the way we think about network design. We should look toward industries that have systems that are designed for maximum availability. In power companies there are often A and B systems. The space shuttle and other space vehicles have primary and backup systems. We should think about designing two parallel networks that are not connected to each other. That way either network could be used but the operation of one network is independent of the other network. When designing a network like this we must make sure that the parallel systems use diverse traffic paths and have separate power sources. If both of these networks rely on the same support-system then if that physical infrastructure fails then both networks will fail completely erasing any advantage gained from a parallel network design.
This goes beyond the dual-chassis-single-supervisor versus single-chassis-dual-supervisor debate. Your costs will double by deploying two parallel networks. Furthermore, the way that the end systems use the network must also change. The trick with having an A and a B network is how the end-system computers connect to both networks and how the applications utilize both networks. The computers need to be connected to both networks with multiple network interface cards. The applications must also be communicating over both networks simultaneously so that there is virtually no time between when the A network fails and all traffic is utilizing the B network. The tricky application issue is that if a single transaction is sent over both networks then there must be a mechanism to handle duplicate packets and prevent duplicate transactions.
These decisions are made far above the pay grade of a simple network plumber as me but you can see the requirement for 100% availability changes how we design networks. The next time an executive leader tells you that they want 100% availability then you can explain to them why the costs increase dramatically the greater the expectations of service availability increase. Many organizations have these expectations that their networks operate 100% of the time but they have failed to budget for and design a network that supports those goals.