“High availability” has been a technical and marketing buzzword for a number of years, and lately infrastructure equipment vendors have made “HA” a feature set. In that regard HA has come to mean a combination of hardware and software that reduces device downtime. In this age of “five nines” reliability and stringent Service Level Agreements, pretty much any downtime is unacceptable: If a device is out of service for more than about 315 seconds in a year, it is below the 99.999% threshold.
The biggest hardware vulnerability is the power supply. Heat makes this the single most failure-prone component of any router or switch. A close second are the cooling fans, which can fail because they have moving parts. Therefore you should expect any mid-range and up router or switch to have redundant power supplies and fans.
These are reasonably simple. Redundant power supplies are both on (and hopefully connected to separate electrical circuits!) and supplying power to the system, so if one fails the other continues powering the system. Fan redundancy is usually just a matter of putting enough fans in the system so that if one fails there are enough remaining to provide sufficient cooling. So the cost redundant power supplies and fans add to a device is mostly just the cost of the components themselves.
As you approach high-end equipment, you start finding redundant control and forwarding planes. These components are far more expensive than power supplies and fans, and so making them redundant will make the cost of a networking device soar.
Let’s look at a control plane: The Route Processor (RP) on a typical Cisco router or Routing Engine (RE) on a typical Juniper router. At its most basic implementation, redundant control planes means that one RP or RE runs in active mode while the other is in standby. If the active one fails, operational intervention is required to switch over to the backup. Downtime is reduced because you do not have to wait for an on-site technician to replace the failed component. It’s certainly not ideal, though, because the system is still down while someone in operations detects the failure and performs the switchover.
An automatic switchover on failure sounds like it can significantly reduce recovery time, but it can also open a can of worms: How do you define a failure? A cold piece of circuitry with no electrons moving through it certainly meets the criteria. (As a Munchkin would say, it’s not only merely dead, it’s really quite sincerely dead.) But what about a control plane that is still performing almost all its duties but is, say, increasing its OSPF sequence number by a large value every time it increments, bringing it quickly to its max value and the consequent need for OSPF to reset itself (bringing all its adjacencies down in the process)? Is this a control plane failure meriting a switchover to the backup, or just a protocol failure that, while service impacting, is not as disruptive as a full control plane switchover. And what about a software bug that causes a control plane failure? If the bug is in one processor, it is probably in the other also. Do you allow them to continually fail and switch to the other, only to fail and switch back to the first, to fail again, endlessly until someone intervenes? You might as well not even have a backup control plane in such a situation. How do you set rules around when a switchover is helpful and when it is not? How do you determine the thresholds of failure? How do you insure that a system does not go into a perpetual flip-flop between control planes?
Perhaps even more important, how do you design a failure detection mechanism with sufficient reliability that it does not mistakenly declare a failure and switch over to the backup control plane when the primary was working just fine? You’d better make your choices well, because 315 seconds of downtime per year can get used up very quickly.
Another focus of using redundant control planes to increase system availability is the reduction of intentional downtime. Operating system software must be upgraded now and then to increase security, get rid of a bug, to add a new feature, or simply to keep versions current. Traditionally, a software upgrade meant loading the new image and then restarting the system. Right there, even if everything goes well, you probably use up your 5 minutes of allowable yearly downtime.
This situation is the driver behind in-service software upgrades (ISSU), the capability to upgrade software without taking the system out of service. The key to making ISSU work is to have redundant control planes, both of which are separate physical entities from the forwarding plane. Rather than having one control plane in a simple standby mode, it “pays attention” to what the active control plane is doing. A copy of various databases and states used by the active control plane are kept on the standby. To perform a software upgrade you first switch over to the standby control plane. Because it has been tracking databases and states, it should be able to take control of the system much faster than if it had to come up from a passive mode. Then you perform the upgrade on the previously-active control plane and restart it. When that component is back up and stable, and has again synchronized its databases and states, you switch back to that control plane. You can then upgrade and restart the backup.
While this basic version of ISSU can reduce the amount of downtime needed for a software upgrade, it doesn’t eliminate it entirely. And in fact the switchover can still be severely disruptive to the network. For example, even though the backup control plane has copies of interface states, neighbor states, routing tables, and so on, when the switchover first happens routing adjacencies are broken. When the standby becomes active it must follow protocol procedures to bring the adjacencies back up. While that is happening, routing protocol neighbors will detect that the node is down and tell their own neighbors, causing topological changes throughout the network. Then when the new control plane has established its adjacencies, the neighbors again tell their neighbors and there is a second topology recalculation in the network.
This problem can be solved too, by implementing software that homes routing adjacencies to the active control plane and at the same time keeps the standby aware of the adjacencies. When a switchover occurs, the standby can immediately take over the existing adjacencies without the neighbors being aware that anything changed.
I’ve posted before on mechanisms designed to prevent these kinds of routing protocol disruptions.
But while this “non-stop routing” capability is easy to describe, it is quite complex to implement. Router vendors developing such solutions can sink as much funding into an ISSU/NSR project as they might spend on development of a new hardware platform. Those costs are of course passed on to you.
And that’s the point (at long last) of this post. Vendors invest millions in the development of these kinds of complex HA solutions because their customers demand them. Yet those same customers are often negligent in implementing the simplest, cheapest procedural rules for preventing network outages. I’m amazed at how often a network operator is willing to accept tens of thousands of dollars in added cost to get HA features in their routers but do not have – or do not enforce – configuration standards. Or have clear change management procedures. Or even implement multiple layers of configuration permissions.
And modeling network changes before performing them on the production network? That’s the exception rather than the rule.
Don’t get me wrong, I think redundancy and features like ISSU/NSR are essential to any network that must meet stringent SLAs. It’s just that the most prevalent source of outages – simple human error – gets the least attention and is the easiest and cheapest problem to remedy.