I was sitting in on the Peering BOF at NANOG a couple of weeks ago, and there was a discussion of Non-Stop Forwarding (NSF), Non-Stop Routing (NSR), and Graceful Restart (GR). It became apparent in the discussion that a couple of the participants were not making clear distinctions among these functions (or at least the acronyms), which are in fact quite different. Confusion about these and a few related functions is quite common, and vendors’ marketing tends to add to the circus.
So in this post I’d like to dig into this particular bowl of alphabet soup.
Modern high-performance routers architecturally separate the forwarding plane and the control plane into separate physical components, each with its own memory and processors. The control plane runs the routing protocols, maintains the necessary databases for route processing, and derives a forwarding table (FIB). The FIB is given to the forwarding plane, which is responsible for packet forwarding.
The fundamental advantage of physically separating the forwarding and control planes is that if the traffic load becomes very heavy—and hence the forwarding plane becomes very busy—it doesn’t adversely effect the control plane’s ability to process new routing information. Conversely, if the routing protocol—and hence the control plane—becomes very busy due to a flood of new route information, it doesn’t adversely effect the ability of the forwarding plane to continue forwarding packets at high speed.
In fact the control plane could stop functioning altogether and because the forwarding plane is a separate entity with its own processors it can continue forwarding packets based on its copy of the FIB. This is Non-Stop Forwarding (NSF): The ability of the forwarding plane to continue running “headless” if the control plane stops.
Of course this is dangerous; if the network topology changes while the control plane is down there is no way to process new route information and the forwarding plane’s FIB can become invalid, resulting in incorrectly forwarded packets. So why would you even want NSF?
The answer is redundant control planes (Cisco calls their control planes Route Processors; Juniper calls them Routing Engines). NSF allows you to switch from a primary to a backup control plane without disrupting forwarding. The FIB could still become invalid during the period between when the primary control plane goes down and the backup control plane takes over, but the risk in this period is usually an acceptable compromise.
The shorter you can make this switchover time, the less risk is incurred. So if the backup control plane maintains a copy of the active configuration and current state on system components such as interfaces, it can become active much faster than if it had to learn all this information first. That, then, is the second ring of our acronym circus: Cisco calls this Stateful Switchover (SSO) and Juniper calls it Graceful Routing Engine Switchover (GRES).
In the third ring of the circus is Non-Stop Routing (NSR), and this is the most confusing part of the show. The problem with control plane switchovers as so far described, even if it uses stateful procedures to decrease the switchover time, is that routing protocol adjacencies are broken by the switchover. When a primary control plane goes down any neighboring router that had a peering session with it sees the peering session fail. When the backup control plane becomes active it re-establishes the adjacency, but in the interim the neighbor has advertised to its own neighbors that router X is no longer a valid next hop to any destinations beyond it, and the neighbors should find another path. And of course when the backup control plane comes on-line and reestablishes adjacencies its neighbors advertise the information that router X is again available as a next hop and everyone should again recalculate best paths. All of this is can be highly disruptive to the network.
The objective of NSR is to prevent, or at least minimize, the effect of broken peering sessions.
A first attempt at controlling broken adjacencies during control plane switchovers is Graceful Restart (GR) protocol extensions. Each routing protocol has its own specific GR extensions, but they all work pretty much the same. When a router’s control plane goes down its neighbors, rather than immediately reporting to their own neighbors that the router has become unavailable, wait a certain amount of time (the grace period). If the router’s control plane comes back up and reestablishes its peering sessions before the grace period expires, as would be the case during a control plane switchover, the temporarily broken peering sessions do not effect the network beyond the neighbors.
There are, however, a couple of problems with GR:
- Neighbors are required to support the GR protocol extensions. Control plane switchovers are most disruptive on provider edge (PE) routers, where there are many peering sessions to customer edge (CE) routers; yet small CE routers are less likely to support GR.
- If there is a complete control plane or router failure rather than just a switchover, the GR grace period can slow network reconvergence.
A newer generation of NSR uses internal processes to keep the backup control plane aware of routing protocol state and adjacency maintenance activities, so that after a switchover the backup control plane can take charge of the existing peering sessions rather than having to establish new ones. The switchover is then transparent to the neighbors, and because the NSR process is internal (and vendor specific) there is no need for the neighbors to support any kind of protocol extension.
Here’s where the confusion comes in: Different vendors use these terms differently. Juniper, for example, calls its graceful restart implementation Graceful Restart, whereas Cisco calls its graceful restart implementation Non-Stop Forwarding Awareness (even though GR applies to routing, not forwarding). Juniper users often confuse GRES and GR: Although the “G” in both acronyms stands for “Graceful,” GRES and GR are two different things. And both Cisco and Juniper have internal NSR capabilities, but the circumstances in which each can be used are quite different.
So enjoy the circus, but be aware that different vendors sometimes use different names for essentially the same act. When a vendor talks about NSF, GR, and NSR, be sure you know that vendor’s definitions.