• United States
by Charles Goldberg, special to Network World

Making sure net routing doesn’t fail

Nov 18, 20024 mins

When a route processor fails, two new software features have been designed to maintain edge router integrity: stateful switchover and nonstop forwarding.

If a route processor fails, is there a network outage? Not necessarily. When the network device recovers from a failure with undetectable disruption, then the network has not failed, because as far as end users are concerned there was no outage and no downtime.

But even in cases when a route processor does fail, two new software features have been designed to maintain edge router integrity: stateful switchover (SSO) and nonstop forwarding (NSF).

Stateful switchover allows for a hot-standby processor to take control of the failed route processor while maintaining connectivity. SSO also assures that network management systems can manage a device with two route processors as one system and one manageable entity.

With SSO, both active and standby route processors maintain Layer 2 data-link connectivity information by checkpointing the minimal data required to maintain ATM, frame relay and Ethernet connections from the active route processor to the standby one. Maintaining the connection is imperative to minimize CPU utilization, reduce the amount of data loss during a switchover and quickly establish the standby processor in hot standby state.

Additionally, any method to create an SSO environment must be able to scale to tens of thousands of interfaces, because routers on the Internet keep connection information on tens of thousands of other routers to which they might need to connect. To accomplish this, the goal is to attempt to maintain only what is necessary and cannot be re-created across the route processors. Examples of states that are kept across the route processors are physical interface state, permanent virtual circuit state and command synchronization.

In a failure, SSO switches the system to the hot standby route processor. The failed one will attempt to reboot and operate as the new standby. This handoff happens without rebooting line cards; therefore without creating a link flap, which might cause connectivity protocols to be dropped.

Every step of the SSO process is monitored through SNMP, informing the network management team that there was a route processor failure. This is critical because customers won’t call the network operation center to report a failure because their applications are never interrupted. The SNMP traps tell the network management systems the cause of the failure and if the failed route processor could reboot. If not, it needs to be replaced, which is done without taking the router out of service.

Nonstop forwarding ensures IP packets are forwarded continuously during the process.

It is not practical to attempt to maintain all the route table states across two route processors, because route tables can have 100,000 to 200,000 route entries. So, the Internet Engineering Task Force has proposed protocol restart extensions that enable nonstop forwarding for Border Gateway Protocol (BGP), Intermediate System to Intermediate System and Open Shortest Path First protocols. Similar extensions will be available for Enhanced Interior Gateway Routing Protocol.

These extensions enable the maintaining of Layer 3 relationships between the router experiencing a restart and all its peer routers, without maintaining any state between the route processors, thus eliminating scalability issues.

When two routers form a peering relationship, they exchange capabilities. New capabilities have been added that caution peers not to remove a failed router from the database because it could come back even before connectivity protocols time out.

These new routing protocol extensions allow a restarting router to notify peers when it has returned, to request all the information it needs to rebuild its route tables and, in the case of BGP, to reestablish the TCP session between peers.

NSF and SSO preserve user sessions during a route processor failure. Even voice-over-IP calls have survived SSO tests.

SSO and NSF are just two of a wave of new features coming to networks that provide graceful recovery from different types of network failures. The result is a new level of end-to-end resiliency on networks.

Goldberg is manager of the product management Internet technologies division at Cisco. He can be reached at