Statefulness in building IP networks

Moore’s Law gains in CPU/memory make upside for addressing WAN costs, latency outweigh most downsides in complexity, scaling, high availability

In our last column, we started to look at the consequences of WAN service history. We noted that what almost all the technological innovations in WAN Optimization technology, and the more recently introduced WAN Virtualization technology as well, have in common is that they take advantage of recent Moore's Law price/performance advances in cheap CPU and memory, and that they are all stateful. Here, we'll take a closer look at the issues surrounding statefulness on data networks, and especially on WANs.

An information system or protocol that relies upon state is said to be stateful. As it relates to TCP/IP-based networks, the state in question is primarily whether the communications infrastructure is aware of, and does special handling of, TCP/IP flows or other IP-based flows such as UDP, or higher-level flows built on top of TCP, such as HTTP, HTTPS or Microsoft's SMB protocol. Routing protocols such as OSPF or RIP obviously must store state, but that state is lower-level information that addresses network reachability, and is distributed among the participating systems, and so the mere existence of a routing protocol does not make a solution "stateful" in this sense.

RELATED: The Consequences of WAN Service History

The scourges of application performance over the WAN

Historically, statefulness in data networks was highly controversial. Certainly this was true in the 1990s, and especially amongst Internet cognoscenti/purists. The idea was that TCP/IP-based networking was designed with the end-to-end argument design principle (this linked paper being one of the few things from my grad school days I've actually found useful during my career), and since multiple complex functions have to happen at the higher layers in the end hosts, the middle of the network should do as little as possible with packets other than getting them to their destination.

And in fact, going back a decade or more, there is very good reason for this approach. Keeping state is extremely expensive in terms of memory and processing power, and the cost of memory and the availability of CPU MIPS were such that keeping state on TCP flows wasn't cost effective. Routers were specifically designed to be stateless when it comes to IP flows, creating a relatively "dumb" network with almost all of the "intelligence" at the hosts - at least as it relates to applications and application flows. With limited CPU and memory, stateless middleboxes were the only reasonable way to go. There were companies at the turn of the millennium that tried to sell flow-based core routers, but they were never successful and have pretty much vanished.

I'd argue that, re: TCP flows, let alone higher-level application flows, for the high speed service provider WAN core, stateless is still pretty much the only reasonable way to go, and this might be the case "forever." Of course, the core itself might migrate from routers connected in a mesh to a layer 1 or layer 2 optical core, where someday all devices that look at layer 3 at the edge of that core can and do deal with flow states.

There is no doubt that statefulness makes the devices more complex, makes delivering high availability more difficult, makes multivendor interoperability geometrically more difficult, and usually increases the chance that applications "break" in the face of some network failures.

So given the efficiency inherent in the end-to-end argument, as well as the other potential difficulties with doing stateful communications infrastructure for TCP/IP-based networks, why have stateful solutions become so much more prevalent?

Arguably, for security alone it became necessary to have a stateful middlebox, which for at least some functions broke the end-to-end model. This is, after all, what a stateful firewall does. Without such detailed state, it's simply not possible for a firewall (or next-generation firewall, or IPS, or application layer firewall...) to perform its function.

Now, perhaps network security is a unique function that requires statefully breaking the end-end model? Maybe. But application developers made sure that their applications worked in environments with stateful firewalls (and things like NAT functionality got more sophisticated) and customers got used to at least one piece of network infrastructure breaking the end-end model - for those who cared at all in the first place.

For IP WANs, the fact that Frame Relay and MPLS were (and remain) so expensive, and therefore most last-mile WAN pipes are so thin, combined with the inexorable march of Moore's Law making CPU power and memory ever more affordable, is what really made it compelling for companies to develop, and enterprise customers to adopt, technologies such as WAN Optimization and WAN Virtualization which store TCP flow state and break the end-to-end model.

There are other reasons as well that stateful solutions have succeeded. The overlay model of deployment pioneered by WAN Optimization vendors such as Peribit and Riverbed made it "safer" to do statefulness. The underlying routed network was unchanged, avoiding thorny interoperability issues with proprietary Cisco routing protocols. Fail-to-wire Ethernet pairs in appliances for overlay solutions make it "transparent" in failure cases. Overlay plus fail-to-wire meant that vendors could take longer to implement true high availability solutions, and that deployments without a lot of redundant hardware at smaller sites could still deliver reliable network connectivity. Finally, the underlying network is standard enough - TCP/IP and Ethernet, typically using Linux-on-Intel-based appliances - that it's cost-effective and straightforward to implement such solutions.

WAN Optimization demonstrated that WAN pipes have gotten so thin relative to LAN speeds and the state of computing and storage technology that solutions could be stateful, and even include hard disk access, and still make sense!

Now, it is important to acknowledge the disadvantages and limitations of stateful solutions. The two biggest ones are scaling and interoperability. Per above, I'd argue that the network core for a large network, and certainly for something as large and complex as the Internet, probably should not be stateful in terms of TCP/IP flows.

Re: scaling, there is no doubt that stateful solutions do not scale as well as stateless ones. Trying to build the public Internet itself with a stateful solution would IMO be a disaster, in addition to being much more expensive. Indeed, the Internet was built upon the previously referenced end-to-end argument. The number of flows to be tracked is simply too large, and doing the coordination of flow handling across multiple networks results in too many challenges.

A quick diversion to a comparison to the issues around OpenFlow and SDN re: scaling might be useful here. Note that this is not a fully apples-to-apples comparison, since OpenFlow and SDN aren't about keeping per-flow TCP state in the network switches, and the state being discussed there really is more about the network and network addressing. As Omar Baldonado points out on the Big Switch Networks blog, having one config to rule them all is a big advantage of an OpenFlow-based solution. It's a centrally stored and managed single state, single configuration for the "entire network." This works for a single data center, even a fairly large one - and is arguably a very good idea. But that's for a single data center. Such a centralized configuration approach can work for an enterprise WAN domain (this is how the WAN Virtualization solution from Talari Networks, the company I founded, does it). It might even work for a single Internet Service Provider's internal network, although that's a pretty risky step for an ISP to take. However, it's unlikely to work for the public Internet, as network ownership issues, e.g., and scalability among other factors pretty much mandate a distributed approach, rather than one with all the state stored in a central location.

Back to statefulness re: TCP flows. With today's enterprise WAN solutions, statefulness is probably about 70% intended to deal with the expensive WAN edge-pipe problem, and 30% to address speed-of-light and congestion-based jitter and packet loss issues. If you've got a truly random any-any WAN with multi-gigabit connections at most locations, stateful solutions might not be for you (and you're probably a company with a network the size and complexity of Microsoft or Google...). But even if you've got hundreds to thousands of locations, if like most companies you've consolidated your data centers to a handful of locations, then the technology today is such that stateful solutions to enterprise WAN edge problems are quite feasible. As we'll see in a future column, this is true even for the two applications - voice and videoconferencing - which actually are any-any, thanks to the combination of WAN Virtualization technology and judiciously chosen colocation facilities, two key parts of the Next-generation Enterprise WAN (NEW) architecture.

Interoperability is a legitimate question; typically this is solved over time after the biggest phase of innovation is completed, and there is no question that such interoperability for software-intensive networking solutions is even more difficult when TCP or application state is involved. But before thinking that this is a deal-breaker for deployment, we should acknowledge that for the last 25 years, even though standards and interoperability of individual components are desirable, in practice enterprises will typically buy any one component of a system from a single vendor for all sorts of rational reasons: manageability, support, etc.

In future columns, we'll cover more of the specific advantages that statefulness in network solutions can bring to the enterprise WAN. NEW architecture component WAN Optimization's application-specific protocol chattiness improvements and compression/deduplication abilities, and WAN Virtualization's abilities to avoid inbound last mile congestion and replicate real-time or time-sensitive interactive flows across multiple connections, are just a few of the examples of the power that stateful infrastructure solutions can bring to the enterprise WAN.

A twenty-five year data networking veteran, Andy founded Talari Networks, a pioneer in WAN Virtualization technology, and served as its first CEO. Andy is the author of an upcoming book on Next-generation Enterprise WANs.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Now read: Getting grounded in IoT