The scourges of application performance over the WAN

Packet loss, latency and "chattiness" problems the key issues to be addressed.

Last time, I asserted that that WAN-specific application performance issues are driven entirely by three factors: latency, packet loss and bandwidth.

This time, we'll cover three other major factors affecting WAN application performance, which themselves are hugely impacted by latency and packet loss, as well as covering the main causes of packet loss and high latency issues on the WAN.

The first two additional factors affecting application performance have to do with the nature of how TCP (Transmission Control Protocol) works: the bandwidth-delay product and how TCP does congestion control. The other factor, application "chattiness," concerns how certain higher-level protocols built on top of TCP, or even applications themselves, are written.

RELATEDWhat really impacts application performance over the WAN

WAN Virtualization is the next-generation complement to WAN Optimization

I won't attempt in this short(!) column to do a full tutorial of how TCP works, but for purposes of understanding WAN effects, these three sentences from here sum it up nicely: "TCP uses a 'congestion window' to determine how many packets it can send at one time. The larger the congestion window size, the higher the throughput. The TCP 'slow start' and 'congestion avoidance' algorithms determine the size of the congestion window."

The bandwidth-delay product refers to the product of a data link's capacity (in bits per second) and its end-to-end-delay (in seconds). The result, an amount of data measured in bits (or bytes), is equivalent to the maximum amount of data on the network circuit at any given time, i.e. data that has been transmitted but not yet received. The bandwidth-delay product, which essentially has no practical effect on LAN performance, is a well-known limit on how fast data transfers can occur over high-delay Wide Area Networks, because of the TCP window size.

Historically, TCP window sizes were limited to 64KB (64 KiloBytes = 512 Kilobits) of data. Modern TCP and OS implementations typically use the TCP window scaling option, enabling window sizes of 256KB or even more. This can drastically increase TCP transfer rates across the WAN, so long as there is minimal packet loss. In the face of meaningful loss, a larger maximum allowable amount of data in-flight will have little practical effect on performance, as the window size (per the following paragraph) is effectively usually limited to less than the bandwidth-delay product.

A closely related issue is the manner by which TCP does congestion control: "TCP uses a network congestion avoidance algorithm that includes various aspects of an additive-increase-multiplicative-decrease (AIMD) scheme, with other schemes such as slow-start in order to achieve congestion avoidance." In particular, the amount of traffic allowed to be "in flight" after a lost packet is detected is reduced by 50%. TCP's congestion control algorithm, and AIMD in particular, is the primary reason why the Internet has not "collapsed" from the weight of everyone using it, and is the amazingly elegant method TCP's designers came up with to efficiently use available bandwidth, and provide fairness "on average." For an individual application, however, this means that performance suffers notably with packet loss, and for interactive or real-time applications can frequently have particularly bad performance when packet loss rates exceed ~1%.

Those are the major TCP issues affecting almost all applications – even those that don't use TCP themselves but that end up being affected by network packet loss and latency increases caused by TCP-based flows competing for the same bandwidth. In addition to these TCP-based general issues, there is the issue of the "chattiness" of certain applications or protocols. Essentially, chattiness refers to how many multiple round-trip communications – largely if not wholly serialized – between client and server are required to perform a given application function. A fantastic explanation of this issue for web-based applications first published more than 10 years ago can be found here in this NetForecast paper explaining Web performance over the Internet. 

Two common protocols that are very chatty are Microsoft's CIFS protocol for file service, and HTTP, the dominant protocol used for web applications. CIFS in XP and earlier versions was especially chatty and problematic over the WAN, and some would argue this was the single biggest reason for the rapid growth in the WAN Optimization business in the 2000s. The SMB 2.0 file service protocol in Windows 7 and Windows Server 2008 and later versions is better over the WAN, but some "chattiness" performance issues remain, especially under any meaningful amount of packet loss.

Much like the bandwidth-delay product issue, "chattiness," which doesn't hurt performance much on a LAN, can have major consequences on a WAN facing packet loss and/or high latency. For public Internet applications, including but not limited to web apps, large numbers of DNS (Domain Name System) requests are a hidden form of application chattiness.

While there are other, innumerable application-specific factors, it's not too much of an oversimplification to suggest that in the end, their impact is quite similar to the "chattiness" issue noted above. Interactive applications such as desktop virtualization are a good example here.

So, if all network-specific performance issues can eventually be traced back to bandwidth, latency, and packet loss, let's look a bit deeper into the causes of latency and packet loss in IP WANs.

[Some sharp-eyed readers might be wondering about now "why hasn't this guy mentioned jitter yet? We know jitter is a huge issue in real-time application performance." In fact, jitter is a measure of the variability over time of the packet latency across a network – i.e. a component of latency! In fact, perhaps the only meaningful exceptions to the "app-specific factors are usually a variant of 'chattiness'" rule are real-time applications, where the end user directly experiences poor voice and/or video quality from either packet loss or excessive jitter (jitter buffers providing a cushion against moderate amounts of jitter), and occasional horrible performance or even loss of connection when video key frames are lost or packet loss rates get too high.]

If we break down WAN latency into its constituent parts, we see both "fixed" components and variable ones. The fixed components of WAN latency relate to the number of route miles a packet must travel between source and destination – and thus limited by the speed of light – with a smaller component based on the number of routers the packet must go through, with the small, fixed amount of time it takes to transit the router even when the links are lightly loaded. The typical one-way latency across the continental U.S. is ~40 ms, meaning a typical RTT across the country and back is ~80 ms.

The variable component of WAN latency – the jitter, in other words – is caused by queuing congestion at the routers (or other IP forwarding devices) anywhere along the way. Queuing congestion is caused when there is more data entering a device (router) trying to go out a given link than the bandwidth available on that link. In typical IP WAN routers, queuing congestion at any given router can add up to 100 to 200 ms of latency. Beyond that amount of delay, packets will typically be dropped – causing packet loss.

Overflowing queues in forwarding devices, as just noted, are the primary reason for packet loss. Some routers use a technique called WRED to drop a lower percentage of packets (i.e. less than 100% of them) when their queues are beginning to fill up, to avoid excessive jitter and better promote "fairness" across flows. Finally, while less common than in the past, bit errors can also be a cause of packet loss. Fairly rare on wired networks these days (beyond the occasional flaky DSL connection), bit errors are sometimes responsible for a moderate amount of packet loss on wireless networks.

High latency causes obvious problems in application performance. Packet loss usually has an even bigger negative impact on WAN app performance. Why is this? Given the TCP bandwidth-delay product, loss rates above ~1% mean that the application can only use a very small amount of bandwidth on a WAN for its TCP flow, no matter the maximum size link that is available. Over longer WAN distances, throughput is lower still. Because TCP is a windowed protocol, forwarding of additional packets from the source will quickly come to a halt until the lost packet is retransmitted and acknowledged.

Even for applications where the bandwidth-delay product per se is not an issue, the "chattiness" problem has much the same – and occasionally worse – effect. In the face of packet loss, all data transmission halts until the packet is retransmitted and acknowledged; in the face of high delay (jitter), the serialization affect of waiting for one operation to complete before starting the next will slow performance substantially even in the absence of packet loss.

There is a lot to digest here, I realize, and we've really only scratched the surface. The key point is that to deliver excellent, predictable application performance, we know that we want to avoid packet loss and high latency as much as possible.

In my next post, we'll start to look at the consequences of WAN service history on the evolution of WAN technologies – including Next-generation Enterprise WAN (NEW) architecture components WAN Optimization, WAN Virtualization, and distributed replicated/synchronized file service – and see which techniques address which of these factors, combating the negative effects of loss and latency directly and indirectly.

A twenty-five year data networking veteran, Andy founded Talari Networks, a pioneer in WAN Virtualization technology, and served as its first CEO. Andy is the author of an upcoming book on Next-generation Enterprise WANs.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2012 IDG Communications, Inc.

IT Salary Survey: The results are in