Latency in the Data Center Matters

Poor TCP Window Sizes and Slight Changes in Delay Can Make a Huge Difference in Throughput

It's been an interesting week at work. Make that two weeks. For the last couple of weeks, we have been working on a throughput performance issue between a very expensive Fujitsu M8000 server and network storage over our 10GbE network. The M8000 was accessing storage via NFS over TCP. Cutting to the end of the story, it wasn't (well, at this point, I should say it "isn't") a network issue, despite the ignorant claims to the contrary. As most of us network engineers know, as soon as something is slow, people will proclaim "it's gotta be a network issue". It took a while to dispel that theory (sadly, again) and focus on what appears to be the problem now which is drivers and patches on the server.

That all being said, during our troubleshooting, the application team was dumbfounded that they couldn't achieve a pure 10Gbps from server to storage device. "It's a 10GbE network, what's the problem?" A solid class in TCP was obviously needed. For a quick review, TCP works on an acknowledgement basis. The server sends some data (the amount of data equal to the "window size") and then the server waits for an ACK from the client. If the network delay between the server and the client is large (like a WAN from the US to India) there's a lot of waiting for that ACK. This slows the sustained throughput for a single data transfer. While not explicitly knowing why, when dealing with WANs most people inherently understand this even if it's in the form of "well the bits have to go all the way across the ocean". Ok, close enough. In a data center with a 10GbE network, it's not all that inherent. Non-networking peoples' perception turns into, "it's 10Gbps, it should transfer at 10Gbps. There's no ocean to cross, just some patch panels and some switches." This also leads to the simple extrapolation that any slowness must be an actual "problem" in the network since it's all inside the same building. Unfortunately, TCP, delay, and window size can all play the same role inside a data center network as it does in an intercontinental WAN. And, now that you've paid for that fancy 10GbE network, people will start to complain when they don't get 10Gbps throughput. So, let's assume you have a beefy server with a 10GbE NIC, a 10GbE network with 3 hops (L3 or L2 doesn't really matter with ASICs), and another server with a 10GbE NIC. The application is going to transfer some data using NFS over TCP with default TCP settings:

  • Bandwidth = 10Gbps
  • TCP Windows Size = 64K (65,536)
  • MTU = 1500
  • Delay = 1.00 ms RTT

Plugging these figures into this handy TCP calculator, the result is very poor 525Mbps (that an "M", not a "G"). From the system admin's perspective: "I plugged in my million dollar servers to your 10GbE network and all I get is half a gig, your network sucks." Sigh, if only that were true. The network does not see TCP. The end-hosts do. So, how do we really get some performance from these servers? Right now they're million dollar Yugos on our 16-lane super-freeway. First, if you can buy some better NICs and switches, you could cut the delay to .3ms (300 microseconds) round-trip. Then the throughput triples to 1.7 Gbps. This is much better, but still only 17% of the underlying network. Next, let's get the system admins to change the TCP window size. Let's say they tweak the Unix settings to get a TCP window of 262K instead of 64K. With .3ms and a 262K TCP window, the throughput skyrockets to 6.99Gbps! Now we're talking. Finally, let's "goal seek" to find the TCP window size that, with .3 ms RTT delay, fills the 10GbE pipe. That would be a TCP window of 375K. Not unrealistic for enterprise class systems in a data center.


I do need to caveat these calculations a bit. This is all theoretical. It is much better than the traditional theory of "it should run at 10Gbps", but still theoretical. Different environments, packet loss, errors, queuing delays, etc. will all influence the actual throughput speed between two hosts in a data center. Furthermore, I wrote this blog based on a handy Internet TCP tool. While it seems to be right, it may not be exact. I have a trusty Excel spreadsheet that I coded 6 years ago to do the same math. (Download the Excel spreadsheet.) With the penultimate settings I used above, my spreadsheet reports 4.2 Gbps, not 6.99 Gbps. Nonetheless, the theme holds. Without monitoring end-to-end delay in the data center and working with system teams to tune host TCP settings, it is unlikely you will ever realize the investment in a 10GbE network. This is true network engineering!


More >From the Field blog entries:

Net Neutrality for Dummies

Cisco Nexus 5000's Poor Results in Data Center Switching Test

Cisco's Overlay Transport Virtualization (OTV) is New, but has Potential

How is the CCDE Coming Along?

IPv4 Space is Getting Low - Really Low

Cisco's on to Something with Borderless Networking

  Go to Cisco Subnet for more Cisco news, blogs, discussion forums, security alerts, book giveaways, and more.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Related:
Now read: Getting grounded in IoT