As I sit here in the lobby of the Hyatt after attending CloudConnect this week and even speaking a bit I was listing to The Shins song 'New Slang' from the Garden State soundtrack [a good one btw, especially if you want to chill out, type for a bit, and not be forced into eavesdropping three conversations around you that are all tangentially interesting] and as I listened to 'New Slang' I kept changing the lyrics in my head to 'New Scale.' So while the title of this post brings back Huey Lewis and the News memories from the 80's, I must confess it was Garden State that was the actual genesis.
So why do I need a new scale?
It used to be that when we discussed network switches and routers we would use a pretty consistent set of metrics to determine which product or technology was superior or best fit a given mission. These have consisted of metrics such as:
How many interfaces does the device have, and at what speed?
Does it forward each of them at wire speed?
Can it forward all of them at wire speed at the same time?
If it cannot forward all of them at wire speed, how and where does it oversubscribe?
If it is oversubscribing, how does it determine what is to be dropped? Does it report the drop?
Does it allow QoS to influence and help select what is to be dropped and what is not to be dropped?
Can I tell if that which I am dropping is being dropped because my network element does not have the capacity to forward it, or if it is because the egress interface that was selected by the forwarding engine/tables is congested and the buffers have expired and it can't handle any more data?
How many routing peers and adjacencies can my device handle?
How fast does it converge?
How many ACLs and lines in the ACL can it handle before it runs out of capacity to process ACLs?
How many IP routes can it handle? How about for IPv6?
You get the drift...
Now I’ve heard a constant complaint from many network savvy people when talking about a technology, product, etc: ‘X won’t scale.’ [It’s kind of a death-knell for many otherwise inspiring arguments about networking technology. Most discussions end at this point]
The real question regarding ‘will it scale’ is always best answered in context rather than in abstract. So ‘it doesn’t scale’ should actually be restated in many cases: ‘It may meet my requirements today; however, given my anticipated growth rates and demands it is not likely to meet them in the future as my IT demand is outstripping this product’s capability to meet that demand.’
In this rather daunting age of virtualization, cloud, sometimes mobile workloads, fabrics, overlay segmentation models, and machine-readable programmatic APIs, the discussion around scale needs to encompass a few new metrics people have historically overlooked. See, most networks in the data center were built with the assertion that the access tier will have one host per port. This one host in most cases has one MAC address and one IP address.
However, in the virtualized data center or cloud it is often likely to have multiple. As an example, taking the now well-known Amazon ECU (Elastic Compute Unit) you can cram about 25-40 of them onto a dual-socket Intel Westmere and, shortly thereafter, you should be able to do north of 50 with the Intel Romley chipset. If your application is more VDI then you could easily see 100-200 VDI sessions on a powerful server.
This changes the scale at which we measure a network capacity and its ability to grow with our business requirements. Here’s why:
New Scale Metric #1: The ARP Table
Most vendors planned on reasonably sized network broadcast domains. A switch needs to know the MAC addresses of every MAC present within a broadcast domain or it floods the unknown out all interfaces participating in that broadcast domain. Historically, most switches at the access tier had MAC address tables of ~8000 to 16,000 entries. This was fine when it was either the default gateway for its subnet, or the aggregation tier switch above it was the default gateway, or only a few switches were all within the same subnet - figure 5-10 of them or so, maybe 20 if you are getting a little crazy with your big subnets.
Recently, some vendors came out with larger MAC tables - for instance a Broadcom Trident+ chipset supports 128k MAC addresses. This used to be the size you would see in the aggregation tier where it may have to support 40-50 switches in multiple subnets if it is acting as the default gateway for each of them.
ARP tables or IP Host tables have become one of the new major bottlenecks though - while the Broadcom Trident+ chip supported 128k MAC entries, it only has 16,000 ARP entries. This may sound like a lot to you, but let’s evaluate the following scenario:
A vendor recently described its Large, Flat, L2 topology as being capable of over 6000 ports. 16000 ARP entries divided by 6000 ports gives me 2.6 MAC/IP Pairs per port. If we wanted to take 6000 ports and put it in a private cloud running an average capacity of virtual machines with a normal 2 VNICs per VM it could take 6000 Servers * 20VMs *2 VNICs = 240,000 IP/MAC pairs.
Lesson: If you are planning on deploying a Large, Flat, L2 Network or a Fabric, check your ARP table sizing and planned consumption. I’m not going to get into which vendor is the best or worst here - just check it and make sure you are not setting yourself up to flood everywhere.
New Scale Metric #2: Buffering
For about the last 16 years we haven’t had to talk about buffer capacity much because Cisco got it pretty well right on the Cat 5000 switches. The Catalyst 5000 used per-port buffer memory at 192KB per 10Mb Ethernet port. It was divided based on a 7:1 ratio in the SAINT ASIC to give 24KB for input buffer and 168KB for transmit buffer. The buffers were somewhat flexible in that the buffer memory used a memory unit size that ensured not much memory was wasted so you could fill the buffer with 64B frames or jumbo frames and they would still get adequately buffered.
But time marches on and architectures change. Input port-buffers faded in popularity to input-based virtual output queue architectures with fabric arbitration. This pushed all congestion to the ingress buffer and would arbitrate so traffic would not be moved across the switch fabric until the egress port was capable of receiving it and serializing it onto the wire. By moving all points of congestion to the ingress buffer it gets markedly simpler to count when you have a congestion-based drop, and also, it gives you the ability to detect ahead of time when you are congesting before you drop - VOQ is what essentially let us build a lossless Ethernet capability in cross-bar based modular systems.
By taking all the congestion on ingress you can PAUSE the incoming traffic to allow the egress buffer the time to drain before you forward to it and before you take on so much data you have to drop.
One problem that came out is a lot of us banked on QCN working and being broadly accepted in the market. See, what we thought was QCN would let us spread the congestion back to N number of interfaces that were feeding a congested node and we could take advantage of the aggregate of the distributed buffers - thus we wouldn’t have to build large buffers into our ASICs anymore. For a variety of reasons QCN didn’t get widespread adoption and some folks are quite scared of it. So let’s look at the buffer formula for the Catalyst 5000 10Mb port and scale it forward…
10Mb = 192KB
100Mb = 1.92 MB
1000Mb = 19.2MB
10Gb = 192MB
I’ll admit that you can reduce this somewhat based on a VOQ architecture so you may not need 192MB of packet buffer per port at 10Gb in a VOQ system because you get the aggregate of ports on ingress. But I would at least double the 24KB the Catalyst 5000 used in this scenario, so for a VOQ-based system you should expect to see the following on per-port ingress buffering:
10Mb = 48KB
100Mb = 480KB
1000Mb = 4.8MB
10Gb = 48MB
The real test of a buffer though is how well it delivers ‘goodput’ under congestion with TCP traffic as opposed to just measuring raw throughput. For this I like to whip out my handy bandwidth delay product calculator and assert that most 3-5 switch hop networks consisting of modular and fixed switch at the edge have an average end-to-end latency of 10usec:
10,000,000,000 bits/sec *.00002s = 2,000,000 bits * 1 byte/8 bits = 250,000 bytes
This means in a 10Gb network with 2-3 switch hops you need about 250KB of packet buffer per TCP flow you intend to support if you assert that you need to be able to buffer an entire TCP window, so that congestion doesn’t create a tail-drop and thus cause the entire window to be re-transmitted, which would reduce the overall TCP goodput of the infrastructure.
So at 48MB, as we highlighted above, you can support 192 concurrent TCP sessions with enough buffer capacity to handle the entire TCP window. This is reasonably in-line with supporting a larger number of VMs per port as well.
Lesson: Check the buffer sizes of what you are looking at. For single-switch hop, low-latency networks or for multicast/UDP where flow control isn’t paramount a small buffer may be acceptable. But if your applications are TCP-based, you’ll need to get a feel for how many flows you will need to support and if your infrastructure has the buffer capacity to handle that number.
So I thought about adding another one around scalability of L2 topology construction using TRILL or other such technologies. I’m kind of on the fence right now because of the variety of implementation paths different vendors are taking. I certainly know I’d hate to manage a 100-switch flat network, and I am not sure of the upper-boundaries of IS-IS scale within a TRILL environment, but I can’t say where a practical cut-off is today.
Ivan Pepelnjak made a decent point in our conversation me last night - you can’t fit more than 1000 hosts in a vSphere instance today, you can’t get more than 300 in a single vNetwork Distributed Switch, and you can’t have more than 32 in an active DRS group. 1000 hosts being the biggest number here and with most racks holding about 40 1RU servers [and being both space and thermal bound from going above that in many cases] it’s hard to imagine a pragmatic requirement for going about 25 cabinets or about 50 total switches.
Fifty switches with 4xQSFP uplinks each would be 200 40GbE interfaces or 800 10GbE interfaces that would need to be terminated in the aggregation tier. This would mean that, depending on your product’s density support, you would need 2 or 3 aggregation switches to support this topology. Interestingly enough, if you can support this topology with just two switches, the need for TRILL is rather negated, but I agree that if your actual business/application demand is for more than 1000 hosts or less oversubscription then you may need a wider aggregation tier. [As you can see this is sort of uncharted territory and I’d love a few well-thought-out opinions on what the right way to measure this area may be].
In lieu of definitive answers I would look to the following:
- How many switches can exist within the L2 topology? [Be sure to know the topology construction protocol limits/boundaries, not just the number of bits reserved for RBRIDGEs in the TRILL spec]
- How many ports do I have on each aggregation box and how many multi-paths are supported in my TRILL/Fabric implementation? [This will define upper boundary condition on number of aggregate ports available in the subnet]
- Then be sure to check your default gateways - make sure you can get traffic OUT and IN to this L2 network. Ideally the default gateways would exist on each of the aggregation devices and they would have the ARP scale to handle number of IP/MAC pairs generated within this infrastructure.
Ok, that’s probably enough for today, and I am sure I missed several things that are worth discussing as scaling parameters, which could be done in a follow-up. If anyone has any ideas feel free to reach out to me in the comments or via LinkedIn.