As I sit here in the lobby of the Hyatt after attending CloudConnect this week and even speaking a bit I was listing to The Shins song 'New Slang' from the Garden State soundtrack [a good one btw, especially if you want to chill out, type for a bit, and not be forced into eavesdropping three conversations around you that are all tangentially interesting] and as I listened to 'New Slang' I kept changing the lyrics in my head to 'New Scale.'\u00a0So while the title of this post brings back Huey Lewis and the News memories from the 80's, I must confess it was Garden State that was the actual genesis.So why do I need a new scale?\u00a0It used to be that when we discussed network switches and routers we would use a pretty consistent set of metrics to determine which product or technology was superior or best fit a given mission.\u00a0These have consisted of metrics such as:How many interfaces does the device have, and at what speed?Does it forward each of them at wire speed?Can it forward all of them at wire speed at the same time?If it cannot forward all of them at wire speed, how and where does it oversubscribe? \u00a0If it is oversubscribing, how does it determine what is to be dropped?\u00a0Does it report the drop?Does it allow QoS to influence and help select what is to be dropped and what is not to be dropped?Can I tell if that which I am dropping is being dropped because my network element does not have the capacity to forward it, or if it is because the egress interface that was selected by the forwarding engine\/tables is congested and the buffers have expired and it can't handle any more data?How many routing peers and adjacencies can my device handle?How fast does it converge?How many ACLs and lines in the ACL can it handle before it runs out of capacity to process ACLs?How many IP routes can it handle? \u00a0How about for IPv6?You get the drift...Now I\u2019ve heard a constant complaint from many network savvy people when talking about a technology, product, etc: \u2018X won\u2019t scale.\u2019\u00a0 [It\u2019s kind of a death-knell for many otherwise inspiring arguments about networking technology. Most discussions end at this point]The real question regarding \u2018will it scale\u2019 is always best answered in context rather than in abstract. So \u2018it doesn\u2019t scale\u2019 should actually be restated in many cases: \u2018It may meet my requirements today; however, given my anticipated growth rates and demands it is not likely to meet them in the future as my IT demand is outstripping this product\u2019s capability to meet that demand.\u2019In this rather daunting age of virtualization, cloud, sometimes mobile workloads, fabrics, overlay segmentation models, and machine-readable programmatic APIs, the discussion around scale needs to encompass a few new metrics people have historically overlooked. See, most networks in the data center were built with the assertion that the access tier will have one host per port. This one host in most cases has one MAC address and one IP address.However, in the virtualized data center or cloud it is often likely to have multiple. As an example, taking the now well-known Amazon ECU (Elastic Compute Unit) you can cram about 25-40 of them onto a dual-socket Intel Westmere and, shortly thereafter, you should be able to do north of 50 with the Intel Romley chipset. If your application is more VDI then you could easily see 100-200 VDI sessions on a powerful server.This changes the scale at which we measure a network capacity and its ability to grow with our business requirements. Here\u2019s why:New Scale Metric #1: The ARP TableMost vendors planned on reasonably sized network broadcast domains.\u00a0A switch needs to know the MAC addresses of every MAC present within a broadcast domain or it floods the unknown out all interfaces participating in that broadcast domain.\u00a0Historically, most switches at the access tier had MAC address tables of ~8000 to 16,000 entries.\u00a0This was fine when it was either the default gateway for its subnet, or the aggregation tier switch above it was the default gateway, or only a few switches were all within the same subnet - figure 5-10 of them or so, maybe 20 if you are getting a little crazy with your big subnets.Recently, some vendors came out with larger MAC tables - for instance a Broadcom Trident+ chipset supports 128k MAC addresses. This used to be the size you would see in the aggregation tier where it may have to support 40-50 switches in multiple subnets if it is acting as the default gateway for each of them. \u00a0ARP tables or IP Host tables have become one of the new major bottlenecks though - while the Broadcom Trident+ chip supported 128k MAC entries, it only has 16,000 ARP entries.\u00a0This may sound like a lot to you, but let\u2019s evaluate the following scenario:A vendor recently described its Large, Flat, L2 topology as being capable of over 6000 ports. 16000 ARP entries divided by 6000 ports gives me 2.6 MAC\/IP Pairs per port.\u00a0If we wanted to take 6000 ports and put it in a private cloud running an average capacity of virtual machines with a normal 2 VNICs per VM it could take 6000 Servers * 20VMs *2 VNICs = 240,000 IP\/MAC pairs. \u00a0Lesson: If you are planning on deploying a Large, Flat, L2 Network or a Fabric, check your ARP table sizing and planned consumption.\u00a0I\u2019m not going to get into which vendor is the best or worst here - just check it and make sure you are not setting yourself up to flood everywhere.New Scale Metric #2: BufferingFor about the last 16 years we haven\u2019t had to talk about buffer capacity much because Cisco got it pretty well right on the Cat 5000 switches.\u00a0The Catalyst 5000 used per-port buffer memory at 192KB per 10Mb Ethernet port.\u00a0It was divided based on a 7:1 ratio in the SAINT ASIC to give 24KB for input buffer and 168KB for transmit buffer.\u00a0The buffers were somewhat flexible in that the buffer memory used a memory unit size that ensured not much memory was wasted so you could fill the buffer with 64B frames or jumbo frames and they would still get adequately buffered.But time marches on and architectures change. Input port-buffers faded in popularity to input-based virtual output queue architectures with fabric arbitration.\u00a0This pushed all congestion to the ingress buffer and would arbitrate so traffic would not be moved across the switch fabric until the egress port was capable of receiving it and serializing it onto the wire. By moving all points of congestion to the ingress buffer it gets markedly simpler to count when you have a congestion-based drop, and also, it gives you the ability to detect ahead of time when you are congesting before you drop - VOQ is what essentially let us build a lossless Ethernet capability in cross-bar based modular systems.By taking all the congestion on ingress you can PAUSE the incoming traffic to allow the egress buffer the time to drain before you forward to it and before you take on so much data you have to drop.One problem that came out is a lot of us banked on QCN working and being broadly accepted in the market. See, what we thought was QCN would let us spread the congestion back to N number of interfaces that were feeding a congested node and we could take advantage of the aggregate of the distributed buffers - thus we wouldn\u2019t have to build large buffers into our ASICs anymore. For a variety of reasons QCN didn\u2019t get widespread adoption and some folks are quite scared of it. So let\u2019s look at the buffer formula for the Catalyst 5000 10Mb port and scale it forward\u202610Mb = 192KB100Mb = 1.92 MB1000Mb = 19.2MB10Gb = 192MBI\u2019ll admit that you can reduce this somewhat based on a VOQ architecture so you may not need 192MB of packet buffer per port at 10Gb in a VOQ system because you get the aggregate of ports on ingress. But I would at least double the 24KB the Catalyst 5000 used in this scenario, so for a VOQ-based system you should expect to see the following on per-port ingress buffering:10Mb = 48KB100Mb = 480KB1000Mb = 4.8MB10Gb = 48MBThe real test of a buffer though is how well it delivers \u2018goodput\u2019 under congestion with TCP traffic as opposed to just measuring raw throughput. For this I like to whip out my handy\u00a0bandwidth delay product\u00a0calculator and assert that most 3-5 switch hop networks consisting of modular and fixed switch at the edge have an average end-to-end latency of 10usec: \u00a010,000,000,000 bits\/sec *.00002s =\u00a0 2,000,000 bits\u00a0 * 1 byte\/8 bits = 250,000 bytesThis means in a 10Gb network with 2-3 switch hops you need about 250KB of packet buffer per TCP flow you intend to support if you assert that you need to be able to buffer an entire TCP window, so that congestion doesn\u2019t create a tail-drop and thus cause the entire window to be re-transmitted, which would reduce the overall TCP goodput of the infrastructure.So at 48MB, as we highlighted above, you can support 192 concurrent TCP sessions with enough buffer capacity to handle the entire TCP window.\u00a0This is reasonably in-line with supporting a larger number of VMs per port as well.Lesson: Check the buffer sizes of what you are looking at.\u00a0For single-switch hop, low-latency networks or for multicast\/UDP where flow control isn\u2019t paramount a small buffer may be acceptable. But if your applications are TCP-based, you\u2019ll need to get a feel for how many flows you will need to support and if your infrastructure has the buffer capacity to handle that number.So I thought about adding another one around scalability of L2 topology construction using TRILL or other such technologies.\u00a0I\u2019m kind of on the fence right now because of the variety of implementation paths different vendors are taking.\u00a0 I certainly know I\u2019d hate to manage a 100-switch flat network, and I am not sure of the upper-boundaries of IS-IS scale within a TRILL environment, but I can\u2019t say where a practical cut-off is today. \u00a0Ivan Pepelnjak\u00a0made a decent point in our conversation me last night - you can\u2019t fit more than 1000 hosts in a vSphere instance today, you can\u2019t get more than 300 in a single vNetwork Distributed Switch, and you can\u2019t have more than 32 in an active DRS group.\u00a01000 hosts being the biggest number here and with most racks holding about 40 1RU servers [and being both space and thermal bound from going above that in many cases] it\u2019s hard to imagine a pragmatic requirement for going about 25 cabinets or about 50 total switches. \u00a0Fifty switches with 4xQSFP uplinks each would be 200 40GbE interfaces or 800 10GbE interfaces that would need to be terminated in the aggregation tier.\u00a0This would mean that, depending on your product\u2019s density support, you would need 2 or 3 aggregation switches to support this topology. Interestingly enough, if you can support this topology with just two switches, the need for TRILL is rather negated, but I agree that if your actual business\/application demand is for more than 1000 hosts or less oversubscription then you may need a wider aggregation tier. [As you can see this is sort of uncharted territory and I\u2019d love a few well-thought-out opinions on what the right way to measure this area may be].In lieu of definitive answers I would look to the following: How many switches can exist within the L2 topology? [Be sure to know the topology construction protocol limits\/boundaries, not just the number of bits reserved for RBRIDGEs in the TRILL spec] How many ports do I have on each aggregation box and how many multi-paths are supported in my TRILL\/Fabric implementation?\u00a0[This will define upper boundary condition on number of aggregate ports available in the subnet] Then be sure to check your default gateways - make sure you can get traffic OUT and IN to this L2 network.\u00a0Ideally the default gateways would exist on each of the aggregation devices and they would have the ARP scale to handle number of IP\/MAC pairs generated within this infrastructure. \u00a0 Ok, that\u2019s probably enough for today, and I am sure I missed several things that are worth discussing as scaling parameters, which could be done in a follow-up. If anyone has any ideas feel free to reach out to me in the comments or via\u00a0LinkedIn.