Skip Links

Why FCoE is Dead, But Not Buried Yet

a simple treatise on the value of FCoE and what inhibits it

By Douglas Gourlay on Tue, 09/13/11 - 4:53pm.

Have you ever watched one of those spy films where someone gets fatally poisoned, but it hasn't hit them yet and they are acting like its any other day and not the last days of their life and you just want to tell them that they should enjoy it while they can?  I feel like that for FibreChannel over Ethernet.  Basically my assertion is that FCoE is a dead technology, and there are a few vendors who have not figured it out yet, here's why.

Storage Value
The value of storage is a function of Price/Gigabyte stored, the I/O speed in and out of the medium, and the 'reach' of the storage - is it only mountable by one host, one subnet, one data center, or globally?

  1. The cheaper I can store all my bits the better - cue better areal densities, tape pundits, etc.
  2. The faster I can access those bits the better - so much for the tape pundits, insert SSD and fast caching systems, Fusion I/O, etc.
  3. The more devices that can mount my storage the better - so much for FCOE, DAS w/o a global file system - kudos to NAS, iSCSI, ZFS, Nexenta, etc.

FCoE doesn't hit the price target to make it useable, it requires too much infrastructure churn. FCoE doesn't have the reach - its bound to a flat layer-2 subnet. Smart network operators know that Large Flat Layer-2 networks not only don't work well, they are a pretty heinous thing to have to troubleshoot and operate.  FCOE doesn't support the distances I need for a BC/DR strategy/  FCOE doesn't work with overlay technologies like OTV and VXLAN (discussed further below)

Summary: I can't build an FCOE network economically.  I don't get increased storage reach from it.  It performs slower than AOE, DAS, and many Global File Systems offering little incremental value in return because it doesn't perform well enough for long-haul storage synchronization.

Why Large Flat Layer-2 Networks Are Not the Right Call, for FCoE or otherwise...
All the talk in the last 2-3 years about TRILL is intriguing to me simply because of the four main networking vendor playing in this area their multipath solutions stack up as follows:

  • Cisco FabricPath: requires full network upgrade, not TRILL standards compliant, offers no interoperability.  Currently a proprietary architecture designed to lock you in to the F1 boards on the Nexus 7000.  Oddly enough let's see how well these F1 boards work with the M2 and F2 boards forthcoming - that will be fun.
  • Brocade: Apparently Brocade thought it would be better to use their FSPF protocol than IS-IS to construct the topologies.  Brilliant...  if you don't want your switch to plug into anyone else's switches.  Probably doable if you have Cisco style dominant market share, but if not its not the smartest move.
  • Juniper: QFabric - rumored to be shipping this week, but still haven't seen a data sheet.  In fact I think I learned more about it from Cisco's tawdry pot-calling-kettle-black video series than I have from Juniper's published docs.  QFabric is the ultimate in lock-in forgoing all standards and building out an extremely proprietary system, even on the wire where few other vendor dared to lock you in.  I don't see why single-point-of-administration mandates proprietary links and lock-in.  But that seems to be Juniper's modus operandi recently. 
  • Arista: I work here, all disclaimers apply, this is my personal opinion though.  Arista didn't build TRILL support in yet, has adopted a wait and see attitude.  In the mean time focusing on MLAG based on IEEE LACP to signal to hosts and switches that two devices 'look like one' and eliminate Spanning Tree loops.

So why did companies chase L2 multipath solutions?  Hint: it wasn't for FCoE.  It was for Virtualization!  The main driver for Large, Flat Layer-2 Networks is because virtualization admins want to move VMs from one server to another, even across our vaunted routed boundaries.  They don't want IP addressing decisions we made 10 years ago to influence their ability to put a workload on a server when they want and where they want.  And when they do move a VM they prefer the TCP sockets stay open and the IP address not change - so they want a stateful live migration across existing L3 boundaries.  The only realistic way to do this in the data center to date has been to build a Large, Flat Layer-2 Network.

Cisco thought they would overcome this by introducing OTV, Overlay Transport Virtualization.  OTV is nice, but again, a proprietary non-multi-vendor solution that requires the entire network to be forklift upgraded - great if you are Cisco or a Cisco AM.

VMware introduced into the IETF (Experimental) VXLAN, Virtual Extensible LAN.  This takes L2/L3 frames, encapsulates them in UDP in the vSwitch and tunnels from one ESX host to another.  It emulates broadcasts and unknown unicast forwarding by tunneling into UDP and moving these frames via IP Multicast over L2 and L3 boundaries.  Short version - with no network upgrade (assuming your network supports IP Routing and Multicast Routing) you can move a VM from one host to another, over your existing network and gateway into and out of the VNI (unique segment and mroute) via VMware's vShield Edge. 

One thing important to point out:  Neither Cisco's OTV, nor VMware's VXLAN (co-authored by Cisco's Dinesh Dutt and Arista's Ken Duda and others) support the transport of FCoE. 

Summary: Vendors have built technologies to enable Large, Flat Layer-2 Networks to enable L3 Stateful VM Live Migration and churn their installed base, but at the same time virtualization vendors have built technologies to do the same thing that do not require network upgrades.  FCOE Will not work with any L2 over L3 mechanism described to date - its a dead end.

Why Smart Engineers Like L3 Networks Better
I make the claim that no smart network operator would prefer a large flat layer-2 network over a L3 routed ECMP design.  I suppose many people with the surname 'Anon' and an IP address beginning with 171.68.x.x. will beg to differ so let me give my reasons:

Many large network operators I have been talking with have built something out for a project.  But when you look at network architectures like Facebook's or other similarly sized providers no one is using Large Flat Layer-2 Networks.  In fact I can't think of anyone who has built one with over 10,000 hosts.  I think if they did it was as an experiment and is not for a production service.

I know some large customers who have tried to build these, but without classic tools like traceroute and ping felt that L3 was easier to troubleshoot and operate and a real-world environment.  Again can't speak to academic projects where the goal was to see if you could build it but not have to operate a service on it.

The lack of broadcast containment and the fact that while some companies talk a good game on building the L2 network but never figured out the L3 default gateway in and out of the network, or how to handle those fun times when your MAC table ages and the ARP table hasn't and now you are flooding like a dam broke.

Toolset simply put while we spent 15-20 years building lots of tools in Unix, Linux, and people even built companies out of a better TFTP server we are lacking basic tools to support L2 networks.  This will take years and is never the focus of most major networking vendors.  Once the churn of your network is done they are happy.  In L3 world we have simple commands like  'Show IP route' that gives me the path traffic will take, no need to look at CAM tables and guess on hashing.

Every network admin, CCIE through CCNA knows how OSPF works, anyone worth the title Network Engineer definitely knows how it works.  How many people can you hire that know TRILL intricately - remember no one is shipping a standards compliant version today...

For the career-focused network engineer still reading this thinking to themselves, "By learning TRILL and deploying it I will be 'the man'" please rewind and re-read the section about 'Why Large, Flat Layer-2 Networks Are Not the Right Call' and see if thats the value you want to bring to your business.
   

FCOE LMFAO
So with FCOE we have a technology that seems to be only taking off in the proprietary compute systems offered by a single emerging server vendor - Cisco.  It requires a complete network upgrade, special purpose NICs, a new network architecture that no smart network operator would prefer, and doesn't work with any technology that enables stateful VM live migrations which are the main driver for flatter networks in the first place.

How well will FCoE run when things come full circle and people are back to deploying their default gateways on the Top-of-Rack switches?  Oh wait, it won't...

Blog Roll
http://www.douglasgourlay.com/