Why Data Centers Must Fundamentally Change -- Part 2.2

Workload Mobility or 'Dude, Where's your Server?'

The second change that is creating disruption in the data center is a result of how we solved for the space, cooling, and power issues highlighted in my previous post.  The industry created denser servers, denser storage, and denser networks - I actually used to be convinced that some product managers were the densest people I ever met because every time they had a chance to create something new the first thing they would do is focus on density.   (I know, bad joke.)CPUs got faster, and things were great up until we hit the wall at around 4Ghz, while the Prescott core was projected to achieve 10Ghz, timing issues came into play north of 4Ghz and CPU vendors started going wider - more cores.  Nowadays, rather than faster CPU clock rates the doubling of transistor density with Moore's law is creating an increase in core-width every couple of years.  Enter my inverse of Moore's Law- "Legacy application processing efficiency will be halved every two years."  It's actually a generalization and not completely accurate, because there is still a lot of room for improvement in memory and I/O access and such, but notwithstanding, a single-threaded application can only use one core at a time.  Only using one-core would mean that you are fabulously efficient in a single-core, 50% max usage in a dual-core, and thus 25% @ 4-way and 12.5% @ 8-way, etc...  Fortunately virtualization came to the rescue - "why rewrite your applications to take advantage of multi-core when you can run multiple instances of an application and operating system combination with the hypervisor providing a layer of abstraction between the hardware and the operating system?"  This solved quite a few issues:1) Applications could use multi-core CPUs without rewrite for parallelismI like the saying that, "every problem in computer science can be solved with another level of abstraction/indirection."  With the advent of the hypervisor as a layer of abstraction between the CPU and the OS what prevents the VM from moving from one server to another?  Thus live migration or vMotion was born - moving a virtual machine from one physical server to another.  Now in an ideal world you would be able to do the following:1) move a VM from any server to any other, even across routed boundaries and data center edges1) Real-time integration and visibility.  The network needs to react, automatically, and in real-time to changes in the virtual environment.  This can be, but is not in any way limited to, VLAN provisioning, access control, QoS policy mirroring with the VM, location and identification of the virtual machine, etc.2) Ability to access storage and data.  The network should provide a transport capable of supporting centralized storage access to the virtual machine, regardless of which host it is running on.  This will greatly reduce the amount of data that needs to be moved as a VM migrates from one host to another.  Another sub-point that is quite relevant here is that centralized data stores need to either:3) VM Movement Scope.  We need to be able to move a VM from:To do this the best body of work I have seen to date is the IETF draft for LISP - Location Identification Separation Protocol  (and I want to add that I think it is just AWESOME that the authors still spell 'cisco Systems' in lower case and haven't forgotten that cisco is and should be a 'Systems' company, not just a consumer brand - excellent!)  What I also like about LISP is that it is designed to scale well, support 3a - 3d for VM mobility, preserves the IP address of the host through the move so established TCP connections and cached DNS entries don't break, and for the network-nerd in me it does not create a triangle-routing problem (common in tunneled environments).Other mechanisms that other vendors, and even other groups within the same company sometimes, have been exploring include VPLS, EoMPLS, L2 Extension across WDM. that I think a lot of people have not figured out because if most of my postulations above are held to be reasonably accurate then there is a problem:FCoE is block based and constrained to a single-hop today.  Even with future multi-hop and FIP support it still cannot cross a routed bit boundary.  So how do I combine a FCoE design with the desire to move virtual machines from one section of my data center to another, or from one data center to another? If I use IP storage - either iSCSI, NFS, NAS, etc it becomes very simple and not even worth much thought.  But it strikes me that moving to FCoE would be counter to the desire to move VMs around in a fluid data center design.  I am not saying it cannot be done, just that it strikes me as much more complicated and most likely requiring a rather sordid assortment of tech thrown at it to achieve the desires outlined above. (note: it CAN be done, but reqires you to go from ESX1 --> FCoE -->FC --> FC-IP --> FC --> FCoE --> ESX2  (not ideal))So if I was designing a VM farm, and I expected to be moving VMs around today, or in the future, I would look to a more scalable storage implementation that worked across whatever technology you think is the best for supporting the addressing extensions necessary to move your VMs IP addresses within the constraints imposed by maintaining TCP connections and cached DNS entries.5) Management and ReportingTry doing this today- get a CAM/MAC table from one switch, correlate with the ARP table on another, then do a reverse-DNS lookup to get the host name.  Not necessarily fast, real-time, or intuitive.  This will be the area of competitive differentiation as our VM environments become increasingly larger and more fluid.  dg

2) Servers could be consolidated as low-use static workloads could be virtualized to a smaller number of more efficient and powerful machines

3) Thus, power could be reduced and the lifecycle of the data center could be extended

2) not drop an established connection during the move

3) balance workload in real-time against compute and I/O/network capacity

4) integrate with power management to reduce cost during non-peak hours

5) follow-the-sun, follow-the-kilowatt/hr, follow-the-demand

6) move my workload to a provider, and move it back when I want <-- a cloud challenge

I am sure there are more things we would all like, but it's a quick list to illustrate a point - while we have been making progress in integrating the network with virtualization the potential for disruption is just starting.  A few things that we must continue to develop and work on are as follows:

        a) move with the virtual machine

        b) sync-on-demand, i.e. stream the data to a local site near where the VM is now located

        c) be pre-staged to all sites where the VM is likely to move

        d) do nothing, but abstract that the VM is remote from the data with WAN optimization type tech

    a) one server to another that share a broadcast domain

    b) one server to another in different broadcast domains

    c) one site to another site, across routed boundaries and geographical distribution

    d) from one site in one AS to another site in a different AS

    e) VPLS, when coupled with BGP for automated tunnel creation and setup is nice tech, however I don't like that generally MPLS is carrier-specific and not routed from one carrier to the next requiring me to stay with one carrier for all sites, or to have secondary and tertiary carriers at every site. 

    f) EoMPLS is just too static and suffers from the same carrier lock-in as above

    g) L2 Extension over WDM is just too costly in most cases, and many companies don't have dedicated fiber plants running between their data centers.  Granted, in some scenarios the two data centers may be co-located within 100km distance for synchronous storage replication in FC environments- it would be a reasonably effective solution here, except good luck having the VM find its default gateway router, SLB, and/or firewall when it booted in Data Center A and now resides in Data Center B- I would plan on a lot of bandwidth linking the two data centers together if you really want to actively move workloads between them.

4) The dirty secret

Let's assume the addressing issues get resolved and continue to improve in scope/scale and hopefully simplicity over the next few years, and maybe even more importantly that they continue the networking tradition of multi-vendor open interoperability rather than one company trying to use the capability as a way of growing the 'Empire' or creating customer lock-in.  The big challenge awaiting after that is an IP Addressing and Management problem.  Ideally, in a few seconds we need to be able to identify which port hosts which virtual machines, what the dependencies are in the network, and if something goes awry what the impact is. 

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2010 IDG Communications, Inc.