When the assessment is finished, you can better judge which machines could be migrated and which could not be. Luckily, the customer had a strict "one application per machine" rule, which was enforced, and which removes possible application conflicts and migration concerns. With the details released about their current infrastructure, it was possible to determine that the necessary hardware was already in use and could be reused with minor hardware upgrades. Each machine would require dual-port NIC and Fibre Channel cards and an increase in memory and local disk space. To run the number of VMs required and to enable the use of VMotion, all machines were paired up at each site at the very least, with a further recommendation to purchase another machine per site (because there were no more hosts to reuse) at the earliest convenience so that they could alleviate possible machine failures in the future. To perform the first migrations, some seed units would be borrowed from the manufacturer and LUNs carved from their own SANs allowing migration from physical to virtual using the seed units. Then the physical host would be converted to an ESX Server and the just-migrated VM VMotioned off the borrowed seed host. This host would be sent to the other sites as their seed unit when the time came to migrate the hosts at the next datacenter. This initial plan would be revised once the capacity planner was run and analyzed.
Example 2: Office in a Box
One of the author's earliest questions was from a company that wanted to use ESX to condense hundreds of remote locations into one easy-to-use and -administer package of a single host running ESX with the remote office servers running as VMs. Because the remote offices currently used outdated hardware, this customer also felt that he should use ESX because it would provide better remote management capability. The customer also believed that the hardware should be upgraded at these remote offices all over the world. Their goal was to ship a box to the remote location, have it plugged in, powered up, and then remotely manage the server. If there were a machine failure of some sort, they would ship out a new box. The concern the customer had was the initial configuration of the box and how to perform backups appropriately.
One of the very first questions we ask the customer is whether they will be using Microsoft Clusters now or in the future of their ESX deployment. When we first started the discussions, they claimed this was never going to be the case. Just in case, we made sure that they set up their six-drive dual-processor machines with a full complement of memory and disks, an extra dual-port Ethernet card, an external tape device via an Adaptec card (see Figure 1.11), and enough file system space for a possible shared system. We discussed a SAN and the use of VMotion, but the customer thought that this would be overkill for their remote offices. For their datacenter, this was a necessity, but not for a remote office.
Office in a box server with tape library
However, the best-laid plan was implemented incorrectly, and a year after the initial confirmation of the customer's design, they needed to implement Microsoft Clustering as a cluster in a box. Because of this oversight, the customer had to reinstall all the ESX Servers to allocate a small shared-mode VMFS. They had to reinstall their machines, but first they set up their operating system disk as a RAID 1, making using of hardware mirroring between disks 1 and 2, leaving the last four disks to make a RAID 5 + 1 spare configuration of 146GB disks. The smaller disks met their VM load quite nicely. On the RAID 5 LUN, they created two file systems, one for the VMFS for the public (nonclustered) VMs and a smaller partition for the shared data drives for the cluster.
Although using a single partition for the two distinct VMFSs is not generally recommended because of LUN-locking considerations, it can be and has been done in a single host environment, as we are discussing. If an entry-level SAN (refer to Figure 1.8) were used, another host would have been added, and the multiple partition approach would not be a best practice due to the nature of SCSI reservations, which are further discussed in Chapter 5. However, in a single-host configuration, SCSI reservations are less of a concern, so use of multiple partitions on the same LUN is not going against any best practices. Ideally, it would be proper to have three LUNs: RAID 1 for the OS and RAID 5 for both necessary VMFSs. However, three LUNs would require at least eight disks, and a disk array would have been necessary, increasing the expense for not much gain, because the VMs in question are small in number and size.
Example 3: The Latest and Greatest
One of our opportunities dealt with the need for the customer to use the latest and greatest hardware with ESX Server and in doing so to plan for the next release of the OS at the same time. The customer decided to go with a full blade enclosure using dual CPU blades with no disk, and many TOE cards so that they could boot their ESX Servers via iSCSI from a NAS. The customer also required an easier and automated way to deploy their ESX Servers.
This presented several challenges up front. The first challenge was that the next release of the OS was not ready at the time, and the HCL for the current release and the first release of the next version of ESX showed that some of their desired options would not be implemented. So, to use ESX, the hardware mix needed to be changed for ESX version 2.5 and for version 3.0. The customer therefore traded in the TOE cards for Fibre cards or blanks. They also realized that iSCSI and NAS receive limited support in the first release of ESX version 3.0. Therefore, they also needed to get access to local disks to implement their desired virtualization.
The main concern here is that the customer wanting the latest and greatest instead got a mixed bag of goodies that were not compatible with the current release, and the prelist of the HCL for the next release did not list their desired hardware either. In essence, if it is not on the HCL now, most likely it will not be on the list in the future; if you can get a prerelease HCL, this can be verified. In essence, this customer had to change their plans based on the release schedules, and it made for quite a few headaches for the customer and required a redesign to get started, including the use of on-board SCSI drives and the use of a SAN. In essence, always check the HCL on the VMware website before purchasing anything.
As for the deployment of ESX, the on-board remote management cards and the multiple methods to deploy ESX made life much easier. Because these concepts are covered elsewhere, we do not go into a lot of detail. ESX provides its own method for scripted installations just for blades. Many vendors also provide mechanisms to script the installations of operating systems onto their blades. The key to scripted installations is adding in all the extra bits often required that are outside of ESX, including hardware agents and other necessary software.
Example 4: The SAN
Our fourth example is a customer who brought in consulting to do a bake-off between competing products using vendor-supplied small SANs. Eventually, the customer made a choice and implemented the results of the bake-off in their production environment that used a completely different SAN that had some significant differences in functionality. Although this information was available during the bake-off, it was pretty much a footnote. This in turn led to issues with how they were implementing ESX in production that had to be reengineered. What made this customer unique is that they wanted to get ESX 3.0 style functionality while using ESX 2.5. Although a noble goal, it leads to setting up 2.5 in a mode that does not follow best practices but that is supportable. The customer wanted to store all VM data on the SAN, including the VM configuration and log files. The customer wrote up their desire and wanted confirmation that this was a supportable option.
The architecture decided upon called for each ESX Server to mount a home directory from the SAN so that VM configuration files could be stored on the SAN, and because the VMFS was already on the SAN, everything related to a VM would be stored on the SAN using two distinctly different file systems. To enable the multiple SAN-based file systems, it is necessary to share the Fibre Channel Adapters between the COS and the VMs for ESX versions before 3.0. The sharing of the Fibre Channel adapters is not a best practice and often causes problems. To limit issues, it is best to have one file system per LUN. Because the customer wanted to have the configuration files available to each possible server, the customer created multiple Linux ext3 file systems sharing the same LUN. This also does not follow the best practice of one file system per LUN. However, they did not mix file system types, so there are no Linux file systems sharing a portion of a LUN with VMFS. This is a good thing because both the VMkernel and the Linux kernel can lock a LUN separately when Fibre Channel adapters are shared, and this will cause SCSI reservations and other SCSI issues. We discuss these issues in Chapter 5.
Even though this customer uses several functions that do not follow best practices, this example is here to point out that although best practices exist, they do not define what is supported or even capable with ESX. We confirmed their architecture was supportable, but also pointed out the best practices and possible problems. Many of the items that were not best practices with ESX versions earlier than 3.0 are now a part of ESX version 3.0. From this example, ESX version 3.0 incorporates the storage of VM configuration and disk files on a VMFS, instead of needing to use multiple file systems and possibly problematic configurations. Understand-ing the limitations of ESX will aid in the use of ESX with various hardware.
Example 5: Secure Environment
It is increasingly common for ESX to be placed into secure environments as long as the security specialist understands how ESX works and why it is safe to do so. However, in this case, the security specialist assumed that because the VMs share the same air they are therefore at risk. Although we could prove it was not the case, the design of the secure environment had to work within this limitation. The initial hardware was two dual-CPU machines and a small SAN that would later be removed when they proved everything worked and their large corporate SANs took over. The customer also wanted secure data not to be visible to anyone but the people in the teams using the information.
This presented several concerns. The first is that the administrators of the ESX box must also be part of the secure teams, have the proper corporate clearances, or be given an exception, because anyone with administrator access to an ESX Server also has access to all the VMDKs available on the ESX Server. Chapter 4, "Auditing, Monitoring, and Securing," goes into securing your ESX environment in quite a bit of detail, but suffice to say, virtualization has its own issues. Because the customer wanted to secure their data completely, it is important to keep the service console, VMotion, and the VM networks all on their own secure networks, too. Why should we secure VMotion and everything? Because VMotion will pass the memory footprint of the server across an Ethernet cable and, combined with access to the service console, will give a hacker everything a VM is doing. If not properly secured, this is quite a frightening situation.
Whereas the company had a rule governing use of SANs to present secure data LUNs, they had no such policy concerning ESX. In essence, it was important to create an architecture that kept all the secure VMs to their own set of ESX Servers and place on another set of ESX Servers those things not belonging to the secure environment. This kept all the networking separated by external firewalls and kept the data from being accessed by those not part of the secure team. If a new secure environment were necessary, another pair of ESX Servers (so we can VMotion VMs) would be added with their own firewall.
The preceding could have easily been performed on a single ESX Server, yet require the administrators to have the proper corporate clearances to be allowed to manipulate secured files. Given this and the appropriate network configuration inside ESX, it is possible to create many different secure environments within a single ESX host, including access to other secure machines external to ESX. However, this customer did not choose this option.
Example 6: Disaster Recovery
We were asked to do a DR plan for a customer that had two datacenters in close proximity to each other. The customer wanted a duplicate set of everything at each site so that they could run remotely if necessary. This is not an uncommon desire, because they in effect wanted a hot site implementation. Their current ESX Server load was two dual-CPU hosts at each location, two distinctly different SANs, and some slightly different operational procedures. The currently light load on each ESX Server would eventually grow until new machines were placed in the environment.
Due to the disparate SAN environments, it was impossible to create a SAN copy of the data because the SANs spoke different languages. Therefore, a hardware solution to the problem was out of the question. This in turn led to political issues that had to be ironed out. Once allowed to proceed, the decision was made to create backups using some other mechanism and physically copy the VMs from site to site using some form of automated script. Although there are plenty of tools that already do this, ESX comes equipped with the necessary script to make backups of VMs while they are still running, so in essence a hot copy can be made by ESX with a bit of scripting. Tie this to a local tape drive (which the customer also wanted to place into the mix) and a powerful local and remote backup solution emerges.
Various other approaches were discussed, but unfortunately, they would not work. A key idea was to use VMotion, but the distances involved implied the VMs would be shipped over a very long yet dedicated wire from site to site, which would put the memory footprints of the VMs at risk. Earlier versions of ESX solve this issue by not allowing VMotion to work through a gateway and router. ESX version 3 on the other hand allows VMotion to work through a router and gateway. Another possibility was the use of an offsite backup repository, but that would make restoration slower.