Book excerpt from VMware ESX and ESXi in the Enterprise

Chapter 6: Effects on Operations

Excerpt from VMware ESX and ESXi in the Enterprise: Planning Deployment of Virtualization Servers, 2nd Edition, by Edward Haletky.

Excerpt from VMware ESX and ESXi in the Enterprise: Planning Deployment of Virtualization Servers, 2nd Edition.

By Edward Haletky

Published by Prentice Hall

ISBN-10: 0-13-705897-7

ISBN-13: 978-0-13-705897-6

Newsletters: Sign-Up & Save! Receive Special Offers, Try Safari Books Online NOW.

The introduction of virtualization using VMware ESX and ESXi creates a myriad of operational problems for administrators, specifically problems having to do with the scheduling of various operations around the use of normal tools and other everyday activities, such as deployments, antivirus and other agent and agentless operational tasks (performance gathering, and so forth), virtual machine agility (vMotion and Storage vMotion), and backups. In the past, prior to quad-core CPUs, many of these limitations were based on CPU utilization, but now the limitations are in the areas of disk and network throughput.

The performance-gathering issues dictate which tools to use to gather performance data and how to use the tools that gather this data. A certain level of understanding is required to interpret the results, and this knowledge will assist in balancing the VMs across multiple ESX or ESXi hosts.

The disk throughput issues are based on the limited pipe between the virtualization host and the remote storage, as well as reservation or locking issues. Locking issues dictate quite a bit how ESX should be managed. As discussed in Chapter 5, “Storage with ESX,” SCSI reservations occur whenever the metadata of the VMFS is changed and the reservation happens for the whole LUN and not just an extent of the VMFS. This also dictates the layout of VMFS on each LUN; specifically, a VMFS should take up a whole LUN and not a part of the LUN. Disk throughput is becoming much more of an issue and will continue to be. Which is why with vSphere 4.1, Storage IO Control (SIOC) was introduced to traffic shape egress from the ESX host to Fibre Channel arrays. SIOC comes into play if the LUN latency is greater than 20ms. SIOC should improve overall throughput for those VMs marked as needing more of the limited pipe between the host and remote storage.

The network throughput issues are based on the limited pipes between the virtual machines and the outside physical network. Because these pipes are shared among many VMs, and most likely networks, via the use of VLANs, network I/O issues come to the forefront. This is especially true when discussing operational issues such as when to run network intensive tasks: VM backups, antivirus scans, and queries against other agents within VMs.

Virtual machine agility has its own operational and security concerns. Basically, the question is, “Can you ever be sure where your data is at any time?” Outside of the traditional operational concerns, virtual machine agility adds complexity to your environment.

Note that some of the solutions discussed within this chapter are utopian and not easy to implement within large-scale ESX environments. These are documented for completeness and to provide information that will aid in debugging these common problems. In addition, in this chapter unless otherwise mentioned we use the term ESX to also imply ESXi.

SCSI-2 Reservation Issues

With the possibility of drastic failures during crucial operations, we need to understand how we can alleviate the possibility of SCSI Reservation conflicts. We can eliminate SCSI Reservations by changing our operational behaviors to cover the possibility of failure. But what is a SCSI Reservation?

SCSI Reservations occur when an ESX host attempts to write to a LUN on a remote storage array. Because the VMFS is a clustered file system, there needs to be a way to ensure that when a write is made, that all previous writes have finished. In the simplest sense, SCSI Reservation is a lock that allows one write to finish before the next. We discuss this in detail in Chapter 5.

Although the changes to operational practices are generally simple, they are nonetheless fairly difficult to implement unless all the operators and administrators know how to tell whether an operation is occurring and whether the new operation would cause a SCSI Reservation conflict if it were implemented. This is where monitoring tools make the biggest impact.

VMware has made two major changes within the VMFS v3.31 to alleviate SCSI-2 Reservation issues. The first change was to raise the number of SCSI-2 Reservation retries that occur before a failure is reported. The second change was to allocate to each ESX host within a cluster a section of a VMFS so that simple updates do not always require a SCSI-2 Reservation. Even with these changes, SCSI-2 Reservations still occur, and we need to consider how to alleviate them.

The easiest way to alleviate SCSI-2 Reservations is to manage your ESX hosts using a common interface such as VMware vCenter Server, because vCenter has the capability to limit some actions that impact the number of simultaneous LUN actions. However, with the proliferation of PowerShell scripts, other vCenter management entities, and direct to host actions, this becomes much more difficult. Therefore, as we discussed in Chapter 4, “Auditing and Monitoring,” it behooves you to perform adequate logging so that you can determine what caused the SCSI-2 Reservation, and then work to alleviate this from an operational perspective.

Best Practice - Verify that any other operation has first completed on a given file, LUN, or set of LUNs before proceeding with the next operation.

The primary way to avoid SCSI-2 Reservations is to verify in your management tool that all operations upon a given LUN or set of LUNs have been completed before proceeding with the next operation. In other words, serialize your actions per LUN or set of LUNs. In addition to checking your management tools, check the state of your backups and whether any current open service console operations have also completed. If a VMDK backup is running, let that take precedence and proceed with the next operation after the backup has completed. The easiest way to determine if a backup is running is to look on your backup tool’s management console. However, you can also check for a snapshot that is created by your backup software using the snapshot manager that is part of the vSphere client or one of the snapshot hunter tools available. Most snapshots created by backup tools will have a very specific snapshot name. For example, if you use VCB, the snapshot will be named “_VCB-BACKUP_.”

Multiple concurrent vMotions or Storage vMotions are a common cause for SCSI-2 Reservations, and this is why VMware has limited the number of simultaneous vMotions and Storage vMotions that can take place to six (increased to 8 in vSphere 4.1). Note that although vSphere will allow this number of migrations take place concurrently, it is not recommended for all arrays. For high-end arrays, the maximum can be performed simultaneously.

To check to see whether service console operations that could affect a LUN or set of LUNs have completed, judicious use of sudo is recommended. sudo can log all your operations to a file called /var/log/secure that you can peruse for file manipulation commands (cp, rm, tar, mv, and so on). Hopefully, this is being redirected to your log server, which has a script written to tell you if any LUN operations are taking place. Additionally, as the administrator, you can check the process lists for all servers for similar operations. No VMware user interface combines backups, vMotion, and service console actions. However, the HyTrust appliance is one such device that does provide a central place to audit for LUN requests (but not the completion of such requests).

When you work with ESXi, filesystem actions can still take place via Tech Support Mode, VMware Management Appliance (vMA) and the use of the vifs command. Even for ESXi, logging will be required.

For example, let’s look at a system of three ESX hosts with five identical LUNs presented to the servers via Hitachi storage. Because each of the servers shares LUNs we need, we should limit our LUN activity to one operation per LUN at any given time. In this case, we could perform five operations simultaneously as long as those operations were LUN specific. After LUN boundaries are crossed, the number of simultaneous operations drops. To illustrate the second case, consider a VM with two disk files, one for the C: drive and one for the D: drive. Normally in ESX, we would place the C: and D: drives on separate LUNs to improve performance, among other things. In this case, because the C: and D: drives live on separate LUNs, manipulation of this VM, say with vMotion, counts as four simultaneous VM operations. This count is due to one operation affecting two LUNs, and the locks need to be set up on both the source and target of the vMotion. Therefore, five LUN operations could equate to fewer VM operations.

This leads to a set of operational behaviors with respect to SCSI Reservations.

Using the preceding examples as a basis, the suggested operational behaviors are as follows:

1 2 3 Page 1
Page 1 of 3
The 10 most powerful companies in enterprise networking 2022