High availability for Microsoft Exchange with VMware ESX Server and SteelEye LifeKeeper

Popular Linux-based virtualization and failover technologies help put together a robust high availability and disaster-recovery solution for Microsoft Exchange -- without upgrading the Exchange software itself.

One government agency recently embarked on a mission to provide both high availability and disaster recovery (DR) for their Microsoft Exchange 2003 server. After reviewing the available options, the agency selected SteelEye’s LifeKeeper Protection Suite for Exchange to provide real-time data protection and application monitoring and recovery of Exchange, and VMware ESX Server 2.5 to host all servers in the DR site to help reduce the cost of building and managing the DR infrastructure.

• Support of physical-to-virtual clustering

• Support of protecting their existing Exchange Server

With the choices made, implementation was next on the agenda. First, the infrastructure was established. A point-to-point 45Mbps WAN connection was established between the primary data center and the DR site. The DR site also had a direct connection to the Internet via a T1 line, which served as the gateway to send and receive SMTP e-mail in the event of a disaster as well as provide a means for remote access to Outlook Web Access in the event of a disaster.

In the DR site, VMware ESX Server 2.5.4, a Linux-based host operating system, was installed onto an IBM System x3650 with a Quad-Core Intel Xeon 2.66GHz processor, 4GB of RAM and four 73GB 15K hot-swap SAS drives. The first VMs were then brought online. These VMs were Microsoft Active Directory (AD) controllers running core infrastructure services required by Exchange, such as DNS and the Global Catalogue. Microsoft automatically replicates AD and AD integrated DNS, so no further steps were required to make sure the DR site had protection of the AD controllers and DNS. In the event of a disaster, the steps were documented about how to seize the AD FSMO roles if the original AD controllers were no longer available.

The next step was to add an Exchange server to the DR site. The SteelEye LifeKeeper Protection Suite for Exchange is completely different from Microsoft Cluster Server (MSCS) Exchange clusters. Where MSCS would require clustered certified and identical hardware, shared storage and upgrading the existing Exchange server to Enterprise Edition, LifeKeeper simply requires that another Exchange mailbox server be added to the existing Exchange Site. Because LifeKeeper supports physical-to-virtual clustering for this new Exchange server, a virtual machine was created on the ESX server, and Exchange was installed in the same fashion as if it were another mailbox server in the same Site. The only LifeKeeper requirement is that the names of the Storage Groups and Mailbox Stores be identical to the primary Exchange server.

All of the network infrastructure and hardware was now in place to begin the implementation of LifeKeeper for Exchange. Because the implementation was also going to include extensive failover testing, time was scheduled on a Friday and Saturday evening of to complete installation, configuration and testing in order to minimize impact to the users.

Figure 1 - The Network Configuration

Figure 1 - The Network Configuration

Friday Evening – Installation and Configuration

Before the installation of LifeKeeper, a basic health check of Exchange and the network was performed. Beyond reviewing the system and application logs for existing errors, such utilities as the Exchange Best Practice Analyzer (ExBPA), DCDiag, NetDiag were run to make sure there were no existing issues. One thing that was identified by ExBPA was that Exchange SP1 was never installed on the new Exchange server. Once this issue was fixed, we were ready to move forward.

The installation of LifeKeeper was pretty straight forward and included installing and licensing the LifeKeeper Protection Suite on both the primary and secondary Exchange servers. Once the software was installed and the servers rebooted, the installation was complete. It was now time for the configuration of LifeKeeper.

The primary Exchange server had been in production for over two years and was running on an HP Proliant DL380 with 2GB of RAM and direct attached SCSI disks. The server had a 30 GB RAID 1 drive for the System partition and a 160 GB RAID 5 volume for the log and database files. About the only requirement that LifeKeeper has in terms of hardware, is that the size of the replicated volumes on the secondary server be as large or larger than the volumes on the primary server. Therefore, the VM acting as the secondary Exchange server had to be assigned a 160 GB volume.

Once the volume on the VM was created, partitioned and formatted, we configured the LifeKeeper cluster. Configuring the cluster included creating communication paths for heartbeats, volume resource, Exchange resource, DNS resource and a generic application resource. Creating these resources was done by invoking the appropriate wizards through the LifeKeeper GUI.

Creating a communication path between the two servers was the first step. Because these servers were connected by a single WAN link, only one communication path was created. To eliminate the possibility of a split-brain scenario, where both servers become active in the event that all communication links fail, it was decided to disable automatic failover and rely strictly on manual failover. In the future, a VPN connection between the two servers will be created across the public network so that a secondary communication path can be established and automatic failover can be enabled.

After the creation of the communication path, the Volume, DNS and Exchange resources were also created. After the configuration of the resources was complete, the GUI appeared as in Figure 2.

Figure 2 - The LifeKeeper GUI after the creation of all the resources

Figure 2 - The LifeKeeper GUI after the creation of all the resources

Each resource has specialized code that gives LifeKeeper the intelligence to provide monitoring and recovery of that resource. The DNS resource also does a dynamic update of the DNS server to provide client redirection when migrating Exchange between different subnets. When you combine these resources, as illustrated in Figure 2, you provide complete protection for the entire Exchange application stack.

The creation of the volume resource included the creation of the data mirror. Because it was going to take a few hours for the initial replication of 34GB to complete across the WAN link, it was decided that this was as a good breaking point for the evening.

Saturday Evening – Completing the Configuration

Saturday evening we continued with the configuration of the LifeKeeper resources. At this point, the only remaining configuration issue was to address the protection of the third-party applications that interacted with Exchange. These applications included Esker Fax, PageMasterEX 2003 and Trend Micro. Protection of these resources was accomplished by creating a LifeKeeper Generic Application Recovery Kit (GenApp).

A GenApp gives users the ability to easily protect third-party and custom applications that do not have a prepackaged recovery kit. The basic requirements to build a GenApp include separate scripts that know how to start and stop the application. Optionally, a script that can check the health of the application can be written.

It was decided that only basic start and stop operations were required. The resulting start (Restore.ksh) and stop (Remove.ksh) scripts are as shown below.

Remove.ksh

net stop FGExchge

net stop EUQ_Monitor

net stop PageMasterEX

exit 0

Restore.ksh

net start FGExchge

net start EUQ_Monitor

net start PageMasterEX

exit 0

Once the scripts were complete, and the GenApp resource creation wizard was run, the configuration of LifeKeeper was complete. It was time to test the solution.

Saturday Evening – Testing the Solution

MANUAL SWITCHOVER

The first test involved a simple manual switchover test. A handful of clients, including Outlook 2003, OWA and POP3 clients were launched and connected to the Exchange server. Some tests e-mails were sent before the switchover, and then the switchover was initiated from the LifeKeeper GUI. During the switchover, Exchange was unavailable for about 1½ minutes. After the switchover completed, client connectivity was successfully tested.

LOCAL RECOVERY

One of features included with LifeKeeper is local recovery. This feature allows LifeKeeper to attempt to fix a problem locally before a failover occurs. To test this feature, we simply stopped the Exchange Information Store Service manually through the Service control panel. We then verified that LifeKeeper detected this failure and restarted the service automatically without causing a failover.

HARD FAILOVER

The last test, and one that is the most important, is to simulate a hard failure of the Exchange server. One way to test this type of disaster is to pull the power cord on your server. After a little coaxing, the admin agreed to go ahead and do that. Because automatic failover was disabled earlier, the secondary server was just sitting there waiting for us to tell it to come into service. Once we brought it into service through the GUI, the secondary server came online in about a minute, with no loss of data.

SWITCHBACK

After both the manual switchover test and the hard failover test, it was imperative that we were able to bring Exchange back into service on the primary server. This was done easily through the LifeKeeper GUI by selecting the primary server and telling it to come into service. Because of LifeKeeper’s intent log, which tracks the changes on a replicated volume, only the changes that occurred while the primary server was offline needed to be synced before it was brought back online.

TEST RESULTS

All of the tests completed successfully with failover times always in the less than two minutes. The only change required was that a secondary public DNS MX record was added with a priority of 20 to point at EX02, so in the event that EX01 was unavailable, EX02 would receive all incoming SMTP e-mail.

Conclusion

By combining LifeKeeper for Exchange and using VMware ESX in the DR site, the customer was able to meet his Exchange Disaster Recovery RPO and RTO requirements as well as stay within his budgetary and space constraints. By having VMware ESX server in place at the DR site, the customer is considering putting other DR servers in place without having to purchasing additional hardware.

David A. Bermingham, MCSE, MCSA:Messaging, is director of product management for SteelEye Technology.

Learn more about this topic

Altiris adds Linux patch, virtual support to management software

NetIQ revs up systems management software

XenSource unveils Windows and Linux virtualization package

This story, "High availability for Microsoft Exchange with VMware ESX Server and SteelEye LifeKeeper" was originally published by LinuxWorld-(US).

Copyright © 2007 IDG Communications, Inc.

The 10 most powerful companies in enterprise networking 2022