When your Exchange server goes down

MS Exchange disaster-recovery wares

WANSync HA Exchange beats out three others for our Clear Choice designation in our test of Microsoft Exchange disaster-recovery wares

E-mail availability is an enterprise business mandate. To that end, we tested four high-availability products for arguably the most popular enterprise e-mail system: Microsoft's Exchange Server.

We tested Fujitsu/Softek's Softek Replicator 2.1.2; LeftHand Networks' SAN/iQ Software, Remote IP Copy Software and NSM 200 SAN Module combination; NSI Software's Double-Take for Windows 4.3; and XOSoft's WANSync HA Exchange 3.5.2 Build 45.


Also: How we did it


WANSync HA Exchange earns our Clear Choice designation because it adapted quickly to our setup, presented clear Exchange installation-specific options, and required no subsequent intervention to complete the processes of failover detection, failover and bringing our hotsite/back-up site online.

WANSync HA Exchange also is clearly built for an Exchange environment as it reads the Active Directory, the server registry and Exchange 2000/2003 files, and quickly gives an administrator a profile of where things are, how they're set and options for synchronization intervals. The other products tested treat Exchange more as a minor option - leaving the Exchange customization details to administrators' customization and script writing skills.

There are a number of ways to increase availability - increasing application platform support via power protection, for example, routing techniques and monitoring. Our tests, however, assume a site's Exchange servers have become unavailable - gone from the network map entirely. This mandates Exchange services become available from an alternate site.

This disaster simulation was simple for us. We merely pulled the power feeding - via an SNMP trigger - our primary site's two servers, one of them running Exchange including an Active Directory Global Catalog server and the other a forest-partitioned server in the same Active Directory domain.

The real test was to find out whether vendors' implementations could sense the primary site was down, then recast the hotsite's mirror to Exchange users. After the power failure we assessed how products detected the outage (all but the Replicator could). We then wanted to see the products bring Exchange services back online by using our VPN connection to fail over operations to the secondary site. We tracked how many messages were lost during failover and clocked time to availability.

WANSync HA Exchange and Double-Take for Windows do this automatically, but the latter lost some messages. LeftHand and Fujitsu/Softek require manual intervention to bring Exchange 2000/ 2003 back online at our simulated hotsite/back-up site location - and both lost messages. It should be noted none of the products lost messages from the Exchange message stores - only messages in progress.

WANSync HA Exchange

A self-described 'switchover solution', WANSync HA Exchange is the only Exchange-specific product reviewed in this comparison, although it does retain XOSoft's WANSync technology that's used for other applications such as Microsoft SQL Server and Oracle 9.

Of two available deployment options, we chose the high-availability one, (as opposed to the WANSync "file" method) which dictated the necessary steps to achieve a replicated server. These steps involved developing a scenario that brought the masters' settings together with their replicants'. This scenario mirrored all the settings necessary for the Active Directory, DNS, registry entries, Exchange-specific logs/file locations, and Exchange settings to be replicated to the secondary site.

We linked the source and replication server by making the replication server a 'switchover host' and chose a replication name. The auto-discovery process in WANSync HA Exchange queried our Active Directory and each host for its information. It then offered appropriate default selections of items such as source host monitoring (heartbeat, timeouts, IP pinging, for example) and let us run scripts before switchover or switchback.

WANSync HA Exchange can redirect DNS to point to the failover Exchange server, then change it back if necessary at a failback point. Also, it can change the IP addresses so that the remaining live Exchange server can be found if DNS can't (or shouldn't) be changed. In both cases, users might have to exit Outlook or their mail readers, and flush their DNS cache as we had to do.

Oddly, WANSync HA Exchange was the slowest of all four products tested to perform an initial synchronization of the message and public folder store in our tests, which took more than 10 hours (see tracking Failover Performance chart, right).

There are three levels of replication between WANSync "master" and "replica" servers. The initial synchronization takes place, using options that let large chunks of data be replicated at a time. Moving the chunk size from small to large made no difference in WANSync's slow data copying as far at the initial replication is concerned. Subsequent replication is done either at the block level, or at the file level. Either file-level or block-level replication was sufficient to let WANSync HA Exchange keep all of the 24 messages in queue at the failure point - on the replica server.

Failback is the reverse of failover, and for all of the applications tested, the time to re-synch was slightly faster (but proportional) because most files were already populated on our disabled Exchange Servers.

Of the four products tested, only WANSync HA Exchange prevented any outgoing messages from being lost at the time of failure. WANSync HA Exchange can auto-discover many Exchange facets such as file locations, Exchange specifics and logs.

The WANSync Manager is the core management application. It uses a Microsoft Management Console-like layout that allows fast perusal of paired (mirror and replica) servers, their settings and the settings that are made for failover.

Tracking failover performance

While XOSoft’s WANSync HA Exchange required the longest initial server synchronization time, the product earned our Clear Choice honors for its ability to fail over quickly without dropping pending messages.

 Time for initial synchronization via 10M bit/sec link Average time to availability after failure (1) Number of dropped message transactions during failure (2)
XOSoft WANSync HA Exchange 643 minutes18 minutes0
NSI Software Double-Take for Windows530 minutes23 minutes5
LeftHand Networks SAN/iQ Software, Remote IP Copy Software, NSM 200 SAN Module521 minutes72 minutes6
Fujitsu/Softek Replicator 536 minutes82 minutes24
Note that these scores are specifically for Exchange 2000/2003, not for other uses or applications.

(1) Average of two synchronizations.

(2) Twenty-four transactions are pending when primary server is shut down.

NSI Double-Take For Windows

Double-Take for Windows is a byte-level replication system. Sources of datasets (drives, volumes and folders) are identified. Then target storage areas are calculated for size and subsequently allocated for replication. In many installations, one target might suit several data sources, but for Exchange, NSI recommends a one-to-one allocation if the targets are going to be used for subsequent/possible failover. The mirroring took 530 minutes to replicate our datasets, consisting of Exchange executables and Exchange stores for both of our two primary site servers.

This initial synchronization was simple to set up and deploy, but additional steps such as batch file configuration and finding Exchange files, for example, are required.

In our two-forest, two-Exchange server example in which we simulated a headquarters and a manufacturing branch topology, four licenses of Double-Take were required, but only two licenses of Exchange server were required.

When failover occurs - as detected by a failure threshold that can be set for communication between servers - the software triggers options you set in a recommended batch file that calls NSI's ExchFailover command. This command starts Exchange on the failover hardware and sets DNS information to the target server.

Although we had no failures in our tests, NSI recommends that you monitor processes that failover closely because of unpredictable results when Exchange services start. We also successfully attempted failback with this product, which is a method to repopulate the source when it became available again.

Double-Take for Windows uses a 32-bit Windows application called the Management Console by which you can control all aspects of the product. Icons display server status and the properties of the server (such as items managed, logs info and so on) can be easily seen in the GUI. There's also a command-line interface available to manually enter stop, start, failover and other commands.

LeftHand Networks

LeftHand provided the full-meal deal in the form of a storage-area network (SAN) running its SAN/iQ Remote IP Copy application. Because LeftHand completes 100% of its clients' installations, we let its engineers install the SAN and software in our lab. The SAN and software combination - called the Network Storage Module (NSM) - is mirrored from site to site.

The SAN comprised two rackable RAID frames with eight drives in each. The drives were configured in what LeftHand calls a RAID 10 configuration (actually RAID 5/2 combination) along with its software. Installation of the hardware took about an hour with two people; initial software installation took half that.

We set up two stores at each 'site' for each Exchange server running at that site on one NSM.

NSMs swap snapshots of data from a source to a target. If the data link speed between NSMs supports it, frequent snapshots can be taken and sent from source to target. Because there are fluctuations in mail server transactions, frequent snapshots need to be taken to ensure transactions aren't lost. The product can be tuned so that frequent snapshots are taken.

We initially connected to the LeftHand Networks SAN with a notebook and null-modem serial cable. The devices are mirrored together over IP - outside of the connection(s) to the Exchange Servers. The required IP connectivity doesn't like network address translation, and therefore requires a virtual LAN or VPN connection between sites. We used a Point-to-Point Tunneling Protocol VPN connection.

The NMS is configured for four parameters: size and type of Windows NT File System volume, output bandwidth to use (bandwidth can be throttled to optimize links), snapshot information and scheduling.

The management tools provided with this product were useful, but not necessarily intuitive. Replacing a failed drive stymied us until we read through the help screens because we couldn't understand the procedure from the documentation.

The Exchange-specific recommendation we followed asked that once a failover had occurred, that you make your way through 27 steps to bring Exchange back online at the hotsite/back-up site. We did this, twice, and the steps worked.

The most onerous of these steps was to bring Exchange online at the recovery site in Recovery mode, which added more than an hour to the point where we could bring either version of Exchange back online.

Documentation comes in three sets, one for Remote IP copy, one for SAN/iQ and one for the SAN200 installation. They're good and descriptive documents, although the Exchange-specific information came via a PDF rather than through the documents. The Exchange instructions were good enough.

The fact that the LeftHand Networks NSMs can replicate a number of platform stores - including Linux, Windows 2000, Windows 2003 and numerous file systems - is to its benefit, but in terms of high availability specifically for Exchange, it lagged the competition.

Softek Replicator 2.1.2

Similar in many ways to Double-Take, Softek Replicator is hardware-independent and uses a device driver that intercepts data to be written to disk. The driver then manages writing the data to disk and transferring it to a target server. These pairs are mirrors of each other. Softek journals entries to store updates that might leave the mirrored pair replicated incorrectly.

Softek pays special attention to transactional commitment in the relational database sense because its driver notifies applications when transactions are complete. Softek Replicator can operate in asynchronous, synchronous or near-synchronous modes. The synchronous mode won't return control to an application until the primary and mirrored devices have been successfully written to. Asynchronous mode will wait until buffers fill to transmit. We couldn't find tuning points that optimized this without losing messages. The near-synchronous mode is a trade-off that lets a pre-defined ceiling be established before Softek Replicator makes an application wait until records have been written to the primary and mirrored pair servers.

Softek Replicator doesn't automatically start a mirrored server recovery of Exchange. It might not even know that the primary site has been blown to bits.

Recovery to availability of Exchange2000 or Exchange 2003 required that we use a manual restoration with the Exchange recovery utility. We started Exchange 2000 or 2003, then started the recovery process of the database (message store) and transaction logs. All of Exchange's components must be on a volume that is not used for paging (such as the 'C:' volume of most every Exchange server) because Softek Replicator can't mirror that volume.

Softek's expertise in MS SQL Server replication doesn't translate well to Exchange in terms of high availability. Softek Replicator can mirror Exchange quite effectively, but in terms of knowing what Exchange is doing, the software is barely there.

Custom tools and scripts will have to be built by the user/administrator to 1) detect the state of the back-up server, 2) automate the process of terminating the replicant pair's relationship, and then 3) restore Exchange server in an orderly manner - the data's there but Softek Replicator doesn't do this for Exchange.

Conclusion

These products represent four ways to increase Exchange availability.

At the high end of the pricing chart, LeftHand Networks provides a holistic solution - all of the storage components are included for organizations that find themselves facing mushrooming storage problems - and in need of all of the bits to make the solution work.

But it's very clear that the features of an Exchange-specific product, XOSoft's WANSync HA Exchange, make a big difference in how effective it can be in getting Exchange servers back up and running after a disastrous situation.

Henderson is managing director of ExtremeLabs of Indianapolis. He can be reached at thenderson@extremelabs.com.

Henderson is also a member of the Network World Lab Alliance, a cooperative of the premier reviewers in the network industry, each bringing to bear years of practical experience on every review. For more Lab Alliance information, including what it takes to become a member, go to www.nwfusion.com/alliance.

WANSync HA Exchange 3.5.2 Build 45OVERALL RATING
4.65
Company: XOSoft Cost: $18,240. Pros: Well-tuned for both Exchange 2000 and 2003; astute deployment; very good availability. Con: Initial replication is slow.
Double-Take for Windows 4.3OVERALL RATING
4
Company: NSI Software Cost: $17,980. Pros: Management Console very good if devoid of Exchange specifics; second-fastest availability time. Con: Requires more initial configuration.
SAN/iQ Software, Remote IP Copy Software, NSM 200 SAN ModuleOVERALL RATING
3.5
Company: LeftHand Networks Cost: $41,500. Pros: Integrated with SAN; fast replication; manageable. Cons: Weak setup for Exchange; slow to availability.
Softek Replicator 2.1.2OVERALL RATING
2.8
Company: Fujitsu/Softek Cost: $26,000*. Pro: Least inexpensive if single CPU licenses used. Cons: Mirrors Exchange, but doesn’t trigger failover; slow to availability.
The breakdown  WANSync HA ExchangeDouble-Take for WindowsSAN/iQ, Remote IP Copy and NSM 200Softek Replicator
Availability 40% 5 43 2
Installation/Configuration 30%4.5443
Management 20%4.5444
Documentation/Support 10%4433
TOTAL SCORE4.6543.52.8

Scoring Key: 5: Exceptional; 4: Very good; 3: Average; 2: Below average; 1: Consistently subpar

*Pricing listed here depicts licences for dual CPU machines as tested.
From CSO: 7 security mistakes people make with their mobile device
Join the discussion
Be the first to comment on this article. Our Commenting Policies