Disaster Recovery using Exchange 2010 Database Availability Groups (DAGs)

Leveraging Exchange 2010's Built-in High Availability and Site Failover Technology

For this blog post, I'm going to jump right into a topic of most interest to organizations deploying Exchange Server 2010, which is Disaster Recovery of databases. New to Exchange 2010 is the concept of the Database Availability Group, or DAG, which effectively allows an organization to have up to 16 replicated copies of an Exchange Database (EDB).

With Exchange 2003 and prior, there was only a single EDB that held a user's mailbox.  If the EDB got corrupt, was offline due to a server or disk failure, or offline because of a site failure, the user(s) could not access their mail, calendars, or contact type information.  A whole industry arose around Exchange database recovery of a single database that included Storage Area Network (SAN) vendors doing snapshots of Exchange databases with Network Appliance (NetApp) having their SnapManager for Exchange (SME) that effectively allows their SANs to replicate Exchange databases for redundancy.  Other solutions include software-based database replication from companies like DoubleTake.  Or appliance-based Exchange availability solutions from companies like Teneros.  All of these 3rd party products effectively replicated the Exchange database so that the organization could either quickly recover from a server or database failure, or have real time failover to a secondary copy of mail.

Then Exchange 2007 came out where Microsoft replicated the entire Exchange EDB database from a primary active server to a secondary passive server.  This technology, called Cluster Continous Replication (CCR) provided an organization with a duplicate copy of the Exchange database on a secondary system with the failover from the primary to the secondary server that occurred in about 1-2 minutes.  Nice about CCR is that it also provided failback from the secondary system back to the primary system also in a 1-2 minute timeframe.  And with Exchange 2007 SP1 and the support for Windows Server 2008 failover clustering, an organization was able to failover and failback Exchange CCR between sites in different geographies.

As an organization, my company has helped hundreds of organizations (including some of the largest companies in the world) setup, test, and implement geo-cluster failovers of Exchange 2007 databases so that if a server in Site A fails, a server in Site B would come online and host the organization's email automatically.

Microsoft also released Standby Continous Replication, or SCR with Exchange 2007 SP1 that provided a 3rd copy of the mail in yet another location so that effectively an organization would now have an active and passive copy of their mail, plus a replica of their mail in a 3rd location for purely DR reasons.  CCR and SCR were revolutionary in terms of providing "out of the box" high availability and disaster recovery of Exchange databases and servers.

The major challenge with CCR and SCR is that the failover is done server to server, and primary to secondary in nature.  So Site A / Server 1 fails to a backup server in Site B / Server 2, however if Site B wanted to have a local copy of their mail, they had to setup a completely separate server so Site B / Server 3 would failover to Site A / Server 4.  This meant that an organization would have several servers running that were completely under-utilized as the passive nodes would only be online in the event of an Active node failure.

This is where Exchange 2010 Database Availability Groups come in.  Database failover is now done at the database level, and each Exchange 2010 Enterprise license server can have 100 databases running on a system.  So effectively Site A / Server 1 could have 10 databases of which 5 databases failover to Site A / Server 2, and 5 database failover to say Site B / Server 3.  AND, since the failover is done database by database, the server in Site B / Server 3 can also host say 20 databases of which 10 of those databases failover to Site A / Server 2, and 10 databases failover to Site A / Server 1.  In this fully meshed failover / failback environment and the support for up to 16 copies of a database across the enterprise, an organization could have full meshed high availability and disaster recovery of Exchange databases.  The failover and failback of Exchange 2010 databases is between 30-60 seconds, and running on top of Windows 2008 SP2 or higher, the organization can failover and back across a wide area network.

The basic process of creating a Database Availability Group is as follows:

Install Exchange 2010 on to a Windows 2008 SP2 or higher server with the Mailbox Server role.

1. Launch the Exchange Management Console.

2. Expand Organization Configuration.

3. Click Mailbox.

4. In the middle pane, click the Database Availability Group tab.

5. In the right pane, click New Database Availability Group

6. When prompted, enter a unique name for the Database Availability Group along with the file share witness path and directory which were created earlier. Click New.

7. When the wizard has completed, click Finish.

At this point, the DAG has been created, however it has no members. Add member mailbox servers to the DAG with the following steps:

1. Launch the Exchange Management Console.

2. Expand Organization Configuration.

3. Click Mailbox.

4. In the middle pane, click the Database Availability Group tab.

5. Right click the DAG created in the previous steps and choose Manage Database Availability Group Membership.

6. When the wizard appears, click Add and choose the mailbox servers from the list that you want to join to the DAG. Click Manage.

7. The wizard might take several minutes to complete. When it had added all the necessary nodes, click Finish.

When this process has been completed on one or more nodes, the system(s) are ready for the rest of the configuration process to continue.

1. Return to Exchange Management Console and expand Organization Configuration.

2. Click Mailbox. In the middle pane, click the Database Management tab.

3. In the lower pane, right-click the database you wish to replicate within the DAG.

4. Choose Add Mailbox Database Copy.

5. When the wizard launches, browse for the server in the DAG to which you want to replicate the mailbox database. Pick a Replay lag time and a truncation lag time.

6. Enter a unique preferred list sequence number and click Add.

7. When the wizard completes, click Finish.

When the Database Availability Group is created, a computer object is created in Ac-tive Directory to represent the Failover cluster virtual network name account. If a DAG is going to be recreated with the same name, it is necessary to disable or delete this computer account or the process will error out and fail.

{note: the preceding content is excerpts from my book "Exchange Server 2010 Unleashed" from Sams Publishing (authors: Morimoto, Noel, et al) where I cover DAG specifics in more detail such as getting into the pre-requisites, debugging, creating failover sites, etc}

While the failover of the Mailbox Server role makes sesnse, the next question that is asked is "what about Client Access Server (CAS) failover and Hub Transport Server (HT) failover?"  The answer is quite simple, that CAS and HT servers by basic definition can be setup to failover across a LAN or WAN through simple Network Load Balancing (NLB).  As a new CAS or HT is added to a Site, the server(s) take on the failover and redundancy of other CAS and/or HT servers in the Site.  By default, the CAS server frontending Mailboxes for a user will allow client communications to pass through the CAS server into the Mailbox server.  In the event of a CAS server failure, NLB will fail the user's connection over to another CAS server.

Over the past 2-yrs that we have been on the early adopter program for Exchange 2010, we tested the DAG failover and failback process including site to site failover.  This is a proven process that leverages the same failover cluster continuous replication technology that originally released with Exchange 2007 for server to server failure, but instead has expanded the same failover cluster continuous replication across multiple servers.

Some Best Practices we've come up with relative to using Database Availability Groups:

  • Run an additional network adapter in the DAG member nodes to properly support Windows clustering.
  • Ensure that hardware is chosen to not only support its dedicated load, but to take over additional loads when its acting as a replica for other master copies of a mailbox database.
  • Base your disk subsystem primarily on storage, as the performance requirements have dropped drastically.
  • Always plan for a sufficient amount of TCP/IP addresses in advance to support current and future cluster needs.
  • Do not run both clustering and NLB on the same computer; it is unsupported by Microsoft because of potential hardware-sharing conflicts between MSCS and NLB.
  • Always plan for the additional WAN traffic created by adding another DAG replica that isn’t on the local LAN.
  • To avoid unwanted failover, power management should be disabled on each of the cluster nodes, both in the motherboard BIOS and in the power applet in Control Panel.
  • Thoroughly test failover and failback mechanisms after the configuration is complete and before migrating users to a Database Availability Group.
  • Make sure that mailbox databases have unique names.
  • When utilizing load balancing, make sure to only load balances the ports necessary. This will avoid the possibility of network related issues when talking to Active Directory.
  • Be sure to regularly monitor replicate between DAG nodes to ensure that rep-lication is healthy.
  • Periodically test the move of master status between various copies of mailbox database groups to ensure that the data is valid and the cluster is working correctly.

I'll post more on DAGs and Storage in general in upcoming postings as Database Availability Groups is one of the most innovative technologies coming out of Redmond in a very long time.  A technology that has helped organizations save hundreds of thousands of dollars on 3rd party high availability and disaster recovery products while providing better failover and failback capabilities straight out of the box.  Stay tuned for more on best practices around strategies to leverage DAGs and the use of cheap storage that drive down costs and increases recoverability.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2009 IDG Communications, Inc.