Exchange 2010 Client Access Server (CAS) and Hub Transport Server (HT) Redundancy

Providing High Availability and Failover in a Server or Site Failure

The question has come up a couple times in my blog on high availability and redundancy of Database Availability Groups (DAGs) specific to “What happens to the Client Access Server (CAS) and Hub Transport (HT) roles when the Exchange 2010 server fails over to another site?”  or “What is the best practice in configuring CAS and HT so that when the DAG fails over, the CAS and HT will also failover?”

These are EXCELLENT questions as it is quite amazing how many times organizations put in HA and DR for their Exchange databases, but have done nothing when it comes to client or mail access.  The best one I’ve seen is where an organization spent millions on high availability, disaster recovery, and security that they were able to failover their primary headquarter facility to their secondary datacenter with a click of a mouse button, and the datacenter was secured tight as a drum.  HOWEVER one afternoon a fire broke out on a lower floor of the building and asbestos was found in the building effectively having the entire building quarantined.  No employees could get into the building to work and the IT department couldn’t even get to the datacenter in the building to “click the button” to failover to their secondary datacenter because the security they had didn’t allow remote access to initiate the failover.  So, the best laid plans for High Availability and Disaster Recovery really need to be thought through and tested to make sure they work.

So on to the whole CAS/HT failover piece, how should CAS and HT servers be placed so that when the databases failover to another datacenter that the CAS and HT are setup to “frontend” the data?

I pinged my co-author of the book “Exchange 2010 Unleashed,” Andrew Abbate who is by far the top expert in the world when it comes to high availability and disaster recovery of Exchange for his input.  Andrew has designed and implemented Exchange 2003, 2007, and 2010 environments where there are 10,000, 25,000 or more mailboxes failing over to 1, 2, many datacenters.  Figured querying Andrew would net out the best “best practice” around.  Sure enough, Andrew had some great info to share, so I decided to cut/paste an email thread I had with Andrew and put his input into this blog…

The following is a (cleaned up) excerpt of Andrew’s suggestions:

Designing CAS/HT for Exchange 2007

{as much as I’ve been focusing my blog posts around Exchange 2010, Andrew provided some great input on Exchange 2007 CAS/HT configuration, so figured I’d slip this in}

With Exchange 2007 Cluster Continuous Replication (CCR), you've got both mailbox nodes in the same Active Directory site since Exchange 2007 CCR by definition has to “stretch the cluster” and thus “stretch the LAN segment for the cluster”.  Thus the CAS/HT systems servicing the CCR mailboxes are in the same AD site.  Because the cluster site is stretched, users routinely connect through a CAS server in a remote site.  Let's say you've got 2 CAS/HT in each site, load balanced amongst themselves for OWA - then statistically, half the time the users connect through the CAS/HT server in your primary datacenter site and half the time users connect through the CAS/HT server in your secondary (DR) site.  This is not an elegant outcome of Exchange 2007 CCR.

Admittedly in Exchange 2007, the CAS isn't doing a whole lot related to Outlook (initial connection, Offline Address Book sync, autodiscover and thus availability, etc).  Hub Transport, on the other hand, isn't as inert in 2007 so in the "2 in each site" scenario above, half your messages are going from primary site to secondary site and back to primary site for a local send.  Not the end of the world, but something to be aware of.  Because this type of routing is not very efficient, the best practice in the case of Exchange 2007 for Hub Transport routing is to ignore a given Hub Transport server.  There is a powershell command that provides that functionality that can be combined with something like Microsoft System Center Operations Manager (SCOM) to monitor Exchange and if it sees the "local" HTs are unavailable, it'll activate the remote ones again.  CCR in Exchange 2007 really focuses on mailbox redundancy.  It would seem that CAS/HT redundancy was something of an afterthought in Exchange 2007. We have come up with a number of best practice workarounds to address the limitations caused by the CAS/HT failover when CCR fails over, but thankfully Exchange 2010 is now here and these Exchange 2007 issues are behind us.

Designing CAS/HT for Exchange 2010

With Exchange 2010 and Database Availability Group (DAG) failover of the database, the Client Access Server / Hub Transport Server design is a very different scenario since you don't have to stretch the Active Directory sites in Exchange 2010.  In Exchange 2010 they use CAS Arrays which define the available CAS servers for clients.  (which incidentally means you can have a CAS in a site that isn't part of the array, so that it could be dedicated to "non-client" services.)  The CAS arrays are referenced by Active Directory to DNS names which can be altered so that "Site A now goes to Site B for CAS array as their local is down". Normally the CAS array is associated to a specific site, which is how Autodiscover finds the RPC endpoint for the user and how it decides which InternalURLs and ExternalURLs to pass to the client.   The CAS array itself has no "intelligence" associated with it - there is no Exchange level communication between the members.  Any "load balancing" is entirely separate from Exchange - Windows Network Load Balance (NLB) or 3rd party appliance.  Once again, System Center Operations Mgr (SCOM) to the rescue - if all members of the CAS array in Site A are unavailable, make the array object for Site A resolve to the one in Site B until it's back up and running.

The bigger question is "What do you need to accomplish"?  If the goal is "How do I make sure my services are up in a site?" the answer is 'create a CAS array and load balance its members'.  If the question is "What do I do if the whole site goes down?" then one of the slickest things to do is to configure Outlook to "on fast network, connect via RPC, on slow network, connect via HTTPS" - that way when the "local site" CAS array is down, the user’s Outlook client cannot make a local RPC connection so it’ll fail over to going out to the internet and connecting as Outlook Anywhere.  This solution for Outlook works for Exchange 2007 and Exchange 2003 as well, so long as you're cached mode.

As for redundancy for mailflow?  In Exchange 2010, you have to have a Hub Transport server in each site no matter what, so you can't really "site failover" that, but the fact that you have a HT server in each site means that when the primary site is down, the HT server in the secondary site will handle incoming and outgoing email requests from the Exchange 2010 mailbox server automatically.

What it comes down to is that High Availability in Exchange 2010 is totally automated, that the Database Availability Groups will automatically failover a database from one server to another.  Disaster Recovery in Exchange 2010 is “mostly” automated as there are scenarios where the failover of the CAS and HT roles can be helped with a configuration of the user’s Outlook client as well as using something like Microsoft System Center Operations Mgr 2007 to fill in areas of product limitation if you need a better level of redundancy management.

Bottomline, for CAS servers in Exchange 2010, using the CAS Array capabilities of Exchange 2010 will allow you to create a CAS Array in each Exchange site and then configure the system to major an array object in your primary site resolve to a CAS Array in your secondary site until the primary site is back up and running.  And for HT servers, putting at least one in each site which is a requirement anyway will provide routing of mail from the Exchange 2010 mailbox servers in the same site as the HT server(s).

So, I’m hoping this provides a better snapshot on how Client Access and Hub Transport servers can best be configured in an environment where you are failing over databases using the Database Availability Groups (DAGs) in Exchange 2010.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Related:

Copyright © 2009 IDG Communications, Inc.