Bulletproofing NT
A new crop of tools provides server and NIC redundancy, ensuring high availability for your NT servers.
|
|
|||
|
|
|
|
|||
|
|
Thirty seconds is just a blip in a lifetime, but for many network managers it is also an annual downtime ceiling that, if exceeded, could unravel their careers.
Network downtime costs domestic businesses $4 billion annually, according to various estimates. On average, a single network outage in the retail industry costs $140,000, while in the securities sector the figure is $450,000.
That's why as Microsoft Corp.'s Windows NT Server increases in popularity, a cottage industry is springing up to deliver hardware and software aimed at improving the fault tolerance and availability of NT servers, whether they're used for traditional applications or as Web servers.
Many of the products address one of two key issues: server clustering, which is intended to guarantee server availability, even when a mission-critical system or application collapses; and preservation of the server's network connection, even in the face of network adapter, link or switch failures.
For the purposes of this article, The Tolly Group chose two products to represent each type of offering and put them through their paces in the lab. In the server clustering category, we examined the long-awaited Microsoft Cluster Server (MSCS), formerly known as Wolfpack, and Bright Tiger Technologies' ClusterCATS, a powerful offering that focuses on NT-based Web servers. We focused on MSCS because it is the only clustering software integrated within the server operating system. We examined ClusterCATS because it represents a new class of software that brings clustering to NT-based Web servers.
For providing fault tolerance at the network interface card (NIC) level, we explored two of the most innovative offerings - Adaptec, Inc.'s Duo ANA-6922A PCI and Intel Corp.'s EtherExpress PRO/100. While we found both work as advertised, we also uncovered some important limitations in terms of the network configurations in which you can deploy them.
Wolfpack attack
Microsoft recently delivered MSCS, nearly three years after the company first started talking it up, back in October 1995. Integrated as a service within NT Server 4.0 Enterprise Edition, MSCS allows two servers to work together as a single logical system, sharing a disk array subsystem and acting as hot standbys for each other.Hardware compatibility with MSCS is limited, so don't think you can slap any two servers together and run MSCS on top.
With MSCS, NT Servers "adopt" various resources, which can be any physical or logical entity that provides a service to clients, such as a SCSI-attached disk, a set of applications or file shares. When a server fails, MSCS transfers its resources to a hot standby server.
For the evaluation, we configured a pair of Data General Corp. AViiON 200-MHz Pentium Pros as Web servers running a real-time simulated stockbroker transaction application. Each server had 64M bytes of memory, 2.1G bytes of local storage and 1.9G bytes of external shared disk storage. The servers ran Windows NT Server 4.0 Enterprise Edition (Service Pack 3), Microsoft Message Queue (MSMQ) and Internet Information Server (IIS) 3.0.
The servers were linked to a 3Com Corp. SuperStack II Switch 3300 Fast Ethernet switch, which also supported a connection to a downstream client workstation, used to enter trading orders. A SCSI cable linked both servers to a shared disk array.
To test the failover capability, we powered off Server A. It took 16 seconds for MSCS on Server B to realize its partner had disappeared. One reason it took so long is that, by default, we had the backup server configured to kick in after the third attempt at polling the downed server. You can fine-tune MSCS software so the backup server steps in sooner, if necessary.
After Server B realized its cluster partner had vanished, it took 76 seconds for the applications and processes running on Server A to failover to Server B. Microsoft says a failed server usually recovers in about 30 seconds. MSCS required more than double that time in The Tolly Group tests because failover time is largely dependent on the applications and server processes running on your cluster.
In our tests, the server cluster was configured as a strategic business server, supporting more than a dozen applications and other resources. As the primary server failed, each of these resources had to be restarted on the backup server. Users were thus left without access to certain applications, databases and files for a total of more than 90 seconds. You'll have to determine how long your users can stand to be without access to each application as you decide which resources your various servers will support.
In addition to its failover capabilities, MSCS can redistribute a group of resources to a primary, or preferred, server once it has been restarted. You can define a clustered server as the preferred server to support a specific group of resources, including applications, disks and file shares.
The Tolly Group defined Server B in its cluster as a preferred server, giving it rights to support a file share group. With those file share resources running on the server, we powered off Server B; MSCS failed over the file share group to Server A. We then powered up Server B, logged on to the NT domain, and the file share group failed back to Server B.
MSCS also works with MSMQ software, previously code-named Falcon. When real-time applications are running on clustered servers, MSMQ software is deployed on the servers and attached client workstations. If a server running a real-time application fails, MSMQ running on the client detects the loss of service and builds a local message queue listing transactions yet to be performed. Once failover is complete, the client software ships the queued transactions to the new server.
One slick feature of MSCS is the capability for users to link clustered resources to an NT Registry key, which stores information about what the resource was doing before it crashed. In the event of a failover, the backup server picking up those resources looks in the Registry key and learns what the application, disk device or other resource was doing so it can restore service to that state.
A downside is that MSCS does not support Dynamic Host Configuration Protocol (DHCP) servers as a resource type. Only static IP addresses or IP addresses permanently secured from a DHCP server are supported.
One cool ClusterCATS
While MSCS provides innovative clustering for NT Server Enterprise Edition, Bright Tiger brings a different brand of clustering to NT-based Web servers with its ClusterCATS (Content, Applications and Transaction Smart) software.ClusterCATS enables you to build and manage SmartClusters, or groups of servers, applications, databases and other resources that span one or more locations. ClusterCATS performs many duties, but its chief function is to provide server load management and failover capabilities. Moreover, the software enables you to easily, and often automatically, replicate entire Web servers or specific Web server content.
To test its failover capabilities, we first set up two NT-based Web servers and replicated content across both. The Web servers were attached to a Fast Ethernet network with Internet Explorer browser clients. The environment simulated an intranet, although the software is designed to work just as well across the Internet.
With ClusterCATS operating on two Web servers, we took one of the devices out of service for maintenance. Consequently, ClusterCATS redirected all HTTP requests to the backup
server. The only clue that the client browser had shifted to an alternative Web server was the backup server's URL displayed onscreen.
We used a simulated order-processing application to see how ClusterCATS would failover transactions in mid-stream. Typically, when a Web server crashes, users lose their sessions and cannot reconnect. With ClusterCATS, a backup
server will redirect users to a Web page that informs them of the loss of transaction data and instructs them to reenter order data. In another scenario, if you want to take a Web server down for maintenance or other reasons, ClusterCATS will cache session variables - such as user name, credit card number and order entries - and ensure that all transactions in progress are completed before the server comes down.
ClusterCATS also supports failover of Oracle Corp. and SQL databases that reside on NT-based Web sites. Bright Tiger plans to offer support for Open Database Connectivity-compliant databases in the near future.
Using ClusterCATS Monitor Agent, you can configure a probe that monitors an entire database or select tables. In the event the probe detects a loss of database service, it redirects queries to a replicated database on an alternate Web server.
In the evaluation, an SQL database was queried for product data and to create orders. We began browsing products in the database on Web Server 1, then failed the SQL database on that Web server. The transaction continued uninterrupted as queries shifted to Web Server 2.
On top of its failover capabilities, ClusterCATS manages content versions; it will redirect users' requests for data to the Web server containing the most current data.
ClusterCATS also handles load balancing among Web servers. Administrators can set two thresholds: a minimum setting that enables a Web server to redirect a percentage of queries as it becomes increasingly busy, and a maximum threshold that offloads all queries once a server reaches the limit.
In our tests, we lowered the maximum threshold to 10% of the Web server, meaning that when the Web server hit 10% of its processing capacity, it directed all other HTTP requests to an alternate Web server. During testing, the Web client transferred HTTP requests to another server without a hitch.
Adapter fault tolerance
While clustering software can help you survive server failures, for added fault tolerance you also need to protect the server's connection to the network.Companies such as 3Com, Intel, Adaptec and ZNYX Corp. now offer adapters that support redundant Fast Ethernet links, or connections that support Cisco Systems, Inc.'s proprietary
Fast EtherChannel. Traditionally, resilient server connections have supported only dual-homed FDDI links.
In addition to providing link and port redundancy, many of these vendors are now offering features such as load balancing and port aggregation, which increase bandwidth between a server and its switch.
The products we examined - Adaptec's Duo ANA-6922A PCI and Intel's EtherExpress PRO/ 100 server adapters - represent different design schools.
Adaptec's Duo ANA-6922A represents a design approach shared by ZNYX, 3Com and others to populate multiple Fast Ethernet ports (up to four for Adaptec) on a single card that fills one PCI slot, enabling you to preserve valuable server slots. The idea is that the benefit of the increased port density per card outweighs the potential risk of a single card crashing, which would knock out the redundant links supported by that card.
Intel, on the other hand, only supports one active Fast Ethernet link per board, meaning it takes two server slots to nail up redundant server links. Intel emphasizes the need for component redundancy; if one board fails, the backup server connection stays intact because it resides on a physically separate board.
In our view, Intel's EtherExpress PRO/100 has a better fault-tolerant design because it removes the possibility of one adapter or PCI bus slot becoming a single point of failure, rare as such an event may be.
We tested the Adaptec and Intel adapters in a variety of scenarios. Our aim was to evaluate how the adapters failed over to a backup link after a port, link or switch outage.
In the first scenario, we plugged the Adaptec NIC into a 133-MHz Pentium-based Micron Millennia server with 64M bytes of memory and 1.2G bytes of storage. The server ran Windows NT 4.0 Server (Service Pack 3) and IIS 3.0. We nailed up two Fast Ethernet connections from the Adaptec NIC to a 3Com SuperStack II Switch 3300. A PC client also connected to the switch.
After initiating a File Transfer Protocol download of a 300M-byte file from the server to the client, we disconnected the primary Fast Ethernet link and verified that the secondary link kicked in. Failover occurred within three seconds. We then conducted a drag-and-drop file copy from the server to the client and failed the primary port again. Once complete, we used NT's "comp" command to compare file sizes and ensure the transfer occurred without a hiccup; a Wandel & Goltermann, Inc. DA-30C internetwork analyzer confirmed all frames were transmitted and compared them against previous tests conducted without port failure.
We repeated the same process using dual Intel EtherExpress adapters, pulling the primary card to determine if the hot backup would step in. Failover occurred within four to six seconds for the Intel EtherExpress.
Connecting redundant server links to a single switch works well for the server adapters tested, but it leaves something to be desired in terms of fault tolerant design. A better approach would be to connect each server link to a different switch so if one switch crashes, it doesn't bring down the primary and backup server links with it.
In this test, the Adaptec Duo ANA-6922A server adapters failed over to backup links within three seconds, and the Intel EtherExpress PRO/ 100 failed over within an average of four seconds. The tests prove servers can communicate Layer 2 switched traffic over their redundant links without service interruption.
Both server adapters, however, ran into trouble with their redundant links when we connected them to physically separate switches that pass traffic through an intermediate router. The upshot is you cannot set up redundant server connections to separate switches if the switches reside in different subnets.
During testing under this scenario, neither product cut over to a backup link when the primary link failed. That's because adapter vendors typically broadcast discovery packets onto the network to ensure the primary link is intact. However, such broadcast traffic is screened by a router and is not passed between subnetworks.
Lessons learned
All four NT fault tolerant products tested impressed our engineers.On the server clustering front, Bright Tiger's ClusterCATS is providing a level of fault tolerance suitable for high-performance electronic commerce servers. The software is especially powerful in maintaining availability of Web-based order-entry and transaction-processing applications. In fact, its array of fault tolerant services should make it a requisite for business-class Web-based servers.
Microsoft, meanwhile, seems to have a winner with MSCS. The software provides the necessary failover services to enable a hot standby server to step into the breech in the event a primary server crashes. But our experience shows MSCS will require a fair amount of user fine-tuning to shrink failover times to acceptable levels.
On the fault tolerant adapter side, Adaptec, with its Duo ANA-6922A PCI, certainly will offer the more attractive price performance because it packs more fault tolerant ports per card than Intel's offering. If you can live with the fact that Adaptec's design is based on the belief that its NIC won't fail, the Duo ANA-6922A will emerge as an attractive choice. Intel's EtherExpress PRO/100, however, is the preferred choice for users looking for maximum fault tolerance.
The big picture, though, is that all these tools are delivering the type of high-availability services required to make NT Server capable of supporting mission-critical applications.
RELATED LINKS
Vendor info on products mentioned in the article:
Adapters
3Com
Adaptec
Intel
LANart
ZNYX
Clustering
Atreve
Bright Tiger
Digital
Microsoft
NCR
Stratus
Qualix
Vinca
Disaster Recovery Planning
Issues to consider when drawing up a plan for server crashes and the like.
NT 6: Preliminary speculation
NWFusion Focus: Windows NT, 4/20/98.
Sun smokes NT with 256 processor server cluster
Software aimed at technical computing market. Network World, 11/24/98.
Bruno is managing editor of publishing products and Kilmartin is an engineer with The Tolly Group, a Manasquan, N.J., strategic consulting and independent testing laboratory. They can be reached at cbruno@ tolly.com and gkilmartin@ tolly.com.
Apply for your free subscription to Network World. Click here. Or get Network World delivered in PDF each week.
![]()
Request a reprint or permission to use this article.
