Tracking UTM high availability
The high-availability and scalability features in the enterprise UTM firewalls we tested range from very fancy to dead simple.
We gave the highest scores to products that recovered within four seconds and took points off when products took more than a minute to restart traffic flows.
|
While most vendors -- SonicWall and WatchGuard were the exceptions -- also offer active/active HA in which two firewalls load-balance automatically between themselves, we tested active/passive HA in which a hot standby system takes over when the active node goes down.
The argument here is that any performance benefits achieved from an active/active configuration would pale in comparison to the guarantee that when a HA event occurs to an active/passive configuration, you'll still have just as good performance as before the event. Because a typical HA event might be a hardware failure that could take a box out for 24 to 72 hours, having the same performance before and after would be pretty important.
We made an exception to this rule, for Check Point firewalls, because we had four platforms running the same software, and we wanted to see whether there were differences in the different HA approaches. On Check Point’s own hardware, we tested using Check Point’s active/active and on Nokia hardware, we tested using Nokia’s IPSO clustering.
Our tests showed that the HA features in Check Point’s software running on all hardware platforms and on Juniper products fails over with no traffic blocked (by our four-second definition). We turned off a system and sessions kept flowing through both vendor’s failover UTM firewall. This was true for the Check Point UTM-1 2050, Crossbeam C25, Nokia IP290, and both the Juniper ISG-1000 and SSG-520M firewalls.
The biggest key to nonstop success is the willingness to waste IP addresses. With Check Point HA, called ClusterXL, Nokia IPSO clustering and Juniper HA, each device has its own IP address, and the pair also has a third IP address as well as an additional (virtual) MAC address. When an HA event occurs, the remaining node takes over the HA IP and MAC addresses, assuring that no one outside of the cluster has to adjust and traffic can keep flowing as soon as the HA event is detected — always within our four second limit.
Contrast that configuration with those required of the Astaro ASG425a, the FortiGate 3600A, the SonicWall PRO 5060 and the WatchGuard Firebox X8500e. All these systems had much easier systems for setting up high availability, requiring only a single IP address for each LAN segment. However, that simplicity cost these implementations between eight and 72 seconds of zero data flow through the test bed when an HA even occurred. In most businesses, one minute of downtime after a hardware failure would be considered fantastic, but to get our highest score, systems had to detect the HA event and keep traffic flowing in less than four seconds.
Two products -- Nokia’s IP290 and Astaro’s ASG425a -- offer multinode clustering, which is a potential solution to the problem of losing a single node in a high-availability environment. With multinode clustering, you can keep adding devices into the cluster, making it (in theory) increasingly reliable and fast.
Although this seems like a particularly effective solution to high availability and scalability, remember that the entire cluster still can’t go any faster than the 1Gbps physical Ethernet interfaces that feed it. With base throughput of greater than 1Gbps on a single node in half of the configurations we tested, there’s not a lot of normal enterprise UTM firewall deployment architectures that would really take advantage of this feature.
Our adventures in HA had only two real glitches. The first was that we found the Astaro ASG 425a HA to be problematic and unreliable. For example, after we rebooted one node in an HA pair, the second node decided that it was the HA “master” and we had two different firewalls, each claiming to represent the cluster. That was a particularly frightening situation, because if you weren’t looking for the HA status, you wouldn’t realize that the systems were running independently, on the same IP address, with the potential for instability and loss of any configuration changes made while in this strange state.
The second HA glitch we found Nokia’s IPSO clustering feature that was specifically related to load balancing and NAT. During testing we saw throughput of the load-sharing cluster go up when we shut one of the nodes off. We did our performance testing with NAT disabled to compensate for this problem. Nokia was researching this issue at press time.
Copyright © 2007 IDG Communications, Inc.