Is your business playing Russian Roulette with system availability?

Server clustering and data mirroring for high availability

Every year, I hear dozens of horror stories from customers about server and network outages and the resulting loss of data and productivity. For a brief moment, some network users may find an outage a bit charming, as older colleagues lean back and reflect, “This is the way it was back in the seventies – no Internet, no e-mail, not even a fax machine. Just typewriters, phones, and Uncle Sam’s mail.”

Every year, I hear dozens of horror stories from customers about server and network outages and the resulting loss of data and productivity. For a brief moment, some network users may find an outage a bit charming, as older colleagues lean back and reflect, "This is the way it was back in the '70s – no Internet, no e-mail, not even a fax machine. Just typewriters, phones, and Uncle Sam's mail."

Such nostalgia is invariably short-lived, though. Today, it's all about immediacy of access to information, applications and one another. Even small enterprises are increasingly online, mobile and Web 2.0-driven, to the point where IT is no longer just a business tool. It is business – the heart and the circulatory system through which most transactions flow. If your IT systems fail, your daily operations follow – and if the outage lasts too long, your business may fail.

So small-to-midsize businesses should ask themselves how they can create a high-availability infrastructure that responds robustly to new-age business challenges and disruptions. Server clustering and data mirroring can play an important role in implementing high availability. They can also serve as a cornerstone to an effective business continuity and disaster-recovery strategy, and – good news – they can be very affordable.

Clustering and mirroring for high availability

Server clustering is the answer for several objectives: creating scalability, load balancing and, of course, increasing system availability. Clustering for high availability allows the automated failover between servers in the cluster, providing close monitoring of applications and all their components, including operating system, server hardware, networking and storage.

The clustering software determines when to perform a failover by continually checking each application's "heartbeat" signal, and if one system has a problem, the application on another server in the cluster takes over. To the outside world, the cluster appears to be a single system, but intelligent redundancy within it creates high availability.

Application availability is only half of the IT requirement. The data that applications create and use must be equally available in order for business to continue. Disk mirroring is the recording of redundant data on two partitions of the same disk or two separate disks, for fault-tolerant operation.

Mirroring is a central component in the highest level of data protection and disaster recovery, and it differs from ordinary backups, which simply replicate a complete volume at specific points in time, often for use in testing. Mirroring creates dynamic, real time copies of data volumes, which further reduces the amount of data at risk of loss. Mirroring can be done using Level 1 Redundant Array of Independent Disks (RAID) features. RAID can be provided through the motherboard or a controller card, or built into a dedicated disk array.

Benefits and challenges

Server clustering provides three key benefits:

 High availability: Designed to avoid a single point of failure.

 Scalability: Computing power can be increased by adding more processors or computers.

 Manageability: Appears as a single-system image with a single point of control.

While clustering provides significant benefits, IT managers must also be cognizant of related challenges. Further, a clustered environment can be complicated to manage – especially if your staff is new to this technology. If IT is unable to perform basic checks, such as confirming whether a patch has been applied correctly to all nodes in the cluster, it could cause serious outages. Finally, if the SMB is using Service-Oriented Architecture, where applications are working in tandem, it will require solutions that understand the dependencies.

The benefits of data mirroring:

 Protects against data loss: Added redundancy offers backup in case of hardware failure.

 Disaster protection: Offers quick recovery against site- and region-wide incidents.

 Individual disk access: Each disk or set of disks in the mirror can be accessed separately for reading purposes.

Although mirroring is essential to ensuring high availability of data, it's not a complete data protection solution by itself. Mirroring is ineffective if the data is corrupted. For example, a virus might corrupt or erase data, or a user might accidentally delete data. This is why data protection in the form of regular backups is also necessary for file-level protection.

Advice for IT

When SMBs decide to implement clustering and mirroring as part of a healthy high availability solution and business continuity/disaster-recovery  plan, it should be managed seamlessly to maximize the benefits. Consider the following:

It's all about the bucks: Systems that provide data protection and recovery in an hour, day or week are less expensive than ones that deliver business-critical service, which should experience close to zero downtime. You and your business's key managers need to look at all of the business functions and processes that are dependent on IT. Then ask, "What is the financial impact on each of these services if IT goes down?"Always start with the application: A critical first step is determining which applications require 24x7 availability. To help with this task, SMBs can build a dependency tree for each application that should be available. Make a list of what makes the application work (such as switch, server, desktop).RPO, RTO: Determine your business's recovery point objective (RPO) and recovery time objective (RTO). The RPO, in effect, is the amount of data loss your business can sustain, while the RTO is the amount of time you can afford your systems to be down – the maximum tolerable outage. If a disaster occurs, how much time can your business afford to lose? An hour? A day? A week? This depends on the nature of your particular business and your owners' or managers' appetite for business risk, so it's important that IT alone does not decide what the RPO and RTO are.Five nines: Most SMBs should strive to achieve five nines reliability, which means systems are available 99.999% of the time. Not all businesses need or can achieve five-nines reliability - perhaps four or three nine is adequate in some cases. The decimal point differences may seem like hair splitting, but they reflect significant duration or frequency of outages. Think about it this way – a system that is 99.999% available to a business that operates only 40 hours per week (and most operate more hours than that) is not available for two minutes per year. One that is 99.99% available is not available for 20 minutes per year. One that is 99.9% available is not available for two hours per year – and of course, management doesn't get to decide which two.

How much does two hours of down time matter to your business, especially if you can't pick and choose which two hours you lose? That question demonstrates the Russian Roulette of ignoring system availability in your business plan.

 To outsource or not, that is the question: What is the level of service you'll need? Is there an in-house IT expert who has the bandwidth to manage server clustering and disk mirroring? If not, consider bringing in your solutions provider to do it for you, or even consider hosted services to support your business-critical infrastructure.

 Don't forget business continuity/disaster recovery: As clustering and mirroring are part of a healthy business continuity/disaster recovery plan, you should test your systems regularly. The frequency with which an organization can test depends on the disaster-recovery budget, but as a benchmark, SMBs should test no less than twice annually. If it is impossible to test the entire system, periodically test the most critical applications and systems.

According to Gartner, improving availability will help to reduce direct loss of revenue and loss of future revenue, revenue loss through failure to meet contractual obligations, productivity loss or overtime costs, and damaged reputation. Remember, your system is your business, and your business is your system.

Learn more about this topic

Symantec unveils enhancements to its Windows storage management

The low-down on the Linux High-Availability Project

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2009 IDG Communications, Inc.

IT Salary Survey: The results are in