High availability and Heartbeat

Heartbeat is a subsystem that allows a primary and a back-up Linux server to determine if the other is 'alive' and if the primary isn't, fail over resources to the backup.

We mentioned Heartbeat a couple of columns ago when we started on Linux Enterprise Clusters, so let's dig deeper.

Heartbeat is a subsystem that allows a primary and a back-up Linux server to determine if the other is "alive" and if the primary isn't, fail over resources to the backup. Heartbeat uses inter-server signaling called "heartbeats" over serial, User Datagram Protocol and PPP/UDP connections, and handles the process of the transfer of the server's IP addresses.

Heartbeat arose from the Heart project in 1999 and is one of the foundational technologies of the High Availability Linux Project.

Now, as simple as failover might sound, we're talking computers and networking and so, of course, it isn't. In fact, the problem is so complex that the current release only supports a pair of nodes. This will change with the forthcoming release of HA Linux Release 2 (HAL-R2) within the next couple of months.

HAL-R2 will be a major revision of the entire Linux system. HAL-R2 will extend Heartbeat's functionality to support multiple nodes, the ability to monitor resources for correct operation, and support for configuration dependencies.

Being able to support multiple nodes in a cluster is crucial, as is monitoring. Resource monitoring ensures that the failure of a service provided by a node can be detected even without the node actually "dying."

Dependencies, otherwise called "constraints" are important, as you might never want database servers to run on the same node as Web servers, or you might want to always have data replication services run only on nodes that are running the database services.

The version of Heartbeat available today is a stable and effective way of ensuring that two nodes in a cluster act in a coordinated manner. Each server runs the Heartbeat daemon and exchange messages called Heartbeats that inform the other machine that the sender is alive.

In the event of the primary node failing, the back-up node Heartbeat is responsible for transferring any IP addresses that must be available after failover.

A highly reliable communications channel is required to avoid the split-brain, or (less sexily) the partitioned, cluster problem. In a split-brain situation both servers are alive and functioning, but both also believe the other is dead because the Heartbeats can no longer be seen. You now have the problem of both servers trying to provide the same services and use the same IP address for crucial client services. Even worse is when both servers share disk resources and compete for access to the same data at the same time.

The solution for this problem is a component of Heartbeat referred to rather eccentrically as "Shoot The Other Node In The Head," otherwise called Stonith.

Stonith uses a controllable power control device such as the Western Telematic network power switch we discussed ages ago.

The simplest and least conventional configuration, as discussed by Karl Kopper in his book The Linux Enterprise Cluster, would be to have the power control device controlled by the back-up server. This only allows for one-way and one-time failure, and requires operator intervention to reset the back-up server when the primary is restored.

Heartbeat and Stonith are the foundations for a Linux Enterprise Cluster, and while building such a beast is definitely not simple, the "bang for the buck" is undeniable.

If you are nervous about building your own High Availability Linux Enterprise Cluster, you can purchase commercial implementations based on HA Linux; see the HA Linux Commercial High-Availability Software for Linux page.

We're highly available at gearhead@gibbs.com as is Gearblog.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2005 IDG Communications, Inc.

IT Salary Survey: The results are in