One of the great challenges of modern networking is the need to support services such as voice and real-time video that are quite sensitive to packet loss, transmission errors, delay, and jitter using a technology - IP - that is designed to be tolerant of packet loss, transmission errors, delay, and jitter.
We design our networks with backup links, alternate routes, redundant node components, and resiliency features such as Fast Reroute (FRR) to insure that we can quickly recover from a detected link failure.
But detecting the link failure is the catch.
SONET detects and reacts to link failures very quickly, but SONET interfaces are outrageously expensive. Ethernet links are increasingly being used wherever possible because the interfaces are cheap and the technology is simple and well understood. But Ethernet is not adept at quickly detecting link failures: The keepalives it sends (Ethertype 0x9000) do little more than check the electrical integrity of the interface's connection to the link. They do nothing to verify functional bidirectional communication with a neighboring node. It is slow to detect half-opens; and if a failure occurs on the far side of a switch, or under any other situation where the local interface continues to see a good signal on the wire or fiber, the failure is not detected until some Layer 3 protocol - usually the routing protocol - misses enough keepalives or Hello messages to declare its neighbor down.
This problem leads many operators to rely on their IGP to detect link loss. For example, the Hello and Dead intervals of OSPF can be cranked down to 1 second; Cisco's hello-multiplier command even allows you to configure a Hello interval in the sub-second range. But in any case it is unlikely that OSPF is going to detect a neighbor loss due to a link failure in less than 2 seconds - a very long time if voice and video packets are being dropped. And if the IGP Hellos are being processed in software at the control plane, there is a performance price to be paid for short Hello intervals.
This is where Bidirectional Forwarding Detection (BFD) can help. BFD is a lightweight, protocol-independent Hello protocol that can detect link failures in the millisecond range. Typical detection times are around 50ms; the "lightweight" nature of the protocol means that it can run in hardware on routers with ASIC-based forwarding planes, avoiding the performance price you must pay for aggressive IGP Hello intervals.
BFD is usually called a "liveness detection" protocol because it does not itself take any remedial action when a neighbor loss is detected; instead it informs the protocols running on the interface in question when a loss is detected..
Currently BFD is being deployed primarily for failure detection on Ethernet links - particularly where the use of switches obscures the efficient detection of failures.
There are a few caveats:
Although the timers can be set as low as 1ms, experience has shown that below 50ms processing delays - even in hardware - begin causing timing jitter. And overly aggressive timers can cause BFD to incorrectly declare a neighbor down when its Control packets are delayed by even small periods of congestion or where a short burst of noise can corrupt all the BFD Control packets within the expected Detection time.
Of course your network must still reconverge around the link failure, but detecting the failure as quickly as possible is the first step in the recovery process. If you support or are planning to support applications that require high network resiliency, BFD is worth a look as a component of fast failure recovery.