It has not been a good few months for the health and consistency of airline information technology. Two huge outages within a couple of weeks of each other -- caused by simple component failures -- resulted in massive passenger disruptions and cost two U.S. airlines millions of dollars in lost revenue and customer compensation.
These events, while of course most painful for those who have experienced them, present quite a few opportunities for learning and improving our own processes, and that's what I'd like to explore in this piece.
The Delta and the Southwest outages show how a single IT failure at the wrong place at the wrong time -- still, even after all of these years of planning and talk of the importance of disaster recovery -- can quickly cost millions, even in the course of just hours.
We have had decades of high-availability options: Different methodologies to either scale up with beefier redundant hardware or scale out with more cheap commodity boxes on hot standby and in clusters, failover options for both Windows and Linux that move operations across geographies in a matter of milliseconds, and now even infrastructure as a service possibilities where you just run backup operations in someone else's data center when you need to.
These options have all come down in cost, too. Where you used to need budget allowances in the millions to build any sort of failover capacity, now failover can honestly be as simple as purchasing a few hours' worth of runtime services with a credit card. (That is certainly too simplistic for a billion-dollar airline, but most of us do not run billion-dollar airline operations.)
To continue reading this article register now