No excuse for airline system outages

In 2016, several airlines' reputations plummeted when passengers were stalled due to system outages that could have been prevented.

In 2016, as multiple system outages led to long check-in lines, flight cancellations and passengers camping out in airports, several airlines’ reputations made unplanned descents. What could they have done differently to prevent these crises or to recover from them more rapidly?

Let’s take a look at a couple of examples.

Southwest Airlines' problem in July came down to a failed network router. Delta Air Lines' nationwide system outage in August was a result of a power surge that caused an automatic transfer switch to malfunction and take down 500 servers. Unfortunately, systems and equipment did not switch over automatically to back up power. The result of these mishaps? Each airline cancelled more than 2,000 flights and suffered negative media coverage, including a tweet storm from frustrated customers.

It’s clear the airlines have to do more to ensure their customers experience seamless service. Yet despite the critical need for uptime, the airline industry still limps along on a core of old technology and does not appear to safeguard it by using best practices for IT infrastructure and network management.

Here are my thoughts on how airlines could strive to prevent downtime occurrences and, should they still fall victim to them, get back up and running rapidly.

Preventing airline system downtime

To avoid downtime that threatens customer service, airlines need redundant technology that’s load balanced, used and monitored as follows:

  • It takes two: The common denominator in these airline outages is a lack of availability. As much as possible, airlines (and other businesses with critical operations) should duplicate their technology. This practice ensures that if one component fails, another is standing by, ready to take over. For example, why didn’t Southwest have a redundant router? This addition to their technology infrastructure would have allowed operations to keep running. Instead, they suffered 2,000 flight cancellations—a severe cost for lack of preparation.

  • Maintain a balance: In addition to creating a Noah’s Ark of technology, the airlines need to load-balance it. The combined workload on two routers that act as failovers for each other, for instance, cannot be more than 100 percent of the maximum load each can handle. If you’re working at 80 percent of the maximum load on one and 40 percent on the other, failing over from one to the other will not be possible. To ensure a workload does not go beyond safe limits, you need to monitor your IT infrastructure.

  • Use it: Redundant technology should not sit twiddling its thumbs just waiting for a failover event. It should be used to alleviate the workload and reduce the wear and tear on other components. But what’s even more important is that if it’s up and running and administrators are constantly monitoring it, they know if it works or if it doesn’t and they can fix it before there’s a crisis.

Ensure rapid system recovery

Despite best intentions, sometimes Murphy’s Law rules: “Anything that can go wrong, will go wrong.” If it does, it should not take six to 12 hours, as in the case of Delta and Southwest, to stage a recovery. There are four ingredients to seamless recovery: the geographic diversity of disaster recovery sites, instituting automatic failovers, failover testing and comprehensive monitoring.

  • Plan for flawless failovers: Because of the airlines’ need for zero downtime, automatic failover can help. When there’s a system failure, without human intervention, it moves an application to another server. It can be costly, but the expense of the solution needs to be weighed against the benefits of keeping passengers travelling to their destinations on schedule.

    Whether manual or automatic, airlines need to test their failover process and equipment to ensure it works. They might want to operate in the main location for the first quarter and the disaster recovery site for the second quarter. While they’re running operations at their disaster recovery site, they can use the time productively to do maintenance and upgrades at the main data center.
     
  • Diversify geographically: An airline based, for example, in Florida needs to have a disaster recovery site that’s in another geographic region. After all, there’s a high risk that a hurricane could lay waste to their infrastructure. But even in other states with lower risk profiles, disasters can still strike in the form of snowstorms, floods and fire. So, airlines should select disaster recovery sites far enough away to minimize duplicate environmental threats and sufficiently close to keep latency low, enabling them to mirror data efficiently and effectively.

  • Monitor the IT infrastructure 24/7/365
    Often it takes too long to find the cause of the issue at the heart of a downtime event. So, airlines need to use a comprehensive IT infrastructure monitoring system. It should oversee servers, storage, SAN and applications so they can quickly spot the problem areas and drill down, for example, to the failed router. Without detailed data about their systems, administrators run in all directions, unproductively looking for a needle in a haystack. A monitoring solution, however, can highlight the issue and turn six hours of troubleshooting into 20 minutes.

+ What do you think? Post your comments about airline system outages on our Facebook page +

While the airlines probably do need to upgrade some of their technology, they can take other steps to minimize downtime. To prevent problems, they should install, use and monitor redundant technology to make sure it works and is load balanced. And to minimize the length of any downtime events, they need geographically diverse disaster recovery sites, a plan for well-executed failovers, and comprehensive system monitoring.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2017 IDG Communications, Inc.

IT Salary Survey: The results are in