Our bullet-proof LAN failed. Here’s what we learned

Three risks and three remedies to help ensure this doesn't happen to you

lan down outage
Credit: Shutterstock

In my organization we manage almost all IT in-house, including the LAN, which is highly redundant. We have 35 Floor Wiring Concentrators around campus, each with around 300 active ports, and the concentrators have dual Gigabit uplinks to network cores in two data centers that are 300 meters apart. The data centers are linked by multiple 10Gb links, and each is connected to the Internet via two trunks that follow different paths.

That should be just about bullet-proof, right?

Well, we recently suffered a four hour outage. This was caused by one of the data center distribution switches generating malformed packets which propagated to the two backbone switches which became unstable, causing the spanning tree algorithm to break, thus killing all network connectivity for end-users. To make matters worse, restarting the two core switches caused two modules of the core switches to fail – a known problem which was flagged in the vendor updates but something that we hadn’t read.

Our response to the outage was professional, but ad-hoc, and the minutes trying to resolve the problem slipped into hours. We didn’t have a plan for responding to this type of incident, and, as luck would have it, our one and only network guru was away on leave. In the end, we needed vendor experts to identify the cause and recover the situation.

From this experience we identified three key risks to network continuity and three corresponding remedies.

Risk 1: The greater the complexity of failover, the greater the risk of failure. The original network design from around ten years ago had been simple, with one core switch in each data center and one interconnection. As the network grew, it was necessary to add multiple distribution layers within each of the data centers, and this additional complexity increases the difficulty of troubleshooting, especially when the root cause is a subtle problem rather than outright hardware failure. Basically, the network is more complex than it needs to be.

Remedy 1: Make the network no more complex than it needs to be. This is a key architecture design principle that is often overlooked by zealous network engineers (or zealous vendor salespeople). The philosophical question it also arises is whether it is actually worthwhile to invest in a zero-downtime system, or whether it is best to have simpler and cheaper manual failover mechanisms which potentially implies more outages but with shorter recovery times. That decision will depend on the organization, but for ours we are seriously considering dumbing down the network in order to guarantee minimum downtime in case of outages.

Risk 2: The greater the reliability, the greater the risk of not having operational procedures in place to respond to a crisis. As the saying goes, success leads to complacency, and it’s easy for a successful and reliable network to lead to complacency in terms of operational monitoring and business continuity response plans. With a bullet-proof configuration such as ours, the network can’t fail, so who would need a plan, right? Even after our outage, some of our technicians were saying it was a one-off and couldn’t happen again.

Remedy 2: Plan, document and test. Needless to say, having good up-to-date documentation of the network configuration is essential, but of course it should also be kept in hard-copy off the network to ensure it is accessible when the network fails. Having an incident response plan is also crucial in understanding roles and responsibilities and avoiding a crowd-souring approach to problem solving.

Similarly, setting up effective communication channels, both within IT and with end-users, is something that needs to be planned. During our outage, we were left without any way to communicate with end-users beyond walking around the campus informing them. We did identify an option of using the public address system, but since no policy was in place for using it outside of emergency situations, that option couldn’t be used.

Risk 3: The greater the reliability, the greater the risk of not having people that can fix a problem. I used to own an old car that required a good deal of routine maintenance – oil, water, and a change of a tire once every two months. I’m no mechanic, but became quietly confident that I could fix any minor problems that came up. About four years ago, I bought a brand new car that never fails, and should go wrong a warning light will tell me to take it to the dealer. I’m no longer expected to fix my car and therefore I’ve lost the skill to be able to do so. The car analogy can be applied to network equipment. Network staff are in danger of having increasingly superficial knowledge of equipment and may be at a loss when it comes to in-depth trouble-shooting.

Remedy 3: Get the right people in-house or outsource it. IT is known for attracting a certain type of personality, but within IT there are certain specializations – and networking is one of them – that requires a certain obsession that is difficult for outsiders to understand. To build and maintain a reliable network, at least two of these people are required.   The alternative to this is to outsource the problem.

While some may feel uncomfortable about outsourcing such a key element of IT operations, the reality is there are many outsourcing companies out there that have specialists that can help you address critical needs. The choice will vary by organization, but in an increasingly industrialized IT landscape, the case for maintaining such skills in-house is becoming harder.

While these risks and remedies primarily concern the operational level, IT of course has to operate within the broader organizational business continuity plans, driven by business requirements. Gartner, for example, has identified five principles of organizational resilience, of which only one relates tightly to IT (Systems, with the other four being Leadership, Culture, People and Settings). Using scenario planning, our intention is to analyze some of the broad risks we face, understand the potential impact at the business level, and then identify options for reducing those risks.

The identification of these three risks and remedies has helped us move forward after our outage. I hope they can help you avoid such outages in your organization.

Whimpenny is the Senior Officer for IT Architecture in the IT Division of the Food and Agriculture Organization of the United Nations (FAO). The views expressed here are those of the author and do not necessarily reflect the views of the Food and Agriculture Organization of the United Nations (FAO).

To comment on this article and other Network World content, visit our Facebook page or our Twitter stream.
Must read: Hidden Cause of Slow Internet and how to fix it
Notice to our Readers
We're now using social media to take your comments and feedback. Learn more about this here.