The latest outages at Amazon and RIM have been attributed to capacity and storage upgrades respectively. In both these cases the real root cause of the problem is not capacity or the upgrade. Why did the capacity problem occur? Why did the upgrade have an adverse effect?
There are two major categories of problems, those above the water line and those below the water line. As is the case with an iceberg, the visible problems above the water line are small in number but major in scale while those below the water line are larger in number but smaller is scale.
The problems at Amazon and RIM are above the water line due to their visible impact, but generic studies show that for each problem occuring above the water line there are 600 ones that have happened below the water line.
This is a big number but a simple strategy exists to deal with it. It is derived from the No Broken Windows theory. If you deal with these small problems and prevent them from occuring, it will follow that the time period before a major visible problem presents itself will be longer.
600 seems a really big number but then the Pareto principle kicks in. If we were to determine a root cause for each problem, then 80% of the problems can be addressed with only 20% of the root causes. Let us assume, you identify 20 root causes. Then your target to achieve better stability is addressing a target of only 4 root causes. Now, I think that makes the whole process more manageable!
In dealing with network problems, the same root causes have happened so often that I devised a checklist. I know that this list is by no means perfect, so I would be interested to know what are the most common network probelms? Searching in Google, lead me to an interesting article published by Network World in 2003, "Ten common management mistakes".
The biggest problem that I still encounter is the NIC settings mismatch. On large networks this problem can exist on 20% to 30% the network nodes, and with a concerted effort can be brought under control. Taken ino context, the impact is much larger than expected because many of the systems are distributed and require the collaboration of multiple nodes and severs to function correctly. Even a small systems has a minimum reliance of 10 network nodes and if 2 or 3 of them are not functioning correctly it influences the whole system. Left unchecked and unmanagement the NIC problem could influence and undermine ALL your systems.
Besides the NIC problem what are the other most common network problems that you encounter?