You might have been alarmed to read recently that half of all network problems are due to human error. Well, bad news. That\u2019s true of the number of problems. If you look at the hours of degraded or failed operation, three-quarters of all of it is due to human error. Furthermore, the great majority of degraded or failed operation can be traced to four specific activities:\n\nFault analysis and response, which network professionals and their management say creates 36% of error-induced outage time\nConfiguration changes (attributed to 27% of error-induced outage time)\nScaling and failover tasks (attributed to 19% of error-induced outage time)\nSecurity policies (attributed to 18% of error-induced outage time)\n\nNot surprisingly, network professionals are eager to find remedies for each of the four primary culprits. Before that can happen, it\u2019s important to understand why the human error occurs.\nMy research points to a handful of specific errors that are committed, and these errors are associated with more than one of the four activities. In fact, almost all the common errors can impact all of the activities, but it\u2019s best to focus on those error conditions that are the major contributors to outage time. They are:\n\nEvents overwhelm the operations staff\nOperations staff \u201closes the picture\u201d\nCross-dependencies between IT\/software configuration and network configuration\nIncorrect, incomplete, and dated documentation\nTroublesome gear\nUnder-qualified and under-trained staff\n\nEvent flood\nThe first of our error causes, cited as a problem by every enterprise I\u2019ve talked with, is that events overwhelm the operations staff. Most planned improvements to network operations centers (NOC) focus on trying to reduce \u201cevent load\u201d through things like root cause analysis, and AI tools (not generative AI) hold a lot of promise here. However, enterprises say that most of these overload errors are caused by lack of a single person in charge. Ops centers often go off on multiple tangents when there\u2019s a flood of alerts, and this puts staff at cross-purposes. \u201cIf you divide your NOC staff by geographic or technical responsibility, you\u2019re inviting colliding responses,\u201d one user said. A NOC coordinator sitting at a \u201csingle pane of glass\u201d and driving the overall response to a problem is the only way to go.\nLosing the picture\nEvent floods relate to the second of our error causes: the operations staff \u201closes the picture,\u201d which is reported by 83% of enterprises. In fact, NOC tools to filter errors or suggest root causes contribute to this problem by disguising some potential issues or creating tunnel vision among the NOC staff. According to enterprises, people making \u201clocal\u201d changes regularly forget to consider the impact of those changes on the rest of the network. They suggest that before any configuration changes are made anywhere, even in response to a fault, the rest of the NOC team should be consulted and sign off on the approach.\nNetwork\/IT dependencies\nJust over three-quarters of enterprises say that cross-dependencies between IT\/software configuration and network configuration are a significant source of errors. Almost all of these users say that they\u2019ve experienced failures because application hosting or configuration was changed without checking whether the changes could impact the network (the reverse is reported in only half that number). Overall, this source of human error is responsible for nearly all the problems with configuration changes and most of the problems with scaling and failover. Enterprises think that the best solution to this problem is to coordinate explicitly between IT and network operations teams on any changes in application deployment or network configuration.\nThat can reduce problems but won\u2019t do much to find and fix some that slip through. The solution to that is to improve application observability within the NOC, something only a quarter of enterprises say they support. If there\u2019s an overall NOC coordinator with a network single-pane-of-glass, then that pane should also provide an overview of application state, at least in terms of input\/output rates. Users also suggest that any time steps are taken to change a network\/IT configuration, parallel steps to reverse the changes should be prepared.\nDocumentation\nThe next error cause is one most users sympathize with, even though only 70% say it results in significant network outages. Incorrect, incomplete, and dated documentation on operations software and network equipment is sometimes a root cause in itself, but it more often contributes to operations confusion. A third of enterprises say that their operations library \u201cshould be better organized and maintained,\u201d and I suspect that\u2019s true of almost every operations library. A little less than ten percent of enterprises say they really don\u2019t have a formal library at all. For a problem that\u2019s reported this often, the solution is fairly easy; enterprises need both a formal technical library and a technical librarian responsible for checking regularly with vendors to keep it up to date. One in five enterprises say they have a \u201cprocedure\u201d for library maintenance but less than half that number say they have even a part-time librarian, and frankly I don\u2019t believe the real number is even that high. The library should also collect anecdotal sources like tech media, and file stories and documents with the proper vendor\/product information. That means having anyone who follows tech publications feed appropriate material to the tech librarian.\nTroublesome gear\nNext on our list is a troublesome piece of equipment or service connection. Remember the old \u201ccry wolf\u201d story? Repeated problems that generate events not only tend to immunize operations people to the specific problem but also can desensitize them to the event type overall. A repeated line error problem, for example, may cause the staff to overlook line errors elsewhere. Only 23% of enterprises say this is a significant problem, but all of those who have something that\u2019s constantly generating events that demand attention say it\u2019s caused their staff to overlook something else. The solution is to change out gear that creates repeated alerts, and report service issues to the provider, escalating the complaint as needed. NOC procedures should require that a digest of faults be prepared at least once per shift and reviewed to spot trouble areas.\nStaff, skills and training\nLast on our list is under-qualified and\/or under-trained staff, but it\u2019s not last because it\u2019s least. This problem is cited by just under 85% of enterprises, and I suspect from my longer-term exposure that this problem is more widespread than that. There are two faces to this problem. First, the staff may not be able to handle their jobs properly because they lack general skills and training. Second, the staff may have issues with a new technology that\u2019s been introduced, either a feature, a package, or a piece of equipment.\nAddressing the first face of the problem, according to enterprises, requires thinking of \u201capprenticeship.\u201d A new employee should serve a period under close supervision, during which they\u2019re trained in an organized way on the specific requirements of your own network, its equipment, its management tools. The apprenticeship might be extended to add in formal training if required, and it doesn\u2019t end until the mentor signs off. Certifications, which enterprises say are helpful for the second face of the problem, aren\u2019t as useful for the first phase. \u201cCertifications tell you how to do something. Mentoring tells you what to do,\u201d according to one network professional.\nMapping errors to error-prone activities\nWhat\u2019s the impact of errors on the four error-prone activities? Below is a breakdown of the four activities, the specific errors committed, and enterprise IT professionals\u2019 views on how often the errors happen and how serious they are. (For my research, a common occurrence is one that\u2019s reported at least monthly, occasionally is four to six times a year, and rare is once a year or less. A serious impact refers to a major disruption, and a significant impact refers to an outage that impacts operations.)\nFault analysis and response\nEvent flood: Common occurrence, serious impactLosing the picture: Common occurrence, serious impactNetwork\/IT dependencies: Occasional occurrence, serious impactDocumentation: Common occurrence, serious impactTroublesome gear: Occasional occurrence, significant impactStaff, skills and training: Common occurrence, serious impact\nConfiguration changes\nEvent flood: Rare occurrence, significant to serious impactLosing the picture: Common occurrence, significant impactNetwork\/IT dependencies: Common occurrence, serious impactDocumentation: Occasional occurrence, significant impactTroublesome gear: Rare occurrence, significant impactStaff, skills and training: Common occurrence, serious impact\nScaling and failover\nEvent flood: Occasional occurrence, serious impactLosing the picture: Occasional occurrence, significant impactNetwork\/IT dependencies: Common occurrence, Serious impactDocumentation: Occasional occurrence, significant impactTroublesome gear: Occasional occurrence, significant impactStaff, skills and training: Common occurrence, serious impact\nSecurity policies\nEvent flood: Rare occurrence, serious impactLosing the picture: Occasional occurrence, serious impactNetwork\/IT dependencies: Occasional occurrence, serious impactDocumentation: Occasional occurrence, significant impactTroublesome gear: Rare occurrence, significant impactStaff, skills and training: Common occurrence, serious impact\nGauging the impact\nHow can enterprises organize the solutions to all these issues? The first step is to plot your own network problems in a similar way. Focus on the areas where the problems have the greatest impact. The second step is look for tools and procedures to address specific problems, not to \u201cimprove\u201d management or serve some other vague mission. Layers of tools with marginal value can be a problem in itself. The third step is to test any changes systemically, even though you\u2019ve justified them with a specific problem in mind. It\u2019s not uncommon to find that a solution to one problem can exacerbate another.\nDon\u2019t fall into a simplification trap here. \u201cTop-down\u201d or \u201ccertification\u201d or \u201csingle-pane-of-glass\u201d aren\u2019t fail-safe. They may not even be useful. Your problems are a result of your situation, and your solutions have to be tuned to your own operations. Take the time to do a thoughtful analysis, and you might be surprised at how quickly you could see results.