The outage that hit Amazon Web Services' Simple Storage Service earlier this month might have been unusual in its impact, but not in its cause—a configuration error. Service providers suffer outages for all sorts of reasons. Backhoes take out local access loops. Seismic events and fishing trawlers cut underwater cables. And, yes, humans make mistakes.
About three years ago, I was called in by a customer to help address a problem in their MPLS network. At the time, I ran MPLS Experts, a predecessor of SD-WAN Experts, and had developed a reputation for knowing a thing or two about global MPLS/VPLS services. The customer was noticing packets with unknown IPs on its carrier-managed private network. After we reviewed the logs, the cause became apparent: One of the carrier techs had misconfigured the VRF/VFI identifiers, accidentally connecting a different customer to their private network.
The appliance-by-appliance configuration that’s a necessary part of most enterprise WANs opens gaping holes for manual misconfiguration to creep in. Some of these errors become routine annoyances, like configuration drift between locations.
How SD-WANs can prevent configuration errors
One of the many benefits of SD-WANs is that they solve this problem by swapping the site-by-site configuration with policies, some defining application behavior, other specifying node configuration, and still other governing business logic. All of which helps eliminate inconsistencies between locations.
But even with policy-based operations, configuration errors can still take down your SD-WAN. At one recent customer, for example, I purposely created a conflicting route policy to see how the SD-WAN environment would behave. One vendor’s product alerted me to the error, but the other? Not a word. Notifications are important to help prevent accidentally eliminating a service chain or misconfiguring a traffic policy that will disrupt your backbone.
Alerts and notifications alone, though, are not enough. Ideally the SD-WAN platform should allow for different levels of access. Some roles may be able to define policies and deploy them in test environments or smaller locations, for example, but not across the SD-WAN. That final step, pushing a policy out across the SD-WAN or impacting critical site connectivity, should require additional privileges, minimizing the likelihood of misconfigurations from inexperienced users.
No doubt that the usability of the management interface plays a big role in whether or not an engineer might “AWS” your network. Some vendors invested significant time in building their GUIs (head nod to you folks, Silver Peak).
Silver Peak’s Unity Orchestrator is an easy-to-use interface into the company’s SD-WAN.
But even slick GUIs can lead to operator mistakes. Other vendors augment their GUIs with iOS-like CLIs. Viptela is a great example on this score.
Which is better? It really depends on the talent and expertise of your teams. GUIs are obviously easier to learn, but for old networking hounds operating in Cisco-like environments, the CLI option might be best. What’s more, with a CLI you can often script actions that might lead to mistakes in a GUI.
Plan for configuration errors
Even then, though, configuration errors can and will happen, and you need to plan for those events. The SD-WAN interface should allow you to roll back your configurations for the network, individual locations, applications and groups of users, if relevant. Rollbacks should be time-stamped to make reverting back to the functional iteration easier. If all network configuration changes are logged by the SD-WAN, then they have you covered.
Nobody likes outages, but if there’s something positive about the AWS outage, it's that it reminded us of the important limitations of technology advancement. Whether it’s the cloud or SD-WAN, IT might have become easier, but good networking engineering practices and requirements still remain true.