This column is available in a weekly newsletter called IT Best Practices. Click here to subscribe.
Seth Lyons is a Senior Systems Engineer with a financial services firm. The company had just installed a number of new firewalls when Lyons received an email alert that there was an issue. It seems one of the clusters had failed over but the packets weren't incrementing correctly following the fail over. They were actually still incrementing on the primary firewall. The alert Lyons received advised him to check the configurations on the switch.
Here is the actual alert that Lyons received:
RX traffic drastically reduced post fail over, possible ARP issue
A fail over was identified at Device time: Sep 18 00:31 2013 UTC, Indeni time: Sep 17 20:31 2013 EDT. This device is now the active member of the cluster and in the period immediately following the fail over (3 minutes more or less) it received 0 packets compared to 2067098 packets that were received by sfdc-wanfw1 (18.104.22.168) in a similar amount of time immediately BEFORE the fail over. This indicates the possibility that the surrounding network equipment may not be aware of the fail over on the layer 2 level.
Manual Remediation Steps:
It is possible this is caused by the fact that during a fail over the responsibility for the virtual IPs moves from one cluster member to the other and the MAC addresses change. ClusterXL issues gratuitous ARPs to deal with this but it may not work with your equipment. Please review SK50840 for more information.
Without that detailed level alert and remediation advice, how long do you think it would have taken Lyons and his team to discover and resolve the issue? We'll never know because they followed the advice of the remediation steps and quickly got everything working properly.
What kind of monitoring and analysis solution has this kind of visibility into networking issues and provides this type of expert advice on how to resolve the issues? This is Indeni, a next generation network operations tool that searches for and identifies difficult issues and configuration mistakes on networks and uses globally sourced expert knowledge to resolve them.
Most network monitoring solutions today are using technology built in the 1990's based on SNMP that analyze basic parameters to determine when conditions warrant administrative attention. These types of tools will tell a network engineer when something is broken, such as a firewall or router being down, or traffic not flowing the way it should. They don't say why the issue is happening, and they certainly don't send a notification about a problem before it actually occurs.
Indeni flips this model on its ear. The solution is based on having collective expert knowledge about network equipment and common configurations. Indeni automatically acquires knowledge from different sources and then uses what it knows to find issues and make recommendations on how to fix them. While traditional monitoring solutions use SNMP traps to look for symptoms of problems, Indeni says it knows how to analyze the innermost workings of a network to look for the causes of issues and identify a problem before its symptoms occur.
Just what are these special sources of knowledge? The first is vendors of network equipment; for example, Cisco, HP and Check Point. Vendors often have deep knowledge bases for customers – best practices, information from the user guide, common issues that users encounter, and so on. Indeni has developed a way to automatically consume this information and generate it as knowledge in a way that the Indeni monitoring solution understands.
The second source of knowledge is Indeni's own customers. The company has found its customers are willing to share anecdotal information about the issues they have run into, best practices they have learned, and the do's and don'ts for maintaining their network. Indeni takes in this information manually and formulates it for use in its system.
A third source of knowledge is actual data from customer networks. When the solution is installed, with the customer's permission Indeni sends some configuration, statistics and log data back to the company’s data center for analysis. Indeni uses that information to learn more about how networks in general should run. For example, if 90% of Indeni's customers enable a certain configuration on a specific type of router or firewall, it is assumed to be a best practice because so many companies have enabled it. This crowd-sourced information grows more valuable as more companies implement Indeni.
Network equipment vendors are taking notice of this approach and partnering with Indeni for preemptive maintenance. Check Point Software Technologies just announced integration of Indeni with Check Point management servers and gateways to allow network security operations teams to constantly validate their environments’ configuration against an ever growing set of best practices. Some best practices come from Check Point’s own knowledge base while others come from the practices learned in the field by users. In addition, Indeni conducts automated root cause analyses whenever an issue actually occurs.
Managed service providers also are finding interesting uses for Indeni. For example, Fujitsu manages customers' networks in the UK. Before Fujitsu onboards a new customer, the company uses Indeni to do a sweep of the network to get a baseline of the network configuration and existing issues. Fujitsu and the new customer can come to terms with how to remediate those issues before Fujitsu ultimately becomes responsible for them, and this reduces liability for Fujitsu. Once the customer's network is under Fujitsu's care, the MSP continues to use Indeni to avoid issues that might otherwise become problems that result in a service credit. Fujitsu has a high customer satisfaction level and it spends less time troubleshooting and fixing problems.
Today the Indeni solution covers routers, switches, firewalls and load balancers. The company eventually wants to cover servers, VoIP, and devices meant to service the Internet of Things. Indeni executives say they hope to create a paradigm shift from being reactive when something breaks to being proactive in troubleshooting bad configurations and resolving issues before something breaks. They must be doing something right because they say that they have never had a customer not renew a subscription.