Network World
Sunday, September 7, 2008
DNSstuff.com
Get information about your IP
IP Information
50+ On-demand DNS and network tools

Andabatae

Navigation

The ultimate test pilots - a network troubleshooting take on the pilot's checklist

Mike Melvill is the ultimate test pilot. He became the first commercial astronaut, after flying SpaceShipOne to above 100 km on 21 June 2004. He flew to the edge of space without government support. Melvill, has a dangerous occupation but he has survived mainly due to his perfectionism in following the basic pilot tool, the checklist. The checklist has important parallels in Information Technology and in this case networks. Another one of the ultimate test pilots is Chuck Yeager. "Chuck Yeager is a pilot of unsurpassed skill and determination," said Mike Melvill. "I've met General Yeager several times, and hold him in very high esteem." Yeager’s flying experience before, during and after the war would hone his skills to absolute precision. Twelve air victories, including five in one day, were an indication of his piloting prowess. But there was also a willingness to push himself to the edge of his limitations while still maintaining a coolness under pressure. These ultimate test pilots are a model for any technologist and as stated in the Project Lite blog entry the pilot's checklist is an important tool.

Thus in the spirit of Melvill and Yeager, who are experts in the use of checklists, here is a Network Troubleshooting checklist (please add comments to the bottom of this blog if you don't agree or think I have missed something):

#1. Assumptions! What is really wrong? Is it the network that is being blamed for something else? Ask why? Up to five times! A method to work out if it really is a network problem or has its origins elsewhere is to use IPSLA. IPSLA is the renamed Cisco Service Assurance Agent. A good commercial tool is from Entuity which has an IPSLA module.

#2. Check the auto negotiation settings. Many problems are as of a result of switch or host setting misconfiguration. Tip: Auto is best! Charles Spurgeon has a great resource on Ethernet which includes a section on auto-negotiate. Also get hold of the Fluke poster.

#3. Check the network drivers. Most of the network drivers that are pre-released with the operating systems are rubbish! Visit the NIC (Network Interface Card) manufacturer web site and update. The most popular NICs are Broadcom and Intel.

#4. Walk through the configuration. Are the IP addresses correct? Are the subnets correct? Is the right VLAN being used? Is the gateway correct? Do the basic ping and tracert tests? Don't assume that the test results from a router will be the same as from a desktop! Here is Cisco's IP troubleshooting tips.

#5. Kick the tyres. As is the case in the eyeball blog entry, do a visual inspection. I was once called to a factory where there was a problem. Upon inspection, the network equipment was covered in pigeon pooh! The chassis had rusted and the PCB boards were being affected by the stuff. No wonder there was chaos.

#6. Changes. Compare and determine differences. Firewall rule changes are often candidate changes for review. And don't discount desktop firewalls! What: conditions, activity, equipment. When: schedule, occurrence, status. Where: local, environment. How: practice, actions, procedures. Who: personnel, supervision. Review the network documentation. Is what is written there reflected in reality? A good integrated service management philosphy is required and BMC has a good change management product set.

#7. Power! Refer APC's white paper on power problems. Often network equipment does not start up correctly after a power outage or is adversely affected by brown outs.

#8. Refer to those Release Notes. Somewhere in the world someone has had the same problem as you. Download and read the latest release notes for your network equipment. Often there are NIC and switch issues that are highlighted in these notes. Here is an example.

#9. Cabling. Wear and tear on cabling cannot be discounted. As a minimum invest in a tester like Fluke's LinkRunner or even the NetTool. Check for power cable runs that are in parallel to network cables. Check for dust on fibre optic connectors.

#10. Black holes. It is amazing how common black holes really are in networks and it is usually down to incorrect MTU settings. Use this guide from Microsoft to help locate the issue.

#11. Sniff free or die. Wireshark's powerful features make it the tool of choice for network troubleshooting. Load the software and capture a copy of the packets involved in the problem. This forms the basis of any extended analysis.

#12. Are the router tables correct? "show ip route". NEDI has a feature to view the routing tables and if these are saved on a regular basis, is a way of checking what has gone wrong.

#13. Is the bandwidth being saturated? FTP and email are bandwidth killers and the usual suspects. Cisco hasn't always done network management right but two of the things that really work from Cisco are IPSLA (see #1 in the checklist) and Netflow. Crannog's Netflow tracker is a great tool and you an always use the manual way with the cli.

#14. Spanning tree. Spanning tree must be setup in a deterministic fashion and not in a default manner. And hubs in a switched network or disasters in waiting. Also make sure a techie hasn't left a span port enabled and then reallocated it later. When you have some time, study the following spanning tree guide.

#15. QoS settings. Have the correct bandwidth allocations been made and are they correct end to end?

#16. Buffers and peaks. Don't be caught out by the averages. A 20 minute average on a link graph will hide the small 5 second 100% utilization peak that is breaking everything.

#17. Are the security nutters up to anything? Those vulnerability scans often cause more trouble than what they are trying to prevent. Death by shooting squadron at dawn is the only punishment for those doing vulnerability scans across a WAN link.

#18. Vendor finger pointing. Never trust a carrier or service provider when their lips are moving.

#19. Name resolution. Is name resolution working correctly?

#20. Complexity. Often network engineers try to show their worth and large pay packages by designing complex networks. The true worth of a good design is if it is normalized and taken down to its most simple form. A simple network is less likely to go belly up. Unluckily, the salesmen make more money, the greater the design complexity. Additionally, hoof those network piazza switches into the trash. Why stack twelve 3950's when one 4510 will do? I suggest that at the next Burning man all those silly 8, 12, and 24 port switches are stacked and covered in petrol and lit!

#21. Finally, the best check to use is the one that pre-empts the issues. Fundamentally, this requires a good network configuration management tool and continious reviews. Is this being done proactively? Checkout NEDI.

Did you know the origins of the pilot's checklist was in Boeing's Flying Fortress? Read here to find out more.


About Ronald Bartels

Ronald is an IT firefighter who enjoys the thrill of solving and analyzing problems. He was painted into a corner to become an IT firefighter because as a network engineer he quickly learned that everyone blamed the network, when there was a problem. He now works in the field of infrastructure architecture and service management.

RSS feed XML feed

Bartels's archive.

Advertisement: