Skip Links

The fine art of Fibre Channel troubleshooting

By Jim Bahn, director of product marketing, Virtual Instruments, special to Network World
January 25, 2012 04:37 PM ET

Network World - This vendor-written tech primer has been edited by Network World to eliminate product promotion, but readers should note it will likely favor the submitter's approach.

Troubleshooting Fibre Channel networks can be as much an art as it is a science, but there are some basic best practices you can follow to reduce the guessing and speed resolution. Here are 10 tips to help you get to the bottom of pesky problems:

1. Generally, problems are reported by the application user. As a first step, the SAN admin will usually gather dumps, logs and traces. At the same time, he'll sometimes remove other users or applications that are less critical; perhaps he'll stop backups, and remove other potential bottlenecks. While this may fix the immediate problem, it often stops the underlying cause from being discovered. If you've only removed the symptom and you stop there, you're likely to see trouble later on.

OUTLOOK: Data center, cloud fabrics to heat up in 2012

2. Use real time monitoring. Ask your vendors what they mean by "real-time"; a five-minute polling interval is not real time. If a fire starts in your kitchen, would you like to be alerted to it immediately or in five minutes? Use the real-time alerting subsystem to get in front of the issues before the application users feel the pain. We recently saw an example where we examined the I/O history leading up to an application outage and found plenty of obvious pointers four hours before the outage. If best practices alerting had been set up, it's likely the outage could have been avoided.

3. One of the first steps is to determine if the user-reported problem correlates with what's happening on the SAN. But if you only investigate what the user is reporting, you may miss larger issues that may affect other, slightly less latency sensitive apps. It's useful to broaden the scope beyond just the immediate issue.

4. Having said that, you should customize existing, canned reports to quickly focus on the suspected application or infrastructure to isolate the condition. We recently talked with a customer who quickly eliminated about 4,380 out of 4,400 SAN links, enabling them to focus on the remaining 20 links for in-depth trace analysis.

5. Review environment inventories by device type and properties automatically discovered. Such things as manufacturer and link rate can be helpful in understanding special circumstances, such as the behavior of a tape device or configuration settings that the admin might not be aware of, like links set to run at 1G instead of 4G. Enable users to provide their own context about devices such as applications they support, location, version, relationship to other equipment, etc.

6. As they isolate, correlate and analyze, our customers often report that the majority of the time that they troubleshoot, they find that the SAN is not to blame. Tools that report on the effect of only SAN latency on the application is very helpful in determining this aspect. Tools that lump SAN and server latency together can't help with this.

Our Commenting Policies
Latest News
rssRss Feed
View more Latest News