When your system breaks: Unix troubleshooting basics

troubleshooting david goehring
Credit: flickr / david goehring

Generally not taught in any formal classes, troubleshooting is one of the things that most of us end up picking up the hard way. How to proceed, where to look, how to determine the root cause of the problems that have crept up -- all of these are skills that we generally develop over time.

The life cycle of a troubleshooting session usually involves:

  • detection -- noticing that a problem exists
  • identification -- getting a handle on what the problem is
  • analysis -- determining what caused the problem
  • correction -- fixing whatever was wrong
  • prevention -- taking steps to ensure the problem doesn't happen again

A systematic approach to troubleshooting can help to more quickly pinpoint the root cause of a problem that breaks a server or application. Here are some steps to take and questions to ask yourself.

What just changed?

The most common first reaction to something that stops working is to ask "OK, so what changed?". Looking into recent changes is also the action most likely to pay off if, in fact, some significant change was just made. Look for files, especially configuration files, that might have been modified, applications or packages that were just added, services that were just started, etc.

Don't overlook the fact that many system problems are slow to emerge and looking for something that just changed might not lead you any closer to the cause of whatever problem you're grappling with.

Examples of things that go wrong that are not tied to some change that was just recently made include:

  • slowly running out of disk space
  • bumping into a configuration flaw that simply never got activated before because certain conditions hadn't yet been met

What errors am I seeing?

Pay close attention to any errors that are being displayed on the system console or in your log files. Do those errors point to any particular cause?

Have you seen errors like these before? Do you see any evidence of the same errors in older log files or on other systems? What do online searches tell you? No matter what kind of problem you've run into, you're not likely to be the first sysadmin who has run into them.

How is the system or service behaving?

Looking into the symptoms of the problem is also likely to pay off. Is the system or service slow or completely unusable? Maybe only some people cannot log in. Maybe only some functions are not working. Noticing what works and what doesn't might help you focus on what's wrong.

How is this system different than one that is still working?

If you're lucky enough to have redundant systems and have a chance to compare the one that isn't working with one that is, you may be able to identify key differences that can help lead to the cause.

What are the likely break points?

Think about how the application or service works and how/where it is likely to have problems. Does it rely on a configuration file? Does it need to communicate with other servers? Is a database involved? Does it write to specific log files? Does it involve multiple processes? Can you easily determine whether all of the required processes are running? If you can, systematically eliminate the potential causes.

What troubleshooting tools do I have on hand that might be helpful?

Think about the tools that you have on hand for looking into system problems. Some that might prove useful include:

  • top -- for looking at performance, including some memory, swap space, and load issues
  • df -- for examining disk usage
  • find -- for locating files that have been modified in the last day or so
  • tail -f -- for viewing recent log entries and watching to see if errors are still arriving
  • lsof -- to determine what files a particular process has open
  • ping -- quick network checking
  • ifconfig -- checking network interfaces
  • traceroute -- checking connections to remote systems
  • netstat -- examining network connections
  • nslookup -- checking host resolutions
  • route -- verifying routing tables
  • arp -- checking IP address to MAC address entries in your cache

Is anything nasty going on?

Don't rule out the possibility that someone has been messing with your system, although most hackers would prefer to do their work without you noticing anything.

What should I NOT do?

Don't confuse symptoms and causes. Whenever you identify a problem, ask yourself why the problem exists.

Be careful not to destroy "evidence" as you work feverishly to get your system back online. Copy log files to another system if you need to recover disk space to get the system back to an operational state. Then you can examine them later to help figure out what caused the problems you're working to resolve. If you need to repair a configuration file, first make a copy of the file (e.g., cp -p config config.save) so that you can more easily look into how and when the file was modified and what you had to do to get things working.

Keep in mind that you might end up making a lot of changes in the process of tracking down your problem. Later on, you might want to think through which of those changes actually resolved the problem.

What should I do?

  • Record your actions. If you're using PuTTY to connect (or some other tool that allows you to record your system interactions), turn on logging. This will help you when you have to review what happened and how you got past the problem. If you've not out of disk space, you also have the option of using the script command to record your login session (e.g., script troubleshooting.`date %m%d%y`).
  • If you can't record, keep notes on what you did and what you saw. You might not remember it all later, especially if you're stressed. You might remember the steps, but not the order in which you ran them.
  • After the problem is resolved, document what happened. You might see it again and you might need to explain to your boss or your customers what happened and how you're going to prevent it from happening in the future.
  • Whenever possible, think about how the problem could be avoided in the future. Can you improve your monitoring services so that disk space, memory and network issues, configuration changes, etc. are brought to your attention long before they affect running services?

Wrap up

Good troubleshooting skills can really save the day and having a plan of attack when a problem arises can play a major role in getting your systems and applications back online and you back home at a decent hour.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Must read: 10 new UI features coming to Windows 10