DevOps Troubleshooting: Linux server best practices

Effective troubleshooting requires that you know how to break a problem into pieces, track down evidence, and that you understand your systems and applications -- as well as the tools at your disposal -- well enough to analyze problems when they rear their ugly heads. You can learn these skills over decades of working with Linux systems or you can jumpstart the process by reading a book which provides you with someone else's insights -- or both!

DevOps Troubleshooting: Linux Server Best Practices provides a lot of practical insights and tricks to help you get up to speed as a competent Linux troubleshooter. The "DevOps" part of this title refers to systems administrators working together with quality assurance engineers and developers -- a collaboration that can bring additional insights to bear on a wide range of problems. The approach that this multidisciplinary team takes involves a lot of sharing and communication and can make quick work of even complicated problems. But even if you're completely on your own, the techniques and suggestions in this book can help you resolve problems more quickly and effectively and might even get you thinking about what you can do NOW that will come in handy when a problem arises.

To be good at troubleshooting, you first need an overall approach to problem solving. You are likely, for example, to ask yourself questions like "When did this last work properly?" or "What has changed recently?". You might even compare a system or application that isn't functioning properly with another that isn't exhibiting the problem. Why does one work while the other keeps crashing? The book covers many approaches like these in its "Troubleshooting Best Practices" chapter.

Insights provided in this chapter can help you to characterize the nature of a failure. Does it happen all the time or just once in a while? Is it reproducible or completely random? What, if any, error messages or log entries are in evidence that might help you gain insights into what is going wrong?

The author also warns us to not be too quick to reboot. You might erase evidence that you need and maybe never understand what went wrong, leaving yourself vulnerable to a likely recurrence and nothing to show for you efforts.

Troubleshooting is an acquired skill. You need to learn how to quickly characterize a problem, how to track down clues about its nature, how to simplify a problem or break it into manageable pieces, and how to bring the proper tools and maybe even the proper colleagues to bear.

This book gives a quick but effective introduction to the art of troubleshooting and then hones in on some of the most common problem areas:

  • slow systems, RAM shortages, taxed CPUs, excessive disk I/O
  • booting problems
  • full or corrupt disks
  • network problems
  • name resolution problems
  • email problems
  • web site problems
  • slow database problems
  • faulty hardware

Each of these chapters provides suggestions on how to approach the particular problem. In the chapter on tracking down web site problems, for example, questions like "Is the server running?", "Is the remote port open?" walk you through the logical steps as you close in on what is wrong. Suggestions like "test the remote port locally" and tips on how to check your firewall rules guide you as you close in on what's wrong.

The chapter then offers suggestions for testing from the command line, using tools such as curl and telnet. It explains HTTP return codes so that you can understand what you are seeing in your web logs.

It also suggests that, if you can get your hands on some server stats, you might be able to tell whether the server is overwhelmed or barely moving and shows you how to run some simple tests on your Apache configuration file, how to spot permissions problems, and how to recognize sluggish or unavailable servers.

One of the things that leaves so many of us unprepared for problems is that we don't pay enough attention to how a system behaves when it's working properly to pick out what is different when it is not. In keeping with this, one of the take home messages of this book is that we should all be careful to document problems and their solutions. How many times have you encountered a problem that looks familiar, but not been able to recover anything from past incidents that could help you this time around? If you document the problem's symptoms, its root cause, and the steps that resolved it, you may not be haunted by a deja vu!

DevOps Troubleshooting: Linux Server Best Practices is an extremely helpful book and one that any member of a DevOps team or Linux administrator should read.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Must read: 10 new UI features coming to Windows 10