The staged approach to troubleshooting database issues

If you've got database issues, you don't have to put the entire business on hold while you troubleshoot.

I've had a bit of a time lately troubleshooting some database issues and I finally got them worked out so I thought I'd talk about my layered approach to pinning down issues on a busy box.

The problem is simple.  You've got a busy box that can't afford to stay down very long.  Yet sometimes everything slows down so much nobody can get any work done because the app keeps timing out.  And when you look at the DB, you see that there are blocks.  Unfortunately that's nothing to write home about because there are always blocks in this DB.  And as many of us know blocking is a part of any healthy DB application.  But, you can see that today the blocking is different than it usually is.  The blocks are hanging around longer than usual and of course there are all those user complaints. 

So looking in sys.sysprocesses you see a couple spids that are at the head of the blocking chain and when you refresh it, they're gone, but there are a couple others.  You refresh it again, and there are yet a number of others.  So the blocking, while it's taking longer than usual, is moving constantly and there never seems to be any one single spid or query that's causing all this commotion.  So how do you troubleshoot a moving target?  This is the crux of the problem.

The system is too busy for you to sit there and chase these ghosts all day.  You've got to get it up and running fast.  And this is where layering your troubleshooting can come into play.  Figure out the nature of the problem first.  Blocking like this can be caused by a number of things from bad execution plans, to out of date stats, to index fragmentation, so it can be hit or miss.  Still there are some easy things you can do right away to help discover the nature of the problem.  What I tend to do is recompile all the stored procedures by running dbcc freeproccache.  This can tell you that the nature of the blocking is either stats related or execution plan related.  The point is though that if this works, you can get the server back up and running quickly even though you don't know exactly what the problem is.  And the next time it happens you can apply the same fix, but don't apply it right away.  Take some of the SPs that are blocking and recompile them manually, and keep a list of the ones you recompile.  If you don't find one in fairly short order then run freeproccache again and get the system going. 

You can also look at some of the SPs and see if they have a table in common.  If they do you can update that stats on that table the next time the blocking occurs.  If that works then you know the problem was stats.  If it doesn't work, then run sp_recompile against that table.  That forces any code that hits that table to recompile the next time it runs.  And if that works, then now you know it's a query that goes against that table, so that narrows it down even more.  And you can just keep going like this until you get down to the exact issue.

Anyway, this is just a quick lesson on how you can troubleshoot issues on busy DB boxes without holding up the entire business while you try to hit a moving target.  So don't hold everybody up, just do what you can each time until you get what you need to figure out the real issue.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.
Take IDG’s 2020 IT Salary Survey: You’ll provide important data and have a chance to win $500.