Troubleshooting database problems

I had an interesting talk with a dev today about how to troubleshoot SQL issues.

I had an interesting talk with a dev today about how to troubleshoot SQL issues. The issue was this: we had a case where all of a sudden the DB just slowed down to a crawl. Everything typically moves along just fine, but yesterday we got so slow it was effectively down. And when I went in there to look around, there were in the neighborhood of 60 blocked processes. Now, it seems easy enough to go in there and see who's at the top of the blocking chain, but these blocking processes kept moving. Every time I'd refresh, it would be a different set of processes that were blocking. So sniping the exact process wasn't going to happen easily. What I ended up doing was dumping the procedure cache and forcing all new plans. I took an educated guess on what the problem was based on my knowledge of the workload and SQL in general. It worked. All the queries started grabbing fresh plans and all was good. Now, I don't advocate this shotgun approach in your environment. You should know the impact of something like that before you do that on a production server, but it worked for us. So now the questions come about how to prevent this in the future and how we can snipe which SP it was exactly so we can just recompile it instead next time. And this is where the problem comes because with our current setup there's no way to do that. We don't have a performance baseline for our queries. We've got it for the system itself, but not for the queries. Let me explain. In order to have an effective baseline for your queries you have to profile their execution stats in some way and save that off to a table. Then when you have a rough patch you can go back and look at the normal execution stats and compare them with what you're seeing now and know when there's a spike. If you do anything else you're speculating. Now that lead him to say that you really don't need to go through that process because if you saw an SP that ran say 1,000 times/hr, and it took 2secs, then it's highly likely that it's not supposed to run that long. Nobody writes SPs that take that long and then have them run that often (he says). And as DB people we cringe when someone says something like that because we've all seen the horrendous SQL that gets written with little regard for anything else on the box. My stance is that any plan that relies on conjecture isn't a solid plan. You never want to rely on guessing, or intuition unless you have to. Now, I'm not saying that he wouldn't have been effective to a degree with his plan. After all, he knows that system and he knows the queries. But he's in charge of a single system and my DBA group is in charge of hundreds. We don't have the luxury of guessing our way through a problem. And yes, we've seen every type of query known to man in the sense of how little regard they can have for users or other processes. So when you're putting together your monitoring plans, make sure you're putting together something that'll actually lead you to an answer. Guessing is fine if that's all you've got. Sometimes you have nothing else to go on so you just have no choice. Even looking at long-running queries in a profiling tool is fairly useless if you don't know what the queries are supposed to be doing. I mean, so what if an SP takes 5mins to run. Maybe it usually take 10mins and you're chasing an issue that doesn't exist.

Join the Network World communities on Facebook and LinkedIn to comment on topics that are top of mind.

Copyright © 2010 IDG Communications, Inc.