Stack Dumps and Corruption

Had something I hadn’t had to deal with in a while appear yesterday on a client site, it’s something that we as DBA’s hope not to see, the dreaded Stack Dump. I’ll talk you through how I dealt with it.

Problem: Customer reported their hard drive was filling up on their SQL Server extremely fast and they were quickly running out of space.

Triage: A check later and there were 50GB of stack dump files appearing in their Log folder for SQL Server, now that’s something we need to look at. If your folder looks something like this then you’ve probably got a problem;

Yep, definitely going to need to check that...
Uh Oh

Checked the SQL Server error log and it was enormous with all of these stack dump errors. OK, let’s do this.

Treatment: The first place to look was those stack dump files to see what was going on. When a stack dump occurs SQL Server will create 3 separate files, a .txt, a .log and a .mdmp file. If you take a look in the txt file you will see a whole lot of data that’s only really useful to Microsoft support, but there’s some useful information in there if you take the time to dig through it. Take a look at the Input Buffer output, this contains 255 bytes of the query that’s being dumped, recognise it? In my scenario I had dealt with this query before so I could quickly find it but you may have to search through your code repository to find the query that’s being the victim of the dump.

Exhibit A – The Culprit

Now this is a query that runs regularly to keep some summary tables updated. Fortunately I was able to switch this off while investigating the issue, if you can’t switch it off then you’re going to have to clear out some of those dump files if you start running out of space while you investigate.

Now the bleed is stopped, time to work out what’s causing the dump. Take a look at the major and minor exception codes, a quick google shows that problem code 5242 is likely a corruption issue. An out of hours run of DBCC CHECKDB (it’s high impact, you don’t want to do this in production) showed that one particular index was showing as having corruption. Damn. Let’s try to fix that now shall we. CHECKDB gives us the object_id and index_id of the culprit, let’s use this information to find out which particular index in the issue;

USE [YourDatabaseName]

SELECT Table_Name
    , Index_Name
FROM sys.objects o
JOIN sys.indexes i
     ON o.object_id = i.object_id
WHERE o.object_id = 'YourObjectID'
     AND i.index_id = 'YourIndexID'

It’s a Non Clustered Index: If this is a non-clustered index then you’re golden. Navigate to the index in management studio, then you can create the drop and create script;

Bear in mind that this will affect the performance of your live system and may cause or be the victim of blocking. Once you’ve got a maintenance window and your users are aware then go ahead and run this script, it’s going to drop the index and create a fresh one using the corrected data. Once this has run, go ahead and run DBCC CHECKDB again to ensure that the corruption issue is fixed. If it is then great, you can clear down those stack dumps to free up hard drive space, tidy up after yourself and go get a cuppa and a biscuit and pat yourself on the back for a job well done.

It’s a Clustered Index: This is different, and a whole different beast. A clustered index is the table, it’s not just an index on the table, the clustered index is how it is stored on disk. Ultimately you’re going to have to restore your database from your last good backup (please restore to a different location, don’t instantly overwrite your production database) and retrieve your data that way. It’s not ideal but it is one of the few ways of recovering from this corruption.

Plus, remember, you always have the option of raising a ticket with Microsoft Support if you can’t fix it yourself. That’s what all of that other junk is for in the dump files.

Leave a Reply