Something that has been bugging me for a while is that you can’t easily tell when the query in an SSMS window was last run. I’ve opened a connect item to see if we can get this addressed. I’d really appreciate if you could take the time to vote on it if you agree with the suggestion
Had something I hadn’t had to deal with in a while appear yesterday on a client site, it’s something that we as DBA’s hope not to see, the dreaded Stack Dump. I’ll talk you through how I dealt with it.
Problem: Customer reported their hard drive was filling up on their SQL Server extremely fast and they were quickly running out of space.
Triage: A check later and there were 50GB of stack dump files appearing in their Log folder for SQL Server, now that’s something we need to look at. If your folder looks something like this then you’ve probably got a problem;
Checked the SQL Server error log and it was enormous with all of these stack dump errors. OK, let’s do this.
Treatment: The first place to look was those stack dump files to see what was going on. When a stack dump occurs SQL Server will create 3 separate files, a .txt, a .log and a .mdmp file. If you take a look in the txt file you will see a whole lot of data that’s only really useful to Microsoft support, but there’s some useful information in there if you take the time to dig through it. Take a look at the Input Buffer output, this contains 255 bytes of the query that’s being dumped, recognise it? In my scenario I had dealt with this query before so I could quickly find it but you may have to search through your code repository to find the query that’s being the victim of the dump.
Now this is a query that runs regularly to keep some summary tables updated. Fortunately I was able to switch this off while investigating the issue, if you can’t switch it off then you’re going to have to clear out some of those dump files if you start running out of space while you investigate.
Now the bleed is stopped, time to work out what’s causing the dump. Take a look at the major and minor exception codes, a quick google shows that problem code 5242 is likely a corruption issue. An out of hours run of DBCC CHECKDB (it’s high impact, you don’t want to do this in production) showed that one particular index was showing as having corruption. Damn. Let’s try to fix that now shall we. CHECKDB gives us the object_id and index_id of the culprit, let’s use this information to find out which particular index in the issue;
FROM sys.objects o
JOIN sys.indexes i
ON o.object_id = i.object_id
WHERE o.object_id = 'YourObjectID'
AND i.index_id = 'YourIndexID'
It’s a Non Clustered Index: If this is a non-clustered index then you’re golden. Navigate to the index in management studio, then you can create the drop and create script;
Bear in mind that this will affect the performance of your live system and may cause or be the victim of blocking. Once you’ve got a maintenance window and your users are aware then go ahead and run this script, it’s going to drop the index and create a fresh one using the corrected data. Once this has run, go ahead and run DBCC CHECKDB again to ensure that the corruption issue is fixed. If it is then great, you can clear down those stack dumps to free up hard drive space, tidy up after yourself and go get a cuppa and a biscuit and pat yourself on the back for a job well done.
It’s a Clustered Index: This is different, and a whole different beast. A clustered index is the table, it’s not just an index on the table, the clustered index is how it is stored on disk. Ultimately you’re going to have to restore your database from your last good backup (please restore to a different location, don’t instantly overwrite your production database) and retrieve your data that way. It’s not ideal but it is one of the few ways of recovering from this corruption.
Plus, remember, you always have the option of raising a ticket with Microsoft Support if you can’t fix it yourself. That’s what all of that other junk is for in the dump files.
I always enjoy reading through Microsoft Connect items and seeing what people smarter than me are looking for out of SQL Server. Inspired by this post from Mr Brent Ozar himself, I thought I’d make a note about one of the more interesting ones and the advantages/disadvantages as I see them.
A bit of background: I work for a software vendor that provides document information management systems (among other things). The servers that run this can be anything since Microsoft 2008 and are run on the customer’s hardware. The SQL Server may well be virtualised and it’s not unusual to see the server being a shared SQL/Web box (although this can be a nightmare to troubleshoot). This makes an interesting situation on each customer’s machine as they’re all unique, any triage tools need to be compatible back to SQL 2008 and versatile to catch as much information as possible. It also means that troubleshooting usually needs to be completed on the customer’s live site as we do not have anything locally that will mirror the specs exactly.
While browsing Microsoft connect one day I came across this connect item which seemed like a good idea. The premise is that there would be functionality to export and import some sort of settings file which, in my case, would allow for a certain amount of troubleshooting remotely by grabbing this file and installing onto a test server.
Would definitely help troubleshooting.
Could even post on Stack Overflow to help diagnose settings issues remotely.
Would avoid needing to go through a long list of settings on a new install to ensure they’re all set correctly.
Won’t allow for any troubleshooting of the data itself (although a .bak file would help there).
This would be a nightmare at a server setting level, and some settings are based upon hardware configuration (think MAXDOP).
What about tempdb files and backup locations, they’re going to need physical file locations passing through. This is going to require a certain level of GUI.
It’s going to reduce the level of knowledge of people performing new installs, this seems like a short term time saver at the detriment of people’s knowledge overall.
This one has been closed by Microsoft as “Won’t Fix” but this doesn’t always mean a death knell for the suggestion. This particular one is a bit of a minefield though so I’m not expecting it to get implemented any time soon.
I’d be interested to know your thoughts around this.
Edit: As Paul mentioned in the comments, the connect item above is to do with SSMS settings, not server level. Because of this I’ve opened the connect item below, please feel free to upvote it if you believe it may be useful;
I’ve always been a fan of giving back wherever possible.
I work for a software vendor where each customer hosts their own SQL Server on site, which means that I indirectly support about 400 servers around the world, all of which have different hardware specifications and unique things to think about. A lot of what I do is jumping on customer’s sick SQL Servers, working out what’s wrong and working out a solution. I use a number of tools so that I can gather information automatically rather than manually pull it each time.
A major set of tools that I use is the Brent Ozar First Responder Kit. This is an excellent set of tools that give a great range of stats about your SQL Server and links to read further if you’re unsure what the stats mean. This is great for getting people with less experience up to speed and still extremely useful for more experienced users.
I’ve got a .NET tool which runs these scripts (among others) and exports to Excel, such a time saver. I added additional checks on my local versions for things that were specific to what I wanted that weren’t in the main scripts (I still do for certain things that may be unsuitable to be included in the main scripts, see my recent post on sp_regread).
In June 2016 these scripts were open sourced the team open sourced these scripts and put them up on GitHub. After browsing the issue list I decided to help out a little in order to give back and make these scripts even more useful for others. The changes I’ve made aren’t large ones but there is definitely a sense of accomplishment when those pull requests get accepted.
This kit gets released once a month and they credit who has coded on the release, it’s nice seeing your name on their release notes (e.g. here and here). It will also not be a bad thing for the old CV when it comes to it, you get your own profile which details what you’ve been contributing on. You even get a cool little graph which tells you the volume of your own contributions.
If you’ve ever seen an open source project that you thought it might be fun to contribute to, I urge you to go ahead and give it a go.
I came across an interesting undocumented stored proc today and thought I’d share what I found out.
Situation: I currently have a set of scripts to run when I jump on a server to bring me up to speed with what’s happening. They check all manner of things so that I don’t have to check everything manually. They’re extremely useful and save me a lot of time.
Problem: One thing customers do is run multiple instances of SQL Server on the same machine (not virtualised). This is bad because it’s extremely difficult to work out which server is causing certain statistics. It’s like a crowded lift, you have no idea who farted.
Solution: xp_regread can tell you via T-SQL how many instances of SQL Server you have running on the box.
Issues: As a software vendor we always connect as an admin account so this always works. You do need a certain level of permissions for this to work, if you’re only in a basic group then you’re probably out of luck I’m afraid. Time to buy the DBA team a bottle of whisky and ask really nicely for your permissions to be elevated.
OK, let’s say this straight off the bat, SQL Server is a bully. It will sit on your server and use as much RAM as it can get its hands on if you let it. This is great for the performance of your SQL queries but if it’s fighting other processes for those resources then you may start running into issues.
But I’m running other processes on my SQL box, how do I throttle SQL?
Simple, this is a SQL box, not a crowded nightclub where everyone is knocking into each other and shouting for the bartender’s attention. You should be using dedicated hardware for your SQL server, move your other processors onto a separate server. SQL licences are expensive, why would you restrict its ability to do its job?
OK, I’ve got it on dedicated hardware, do I need to do anything?
Now this all depends on how your SQL instance is behaving. SQL will attempt to load all of its data into RAM where it can. Accessing data from here is much faster than having to go and retrieve it from disk (even if you’ve got it on lovely SSD’s). If you’ve got 5GB of data on a box with 12GB RAM then I doubt you’re seeing an issue, whereas if you’ve got 15GB of data on that same box then you’re likely to start running into problems (however, more data and you’re going to want to think about investing in more RAM, the cost of this pales in comparison to the cost of SQL licences).
OK, so I might have a RAM issue, what do I do about it?
SQL does have settings to enable you to set a maximum amount of memory that it can use, this can ease contention with things like the operating system or other background tasks. A good starting point is to leave 4GB available for the OS and limit SQL to using the rest. This does depend upon what else you have running on the box so may require tweaking. Try setting your max server memory to 4GB less than you have on the box and go from there. Monitor your memory usage and see if you can give SQL that extra bit more memory.
You know that wait stat you sometimes see; ‘CXPACKET’? You know all of that blocking you’re getting on your server? You know of those times when a user is waiting on the system and there’s nothing obvious causing it? That’s why you should care. Now, parallelism isn’t a silver bullet to fix all of your problems but the standard settings for this are so ancient that this could be causing you major headaches.
First of all, a quick pop quiz
In SSMS, right click on your server and go to properties. Open the ‘Advanced’ tab and scroll to the bottom. Take a look at Cost Threshold for Parallelism and Max Degree of Parallelism. I bet this is set to 5 and 0 right? If not, good for you, but do you understand these settings? If you are seeing a 5 and a 0, don’t worry, these are the defaults but you should be aware that they’re rubbish and you’ll want to be changing these.
Big queries are hungry, they take a lot of CPU to process, it would be great if we could split up the work that they can do so they run faster right? Yep, it’s a great idea. Instead of using a single core of that quad core processor your server is running on, wouldn’t it be a great idea if we could use more and run our query in parallel? Yep, but there’s some gotchas. Let’s talk about these two settings before we think about changing them.
Cost Threshold for Parallelism
Let’s get the first question out of the way, what does this number stand for? What exactly is (the default) of 5? Is it seconds?
Short answer: Yes but no.
Long answer: it was a count of seconds a long time ago in a land far far away (Microsoft HQ, Washington). Now it’s just a number that we can use to compare the relative size of queries.
This number is calculated by the optimizer when a query is compiled. Statistics are compared and SQL decides how much it thinks the query is going to cost. If the threshold is over this number then it’s going to try to run parallel. But how many cores is it going to use?
Max Degree of Parallelism (MAXDOP)
Optimizer has decided that the query it’s going to run is big enough to run parallel, how parallel it goes is based upon this setting. The SQL default is set to 0 which for this setting means unlimited. This means that this massive query we have is going to use as many cores as it likes, on that quad core processor it’s going to split itself into 4 and use all of the CPU so that it can run as fast as it can. Sounds great right, this big query is going to run really fast? Not in the real world.
Why are the defaults rubbish?
Technology has come quite a way since SQL was designed. What used to be a large query is, comparatively, tiny by today’s standards. Queries with a cost of 5 or more is likely to be quite a number of queries, chances are that you’re going to have relatively small queries going parallel. When they do, your MAXDOP being set to 0 (unlimited) is going to mean that the majority of your queries are going to be taking up the whole CPU every time they are run. They’re going to be fighting over your resource which is going to be killing your performance and it likely to be a cause of all of those blocks that you’re seeing.
Damn, how do I fix it?
Great thing is, this can be a quick win but it might take a bit of tinkering to get the best possible settings for your application.
A good place to start for the Cost Threshold setting is to stick it to 50 to begin with. You may feel that you want to reduce or increase this setting as time moves on but give it a go.
MAXDOP will require a bit of working out for you, but only a bit! How many cores does your server have on the CPU? 4? 8? Try setting your MAXDOP to half of the number of available cores up to a maximum of 4 without hyper threading and 8 with hyper threading turned on.
Don’t leave these settings and walk away. They’ll be fine, but what if a little tweak helps further? Continue to monitor your blocking to see if this alleviates the problem.