Posts

Another kind of contention in TEMPDB

I’ve been helping troubleshoot a SQL Server 2014 that kind of tips over every hour when a job is run on one of the application servers.

All eight cores of the server go through the roof together with the counter for batch requests/sec – and then everything grinds to a halt. This behavior continues for about ten minutes (until the application servers are done with their horrors) and then the server recovers back down to 5-10% CPU load and everyone breathes a sigh of relief. For about an hour.

So how does one troubleshoot an issue like this? Well, the first thing I did was to look at the wait statistics during one of these windows of disaster. This can be done in one of several ways, but I ran a script that takes a snapshot of the sys.dm_os_waiting_tasks, waits a number of seconds (30 in my specific case), takes a new snapshot and calculates the difference (I used Paul Randals script HERE). This way I’m not burdened with whatever the server has been doing up to this point in time.

From the result one thing sticks out somewhat – PAGELATCH_EX. So what IS PAGELATCH_EX anyway? A full explanation is beyond the scope of this blog post, but Paul Randal (as usual) has a great writeup about the wait HERE.

The next step was to find out what query was causing this wait using sys.dm_os_waiting_tasks and filtering on wait_type=’PAGELATCH_EX’.

SELECT *
FROM sys.dm_os_waiting_tasks WHERE wait_type='PAGELATCH_EX'

 

I ran this several times to see if there was a pattern to the madness, and it turned out it was. All waits were concentrated in database ID 2 – TEMPDB. Many people perk up by now and jump to the conclusion that this is your garden variety SGAM/PFS contention – easily remedied with more TEMPDB files and a trace flag. But, alas- this was further inside the TEMPDB. The output from the query above gave me the exact page number, and plugging that into DBCC PAGE gives the metadata object ID.

DBCC TRACEON(3604)
GO
DBCC PAGE(2,7,1639,0)
GO

 

 

This ID can then be run through the OBJECT_NAME function to get the name of the object in question.

 

 

 

It turned out to be sys.sysschobjs that can be read about in detail HERE. In short, it is a table inside TEMPDB that contains one row per object in TEMPDB.

Thus there is apparently contention for an object that tracks the existence of tables in the TEMPDB. This pretty much has to do with the server creating and demolishing TEMPDB tables in a rate that is simply too high for the poor server to handle. To see just how many TEMPDB tables were actually created and destroyed during these windows of despair, I took a look at the performance monitor counters SQL Server: General Statistics/Temp Tables Creation Rate and Temp Tables For Destruction. During most of the day these were running at around 1-2 per sec of creation rate and 7-8 in the queue for destruction. Then this happened.

 

It is kind of obvious WHEN this window opens, yes? The numbers soar to 27 tables created per second and some 280 queued for destruction.

So, to recap: the job on the application servers is creating temporary tables at a rate with which the database simply can’t keep up. This leads to PAGELATCH_EX waits on sysschobjs and this shows as high CPU.

The million dollar question boils down to “what can be done about it?”. Not much, it turns out. In this case I was assisting the vendor and they have some homework to do with their code. There are essentially two things that can be done from their end: either cut down on the use of TEMPDB tables or implement an in-memory table instead. Now, this behavior might very well be due to a bug where the TEMPDB table caching is not working as expected or someone has mistakenly put a CREATE TABLE #WHATNOT on the wrong side of a loop, but I’ll have to get back to you on that one.

From the customer’s side there is nothing to do but throw hardware at the problem and impatiently wait for a fix from the vendor.

 

Three amigos screw up a cluster

Three amigos joined me for a cluster installation the other day. Neither of them was very friendly, so I decided to call them all out in one place in order for others to steer clear.

The opinionated folder

Using mount point folders is a great idea for neatness and to avoid having a gazillion drive letters in play. It is also a great idea to review the best practices for SQL Server clusters on mount point folders, as there are a few gotchas. This blog post lays it all out.

If you fail to set up the correct permissions, funky stuff will ensue. The issue I’ve seen most often is something along the lines of lack of privileges on the SYSVOL\{some-long-GUID}. In that case, it’s time to read the blog post and berate oneself for trying to put stuff in the root folder.

The unwilling mover

The following error stumped me for a while:

“The resource ‘LogDisk (INST1)’ could not be moved from cluster group ‘Available Storage’ to cluster group ‘SQL Server (INST)’ “

For some reason the cluster didn’t feel like moving the cluster disks into the SQL Server group. Unforturnately I didn’t have enough time to really dig in to why this occurs, but it does have something to do with running with mounted folders. Fortunately the solution is fairly simple – just create an empty cluster group, call it what you want your SQL Server cluster group to be called, and move the disks into it before you install SQL Server. This is usually very simple, but in this case the GUI didn’t let me move the disks. Basically nothing happened, which made it somwhat difficult to troubleshoot… When searching for the correct syntax for moving disks via Powershell, I came across this blog post that covers this issue and the Powershell workaround.

The lost agent (with a slight detour via insufficient privileges)

clip_image003

The SQL Server installation failed thoroughly when it became evident that the cluster network object (CNO) for some reason lacked the privileges to create computer objects. This led to a failure of the cluster to bring up the name resource, and subsequently to abort the installation. The solution is to make sure the CNO has the permission to create computer objects. This blog post goes into excellent detail.

After rectifying the above I decided to salvage the installation, and I noticed that there was something missing. Like the agent. In order to create that resource, take a look here. With that done this specific issue was taken care of.

 

 

 

 

 

 

 

Back in the saddle – index time

Well, I’m back in the saddle. Four weeks of vacation, of which three resembled some kind of monsoon. The last was great, and ofcourse the weather became great as soon as I returned to work.

I decided to get me a new script to show me unused indexes, and started looking at Stack Overflow to see if someone already had taken care of it (someone usually has). Turns out that there was a blog post by John Pasquet of folioTek here, and a very interesting addition to said blog post (and scripts) by user HAL9000 of Stack Overflow here. It just became my go to script for finding unused indexes. Thanks, HAL9000 and John!