I’ve been helping troubleshoot a SQL Server 2014 that kind of tips over every hour when a job is run on one of the application servers.
All eight cores of the server go through the roof together with the counter for batch requests/sec – and then everything grinds to a halt. This behavior continues for about ten minutes (until the application servers are done with their horrors) and then the server recovers back down to 5-10% CPU load and everyone breathes a sigh of relief. For about an hour.
So how does one troubleshoot an issue like this? Well, the first thing I did was to look at the wait statistics during one of these windows of disaster. This can be done in one of several ways, but I ran a script that takes a snapshot of the sys.dm_os_waiting_tasks, waits a number of seconds (30 in my specific case), takes a new snapshot and calculates the difference (I used Paul Randals script HERE). This way I’m not burdened with whatever the server has been doing up to this point in time.
From the result one thing sticks out somewhat – PAGELATCH_EX. So what IS PAGELATCH_EX anyway? A full explanation is beyond the scope of this blog post, but Paul Randal (as usual) has a great writeup about the wait HERE.
The next step was to find out what query was causing this wait using sys.dm_os_waiting_tasks and filtering on wait_type=’PAGELATCH_EX’.
SELECT * FROM sys.dm_os_waiting_tasks WHERE wait_type='PAGELATCH_EX'
I ran this several times to see if there was a pattern to the madness, and it turned out it was. All waits were concentrated in database ID 2 – TEMPDB. Many people perk up by now and jump to the conclusion that this is your garden variety SGAM/PFS contention – easily remedied with more TEMPDB files and a trace flag. But, alas- this was further inside the TEMPDB. The output from the query above gave me the exact page number, and plugging that into DBCC PAGE gives the metadata object ID.
DBCC TRACEON(3604) GO DBCC PAGE(2,7,1639,0) GO
This ID can then be run through the OBJECT_NAME function to get the name of the object in question.
It turned out to be sys.sysschobjs that can be read about in detail HERE. In short, it is a table inside TEMPDB that contains one row per object in TEMPDB.
Thus there is apparently contention for an object that tracks the existence of tables in the TEMPDB. This pretty much has to do with the server creating and demolishing TEMPDB tables in a rate that is simply too high for the poor server to handle. To see just how many TEMPDB tables were actually created and destroyed during these windows of despair, I took a look at the performance monitor counters SQL Server: General Statistics/Temp Tables Creation Rate and Temp Tables For Destruction. During most of the day these were running at around 1-2 per sec of creation rate and 7-8 in the queue for destruction. Then this happened.
It is kind of obvious WHEN this window opens, yes? The numbers soar to 27 tables created per second and some 280 queued for destruction.
So, to recap: the job on the application servers is creating temporary tables at a rate with which the database simply can’t keep up. This leads to PAGELATCH_EX waits on sysschobjs and this shows as high CPU.
The million dollar question boils down to “what can be done about it?”. Not much, it turns out. In this case I was assisting the vendor and they have some homework to do with their code. There are essentially two things that can be done from their end: either cut down on the use of TEMPDB tables or implement an in-memory table instead. Now, this behavior might very well be due to a bug where the TEMPDB table caching is not working as expected or someone has mistakenly put a CREATE TABLE #WHATNOT on the wrong side of a loop, but I’ll have to get back to you on that one.
From the customer’s side there is nothing to do but throw hardware at the problem and impatiently wait for a fix from the vendor.