Thoughts on communities

I’ve just stepped on the train from Gothenburg and SQL Saturday #536, where I had the honor of giving a talk about consolidation with SQL Server 2016.

This SQL Saturday was organized by Mikael Wedham, a very friendly and exceedingly knowledgeable Swede. He had managed quite an amazing lineup with speakers this year: John Q. Martin, Erland Sommarskog, Mark Broadbent and Cathrine Wilhelmsen just to mention a few. As events go, this one was a small-to-mid-sized event. I think I heard Mikael say that about 90 were registered and about 15-20 were no-shows. I’m sorry to see so many no-shows, especially as this is an all-volunteer event all around, but then I was very happy to see everybody else that DID show up and make the event spectacular.

This was my first SQL Saturday presentation despite having giving talks and such for almost 20 years. I had the privilege of going first, meaning that I could relax and attend other sessions after mine. I’ve been to PASS Summit twice, and I’m going there this year as well, but I’ve had a surprisingly hard time getting to know new people there. Here it felt like I stepped into the legendary SQL family right through the door, and I’m very happy so stay that I’ve made quite a few new aquaintances.

Listening to the attendees mingling about made it clear that the SQL Saturday is very much a social event; some people seemed to know each other from other PASS events, and some were at their very first. The feeling among the attendees were one of friendly curiosity, and everywhere I went I heard fragments of interesting conversations and people sharing their knowledge.

And that’s what the community is about – sharing knowledge. That’s why I’m very happy to be part of it and very happy to get the oportunity to share some of the knowledge I’ve spent 20 years accumulating. The IT landscape is always changing rapidly, but I’m here, now, during *this* change, and I intend to do what I can to help others. Come join me!

SQL Saturday #536

I had a fairly short vacation this year in anticipation for a trip to Japan next year, but the time running up to said vacation was … hectic to say the least. Some time during this hectic workload I had the brilliant idea to send in a few abstracts for SQL Saturdays around the world. One fine day this summer I got a phone call from fellow Sweden Mikael Wedham, organizer of SQL Saturday #536 in Gothenburg.

In short: I’ll be presenting “Unicorn safari – alleviating consolidation pains with SQL Server 2016” in two weeks time. The title does not entirely reflect the content, though, but I’m hoping my audience will forgive me. It’s an intermediate level talk, not necessarily because of the technical content but the fact that a lot of business factors are involved.

SQL Saturdays are the best thing that’s happened since sliced bread, and what’s even better is that they’re free. Now is the time to go to Gothenburg and SQL Saturday!

Presentation skills – do you have them?

A large part of my work as a consultant is giving presentations and teaching stuff. I’ve been doing presenting and teaching since 2000 and I feel fairly good at it. When I started out I was young and inexperienced, something I compensated for with a huge ego instead. Needless to say, I had … mixed results. Thankfully I’m a quick learner and dialed back the ego and increased the use of my ears to a level where I consistently managed to get both my points across and get good scores.

I had my own company in parallel with my normal job at the University and taught for a company called Learning Tree. This was back in the early 2000s when there was still some money to be made for teachers and training companies alike, but after a couple of years this started to go downhill so I decided to shut down my company and focus on my main line of work.

True to form, I couldn’t keep my hands out of teaching so I did a few guest lectures and some internal courses here and there. Fast forward to my present job at Atea, where I’ve been actively involved in giving presentations from day one. I do presentations on anything imaginable when it comes to IT infrastructure, both for internal use and for customers. We also have a couple of events ranging from 100 people to almost 2000 people where I regularly talk or do demos.

In the summer of 2014 I attended a CompTIA CTT+ (certified technical trainer) course as a prerequisite to attaining MCT (Microsoft Certified Trainer) status in order to be able to teach Microsoft courses. This was held in London by a superb teacher by the name of Simon Garlick (of reZound UK). I’ve been through technical trainer courses a few times before, but it was a great refresher aimed at getting the students through CTT+ certification, something that involves videotaping yourself giving a presentation and keeping track of a myriad of details the judges want to see. I passed both the video and written exam parts and finally achieved the coveted MCT title – 16 years after first hearing about it.

Then along came my good friend and colleague Simon. Not only is he exceptionally bright, but he has a knack of finding interesting tidbits that he likes to share with me. This time he had found a book by a guy called David Phillips called “How to avoid death by Powerpoint” and claimed it was very good. I scanned it, realized he is a proponent of black backgrounds (something I have never understood), and decided to see what this was all about.

After reading said book, I found myself unable to look at Powerpoint in quite the same way again. I then watched his TEDx talk and very much liked what I saw.

A few months later Simon came along again and reminded me that this David Phillips person apparently gave a course on presentation skills and had taught a previous colleague of ours. Said colleague was very impressed, and since that is a rare occurrence we decided to book ourselves on a course to see what this was all about.

And boy, am I happy we did.

Having stayed over at the mansion next door, we walked the couple of hundred meters to the venue – another mansion. Here David greeted us and told us he had worked hard on creating the ultimate learning environment – complete with birdsong and absolutely no creaking floors.

We launched straight into the course and David mesmerized us from the moment he opened his mouth. The two days went by in a flash, and without giving too much away I can honestly say that it was the two best training days I’ve ever attended. Me and Simon were pretty good when we came, but we were much, much better when we left. David managed to create an environment where eight people, most of whom had never met each other before, could present to and get feedback from each other as well as the instructor without any friction whatsoever. That is a feat in itself, but he went on to elevate all of us from our respective levels to a common standard that was higher than where I was when I came through the door.

We went through the neurological basis for learning, preparation, movement, body language and use of voice, to mention a few things. One aspect turned out to be key for me – preparation. I’ve been doing things more or less unplugged for 15 years – never again.

I can’t praise David highly enough, and I recommend anyone who is looking for a presentation skills course to look no further. The training is superb, the instructor is exceedingly good and the venue is outstanding. I can’t wait for the next level training.

Manhandling GoldenGate, part 3

Here are the different configuration files and such I used for the setup described in parts 1 and 2. Please note that I’ve removed the several hundred schemas for clarity, and I have exchanged the customer-specific schema names for the nonexistent SCOTT and TIGER schema names.

mgr parameter file on primary

extract parameter file on primary

pump parameter file on primary

manager parameter file on secondary

replicat parameter file on secondary

Oracle parameters on primary/secondary

GGSCI statements for the primary

GGSCI statements for the secondary

Manhandling GoldenGate, part 2

Let’s take a look at something that is not quite as well documented for anything else than a tiny, safe lab environment – instantiation of the secondary database.

There is a document from Oracle covering this (1276058.1) but it has not bee updated since 2013 and is not very detailed. Instantiating an Oracle secondary can basically be done in one of three ways: initial load using GoldenGate, Data Pump or RMAN. The first alternative is the only way to go if you’re trying to set up a heterogeneous replication between, say, Oracle and SQL Server or similar. It strikes me as unnecessarily difficult for an Oracle to Oracle environment. The Data Pump variant seems to be the one Oracle and most bloggers prefer. In my case I discarded it as I’ve got over 300 different schemas and Data Pump simply isn’t as fast as RMAN. So I’m left with RMAN, and happily there are instructions in the Oracle document above to DUPLICATE a database. Trouble is, I can’t for the life of me get it to work. I’m always left with an error:

Another detail that is missing from the document is how to handle integrated extract and replicat. The classic extract can be started at a specific CSN (GoldenGate-speak for SCN) – if you’re running integrated extract you can’t.
When creating the extract process it needs to be registered with the database (and this results in a SCN as I said in the previous post) that can be used for DUPLICATing the secondary.

Time to test this out then.

  1. Set up Oracle and GoldenGate per all the standard tutorials
  2. Take a RMAN full backup
  3. Create extract and pump processes, register the extract process and write down the SCN
  4. Create a RMAN archivelog backup
  5. Transfer all the backup files to the secondary
  6. Run RMAN DUPLICATE UNTIL SCN using the SCN above to instantiate the secondary
  7. Set up the replicat process on the secondary and start it with START ATCSN and use the SCN above

Easy enough, right? Yes, unless there are open transactions. In the first tests I did there were no activity in my test database, and hence there were no open transactions. Everything worked like charm at the first try (always an ominous sign).
Then it was time to do this on an actual test system running in a production-like environment. Steps 1-5 are identical with the slight difference that the replicat process starts and then ABENDs (GoldenGate-speak for paused) with OGG-01154 and SQL error 1403. Looking into this it turns out that GoldenGate expects to find a row that for some reason is not there, and hence it will ABEND and kindly ask for help.

I decided to toy with different SCNs to DUPLICATE, start extract at and start replicat at. No joy, everything amounted to the same thing – a data mismatch between the primary and the secondary. At this time I actually opened a case with Oracle Support as I thought I had done something wrong or that something was broken.I won’t go into the less than enjoyable interactions with Oracle Support I’ve had through the years, but after several discussions to and fro they dropped a gem: “Whatever the type of initial load, the basic rule is to avoid open transactions.” Right – so how am I supposed to handle instantiation from a system in full swing? That’s a basic requirement for a high availability solution I would think.
There is another support document (1347191.1) that covers “How to handle open transactions during initial load” that is basically useless for integrated capture (remember the REGISTER DATABASE and the resulting SCN?) apart from the part where you can use Data Pump to manually handle the specific table(s) that have open transactions (how messy is that?)

Looking into more detail about what actually is happening in this case (checking the ggs_error log file and toying with the logdump tool) GoldenGate tries to DELETE a row that does not exist. This occurs since the INSERT was done prior to the SCN where we REGISTERED the extract.

There are a few ways to handle this, but the way I settled on was to set the following parameter in the replicat parameter file:

This tells GoldenGate to toss all the rows that results in error 1403 into the discard file for later analysis and handling. Mind you, having this parameter set permanently is a collosal risk in my view, as this potentially could lead to data loss. In this specific case I know what is happening and why, so I can live with having the 20-odd errors due to missing rows to DELETE end up in the discard file (and safely ignored).
Hence the recepie looks like this:

  1. Set up Oracle and GoldenGate per all the standard tutorials
  2. Take a RMAN full backup.
  3. Create extract and pump processes, register the extract process and write down the SCN.
  4. Run RMAN DUPLICATE UNTIL SCN using the SCN above to instantiate the secondary.
  5. Set up the replicat process on the secondary and start it with START ATCSN and use the SCN above.
  6. Wait for the replicat to ABEND and verify that the error returned is the one discussed above
  7. Set REPERROR 1403, DISCARD in the replicat parameter file and restart the replicat ATCSN as above
  8. When the lag time goes to zero (meaning that all the rows queued up in the trail file have been consumed), check the discard file and verify that the only “lost” transactions are related to DELETEs
  9. Stop the replicat process, remove the REPERROR parameter and restart without any ATCSN.

Going back to the primary is not quite as much fun, though. It would be quite possible to set up a unidirectional replication back to the primary, but that is outside the scope of this specific customer project. Instead the process looks something like this:

  1. Back up site B via RMAN (the old secondary, now running as the primary)
  2. Transfer the backup files to site A
  3. Back up archivelog on site B and shut down site B
  4. Transfer the final archivelog backup files to site A
  5. Restore site A
  6. Point the CNAME to site A
  7. Open site A for business
  8. Instantiate site B and recreate the GoldenGate relationship

Okay, having that in place, it is time to move our gaze to the detail of how to actually connect to the system. Remember that GoldenGate have no concept of a primary or secondary database, so both sides are open for business. This means that if anyone changes any data in any table that is included in the GoldenGate replication, the replicat process will ABEND and the net result is basically what we had above – a data mismatch. This is a fact of life and must be taken care of outside of Oracle/GoldenGate.

In our case we decided to go with a CNAME that points to the primary server. If the time comes to switch over to the secondary, we stop the database on the primary (to kill any remaining connections) and then point the CNAME to the secondary server. That’s it – no application servers or other systems have any need to know anything about GoldenGate, the secondary database or the like.

The reason for using a CNAME rather than a SERVICE_NAME in Oracle boils down to me not wanting to risk connections to the secondary. Since there is no concept of primary/secondary database built in as with Data Guard, I can’t use a trigger to start or stop the SERVICE_NAME. And if I’m relying a round-robin list in TNSNAMES to connect to the secondary in case the primary is up, Bad Things(tm) can occur should the old primary come up and accept connections for some reasons. By using a CNAME this cannot happen as the IP address will only resolve to the server currently set to primary (provided that noone screws the pooch and set the IP address wrong, but let’s try to avoid that).

To summarize: open transactions can bite you in the posterior, especially when running integrated extract. It does not have to be more than a nuiscance if you know how to handle it, but it can be very difficult to find relevant information.

Using a CNAME to connect to the database means more flexibility and less risk than a SERVICE_NAME, even though it means relying on either a DNS administrator or delegating DNS administrative rights to the DBA.

Manhandling GoldenGate, part 1

I’ve been hard at work setting up a GoldenGate replication system between two sites to serve as a HA solution for a customer that is … unwilling to pay the rather high licensing fees to use Oracle Enterprise Edition and Data Guard. Instead, we’ve opted to go with Standard Edition One 11g on two nodes and Oracle GoldenGate 12c as the way to synchronize data from site A to site B.

Before going into the setup side of things, there are a few things to consider when using this technology in this way. The first item to think about is the fact that this is in fact a logical, transactional replication – not a binary log transfer (like physical Data Guard). That means that the secondary on site B is open for business – both read and write. That can be a very good thing and that can be a very bad thing.

The second item to keep in mind is the list of limitations. With the new “integrated extract” that appeared in 12c the list keeps shrinking, but there are still some things that simply cannot be transferred using GoldenGate.

Third, GoldenGate is not too keen on instantiating the secondary with open transactions on the primary. Yes, you read that correctly – it might bite you when you start the replication. More about this in part 2.

Fourth, there is no concept of “primary” or “secondary”, so that has to be handled outside of Oracle. This is by design an effect of the first item and not a limitation of the software. I’m discussing this more in depth in part 2 as well.

Design

ggThe design is fairly simple in itself – two databases that have no idea whatsoever that they have any kind of relationship, an extract process on the primary that is responsible for, well, extracting the data that goes into the transaction log on the primary, a pump process on the primary responsible for tossing the result from the extract over to the secondary and finally a replication process on the secondary that applies the transactions as they come in. The extract, pump and replicat processes are all part of GoldenGate.

Setup

There are quite a few tutorials and such on the internet that describes how to setup a simple unidirectional GoldenGate replication pair, so I won’t go into any details regarding the basic settings. With that said, my situation was a … bit different. I’ve got well over 300 schemas to keep track of, as well as both integrated capture (extract) and replication, and about them you can find a lot less information.
The integrated approach as opposed to the classic GoldenGate processes are Oracles recommended way of doing things these days, provided that you’re running on a new enough database (11.2+, preferrably 11.2.0.4 or 12c.) The devil is in the details as the actual setup is very straight forward – basically just say TRANLOGOPTIONS INTEGRATED in the extract process parameter file as well as specify INTEGRATED when creating the exract process definition (add extract ext1, integrated tranlog, begin now). There is one more thing needed – registering the extract process with the database. This is done in GGSCI like this: register extract ext1 database
There is actually some output from this: a SCN. This is the first SCN that the extract start extracting from and can be used for instantiating the secondary database, either via data pump or RMAN.
Another slightly surprising detail is the way that the 64-bit GoldenGate behaves from 11.2 and onwards – it defaults the CACHESIZE parameter to 64GB. In short, this is very likely to result in a “cachesize too small” error, followed by abending (stopping) of the extract process. Oracle ACE Gavin Soorma has written a post of it here.

Other settings to note are:

ROLLOVER in the extract parameter file that controls when to switch trail files (lest they become ginormous), PURGEOLDEXTRACTS in the manager parameter file to handle obsolete trail files, LOGALLSUPPCOLS that together with the TRANDATA settings help handle tables without keys and a nice new feature in the shape of DBOPTIONS SUPPRESSTRIGGERS that, well, supresses triggers on the secondary.

All right, the primary generally won’t provide any surprises. Sure, the parameter files for both the extract and the pump are by now rather large as every schema that is to participate in the replication (in this case ALL of them) needs to have both DDL mappings and table definitions, but otherwise there is little to deviate from any tutorial out there.

Blog developments

Things are slowly getting back to normal after the sudden passing a few weeks ago. Sure, it was “only” a cat, but as me and my wife don’t have any kids (and have no intention of getting any either), the cats are our kids. Anyhow, I’ve updated the blog a bit to make it more useful to myself. I’m using Feedly on my iPad for RSS aggregation, and I had the bright idea the other day to do the same on the blog as I don’t always carry my iPad around. Said and done, and there is a new link at the top for the aggregation page. I’m always on the lookout for new blogs so if you have any suggestions, shoot me an e-mail or hit me up in Twitter.

I’m also in the process of revisiting and rewriting my best practice post for SQL Server 2008/2012/2014. Since this is always evolving I’ve decided to create a page specifically for it, and it will be appearing as soon as I’ve cleaned up and updated the code. Stay tuned!

A eulogy in a technical blog

I’ve been absent from the blog for a while. This has several reasons, ranging from work, to lots of things to do outside work and what I’d like to take a moment to talk about today – the death of one of my cats.

I got Salsa, as she was called, back in May of 2001 when she was about 8-9 weeks old. She and her sister Tesla (along with a couple of other kittens) were born on March 15th, 2001. Straight from the get-go it was obvious that they were very different from each other – most likely due to different fathers. Tesla is rather calm and thoughtful, and Salsa was everything you’d expect from a calico cat – all paws, more speed than control and an utter lack of consequence analysis. After less than a year me and my then-girlfriend separated and I received sole custody of two small balls of fur.
This was the way of things until I met my wife 10 years ago.

Fast forward to May 2013 where we had a very bad week. Tesla broke a tooth and required surgery, Amys, my wife’s cat of 14 years passed away, and Salsa was diagnosed with hyperthyroidism – a chronic, wide-ranging illness that would require medication morning and evening for the rest of her life. A year later Tesla was also diagnosed with this illness, something not unexpeced as they were siblings.

The medication worked out fine despite some tricky periods where we had to adjust the dosage. She had a great life, lots of love, lots of toys. And food. She loved food. So much we had a blimp for a while (this was somewhat due to the medication, but mostly due to her love for all things edible). We weathered that as well, and have had the pleasure to share our lives with a generally healthy and always happy pair of cats. Until yesterday.

I had paramedical training back in the late stone age, and looking at her beside me in the morning I realized that Salsa showed signs of somewhat labored breathing. After a short  phone discussion with the vet we drove the 10 minutes to the local clinic for an emergency chest X-ray. The news were not good. She had a rather large amount of fluid in her lungs and around her heart, pointing to congestive heart failure. This is not that uncommon in older cats and especially not in an older cat with hyperthyroidism (and the resulting hypertension, despite medication). She received some diuretics and we were told to come back a few hours later for a follow-up X-ray to see if the fluid would be cleared.
After observing her for a few hours and noting no change in her breathing or level of activity, I decieded to call my wife and ask her to meet us at the clinic. From Salsa’s clinical presentation and the knowledge that none of the reasons behind the heart failure would be in any way curable (and most likely not treatable to any larger degree) we decided then and there to put her to sleep and end her suffering.

So there we have it – one of my best friends of 15 years is no more.

I am very, very thankful for the 15 years we had together – and all the ups and downs associated with any life. The picture below was taken a few minutes before we went to the vet for the final time. She leaves a cat-shaped hole in our hearts and she will be sorely missed.

Sleep well, my beloved calico cat. Say hi to Amys for us.

salsa

 

Three amigos screw up a cluster

Three amigos joined me for a cluster installation the other day. Neither of them was very friendly, so I decided to call them all out in one place in order for others to steer clear.

The opinionated folder

Using mount point folders is a great idea for neatness and to avoid having a gazillion drive letters in play. It is also a great idea to review the best practices for SQL Server clusters on mount point folders, as there are a few gotchas. This blog post lays it all out.

If you fail to set up the correct permissions, funky stuff will ensue. The issue I’ve seen most often is something along the lines of lack of privileges on the SYSVOL\{some-long-GUID}. In that case, it’s time to read the blog post and berate oneself for trying to put stuff in the root folder.

The unwilling mover

The following error stumped me for a while:

“The resource ‘LogDisk (INST1)’ could not be moved from cluster group ‘Available Storage’ to cluster group ‘SQL Server (INST)’ “

For some reason the cluster didn’t feel like moving the cluster disks into the SQL Server group. Unforturnately I didn’t have enough time to really dig in to why this occurs, but it does have something to do with running with mounted folders. Fortunately the solution is fairly simple – just create an empty cluster group, call it what you want your SQL Server cluster group to be called, and move the disks into it before you install SQL Server. This is usually very simple, but in this case the GUI didn’t let me move the disks. Basically nothing happened, which made it somwhat difficult to troubleshoot… When searching for the correct syntax for moving disks via Powershell, I came across this blog post that covers this issue and the Powershell workaround.

The lost agent (with a slight detour via insufficient privileges)

clip_image003

The SQL Server installation failed thoroughly when it became evident that the cluster network object (CNO) for some reason lacked the privileges to create computer objects. This led to a failure of the cluster to bring up the name resource, and subsequently to abort the installation. The solution is to make sure the CNO has the permission to create computer objects. This blog post goes into excellent detail.

After rectifying the above I decided to salvage the installation, and I noticed that there was something missing. Like the agent. In order to create that resource, take a look here. With that done this specific issue was taken care of.

 

 

 

 

 

 

 

GoldenGate surprises

I’ve been way busier than I exepected, and I’ve got somewhat of a bad conscience for not having been able to update the blog. But here goes.

I’ve been tasked with setting up an Oracle GoldenGate solution for a customer. The environment is fairly straight forward with two servers running Windows 2012R2 and Oracle Standard Edition One (yep, licnesens bought before the new SE Two licensing) 11.2.0.4 and GoldenGate 12.1.0.1. The idea is to get GoldenGate to handle replication from site A to site B in case of disaster.

GoldenGate is an interesting animal in itself as it is *very* complex and have in true Oracle fashion somewhere along the lines of a gazillion parameters. In this case we’ve opted to set up GoldenGate on each of the two servers, despite the fact that teh replication is stricly one-way (from site A to site B).

There are a lot of blogs (especially Gavin Soorma) that do an execellent job of explaining exactly how GoldenGate works, so I’ll focus on some of the more interesting findings so far:

Integrated or classic mode

GoldenGate 12c gives you the choice of using the classic way of sifting through the log files and shipping the result or leveraging logminer to extract the data. Long story short: integrated extract and replicat supports more data types and is faster. WAY faster. I tried banging in 2M rows into a table and measuring how quickly the data crawled over to the secondary. It went from minutes to seconds. GoldenGate integrated extract/replicat *is* supported on Oracle Standard Edition (despite that the integrated functionality leverages Oracle Streams, an enterprise-only feature) but some things (like parallelism) is disabled, despite the fact that Oracle for some reason recommends using classic capture/replicat on Standard Edition and Standard Edition One. Considering the speed increase with integrated I’ll go with that, thank you very much.

The value of keys

Picture this: a single, heap table with a couple of hundred thousand rows. Inserting the data is quick and GoldenGate does an excellent job of shuttling the data over to the secondary site B. Now let’s see what happens when we do a delete instead. The cursor comes back in a few seconds on the pirmary, and after hitting commit I started looking at the secondary. This is where things become interesting.
The CPU went to 50% and basically pegged there – this is due to the fact that I’ve got two CPUs in my testing box and one of them was completely saturated. Funny thing is, that apart from an initial flurry of disk- and network activity, nothing seemed to be happening on the server. The rows are still in my table as well. Scratching my head I took a look at long-running queries using a query by David Mann

and found this little gem:

Wait, what? This basically means a RBAR (row by agonizing row) handling of a delete which took about four HOURS(!). Lesson learned – ALWAYS make sure to have keys in your tables. And expect deletes to take a lot of time.

Memory woes

When stress testing the replication I started receiving quite a lot of bothersome memory problems (inability to reallocate, out of virtual memory, header corruptions, etc). After searching the internet for answers and finding exacly nothing, it turned out that there is a small gotcha with the 64 bit version of GoldenGate – it defaults to 64GB memory (which means 128GB virtual memory pool). Since my machine only sported 8GB of memory, the process very much did get a memory problem. By changing the parameter CACHESIZE to a more reasonable number the problem went away.

Finding the culprit

I had a case the other day where the customer called and told me that he had a bit of a performance problem. Well, I’ve heard that a few times over the years, and asked him to elaborate. He sent me a screenshot of SolarWinds Database Performance Analyzer and, well, it kind of points the finger to one thing:

asyncwaits

Um. Yeah. The ASYNC_NETWORK_IO thing. It kind of … sticks out.

Okay, so how to tackle this then? We know what waits are happening, but where do we go from here? Well, It’s time to find out exactly what is generating said waits. And this is where extended events come into play. Again. (Extended events are the coolest thing since sliced bread, there, I’ve said it.)

There are a few scripts I use when doing this and they were written by the venerable John Sterrett (see his blog post from 2013 here). There are a few small gotchas though – ASYNC_NETWORK_IO does not exist as an event ID. It is called NETWORK_IO. Hence the statements for capturing this specific workload looks like this:

 

 

Then you settle back and wait for the troublesome queries to crawl out of the woodwork. Little do they know that we’ve set a trap.

Upon emptying the trap after a few minutes, it became apparent that the culprit was one huge, automatically generated(meh) query. Time to get the execution plans for them using sys.dm_exec_query_plan. Trouble is (and this is the second gotcha of the scripts), said TVF requires the plan_handle, not the sql_handle. So we have to do a little digging around to get that – for instance taking a peek in sys.dm_exec_query_stats (which might hold multiple plan_handles per sql_handle, mind you!). Plug said plan_handle into sys.dm_exec_query_plan and hey, presto! The execution plan in all its gory glory.

It was apparent that the queries were able to use neither indexes (hello automatic code generation!) nor had a very good where statement (why don’t we take ALL of our several million rows and send them to the client, yes?). And there you have the problem – bad code that tries to send too much data to the client. Bad code, no cookies for you.

I provided the customer with my findings and he went on his merry way, grumbling and heading for the developers.

PASS Summit 2015 – a quick reflection

PASS Summit 2015 is winding down and I am strolling through increasingly deserted hallways. This was my second PASS Summit, and I already know it will not be my last. I’ve been to Oracle OpenWorld in San Francisco six times and compared to that conference this one is downright tiny. But the thing with the PASS community in relation to the Oracle community is like night and day.

A quick background: I started out with Oracle back in ’97 and SQL Server shortly thereafter, and let’s face it – SQL Server was not a serious contender back then. Both SQL Server and Oracle has come a good ways since then, and Oracle is ahead in many respects. I find that a product is very much dependent on it’s user base, and the more vibrant a community is, the greater success said product tends to enjoy.

The Oracle community is HUGE – quite a bit larger than the SQL Server ditto, but it is also, in my opinion, vey much more closed and, to be honest, spiteful. The thought of walking up to one of the big names in the Oracle community at OpenWorld is fairly alien to me, especially after having been rather rudely dismissed doing just that a couple of years back. Looking at the SQL Server community, the opposite is true. Someone tweeted something along the lines of “don’t be afraid to talk to an MVP – everyone was a beginner at some point”.

Last year I watched in awe at these mythical MVP and MCM creatures, and I didn’t realize until the last day that they actually ment it when they say “come talk to me, I love to interact with the community”. This year, I’ve had the pleasure of talking to quite a few MVPs and other speakers, and everyone has been kind, courteous, genuinely interested in the discussion and above all – very, very helpful.
This community feels like coming home after a long day’s work – kicking off the shoes and falling onto the couch, not needing to worry about anyone trying to badmouth someone else, block you from speaking at events or just messing with you out of sheer spite.
I’m sure readers from the Oracle community might disagree, but, well, this has happened to me and I’ve had enough of that. I’m closing on 40 and have been doing databases for half my life, and I’m fed up with people trying to screw me over.

I’ve been contemplating trying to get on the speaker’s circuit at events for quite some time, but have not had the balls to commit, to be honest. Sure, I’m an MCT and have no problem teaching a wide variety of subjects, but what if I screw my demos up and people subsequently hate me? Well, the tipping point came just a few hours ago.
I had the pleasure of attending a session on advanced restore methods given by the exceptional Jes Borland. Most of her demos bombed, and she smiled and told the audience she was terrified as this is the worst thing that can happen to a presenter.

And they laughed.

Not in spite, not at her incompetence, but at the fact that here is someone not so far removed from them showing that she’s human too. I’m sure most if not all of the people in the audience saw something in themselves in Jes as that moment.
Oh, and she did her usual stellar job of delivering the presentation despite the demo issues, and I’m sure everyone left the session happy and more informed. It dawned on me that I’m the only one holding myself back.

By putting this on my blog I’ve kind of painted myself into a corner – now I have to follow through. As soon as I get home I’ll start drawing up propolsals and make a sort of a map of where I am and where I want to go with my speaking carreer. Thank you PASS and thank you Jes for today. The future just got a lot more scary – just the way it’s supposed to be.

*UPDATE*

I think I was just taught a lesson in how Twitter works. I originally just tweeted the link to this post to Jes with a thank you note, and then she retweeted it. And then it got retweeted again. And again. More people than I could have imagined have seen the tweet (and link to this blog), and the website statistics have … changed somewhat from a normal day… 😛

Azure Automation and index maintenance gotchas

I decided to write up a blog post about index maintenance in SQL Azure databases (yes, I’m gravitating towards Azure just like everyone else these days). I had everything planned, what research to do, the case I was going to use, the whole nine yards. It would have been epic, had it not been for Pieter Vanhove, who’s already done it. Worse still, he did it way better than what I had planned.

http://pietervanhove.azurewebsites.net/?p=14137

Somewhat miffed, I decided to play with it anyway (just not write a blog post about it, no, I’m not bitter 😛 ).

Turns out there’s quite a gotcha with this:

AzureDB does not support cross database queries, which means that, since the runbook Powershell script creates a connection to the MGMTDB, it is ONLY within that database that the queries will execute. You can check this using a simple select * from sys.databases – it will ONLY show you the user database you’re in and the system databases. Ergo, Ola’s script will only execute on the database it is installed in. Okay, you say, let’s create the scripts in the MASTER database (generally a bad idea, but hey) and run them, as if I create a connection to the MASTER database, sys.databases shows every other user database on the server. Turns out that, well, you can’t. There seems to be no way to create objects in the master database, so we’re stuck at the moment.

Ola’s solution is to create his scripts in every user database, but while this works most excellent from a performance and integrity standpoint, the amount of manual labor required to either create new runbooks for each and every database that gets created (and remove said runbook for decommissioned databases) OR create a potentially HUGE runbook that encompasses all databases (that also needs to be manually maintained, ugh) is simply staggering. The hunt for a solution is on.

 

*EDIT*

Aha! While attending the PASS Summit I got word from Brent Ozar that cross-database queries had arrived to Azure. I’ve yet to track down Ola and ask him about his scripts as the implementation is … maybe not quite as straight-forward as one might think, but the solution is a lot closer. Oh, and I spoke to Pieter as well – he’s updated the original blog post with his workaround. It’s a good example of what can be done with some ingenuity and a runbook. Check it out!

Back in the saddle – index time

Well, I’m back in the saddle. Four weeks of vacation, of which three resembled some kind of monsoon. The last was great, and ofcourse the weather became great as soon as I returned to work.

I decided to get me a new script to show me unused indexes, and started looking at Stack Overflow to see if someone already had taken care of it (someone usually has). Turns out that there was a blog post by John Pasquet of folioTek here, and a very interesting addition to said blog post (and scripts) by user HAL9000 of Stack Overflow here. It just became my go to script for finding unused indexes. Thanks, HAL9000 and John!

 

Vacation! And a few gotchas…

The blog has been slow lately due to the impending vacation. I’ll toss in a few things I stumbled on the last week though:

When dealing with a cluster, chances are that some of the disks just won’t be returned to the fold when deleted from available storage. For some reason, the cluster sometimes keeps the reservations on some disks, leading to some rather weird error messages. The solution is to us powershell to give said reservations the boot like this:

where X is the disk number from Get-Disks.

Speaking of clusters; trying to find which node the SQL Server is running from can be a bit of a challenge from within SQL Server. Try this script that I found over at SQLMatters.com:

I’ve also thoroughly enjoyed setting up a 2008R2 cluster on Windows 2012R2 and mount points. That’s also riddled with some interesting … features. A couple of good links are an mssqltips.com article here, a Microsoft Connect entry here and a blog post from Jacob Moran from 2009 here.

Found a nice script to set up instant file instantiation with powershell, written by Ingo Karstein:

and finally a nice script from Eric Humphrey to set SQL Server trace flags with Powershell:

Differences between Oracle and SQL Server transaction log files

When moving between Oracle and SQL Server, one of the first gotchas is how the transaction log is handled differently between the two engines. In Oracle, the transaction log is a pure transaction log in that it contains only the data and not the rollback information. SQL Server combines the transaction log and the rollback data into the same file.
This leads to some interesting differences i behavior. Let’s take a peek at Oracle first. Oracle divides the different workloads from each other; the transaction log only contains change data but not data required for rollback. The rollback data is stored in a separate rollback tablespace. This also means that the transaction log size i fixed. When a transaction log file is filled up, a so-called log switch occurs where the log writer process switches to the next log file in the group and starts to write there.  This is a good time to point out that the transaction log on Oracle requires at least two (preferably at least three) files in a group, and that the group can consist of several mirrors of the transaction logs (preferably on different drives).

When the log writer switches out of the now full transaction log file, the archiving process steps in and starts reading and copying the transaction log to a new file called the archive log (specified by log_archive_dest_x-parameters). This process then repeats itself as soon as the log writer is finished with one transaction log.

As previously mentioned the transaction log size is fixed, but the number of archive logs produced vary depending on the workload. According to the classic best practices for Oracle, a log switch should occur about every 20-30 minutes. That means that the logs should be sized accordingly, but how many DBAs have a consistent workload that always fills up a transaction log in the exact same way?

Let’s think about the consequences of file size for  short while. Too small files means that the log writer process might be blocked by the archive log process, since the archiver is slower (it has to read AND write). This means poor performance and very hammered transaction log and archive log disks.
The other alternative is even worse – should the transaction log disk disappear in a black hole (it *might* happen!), you will loose all the data that’s still in the transaction log (and have yet to be archived).

The solution is to create large enough (or high enough number of) transaction log files that the archiver have time to scurry out of the way before the log writer switches back to the first file in the log group during the most intensive workload. This might mean that the files are either too large or too numerous to have a log switch of every 20-30 minutes during most of the day, but this is easily taken care of the parameter archive_lag_target. This parameter, set in minutes, specify the maximum amount of time that can pass before an explicit log switch (and hence archiving) is invoked.
The sad thing is that I’ve rarely ever seen anyone set this parameter.

SQL Server, then. The transaction log have a way of growing, sometimes out of hand. The good thing is that we have a continous log thread and the bad thing is that it is differing wildly in size. Where Oracle switches the log file automatically (and hopefully in a consistent, reasonable way), SQL Server does not and hence continues to build rather large transaction log files. There is always a hot discussion what is a good time between log backups, but as with the Oracle discourse above, it all boils down to how much data you are prepared to loose. Some of my customers run a log backup every five minutes, some do it once per day (don’t get me started). The really bad thing about SQL Server when it comes to log handling is the fact that you can’t mirror the transaction log file from within the database.

 

 

Basic settings for SQL Server 2012/2014, part 4

Part 4 – scripts for the previous parts

Below are scripts for most of the SQL Server specific settings I discussed in the previous three parts. Please note that not quite everything is included (trace flags, instant file instantiation, etc.).

Migrating databases, part 2

Having gone through the truly offline ways of transferring data, let’s take a look at log shipping and database mirroring. These are fairly old techniques as both of them came into being back in SQL Server 2005. The concept is very simple – every once in a while data from the transaction logs are sent to the secondary site and applied there. Then the cycle restarts and after a set time a new shipment of transactions occur.

In the case of log shipping, this is always asynchronously. There will always be a lag between the primary and the secondary instance, something to be very much aware of.
Okay, so what’s the fuss with this whole log shipping thing? Well, consider a system with 2TB worth of data. Stopping said system, backing up, transferring the backups and then restoring might be difficult due to time constraints. (If you have the time, go for it as it is basically bulletproof). With log shipping, you can take a backup on a running system, transfer said backup when convenient, restore the backup on the secondary also when convenient, and then prepare for log shipping.
By setting up log shipping, you shorten the time for the final switchover to a very small time. Say for instance that you set your log shipping job to execute every ten minutes, then you only ever have a maximum of ten minutes worth of data to back up, transfer and restore on the secondary site, shortening the switchover time considerably. It works fine in all editions of SQL Server. Sounds easy? Well, it is. With a few gotchas.

First and foremost, as with all the techniques described in this and the previous blog post: you have to manually transfer the logins and users. If you don’t, things will be somewhat difficult in a switchover scenario – difficult in the sens that nobody will be able to access the database. Bummer.
Second, there is no automatic failover to the secondary site. You can script a failover, but there is no automatic provision for doing it with SQL Server. If you need automatic failovers, then this is the wrong feature to use.
Third, there is no provision for client transfer. You need to take care of getting the client to the server by yourself, and my top tip here is to use DNS pointers – this way the client won’t need to be updated to point to the new server, as this is taken care of with DNS. Just remember to keep the TTL down…

One thing that makes log shipping vastly superior to, for instance, Data Guard, is the fact that it’s quite possible to use log shipping to migrate to a newer version of SQL Server. This can potentially shorten the time for an upgrade by quite some time, and does simplifiy testing as well.

Brent (as usual) has a great FAQ about log shipping: http://www.brentozar.com/archive/2013/03/log-shipping-faq/

In order to shorten the switchover time even more, there is something else called database mirroring. It appeared in SQL Server 2008 and was marked for deprecation in SQL Server 2014. It still works, but, well, don’t use it for production stuff. Under the hood it’s a bit different from log shipping; where log shipping uses a basic backup-and-copy to get the files to the secondary, mirroring uses TCP endpoints and individual transaction log records. This means that it’s possible to have the log data transferred synchronously to the secondary – in fact, that’s the ONLY way to do it on Standard Edition. With Enterprise Edition, you may have an asynchronous mirror, but on standard you’re stuck with synchronous.

Mirroring has a longer list of requirements than log shipping (TCP endpoints, keys, certificates if used, database name, etc. ) to keep track of, but generally speaking, it’s more of “different” requirements than “more difficult” requirements.

Practical tips

There is an old but still very relevant blog post about the differences here: https://nilebride.wordpress.com/2011/07/24/log-shipping-vs-mirroring-vs-replication/

Log shipping step by step:
http://www.mssqltips.com/sqlservertip/2301/step-by-step-sql-server-log-shipping/

Database mirroring step by step:
http://www.mssqltips.com/sqlservertip/2464/configure-sql-server-database-mirroring-using-ssms/

Migrating databases, part 1

Migrating databases

Most of us poor souls that for some reason enjoy this work called “database administration” get to meet “Mr. Migration” from time to time. I’ve lost count of the number of customers in need of moving one or more (most often A LOT) of databases from one server to another. The reasons for this are as many as the databases themselves: licensing issues, new underlying design, performance problem alleviation or frankly anything under the sun. Since the ball usually lands squarely in our court when it comes to trying to please all members of the project – be it the end users that don’t want any downtime, the project management that don’t want any risk or the sysadmins that don’t want to do any work ( 😀 ) – I decided to write down my different takes on the issue. Basically there are four ways to transfer data, each with it’s own specific challenges. This is part 1, covering detach/attach and backup and recovery.

As always when doing I/O in SQL Server it is well worth mentioning alignment and NTFS allocation size. By default, Windows formats a disk to an NTFS cluster size of 4K. Since SQL Server generally works with 64Kb block size (8 pages á 8K) this means an increased load on the I/O system – up to 16(!) I/O operations per SQL Server block retrieved. This can lead to a serious performance degradation even before we’re out of the gate, so the best practice is to set this to 64Kb. Trouble is, you have to reformat the drive to do it…
Alignment used to be a much larger problem than it is today, but still rears its ugly head from time to time. Windows 2008 and above (usually) takes care of this by itself, but always double check your alignment.
Checking NTFS allocation size is done by putting the below into an administrator-level command prompt, changing the drive letter:

and looking for Bytes per Cluster.

Checking the alignment can be done by putting the below into an administrator-level command prompt, looking for DeviceID and StartingOffset:

Detach/attach

The simplest migration technique I know of is probably an old-fashioned detach-copy-attach. First of all, let’s get one thing out of the way: due to the way SQL Server is designed, most of the time the logins are not included in the backup/database file copies. (I say most of the time, since from SQL Server 2012 and onwards there exists provisions for a “self contained” database which can include logins). This lack of users is a problem associated with every type of migration I’ll go through here, so keep it in mind.

Detaching a database means excising it from the master database of the source server, copying the actual database and transaction log files to the destination and plugging them into the new master database. This means no extra time for backing up or restoring the data in itself, and in the case of a move, no need for extra disk space. The detach and attach operations always take the same time regardless of database size. In my opinion, that’s where the appeal ends. Consider the scenario where the database detaches fine, but for some reason won’t attach. Yes, it can happen even if it is very rarely, and no, it’s neither a great time to troubleshoot, nor is it very good for your blood pressure. I’ve been a DBA for quite some time, and I always strive to keep everything redundant. In the detach/attach scenario, I have the potential to saw off the branch of the tree I’m sitting on, and most of the time I don’t enjoy falling unceremoniously to the ground as this does nothing for my image.
Also consider that the entire files are copied – even if there is 99% free space. Nothing is done to the data/transaction log files, so a top tip is to make sure that you’ve done your housekeeping well before embarking on the detach/attach journey.
Onward to a more elegant migration path instead!

Backup/recovery

Another solution with very low risk is a simple backup and recovery. The database(s) are backed up on the source system, transferred over to the destination and restored. There are a few issues that might sink this migration path, though. Downtime and the need for a lot of disk space comes to mind. The added disk space can be somewhat alleviated using SQL Server backup compression, available from SQL Server 2008 Enterprise Edition and SQL Server 2008R2 Standard Edition. It is turned off by default, but can yield surprisingly good results if turned on. Try it out and see what the result is for your specific data.
The downtime issue is a bit more difficult to handle. There will be downtime – from the moment you start the backup, through the transfer to the destination host and the restore. There are ways to shorten the backup/restore time though, but before starting to fiddle with the knobs, we first need to find out what settings are used by default. This is easily done via two trace flags:

These flags will log the exact backup settings used to the SQL Server log file. Armed with the information about what settings are used at the moment, let’s see how we can increase the backup speed.

1. Use compression – this shortens the backup time and the restore time. The actual savings depend entirely on what kind of data you have and how compressible it is. But do try compression as it generally rocks.

2. Set BUFFERCOUNT – this specifies the total number of I/O buffers that will be used for the backup operation. The total space used by the backup buffers is determined by: buffercount * maxtransfersize. Increasing BUFFERCOUNT can significantly increase your backup performance, but keep in mind that you are using up memory in the SQLSERVR.EXE process. As always, test this out before throwing it in production.

3. Set MAXTRANSFERSIZE – this setting specifies the largest unit of transfer used between SQL Server and the backup media.
The possible values are multiples of 64 KB ranging up to 4194304 bytes (4 MB). The default is 1 MB. This is also displayed in the log with the trace flags above.

4. Set BLOCKSIZE – this setting specifies the physical block size for the backup files, in bytes. The trouble is that it defaults to 512 bytes for everything that isn’t a tape…

5. Increase the number of backup files. This does complicate the restore somewhat, but the rewards vastly compensate for this. As long as your disk subsystem can keep up (especially if you’re running SSDs), this setting alone will give you a serious performance boost.

The only way to increase the transfer speed between the hosts that I know is to transfer less data – this is where compression comes in Depending on the data, I’ve seen between 30% and 80% reduction in backup file size. Less data to transfer, faster restore due to compression.

Combining all the ideas above with a differential backup solution is another interesting way to shorten the actual down time. Let’s assume, for the sake of the argument, that the database is 1TB and it takes 1 hour to backup, transfer and restore. An idea is to first do a full backup, transfer and restore it without recovering the database – this is done without downtime, when is convenient for everybody involved. When it is time for the actual migration, stop the application, do a differential backup, transfer that and restore. Depending on the transaction volume this will most likely be a much smaller backup file, meaning that the entire operation can be done much faster. This is nothing new in any way, but I’m just pointing it out so no one forgets old school stuff.

While we’re on the subject of differential backups, what if there was a way to keep the destination host in sync or similar? Well, that’s the subject of the next post!

Basic settings for SQL Server 2012/2014, part 3

Part 3: simplifying the life of the DBA

When I’ve got the environment set up quite the way I like it, I usually set up a repository where I keep all my scripts. I designate one server as the “master” for my purposes, and set up a database called DBADB. In here I put all my standard scripts (like the bunch from Brent Ozar & co., SP_WHOISACTIVE from Adam Machanic, the log table for Ola Hallengren’s maintenance solution and such) to keep track of them, and to have only one place to update.

In my standard setup scripts for new databases, I also create a local DBADB as well as a local user for replication. When this is done, I enable replication between the “master” server and the new server, and presto! I’ve got a) a central repository for scripts and b) all the scripts available at all my servers.

Set up a good index maintenance solution, like either Ola Hallengrens script suite or Minion Reindex from Sean McCown of MidnightDBA fame. These run circles around maintenance plans and do a much better job of keeping your database server in good working order.

Olas scripts here, and the Minion Reindex scripts here.

Create DNS aliases for each and every database connection there is. After making sure that each and every system is accessing the DNS alias instead of the actual server name, I’ve simplified my life tremendously should I ever have to move a database to another server. I’ve got a good example of when this was a customer of mine that had to move a whole lot of databases from an old server to a new one. Since we had the DNS alias solution in place, we only had to stop the databases, move them, change the DNS pointer and restart – we knew for a fact that we did not have to change anything in the client connection strings.

Document everything. Yes, this falls squarely under the category “simplifying the life of the DBA”. Good documentation simplifies your life. Every. Single. Time. Keep a good record of what jobs runs when on which servers. Document what users have sysadmin privileges and why. Make sure that any deviations from the standard disk setup is meticulously written down. Catalog what install options you’ve chosen for each server. The list can go on and on, but I think you get the point.

Add operators, alerts and be proactive. There are several problems that can be proactively monitored by SQL Server itself. By creating operators and connecting these to e-mail addresses or pager alerts, it is possible to wake people up in the middle of the night if something goes wrong, as opposed to remaining completely oblivious to the fact that the server is slowly sinking in the depths. With alerts, whenever SQL Server logs an error of a specific severity, the alert is triggered and something interesting happens in the life of the DBA.

Brent Ozar has a script to create the alerts here: http://www.brentozar.com/blitz/configure-sql-server-alerts/ Also remember that when creating the operator, it is quite possible to specify several e-mail addresses if you just separate them with a comma.

Set up a failsafe operator – this is basically the go-to guy if no-one else can be found to harass. Take a look at this post on MSDN: https://technet.microsoft.com/en-us/magazine/gg313743.aspx

Change the settings on the model database to some sensible settings. Increasing a database file with 1MB every time it becomes full is not only dumb but also very wasteful. The problem becomes even bigger with transaction logs and virtual log files (see this youtube link for more info – https://youtu.be/lcmYeE-cqQo). The solution is to change the settings for the model database to something sensible (50-100MB size increase a good starting point – adapt to your specific needs!) so that every database that gets created at least have some sense in the growth settings.

These are the things I do every time and at every client. Nothing is new, nothing is exotic, it’s just hard, consistent work. And despite it being nothing new, it can’t be said often enough…

Double Whammy

So there I was, gearing up for a trip to a customer 90 minutes away and my audiobook had just finished. Having listened to Matan Yungman at SQL Pass in Copenhagen, I vaguely remembered him saying something about SQL Server Radio, so I decied to check it out. Turns out they’ve been doing it for a while, and the show length (around 60 minutes) suits me just fine. Matan Yungman and Guy Glantser both work for Madeira SQL Serverices, and are well known speakers in the SQL Server Community. The only problems I’ve found are that they’ve only done 19 shows so far (and I spend A LOT of time in my car 😛 ) and that they have some issues with sound and mixing levels. This I’m sure will be fixed over time, as it’s not an easy proposition to create and run a podcast.

So, 10 minutes into my journey I fire up epsiode 18, where they speak to Sean McCown of Midnight DBA fame. He’s running a consulting company with his wife Jen, as well as hosting a weekly SQL Server web show called DBAs@Midnight. The subject of the discussion on SQL Server Radio was index maintenance, and more specifically Seans take on it with his script Minion Reindex. Now, I’ve been using Ola Hallengren’s script suite for YEARS and like it very much. It has a few limitations and I’ve done some wrapper scripts around it a few years back, but in all, I find it to be a great solution for most of my clients. I was a bit put off with Sean’s bashing of Ola’s script and I feel personally that he was a bit too harsh. It’s one thing to feel that something is good but I could do better, but it’s something else to say that it’s bad and then do something better. Of course, I’m Swedish and Ola is Swedish, so my countryman defense probably kicked into gear. But I digress.
The discussion was exceptionally interesting and I found myself wanting to pull over and download the documentation for the solution then and there. Fast forward to today, I’ve had had the time to download the script and play with it a bit and my first opinion is that it might be the coolest thing since sliced bread, as the saying goes. The granularity is a real gem and the included documentation is excellent. I very much recommend everyone to check out Minion Reindex.

So there you have it – an ordinarily boring drive turned into a double whammy with both SQL Server Radio and Minion Reindex. Eagerly looking forward to both Minion Backup and the upcoming Minion CheckDB.

 

Links:

Ola Hallengren

MidnightDBA

SQL Server Radio

 

 

Basic settings for SQL Server 2012/2014, part 2

Part 2: settings inside SQL Server

When the server is all good and installed, it’s time to go through and set up a bunch of server settings. First of all – dedicated admin connections. This is used as a “get out of jail for free”-card for a DBA if all the schedulers are cranky and won’t accept connections for one reason or another. Using the DAC, it might be possible to sneak in through the “back door” and solve whatever problem that is plaguing the system without having to go through the hassle of a reboot/service restart. Turn. It. On. Period.

https://msdn.microsoft.com/en-us/library/ms190468.aspx
https://msdn.microsoft.com/en-us/library/ms189595.aspx

Optimize for adhoc was introduced way, way back to minimize the impact of adhoc plans in the plan cache. Simply put, it works bu not storing a complete plan in the cache until it is reused at least once. Instead a “stub” is stored, and less memory is consumed by plans that are never reused. Having read the same recommendation from several sources, I always turn this on – BUT – keep an eye out for surprises in the plan cache.

Kimberly L. Tripp lays it all out here: http://www.sqlskills.com/blogs/kimberly/plan-cache-and-optimizing-for-adhoc-workloads/ and her post links to a gazillion other posts.

Memory, then. The default setting for MIN_SERVER_MEMORY is 0 and MAX_SERVER_MEMORY is 2147483647 MB. Yes. The technical term for that amount is “a lot”. It’s a really bad idea to leave this setting as is, as SQL Server is a bit like my late lawnmower – it will eat EVERYTHING in sight. This comes with some consequences for wildlife, geography and neighbours, or in this case – other consumers of memory. There are several ways to come up with a number for MAX_SERVER_MEMORY, but a simple rule of thumb is 90% of total memory or 4GB, whichever is greater. Suddenly it wasn’t so great to give the server 6GB of RAM, riiight?
NOTE: the previous number is ONLY applicable if you have ONLY one instance on the server and NO other consumers (Fulltext daemon, reporting/analysis/integration services, etc.)

http://www.brentozar.com/archive/2012/11/how-to-set-sql-server-max-memory-for-vmware/
Somewhat older but still very much sound:
https://sqlserverperformance.wordpress.com/2009/10/29/suggested-max-memory-settings-for-sql-server-20052008/

TEMPDB has been a hot topic for contention for many years. I’ve read everything between 2-8 files depending on number of cores, and then tune upwards if you run into problems. I make it simple: on a SAN with lots of spindles or SSDs: 8 files and be done with. Too many files can actually hurt performance if you have too few spindles (but then again – everything hurts if with too few spindles).

Paul Randal lays out a lot of information here: http://www.sqlskills.com/blogs/paul/correctly-adding-data-files-tempdb/

Lock pages in memory – perhaps the closest thing to a religious argument the community has ever had. I fall into the “on the hedge” category of DBAs. I personally don’t use it most of the time, but it’s nice to know that it is there should I need it.

Jonathan Kehayias has a great blog post with both sides of the argument and lots of interesting information: https://www.simple-talk.com/sql/database-administration/great-sql-server-debates-lock-pages-in-memory/

Backup compression is another no-brainer. Or is it? The prevailing opinion is to turn it on as the servers out there usually has a lot more CPU than I/O resources. That is undoubtedly the case, but keep in mind the environment outside the database. If you’re running a compressing backup solution like TSM, NetBackup or similar, the compression and subsequent deduplication of that system will suffer greatly. In essence: make sure you only compress your backups at one place, and if at all possible do it at the SQL Server.

Paul Randal again has a great post on the subject: http://www.sqlskills.com/blogs/paul/sql-server-2008-backup-compression-cpu-cost/

Trace flags are usually something to be wary of, but there are a few that are a downright good idea to use.
3226 for instance, suppresses the backup success messages in the SQL Server log. There’s enough information in there already, so a little less stuff to wade through is only a good thing.
http://msdn.microsoft.com/en-us/library/ms188396.aspx

1118 tells the server to avoid mixed extents and give each new object its own 64KB of allocated data.
1117 tells the server to grow each file in a filegroup equally. This can come in handy to avoid hotspots as a result of uneven file growth. Both these flags are documented in in  http://support.microsoft.com/kb/2154845/en-us

2371 modifies the behavior of statistics auto update – very useful on large tables.
http://blogs.msdn.com/b/saponsqlserver/archive/2011/09/07/changes-to-automatic-update-statistics-in-sql-server-traceflag-2371.aspx
2562 and 2549 pertains to improvements for DBCC CHECKDB.
https://support.microsoft.com/en-us/kb/2634571

 

Dude, where’s my disk?

Today I took a look at a clients 10g Oracle RAC environment where one node apparently had started misbehaving quite some time ago. The scenario was that both nodes were up, cluster ready services was up, nodeapps were up, ASM was up but the instance was down. Starting from the top with the database alert log, it was apparent that ASM had no available disk groups for this database. Okay, let’s ask ASM what it feels, then. Looking in the ASM log file it was equally obvious that the ASM instance was online but didn’t service any disk groups. That’s odd.
Let’s bring out the whole toolbox and start from the top. It didn’t take very long for the first problem to appear: ASMTOOL couldn’t see any stamped disks. Even worse, running asmtool -list gave me OS Error 21 – disk not ready. The server saw the disks via Disk Manager, but ASM couldn’t see the disk headers. That’s novel. Not very helpful, but novel.
Considering what might interfere with disk headers and disk presentation in general, volume managers and similar comes to mind. Knowing that the customer runs NetApp in the organization, I took a look at SnapDrive. SnapDrive should never be installed (or heaven forbid – be running!) on a RAC cluster machine. The node that worked actually HAD SnapDrive – albeit disabled. The node that didn’t work also had SnapDrive – running happily. Time to end that.
A reboot and back to square one. Still couldn’t see the ASM stamps. Time to check DISKPART for automount setting. It turned out to be disabled, so that was rectified:

DISKPART> automount enable

Automatic mounting of new volumes enabled.

But still no diskgroups for me, despite a reboot and a CRS restart. On a whim I took the disks offline and online again. To my surprise that was the push that awoke the ASM stack. From that point everything worked like charm; ASM saw the stamps, ASM could be persuaded to mount the disk groups and then the database was started without any issues. Some points to take home:

  • Just because something is ancient, don’t expect it to be removed from use (i.e old knowledge always comes in handy)
  • When people tell you that they haven’t done anything to the environment – politely smile and expect that they have.
  • Oracle RAC is a finicky beast – always pay attention to the support matrix.
  • RAC is only as highly available as the administrator tasked with supporting it.

Basic settings for SQL Server 2012/2014, part 1

Whenever I set up an SQL Server I usually do the same things. I’ve collected a list of best practice settings over the years (and most often put them in a script, as most of my friends are scripts). Throughout a few blog posts I’ll go over the settings I do use and discuss a bit why I’ve chosen to set the parameters the way I do.

Part 1: configuring the server

There are a few hard rules when it comes to setting up the environment for SQL Server. The first two applies to storage:

Always make sure you align your partitions with the underlying stripe size. Windows 2008 and above (usually) takes care of this by itself, but always double check your alignment. Failure to align the partition can lead to significant performance degradation.

Format your partitions where you will store SQL Server related files to a NTFS Allocation Size of 64KB. The reason is very simple – SQL Server primarily works with 64KB I/O (1 64KB extent = 8 pages of 8KB each). With a default NTFS Allocation Size set to 4KB, the server has to do 16 I/Os for each extent – not optimal in the least. (In some cases with Analysis Services, 32KB might yield better results. See the link below for further information).

More information:
https://technet.microsoft.com/en-us/library/dd758814%28v=sql.100%29.aspx

SQL Server loves memory, and in my opinion you can’t give it too much memory. Many of my clients have a tendency to give it way too little (6-8GB) where double that amount might be a good starting point. This all depends (as always) on the workload and the requirements, but remember – memory transfer is fast, disk transfer is slow – regardless of if you’re sporting all SSDs.
Jeremiah Peschka have a great blog post that summarises my thinking:

http://www.brentozar.com/archive/2014/05/much-memory-sql-server-need/

Allen McGuire wrote a great set of scripts to harvest and report on SQL Server IO that can be useful:
http://allen-mcguire.blogspot.se/2014/03/harvest-and-report-on-sql-server-io.html

CPU choice is another interesting topic. Remember a few years ago when it was all about the MHz? Well, these days it’s all about cores, and since the licenses are purchased per core, well, that’s not necessarily a good idea. Parallellizing is a difficult concept, and depending on your workload it might be more or less tricky to parallelize said workload. In a worst case scenario with, say, SharePoint – where you are forced to set Max Degree of Parallelism – a CPU with many slow cores will give you worse performance than a CPU with fewer fast cores.
In general, I prefer few fast cores over many slow ones. Most of my clients’ workloads benefit more from this choice.

Glenn Berry lays it out here: http://sqlperformance.com/2014/01/system-configuration/selecting-a-processor-for-sql-server-2014-1

Instant File Initialization is something that always should be turned on. If it not, then every time a file is created it needs to be filled with zeroes. Not zeroes like myself, but actual zero bits. This means a huge hit on the storage system whenever a file is created. With instant file initialization, the pointer is created without the need for zeroing the actual data. This means instant file creation and no I/O effects. This will affect restores too…
Sure, the blocks are not zeroed and hence are vulnareble for reading with a hex editor. But as Kevin Boles (TheSQLGuru) put it in a presentation for SQL Server User Group Sweden: “if someone is reading your SQL Server data files with a hex editor, you are already pwnd”.

Check this blog post from the premier field engineer team at Microsoft: http://blogs.msdn.com/b/sql_pfe_blog/archive/2009/12/23/how-and-why-to-enable-instant-file-initialization.aspx

Power settings have a funny way of being ignored. The bad thing is that Windows 2012 has a habit of setting the power plan to “balanced”, meaning that you won’t get the maximum performance from your server. Set the power plan to “high” and keep an eye out from time to time – this has been known to “magically” change itself.

 

 

Failover on Oracle Standard Edition?

We have all heard (and gnashed our teeth over) that in order to get a proper automatic failover environment for Oracle (i.e Data Guard) we need Enterprise Edition. Oracle Enterprise Edition (or as Brent Ozar so succinctly put it: Expensive Edition) is hideously expensive, especially if you’re only ever going to use the Data Guard feature.

Sure, you can always use the classic Oracle Standby with a bunch of scripts – much like log shipping in SQL Server. It works (if your scripts are OK this is basically bomb-proof) and is fairly simple to set up. Trouble is that there is no (good) way of automating the failover, and when itis time to go back, well, you’re basically looking at a complete rebuild of the database previously known as the primary. Depending on your the amount of data this can take a long time, as well as saturate the network links between the systems (and the rest of the datacenter), making end users (and sysadmins) less than overjoyed.

Oracle is a huge company and different divisions peddle different software – the database people don’t sell applications, the application people don’t sell databases, and neither of them sell … Golden Gate. Golden Gate is the result of a company purchase Oracle did several years ago, and Golden Gate is an exceptionally versatile data transfer tool. It is mostly used for heterogenous data transfer (that is, going from one database engine to another) or basically anything that involves massaging the data. That’s probably why so few people think of it when it comes to – yes, you guessed it – high availability.

First things first – what is it going to cost me? Well, the answer is ‘a lot’. But not nearly as much as a complete Enterprise Edition solution. Golden Gate is licensed per core, just like Enterprise Edition, but has a price tag of roughly half of an Enterprise Edition core license. That means that you will be paying A LOT more for Golden Gate than for your Standard Edition database, but again, not nearly as much as if you’d decided to bite the bullet (hailstorm?) and forked up the cash for an all-Enterprise Edition setup.

Golden Gate brings a lot of cool stuff to the party as well – remember the ‘heterogenous’ part of the previous paragraph? Oh yes, that means that we can use Golden Gate to sync up two different Oracle versions. Thinking slightly outside the box, this means that we can use this solution for a transparent upgrade without any downtime at all. Or we can do data transfer between the Oracle databasen and, say, an SQL Server data warehouse. Or, that you came here for – a plain, simple Data Guard analogy with automatic failover.

This is just a teaser of WHAT you can do, not HOW you do it. I’ll make a blog post about that in the future!