Manhandling GoldenGate, part 3

Here are the different configuration files and such I used for the setup described in parts 1 and 2. Please note that I’ve removed the several hundred schemas for clarity, and I have exchanged the customer-specific schema names for the nonexistent SCOTT and TIGER schema names.

mgr parameter file on primary

extract parameter file on primary

pump parameter file on primary

manager parameter file on secondary

replicat parameter file on secondary

Oracle parameters on primary/secondary

GGSCI statements for the primary

GGSCI statements for the secondary

Manhandling GoldenGate, part 2

Let’s take a look at something that is not quite as well documented for anything else than a tiny, safe lab environment – instantiation of the secondary database.

There is a document from Oracle covering this (1276058.1) but it has not bee updated since 2013 and is not very detailed. Instantiating an Oracle secondary can basically be done in one of three ways: initial load using GoldenGate, Data Pump or RMAN. The first alternative is the only way to go if you’re trying to set up a heterogeneous replication between, say, Oracle and SQL Server or similar. It strikes me as unnecessarily difficult for an Oracle to Oracle environment. The Data Pump variant seems to be the one Oracle and most bloggers prefer. In my case I discarded it as I’ve got over 300 different schemas and Data Pump simply isn’t as fast as RMAN. So I’m left with RMAN, and happily there are instructions in the Oracle document above to

a database. Trouble is, I can’t for the life of me get it to work. I’m always left with an error:

Another detail that is missing from the document is how to handle integrated extract and replicat. The classic extract can be started at a specific CSN (GoldenGate-speak for SCN) – if you’re running integrated extract you can’t.
When creating the extract process it needs to be registered with the database (and this results in a SCN as I said in the previous post) that can be used for DUPLICATing the secondary.

Time to test this out then.

  1. Set up Oracle and GoldenGate per all the standard tutorials
  2. Take a RMAN full backup
  3. Create extract and pump processes, register the extract process and write down the SCN
  4. Create a RMAN archivelog backup
  5. Transfer all the backup files to the secondary
  6. Run using the SCN above to instantiate the secondary
  7. Set up the replicat process on the secondary and start it with START ATCSN and use the SCN above

Easy enough, right? Yes, unless there are open transactions. In the first tests I did there were no activity in my test database, and hence there were no open transactions. Everything worked like charm at the first try (always an ominous sign).
Then it was time to do this on an actual test system running in a production-like environment. Steps 1-5 are identical with the slight difference that the replicat process starts and then ABENDs (GoldenGate-speak for paused) with

and

. Looking into this it turns out that GoldenGate expects to find a row that for some reason is not there, and hence it will ABEND and kindly ask for help.

I decided to toy with different SCNs to

, start extract at and start replicat at. No joy, everything amounted to the same thing – a data mismatch between the primary and the secondary. At this time I actually opened a case with Oracle Support as I thought I had done something wrong or that something was broken.I won’t go into the less than enjoyable interactions with Oracle Support I’ve had through the years, but after several discussions to and fro they dropped a gem: “Whatever the type of initial load, the basic rule is to avoid open transactions.” Right – so how am I supposed to handle instantiation from a system in full swing? That’s a basic requirement for a high availability solution I would think.
There is another support document (1347191.1) that covers “How to handle open transactions during initial load” that is basically useless for integrated capture (remember the

and the resulting SCN?) apart from the part where you can use Data Pump to manually handle the specific table(s) that have open transactions (how messy is that?)

Looking into more detail about what actually is happening in this case (checking the ggs_error log file and toying with the logdump tool) GoldenGate tries to

a row that does not exist. This occurs since the

was done prior to the SCN where we

the extract.

There are a few ways to handle this, but the way I settled on was to set the following parameter in the replicat parameter file:

This tells GoldenGate to toss all the rows that results in error 1403 into the discard file for later analysis and handling. Mind you, having this parameter set permanently is a collosal risk in my view, as this potentially could lead to data loss. In this specific case I know what is happening and why, so I can live with having the 20-odd errors due to missing rows to

end up in the discard file (and safely ignored).
Hence the recepie looks like this:

  1. Set up Oracle and GoldenGate per all the standard tutorials
  2. Take a RMAN full backup.
  3. Create extract and pump processes, register the extract process and write down the SCN.
  4. Run using the SCN above to instantiate the secondary.
  5. Set up the replicat process on the secondary and start it with and use the SCN above.
  6. Wait for the replicat to ABEND and verify that the error returned is the one discussed above
  7. Set in the replicat parameter file and restart the replicat as above
  8. When the lag time goes to zero (meaning that all the rows queued up in the trail file have been consumed), check the discard file and verify that the only “lost” transactions are related to s
  9. Stop the replicat process, remove the parameter and restart without any .

Going back to the primary is not quite as much fun, though. It would be quite possible to set up a unidirectional replication back to the primary, but that is outside the scope of this specific customer project. Instead the process looks something like this:

  1. Back up site B via RMAN (the old secondary, now running as the primary)
  2. Transfer the backup files to site A
  3. Back up archivelog on site B and shut down site B
  4. Transfer the final archivelog backup files to site A
  5. Restore site A
  6. Point the CNAME to site A
  7. Open site A for business
  8. Instantiate site B and recreate the GoldenGate relationship

Okay, having that in place, it is time to move our gaze to the detail of how to actually connect to the system. Remember that GoldenGate have no concept of a primary or secondary database, so both sides are open for business. This means that if anyone changes any data in any table that is included in the GoldenGate replication, the replicat process will ABEND and the net result is basically what we had above – a data mismatch. This is a fact of life and must be taken care of outside of Oracle/GoldenGate.

In our case we decided to go with a CNAME that points to the primary server. If the time comes to switch over to the secondary, we stop the database on the primary (to kill any remaining connections) and then point the CNAME to the secondary server. That’s it – no application servers or other systems have any need to know anything about GoldenGate, the secondary database or the like.

The reason for using a CNAME rather than a SERVICE_NAME in Oracle boils down to me not wanting to risk connections to the secondary. Since there is no concept of primary/secondary database built in as with Data Guard, I can’t use a trigger to start or stop the SERVICE_NAME. And if I’m relying a round-robin list in TNSNAMES to connect to the secondary in case the primary is up, Bad Things(tm) can occur should the old primary come up and accept connections for some reasons. By using a CNAME this cannot happen as the IP address will only resolve to the server currently set to primary (provided that noone screws the pooch and set the IP address wrong, but let’s try to avoid that).

To summarize: open transactions can bite you in the posterior, especially when running integrated extract. It does not have to be more than a nuiscance if you know how to handle it, but it can be very difficult to find relevant information.

Using a CNAME to connect to the database means more flexibility and less risk than a SERVICE_NAME, even though it means relying on either a DNS administrator or delegating DNS administrative rights to the DBA.

Manhandling GoldenGate, part 1

I’ve been hard at work setting up a GoldenGate replication system between two sites to serve as a HA solution for a customer that is … unwilling to pay the rather high licensing fees to use Oracle Enterprise Edition and Data Guard. Instead, we’ve opted to go with Standard Edition One 11g on two nodes and Oracle GoldenGate 12c as the way to synchronize data from site A to site B.

Before going into the setup side of things, there are a few things to consider when using this technology in this way. The first item to think about is the fact that this is in fact a logical, transactional replication – not a binary log transfer (like physical Data Guard). That means that the secondary on site B is open for business – both read and write. That can be a very good thing and that can be a very bad thing.

The second item to keep in mind is the list of limitations. With the new “integrated extract” that appeared in 12c the list keeps shrinking, but there are still some things that simply cannot be transferred using GoldenGate.

Third, GoldenGate is not too keen on instantiating the secondary with open transactions on the primary. Yes, you read that correctly – it might bite you when you start the replication. More about this in part 2.

Fourth, there is no concept of “primary” or “secondary”, so that has to be handled outside of Oracle. This is by design an effect of the first item and not a limitation of the software. I’m discussing this more in depth in part 2 as well.

Design

ggThe design is fairly simple in itself – two databases that have no idea whatsoever that they have any kind of relationship, an extract process on the primary that is responsible for, well, extracting the data that goes into the transaction log on the primary, a pump process on the primary responsible for tossing the result from the extract over to the secondary and finally a replication process on the secondary that applies the transactions as they come in. The extract, pump and replicat processes are all part of GoldenGate.

Setup

There are quite a few tutorials and such on the internet that describes how to setup a simple unidirectional GoldenGate replication pair, so I won’t go into any details regarding the basic settings. With that said, my situation was a … bit different. I’ve got well over 300 schemas to keep track of, as well as both integrated capture (extract) and replication, and about them you can find a lot less information.
The integrated approach as opposed to the classic GoldenGate processes are Oracles recommended way of doing things these days, provided that you’re running on a new enough database (11.2+, preferrably 11.2.0.4 or 12c.) The devil is in the details as the actual setup is very straight forward – basically just say

in the extract process parameter file as well as specify

when creating the exract process definition (add extract ext1, integrated tranlog, begin now). There is one more thing needed – registering the extract process with the database. This is done in GGSCI like this:

There is actually some output from this: a SCN. This is the first SCN that the extract start extracting from and can be used for instantiating the secondary database, either via data pump or RMAN.
Another slightly surprising detail is the way that the 64-bit GoldenGate behaves from 11.2 and onwards – it defaults the

parameter to 64GB. In short, this is very likely to result in a “cachesize too small” error, followed by abending (stopping) of the extract process. Oracle ACE Gavin Soorma has written a post of it here.

Other settings to note are:

in the extract parameter file that controls when to switch trail files (lest they become ginormous),

in the manager parameter file to handle obsolete trail files,

that together with the

settings help handle tables without keys and a nice new feature in the shape of

that, well, supresses triggers on the secondary.

All right, the primary generally won’t provide any surprises. Sure, the parameter files for both the extract and the pump are by now rather large as every schema that is to participate in the replication (in this case ALL of them) needs to have both DDL mappings and table definitions, but otherwise there is little to deviate from any tutorial out there.

GoldenGate surprises

I’ve been way busier than I exepected, and I’ve got somewhat of a bad conscience for not having been able to update the blog. But here goes.

I’ve been tasked with setting up an Oracle GoldenGate solution for a customer. The environment is fairly straight forward with two servers running Windows 2012R2 and Oracle Standard Edition One (yep, licnesens bought before the new SE Two licensing) 11.2.0.4 and GoldenGate 12.1.0.1. The idea is to get GoldenGate to handle replication from site A to site B in case of disaster.

GoldenGate is an interesting animal in itself as it is *very* complex and have in true Oracle fashion somewhere along the lines of a gazillion parameters. In this case we’ve opted to set up GoldenGate on each of the two servers, despite the fact that teh replication is stricly one-way (from site A to site B).

There are a lot of blogs (especially Gavin Soorma) that do an execellent job of explaining exactly how GoldenGate works, so I’ll focus on some of the more interesting findings so far:

Integrated or classic mode

GoldenGate 12c gives you the choice of using the classic way of sifting through the log files and shipping the result or leveraging logminer to extract the data. Long story short: integrated extract and replicat supports more data types and is faster. WAY faster. I tried banging in 2M rows into a table and measuring how quickly the data crawled over to the secondary. It went from minutes to seconds. GoldenGate integrated extract/replicat *is* supported on Oracle Standard Edition (despite that the integrated functionality leverages Oracle Streams, an enterprise-only feature) but some things (like parallelism) is disabled, despite the fact that Oracle for some reason recommends using classic capture/replicat on Standard Edition and Standard Edition One. Considering the speed increase with integrated I’ll go with that, thank you very much.

The value of keys

Picture this: a single, heap table with a couple of hundred thousand rows. Inserting the data is quick and GoldenGate does an excellent job of shuttling the data over to the secondary site B. Now let’s see what happens when we do a delete instead. The cursor comes back in a few seconds on the pirmary, and after hitting commit I started looking at the secondary. This is where things become interesting.
The CPU went to 50% and basically pegged there – this is due to the fact that I’ve got two CPUs in my testing box and one of them was completely saturated. Funny thing is, that apart from an initial flurry of disk- and network activity, nothing seemed to be happening on the server. The rows are still in my table as well. Scratching my head I took a look at long-running queries using a query by David Mann

and found this little gem:

Wait, what? This basically means a RBAR (row by agonizing row) handling of a delete which took about four HOURS(!). Lesson learned – ALWAYS make sure to have keys in your tables. And expect deletes to take a lot of time.

Memory woes

When stress testing the replication I started receiving quite a lot of bothersome memory problems (inability to reallocate, out of virtual memory, header corruptions, etc). After searching the internet for answers and finding exacly nothing, it turned out that there is a small gotcha with the 64 bit version of GoldenGate – it defaults to 64GB memory (which means 128GB virtual memory pool). Since my machine only sported 8GB of memory, the process very much did get a memory problem. By changing the parameter CACHESIZE to a more reasonable number the problem went away.

Differences between Oracle and SQL Server transaction log files

When moving between Oracle and SQL Server, one of the first gotchas is how the transaction log is handled differently between the two engines. In Oracle, the transaction log is a pure transaction log in that it contains only the data and not the rollback information. SQL Server combines the transaction log and the rollback data into the same file.
This leads to some interesting differences i behavior. Let’s take a peek at Oracle first. Oracle divides the different workloads from each other; the transaction log only contains change data but not data required for rollback. The rollback data is stored in a separate rollback tablespace. This also means that the transaction log size i fixed. When a transaction log file is filled up, a so-called log switch occurs where the log writer process switches to the next log file in the group and starts to write there.  This is a good time to point out that the transaction log on Oracle requires at least two (preferably at least three) files in a group, and that the group can consist of several mirrors of the transaction logs (preferably on different drives).

When the log writer switches out of the now full transaction log file, the archiving process steps in and starts reading and copying the transaction log to a new file called the archive log (specified by log_archive_dest_x-parameters). This process then repeats itself as soon as the log writer is finished with one transaction log.

As previously mentioned the transaction log size is fixed, but the number of archive logs produced vary depending on the workload. According to the classic best practices for Oracle, a log switch should occur about every 20-30 minutes. That means that the logs should be sized accordingly, but how many DBAs have a consistent workload that always fills up a transaction log in the exact same way?

Let’s think about the consequences of file size for  short while. Too small files means that the log writer process might be blocked by the archive log process, since the archiver is slower (it has to read AND write). This means poor performance and very hammered transaction log and archive log disks.
The other alternative is even worse – should the transaction log disk disappear in a black hole (it *might* happen!), you will loose all the data that’s still in the transaction log (and have yet to be archived).

The solution is to create large enough (or high enough number of) transaction log files that the archiver have time to scurry out of the way before the log writer switches back to the first file in the log group during the most intensive workload. This might mean that the files are either too large or too numerous to have a log switch of every 20-30 minutes during most of the day, but this is easily taken care of the parameter archive_lag_target. This parameter, set in minutes, specify the maximum amount of time that can pass before an explicit log switch (and hence archiving) is invoked.
The sad thing is that I’ve rarely ever seen anyone set this parameter.

SQL Server, then. The transaction log have a way of growing, sometimes out of hand. The good thing is that we have a continous log thread and the bad thing is that it is differing wildly in size. Where Oracle switches the log file automatically (and hopefully in a consistent, reasonable way), SQL Server does not and hence continues to build rather large transaction log files. There is always a hot discussion what is a good time between log backups, but as with the Oracle discourse above, it all boils down to how much data you are prepared to loose. Some of my customers run a log backup every five minutes, some do it once per day (don’t get me started). The really bad thing about SQL Server when it comes to log handling is the fact that you can’t mirror the transaction log file from within the database.

 

 

Dude, where’s my disk?

Today I took a look at a clients 10g Oracle RAC environment where one node apparently had started misbehaving quite some time ago. The scenario was that both nodes were up, cluster ready services was up, nodeapps were up, ASM was up but the instance was down. Starting from the top with the database alert log, it was apparent that ASM had no available disk groups for this database. Okay, let’s ask ASM what it feels, then. Looking in the ASM log file it was equally obvious that the ASM instance was online but didn’t service any disk groups. That’s odd.
Let’s bring out the whole toolbox and start from the top. It didn’t take very long for the first problem to appear: ASMTOOL couldn’t see any stamped disks. Even worse, running

gave me OS Error 21 – disk not ready. The server saw the disks via Disk Manager, but ASM couldn’t see the disk headers. That’s novel. Not very helpful, but novel.
Considering what might interfere with disk headers and disk presentation in general, volume managers and similar comes to mind. Knowing that the customer runs NetApp in the organization, I took a look at SnapDrive. SnapDrive should never be installed (or heaven forbid – be running!) on a RAC cluster machine. The node that worked actually HAD SnapDrive – albeit disabled. The node that didn’t work also had SnapDrive – running happily. Time to end that.
A reboot and back to square one. Still couldn’t see the ASM stamps. Time to check DISKPART for automount setting. It turned out to be disabled, so that was rectified:

But still no diskgroups for me, despite a reboot and a CRS restart. On a whim I took the disks offline and online again. To my surprise that was the push that awoke the ASM stack. From that point everything worked like charm; ASM saw the stamps, ASM could be persuaded to mount the disk groups and then the database was started without any issues. Some points to take home:

  • Just because something is ancient, don’t expect it to be removed from use (i.e old knowledge always comes in handy)
  • When people tell you that they haven’t done anything to the environment – politely smile and expect that they have.
  • Oracle RAC is a finicky beast – always pay attention to the support matrix.
  • RAC is only as highly available as the administrator tasked with supporting it.

Failover on Oracle Standard Edition?

We have all heard (and gnashed our teeth over) that in order to get a proper automatic failover environment for Oracle (i.e Data Guard) we need Enterprise Edition. Oracle Enterprise Edition (or as Brent Ozar so succinctly put it: Expensive Edition) is hideously expensive, especially if you’re only ever going to use the Data Guard feature.

Sure, you can always use the classic Oracle Standby with a bunch of scripts – much like log shipping in SQL Server. It works (if your scripts are OK this is basically bomb-proof) and is fairly simple to set up. Trouble is that there is no (good) way of automating the failover, and when itis time to go back, well, you’re basically looking at a complete rebuild of the database previously known as the primary. Depending on your the amount of data this can take a long time, as well as saturate the network links between the systems (and the rest of the datacenter), making end users (and sysadmins) less than overjoyed.

Oracle is a huge company and different divisions peddle different software – the database people don’t sell applications, the application people don’t sell databases, and neither of them sell … Golden Gate. Golden Gate is the result of a company purchase Oracle did several years ago, and Golden Gate is an exceptionally versatile data transfer tool. It is mostly used for heterogenous data transfer (that is, going from one database engine to another) or basically anything that involves massaging the data. That’s probably why so few people think of it when it comes to – yes, you guessed it – high availability.

First things first – what is it going to cost me? Well, the answer is ‘a lot’. But not nearly as much as a complete Enterprise Edition solution. Golden Gate is licensed per core, just like Enterprise Edition, but has a price tag of roughly half of an Enterprise Edition core license. That means that you will be paying A LOT more for Golden Gate than for your Standard Edition database, but again, not nearly as much as if you’d decided to bite the bullet (hailstorm?) and forked up the cash for an all-Enterprise Edition setup.

Golden Gate brings a lot of cool stuff to the party as well – remember the ‘heterogenous’ part of the previous paragraph? Oh yes, that means that we can use Golden Gate to sync up two different Oracle versions. Thinking slightly outside the box, this means that we can use this solution for a transparent upgrade without any downtime at all. Or we can do data transfer between the Oracle databasen and, say, an SQL Server data warehouse. Or, that you came here for – a plain, simple Data Guard analogy with automatic failover.

This is just a teaser of WHAT you can do, not HOW you do it. I’ll make a blog post about that in the future!