Manhandling GoldenGate, part 2

Let’s take a look at something that is not quite as well documented for anything else than a tiny, safe lab environment – instantiation of the secondary database.

There is a document from Oracle covering this (1276058.1) but it has not bee updated since 2013 and is not very detailed. Instantiating an Oracle secondary can basically be done in one of three ways: initial load using GoldenGate, Data Pump or RMAN. The first alternative is the only way to go if you’re trying to set up a heterogeneous replication between, say, Oracle and SQL Server or similar. It strikes me as unnecessarily difficult for an Oracle to Oracle environment. The Data Pump variant seems to be the one Oracle and most bloggers prefer. In my case I discarded it as I’ve got over 300 different schemas and Data Pump simply isn’t as fast as RMAN. So I’m left with RMAN, and happily there are instructions in the Oracle document above to DUPLICATE a database. Trouble is, I can’t for the life of me get it to work. I’m always left with an error:

RMAN-00571: ===========================================================
RMAN-00569: =============== ERROR MESSAGE STACK FOLLOWS ===============
RMAN-00571: ===========================================================
RMAN-03002: failure of Duplicate Db command at 03/25/2016 10:06:38
RMAN-05501: aborting duplication of target database
RMAN-03015: error occurred in stored script Memory Script
RMAN-06026: some targets not found - aborting restore
RMAN-06101: no channel to restore a backup or copy of the control file

Another detail that is missing from the document is how to handle integrated extract and replicat. The classic extract can be started at a specific CSN (GoldenGate-speak for SCN) – if you’re running integrated extract you can’t.
When creating the extract process it needs to be registered with the database (and this results in a SCN as I said in the previous post) that can be used for DUPLICATing the secondary.

Time to test this out then.

  1. Set up Oracle and GoldenGate per all the standard tutorials
  2. Take a RMAN full backup
  3. Create extract and pump processes, register the extract process and write down the SCN
  4. Create a RMAN archivelog backup
  5. Transfer all the backup files to the secondary
  6. Run RMAN DUPLICATE UNTIL SCN using the SCN above to instantiate the secondary
  7. Set up the replicat process on the secondary and start it with START ATCSN and use the SCN above

Easy enough, right? Yes, unless there are open transactions. In the first tests I did there were no activity in my test database, and hence there were no open transactions. Everything worked like charm at the first try (always an ominous sign).
Then it was time to do this on an actual test system running in a production-like environment. Steps 1-5 are identical with the slight difference that the replicat process starts and then ABENDs (GoldenGate-speak for paused) with OGG-01154 and SQL error 1403. Looking into this it turns out that GoldenGate expects to find a row that for some reason is not there, and hence it will ABEND and kindly ask for help.

I decided to toy with different SCNs to DUPLICATE, start extract at and start replicat at. No joy, everything amounted to the same thing – a data mismatch between the primary and the secondary. At this time I actually opened a case with Oracle Support as I thought I had done something wrong or that something was broken.I won’t go into the less than enjoyable interactions with Oracle Support I’ve had through the years, but after several discussions to and fro they dropped a gem: “Whatever the type of initial load, the basic rule is to avoid open transactions.” Right – so how am I supposed to handle instantiation from a system in full swing? That’s a basic requirement for a high availability solution I would think.
There is another support document (1347191.1) that covers “How to handle open transactions during initial load” that is basically useless for integrated capture (remember the REGISTER DATABASE and the resulting SCN?) apart from the part where you can use Data Pump to manually handle the specific table(s) that have open transactions (how messy is that?)

Looking into more detail about what actually is happening in this case (checking the ggs_error log file and toying with the logdump tool) GoldenGate tries to DELETE a row that does not exist. This occurs since the INSERT was done prior to the SCN where we REGISTERED the extract.

There are a few ways to handle this, but the way I settled on was to set the following parameter in the replicat parameter file:

REPERROR 1403, DISCARD

This tells GoldenGate to toss all the rows that results in error 1403 into the discard file for later analysis and handling. Mind you, having this parameter set permanently is a collosal risk in my view, as this potentially could lead to data loss. In this specific case I know what is happening and why, so I can live with having the 20-odd errors due to missing rows to DELETE end up in the discard file (and safely ignored).
Hence the recepie looks like this:

  1. Set up Oracle and GoldenGate per all the standard tutorials
  2. Take a RMAN full backup.
  3. Create extract and pump processes, register the extract process and write down the SCN.
  4. Run RMAN DUPLICATE UNTIL SCN using the SCN above to instantiate the secondary.
  5. Set up the replicat process on the secondary and start it with START ATCSN and use the SCN above.
  6. Wait for the replicat to ABEND and verify that the error returned is the one discussed above
  7. Set REPERROR 1403, DISCARD in the replicat parameter file and restart the replicat ATCSN as above
  8. When the lag time goes to zero (meaning that all the rows queued up in the trail file have been consumed), check the discard file and verify that the only “lost” transactions are related to DELETEs
  9. Stop the replicat process, remove the REPERROR parameter and restart without any ATCSN.

Going back to the primary is not quite as much fun, though. It would be quite possible to set up a unidirectional replication back to the primary, but that is outside the scope of this specific customer project. Instead the process looks something like this:

  1. Back up site B via RMAN (the old secondary, now running as the primary)
  2. Transfer the backup files to site A
  3. Back up archivelog on site B and shut down site B
  4. Transfer the final archivelog backup files to site A
  5. Restore site A
  6. Point the CNAME to site A
  7. Open site A for business
  8. Instantiate site B and recreate the GoldenGate relationship

Okay, having that in place, it is time to move our gaze to the detail of how to actually connect to the system. Remember that GoldenGate have no concept of a primary or secondary database, so both sides are open for business. This means that if anyone changes any data in any table that is included in the GoldenGate replication, the replicat process will ABEND and the net result is basically what we had above – a data mismatch. This is a fact of life and must be taken care of outside of Oracle/GoldenGate.

In our case we decided to go with a CNAME that points to the primary server. If the time comes to switch over to the secondary, we stop the database on the primary (to kill any remaining connections) and then point the CNAME to the secondary server. That’s it – no application servers or other systems have any need to know anything about GoldenGate, the secondary database or the like.

The reason for using a CNAME rather than a SERVICE_NAME in Oracle boils down to me not wanting to risk connections to the secondary. Since there is no concept of primary/secondary database built in as with Data Guard, I can’t use a trigger to start or stop the SERVICE_NAME. And if I’m relying a round-robin list in TNSNAMES to connect to the secondary in case the primary is up, Bad Things(tm) can occur should the old primary come up and accept connections for some reasons. By using a CNAME this cannot happen as the IP address will only resolve to the server currently set to primary (provided that noone screws the pooch and set the IP address wrong, but let’s try to avoid that).

To summarize: open transactions can bite you in the posterior, especially when running integrated extract. It does not have to be more than a nuiscance if you know how to handle it, but it can be very difficult to find relevant information.

Using a CNAME to connect to the database means more flexibility and less risk than a SERVICE_NAME, even though it means relying on either a DNS administrator or delegating DNS administrative rights to the DBA.

0 replies

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply

Your email address will not be published. Required fields are marked *