I’ve been way busier than I exepected, and I’ve got somewhat of a bad conscience for not having been able to update the blog. But here goes.
I’ve been tasked with setting up an Oracle GoldenGate solution for a customer. The environment is fairly straight forward with two servers running Windows 2012R2 and Oracle Standard Edition One (yep, licnesens bought before the new SE Two licensing) 220.127.116.11 and GoldenGate 18.104.22.168. The idea is to get GoldenGate to handle replication from site A to site B in case of disaster.
GoldenGate is an interesting animal in itself as it is *very* complex and have in true Oracle fashion somewhere along the lines of a gazillion parameters. In this case we’ve opted to set up GoldenGate on each of the two servers, despite the fact that teh replication is stricly one-way (from site A to site B).
There are a lot of blogs (especially Gavin Soorma) that do an execellent job of explaining exactly how GoldenGate works, so I’ll focus on some of the more interesting findings so far:
Integrated or classic mode
GoldenGate 12c gives you the choice of using the classic way of sifting through the log files and shipping the result or leveraging logminer to extract the data. Long story short: integrated extract and replicat supports more data types and is faster. WAY faster. I tried banging in 2M rows into a table and measuring how quickly the data crawled over to the secondary. It went from minutes to seconds. GoldenGate integrated extract/replicat *is* supported on Oracle Standard Edition (despite that the integrated functionality leverages Oracle Streams, an enterprise-only feature) but some things (like parallelism) is disabled, despite the fact that Oracle for some reason recommends using classic capture/replicat on Standard Edition and Standard Edition One. Considering the speed increase with integrated I’ll go with that, thank you very much.
The value of keys
Picture this: a single, heap table with a couple of hundred thousand rows. Inserting the data is quick and GoldenGate does an excellent job of shuttling the data over to the secondary site B. Now let’s see what happens when we do a delete instead. The cursor comes back in a few seconds on the pirmary, and after hitting commit I started looking at the secondary. This is where things become interesting.
The CPU went to 50% and basically pegged there – this is due to the fact that I’ve got two CPUs in my testing box and one of them was completely saturated. Funny thing is, that apart from an initial flurry of disk- and network activity, nothing seemed to be happening on the server. The rows are still in my table as well. Scratching my head I took a look at long-running queries using a query by David Mann
SELECT nvl(ses.username,'ORACLE PROC')||' ('||ses.sid||')' USERNAME, SID, MACHINE, REPLACE(SQL.SQL_TEXT,CHR(10),'') STMT, ltrim(to_char(floor(SES.LAST_CALL_ET/3600), '09')) || ':' || ltrim(to_char(floor(mod(SES.LAST_CALL_ET, 3600)/60), '09')) || ':' || ltrim(to_char(mod(SES.LAST_CALL_ET, 60), '09')) RUNT FROM V$SESSION SES, V$SQLtext_with_newlines SQL where SES.STATUS = 'ACTIVE' and SES.USERNAME is not null and SES.SQL_ADDRESS = SQL.ADDRESS and SES.SQL_HASH_VALUE = SQL.HASH_VALUE and Ses.AUDSID <> userenv('SESSIONID') order by runt desc, 1,sql.piece;
and found this little gem:
DELETE /*+ restrict_all_ref_cons */ FROM "DBUSER"."AAA" WHERE "A"=:1 and rownum < 2
Wait, what? This basically means a RBAR (row by agonizing row) handling of a delete which took about four HOURS(!). Lesson learned – ALWAYS make sure to have keys in your tables. And expect deletes to take a lot of time.
When stress testing the replication I started receiving quite a lot of bothersome memory problems (inability to reallocate, out of virtual memory, header corruptions, etc). After searching the internet for answers and finding exacly nothing, it turned out that there is a small gotcha with the 64 bit version of GoldenGate – it defaults to 64GB memory (which means 128GB virtual memory pool). Since my machine only sported 8GB of memory, the process very much did get a memory problem. By changing the parameter CACHESIZE to a more reasonable number the problem went away.