Skip to Main Content
  • Questions
  • Dataguard SYNC - how it really works.

Breadcrumb

Question and Answer

Connor McDonald

Thanks for the question, Koh.

Asked: April 01, 2016 - 2:16 am UTC

Last updated: April 22, 2016 - 4:36 am UTC

Version: 11.2.0.4

Viewed 10K+ times! This question is

You Asked

Hi all,

I have been wondering the impact that what maximum protection mode would have on my current setup. There are a few questions that I have in mind, hope that gurus here can advise.. and do forgive me if I sound oblivion

q1) we know that REDO log buffer contains redo entries for both committed and uncommitted transactions, during a REDO log buffer flush to LGWR->ORL as well as to the Standby Redologs, are all redo entries (for both committed and uncommitted transactions) flush to SRL as well ?


q2) In SYNC mode, it is mentioned that LGWR will be synchronously sending redo entries over to RFS/SRL, if there is network error, what will happen during a log flush ?
Will the redo entries stay inside the LOG BUFFER because LGWR cannot send the redo entries over to the RFS/SRL ?


q3) I simulate a network failure by shutting down the STANDBY database network interface, then I tried tracing the LGWR process in PRIMARY using(dbms_monitor.session_trace_enable). In the LGWR tracelog, I found no log file parallel write at all.

Does that means that during a network failure in DATAGUARD maximum protection mode, there will be no REDO BUFFER flush to the LGWR (since it is not able to send synchronously to Standby)

q4) I realize that we are still able to perform DMLS, do that means that redo entries get filled up in the LOGBUFFER without being able to send out to the Standby ?

Regards,
Alan

and Connor said...

The docs are pretty clear on this one:

"This protection mode ensures that no data loss will occur if the primary database fails. To provide this level of protection, the redo data needed to recover a transaction must be written to both the online redo log and to the standby redo log on at least one synchronized standby database before the transaction commits. To ensure that data loss cannot occur, the primary database will shut down, rather than continue processing transactions, if it cannot write its redo stream to at least one synchronized standby database"

It is almost as if the standby becomes "part" of your production instance, in that, if you cant write to it...your *entire* database becomes unavailable, because we will shut the primary down, ie, you cannot issue a commit unless both the primary and a standby can correctly process and acknowledge it.

So the normal semantics apply - if your standby is unreachable, its like an uncommitted transaction when you pull the power plug. It *never* committed, so its undone on instance restart.

Hope this helps.

Rating

  (5 ratings)

Is this answer out of date? If it is, please let us know via a Comment

Comments

Alan, April 04, 2016 - 12:43 pm UTC

Hi Connor,

Thanks for your reply. Yeap, i did read through the documentation.

But what is puzzling me is the ability to
carry on executing DML statements without commit during standby failure (before primary shutdown).

as well as LGWR behavior when it is unable to flush to the standby.

If as per the documentation that LGWR flush the redo buffers every 3 seconds, how will LGWR react when it finds that it is unable to send/flush to the RFS / Standby Logs?

I tried by shutting down the standby network interface, and found that i can still execute DMLs statement (without commit), and a LGWR trace shows that there are no parallel writes.

Are you able to shed more light on the above ?


Regards,
Alan
Connor McDonald
April 05, 2016 - 1:38 am UTC

(Without any genuine evidence to back this up) I would imagine that we (aka the primary ) doesn't really care about the standby until the point of commit, because it would be at that point that its going to insist upon a response.

But that does raise the interesting scenario of how long we could go before we ultimately decide "enough is enough". See what happens if you do more DML changes than the size of your log buffer ? I'd anticipate you'd have to see log parallel write at *some* stage.

Even in this case, its not LGWR directly that gets the redo over to the standby, its the log write network service (LNS) that reads from the redo logs and sends the data over to the standby, so as long as you dont commit, your primary will probably still "linger"... but once you commit, and the standby cannot... then we'd have to abort to maintain the max protection mode guarantee.

A reader, April 05, 2016 - 1:46 am UTC

Hi Connor,

Glad to hear your reply and thanks for the feedback.

You mentioned that it is not the LGWR but LNS that sends the redo over to RFS, does it apply even in the case of "SYNC" ?

On looking at the 112 documentation, it seems that LGWR is the one sending over though; hence i got curious and try to see its behavior when it can't send over to the standby.

Regards,
Alan
Connor McDonald
April 05, 2016 - 2:31 am UTC

Yeah, there's some ambiguity here. For example, the wait event descriptions for 11.2 are:

"LNS wait on SENDREQ:
Total time spent waiting for redo data to be written to all ASYNC and SYNC redo transport destinations"

which suggests LNS will be handling sync.

However, in 12, the events are renamed, and this is one of the new ones:

"SYNC Remote Write:
The time spent by LGWR doing SYNC RFSWRITE operations"

which of course suggests the opposite :-)

And just for good measure, one of the features of moving to 11.2 for DataGuard was that 'sync' performance was improved, because:

"transmitting redo to the remote standby is done in parallel with LGWR writing redo to the local online log file of the primary database"

You'll see see a whole lot of new background process from 11.2 onwards, solely for redo transport,

SQL> select name, description from v$bgprocess
  2  where lower(description) like '%redo%' order by name;

NAME  DESCRIPTION
----- -----------------------------------------------------
LGWR  Redo etc.
NSS1  Redo transport NSS1
NSS2  Redo transport NSS2
NSS3  Redo transport NSS3
NSS4  Redo transport NSS4
NSS5  Redo transport NSS5
NSS6  Redo transport NSS6
...
...


so the world changes rapidly :-)

Some good reading here

http://www.oracle.com/technetwork/database/features/availability/maa-096107.html

Hope this helps.


Alan Koh, April 08, 2016 - 6:53 pm UTC

Hi Connor,

So sorry for reverting late and truly appreciate your prompt response.

Yeap. if we look at 10.2 documentation, it seems like LGWR is still passing the redo the LNS..

https://docs.oracle.com/cd/B19306_01/server.102/b14239/img_text/lgwrsync.htm

But says the otherwise on 11.2 ( whereas LGWR is sending directly ).

Irregardless of LGWR -> RFS or LNS -> RFS or even the new processes in 12c sending over to RFS,

I would just like to seek your assurance rather, that my understanding on the below is correct

- redo buffer containing both redo entries for both uncommitted and committed transaction will still be flush from the primary to be applied on the standby

- log buffer will still be flush and written to ORL by LGWR on several conditions (1/3 full , 3 secs, chkpoint etc),

but the behavior in which the PRIMARY database's LGWR will behave when the Standby database RFS is un-contactable is not documented nor confirm. ( might varies with different version/implementation)

Right ?

Regards,
Alan
Connor McDonald
April 09, 2016 - 7:04 am UTC

That is my understanding.

But I'll drop a message to the DataGuard product manager to see if I can get more info for you.

A reader, April 11, 2016 - 3:47 pm UTC

Hi Connor,

Thanks. Appreciate your reply.
Will wait for your update.

Thank you!
Chris Saxon
April 14, 2016 - 1:35 am UTC

Hi, I got this back from the Data Guard product manager:

"When the standby is no longer reachable, meaning that the Data Guard NSA process has not responded by NET_TIMEOUT seconds, the LGWR process begins its retry logic. This is a fixed number of retries to reconnect to the standby that usually takes about 5 minutes. During this time users can continue to execute DML which will write redo to the Log Buffer. But when they say COMMIT they will hang because the LGWR will not respond. The LGWR will not though, be writing any redo out to the Online Redo Log file at all as it is busy in its retry logic. When the retries finish with no response from the Standby the LGWR will abort the Primary database. All redo in the Log Buffer will be thrown away. Users will not have 'lost' any transactions because none of those transactions will have been acknowledged as committed to the user. The maximum length of time that the database could be in this mode would be ~5 minutes plus NET_TIMEOUT for the Standby (which defaults to 30 seconds so about 5 1/2 minutes say?) OR until the Log Buffer fills up since nothing is being flushed out to the ORL during this period."

He also scolded me for using "DataGuard" not "Data Guard", so keep that in mind as well :-)

A reader, April 20, 2016 - 12:08 pm UTC

Hi Connor,

Deeply appreciate the reply and link up with the "Data Guard" product manager.

This proves that we are somehow right - there is no redo flush when the standby is uncontactable ;)!

Thanks, i just couldn't get the answer anywhere but i am so glad i got it here.

100 million thanks!

Regards,
Alan
Connor McDonald
April 22, 2016 - 4:36 am UTC

Glad we could help and I learned plenty myself in the discussion

More to Explore

DBMS_MONITOR

More on PL/SQL routine DBMS_MONITOR here