thank you for the answer. But to be honest I am not really satisfied with it.
1. In alert log for each SCN we have similar pattern:
Beginning log switch checkpoint up to RBA [0x100a.2.10], SCN: 39289565
Thread 1 advanced to log sequence 4106 (LGWR switch), current SCN: 39289565
Current log# 2 seq# 4106 mem# 0: /data/oradata/PROD/redo2_0.log
Checkpoint not complete
and then directly:
Completed checkpoint up to RBA [0x100a.2.10], SCN: 39289565
Is this normal, that the SCN duration is more than 4 minutes?
2. We have always "Checkpoint not complete"-message for each SCN independent on number of redo-groups (tried 3,6,10 - no difference). Is this normal for the log ship environment with standby database?
3. And the 'fast_start_mttr_target' parameter you recommended is not applicable because as I mentioned above, we have the Standard Edition Database, and this parameter works only for Enterprise...
I kindly ask you to review the issue.
Thanks in advance!
November 16, 2020 - 2:54 am UTC
Apologies for not seeing the Standard Edition mention - I skimmed over too quickly.
I'm going to make a hypothesis on a potential cause - you can confirm this with some checking of some data on your system that's not provided in this question.
With archive lag target at "nnn", there are two likely scenarios:
1) Your redo logs are typically nearly full (lets say 90%) and then they switch due to archive_lag_target, OR
2) Your redo logs get nowhere near full (lets say 20-30%) and then they switch anyway due to archive_lag_target
In the latter case, there is a good chance we will not be checkpointing aggressively enough. Our checkpointing tries to do "as little as possible, as late as possible" because
a) "as little" - its IO work that doesn't contribute directly to online performance, and
b) "as late" - because if we change a block 10 times, and only checkpoint once then we didn't have to flush it 10 times but only once, which brings us back to (a) (less work).
So the checkpointing will try leave as much outstanding work as possible until such point as we're getting to "running out" of redo log. So if you're in scenario (2) above, it never looks like we have come close to running out of redo so we're very passive on checkpointing...and then "splat" - we see you've cycled around to want to reuse a redo log again.
Can you tell us what is in V$INSTANCE_RECOVERY.LOG_FILE_SIZE_REDO_BLKS ?
In any event, you indeed might want to tinker with log_checkpoint_timeout and log_checkpoint_interval, but be aware that both have impacts (which we typically dont recommend them), ie
i) timeout - doesn't *guarantee* improved checkpointing because dirty buffer volume is not necessarily consistent per unit time, so you can still end up with the same issue
ii) interval - can bump your I/O volume because there's the probability we will (re)checkpoint the same blocks over and over - see (b) above.