Hello,
I found out, that there are "Checkpoint not complete" messages by each log switch in alert log of our Oracle Database 19c Standard Edition 2 Release 19.0.0.0.0.
We are using log shipping to standby database, therefore we force log switch witch archive_lag_target parameter each 2 minutes (see PFILE below).
Our checkpoints seem to be quite long (~4 minutes). Also it seems, that all our checkpoints are not complete all the time, independent on load of the database.
But on other hand our processes can normally use DML statements without waiting significant time on event "Checkpoint not complete".
That is why I kindly ask you to explain:
1. What is the meaning of "Checkpoint not complete" messages in this case?
2. Why our DML statements do not hang during incomplete checkpoint event?
3. Would it make sense to setup incremental checkpoints each N seconds with log_checkpoint_timeout parameter?
Thanks in advance,
===============================================================================
alert.log:
===============================================================================
Beginning log switch checkpoint up to RBA [0x100a.2.10], SCN: 39289565
2020-09-25T14:50:37.557027+02:00
Thread 1 advanced to log sequence 4106 (LGWR switch), current SCN: 39289565
Current log# 2 seq# 4106 mem# 0: /data/oradata/PROD/redo2_0.log
2020-09-25T14:50:37.709741+02:00
ARC1 (PID:12141): Archived Log entry 22746 added for T-1.S-4105 ID 0xf432ed53 LAD:1
2020-09-25T14:52:40.171692+02:00
Thread 1 cannot allocate new log, sequence 4107
Checkpoint not complete
Current log# 2 seq# 4106 mem# 0: /data/oradata/PROD/redo2_0.log
2020-09-25T14:52:40.469412+02:00
Completed checkpoint up to RBA [0x1009.2.10], SCN: 39288123
2020-09-25T14:52:43.269670+02:00
Beginning log switch checkpoint up to RBA [0x100b.2.10], SCN: 39290953
2020-09-25T14:52:43.270757+02:00
Thread 1 advanced to log sequence 4107 (LGWR switch), current SCN: 39290953
Current log# 3 seq# 4107 mem# 0: /data/oradata/PROD/redo3_0.log
2020-09-25T14:52:43.281539+02:00
ARC0 (PID:12135): Archived Log entry 22747 added for T-1.S-4106 ID 0xf432ed53 LAD:1
2020-09-25T14:54:42.798805+02:00
Thread 1 cannot allocate new log, sequence 4108
Checkpoint not complete
Current log# 3 seq# 4107 mem# 0: /data/oradata/PROD/redo3_0.log
2020-09-25T14:54:43.377453+02:00
Completed checkpoint up to RBA [0x100a.2.10], SCN: 39289565
2020-09-25T14:54:45.873354+02:00
Beginning log switch checkpoint up to RBA [0x100c.2.10], SCN: 39291331
===============================================================================
Statspack:
===============================================================================
Avg %Total
%Tim Total Wait wait Waits Call
Event Waits out Time (s) (ms) /txn Time
---------------------------- ------------ ---- ---------- ------ -------- ------
log file sync 2,363 0 143 60 1.4 3.2
direct path read 2,266 0 69 30 1.3 1.5
log buffer space 123 0 42 344 0.1 .9
Failed Logon Delay 32 100 33 1036 0.0 .7
log file switch (checkpoint 11 0 9 783 0.0 .2
direct path sync 14 0 7 529 0.0 .2
log file switch completion 25 0 7 262 0.0 .1
direct path write 1,360 0 6 5 0.8 .1
===============================================================================
PFILE:
===============================================================================
PROD.__data_transfer_cache_size=0
PROD.__db_cache_size=4177526784
PROD.__java_pool_size=0
PROD.__large_pool_size=16777216
PROD.__oracle_base='/opt/oracle'#ORACLE_BASE set from environment
PROD.__pga_aggregate_target=2147483648
PROD.__sga_target=6006243328
PROD.__shared_io_pool_size=134217728
PROD.__shared_pool_size=1627389952
PROD.__streams_pool_size=33554432
PROD.__unified_pga_pool_size=0
*.archive_lag_target=120
*.audit_file_dest='/data/oradata/PROD/adump'
*.audit_trail='NONE'
*.compatible='12.1.0.2'
*.control_files='/data/oradata/PROD/control01.ctl','/data/oradata/PROD/flash_recovery_area/control02.ctl'#Restore Controlfile
*.core_dump_dest='/data/oradata/PROD/cdump'
*.db_block_size=8192
*.db_domain='corp.de'
*.db_file_multiblock_read_count=128# 128*8K=1M
*.db_name='PROD'
*.db_recovery_file_dest='/data/oradata/PROD/flash_recovery_area'
*.db_recovery_file_dest_size=4G
*.db_unique_name='PROD'
*.diagnostic_dest='/data/oradata/PROD'
*.job_queue_processes=10
*.log_archive_dest_1='LOCATION=/data/oradata/PROD/archive'
*.log_archive_dest_state_1='ENABLE'
*.log_archive_format='%t_%s_%r.dbf'
*.log_archive_max_processes=1
*.log_archive_min_succeed_dest=1
*.log_checkpoints_to_alert=TRUE
*.open_cursors=300
*.optimizer_adaptive_plans=FALSE
*.optimizer_adaptive_statistics=FALSE
*.pga_aggregate_target=2G
*.processes=600
*.recyclebin='off'
*.remote_login_passwordfile='EXCLUSIVE'
*.sec_protocol_error_further_action='(drop,10)'
*.sga_max_size=5992435712
*.sga_target=5992435712
*.shared_servers=0
*.undo_management='AUTO'
*.undo_retention=10800
*.undo_tablespace='UNDOTBS1'
===============================================================================
Redo:
===============================================================================
SQL> SELECT * FROM V$LOG;
GROUP# THREAD# SEQUENCE# BYTES BLOCKSIZE MEMBERS ARCHIVED STATUS FIRST_CHANGE# FIRST_TIME NEXT_CHANGE# NEXT_TIME CON_ID
---------- ---------- ---------- ---------- ---------- ---------- --------- ------------------------------------------------ ------------- --------------- ------------ --------------- ----------
1 1 4126 104857600 512 1 YES ACTIVE 39299332 25-SEP-20 39300899 25-SEP-20 0
2 1 4127 104857600 512 1 NO CURRENT 39300899 25-SEP-20 9.2954E+18 0
3 1 4125 104857600 512 1 YES ACTIVE 39298149 25-SEP-20 39299332 25-SEP-20 0
SQL> SELECT * FROM V$LOGFILE;
GROUP# STATUS TYPE MEMBER IS_RECOVE CON_ID
---------- --------------------- --------------------- -------------------------------------------------- --------- ----------
1 ONLINE /data/oradata/PROD/redo1_0.log NO 0
2 ONLINE /data/oradata/PROD/redo2_0.log NO 0
3 ONLINE /data/oradata/PROD/redo3_0.log NO 0
DML *is* probably hanging, but probably only just a tiny amount of time. For example:
2020-09-25T14:54:42.798805+02:00
Thread 1 cannot allocate new log, sequence 4108
2020-09-25T14:54:43.377453+02:00
Completed checkpoint up to RBA [0x100a.2.10], SCN: 39289565
so in less than half a second you were "back in business". Its unlikely anyone even noticed the blip.
2 min switches is very very frequent (but I can see the motivation in terms of "home grown data guard"). So a couple of things I'd be looking at:
a) a lot more log groups. Your checkpoint issues will come when we've been on group 1 and flipped through 2,3 all within 4 mins (due to archive lag target). So I'd be looking at (say) 8 to 10 groups, which means a group will be reused in approx 20mins, bring it more into alignment with a "typical" redo log group scenario.
b) if you're still seeing the messages, look at 'fast_start_mttr_target' to give the incremental checkpointing a push.