You Asked

Hi Tom,

Can you explain to me how often an Oracle database needs to read from the control file? And can the placement of the control file on the same disk/diskgroup as the redo logs cause I/O contention?

Problem context:

We have a 3 node RAC cluster hosting 29 busy production databases. We are experiencing high wait spikes on all databases randomly which appears to root from “log file sync”. The redo logs are multiplexed across 2 ASM diskgroups (REDO1 and REDO2). We also multiplex control files which are located on the same ASM diskgroups.

Oddly, I am able to see thousands of kb/s being read from REDO1 at all times, REDO2 is next to nothing at all times. My understanding is that nothing reads from the redo logs, unless recovery is required.

This pointed me to the control files - i understand only the first controlfile entry from v$controlfile is “read” by Oracle, which would make sense as this entry is REDO1 on all 29 databases. Could controlfile “read” requests for 29 database systems accumulate thousands of kb/s worth of I/O?

I am concerned the placement of the controlfile’s on the same diskgroups as the redo logs are causing I/O contention thus causing wait for log file sync. As well as read I/O, the controlfile is stamped on every transaction, so will also be experiencing heavy write I/O. I have seen this file layout on many systems (redo and controlfiles together) but none with as many as 29 databases.

Many thanks.

and Connor said...

My understanding is that nothing reads from the redo logs, unless recovery is required.

Well, the archiver will and if you have standby database, then we'll be dispatching redo from the primary to standby pretty much continously. An AWR report can show you I/O distribution by file type, so you can use that confirm whether its redo or controlfile. But let's assume the latter.

There is still lots of *potential* causes here to be looked at.

First of all, 'log file sync' is time you waited for LGWR to get back to you, but that might be (or might *not* be) synonymous with LGWR I/O. You should check the ration of log file sync to log file parallel write to see how much of that time is truly based on waiting for LGWR I/O.

For example, smash a machine to 100% CPU and you'll sessions all stuck on log file sync because LGWR can't get a look in on the CPU, so everyone waits. So CPU is a good place to start.

Are you doing direct mode operations? Direct path loads, or NOLOGGING operations result in controlfile operations. Sometimes these can be happening even without knowledge, eg if you have SQL that uses the "WITH" clause, we might dynamically create and load global temporary tables to support this.

Backup activity can result in lots of controlfile operations - because RMAN is tracking a lot of its progress and results via the controlfile.

Do you have standby database? Again - controlfile used to coordinate activities there.

Similarly, how many archivelogs do you have "in play". Metadata for there will be stored in the controlfile - check the size of archivelog information in V$CONTROLFILE_RECORD_SECTION. Big = pain

Along similar lines, anything the interrupts archiving (eg disk full) will typically result in LGWR/ARCn etc losing the plot and going ballistic on the controlfile as they try/retry etc.

So that should give you some stuff to start with - if all else fails, might be time to get in touch with Support.

Is this answer out of date? If it is, please let us know via a Comment

More to Explore

Administration

Need more information on Administration? Check out the Administrators guide for the Oracle Database

Breadcrumb

Question and Answer

You Asked

and Connor said...

More to Explore

Administration

Redo log and controlfile I/O contention.

Breadcrumb

Question and Answer

You Asked

and Connor said...

More to Explore

Administration